From 866a429a6f4b0c54104da3c037dedf88d6b13824 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Thu, 11 Jun 2026 16:55:48 +0000 Subject: [PATCH] journal(dstamp): root cause = swarm failure_action:rollback reverts chaos-version label to base spec (start-first masks it via wait_healthy); concurrency refuted; repro3 capturing UpdateStatus --- JOURNAL-dstamp.md | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/JOURNAL-dstamp.md b/JOURNAL-dstamp.md index 61562f9..8e87302 100644 --- a/JOURNAL-dstamp.md +++ b/JOURNAL-dstamp.md @@ -61,3 +61,48 @@ Open: must still explain *exactly* how a concurrent peer produces an `eb96de9+U` label on the shared stack — a base deploy is pinned/non-chaos (no chaos label), so the +U chaos label must come from some chaos deploy with HEAD=eb96de9. The isolated real run + (if needed) a deliberate 2-run concurrency repro will nail the mechanism. Will NOT claim M1 on inference. + +## 2026-06-11 (cont.) — REAL runs: concurrency REFUTED, true root cause = swarm rollback + +Three real install+upgrade runs of discourse @7ae7b0f (CCCI_RUN_ID=dstamp-repro{1,2,3}), each +SOLO/isolated (no concurrent discourse run): + +- **base deploy is CHAOS** (not pinned): `compose.ccci.yml` overlay is present ⇒ + `deploy_app` takes the `has_ccci_overlay` auto-chaos branch (`lifecycle.py:291-298`). So the + base stamps `chaos-version = eb96de9+U` on the shared stack. (My earlier bail-at-secrets repros + used a non-chaos/manual base → that's why they didn't expose it.) +- **repro1 (unpatched): upgrade FAIL** — `chaos commit 'eb96de94+U', not 7ae7b0f76efb`. The + per-run tree reflog + snapshot prove HEAD = **7ae7b0f** at the upgrade deploy (last checkout + 16:39:03, no checkout-back), yet the deployed `.Spec` chaos label was eb96de9+U. +- **repro2 (instrumented: abra deploy `--debug` + a HEAD-print subprocess before the redeploy): + upgrade PASS** — `[DSTAMP] taking chaos version: 7ae7b0f7+U`, HEAD=7ae7b0f, + `deployed_identity = {version 0.9.0+3.5.0, image bitnamilegacy/discourse:3.3.1, chaos 7ae7b0f7+U}`. + +So the SAME solo config is **intermittent** (184✓ 06-05, m2b/m2p/ab✗ 06-10/11, repro1✗, repro2✓); +flipping with a tiny timing change ⇒ **NOT a concurrency artifact, NOT abra version-resolution** +(abra computes 7ae7b0f7 correctly — proven by repro2's debug line AND all 3 bail-at-secrets repros). + +**TRUE ROOT CAUSE (recipe deploy policy + heavy/flaky new task):** discourse `compose.yml` app +service sets `deploy.update_config: { failure_action: rollback, order: start-first }` with a +`healthcheck.start_period: 20m`. The upgrade chaos deploy applies the head spec +(`chaos-version=7ae7b0f7+U`) start-first (old + new task co-resident = ~2× memory for a +precompile-heavy Rails app). When the NEW task intermittently fails swarm's update monitor, +swarm executes **failure_action: rollback ⇒ reverts the app service to its PreviousSpec (the +base: `chaos-version=eb96de9+U`)**. Under `start-first` the OLD task keeps serving, so the +harness `wait_healthy` still passes — but `deployed_identity` reads `.Spec.Labels` of the +ROLLED-BACK spec and sees the base commit. The "since ~06-10 on every run" pattern = the +rcust-phase runs happened under heavier host load (warm keycloak etc.), so the new task reliably +failed the monitor ⇒ rollback every time; the solo 06-05 run (184) didn't roll back. Harness- and +abra-neutral, exactly as observed. + +repro3 (UpdateStatus + PreviousSpec capture, NO --debug to preserve failing timing) running to +get the swarm rollback in the act (expect `UpdateStatus.State = rollback_*`, `PreviousSpec.Labels` +chaos=eb96de9+U == the read `.Spec.Labels` after revert). That is the direct-evidence smoking gun. + +Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy +must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness +must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with +a clear message (the deploy did not converge to the head spec), AND/OR make the upgrade redeploy not +subject to silent rollback masking (e.g. assert UpdateStatus completed before reading identity). +The recipe's rollback policy is legitimate for prod; the harness bug is that a rollback is invisible +to HC1 and masquerades as "stamped the wrong commit". Will finalise the fix after repro3 confirms.