From 866a429a6f4b0c54104da3c037dedf88d6b13824 Mon Sep 17 00:00:00 2001
From: autonomic-bot <autonomic-bot@noreply.git.autonomic.zone>
Date: Thu, 11 Jun 2026 16:55:48 +0000
Subject: [PATCH] journal(dstamp): root cause = swarm failure_action:rollback
 reverts chaos-version label to base spec (start-first masks it via
 wait_healthy); concurrency refuted; repro3 capturing UpdateStatus

---
 JOURNAL-dstamp.md | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/JOURNAL-dstamp.md b/JOURNAL-dstamp.md
index 61562f9..8e87302 100644
--- a/JOURNAL-dstamp.md
+++ b/JOURNAL-dstamp.md
@@ -61,3 +61,48 @@ Open: must still explain *exactly* how a concurrent peer produces an `eb96de9+U`
 label on the shared stack — a base deploy is pinned/non-chaos (no chaos label), so the +U chaos
 label must come from some chaos deploy with HEAD=eb96de9. The isolated real run + (if needed) a
 deliberate 2-run concurrency repro will nail the mechanism. Will NOT claim M1 on inference.
+
+## 2026-06-11 (cont.) — REAL runs: concurrency REFUTED, true root cause = swarm rollback
+
+Three real install+upgrade runs of discourse @7ae7b0f (CCCI_RUN_ID=dstamp-repro{1,2,3}), each
+SOLO/isolated (no concurrent discourse run):
+
+- **base deploy is CHAOS** (not pinned): `compose.ccci.yml` overlay is present ⇒
+  `deploy_app` takes the `has_ccci_overlay` auto-chaos branch (`lifecycle.py:291-298`). So the
+  base stamps `chaos-version = eb96de9+U` on the shared stack. (My earlier bail-at-secrets repros
+  used a non-chaos/manual base → that's why they didn't expose it.)
+- **repro1 (unpatched): upgrade FAIL** — `chaos commit 'eb96de94+U', not 7ae7b0f76efb`. The
+  per-run tree reflog + snapshot prove HEAD = **7ae7b0f** at the upgrade deploy (last checkout
+  16:39:03, no checkout-back), yet the deployed `.Spec` chaos label was eb96de9+U.
+- **repro2 (instrumented: abra deploy `--debug` + a HEAD-print subprocess before the redeploy):
+  upgrade PASS** — `[DSTAMP] taking chaos version: 7ae7b0f7+U`, HEAD=7ae7b0f,
+  `deployed_identity = {version 0.9.0+3.5.0, image bitnamilegacy/discourse:3.3.1, chaos 7ae7b0f7+U}`.
+
+So the SAME solo config is **intermittent** (184✓ 06-05, m2b/m2p/ab✗ 06-10/11, repro1✗, repro2✓);
+flipping with a tiny timing change ⇒ **NOT a concurrency artifact, NOT abra version-resolution**
+(abra computes 7ae7b0f7 correctly — proven by repro2's debug line AND all 3 bail-at-secrets repros).
+
+**TRUE ROOT CAUSE (recipe deploy policy + heavy/flaky new task):** discourse `compose.yml` app
+service sets `deploy.update_config: { failure_action: rollback, order: start-first }` with a
+`healthcheck.start_period: 20m`. The upgrade chaos deploy applies the head spec
+(`chaos-version=7ae7b0f7+U`) start-first (old + new task co-resident = ~2× memory for a
+precompile-heavy Rails app). When the NEW task intermittently fails swarm's update monitor,
+swarm executes **failure_action: rollback ⇒ reverts the app service to its PreviousSpec (the
+base: `chaos-version=eb96de9+U`)**. Under `start-first` the OLD task keeps serving, so the
+harness `wait_healthy` still passes — but `deployed_identity` reads `.Spec.Labels` of the
+ROLLED-BACK spec and sees the base commit. The "since ~06-10 on every run" pattern = the
+rcust-phase runs happened under heavier host load (warm keycloak etc.), so the new task reliably
+failed the monitor ⇒ rollback every time; the solo 06-05 run (184) didn't roll back. Harness- and
+abra-neutral, exactly as observed.
+
+repro3 (UpdateStatus + PreviousSpec capture, NO --debug to preserve failing timing) running to
+get the swarm rollback in the act (expect `UpdateStatus.State = rollback_*`, `PreviousSpec.Labels`
+chaos=eb96de9+U == the read `.Spec.Labels` after revert). That is the direct-evidence smoking gun.
+
+Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy
+must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness
+must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with
+a clear message (the deploy did not converge to the head spec), AND/OR make the upgrade redeploy not
+subject to silent rollback masking (e.g. assert UpdateStatus completed before reading identity).
+The recipe's rollback policy is legitimate for prod; the harness bug is that a rollback is invisible
+to HC1 and masquerades as "stamped the wrong commit". Will finalise the fix after repro3 confirms.