# JOURNAL — phase `dstamp` (Builder, reasoning/private) ## 2026-06-11 — Bootstrap + investigation Read the phase plan, plan.md §6.1/§7/§9, the Adversary's REVIEW-dstamp prep notes, and the stamp-relevant harness code (`abra.py`, `lifecycle.py:deployed_identity/recipe_checkout_ref/ chaos_redeploy/prepull_images`, `generic.py:perform_upgrade/assert_upgraded`, run_recipe_ci upgrade op + fetch_recipe). ### Mechanism (from abra source @06a57de = the pinned binary) chaos-version label is set in `cli/app/deploy.go`: for a `-C` deploy, `getDeployVersion` (l.365) returns `Recipe.ChaosVersion()` (l.367-373) and `SetChaosVersionLabel(compose, stack, toDeployVersion)` (l.168). `ChaosVersion` (`pkg/recipe/git.go:300`) = `formatter.SmallSHA(Head().String())` + `+U` if dirty. `Head` (l.483) = go-git `repo.Head()`. Crucially, `app.Recipe.Ensure(ctx)` (deploy.go:86) calls into git.go:38 which **early-returns on `ctx.Chaos`** (l.41-43) — so a chaos deploy does NOT re-checkout the .env version. `GetEnsureContext` (cli/internal/ensure.go) wires `EnsureContext{Chaos, Offline, IgnoreEnvVersion=DeployLatest}` from the CLI flags. So `-C` ⇒ Ensure no-op ⇒ chaos version = whatever git HEAD the harness left checked out. ### The contradiction that drove the dig The m2p failure message is `chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb'`. `eb96de9` = tag `0.7.0+3.3.1` (the upgrade base); `7ae7b0f` = PR head (9 commits past that tag, and there is NO 0.8/0.9 tag despite HEAD's "upgrade to 0.9.0+3.5.0" message). The harness `perform_upgrade` does `recipe_checkout_ref(head_ref=7ae7b0f)` then `chaos_redeploy`, with only `env_set` + `prepull_images` (pure docker compose, no git) in between — and the run's recipe **snapshot HEAD = 7ae7b0f**. So at deploy time HEAD *should* be 7ae7b0f ⇒ stamp 7ae7b0f. Yet it stamped eb96de9. abra's source says chaos = Head(); so for eb96de9 to be stamped, HEAD had to be eb96de9 at the chaos deploy — which the isolated flow never produces. ### Reproductions (all on cc-ci, scratch ABRA_DIR, deploys bail at `secret not generated` ### which is deploy.go:140, AFTER the chaos version is computed+logged at deploy.go:372) 1. cp -a canonical recipe, checkout head→base(tag)→head, `abra app deploy -C` → `taking chaos version: 7ae7b0f7`. HEAD stays 7ae7b0f. NO drift. 2. real non-chaos base deploy (exercises go-git `EnsureVersion` which checks out tag via `Branch: refs/tags/0.7.0+3.3.1`, leaving HEAD=eb96de9), then CLI `git checkout -f head`, then `-C` deploy → `taking chaos version: 7ae7b0f7`. NO drift. 3. mirror-faithful: `git clone ` + `git checkout 7ae7b0f` + `git fetch refs/tags/*:refs/tags/*` (exact `fetch_recipe`), then base deploy → re-checkout head → `-C` deploy → `taking chaos version: 7ae7b0f7`. NO drift. Conclusion: the isolated git/abra version-resolution path is **correct** in the current host state. The drift is not in that path. ### Timeline / differentiator - abra binary: constant since 2026-06-01 (system-4). Not abra. - Same ref 7ae7b0f: run 184 (06-05 02:17, **solo**) was L4 upgrade-PASS. The drift runs (m2b 06-10 20:54, m2p 06-11 00:44, ab 06-11 00:48) are **clustered** (m2p & ab 4 min apart → overlapping for a multi-tier discourse run that takes ≫4 min). - `app_domain` hashes (recipe|pr|ref) ⇒ all three drift runs, same ref, **collide on one swarm stack**. The upgrade `chaos_redeploy` does NOT take `deploy_app`'s app-domain flock, so two concurrent runs can interleave deploys on the shared stack and the `_app` service label read by `deployed_identity` reflects whichever deploy last wrote it. **Leading hypothesis:** the "harness-neutral env drift" is actually a **concurrency artifact** of the rcust-phase M2 A/B discourse experiments running near-simultaneously on the shared stack — not an abra/recipe/environment regression. Run 184 solo = green; clustered 06-11 = drift; isolated re-reproduction now = green. Testing with one clean isolated real run (install,upgrade) before committing to this attribution — direct evidence required by the plan, not inference alone. Open: must still explain *exactly* how a concurrent peer produces an `eb96de9+U` (dirty CHAOS) label on the shared stack — a base deploy is pinned/non-chaos (no chaos label), so the +U chaos label must come from some chaos deploy with HEAD=eb96de9. The isolated real run + (if needed) a deliberate 2-run concurrency repro will nail the mechanism. Will NOT claim M1 on inference. ## 2026-06-11 (cont.) — REAL runs: concurrency REFUTED, true root cause = swarm rollback Three real install+upgrade runs of discourse @7ae7b0f (CCCI_RUN_ID=dstamp-repro{1,2,3}), each SOLO/isolated (no concurrent discourse run): - **base deploy is CHAOS** (not pinned): `compose.ccci.yml` overlay is present ⇒ `deploy_app` takes the `has_ccci_overlay` auto-chaos branch (`lifecycle.py:291-298`). So the base stamps `chaos-version = eb96de9+U` on the shared stack. (My earlier bail-at-secrets repros used a non-chaos/manual base → that's why they didn't expose it.) - **repro1 (unpatched): upgrade FAIL** — `chaos commit 'eb96de94+U', not 7ae7b0f76efb`. The per-run tree reflog + snapshot prove HEAD = **7ae7b0f** at the upgrade deploy (last checkout 16:39:03, no checkout-back), yet the deployed `.Spec` chaos label was eb96de9+U. - **repro2 (instrumented: abra deploy `--debug` + a HEAD-print subprocess before the redeploy): upgrade PASS** — `[DSTAMP] taking chaos version: 7ae7b0f7+U`, HEAD=7ae7b0f, `deployed_identity = {version 0.9.0+3.5.0, image bitnamilegacy/discourse:3.3.1, chaos 7ae7b0f7+U}`. So the SAME solo config is **intermittent** (184✓ 06-05, m2b/m2p/ab✗ 06-10/11, repro1✗, repro2✓); flipping with a tiny timing change ⇒ **NOT a concurrency artifact, NOT abra version-resolution** (abra computes 7ae7b0f7 correctly — proven by repro2's debug line AND all 3 bail-at-secrets repros). **TRUE ROOT CAUSE (recipe deploy policy + heavy/flaky new task):** discourse `compose.yml` app service sets `deploy.update_config: { failure_action: rollback, order: start-first }` with a `healthcheck.start_period: 20m`. The upgrade chaos deploy applies the head spec (`chaos-version=7ae7b0f7+U`) start-first (old + new task co-resident = ~2× memory for a precompile-heavy Rails app). When the NEW task intermittently fails swarm's update monitor, swarm executes **failure_action: rollback ⇒ reverts the app service to its PreviousSpec (the base: `chaos-version=eb96de9+U`)**. Under `start-first` the OLD task keeps serving, so the harness `wait_healthy` still passes — but `deployed_identity` reads `.Spec.Labels` of the ROLLED-BACK spec and sees the base commit. The "since ~06-10 on every run" pattern = the rcust-phase runs happened under heavier host load (warm keycloak etc.), so the new task reliably failed the monitor ⇒ rollback every time; the solo 06-05 run (184) didn't roll back. Harness- and abra-neutral, exactly as observed. repro3 (UpdateStatus + PreviousSpec capture, NO --debug to preserve failing timing) running to get the swarm rollback in the act (expect `UpdateStatus.State = rollback_*`, `PreviousSpec.Labels` chaos=eb96de9+U == the read `.Spec.Labels` after revert). That is the direct-evidence smoking gun. ### DIRECT EVIDENCE — captured (repro4, solo/isolated, upgrade FAIL) repro3 base deploy FATA'd (abra convergence monitor gave up — discourse is genuinely flaky/heavy under load, which is the very premise). repro4 reached the upgrade and the post-`chaos_redeploy` `docker service inspect _app` capture is the smoking gun: - `UpdateStatus = {"State":"updating","Message":"update in progress"}` - `.Spec.Labels` chaos-version = **7ae7b0f7+U**, version = 0.9.0+3.5.0 (HEAD spec applied OK) - `.PreviousSpec.Labels` chaos-version = **eb96de94+U**, version = 0.7.0+3.3.1 (the base) - `deployed_identity` (same instant) = chaos **7ae7b0f7+U** (reads Spec, correct) Then `wait_healthy` ran (old task serving under start-first → passes); the new task failed swarm's monitor → `failure_action: rollback` reverted `.Spec` → `.PreviousSpec` (eb96de94+U); the assertion-phase read saw eb96de94+U → HC1 FAIL. The ONLY operation that turns `.Spec.Labels` from 7ae7b0f7+U into the exact `.PreviousSpec` eb96de94+U is a swarm rollback. abra+harness exonerated; the head was really deployed and then swarm-reverted. Attribution complete, by direct evidence. Note the app image is `bitnamilegacy/discourse:3.3.1` for BOTH base and head spec (head only bumps the version label + db image), so the new task isn't failing on a missing image — it's the start-first 2× co-residency of the precompile/Rails-heavy app under host memory pressure (a real new-task failure, intermittent), which trips `failure_action: rollback`. ### Fix plan (HC1 teeth preserved) - Reliability: `tests/discourse/compose.ccci.yml` overlay → app `deploy.update_config.order: stop-first` (old stops before new starts → new boots with full memory → genuinely healthy → no spurious rollback). Upgrade-to-head still really deployed+asserted; not a weakening. WHY in header. Risk to weigh: stop-first = brief real downtime during the CI upgrade (covered by DEPLOY_TIMEOUT 3600). Alternative `failure_action: pause` REJECTED — it would let a genuinely-failed new task pass HC1 (start-first keeps old serving) = test-weakening. - Correctness: harness upgrade path asserts the redeploy converged to the head spec (UpdateStatus not rollback*/paused / `.Spec` not reverted to `.PreviousSpec`) → honest failure message on a real rollback, instead of the misleading "re-checkout failed". General (all rollback-policy recipes). HC1 teeth intact: a head that truly can't stay healthy still fails. - Will validate stop-first actually eliminates the rollback with a full real run before claiming. ## 2026-06-11 (cont.) — fix validated + blast-radius **Fix implemented** (commit 0cc31a5): (1) `tests/discourse/compose.ccci.yml` app service `deploy.update_config.order: stop-first`; (2) `lifecycle.assert_upgrade_converged()` + call in `generic.perform_upgrade` right after `chaos_redeploy` (before wait_healthy) — waits for swarm's app-service rolling update to reach a TERMINAL state and FAILs honestly on rollback*/paused. Unit tests: 253 passed (no regression). **fix1 validation** (run `dstamp-fix1`, fresh checkout @0cc31a5, install+upgrade, solo): UPGRADE **PASS** — `upgrade-converged: …UpdateStatus=completed`, `upgrade→PR-head: head_ref=7ae7b0f7 chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`. The head is deployed, the update converges (no rollback), HC1 reads 7ae7b0f7+U. (Bug was intermittent — running more to show reliability, since repro2 passed unpatched.) **Blast-radius sweep** — recipes with `failure_action: rollback` + `order: start-first`: `discourse, drone, keycloak, n8n, traefik`. Evidence check of the upgrade tier across many runs (incl. the rcust-era m2r-* runs under the same heavy load): - keycloak: runs 155/186/187/m2r/shot-proof → upgrade PASS L4 (HC1 pass ⇒ chaos==head). NOT affected. - n8n: runs 47/54/61/162/197/m2r/shot-proof → upgrade PASS L4. NOT affected. - drone, traefik: cc-ci INFRA (warm-reconciled), NOT enrolled in the recipe-CI upgrade tier. ⇒ **Only discourse actually exhibits the drift** — its app is uniquely heavy (Rails asset precompile, 2.4GB image) so the start-first 2× co-residency OOMs the new task; the lighter keycloak/n8n new tasks survive swarm's monitor, so no rollback. The general harness guard (`assert_upgrade_converged`) now protects ALL rollback-policy recipes from a silent future rollback (honest failure), and discourse additionally gets stop-first to converge reliably. ### Hardening (commit e9c26c7) + fix2 validation Adversary independently confirmed the root cause + assessed the fix CORRECT (REVIEW-dstamp probe), flagging one non-blocking race: assert_upgrade_converged's first poll could read a STALE terminal `completed` (from the install/base deploy) before swarm schedules the new roll → return OK prematurely → miss a later rollback. Hardened with a two-phase wait: phase 1 confirms the NEW update is scheduled (`UpdateStatus.StartedAt` advances past the pre-redeploy value, captured via `update_status_started`, or state is in-flight `updating`/`rollback_started`), with a 30s grace for a genuine no-op redeploy; phase 2 then waits for the terminal verdict. fix2 (hardened, fresh checkout @e9c26c7, install+upgrade): UPGRADE **PASS** — `upgrade-converged: …UpdateStatus=completed`, `chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`. Two consecutive green fixed runs (fix1+fix2) vs intermittent unpatched failures (repro1✗ repro4✗ repro2✓). Unit tests 253 pass. ### M1 claimed Attribution + minimal repro + 06-05→06-10 change + fix + blast-radius all complete and Adversary-pre-confirmed → claiming M1 (verification recipe in STATUS-dstamp). Next: M2 — full all-stages discourse green at true level via the drone `!testme` path (the recipe-CI pipeline runs `cc-ci-run runner/run_recipe_ci.py` from the drone-cloned cc-ci workspace, so e9c26c7 is live for !testme — no nixos-rebuild needed for the harness), other recipes re-proven (none affected), HC1 teeth shown (wrong stamp still FAILs), DEFERRED closed. Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with a clear message (the deploy did not converge to the head spec), AND/OR make the upgrade redeploy not subject to silent rollback masking (e.g. assert UpdateStatus completed before reading identity). The recipe's rollback policy is legitimate for prod; the harness bug is that a rollback is invisible to HC1 and masquerades as "stamped the wrong commit". Will finalise the fix after repro3 confirms.