# STATUS — phase `dstamp` (discourse abra-stamp drift) Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2. ## Gate: M1 — CLAIMED, awaiting Adversary **WHAT (M1 = Attribution):** root cause attributed by direct evidence; minimal reproducible demonstration; 06-05→06-10 change identified; fix implemented (recipe overlay + harness, HC1 unweakened); blast-radius sweep complete. Root cause: discourse `compose.yml` app service sets `deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }`. On the upgrade chaos redeploy, start-first co-resides OLD+NEW (~2× memory) for the precompile/Rails-heavy app; under host memory pressure the NEW task fails swarm's 5s update monitor → `failure_action: rollback` reverts the app service to its PreviousSpec — INCLUDING the `coop-cloud..chaos-version` label (head→base). Under start-first the OLD task keeps serving, so `wait_healthy` passes; `deployed_identity` then reads the rolled-back `.Spec` (base commit `eb96de94+U`) and HC1 misreports it as "re-checkout failed". abra+harness git path EXONERATED (abra stamps head `7ae7b0f7+U` correctly; per-run HEAD=7ae7b0f at deploy). **HOW to verify (Adversary, cold):** 1. *Recipe policy:* `cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3 update_config compose.yml` → `failure_action: rollback`, `order: start-first`. EXPECTED present. 2. *abra exonerated (minimal repro):* scratch ABRA_DIR, base→head checkout, `abra app deploy -C -o -n --debug` bails at `secret not generated` AFTER logging `app/deploy.go:372 version: taking chaos version: 7ae7b0f7+U` (HEAD-correct). Procedure: JOURNAL-dstamp "mirror-faithful repro". 3. *Direct rollback evidence:* console `/var/lib/cc-ci-runs/dstamp-repro4.console.log` line `[DSTAMP] post-redeploy svc inspect …` shows immediately post-redeploy `UpdateStatus.State= "updating"`, `.Spec…chaos-version=7ae7b0f7+U` (head applied), `.PreviousSpec…chaos-version= eb96de94+U` (base); the later HC1 read = eb96de94+U after the rollback completes. 4. *Fix present:* `runner/harness/lifecycle.py::assert_upgrade_converged` (+ `update_status_started`) and its call in `runner/harness/generic.py::perform_upgrade`; `tests/discourse/compose.ccci.yml` app `deploy.update_config.order: stop-first`. Commits `0cc31a5` + `e9c26c7`. 5. *Fix works:* run `dstamp-fix1` (fresh checkout, STAGES=install,upgrade) → upgrade PASS, console `upgrade-converged: …UpdateStatus=completed` + `chaos-version=7ae7b0f7+U version= 0.7.0+3.3.1→0.9.0+3.5.0`. (Re-runnable: `RECIPE=discourse PR=2 REF=7ae7b0f76efb2988c1e54956348dc9eeb7812e0b SRC=recipe-maintainers/discourse STAGES=install,upgrade CCCI_RUN_ID= cc-ci-run runner/run_recipe_ci.py` from a checkout at `e9c26c7`.) 6. *Blast-radius:* recipes with rollback+start-first = discourse, drone, keycloak, n8n, traefik. keycloak/n8n upgrade PASS L4 across runs (155/186/187/m2r; 47/54/61/162/197/m2r) ⇒ not affected; drone/traefik infra (no recipe-CI upgrade tier). Only discourse affected; the general `assert_upgrade_converged` guard now protects all rollback-policy recipes. **EXPECTED:** all of 1–6 hold. **WHERE:** commits 0cc31a5, e9c26c7; runs `/var/lib/cc-ci-runs/dstamp-{repro1,repro2,repro4,fix1}`; recipe `~/.abra/recipes/discourse`. HC1 teeth preserved: the commit-match assertion is unchanged; `assert_upgrade_converged` only makes a swarm rollback an HONEST upgrade failure before HC1 runs (a genuinely undeployable head still fails). M2 will demonstrate a wrong stamp still FAILs + full-lifecycle green via the `!testme` path. --- ## Root cause detail (evidence) ## ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED) The upgrade chaos redeploy applies the **correct** head spec, then swarm **rolls it back** to the base spec, reverting the `chaos-version` label — masked by the recipe's `start-first` strategy + the harness's `wait_healthy` (the OLD task keeps serving, so health passes). Recipe policy (`~/.abra/recipes/discourse/compose.yml`, app service): `deploy.update_config: { failure_action: rollback, order: start-first }`, `healthcheck.start_period: 20m`. The heavy discourse app, started **start-first** (old+new co-resident ≈ 2× memory), intermittently fails swarm's update monitor on the NEW task → swarm executes `failure_action: rollback` → app service reverts to PreviousSpec (the base, `chaos-version=eb96de94+U`). **Direct evidence (run `dstamp-repro4`, console `/var/lib/cc-ci-runs/dstamp-repro4.console.log`, solo/isolated):** immediately after `chaos_redeploy`, `docker service inspect _app`: - `UpdateStatus.State = "updating"`, - `.Spec.Labels coop-cloud..chaos-version = 7ae7b0f7+U` (HEAD applied — abra stamped head correctly), `.version = 0.9.0+3.5.0`, - `.PreviousSpec.Labels …chaos-version = eb96de94+U` (the base), `.version = 0.7.0+3.3.1`. Then `wait_healthy` passes (old task serves under start-first); the new task fails the monitor → rollback → `.Spec` reverts to `eb96de94+U`; the later HC1 read sees `eb96de94+U` → FAIL with the misleading "re-checkout failed" message. (`dstamp-repro2`, lighter timing, had NO rollback → upgrade PASS @ `7ae7b0f7+U`.) Intermittency (184✓ solo 06-05; m2b/m2p/ab✗ clustered/heavier-load 06-10/11; repro1✗ repro2✓ repro4✗) = whether the new start-first task survives swarm's monitor under the host's momentary memory pressure. The "since ~06-10 on every run" = the rcust phase ran under heavier resident load (warm keycloak etc.) so the new task reliably failed → rollback every time. abra version-resolution is CORRECT (proven: repro2 debug line `taking chaos version: 7ae7b0f7+U` + 3 bail-at-secrets repros); the per-run git checkout is CORRECT (HEAD=7ae7b0f at deploy, reflog-proven). NOT abra, NOT the per-run tree, NOT concurrency. ## Fix (in progress) — HC1 keeps its teeth 1. **Reliability (restore true level):** discourse `tests/discourse/compose.ccci.yml` overlay set the app service `deploy.update_config.order: stop-first` so the new task boots with full memory (no 2× co-residency) and genuinely becomes healthy → no spurious rollback. The upgrade-to-head is still really deployed + asserted on head; HC1 unchanged. Documented WHY in the overlay header. 2. **Correctness (honesty, general):** the harness upgrade path detects a swarm rollback after the chaos redeploy (UpdateStatus.State rollback*/paused, or `.Spec` reverted to `.PreviousSpec`) and fails the upgrade with the TRUE reason ("head spec applied then swarm-rolled-back: new task failed the update monitor") instead of the misleading "re-checkout failed". A genuinely undeployable head still FAILS (teeth preserved). 3. **Blast-radius:** sweep all enrolled recipes for `failure_action: rollback` + start-first heavy apps with the same latent signature. ## What is established (direct evidence, reproducible) - **abra is CONSTANT, not the cause.** abra binary `bf6azhpi…-abra-0.13.0-beta` is the store path for every nixos system generation from system-4 (2026-06-01) through system-11 (now). No abra change between 06-05 and 06-10. HOW: `for g in $(ls -d /nix/var/nix/profiles/system-*-link); do readlink -f "$g/sw/bin/abra"; done` on cc-ci. EXPECTED: all `…bf6azhpi…` from system-4 on. - **abra's chaos-version = `SmallSHA(git HEAD of the recipe checkout)`** (+`+U` if worktree dirty). Source: abra@06a57de `cli/app/deploy.go:106,168,365-373` (chaos → `toDeployVersion = Recipe.ChaosVersion()`), `pkg/recipe/git.go:300-318` (`ChaosVersion` = `SmallSHA(Head())`), `:483-495` (`Head` = go-git `repo.Head()`). In chaos mode `Recipe.Ensure` early-returns (`pkg/recipe/git.go:41-43`) — NO env-version re-checkout. - **The isolated git/abra path stamps CORRECTLY now.** Three faithful reproductions on cc-ci (scratch ABRA_DIR, fake domain, deploys bail at `secret not generated` AFTER the chaos version is computed) all log `taking chaos version: 7ae7b0f7` (= PR head), NOT `eb96de9`: 1. `cp -a` canonical recipe + manual tag/head checkout. 2. real non-chaos base deploy (go-git `EnsureVersion` tag checkout) → CLI re-checkout head → chaos. 3. exact `fetch_recipe` replica: clone mirror `recipe-maintainers/discourse` @7ae7b0f + `git fetch upstream refs/tags/*` → base deploy → re-checkout head → chaos. HOW (variant 3, re-runnable cold): see JOURNAL-dstamp 2026-06-11 "mirror-faithful repro". EXPECTED: `DEBU app/deploy.go:372 version: taking chaos version: 7ae7b0f7`. - **Same ref, solo run was GREEN; clustered runs DRIFTED.** discourse @ ref `7ae7b0f76efb`: run **184** (2026-06-05 02:17, solo) = **L4, upgrade PASS**; the 06-10/06-11 runs **m2b-discourse** (06-10 20:54), **m2p-discourse** (06-11 00:44), **ab-discourse-7ae7b0f-oldmain** (06-11 00:48) = **L1, upgrade FAIL** (`chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb' (HC1)`). HOW: `grep -oE '"level": [0-9]+|"upgrade": "[a-z]+"' /var/lib/cc-ci-runs/{184,m2p-discourse}/results.json`. - **All same-ref discourse runs share ONE swarm stack.** `naming.app_domain(recipe,pr,ref)` = `-<6hex(recipe|pr|ref)>.ci.commoninternet.net` → identical for identical (recipe,pr,ref). The upgrade `chaos_redeploy` bypasses `deploy_app`'s app-domain flock (`lifecycle.chaos_redeploy` / `generic.perform_upgrade`). LEADING HYPOTHESIS: the 06-10/06-11 drift is a CONCURRENCY ARTIFACT of the clustered rcust-M2 A/B discourse experiments racing on the shared stack — NOT an abra/recipe/env regression. Under test now. ## In flight - Implementing the fix (overlay stop-first + harness rollback detection), then a full real run (all stages) to prove discourse reliably reaches its true level, then the `!testme` drone path. - Repro evidence runs: `/var/lib/cc-ci-runs/dstamp-repro{1,2,3,4}.console.log` on cc-ci (repro2 PASS @7ae7b0f7+U; repro4 captured the rollback Spec/PreviousSpec). ## Blocked - (none)