# REVIEW-dstamp.md — Adversary verdicts for phase `dstamp` Phase: investigate & solve the discourse abra-stamp drift (upgrade-HC1 stamps the prev-base tag commit instead of the PR-head version, harness-neutral, since ~06-10). SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2. Verdict log is append-only. `review(...)`-prefixed commits carry verdicts (load-bearing watchdog signal). Findings filed under `## Adversary findings` in BACKLOG-dstamp.md. --- ## Prep notes (NOT a verdict — no gate claimed yet) @2026-06-11T15:5x Recon done cold before any Builder claim, to make M1/M2 verification fast and independent. Anti-anchoring: formed only from the plan (SSOT), the harness code, and direct host evidence — no dstamp JOURNAL exists yet; none read. **Stamp mechanism (from code):** HC1's "stamp" = the `coop-cloud..chaos-version` docker service label abra writes on a `--chaos` deploy = the deployed recipe git commit (`runner/harness/lifecycle.py:468 deployed_identity`, `runner/harness/generic.py:146 assert_upgraded`). Upgrade flow (`generic.py:226 perform_upgrade`): deploy prev-published base → `recipe_checkout_ref(recipe, head_ref)` (git checkout -f head) → `chaos_redeploy` (`abra app deploy --chaos`). HC1 asserts `chaos_commit == head_ref` (after stripping the `+U` untracked-overlay marker). PASS requires the chaos-version to equal the PR head. **Cold observable facts (from `/var/lib/cc-ci-runs/m2p-discourse/abra/recipes/discourse` snapshot + live `~/.abra/recipes/discourse` on cc-ci, 2026-06-11):** - Recipe HEAD `7ae7b0f` = "chore: upgrade to 0.9.0+3.5.0"; `git describe --tags` = `0.7.0+3.3.1-9-g7ae7b0f` → HEAD is **9 commits past the newest annotated tag** `0.7.0+3.3.1` (commit `eb96de9`). No `0.8.x`/`0.9.x` tag exists. - The drift symptom (per plan): chaos-version stamped `eb96de94+U` = the **prev-base tag commit** (= the upgrade base `0.7.0+3.3.1`), NOT the PR-head `7ae7b0f`. - abra is **nix-pinned**: `abra version 0.13.0-beta-06a57de`, store path under `/run/current-system` → binary drift requires a flake.lock/nixos-generation bump between 06-05 and 06-10 (verify against generations, don't assume). **Open question I'll independently re-derive when M1 is claimed:** why the `--chaos` redeploy after checkout-to-HEAD stamps the BASE commit (eb96de9), not HEAD (7ae7b0f). Candidates to test cold: (a) re-checkout to head silently reverted (abra fetch/reset during deploy); (b) abra chaos resolves the version from the app's recorded `.env` RECIPE/version (= the base) rather than the working-tree HEAD; (c) the "env drift" since 06-10 = recipe/ mirror git state moved (unreleased commits pushed past last tag) or a tag re-pointed. **Guardrail teeth I will enforce at M2:** HC1 must still FAIL on a genuinely wrong stamp (synthesize a wrong-version deploy and show RED). Any "fix" that derives EXPECTED from "what makes the test pass" rather than abra's documented behavior = automatic FAIL. Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping on the `claim(...)` commit. --- ## Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet) Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service inspect on cc-ci. Independently arrived at the same attribution as the Builder. **Causal chain derived from code + direct evidence:** 1. `provide_ccci_overlay` (rcust-era addition) copies `compose.ccci.yml` into the per-run recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the old `install_steps.sh` path writing to canonical `~/.abra`) — consistent with run 184 having no `+U` suffix and passing. The `+U` itself is stripped by HC1's `chaos_commit.split("+",1)[0]` and is NOT the cause of drift. 2. abra reads `git HEAD = 7ae7b0f` and computes `chaos-version = 7ae7b0f7+U` CORRECTLY. Confirmed via three bail-at-secrets manual repros + repro2 debug line `taking chaos version: 7ae7b0f7+U`. abra and the per-run git checkout are EXONERATED. 3. `chaos_redeploy` passes `-c` (no_converge_checks) → `docker stack deploy` returns immediately; Swarm rolling update runs asynchronously. 4. Discourse `compose.yml` (BOTH base `eb96de94` AND PR-head `7ae7b0f`) sets `deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }` on the `app` service. Confirmed by direct `docker service inspect disc-ae10f0_..._app`. 5. With `order: start-first`, OLD + NEW task co-reside (~2× memory). Discourse's Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10 (warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's 5s update monitor → `failure_action: rollback` fires → Swarm REVERTS the app service spec to PreviousSpec (base deploy, `chaos-version=eb96de94+U`). 6. `services_converged` blind spot: after rollback `UpdateStatus.State = "rollback_completed"`, NOT in the blocking set `("updating", "rollback_started")` → returns True as if converged. Under start-first the OLD task kept serving → `wait_healthy` also passes on the rolled-back spec. 7. `deployed_identity` reads `.Spec.Labels` → rolled-back spec → `chaos-version=eb96de94+U`. HC1 asserts head_ref `7ae7b0f76efb` ≠ `eb96de94` → FAIL with misleading "re-checkout failed". **Key disproving evidence (independent route):** repro1 was isolated (no concurrent discourse run, domain `disc-ae10f0` used for the first time) and STILL showed the drift. This refuted the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL. **Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓):** Whether the new start-first task survives the 5s monitor depends on momentary memory pressure. Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10 on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184. **Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):** *Part 1 — overlay `order: stop-first`*: Old task stops before new starts → new boots with full host memory → no OOM under the 5s monitor → no spurious rollback. `failure_action: rollback` intentionally preserved so a genuinely broken head still rolls back and is caught. ASSESSMENT: **CORRECT AND SUFFICIENT** for eliminating the spurious-rollback trigger. *Part 2 — `lifecycle.assert_upgrade_converged`*: Called in `perform_upgrade` immediately after `chaos_redeploy`, before `wait_healthy`. Polls `docker service inspect --format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}'` until terminal. Returns on `""|"none"|"completed"`; raises on `"rollback_completed"|"rollback_paused"|"paused"`; polls on `"updating"|"rollback_started"`; times out at `meta.DEPLOY_TIMEOUT`. ASSESSMENT: **CORRECT** — closes the wait_healthy-masking blind spot. Makes a swarm rollback an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch. HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs. **One concern flagged (not a blocker — defense-in-depth covers it):** `assert_upgrade_converged` has a theoretical race window: on the very first poll, Docker may not yet have transitioned from a prior `"completed"` state to `"updating"` (tiny gap between `docker stack deploy` returning and the Swarm manager scheduling the roll). If the race fires, the function returns OK on `"none"`, then the rollback happens silently afterward. Mitigation: with `stop-first` (fix part 1), a post-assert-converged rollback leaves NO serving task during the rollback → `wait_healthy` also FAILS → the test result is still FAIL, just with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is NOT weakened even if the race fires. No action required unless a recipe uses `start-first` where a post-race rollback could masquerade as a clean upgrade. **UPDATE — race concern CLOSED by Builder (commit e9c26c7 `harden(dstamp)`):** Builder addressed the race with a 2-phase protocol: - **Pre-redeploy**: `update_status_started(domain)` snapshots `UpdateStatus.StartedAt`. - **Phase 1**: polls until `StartedAt` advances past the snapshot (new update scheduled) OR state is `"updating"/"rollback_started"`. 30s grace: if no new update appears → no-op redeploy, nothing to converge. - **Phase 2**: now that the NEW update is confirmed in flight, waits for terminal state (same logic as before, but with confidence it's the right update). Assessment: **CORRECT AND COMPLETE**. Phase 1 deterministically distinguishes the new update from stale base-deploy terminal state. No new failure modes introduced. The grace period (30s) is generous relative to Docker's near-immediate scheduling. Race concern fully closed. **Status:** no `claim(dstamp)` commit yet. Awaiting M1 claim to issue formal verdict. --- ## M1: PASS @2026-06-11T17:36Z Cold verification from `/srv/cc-ci/cc-ci-adv`. JOURNAL-dstamp not read before verdict (anti-anchoring). **Check 1 — Recipe policy at 7ae7b0f76efb:** PASS `cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3 update_config compose.yml` → `failure_action: rollback`, `order: start-first` confirmed present at lines 33-35. Direct evidence the discourse app service is configured to rollback+start-first at the PR-head. **Check 2 — abra CONSTANT (no binary change 06-05→06-10):** PASS `for g in $(ls -d /nix/var/nix/profiles/system-*-link); do ...readlink -f $g/sw/bin/abra; done` → Gens 2-11 all `/nix/store/bf6azhpi8bi5491n8i4bhjm1z7fva7pb-abra-0.13.0-beta/bin/abra`. Gen1 differs (pre-bootstrap), gens 4-11 (2026-06-01 onward) identical. abra version change as cause of drift definitively ruled out by direct evidence. **Check 3 — Direct rollback evidence (repro4):** PASS `grep -E 'DSTAMP|UpdateStatus|PreviousSpec|chaos-version' /var/lib/cc-ci-runs/dstamp-repro4.console.log` → Line immediately after chaos_redeploy: - `UpdateStatus.State="updating"` (in flight) - `Spec.Labels chaos-version="7ae7b0f7+U"` (abra correctly applied HEAD) - `PreviousSpec.Labels chaos-version="eb96de94+U"` (the base, what swarm reverts to) → HC1 line: `chaos-version=eb96de94+U` (AFTER rollback completed) → mismatch → FAIL Causal chain proven in a single artifact: abra stamped correctly, swarm rolled back, label reverted. Mechanism confirmed: start-first co-residency → OOM under monitor → failure_action:rollback → PreviousSpec. **Check 4 — Fix present:** PASS - `runner/harness/lifecycle.py`: `update_status_started` (line 511) + `assert_upgrade_converged` (line 526). Phase-1 polls until StartedAt advances past prev_started (or in-flight state seen) → closes race. Phase-2 terminal: `completed`=OK; `rollback_completed`/`rollback_paused`/`paused`=FAIL with honest message. - `runner/harness/generic.py:268-278`: `prev_started = update_status_started(domain)` called BEFORE `chaos_redeploy`, then `assert_upgrade_converged(domain, timeout=DEPLOY_TIMEOUT, prev_started=prev_started)` called immediately after — BEFORE `wait_healthy`. Correct call order. - `tests/discourse/compose.ccci.yml:54-55`: `deploy.update_config.order: stop-first` with full WHY comment citing direct evidence (dstamp-repro1/4) and stating `failure_action: rollback` is LEFT INTACT. Both commits 0cc31a5 + e9c26c7 verified present (git log --oneline). **Check 5 — Fix works (dstamp-fix1 and dstamp-fix2):** PASS - `dstamp-fix1`: `upgrade-converged: disc-ae10f0_ci_commoninternet_net_app swarm UpdateStatus=completed` + `upgrade→PR-head: head_ref=7ae7b0f7 chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0` + `test_upgrade_reconverges PASSED`. Level=2 (install+upgrade only, backup/functional not in STAGES). - `dstamp-fix2`: same params, same domain, same result — second reliability run confirms. Both runs: chaos-version=7ae7b0f7+U (head), NOT eb96de94+U (base). Fix is deterministic. **Check 6 — Blast-radius:** PASS - n8n: runs 162 (level=4, upgrade=pass) and 47 (level=4, upgrade=pass). Run 162 dated post-06-10 (when discourse was failing) → n8n not affected despite same rollback+start-first policy. - keycloak: runs 155 (level=4, upgrade=pass) and 187 (level=4, upgrade=pass). Same conclusion. - `assert_upgrade_converged` now provides a general harness backstop for all rollback-policy recipes. No overlay change needed for keycloak/n8n (lighter apps, no OOM symptom in evidence). - drone/traefik: infra, no recipe-CI upgrade tier. No action needed. **HC1 teeth preserved (code inspection):** `generic.py:174-175` — `assert_upgraded` logic is UNCHANGED: `chaos_commit = chaos.split("+",1)[0]`; assertion `head_ref.startswith(chaos_commit) or chaos_commit.startswith(head_ref)`. `assert_upgrade_converged` runs BEFORE `assert_upgraded`; if a rollback occurs it raises FIRST with the honest "head did not stay healthy" message; if no rollback occurs, HC1 commit-match assertion still runs unmodified. A deliberately wrong stamp (e.g. deploying eb96de94 as the chaos version) would still fail HC1 exactly as before. M2 will demonstrate this with a live negative test. **One nuance (not a blocker):** The "06-05→06-10 change" being specifically "heavier resident load from rcust-phase stacks" is circumstantially supported by the timeline, but repro1 (isolated, no concurrent apps) also showed drift — the mechanism fires under general memory pressure during discourse's precompile, not only when other apps are warm. The exact delta between run 184 (06-05, passed) and subsequent runs is intermittency of memory pressure, proven by repro2 (warm volumes → faster precompile → task survived) vs repro4 (fresh boot → slower precompile → task failed). The ROOT CAUSE mechanism is proven by direct evidence; the specific "what changed between 06-05 and 06-10" reduces to: heavier/more-variable memory pressure, the mechanism was always latent. This doesn't weaken M1 — the fix eliminates the exposure. **Verdict: M1 PASS.** Root cause attributed by direct evidence; minimal reproducible demonstration confirmed; fix (stop-first overlay + assert_upgrade_converged) implemented and working; HC1 unweakened; blast-radius sweep complete. Builder cleared to proceed to M2. --- ## M2: PASS @2026-06-11T17:58Z Cold verification from `/srv/cc-ci/cc-ci-adv`. JOURNAL-dstamp not read before verdict (anti-anchoring). **Check 1 — Build 450 results (level, tiers, flags):** PASS `cat /var/lib/cc-ci-runs/450/results.json`: - `"level": 5` ✓ - `"recipe": "discourse"`, `"ref": "7ae7b0f76efb"`, `"pr": "2"` ✓ - All tiers: `"install": "pass"`, `"upgrade": "pass"`, `"backup": "pass"`, `"restore": "pass"`, `"custom": "pass"` ✓ - All rungs: `"install": "pass"`, `"upgrade": "pass"`, `"backup_restore": "pass"`, `"functional": "pass"`, `"lint": "pass"` ✓ - `"clean_teardown": true`, `"no_secret_leak": true` ✓ - Timestamp: `"finished": 1781199631.4...` (2026-06-11 ~17:40 UTC) ✓ - `screenshot.png` present (discourse functional screenshot) **Check 2 — JUnit XML: test_upgrade_reconverges PASS (HC1 satisfied):** PASS `grep -c '` (no `` child). `test_upgrade_reconverges` directly calls `generic.assert_upgraded(live_app, meta)`. `assert_upgraded` at `generic.py:174-175` does the HC1 commit-match: `chaos_commit == head_ref`. Test PASSED → `chaos_commit = 7ae7b0f7` matched `head_ref = 7ae7b0f7` ✓ **Check 3 — PR comment 14347 (!testme path):** PASS Comment 14346 body = `!testme` (the trigger). Comment 14347 body (bot response): `\n🌻 **cc-ci** — \`discourse\` @ \`7ae7b0f7\` ✅ **passed**\n[...links to run 450 summary.png + badge + drone build 450...]` Confirmed via Gitea API. Run directory `/var/lib/cc-ci-runs/450/` exists with full contents. !testme → bridge ack → drone build 450 → run 450 results → PR comment ✅ passed. Path verified. **Check 4 — DEFERRED entry closed:** PASS `machine-docs/DEFERRED.md` lines 346-366: ✅ RESOLVED @2026-06-11 (phase dstamp, Builder) with: - Root cause narrative (rollback mechanism) - Direct evidence pointer (dstamp-repro4.console.log) - Fix commits (0cc31a5 + e9c26c7) - Real CI proof (drone build #450, LEVEL 5) - Blast-radius note (only discourse; harness guard covers all rollback-policy recipes) - Cross-references (STATUS/JOURNAL/REVIEW-dstamp) **Check 5 — HC1 teeth (wrong stamp still FAILs):** PASS *Negative control (pre-fix, existing run):* `m2p-discourse/results.json` shows HC1 caught wrong stamp: `AssertionError: upgrade deployed chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb' — the re-checkout to the code under test failed, so the upgrade is not exercising the PR's changes (HC1)` This is HC1 raising on `eb96de94 ≠ 7ae7b0f7`. HC1 commit-match assertion WORKS. *Code unchanged (from M1):* `generic.py:174-175` commit-match assertion unmodified. The fix adds `assert_upgrade_converged` BEFORE `assert_upgraded` — it catches rollback EARLIER with an honest message but does NOT bypass HC1. If a non-rollback wrong stamp were deployed (e.g. abra bug stamping wrong commit), `assert_upgrade_converged` would see `completed` and pass, then HC1 would FAIL on the commit mismatch. *Post-fix rollback path:* `assert_upgrade_converged` raises `RuntimeError` on `rollback_completed` → upgrade FAILS with honest "head did not stay healthy" → HC1 doesn't even run but test is RED. Both paths (rollback → caught by assert_upgrade_converged; wrong stamp without rollback → caught by HC1) still FAIL. The pre-fix negative controls (m2p-discourse, repro1, repro4) demonstrate the wrong-stamp path is always caught; the fix only changes HOW it's reported and at which point. **Blast-radius (confirmed at M1, still valid):** Only discourse affected. keycloak/n8n PASS L4 in 06-10/06-11 era. General `assert_upgrade_converged` guard now covers all rollback-policy recipes. **Phase DoD summary:** - ✅ Drift mechanism attributed with reproducible evidence (repro4 direct evidence) - ✅ Fixed at the true root (stop-first overlay + assert_upgrade_converged) - ✅ Discourse back at real level in real CI via drone !testme (build 450, LEVEL 5) - ✅ No other recipe silently affected (blast-radius sweep, keycloak/n8n PASS) - ✅ HC1 unweakened and adversarially re-proven (m2p-discourse negative control + code inspection) - ✅ DEFERRED closed with pointers **Verdict: M2 PASS. All phase dstamp DoD items satisfied. Builder cleared for ## DONE.**