# REVIEW-dstamp.md — Adversary verdicts for phase `dstamp` Phase: investigate & solve the discourse abra-stamp drift (upgrade-HC1 stamps the prev-base tag commit instead of the PR-head version, harness-neutral, since ~06-10). SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2. Verdict log is append-only. `review(...)`-prefixed commits carry verdicts (load-bearing watchdog signal). Findings filed under `## Adversary findings` in BACKLOG-dstamp.md. --- ## Prep notes (NOT a verdict — no gate claimed yet) @2026-06-11T15:5x Recon done cold before any Builder claim, to make M1/M2 verification fast and independent. Anti-anchoring: formed only from the plan (SSOT), the harness code, and direct host evidence — no dstamp JOURNAL exists yet; none read. **Stamp mechanism (from code):** HC1's "stamp" = the `coop-cloud..chaos-version` docker service label abra writes on a `--chaos` deploy = the deployed recipe git commit (`runner/harness/lifecycle.py:468 deployed_identity`, `runner/harness/generic.py:146 assert_upgraded`). Upgrade flow (`generic.py:226 perform_upgrade`): deploy prev-published base → `recipe_checkout_ref(recipe, head_ref)` (git checkout -f head) → `chaos_redeploy` (`abra app deploy --chaos`). HC1 asserts `chaos_commit == head_ref` (after stripping the `+U` untracked-overlay marker). PASS requires the chaos-version to equal the PR head. **Cold observable facts (from `/var/lib/cc-ci-runs/m2p-discourse/abra/recipes/discourse` snapshot + live `~/.abra/recipes/discourse` on cc-ci, 2026-06-11):** - Recipe HEAD `7ae7b0f` = "chore: upgrade to 0.9.0+3.5.0"; `git describe --tags` = `0.7.0+3.3.1-9-g7ae7b0f` → HEAD is **9 commits past the newest annotated tag** `0.7.0+3.3.1` (commit `eb96de9`). No `0.8.x`/`0.9.x` tag exists. - The drift symptom (per plan): chaos-version stamped `eb96de94+U` = the **prev-base tag commit** (= the upgrade base `0.7.0+3.3.1`), NOT the PR-head `7ae7b0f`. - abra is **nix-pinned**: `abra version 0.13.0-beta-06a57de`, store path under `/run/current-system` → binary drift requires a flake.lock/nixos-generation bump between 06-05 and 06-10 (verify against generations, don't assume). **Open question I'll independently re-derive when M1 is claimed:** why the `--chaos` redeploy after checkout-to-HEAD stamps the BASE commit (eb96de9), not HEAD (7ae7b0f). Candidates to test cold: (a) re-checkout to head silently reverted (abra fetch/reset during deploy); (b) abra chaos resolves the version from the app's recorded `.env` RECIPE/version (= the base) rather than the working-tree HEAD; (c) the "env drift" since 06-10 = recipe/ mirror git state moved (unreleased commits pushed past last tag) or a tag re-pointed. **Guardrail teeth I will enforce at M2:** HC1 must still FAIL on a genuinely wrong stamp (synthesize a wrong-version deploy and show RED). Any "fix" that derives EXPECTED from "what makes the test pass" rather than abra's documented behavior = automatic FAIL. Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping on the `claim(...)` commit. --- ## Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet) Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service inspect on cc-ci. Independently arrived at the same attribution as the Builder. **Causal chain derived from code + direct evidence:** 1. `provide_ccci_overlay` (rcust-era addition) copies `compose.ccci.yml` into the per-run recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the old `install_steps.sh` path writing to canonical `~/.abra`) — consistent with run 184 having no `+U` suffix and passing. The `+U` itself is stripped by HC1's `chaos_commit.split("+",1)[0]` and is NOT the cause of drift. 2. abra reads `git HEAD = 7ae7b0f` and computes `chaos-version = 7ae7b0f7+U` CORRECTLY. Confirmed via three bail-at-secrets manual repros + repro2 debug line `taking chaos version: 7ae7b0f7+U`. abra and the per-run git checkout are EXONERATED. 3. `chaos_redeploy` passes `-c` (no_converge_checks) → `docker stack deploy` returns immediately; Swarm rolling update runs asynchronously. 4. Discourse `compose.yml` (BOTH base `eb96de94` AND PR-head `7ae7b0f`) sets `deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }` on the `app` service. Confirmed by direct `docker service inspect disc-ae10f0_..._app`. 5. With `order: start-first`, OLD + NEW task co-reside (~2× memory). Discourse's Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10 (warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's 5s update monitor → `failure_action: rollback` fires → Swarm REVERTS the app service spec to PreviousSpec (base deploy, `chaos-version=eb96de94+U`). 6. `services_converged` blind spot: after rollback `UpdateStatus.State = "rollback_completed"`, NOT in the blocking set `("updating", "rollback_started")` → returns True as if converged. Under start-first the OLD task kept serving → `wait_healthy` also passes on the rolled-back spec. 7. `deployed_identity` reads `.Spec.Labels` → rolled-back spec → `chaos-version=eb96de94+U`. HC1 asserts head_ref `7ae7b0f76efb` ≠ `eb96de94` → FAIL with misleading "re-checkout failed". **Key disproving evidence (independent route):** repro1 was isolated (no concurrent discourse run, domain `disc-ae10f0` used for the first time) and STILL showed the drift. This refuted the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL. **Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓):** Whether the new start-first task survives the 5s monitor depends on momentary memory pressure. Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10 on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184. **Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):** *Part 1 — overlay `order: stop-first`*: Old task stops before new starts → new boots with full host memory → no OOM under the 5s monitor → no spurious rollback. `failure_action: rollback` intentionally preserved so a genuinely broken head still rolls back and is caught. ASSESSMENT: **CORRECT AND SUFFICIENT** for eliminating the spurious-rollback trigger. *Part 2 — `lifecycle.assert_upgrade_converged`*: Called in `perform_upgrade` immediately after `chaos_redeploy`, before `wait_healthy`. Polls `docker service inspect --format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}'` until terminal. Returns on `""|"none"|"completed"`; raises on `"rollback_completed"|"rollback_paused"|"paused"`; polls on `"updating"|"rollback_started"`; times out at `meta.DEPLOY_TIMEOUT`. ASSESSMENT: **CORRECT** — closes the wait_healthy-masking blind spot. Makes a swarm rollback an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch. HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs. **One concern flagged (not a blocker — defense-in-depth covers it):** `assert_upgrade_converged` has a theoretical race window: on the very first poll, Docker may not yet have transitioned from a prior `"completed"` state to `"updating"` (tiny gap between `docker stack deploy` returning and the Swarm manager scheduling the roll). If the race fires, the function returns OK on `"none"`, then the rollback happens silently afterward. Mitigation: with `stop-first` (fix part 1), a post-assert-converged rollback leaves NO serving task during the rollback → `wait_healthy` also FAILS → the test result is still FAIL, just with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is NOT weakened even if the race fires. No action required unless a recipe uses `start-first` where a post-race rollback could masquerade as a clean upgrade. **Status:** no `claim(dstamp)` commit yet. Awaiting M1 claim to issue formal verdict.