cc-ci/REVIEW-dstamp.md

# REVIEW-dstamp.md — Adversary verdicts for phase `dstamp`

Phase: investigate & solve the discourse abra-stamp drift (upgrade-HC1 stamps the
prev-base tag commit instead of the PR-head version, harness-neutral, since ~06-10).
SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2.

Verdict log is append-only. `review(...)`-prefixed commits carry verdicts (load-bearing
watchdog signal). Findings filed under `## Adversary findings` in BACKLOG-dstamp.md.

---

## Prep notes (NOT a verdict — no gate claimed yet) @2026-06-11T15:5x

Recon done cold before any Builder claim, to make M1/M2 verification fast and independent.
Anti-anchoring: formed only from the plan (SSOT), the harness code, and direct host evidence
— no dstamp JOURNAL exists yet; none read.

**Stamp mechanism (from code):** HC1's "stamp" = the `coop-cloud.<stack>.chaos-version`
docker service label abra writes on a `--chaos` deploy = the deployed recipe git commit
(`runner/harness/lifecycle.py:468 deployed_identity`, `runner/harness/generic.py:146
assert_upgraded`). Upgrade flow (`generic.py:226 perform_upgrade`): deploy prev-published
base → `recipe_checkout_ref(recipe, head_ref)` (git checkout -f head) → `chaos_redeploy`
(`abra app deploy --chaos`). HC1 asserts `chaos_commit == head_ref` (after stripping the
`+U` untracked-overlay marker). PASS requires the chaos-version to equal the PR head.

**Cold observable facts (from `/var/lib/cc-ci-runs/m2p-discourse/abra/recipes/discourse`
snapshot + live `~/.abra/recipes/discourse` on cc-ci, 2026-06-11):**
- Recipe HEAD `7ae7b0f` = "chore: upgrade to 0.9.0+3.5.0"; `git describe --tags` =
  `0.7.0+3.3.1-9-g7ae7b0f` → HEAD is **9 commits past the newest annotated tag**
  `0.7.0+3.3.1` (commit `eb96de9`). No `0.8.x`/`0.9.x` tag exists.
- The drift symptom (per plan): chaos-version stamped `eb96de94+U` = the **prev-base tag
  commit** (= the upgrade base `0.7.0+3.3.1`), NOT the PR-head `7ae7b0f`.
- abra is **nix-pinned**: `abra version 0.13.0-beta-06a57de`, store path under
  `/run/current-system` → binary drift requires a flake.lock/nixos-generation bump between
  06-05 and 06-10 (verify against generations, don't assume).

**Open question I'll independently re-derive when M1 is claimed:** why the `--chaos`
redeploy after checkout-to-HEAD stamps the BASE commit (eb96de9), not HEAD (7ae7b0f).
Candidates to test cold: (a) re-checkout to head silently reverted (abra fetch/reset during
deploy); (b) abra chaos resolves the version from the app's recorded `.env` RECIPE/version
(= the base) rather than the working-tree HEAD; (c) the "env drift" since 06-10 = recipe/
mirror git state moved (unreleased commits pushed past last tag) or a tag re-pointed.

**Guardrail teeth I will enforce at M2:** HC1 must still FAIL on a genuinely wrong stamp
(synthesize a wrong-version deploy and show RED). Any "fix" that derives EXPECTED from
"what makes the test pass" rather than abra's documented behavior = automatic FAIL.

Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping
on the `claim(...)` commit.

---

## Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet)

Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from
harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service
inspect on cc-ci. Independently arrived at the same attribution as the Builder.

**Causal chain derived from code + direct evidence:**

1. `provide_ccci_overlay` (rcust-era addition) copies `compose.ccci.yml` into the per-run
   recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the old
   `install_steps.sh` path writing to canonical `~/.abra`) — consistent with run 184 having
   no `+U` suffix and passing. The `+U` itself is stripped by HC1's `chaos_commit.split("+",1)[0]`
   and is NOT the cause of drift.

2. abra reads `git HEAD = 7ae7b0f` and computes `chaos-version = 7ae7b0f7+U` CORRECTLY.
   Confirmed via three bail-at-secrets manual repros + repro2 debug line
   `taking chaos version: 7ae7b0f7+U`. abra and the per-run git checkout are EXONERATED.

3. `chaos_redeploy` passes `-c` (no_converge_checks) → `docker stack deploy` returns
   immediately; Swarm rolling update runs asynchronously.

4. Discourse `compose.yml` (BOTH base `eb96de94` AND PR-head `7ae7b0f`) sets
   `deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }`
   on the `app` service. Confirmed by direct `docker service inspect disc-ae10f0_..._app`.

5. With `order: start-first`, OLD + NEW task co-reside (~2× memory). Discourse's
   Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10
   (warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's
   5s update monitor → `failure_action: rollback` fires → Swarm REVERTS the app service
   spec to PreviousSpec (base deploy, `chaos-version=eb96de94+U`).

6. `services_converged` blind spot: after rollback `UpdateStatus.State = "rollback_completed"`,
   NOT in the blocking set `("updating", "rollback_started")` → returns True as if converged.
   Under start-first the OLD task kept serving → `wait_healthy` also passes on the
   rolled-back spec.

7. `deployed_identity` reads `.Spec.Labels` → rolled-back spec → `chaos-version=eb96de94+U`.
   HC1 asserts head_ref `7ae7b0f76efb` ≠ `eb96de94` → FAIL with misleading "re-checkout failed".

**Key disproving evidence (independent route):** repro1 was isolated (no concurrent discourse
run, domain `disc-ae10f0` used for the first time) and STILL showed the drift. This refuted
the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL.

**Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓):**
Whether the new start-first task survives the 5s monitor depends on momentary memory pressure.
Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm
volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10
on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184.

**Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):**

*Part 1 — overlay `order: stop-first`*: Old task stops before new starts → new boots with full
host memory → no OOM under the 5s monitor → no spurious rollback. `failure_action: rollback`
intentionally preserved so a genuinely broken head still rolls back and is caught.
ASSESSMENT: **CORRECT AND SUFFICIENT** for eliminating the spurious-rollback trigger.

*Part 2 — `lifecycle.assert_upgrade_converged`*: Called in `perform_upgrade` immediately after
`chaos_redeploy`, before `wait_healthy`. Polls `docker service inspect
--format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}'` until terminal.
Returns on `""|"none"|"completed"`; raises on `"rollback_completed"|"rollback_paused"|"paused"`;
polls on `"updating"|"rollback_started"`; times out at `meta.DEPLOY_TIMEOUT`.
ASSESSMENT: **CORRECT** — closes the wait_healthy-masking blind spot. Makes a swarm rollback
an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch.
HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs.

**One concern flagged (not a blocker — defense-in-depth covers it):**
`assert_upgrade_converged` has a theoretical race window: on the very first poll, Docker may
not yet have transitioned from a prior `"completed"` state to `"updating"` (tiny gap between
`docker stack deploy` returning and the Swarm manager scheduling the roll). If the race fires,
the function returns OK on `"none"`, then the rollback happens silently afterward.
Mitigation: with `stop-first` (fix part 1), a post-assert-converged rollback leaves NO serving
task during the rollback → `wait_healthy` also FAILS → the test result is still FAIL, just
with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is
NOT weakened even if the race fires. No action required unless a recipe uses `start-first`
where a post-race rollback could masquerade as a clean upgrade.

**Status:** no `claim(dstamp)` commit yet. Awaiting M1 claim to issue formal verdict.