9.0 KiB
REVIEW-dstamp.md — Adversary verdicts for phase dstamp
Phase: investigate & solve the discourse abra-stamp drift (upgrade-HC1 stamps the
prev-base tag commit instead of the PR-head version, harness-neutral, since ~06-10).
SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-dstamp-discourse-drift.md. Gates M1, M2.
Verdict log is append-only. review(...)-prefixed commits carry verdicts (load-bearing
watchdog signal). Findings filed under ## Adversary findings in BACKLOG-dstamp.md.
Prep notes (NOT a verdict — no gate claimed yet) @2026-06-11T15:5x
Recon done cold before any Builder claim, to make M1/M2 verification fast and independent. Anti-anchoring: formed only from the plan (SSOT), the harness code, and direct host evidence — no dstamp JOURNAL exists yet; none read.
Stamp mechanism (from code): HC1's "stamp" = the coop-cloud.<stack>.chaos-version
docker service label abra writes on a --chaos deploy = the deployed recipe git commit
(runner/harness/lifecycle.py:468 deployed_identity, runner/harness/generic.py:146 assert_upgraded). Upgrade flow (generic.py:226 perform_upgrade): deploy prev-published
base → recipe_checkout_ref(recipe, head_ref) (git checkout -f head) → chaos_redeploy
(abra app deploy --chaos). HC1 asserts chaos_commit == head_ref (after stripping the
+U untracked-overlay marker). PASS requires the chaos-version to equal the PR head.
Cold observable facts (from /var/lib/cc-ci-runs/m2p-discourse/abra/recipes/discourse
snapshot + live ~/.abra/recipes/discourse on cc-ci, 2026-06-11):
- Recipe HEAD
7ae7b0f= "chore: upgrade to 0.9.0+3.5.0";git describe --tags=0.7.0+3.3.1-9-g7ae7b0f→ HEAD is 9 commits past the newest annotated tag0.7.0+3.3.1(commiteb96de9). No0.8.x/0.9.xtag exists. - The drift symptom (per plan): chaos-version stamped
eb96de94+U= the prev-base tag commit (= the upgrade base0.7.0+3.3.1), NOT the PR-head7ae7b0f. - abra is nix-pinned:
abra version 0.13.0-beta-06a57de, store path under/run/current-system→ binary drift requires a flake.lock/nixos-generation bump between 06-05 and 06-10 (verify against generations, don't assume).
Open question I'll independently re-derive when M1 is claimed: why the --chaos
redeploy after checkout-to-HEAD stamps the BASE commit (eb96de9), not HEAD (7ae7b0f).
Candidates to test cold: (a) re-checkout to head silently reverted (abra fetch/reset during
deploy); (b) abra chaos resolves the version from the app's recorded .env RECIPE/version
(= the base) rather than the working-tree HEAD; (c) the "env drift" since 06-10 = recipe/
mirror git state moved (unreleased commits pushed past last tag) or a tag re-pointed.
Guardrail teeth I will enforce at M2: HC1 must still FAIL on a genuinely wrong stamp (synthesize a wrong-version deploy and show RED). Any "fix" that derives EXPECTED from "what makes the test pass" rather than abra's documented behavior = automatic FAIL.
Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping
on the claim(...) commit.
Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet)
Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service inspect on cc-ci. Independently arrived at the same attribution as the Builder.
Causal chain derived from code + direct evidence:
-
provide_ccci_overlay(rcust-era addition) copiescompose.ccci.ymlinto the per-run recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the oldinstall_steps.shpath writing to canonical~/.abra) — consistent with run 184 having no+Usuffix and passing. The+Uitself is stripped by HC1'schaos_commit.split("+",1)[0]and is NOT the cause of drift. -
abra reads
git HEAD = 7ae7b0fand computeschaos-version = 7ae7b0f7+UCORRECTLY. Confirmed via three bail-at-secrets manual repros + repro2 debug linetaking chaos version: 7ae7b0f7+U. abra and the per-run git checkout are EXONERATED. -
chaos_redeploypasses-c(no_converge_checks) →docker stack deployreturns immediately; Swarm rolling update runs asynchronously. -
Discourse
compose.yml(BOTH baseeb96de94AND PR-head7ae7b0f) setsdeploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }on theappservice. Confirmed by directdocker service inspect disc-ae10f0_..._app. -
With
order: start-first, OLD + NEW task co-reside (~2× memory). Discourse's Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10 (warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's 5s update monitor →failure_action: rollbackfires → Swarm REVERTS the app service spec to PreviousSpec (base deploy,chaos-version=eb96de94+U). -
services_convergedblind spot: after rollbackUpdateStatus.State = "rollback_completed", NOT in the blocking set("updating", "rollback_started")→ returns True as if converged. Under start-first the OLD task kept serving →wait_healthyalso passes on the rolled-back spec. -
deployed_identityreads.Spec.Labels→ rolled-back spec →chaos-version=eb96de94+U. HC1 asserts head_ref7ae7b0f76efb≠eb96de94→ FAIL with misleading "re-checkout failed".
Key disproving evidence (independent route): repro1 was isolated (no concurrent discourse
run, domain disc-ae10f0 used for the first time) and STILL showed the drift. This refuted
the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL.
Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓): Whether the new start-first task survives the 5s monitor depends on momentary memory pressure. Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10 on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184.
Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):
Part 1 — overlay order: stop-first: Old task stops before new starts → new boots with full
host memory → no OOM under the 5s monitor → no spurious rollback. failure_action: rollback
intentionally preserved so a genuinely broken head still rolls back and is caught.
ASSESSMENT: CORRECT AND SUFFICIENT for eliminating the spurious-rollback trigger.
Part 2 — lifecycle.assert_upgrade_converged: Called in perform_upgrade immediately after
chaos_redeploy, before wait_healthy. Polls docker service inspect --format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}' until terminal.
Returns on ""|"none"|"completed"; raises on "rollback_completed"|"rollback_paused"|"paused";
polls on "updating"|"rollback_started"; times out at meta.DEPLOY_TIMEOUT.
ASSESSMENT: CORRECT — closes the wait_healthy-masking blind spot. Makes a swarm rollback
an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch.
HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs.
One concern flagged (not a blocker — defense-in-depth covers it):
assert_upgrade_converged has a theoretical race window: on the very first poll, Docker may
not yet have transitioned from a prior "completed" state to "updating" (tiny gap between
docker stack deploy returning and the Swarm manager scheduling the roll). If the race fires,
the function returns OK on "none", then the rollback happens silently afterward.
Mitigation: with stop-first (fix part 1), a post-assert-converged rollback leaves NO serving
task during the rollback → wait_healthy also FAILS → the test result is still FAIL, just
with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is
NOT weakened even if the race fires. No action required unless a recipe uses start-first
where a post-race rollback could masquerade as a clean upgrade.
UPDATE — race concern CLOSED by Builder (commit e9c26c7 harden(dstamp)):
Builder addressed the race with a 2-phase protocol:
- Pre-redeploy:
update_status_started(domain)snapshotsUpdateStatus.StartedAt. - Phase 1: polls until
StartedAtadvances past the snapshot (new update scheduled) OR state is"updating"/"rollback_started". 30s grace: if no new update appears → no-op redeploy, nothing to converge. - Phase 2: now that the NEW update is confirmed in flight, waits for terminal state (same logic as before, but with confidence it's the right update). Assessment: CORRECT AND COMPLETE. Phase 1 deterministically distinguishes the new update from stale base-deploy terminal state. No new failure modes introduced. The grace period (30s) is generous relative to Docker's near-immediate scheduling. Race concern fully closed.
Status: no claim(dstamp) commit yet. Awaiting M1 claim to issue formal verdict.