18 KiB
REVIEW-dstamp.md — Adversary verdicts for phase dstamp
Phase: investigate & solve the discourse abra-stamp drift (upgrade-HC1 stamps the
prev-base tag commit instead of the PR-head version, harness-neutral, since ~06-10).
SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-dstamp-discourse-drift.md. Gates M1, M2.
Verdict log is append-only. review(...)-prefixed commits carry verdicts (load-bearing
watchdog signal). Findings filed under ## Adversary findings in BACKLOG-dstamp.md.
Prep notes (NOT a verdict — no gate claimed yet) @2026-06-11T15:5x
Recon done cold before any Builder claim, to make M1/M2 verification fast and independent. Anti-anchoring: formed only from the plan (SSOT), the harness code, and direct host evidence — no dstamp JOURNAL exists yet; none read.
Stamp mechanism (from code): HC1's "stamp" = the coop-cloud.<stack>.chaos-version
docker service label abra writes on a --chaos deploy = the deployed recipe git commit
(runner/harness/lifecycle.py:468 deployed_identity, runner/harness/generic.py:146 assert_upgraded). Upgrade flow (generic.py:226 perform_upgrade): deploy prev-published
base → recipe_checkout_ref(recipe, head_ref) (git checkout -f head) → chaos_redeploy
(abra app deploy --chaos). HC1 asserts chaos_commit == head_ref (after stripping the
+U untracked-overlay marker). PASS requires the chaos-version to equal the PR head.
Cold observable facts (from /var/lib/cc-ci-runs/m2p-discourse/abra/recipes/discourse
snapshot + live ~/.abra/recipes/discourse on cc-ci, 2026-06-11):
- Recipe HEAD
7ae7b0f= "chore: upgrade to 0.9.0+3.5.0";git describe --tags=0.7.0+3.3.1-9-g7ae7b0f→ HEAD is 9 commits past the newest annotated tag0.7.0+3.3.1(commiteb96de9). No0.8.x/0.9.xtag exists. - The drift symptom (per plan): chaos-version stamped
eb96de94+U= the prev-base tag commit (= the upgrade base0.7.0+3.3.1), NOT the PR-head7ae7b0f. - abra is nix-pinned:
abra version 0.13.0-beta-06a57de, store path under/run/current-system→ binary drift requires a flake.lock/nixos-generation bump between 06-05 and 06-10 (verify against generations, don't assume).
Open question I'll independently re-derive when M1 is claimed: why the --chaos
redeploy after checkout-to-HEAD stamps the BASE commit (eb96de9), not HEAD (7ae7b0f).
Candidates to test cold: (a) re-checkout to head silently reverted (abra fetch/reset during
deploy); (b) abra chaos resolves the version from the app's recorded .env RECIPE/version
(= the base) rather than the working-tree HEAD; (c) the "env drift" since 06-10 = recipe/
mirror git state moved (unreleased commits pushed past last tag) or a tag re-pointed.
Guardrail teeth I will enforce at M2: HC1 must still FAIL on a genuinely wrong stamp (synthesize a wrong-version deploy and show RED). Any "fix" that derives EXPECTED from "what makes the test pass" rather than abra's documented behavior = automatic FAIL.
Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping
on the claim(...) commit.
Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet)
Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service inspect on cc-ci. Independently arrived at the same attribution as the Builder.
Causal chain derived from code + direct evidence:
-
provide_ccci_overlay(rcust-era addition) copiescompose.ccci.ymlinto the per-run recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the oldinstall_steps.shpath writing to canonical~/.abra) — consistent with run 184 having no+Usuffix and passing. The+Uitself is stripped by HC1'schaos_commit.split("+",1)[0]and is NOT the cause of drift. -
abra reads
git HEAD = 7ae7b0fand computeschaos-version = 7ae7b0f7+UCORRECTLY. Confirmed via three bail-at-secrets manual repros + repro2 debug linetaking chaos version: 7ae7b0f7+U. abra and the per-run git checkout are EXONERATED. -
chaos_redeploypasses-c(no_converge_checks) →docker stack deployreturns immediately; Swarm rolling update runs asynchronously. -
Discourse
compose.yml(BOTH baseeb96de94AND PR-head7ae7b0f) setsdeploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }on theappservice. Confirmed by directdocker service inspect disc-ae10f0_..._app. -
With
order: start-first, OLD + NEW task co-reside (~2× memory). Discourse's Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10 (warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's 5s update monitor →failure_action: rollbackfires → Swarm REVERTS the app service spec to PreviousSpec (base deploy,chaos-version=eb96de94+U). -
services_convergedblind spot: after rollbackUpdateStatus.State = "rollback_completed", NOT in the blocking set("updating", "rollback_started")→ returns True as if converged. Under start-first the OLD task kept serving →wait_healthyalso passes on the rolled-back spec. -
deployed_identityreads.Spec.Labels→ rolled-back spec →chaos-version=eb96de94+U. HC1 asserts head_ref7ae7b0f76efb≠eb96de94→ FAIL with misleading "re-checkout failed".
Key disproving evidence (independent route): repro1 was isolated (no concurrent discourse
run, domain disc-ae10f0 used for the first time) and STILL showed the drift. This refuted
the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL.
Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓): Whether the new start-first task survives the 5s monitor depends on momentary memory pressure. Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10 on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184.
Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):
Part 1 — overlay order: stop-first: Old task stops before new starts → new boots with full
host memory → no OOM under the 5s monitor → no spurious rollback. failure_action: rollback
intentionally preserved so a genuinely broken head still rolls back and is caught.
ASSESSMENT: CORRECT AND SUFFICIENT for eliminating the spurious-rollback trigger.
Part 2 — lifecycle.assert_upgrade_converged: Called in perform_upgrade immediately after
chaos_redeploy, before wait_healthy. Polls docker service inspect --format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}' until terminal.
Returns on ""|"none"|"completed"; raises on "rollback_completed"|"rollback_paused"|"paused";
polls on "updating"|"rollback_started"; times out at meta.DEPLOY_TIMEOUT.
ASSESSMENT: CORRECT — closes the wait_healthy-masking blind spot. Makes a swarm rollback
an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch.
HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs.
One concern flagged (not a blocker — defense-in-depth covers it):
assert_upgrade_converged has a theoretical race window: on the very first poll, Docker may
not yet have transitioned from a prior "completed" state to "updating" (tiny gap between
docker stack deploy returning and the Swarm manager scheduling the roll). If the race fires,
the function returns OK on "none", then the rollback happens silently afterward.
Mitigation: with stop-first (fix part 1), a post-assert-converged rollback leaves NO serving
task during the rollback → wait_healthy also FAILS → the test result is still FAIL, just
with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is
NOT weakened even if the race fires. No action required unless a recipe uses start-first
where a post-race rollback could masquerade as a clean upgrade.
UPDATE — race concern CLOSED by Builder (commit e9c26c7 harden(dstamp)):
Builder addressed the race with a 2-phase protocol:
- Pre-redeploy:
update_status_started(domain)snapshotsUpdateStatus.StartedAt. - Phase 1: polls until
StartedAtadvances past the snapshot (new update scheduled) OR state is"updating"/"rollback_started". 30s grace: if no new update appears → no-op redeploy, nothing to converge. - Phase 2: now that the NEW update is confirmed in flight, waits for terminal state (same logic as before, but with confidence it's the right update). Assessment: CORRECT AND COMPLETE. Phase 1 deterministically distinguishes the new update from stale base-deploy terminal state. No new failure modes introduced. The grace period (30s) is generous relative to Docker's near-immediate scheduling. Race concern fully closed.
Status: no claim(dstamp) commit yet. Awaiting M1 claim to issue formal verdict.
M1: PASS @2026-06-11T17:36Z
Cold verification from /srv/cc-ci/cc-ci-adv. JOURNAL-dstamp not read before verdict (anti-anchoring).
Check 1 — Recipe policy at 7ae7b0f76efb: PASS
cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3 update_config compose.yml
→ failure_action: rollback, order: start-first confirmed present at lines 33-35. Direct evidence the
discourse app service is configured to rollback+start-first at the PR-head.
Check 2 — abra CONSTANT (no binary change 06-05→06-10): PASS
for g in $(ls -d /nix/var/nix/profiles/system-*-link); do ...readlink -f $g/sw/bin/abra; done
→ Gens 2-11 all /nix/store/bf6azhpi8bi5491n8i4bhjm1z7fva7pb-abra-0.13.0-beta/bin/abra.
Gen1 differs (pre-bootstrap), gens 4-11 (2026-06-01 onward) identical. abra version change as
cause of drift definitively ruled out by direct evidence.
Check 3 — Direct rollback evidence (repro4): PASS
grep -E 'DSTAMP|UpdateStatus|PreviousSpec|chaos-version' /var/lib/cc-ci-runs/dstamp-repro4.console.log
→ Line immediately after chaos_redeploy:
UpdateStatus.State="updating"(in flight)Spec.Labels chaos-version="7ae7b0f7+U"(abra correctly applied HEAD)PreviousSpec.Labels chaos-version="eb96de94+U"(the base, what swarm reverts to) → HC1 line:chaos-version=eb96de94+U(AFTER rollback completed) → mismatch → FAIL
Causal chain proven in a single artifact: abra stamped correctly, swarm rolled back, label reverted. Mechanism confirmed: start-first co-residency → OOM under monitor → failure_action:rollback → PreviousSpec.
Check 4 — Fix present: PASS
runner/harness/lifecycle.py:update_status_started(line 511) +assert_upgrade_converged(line 526). Phase-1 polls until StartedAt advances past prev_started (or in-flight state seen) → closes race. Phase-2 terminal:completed=OK;rollback_completed/rollback_paused/paused=FAIL with honest message.runner/harness/generic.py:268-278:prev_started = update_status_started(domain)called BEFOREchaos_redeploy, thenassert_upgrade_converged(domain, timeout=DEPLOY_TIMEOUT, prev_started=prev_started)called immediately after — BEFOREwait_healthy. Correct call order.tests/discourse/compose.ccci.yml:54-55:deploy.update_config.order: stop-firstwith full WHY comment citing direct evidence (dstamp-repro1/4) and statingfailure_action: rollbackis LEFT INTACT. Both commits0cc31a5+e9c26c7verified present (git log --oneline).
Check 5 — Fix works (dstamp-fix1 and dstamp-fix2): PASS
dstamp-fix1:upgrade-converged: disc-ae10f0_ci_commoninternet_net_app swarm UpdateStatus=completedupgrade→PR-head: head_ref=7ae7b0f7 chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0test_upgrade_reconverges PASSED. Level=2 (install+upgrade only, backup/functional not in STAGES).
dstamp-fix2: same params, same domain, same result — second reliability run confirms. Both runs: chaos-version=7ae7b0f7+U (head), NOT eb96de94+U (base). Fix is deterministic.
Check 6 — Blast-radius: PASS
- n8n: runs 162 (level=4, upgrade=pass) and 47 (level=4, upgrade=pass). Run 162 dated post-06-10 (when discourse was failing) → n8n not affected despite same rollback+start-first policy.
- keycloak: runs 155 (level=4, upgrade=pass) and 187 (level=4, upgrade=pass). Same conclusion.
assert_upgrade_convergednow provides a general harness backstop for all rollback-policy recipes. No overlay change needed for keycloak/n8n (lighter apps, no OOM symptom in evidence).- drone/traefik: infra, no recipe-CI upgrade tier. No action needed.
HC1 teeth preserved (code inspection): generic.py:174-175 — assert_upgraded logic is UNCHANGED:
chaos_commit = chaos.split("+",1)[0]; assertion head_ref.startswith(chaos_commit) or chaos_commit.startswith(head_ref). assert_upgrade_converged runs BEFORE assert_upgraded; if a
rollback occurs it raises FIRST with the honest "head did not stay healthy" message; if no rollback occurs,
HC1 commit-match assertion still runs unmodified. A deliberately wrong stamp (e.g. deploying eb96de94
as the chaos version) would still fail HC1 exactly as before. M2 will demonstrate this with a live negative test.
One nuance (not a blocker): The "06-05→06-10 change" being specifically "heavier resident load from rcust-phase stacks" is circumstantially supported by the timeline, but repro1 (isolated, no concurrent apps) also showed drift — the mechanism fires under general memory pressure during discourse's precompile, not only when other apps are warm. The exact delta between run 184 (06-05, passed) and subsequent runs is intermittency of memory pressure, proven by repro2 (warm volumes → faster precompile → task survived) vs repro4 (fresh boot → slower precompile → task failed). The ROOT CAUSE mechanism is proven by direct evidence; the specific "what changed between 06-05 and 06-10" reduces to: heavier/more-variable memory pressure, the mechanism was always latent. This doesn't weaken M1 — the fix eliminates the exposure.
Verdict: M1 PASS. Root cause attributed by direct evidence; minimal reproducible demonstration confirmed; fix (stop-first overlay + assert_upgrade_converged) implemented and working; HC1 unweakened; blast-radius sweep complete. Builder cleared to proceed to M2.
M2: PASS @2026-06-11T17:58Z
Cold verification from /srv/cc-ci/cc-ci-adv. JOURNAL-dstamp not read before verdict (anti-anchoring).
Check 1 — Build 450 results (level, tiers, flags): PASS
cat /var/lib/cc-ci-runs/450/results.json:
"level": 5✓"recipe": "discourse","ref": "7ae7b0f76efb","pr": "2"✓- All tiers:
"install": "pass","upgrade": "pass","backup": "pass","restore": "pass","custom": "pass"✓ - All rungs:
"install": "pass","upgrade": "pass","backup_restore": "pass","functional": "pass","lint": "pass"✓ "clean_teardown": true,"no_secret_leak": true✓- Timestamp:
"finished": 1781199631.4...(2026-06-11 ~17:40 UTC) ✓ screenshot.pngpresent (discourse functional screenshot)
Check 2 — JUnit XML: test_upgrade_reconverges PASS (HC1 satisfied): PASS
grep -c '<failure\|<error' upgrade__generic__test_upgrade.xml → 0
Full XML: <testcase classname="tests._generic.test_upgrade" name="test_upgrade_reconverges" time="0.260"/>
(no <failure> child). test_upgrade_reconverges directly calls generic.assert_upgraded(live_app, meta).
assert_upgraded at generic.py:174-175 does the HC1 commit-match: chaos_commit == head_ref.
Test PASSED → chaos_commit = 7ae7b0f7 matched head_ref = 7ae7b0f7 ✓
Check 3 — PR comment 14347 (!testme path): PASS
Comment 14346 body = !testme (the trigger).
Comment 14347 body (bot response):
<!-- cc-ci:testme -->\n🌻 **cc-ci** — \discourse` @ `7ae7b0f7` ✅ passed\n[...links to run 450 summary.png + badge + drone build 450...]Confirmed via Gitea API. Run directory/var/lib/cc-ci-runs/450/` exists with full contents.
!testme → bridge ack → drone build 450 → run 450 results → PR comment ✅ passed. Path verified.
Check 4 — DEFERRED entry closed: PASS
machine-docs/DEFERRED.md lines 346-366: ✅ RESOLVED @2026-06-11 (phase dstamp, Builder) with:
- Root cause narrative (rollback mechanism)
- Direct evidence pointer (dstamp-repro4.console.log)
- Fix commits (
0cc31a5+e9c26c7) - Real CI proof (drone build #450, LEVEL 5)
- Blast-radius note (only discourse; harness guard covers all rollback-policy recipes)
- Cross-references (STATUS/JOURNAL/REVIEW-dstamp)
Check 5 — HC1 teeth (wrong stamp still FAILs): PASS
Negative control (pre-fix, existing run): m2p-discourse/results.json shows HC1 caught wrong stamp:
AssertionError: upgrade deployed chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb' — the re-checkout to the code under test failed, so the upgrade is not exercising the PR's changes (HC1)
This is HC1 raising on eb96de94 ≠ 7ae7b0f7. HC1 commit-match assertion WORKS.
Code unchanged (from M1): generic.py:174-175 commit-match assertion unmodified. The fix adds
assert_upgrade_converged BEFORE assert_upgraded — it catches rollback EARLIER with an honest message
but does NOT bypass HC1. If a non-rollback wrong stamp were deployed (e.g. abra bug stamping wrong commit),
assert_upgrade_converged would see completed and pass, then HC1 would FAIL on the commit mismatch.
Post-fix rollback path: assert_upgrade_converged raises RuntimeError on rollback_completed →
upgrade FAILS with honest "head did not stay healthy" → HC1 doesn't even run but test is RED.
Both paths (rollback → caught by assert_upgrade_converged; wrong stamp without rollback → caught by HC1)
still FAIL. The pre-fix negative controls (m2p-discourse, repro1, repro4) demonstrate the wrong-stamp
path is always caught; the fix only changes HOW it's reported and at which point.
Blast-radius (confirmed at M1, still valid): Only discourse affected. keycloak/n8n PASS L4
in 06-10/06-11 era. General assert_upgrade_converged guard now covers all rollback-policy recipes.
Phase DoD summary:
- ✅ Drift mechanism attributed with reproducible evidence (repro4 direct evidence)
- ✅ Fixed at the true root (stop-first overlay + assert_upgrade_converged)
- ✅ Discourse back at real level in real CI via drone !testme (build 450, LEVEL 5)
- ✅ No other recipe silently affected (blast-radius sweep, keycloak/n8n PASS)
- ✅ HC1 unweakened and adversarially re-proven (m2p-discourse negative control + code inspection)
- ✅ DEFERRED closed with pointers
Verdict: M2 PASS. All phase dstamp DoD items satisfied. Builder cleared for ## DONE.