From fb411b256347e2247ebd16d21509d9ab1df6e730 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Thu, 11 Jun 2026 17:37:35 +0000 Subject: [PATCH] =?UTF-8?q?review(dstamp):=20M1=20PASS=20@2026-06-11T17:36?= =?UTF-8?q?Z=20=E2=80=94=20root=20cause=20proven=20by=20direct=20evidence?= =?UTF-8?q?=20(repro4:=20Spec=3D7ae7b0f7+U=E2=86=92PreviousSpec=3Deb96de94?= =?UTF-8?q?+U,=20swarm=20rollback=20confirmed);=20abra=20constant=20(gens4?= =?UTF-8?q?-11=20same=20store=20path);=20fix=20verified=20(stop-first=20ov?= =?UTF-8?q?erlay=20+=20assert=5Fupgrade=5Fconverged=202-phase,=20HC1=20cod?= =?UTF-8?q?e=20unchanged);=20blast-radius=20n8n/keycloak=20PASS=20L4=20in?= =?UTF-8?q?=2006-10/06-11=20era;=20dstamp-fix1/fix2=20upgrade=3DPASS=20@7a?= =?UTF-8?q?e7b0f7+U.=20Builder=20cleared=20for=20M2.?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- REVIEW-dstamp.md | 74 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 74 insertions(+) diff --git a/REVIEW-dstamp.md b/REVIEW-dstamp.md index f9f7d6c..9e6665a 100644 --- a/REVIEW-dstamp.md +++ b/REVIEW-dstamp.md @@ -139,3 +139,77 @@ from stale base-deploy terminal state. No new failure modes introduced. The grac is generous relative to Docker's near-immediate scheduling. Race concern fully closed. **Status:** no `claim(dstamp)` commit yet. Awaiting M1 claim to issue formal verdict. + +--- + +## M1: PASS @2026-06-11T17:36Z + +Cold verification from `/srv/cc-ci/cc-ci-adv`. JOURNAL-dstamp not read before verdict (anti-anchoring). + +**Check 1 — Recipe policy at 7ae7b0f76efb:** PASS +`cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3 update_config compose.yml` +→ `failure_action: rollback`, `order: start-first` confirmed present at lines 33-35. Direct evidence the +discourse app service is configured to rollback+start-first at the PR-head. + +**Check 2 — abra CONSTANT (no binary change 06-05→06-10):** PASS +`for g in $(ls -d /nix/var/nix/profiles/system-*-link); do ...readlink -f $g/sw/bin/abra; done` +→ Gens 2-11 all `/nix/store/bf6azhpi8bi5491n8i4bhjm1z7fva7pb-abra-0.13.0-beta/bin/abra`. +Gen1 differs (pre-bootstrap), gens 4-11 (2026-06-01 onward) identical. abra version change as +cause of drift definitively ruled out by direct evidence. + +**Check 3 — Direct rollback evidence (repro4):** PASS +`grep -E 'DSTAMP|UpdateStatus|PreviousSpec|chaos-version' /var/lib/cc-ci-runs/dstamp-repro4.console.log` +→ Line immediately after chaos_redeploy: +- `UpdateStatus.State="updating"` (in flight) +- `Spec.Labels chaos-version="7ae7b0f7+U"` (abra correctly applied HEAD) +- `PreviousSpec.Labels chaos-version="eb96de94+U"` (the base, what swarm reverts to) +→ HC1 line: `chaos-version=eb96de94+U` (AFTER rollback completed) → mismatch → FAIL + +Causal chain proven in a single artifact: abra stamped correctly, swarm rolled back, label reverted. +Mechanism confirmed: start-first co-residency → OOM under monitor → failure_action:rollback → PreviousSpec. + +**Check 4 — Fix present:** PASS +- `runner/harness/lifecycle.py`: `update_status_started` (line 511) + `assert_upgrade_converged` (line 526). + Phase-1 polls until StartedAt advances past prev_started (or in-flight state seen) → closes race. + Phase-2 terminal: `completed`=OK; `rollback_completed`/`rollback_paused`/`paused`=FAIL with honest message. +- `runner/harness/generic.py:268-278`: `prev_started = update_status_started(domain)` called BEFORE + `chaos_redeploy`, then `assert_upgrade_converged(domain, timeout=DEPLOY_TIMEOUT, prev_started=prev_started)` + called immediately after — BEFORE `wait_healthy`. Correct call order. +- `tests/discourse/compose.ccci.yml:54-55`: `deploy.update_config.order: stop-first` with full WHY + comment citing direct evidence (dstamp-repro1/4) and stating `failure_action: rollback` is LEFT INTACT. + Both commits 0cc31a5 + e9c26c7 verified present (git log --oneline). + +**Check 5 — Fix works (dstamp-fix1 and dstamp-fix2):** PASS +- `dstamp-fix1`: `upgrade-converged: disc-ae10f0_ci_commoninternet_net_app swarm UpdateStatus=completed` + + `upgrade→PR-head: head_ref=7ae7b0f7 chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0` + + `test_upgrade_reconverges PASSED`. Level=2 (install+upgrade only, backup/functional not in STAGES). +- `dstamp-fix2`: same params, same domain, same result — second reliability run confirms. + Both runs: chaos-version=7ae7b0f7+U (head), NOT eb96de94+U (base). Fix is deterministic. + +**Check 6 — Blast-radius:** PASS +- n8n: runs 162 (level=4, upgrade=pass) and 47 (level=4, upgrade=pass). Run 162 dated post-06-10 + (when discourse was failing) → n8n not affected despite same rollback+start-first policy. +- keycloak: runs 155 (level=4, upgrade=pass) and 187 (level=4, upgrade=pass). Same conclusion. +- `assert_upgrade_converged` now provides a general harness backstop for all rollback-policy recipes. + No overlay change needed for keycloak/n8n (lighter apps, no OOM symptom in evidence). +- drone/traefik: infra, no recipe-CI upgrade tier. No action needed. + +**HC1 teeth preserved (code inspection):** `generic.py:174-175` — `assert_upgraded` logic is UNCHANGED: +`chaos_commit = chaos.split("+",1)[0]`; assertion `head_ref.startswith(chaos_commit) or +chaos_commit.startswith(head_ref)`. `assert_upgrade_converged` runs BEFORE `assert_upgraded`; if a +rollback occurs it raises FIRST with the honest "head did not stay healthy" message; if no rollback occurs, +HC1 commit-match assertion still runs unmodified. A deliberately wrong stamp (e.g. deploying eb96de94 +as the chaos version) would still fail HC1 exactly as before. M2 will demonstrate this with a live negative test. + +**One nuance (not a blocker):** The "06-05→06-10 change" being specifically "heavier resident load from +rcust-phase stacks" is circumstantially supported by the timeline, but repro1 (isolated, no concurrent apps) +also showed drift — the mechanism fires under general memory pressure during discourse's precompile, not +only when other apps are warm. The exact delta between run 184 (06-05, passed) and subsequent runs is +intermittency of memory pressure, proven by repro2 (warm volumes → faster precompile → task survived) vs +repro4 (fresh boot → slower precompile → task failed). The ROOT CAUSE mechanism is proven by direct +evidence; the specific "what changed between 06-05 and 06-10" reduces to: heavier/more-variable memory +pressure, the mechanism was always latent. This doesn't weaken M1 — the fix eliminates the exposure. + +**Verdict: M1 PASS.** Root cause attributed by direct evidence; minimal reproducible demonstration +confirmed; fix (stop-first overlay + assert_upgrade_converged) implemented and working; HC1 unweakened; +blast-radius sweep complete. Builder cleared to proceed to M2.