review(dstamp): M1 PASS @2026-06-11T17:36Z — root cause proven by direct evidence (repro4: Spec=7ae7b0f7+U→PreviousSpec=eb96de94+U, swarm rollback confirmed); abra constant (gens4-11 same store path); fix verified (stop-first overlay + assert_upgrade_converged 2-phase, HC1 code unchanged); blast-radius n8n/keycloak PASS L4 in 06-10/06-11 era; dstamp-fix1/fix2 upgrade=PASS @7ae7b0f7+U. Builder cleared for M2.
Some checks failed
continuous-integration/drone/push Build is failing

This commit is contained in:
autonomic-bot
2026-06-11 17:37:35 +00:00
parent 2da1f01849
commit fb411b2563

View File

@ -139,3 +139,77 @@ from stale base-deploy terminal state. No new failure modes introduced. The grac
is generous relative to Docker's near-immediate scheduling. Race concern fully closed.
**Status:** no `claim(dstamp)` commit yet. Awaiting M1 claim to issue formal verdict.
---
## M1: PASS @2026-06-11T17:36Z
Cold verification from `/srv/cc-ci/cc-ci-adv`. JOURNAL-dstamp not read before verdict (anti-anchoring).
**Check 1 — Recipe policy at 7ae7b0f76efb:** PASS
`cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3 update_config compose.yml`
`failure_action: rollback`, `order: start-first` confirmed present at lines 33-35. Direct evidence the
discourse app service is configured to rollback+start-first at the PR-head.
**Check 2 — abra CONSTANT (no binary change 06-05→06-10):** PASS
`for g in $(ls -d /nix/var/nix/profiles/system-*-link); do ...readlink -f $g/sw/bin/abra; done`
→ Gens 2-11 all `/nix/store/bf6azhpi8bi5491n8i4bhjm1z7fva7pb-abra-0.13.0-beta/bin/abra`.
Gen1 differs (pre-bootstrap), gens 4-11 (2026-06-01 onward) identical. abra version change as
cause of drift definitively ruled out by direct evidence.
**Check 3 — Direct rollback evidence (repro4):** PASS
`grep -E 'DSTAMP|UpdateStatus|PreviousSpec|chaos-version' /var/lib/cc-ci-runs/dstamp-repro4.console.log`
→ Line immediately after chaos_redeploy:
- `UpdateStatus.State="updating"` (in flight)
- `Spec.Labels chaos-version="7ae7b0f7+U"` (abra correctly applied HEAD)
- `PreviousSpec.Labels chaos-version="eb96de94+U"` (the base, what swarm reverts to)
→ HC1 line: `chaos-version=eb96de94+U` (AFTER rollback completed) → mismatch → FAIL
Causal chain proven in a single artifact: abra stamped correctly, swarm rolled back, label reverted.
Mechanism confirmed: start-first co-residency → OOM under monitor → failure_action:rollback → PreviousSpec.
**Check 4 — Fix present:** PASS
- `runner/harness/lifecycle.py`: `update_status_started` (line 511) + `assert_upgrade_converged` (line 526).
Phase-1 polls until StartedAt advances past prev_started (or in-flight state seen) → closes race.
Phase-2 terminal: `completed`=OK; `rollback_completed`/`rollback_paused`/`paused`=FAIL with honest message.
- `runner/harness/generic.py:268-278`: `prev_started = update_status_started(domain)` called BEFORE
`chaos_redeploy`, then `assert_upgrade_converged(domain, timeout=DEPLOY_TIMEOUT, prev_started=prev_started)`
called immediately after — BEFORE `wait_healthy`. Correct call order.
- `tests/discourse/compose.ccci.yml:54-55`: `deploy.update_config.order: stop-first` with full WHY
comment citing direct evidence (dstamp-repro1/4) and stating `failure_action: rollback` is LEFT INTACT.
Both commits 0cc31a5 + e9c26c7 verified present (git log --oneline).
**Check 5 — Fix works (dstamp-fix1 and dstamp-fix2):** PASS
- `dstamp-fix1`: `upgrade-converged: disc-ae10f0_ci_commoninternet_net_app swarm UpdateStatus=completed`
+ `upgrade→PR-head: head_ref=7ae7b0f7 chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`
+ `test_upgrade_reconverges PASSED`. Level=2 (install+upgrade only, backup/functional not in STAGES).
- `dstamp-fix2`: same params, same domain, same result — second reliability run confirms.
Both runs: chaos-version=7ae7b0f7+U (head), NOT eb96de94+U (base). Fix is deterministic.
**Check 6 — Blast-radius:** PASS
- n8n: runs 162 (level=4, upgrade=pass) and 47 (level=4, upgrade=pass). Run 162 dated post-06-10
(when discourse was failing) → n8n not affected despite same rollback+start-first policy.
- keycloak: runs 155 (level=4, upgrade=pass) and 187 (level=4, upgrade=pass). Same conclusion.
- `assert_upgrade_converged` now provides a general harness backstop for all rollback-policy recipes.
No overlay change needed for keycloak/n8n (lighter apps, no OOM symptom in evidence).
- drone/traefik: infra, no recipe-CI upgrade tier. No action needed.
**HC1 teeth preserved (code inspection):** `generic.py:174-175``assert_upgraded` logic is UNCHANGED:
`chaos_commit = chaos.split("+",1)[0]`; assertion `head_ref.startswith(chaos_commit) or
chaos_commit.startswith(head_ref)`. `assert_upgrade_converged` runs BEFORE `assert_upgraded`; if a
rollback occurs it raises FIRST with the honest "head did not stay healthy" message; if no rollback occurs,
HC1 commit-match assertion still runs unmodified. A deliberately wrong stamp (e.g. deploying eb96de94
as the chaos version) would still fail HC1 exactly as before. M2 will demonstrate this with a live negative test.
**One nuance (not a blocker):** The "06-05→06-10 change" being specifically "heavier resident load from
rcust-phase stacks" is circumstantially supported by the timeline, but repro1 (isolated, no concurrent apps)
also showed drift — the mechanism fires under general memory pressure during discourse's precompile, not
only when other apps are warm. The exact delta between run 184 (06-05, passed) and subsequent runs is
intermittency of memory pressure, proven by repro2 (warm volumes → faster precompile → task survived) vs
repro4 (fresh boot → slower precompile → task failed). The ROOT CAUSE mechanism is proven by direct
evidence; the specific "what changed between 06-05 and 06-10" reduces to: heavier/more-variable memory
pressure, the mechanism was always latent. This doesn't weaken M1 — the fix eliminates the exposure.
**Verdict: M1 PASS.** Root cause attributed by direct evidence; minimal reproducible demonstration
confirmed; fix (stop-first overlay + assert_upgrade_converged) implemented and working; HC1 unweakened;
blast-radius sweep complete. Builder cleared to proceed to M2.