diff --git a/JOURNAL-dstamp.md b/JOURNAL-dstamp.md index 3a7648f..042cfe2 100644 --- a/JOURNAL-dstamp.md +++ b/JOURNAL-dstamp.md @@ -157,6 +157,26 @@ keycloak/n8n new tasks survive swarm's monitor, so no rollback. The general harn (`assert_upgrade_converged`) now protects ALL rollback-policy recipes from a silent future rollback (honest failure), and discourse additionally gets stop-first to converge reliably. +### Hardening (commit e9c26c7) + fix2 validation +Adversary independently confirmed the root cause + assessed the fix CORRECT (REVIEW-dstamp probe), +flagging one non-blocking race: assert_upgrade_converged's first poll could read a STALE terminal +`completed` (from the install/base deploy) before swarm schedules the new roll → return OK +prematurely → miss a later rollback. Hardened with a two-phase wait: phase 1 confirms the NEW +update is scheduled (`UpdateStatus.StartedAt` advances past the pre-redeploy value, captured via +`update_status_started`, or state is in-flight `updating`/`rollback_started`), with a 30s grace for +a genuine no-op redeploy; phase 2 then waits for the terminal verdict. fix2 (hardened, fresh +checkout @e9c26c7, install+upgrade): UPGRADE **PASS** — `upgrade-converged: …UpdateStatus=completed`, +`chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`. Two consecutive green fixed runs +(fix1+fix2) vs intermittent unpatched failures (repro1✗ repro4✗ repro2✓). Unit tests 253 pass. + +### M1 claimed +Attribution + minimal repro + 06-05→06-10 change + fix + blast-radius all complete and +Adversary-pre-confirmed → claiming M1 (verification recipe in STATUS-dstamp). Next: M2 — full +all-stages discourse green at true level via the drone `!testme` path (the recipe-CI pipeline runs +`cc-ci-run runner/run_recipe_ci.py` from the drone-cloned cc-ci workspace, so e9c26c7 is live for +!testme — no nixos-rebuild needed for the harness), other recipes re-proven (none affected), HC1 +teeth shown (wrong stamp still FAILs), DEFERRED closed. + Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with diff --git a/STATUS-dstamp.md b/STATUS-dstamp.md index d36520f..cdd2736 100644 --- a/STATUS-dstamp.md +++ b/STATUS-dstamp.md @@ -2,7 +2,55 @@ Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2. -## Phase state: ROOT CAUSE ATTRIBUTED (direct evidence) — building fix, no gate claimed yet +## Gate: M1 — CLAIMED, awaiting Adversary + +**WHAT (M1 = Attribution):** root cause attributed by direct evidence; minimal reproducible +demonstration; 06-05→06-10 change identified; fix implemented (recipe overlay + harness, HC1 +unweakened); blast-radius sweep complete. + +Root cause: discourse `compose.yml` app service sets `deploy.update_config: { failure_action: +rollback, order: start-first, monitor: 5s }`. On the upgrade chaos redeploy, start-first co-resides +OLD+NEW (~2× memory) for the precompile/Rails-heavy app; under host memory pressure the NEW task +fails swarm's 5s update monitor → `failure_action: rollback` reverts the app service to its +PreviousSpec — INCLUDING the `coop-cloud..chaos-version` label (head→base). Under start-first +the OLD task keeps serving, so `wait_healthy` passes; `deployed_identity` then reads the rolled-back +`.Spec` (base commit `eb96de94+U`) and HC1 misreports it as "re-checkout failed". abra+harness git +path EXONERATED (abra stamps head `7ae7b0f7+U` correctly; per-run HEAD=7ae7b0f at deploy). + +**HOW to verify (Adversary, cold):** +1. *Recipe policy:* `cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3 + update_config compose.yml` → `failure_action: rollback`, `order: start-first`. EXPECTED present. +2. *abra exonerated (minimal repro):* scratch ABRA_DIR, base→head checkout, `abra app deploy -C + -o -n --debug` bails at `secret not generated` AFTER logging `app/deploy.go:372 version: taking + chaos version: 7ae7b0f7+U` (HEAD-correct). Procedure: JOURNAL-dstamp "mirror-faithful repro". +3. *Direct rollback evidence:* console `/var/lib/cc-ci-runs/dstamp-repro4.console.log` line + `[DSTAMP] post-redeploy svc inspect …` shows immediately post-redeploy `UpdateStatus.State= + "updating"`, `.Spec…chaos-version=7ae7b0f7+U` (head applied), `.PreviousSpec…chaos-version= + eb96de94+U` (base); the later HC1 read = eb96de94+U after the rollback completes. +4. *Fix present:* `runner/harness/lifecycle.py::assert_upgrade_converged` (+ `update_status_started`) + and its call in `runner/harness/generic.py::perform_upgrade`; `tests/discourse/compose.ccci.yml` + app `deploy.update_config.order: stop-first`. Commits `0cc31a5` + `e9c26c7`. +5. *Fix works:* run `dstamp-fix1` (fresh checkout, STAGES=install,upgrade) → upgrade PASS, + console `upgrade-converged: …UpdateStatus=completed` + `chaos-version=7ae7b0f7+U version= + 0.7.0+3.3.1→0.9.0+3.5.0`. (Re-runnable: `RECIPE=discourse PR=2 + REF=7ae7b0f76efb2988c1e54956348dc9eeb7812e0b SRC=recipe-maintainers/discourse + STAGES=install,upgrade CCCI_RUN_ID= cc-ci-run runner/run_recipe_ci.py` from a checkout at + `e9c26c7`.) +6. *Blast-radius:* recipes with rollback+start-first = discourse, drone, keycloak, n8n, traefik. + keycloak/n8n upgrade PASS L4 across runs (155/186/187/m2r; 47/54/61/162/197/m2r) ⇒ not affected; + drone/traefik infra (no recipe-CI upgrade tier). Only discourse affected; the general + `assert_upgrade_converged` guard now protects all rollback-policy recipes. + +**EXPECTED:** all of 1–6 hold. **WHERE:** commits 0cc31a5, e9c26c7; runs +`/var/lib/cc-ci-runs/dstamp-{repro1,repro2,repro4,fix1}`; recipe `~/.abra/recipes/discourse`. + +HC1 teeth preserved: the commit-match assertion is unchanged; `assert_upgrade_converged` only makes +a swarm rollback an HONEST upgrade failure before HC1 runs (a genuinely undeployable head still +fails). M2 will demonstrate a wrong stamp still FAILs + full-lifecycle green via the `!testme` path. + +--- + +## Root cause detail (evidence) ## ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED)