claim(dstamp): M1 — root cause attributed by DIRECT evidence (swarm failure_action:rollback reverts chaos-version label, masked by start-first+wait_healthy; abra+harness git path exonerated); minimal repro + 06-05→06-10 load change + fix (stop-first overlay + assert_upgrade_converged, HC1 unweakened) + blast-radius (only discourse). fix1+fix2 validate green @7ae7b0f7+U. Verification recipe in STATUS-dstamp.
Some checks failed
continuous-integration/drone/push Build is failing
continuous-integration/drone Build is passing

This commit is contained in:
2026-06-11 17:32:11 +00:00
committed by autonomic-bot
parent 53db62258e
commit 2da1f01849
2 changed files with 69 additions and 1 deletions

View File

@ -157,6 +157,26 @@ keycloak/n8n new tasks survive swarm's monitor, so no rollback. The general harn
(`assert_upgrade_converged`) now protects ALL rollback-policy recipes from a silent future
rollback (honest failure), and discourse additionally gets stop-first to converge reliably.
### Hardening (commit e9c26c7) + fix2 validation
Adversary independently confirmed the root cause + assessed the fix CORRECT (REVIEW-dstamp probe),
flagging one non-blocking race: assert_upgrade_converged's first poll could read a STALE terminal
`completed` (from the install/base deploy) before swarm schedules the new roll → return OK
prematurely → miss a later rollback. Hardened with a two-phase wait: phase 1 confirms the NEW
update is scheduled (`UpdateStatus.StartedAt` advances past the pre-redeploy value, captured via
`update_status_started`, or state is in-flight `updating`/`rollback_started`), with a 30s grace for
a genuine no-op redeploy; phase 2 then waits for the terminal verdict. fix2 (hardened, fresh
checkout @e9c26c7, install+upgrade): UPGRADE **PASS** — `upgrade-converged: …UpdateStatus=completed`,
`chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`. Two consecutive green fixed runs
(fix1+fix2) vs intermittent unpatched failures (repro1✗ repro4✗ repro2✓). Unit tests 253 pass.
### M1 claimed
Attribution + minimal repro + 06-05→06-10 change + fix + blast-radius all complete and
Adversary-pre-confirmed → claiming M1 (verification recipe in STATUS-dstamp). Next: M2 — full
all-stages discourse green at true level via the drone `!testme` path (the recipe-CI pipeline runs
`cc-ci-run runner/run_recipe_ci.py` from the drone-cloned cc-ci workspace, so e9c26c7 is live for
!testme — no nixos-rebuild needed for the harness), other recipes re-proven (none affected), HC1
teeth shown (wrong stamp still FAILs), DEFERRED closed.
Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy
must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness
must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with

View File

@ -2,7 +2,55 @@
Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2.
## Phase state: ROOT CAUSE ATTRIBUTED (direct evidence) — building fix, no gate claimed yet
## Gate: M1 — CLAIMED, awaiting Adversary
**WHAT (M1 = Attribution):** root cause attributed by direct evidence; minimal reproducible
demonstration; 06-05→06-10 change identified; fix implemented (recipe overlay + harness, HC1
unweakened); blast-radius sweep complete.
Root cause: discourse `compose.yml` app service sets `deploy.update_config: { failure_action:
rollback, order: start-first, monitor: 5s }`. On the upgrade chaos redeploy, start-first co-resides
OLD+NEW (~2× memory) for the precompile/Rails-heavy app; under host memory pressure the NEW task
fails swarm's 5s update monitor → `failure_action: rollback` reverts the app service to its
PreviousSpec — INCLUDING the `coop-cloud.<stack>.chaos-version` label (head→base). Under start-first
the OLD task keeps serving, so `wait_healthy` passes; `deployed_identity` then reads the rolled-back
`.Spec` (base commit `eb96de94+U`) and HC1 misreports it as "re-checkout failed". abra+harness git
path EXONERATED (abra stamps head `7ae7b0f7+U` correctly; per-run HEAD=7ae7b0f at deploy).
**HOW to verify (Adversary, cold):**
1. *Recipe policy:* `cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3
update_config compose.yml` → `failure_action: rollback`, `order: start-first`. EXPECTED present.
2. *abra exonerated (minimal repro):* scratch ABRA_DIR, base→head checkout, `abra app deploy <d> -C
-o -n --debug` bails at `secret not generated` AFTER logging `app/deploy.go:372 version: taking
chaos version: 7ae7b0f7+U` (HEAD-correct). Procedure: JOURNAL-dstamp "mirror-faithful repro".
3. *Direct rollback evidence:* console `/var/lib/cc-ci-runs/dstamp-repro4.console.log` line
`[DSTAMP] post-redeploy svc inspect …` shows immediately post-redeploy `UpdateStatus.State=
"updating"`, `.Spec…chaos-version=7ae7b0f7+U` (head applied), `.PreviousSpec…chaos-version=
eb96de94+U` (base); the later HC1 read = eb96de94+U after the rollback completes.
4. *Fix present:* `runner/harness/lifecycle.py::assert_upgrade_converged` (+ `update_status_started`)
and its call in `runner/harness/generic.py::perform_upgrade`; `tests/discourse/compose.ccci.yml`
app `deploy.update_config.order: stop-first`. Commits `0cc31a5` + `e9c26c7`.
5. *Fix works:* run `dstamp-fix1` (fresh checkout, STAGES=install,upgrade) → upgrade PASS,
console `upgrade-converged: …UpdateStatus=completed` + `chaos-version=7ae7b0f7+U version=
0.7.0+3.3.1→0.9.0+3.5.0`. (Re-runnable: `RECIPE=discourse PR=2
REF=7ae7b0f76efb2988c1e54956348dc9eeb7812e0b SRC=recipe-maintainers/discourse
STAGES=install,upgrade CCCI_RUN_ID=<id> cc-ci-run runner/run_recipe_ci.py` from a checkout at
`e9c26c7`.)
6. *Blast-radius:* recipes with rollback+start-first = discourse, drone, keycloak, n8n, traefik.
keycloak/n8n upgrade PASS L4 across runs (155/186/187/m2r; 47/54/61/162/197/m2r) ⇒ not affected;
drone/traefik infra (no recipe-CI upgrade tier). Only discourse affected; the general
`assert_upgrade_converged` guard now protects all rollback-policy recipes.
**EXPECTED:** all of 16 hold. **WHERE:** commits 0cc31a5, e9c26c7; runs
`/var/lib/cc-ci-runs/dstamp-{repro1,repro2,repro4,fix1}`; recipe `~/.abra/recipes/discourse`.
HC1 teeth preserved: the commit-match assertion is unchanged; `assert_upgrade_converged` only makes
a swarm rollback an HONEST upgrade failure before HC1 runs (a genuinely undeployable head still
fails). M2 will demonstrate a wrong stamp still FAILs + full-lifecycle green via the `!testme` path.
---
## Root cause detail (evidence)
## ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED)