status(dstamp): ## DONE — M1 (fb411b2) + M2 (71358da) both PASS, no VETO. Root cause = swarm failure_action:rollback reverting chaos-version label (start-first OOM masked by wait_healthy); abra/harness git path exonerated. Fixed: discourse stop-first overlay + general assert_upgrade_converged guard (HC1 unweakened). Proven L5 via drone !testme #450. Blast-radius: discourse-only. DEFERRED closed.
Some checks failed
continuous-integration/drone/push Build is failing

This commit is contained in:
2026-06-11 17:52:45 +00:00
parent 71358da446
commit 17c8d29a8f

View File

@ -2,6 +2,39 @@
Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2.
## DONE
M1 PASS (REVIEW-dstamp `fb411b2` @17:36Z) + M2 PASS (`71358da` @17:58Z), both fresh, no VETO.
All Definition-of-Done items Adversary-verified.
**Operator summary.** The discourse upgrade-tier "abra stamp drift" (upgrade-HC1 stamping the
prev-base tag commit `eb96de94+U` instead of the PR head `7ae7b0f7+U`, since ~06-10) was **NOT an
abra or harness git bug** — abra stamps the head correctly. **Root cause:** discourse's
`compose.yml` app service uses `deploy.update_config: { failure_action: rollback, order:
start-first, monitor: 5s }`. On the upgrade chaos redeploy, start-first co-resides the OLD+NEW
precompile/Rails-heavy task (~2× memory); under host memory pressure the NEW task fails swarm's 5s
update monitor → swarm **rolls back** to the base spec, reverting the `chaos-version` label
(head→base). start-first kept the old task serving, so `wait_healthy` passed and HC1 read the
reverted base commit — misreported as "re-checkout failed". Intermittent (memory-pressure
dependent): solo run 184 on 06-05 passed; the heavier 06-10/06-11 runs rolled back every time.
**Direct evidence:** `dstamp-repro4` captured `.Spec chaos-version=7ae7b0f7+U` (head applied) →
`.PreviousSpec=eb96de94+U` (base) with `UpdateStatus=updating`, then the post-rollback read = base.
**Fix (commits `0cc31a5` + `e9c26c7`, HC1 unweakened):** (1) `tests/discourse/compose.ccci.yml`
app `update_config.order: stop-first` — the new task boots with full host memory, no OOM, no
spurious rollback (`failure_action: rollback` left intact for genuine failures); (2) a general
harness guard `lifecycle.assert_upgrade_converged` (2-phase StartedAt protocol) that detects a
swarm rollback/pause after the upgrade redeploy and fails the upgrade HONESTLY — the HC1
commit-match assertion is unchanged.
**Proven in real CI:** drone `!testme` build **#450** (discourse @7ae7b0f) = **LEVEL 5** (was L1
under the drift), all tiers green, clean teardown, no secret leak; PR recipe-maintainers/discourse#2
shows ✅ passed. **Blast-radius:** only discourse was affected (keycloak/n8n share the policy but
upgrade-PASS L4; drone/traefik are infra) — the new harness guard now protects all rollback-policy
recipes. DEFERRED entry closed with pointers. **No operator action required.**
---
## Gate: M1 — PASS (REVIEW-dstamp fb411b2 @2026-06-11T17:36Z). Now on M2.
## Gate: M2 — CLAIMED, awaiting Adversary