diff --git a/BACKLOG-dstamp.md b/BACKLOG-dstamp.md index 70af3e7..4f2f883 100644 --- a/BACKLOG-dstamp.md +++ b/BACKLOG-dstamp.md @@ -10,15 +10,12 @@ - [x] Timeline: run 184 (06-05, solo) green @7ae7b0f; clustered 06-10/06-11 runs drift @ same ref. - [x] Identify shared-stack collision vector (`app_domain` = hash(recipe|pr|ref); upgrade chaos_redeploy bypasses app-domain flock). -- [ ] **IN FLIGHT:** single clean ISOLATED real run (install,upgrade @7ae7b0f, console-captured) - → decide concurrency-artifact vs real drift. -- [ ] If concurrency artifact: pin the exact mechanism producing the `eb96de9+U` chaos label on - the shared stack (deliberate 2-run concurrency repro if needed); decide the fix - (app-lock the upgrade chaos_redeploy / serialize same-stack runs) WITHOUT weakening HC1. -- [ ] If real env drift: read the isolated-run console, attribute the exact 06-05→06-10 change. -- [ ] Blast-radius sweep: every enrolled recipe's latest upgrade-tier evidence for the same - signature (prev-base tag commit stamped where a version was expected). -- [ ] Restore discourse to its true level in real CI via the drone `!testme` path (M2). +- [x] Isolated real runs (repro1–4) + direct UpdateStatus/PreviousSpec capture → root cause attributed. +- [x] Concurrency REFUTED (solo repro1/4 reproduce). Mechanism = swarm `failure_action:rollback` + reverts the chaos-version label (direct evidence repro4: Spec=7ae7b0f7+U→PreviousSpec=eb96de9+U). +- [x] 06-05→06-10 change = rcust-phase heavier resident host load → start-first new task reliably OOMs → rollback every run (solo 06-05 run 184 didn't; my repro2 didn't either). +- [x] Blast-radius: only discourse affected (keycloak/n8n have the policy but upgrade PASS L4 across runs; drone/traefik infra). General harness guard covers all. +- [ ] Restore discourse to its true level in real CI via the drone `!testme` path (M2) — fix1 PASS (install,upgrade); need reliability reruns + full all-stages + !testme. - [ ] Prove HC1 still has teeth (a deliberately wrong stamp still FAILs). - [ ] Close the DEFERRED.md dstamp re-entry with pointers. diff --git a/JOURNAL-dstamp.md b/JOURNAL-dstamp.md index 2467e64..3a7648f 100644 --- a/JOURNAL-dstamp.md +++ b/JOURNAL-dstamp.md @@ -131,6 +131,32 @@ new-task failure, intermittent), which trips `failure_action: rollback`. recipes). HC1 teeth intact: a head that truly can't stay healthy still fails. - Will validate stop-first actually eliminates the rollback with a full real run before claiming. +## 2026-06-11 (cont.) — fix validated + blast-radius + +**Fix implemented** (commit 0cc31a5): (1) `tests/discourse/compose.ccci.yml` app service +`deploy.update_config.order: stop-first`; (2) `lifecycle.assert_upgrade_converged()` + call in +`generic.perform_upgrade` right after `chaos_redeploy` (before wait_healthy) — waits for swarm's +app-service rolling update to reach a TERMINAL state and FAILs honestly on rollback*/paused. +Unit tests: 253 passed (no regression). + +**fix1 validation** (run `dstamp-fix1`, fresh checkout @0cc31a5, install+upgrade, solo): UPGRADE +**PASS** — `upgrade-converged: …UpdateStatus=completed`, `upgrade→PR-head: head_ref=7ae7b0f7 +chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`. The head is deployed, the update +converges (no rollback), HC1 reads 7ae7b0f7+U. (Bug was intermittent — running more to show +reliability, since repro2 passed unpatched.) + +**Blast-radius sweep** — recipes with `failure_action: rollback` + `order: start-first`: +`discourse, drone, keycloak, n8n, traefik`. Evidence check of the upgrade tier across many runs +(incl. the rcust-era m2r-* runs under the same heavy load): +- keycloak: runs 155/186/187/m2r/shot-proof → upgrade PASS L4 (HC1 pass ⇒ chaos==head). NOT affected. +- n8n: runs 47/54/61/162/197/m2r/shot-proof → upgrade PASS L4. NOT affected. +- drone, traefik: cc-ci INFRA (warm-reconciled), NOT enrolled in the recipe-CI upgrade tier. +⇒ **Only discourse actually exhibits the drift** — its app is uniquely heavy (Rails asset +precompile, 2.4GB image) so the start-first 2× co-residency OOMs the new task; the lighter +keycloak/n8n new tasks survive swarm's monitor, so no rollback. The general harness guard +(`assert_upgrade_converged`) now protects ALL rollback-policy recipes from a silent future +rollback (honest failure), and discourse additionally gets stop-first to converge reliably. + Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with