journal(dstamp): fix1 validation PASS (chaos 7ae7b0f7+U, converged); blast-radius = only discourse affected (keycloak/n8n upgrade-PASS L4; drone/traefik infra); general guard covers all
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
2026-06-11 17:16:40 +00:00
committed by autonomic-bot
parent e9eed8e7b7
commit d0d762c9c8
2 changed files with 32 additions and 9 deletions

View File

@ -10,15 +10,12 @@
- [x] Timeline: run 184 (06-05, solo) green @7ae7b0f; clustered 06-10/06-11 runs drift @ same ref.
- [x] Identify shared-stack collision vector (`app_domain` = hash(recipe|pr|ref); upgrade
chaos_redeploy bypasses app-domain flock).
- [ ] **IN FLIGHT:** single clean ISOLATED real run (install,upgrade @7ae7b0f, console-captured)
→ decide concurrency-artifact vs real drift.
- [ ] If concurrency artifact: pin the exact mechanism producing the `eb96de9+U` chaos label on
the shared stack (deliberate 2-run concurrency repro if needed); decide the fix
(app-lock the upgrade chaos_redeploy / serialize same-stack runs) WITHOUT weakening HC1.
- [ ] If real env drift: read the isolated-run console, attribute the exact 06-05→06-10 change.
- [ ] Blast-radius sweep: every enrolled recipe's latest upgrade-tier evidence for the same
signature (prev-base tag commit stamped where a version was expected).
- [ ] Restore discourse to its true level in real CI via the drone `!testme` path (M2).
- [x] Isolated real runs (repro14) + direct UpdateStatus/PreviousSpec capture → root cause attributed.
- [x] Concurrency REFUTED (solo repro1/4 reproduce). Mechanism = swarm `failure_action:rollback`
reverts the chaos-version label (direct evidence repro4: Spec=7ae7b0f7+U→PreviousSpec=eb96de9+U).
- [x] 06-05→06-10 change = rcust-phase heavier resident host load → start-first new task reliably OOMs → rollback every run (solo 06-05 run 184 didn't; my repro2 didn't either).
- [x] Blast-radius: only discourse affected (keycloak/n8n have the policy but upgrade PASS L4 across runs; drone/traefik infra). General harness guard covers all.
- [ ] Restore discourse to its true level in real CI via the drone `!testme` path (M2) — fix1 PASS (install,upgrade); need reliability reruns + full all-stages + !testme.
- [ ] Prove HC1 still has teeth (a deliberately wrong stamp still FAILs).
- [ ] Close the DEFERRED.md dstamp re-entry with pointers.

View File

@ -131,6 +131,32 @@ new-task failure, intermittent), which trips `failure_action: rollback`.
recipes). HC1 teeth intact: a head that truly can't stay healthy still fails.
- Will validate stop-first actually eliminates the rollback with a full real run before claiming.
## 2026-06-11 (cont.) — fix validated + blast-radius
**Fix implemented** (commit 0cc31a5): (1) `tests/discourse/compose.ccci.yml` app service
`deploy.update_config.order: stop-first`; (2) `lifecycle.assert_upgrade_converged()` + call in
`generic.perform_upgrade` right after `chaos_redeploy` (before wait_healthy) — waits for swarm's
app-service rolling update to reach a TERMINAL state and FAILs honestly on rollback*/paused.
Unit tests: 253 passed (no regression).
**fix1 validation** (run `dstamp-fix1`, fresh checkout @0cc31a5, install+upgrade, solo): UPGRADE
**PASS** — `upgrade-converged: …UpdateStatus=completed`, `upgrade→PR-head: head_ref=7ae7b0f7
chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`. The head is deployed, the update
converges (no rollback), HC1 reads 7ae7b0f7+U. (Bug was intermittent — running more to show
reliability, since repro2 passed unpatched.)
**Blast-radius sweep** — recipes with `failure_action: rollback` + `order: start-first`:
`discourse, drone, keycloak, n8n, traefik`. Evidence check of the upgrade tier across many runs
(incl. the rcust-era m2r-* runs under the same heavy load):
- keycloak: runs 155/186/187/m2r/shot-proof → upgrade PASS L4 (HC1 pass ⇒ chaos==head). NOT affected.
- n8n: runs 47/54/61/162/197/m2r/shot-proof → upgrade PASS L4. NOT affected.
- drone, traefik: cc-ci INFRA (warm-reconciled), NOT enrolled in the recipe-CI upgrade tier.
⇒ **Only discourse actually exhibits the drift** — its app is uniquely heavy (Rails asset
precompile, 2.4GB image) so the start-first 2× co-residency OOMs the new task; the lighter
keycloak/n8n new tasks survive swarm's monitor, so no rollback. The general harness guard
(`assert_upgrade_converged`) now protects ALL rollback-policy recipes from a silent future
rollback (honest failure), and discourse additionally gets stop-first to converge reliably.
Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy
must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness
must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with