journal(dstamp): fix1 validation PASS (chaos 7ae7b0f7+U, converged); blast-radius = only discourse affected (keycloak/n8n upgrade-PASS L4; drone/traefik infra); general guard covers all
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
2026-06-11 17:16:40 +00:00
committed by autonomic-bot
parent e9eed8e7b7
commit d0d762c9c8
2 changed files with 32 additions and 9 deletions

View File

@ -131,6 +131,32 @@ new-task failure, intermittent), which trips `failure_action: rollback`.
recipes). HC1 teeth intact: a head that truly can't stay healthy still fails.
- Will validate stop-first actually eliminates the rollback with a full real run before claiming.
## 2026-06-11 (cont.) — fix validated + blast-radius
**Fix implemented** (commit 0cc31a5): (1) `tests/discourse/compose.ccci.yml` app service
`deploy.update_config.order: stop-first`; (2) `lifecycle.assert_upgrade_converged()` + call in
`generic.perform_upgrade` right after `chaos_redeploy` (before wait_healthy) — waits for swarm's
app-service rolling update to reach a TERMINAL state and FAILs honestly on rollback*/paused.
Unit tests: 253 passed (no regression).
**fix1 validation** (run `dstamp-fix1`, fresh checkout @0cc31a5, install+upgrade, solo): UPGRADE
**PASS** — `upgrade-converged: …UpdateStatus=completed`, `upgrade→PR-head: head_ref=7ae7b0f7
chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`. The head is deployed, the update
converges (no rollback), HC1 reads 7ae7b0f7+U. (Bug was intermittent — running more to show
reliability, since repro2 passed unpatched.)
**Blast-radius sweep** — recipes with `failure_action: rollback` + `order: start-first`:
`discourse, drone, keycloak, n8n, traefik`. Evidence check of the upgrade tier across many runs
(incl. the rcust-era m2r-* runs under the same heavy load):
- keycloak: runs 155/186/187/m2r/shot-proof → upgrade PASS L4 (HC1 pass ⇒ chaos==head). NOT affected.
- n8n: runs 47/54/61/162/197/m2r/shot-proof → upgrade PASS L4. NOT affected.
- drone, traefik: cc-ci INFRA (warm-reconciled), NOT enrolled in the recipe-CI upgrade tier.
⇒ **Only discourse actually exhibits the drift** — its app is uniquely heavy (Rails asset
precompile, 2.4GB image) so the start-first 2× co-residency OOMs the new task; the lighter
keycloak/n8n new tasks survive swarm's monitor, so no rollback. The general harness guard
(`assert_upgrade_converged`) now protects ALL rollback-policy recipes from a silent future
rollback (honest failure), and discourse additionally gets stop-first to converge reliably.
Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy
must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness
must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with