journal(dstamp): fix1 validation PASS (chaos 7ae7b0f7+U, converged); blast-radius = only discourse affected (keycloak/n8n upgrade-PASS L4; drone/traefik infra); general guard covers all
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -131,6 +131,32 @@ new-task failure, intermittent), which trips `failure_action: rollback`.
|
||||
recipes). HC1 teeth intact: a head that truly can't stay healthy still fails.
|
||||
- Will validate stop-first actually eliminates the rollback with a full real run before claiming.
|
||||
|
||||
## 2026-06-11 (cont.) — fix validated + blast-radius
|
||||
|
||||
**Fix implemented** (commit 0cc31a5): (1) `tests/discourse/compose.ccci.yml` app service
|
||||
`deploy.update_config.order: stop-first`; (2) `lifecycle.assert_upgrade_converged()` + call in
|
||||
`generic.perform_upgrade` right after `chaos_redeploy` (before wait_healthy) — waits for swarm's
|
||||
app-service rolling update to reach a TERMINAL state and FAILs honestly on rollback*/paused.
|
||||
Unit tests: 253 passed (no regression).
|
||||
|
||||
**fix1 validation** (run `dstamp-fix1`, fresh checkout @0cc31a5, install+upgrade, solo): UPGRADE
|
||||
**PASS** — `upgrade-converged: …UpdateStatus=completed`, `upgrade→PR-head: head_ref=7ae7b0f7
|
||||
chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`. The head is deployed, the update
|
||||
converges (no rollback), HC1 reads 7ae7b0f7+U. (Bug was intermittent — running more to show
|
||||
reliability, since repro2 passed unpatched.)
|
||||
|
||||
**Blast-radius sweep** — recipes with `failure_action: rollback` + `order: start-first`:
|
||||
`discourse, drone, keycloak, n8n, traefik`. Evidence check of the upgrade tier across many runs
|
||||
(incl. the rcust-era m2r-* runs under the same heavy load):
|
||||
- keycloak: runs 155/186/187/m2r/shot-proof → upgrade PASS L4 (HC1 pass ⇒ chaos==head). NOT affected.
|
||||
- n8n: runs 47/54/61/162/197/m2r/shot-proof → upgrade PASS L4. NOT affected.
|
||||
- drone, traefik: cc-ci INFRA (warm-reconciled), NOT enrolled in the recipe-CI upgrade tier.
|
||||
⇒ **Only discourse actually exhibits the drift** — its app is uniquely heavy (Rails asset
|
||||
precompile, 2.4GB image) so the start-first 2× co-residency OOMs the new task; the lighter
|
||||
keycloak/n8n new tasks survive swarm's monitor, so no rollback. The general harness guard
|
||||
(`assert_upgrade_converged`) now protects ALL rollback-policy recipes from a silent future
|
||||
rollback (honest failure), and discourse additionally gets stop-first to converge reliably.
|
||||
|
||||
Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy
|
||||
must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness
|
||||
must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with
|
||||
|
||||
Reference in New Issue
Block a user