status(dstamp): DIRECT EVIDENCE — repro4 caught Spec=7ae7b0f7+U + PreviousSpec=eb96de94+U + State=updating post-redeploy; swarm failure_action:rollback reverts label (masked by start-first+wait_healthy); abra+harness exonerated. Fix: stop-first overlay + harness rollback detection
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -99,6 +99,38 @@ repro3 (UpdateStatus + PreviousSpec capture, NO --debug to preserve failing timi
|
||||
get the swarm rollback in the act (expect `UpdateStatus.State = rollback_*`, `PreviousSpec.Labels`
|
||||
chaos=eb96de9+U == the read `.Spec.Labels` after revert). That is the direct-evidence smoking gun.
|
||||
|
||||
### DIRECT EVIDENCE — captured (repro4, solo/isolated, upgrade FAIL)
|
||||
repro3 base deploy FATA'd (abra convergence monitor gave up — discourse is genuinely flaky/heavy
|
||||
under load, which is the very premise). repro4 reached the upgrade and the post-`chaos_redeploy`
|
||||
`docker service inspect <stack>_app` capture is the smoking gun:
|
||||
- `UpdateStatus = {"State":"updating","Message":"update in progress"}`
|
||||
- `.Spec.Labels` chaos-version = **7ae7b0f7+U**, version = 0.9.0+3.5.0 (HEAD spec applied OK)
|
||||
- `.PreviousSpec.Labels` chaos-version = **eb96de94+U**, version = 0.7.0+3.3.1 (the base)
|
||||
- `deployed_identity` (same instant) = chaos **7ae7b0f7+U** (reads Spec, correct)
|
||||
Then `wait_healthy` ran (old task serving under start-first → passes); the new task failed swarm's
|
||||
monitor → `failure_action: rollback` reverted `.Spec` → `.PreviousSpec` (eb96de94+U); the
|
||||
assertion-phase read saw eb96de94+U → HC1 FAIL. The ONLY operation that turns `.Spec.Labels` from
|
||||
7ae7b0f7+U into the exact `.PreviousSpec` eb96de94+U is a swarm rollback. abra+harness exonerated;
|
||||
the head was really deployed and then swarm-reverted. Attribution complete, by direct evidence.
|
||||
|
||||
Note the app image is `bitnamilegacy/discourse:3.3.1` for BOTH base and head spec (head only bumps
|
||||
the version label + db image), so the new task isn't failing on a missing image — it's the
|
||||
start-first 2× co-residency of the precompile/Rails-heavy app under host memory pressure (a real
|
||||
new-task failure, intermittent), which trips `failure_action: rollback`.
|
||||
|
||||
### Fix plan (HC1 teeth preserved)
|
||||
- Reliability: `tests/discourse/compose.ccci.yml` overlay → app `deploy.update_config.order:
|
||||
stop-first` (old stops before new starts → new boots with full memory → genuinely healthy → no
|
||||
spurious rollback). Upgrade-to-head still really deployed+asserted; not a weakening. WHY in header.
|
||||
Risk to weigh: stop-first = brief real downtime during the CI upgrade (covered by DEPLOY_TIMEOUT
|
||||
3600). Alternative `failure_action: pause` REJECTED — it would let a genuinely-failed new task
|
||||
pass HC1 (start-first keeps old serving) = test-weakening.
|
||||
- Correctness: harness upgrade path asserts the redeploy converged to the head spec (UpdateStatus
|
||||
not rollback*/paused / `.Spec` not reverted to `.PreviousSpec`) → honest failure message on a
|
||||
real rollback, instead of the misleading "re-checkout failed". General (all rollback-policy
|
||||
recipes). HC1 teeth intact: a head that truly can't stay healthy still fails.
|
||||
- Will validate stop-first actually eliminates the rollback with a full real run before claiming.
|
||||
|
||||
Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy
|
||||
must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness
|
||||
must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with
|
||||
|
||||
Reference in New Issue
Block a user