journal(dstamp): root cause = swarm failure_action:rollback reverts chaos-version label to base spec (start-first masks it via wait_healthy); concurrency refuted; repro3 capturing UpdateStatus
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -61,3 +61,48 @@ Open: must still explain *exactly* how a concurrent peer produces an `eb96de9+U`
|
||||
label on the shared stack — a base deploy is pinned/non-chaos (no chaos label), so the +U chaos
|
||||
label must come from some chaos deploy with HEAD=eb96de9. The isolated real run + (if needed) a
|
||||
deliberate 2-run concurrency repro will nail the mechanism. Will NOT claim M1 on inference.
|
||||
|
||||
## 2026-06-11 (cont.) — REAL runs: concurrency REFUTED, true root cause = swarm rollback
|
||||
|
||||
Three real install+upgrade runs of discourse @7ae7b0f (CCCI_RUN_ID=dstamp-repro{1,2,3}), each
|
||||
SOLO/isolated (no concurrent discourse run):
|
||||
|
||||
- **base deploy is CHAOS** (not pinned): `compose.ccci.yml` overlay is present ⇒
|
||||
`deploy_app` takes the `has_ccci_overlay` auto-chaos branch (`lifecycle.py:291-298`). So the
|
||||
base stamps `chaos-version = eb96de9+U` on the shared stack. (My earlier bail-at-secrets repros
|
||||
used a non-chaos/manual base → that's why they didn't expose it.)
|
||||
- **repro1 (unpatched): upgrade FAIL** — `chaos commit 'eb96de94+U', not 7ae7b0f76efb`. The
|
||||
per-run tree reflog + snapshot prove HEAD = **7ae7b0f** at the upgrade deploy (last checkout
|
||||
16:39:03, no checkout-back), yet the deployed `.Spec` chaos label was eb96de9+U.
|
||||
- **repro2 (instrumented: abra deploy `--debug` + a HEAD-print subprocess before the redeploy):
|
||||
upgrade PASS** — `[DSTAMP] taking chaos version: 7ae7b0f7+U`, HEAD=7ae7b0f,
|
||||
`deployed_identity = {version 0.9.0+3.5.0, image bitnamilegacy/discourse:3.3.1, chaos 7ae7b0f7+U}`.
|
||||
|
||||
So the SAME solo config is **intermittent** (184✓ 06-05, m2b/m2p/ab✗ 06-10/11, repro1✗, repro2✓);
|
||||
flipping with a tiny timing change ⇒ **NOT a concurrency artifact, NOT abra version-resolution**
|
||||
(abra computes 7ae7b0f7 correctly — proven by repro2's debug line AND all 3 bail-at-secrets repros).
|
||||
|
||||
**TRUE ROOT CAUSE (recipe deploy policy + heavy/flaky new task):** discourse `compose.yml` app
|
||||
service sets `deploy.update_config: { failure_action: rollback, order: start-first }` with a
|
||||
`healthcheck.start_period: 20m`. The upgrade chaos deploy applies the head spec
|
||||
(`chaos-version=7ae7b0f7+U`) start-first (old + new task co-resident = ~2× memory for a
|
||||
precompile-heavy Rails app). When the NEW task intermittently fails swarm's update monitor,
|
||||
swarm executes **failure_action: rollback ⇒ reverts the app service to its PreviousSpec (the
|
||||
base: `chaos-version=eb96de9+U`)**. Under `start-first` the OLD task keeps serving, so the
|
||||
harness `wait_healthy` still passes — but `deployed_identity` reads `.Spec.Labels` of the
|
||||
ROLLED-BACK spec and sees the base commit. The "since ~06-10 on every run" pattern = the
|
||||
rcust-phase runs happened under heavier host load (warm keycloak etc.), so the new task reliably
|
||||
failed the monitor ⇒ rollback every time; the solo 06-05 run (184) didn't roll back. Harness- and
|
||||
abra-neutral, exactly as observed.
|
||||
|
||||
repro3 (UpdateStatus + PreviousSpec capture, NO --debug to preserve failing timing) running to
|
||||
get the swarm rollback in the act (expect `UpdateStatus.State = rollback_*`, `PreviousSpec.Labels`
|
||||
chaos=eb96de9+U == the read `.Spec.Labels` after revert). That is the direct-evidence smoking gun.
|
||||
|
||||
Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy
|
||||
must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness
|
||||
must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with
|
||||
a clear message (the deploy did not converge to the head spec), AND/OR make the upgrade redeploy not
|
||||
subject to silent rollback masking (e.g. assert UpdateStatus completed before reading identity).
|
||||
The recipe's rollback policy is legitimate for prod; the harness bug is that a rollback is invisible
|
||||
to HC1 and masquerades as "stamped the wrong commit". Will finalise the fix after repro3 confirms.
|
||||
|
||||
Reference in New Issue
Block a user