From 9959ad6a2d26f447869edffeaf286d8a0ecb00a7 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Thu, 11 Jun 2026 17:04:13 +0000 Subject: [PATCH] =?UTF-8?q?status(dstamp):=20DIRECT=20EVIDENCE=20=E2=80=94?= =?UTF-8?q?=20repro4=20caught=20Spec=3D7ae7b0f7+U=20+=20PreviousSpec=3Deb9?= =?UTF-8?q?6de94+U=20+=20State=3Dupdating=20post-redeploy;=20swarm=20failu?= =?UTF-8?q?re=5Faction:rollback=20reverts=20label=20(masked=20by=20start-f?= =?UTF-8?q?irst+wait=5Fhealthy);=20abra+harness=20exonerated.=20Fix:=20sto?= =?UTF-8?q?p-first=20overlay=20+=20harness=20rollback=20detection?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- JOURNAL-dstamp.md | 32 ++++++++++++++++++++++++++++ STATUS-dstamp.md | 53 +++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 81 insertions(+), 4 deletions(-) diff --git a/JOURNAL-dstamp.md b/JOURNAL-dstamp.md index 8e87302..2467e64 100644 --- a/JOURNAL-dstamp.md +++ b/JOURNAL-dstamp.md @@ -99,6 +99,38 @@ repro3 (UpdateStatus + PreviousSpec capture, NO --debug to preserve failing timi get the swarm rollback in the act (expect `UpdateStatus.State = rollback_*`, `PreviousSpec.Labels` chaos=eb96de9+U == the read `.Spec.Labels` after revert). That is the direct-evidence smoking gun. +### DIRECT EVIDENCE — captured (repro4, solo/isolated, upgrade FAIL) +repro3 base deploy FATA'd (abra convergence monitor gave up — discourse is genuinely flaky/heavy +under load, which is the very premise). repro4 reached the upgrade and the post-`chaos_redeploy` +`docker service inspect _app` capture is the smoking gun: +- `UpdateStatus = {"State":"updating","Message":"update in progress"}` +- `.Spec.Labels` chaos-version = **7ae7b0f7+U**, version = 0.9.0+3.5.0 (HEAD spec applied OK) +- `.PreviousSpec.Labels` chaos-version = **eb96de94+U**, version = 0.7.0+3.3.1 (the base) +- `deployed_identity` (same instant) = chaos **7ae7b0f7+U** (reads Spec, correct) +Then `wait_healthy` ran (old task serving under start-first → passes); the new task failed swarm's +monitor → `failure_action: rollback` reverted `.Spec` → `.PreviousSpec` (eb96de94+U); the +assertion-phase read saw eb96de94+U → HC1 FAIL. The ONLY operation that turns `.Spec.Labels` from +7ae7b0f7+U into the exact `.PreviousSpec` eb96de94+U is a swarm rollback. abra+harness exonerated; +the head was really deployed and then swarm-reverted. Attribution complete, by direct evidence. + +Note the app image is `bitnamilegacy/discourse:3.3.1` for BOTH base and head spec (head only bumps +the version label + db image), so the new task isn't failing on a missing image — it's the +start-first 2× co-residency of the precompile/Rails-heavy app under host memory pressure (a real +new-task failure, intermittent), which trips `failure_action: rollback`. + +### Fix plan (HC1 teeth preserved) +- Reliability: `tests/discourse/compose.ccci.yml` overlay → app `deploy.update_config.order: + stop-first` (old stops before new starts → new boots with full memory → genuinely healthy → no + spurious rollback). Upgrade-to-head still really deployed+asserted; not a weakening. WHY in header. + Risk to weigh: stop-first = brief real downtime during the CI upgrade (covered by DEPLOY_TIMEOUT + 3600). Alternative `failure_action: pause` REJECTED — it would let a genuinely-failed new task + pass HC1 (start-first keeps old serving) = test-weakening. +- Correctness: harness upgrade path asserts the redeploy converged to the head spec (UpdateStatus + not rollback*/paused / `.Spec` not reverted to `.PreviousSpec`) → honest failure message on a + real rollback, instead of the misleading "re-checkout failed". General (all rollback-policy + recipes). HC1 teeth intact: a head that truly can't stay healthy still fails. +- Will validate stop-first actually eliminates the rollback with a full real run before claiming. + Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with diff --git a/STATUS-dstamp.md b/STATUS-dstamp.md index f46dbe2..d36520f 100644 --- a/STATUS-dstamp.md +++ b/STATUS-dstamp.md @@ -2,7 +2,51 @@ Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2. -## Phase state: INVESTIGATING (no gate claimed yet) +## Phase state: ROOT CAUSE ATTRIBUTED (direct evidence) — building fix, no gate claimed yet + +## ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED) + +The upgrade chaos redeploy applies the **correct** head spec, then swarm **rolls it back** to the +base spec, reverting the `chaos-version` label — masked by the recipe's `start-first` strategy + +the harness's `wait_healthy` (the OLD task keeps serving, so health passes). + +Recipe policy (`~/.abra/recipes/discourse/compose.yml`, app service): `deploy.update_config: +{ failure_action: rollback, order: start-first }`, `healthcheck.start_period: 20m`. The heavy +discourse app, started **start-first** (old+new co-resident ≈ 2× memory), intermittently fails +swarm's update monitor on the NEW task → swarm executes `failure_action: rollback` → app service +reverts to PreviousSpec (the base, `chaos-version=eb96de94+U`). + +**Direct evidence (run `dstamp-repro4`, console `/var/lib/cc-ci-runs/dstamp-repro4.console.log`, +solo/isolated):** immediately after `chaos_redeploy`, `docker service inspect _app`: +- `UpdateStatus.State = "updating"`, +- `.Spec.Labels coop-cloud..chaos-version = 7ae7b0f7+U` (HEAD applied — abra stamped head + correctly), `.version = 0.9.0+3.5.0`, +- `.PreviousSpec.Labels …chaos-version = eb96de94+U` (the base), `.version = 0.7.0+3.3.1`. +Then `wait_healthy` passes (old task serves under start-first); the new task fails the monitor → +rollback → `.Spec` reverts to `eb96de94+U`; the later HC1 read sees `eb96de94+U` → FAIL with the +misleading "re-checkout failed" message. (`dstamp-repro2`, lighter timing, had NO rollback → +upgrade PASS @ `7ae7b0f7+U`.) + +Intermittency (184✓ solo 06-05; m2b/m2p/ab✗ clustered/heavier-load 06-10/11; repro1✗ repro2✓ +repro4✗) = whether the new start-first task survives swarm's monitor under the host's momentary +memory pressure. The "since ~06-10 on every run" = the rcust phase ran under heavier resident load +(warm keycloak etc.) so the new task reliably failed → rollback every time. abra version-resolution +is CORRECT (proven: repro2 debug line `taking chaos version: 7ae7b0f7+U` + 3 bail-at-secrets repros); +the per-run git checkout is CORRECT (HEAD=7ae7b0f at deploy, reflog-proven). NOT abra, NOT the +per-run tree, NOT concurrency. + +## Fix (in progress) — HC1 keeps its teeth +1. **Reliability (restore true level):** discourse `tests/discourse/compose.ccci.yml` overlay set + the app service `deploy.update_config.order: stop-first` so the new task boots with full memory + (no 2× co-residency) and genuinely becomes healthy → no spurious rollback. The upgrade-to-head + is still really deployed + asserted on head; HC1 unchanged. Documented WHY in the overlay header. +2. **Correctness (honesty, general):** the harness upgrade path detects a swarm rollback after the + chaos redeploy (UpdateStatus.State rollback*/paused, or `.Spec` reverted to `.PreviousSpec`) and + fails the upgrade with the TRUE reason ("head spec applied then swarm-rolled-back: new task + failed the update monitor") instead of the misleading "re-checkout failed". A genuinely + undeployable head still FAILS (teeth preserved). +3. **Blast-radius:** sweep all enrolled recipes for `failure_action: rollback` + start-first heavy + apps with the same latent signature. ## What is established (direct evidence, reproducible) @@ -43,9 +87,10 @@ Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2. the shared stack — NOT an abra/recipe/env regression. Under test now. ## In flight -- Isolated clean real run (`CCCI_RUN_ID=dstamp-repro1`, STAGES=install,upgrade, ref 7ae7b0f, - no concurrent discourse run) with full console capture → decides: isolated real run GREEN - (⇒ concurrency artifact) vs DRIFT (⇒ read exact console). Console: `/var/lib/cc-ci-runs/dstamp-repro1.console.log` on cc-ci. +- Implementing the fix (overlay stop-first + harness rollback detection), then a full real run + (all stages) to prove discourse reliably reaches its true level, then the `!testme` drone path. +- Repro evidence runs: `/var/lib/cc-ci-runs/dstamp-repro{1,2,3,4}.console.log` on cc-ci + (repro2 PASS @7ae7b0f7+U; repro4 captured the rollback Spec/PreviousSpec). ## Blocked - (none)