status(dstamp): DIRECT EVIDENCE — repro4 caught Spec=7ae7b0f7+U + PreviousSpec=eb96de94+U + State=updating post-redeploy; swarm failure_action:rollback reverts label (masked by start-first+wait_healthy); abra+harness exonerated. Fix: stop-first overlay + harness rollback detection
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -2,7 +2,51 @@
|
||||
|
||||
Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2.
|
||||
|
||||
## Phase state: INVESTIGATING (no gate claimed yet)
|
||||
## Phase state: ROOT CAUSE ATTRIBUTED (direct evidence) — building fix, no gate claimed yet
|
||||
|
||||
## ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED)
|
||||
|
||||
The upgrade chaos redeploy applies the **correct** head spec, then swarm **rolls it back** to the
|
||||
base spec, reverting the `chaos-version` label — masked by the recipe's `start-first` strategy +
|
||||
the harness's `wait_healthy` (the OLD task keeps serving, so health passes).
|
||||
|
||||
Recipe policy (`~/.abra/recipes/discourse/compose.yml`, app service): `deploy.update_config:
|
||||
{ failure_action: rollback, order: start-first }`, `healthcheck.start_period: 20m`. The heavy
|
||||
discourse app, started **start-first** (old+new co-resident ≈ 2× memory), intermittently fails
|
||||
swarm's update monitor on the NEW task → swarm executes `failure_action: rollback` → app service
|
||||
reverts to PreviousSpec (the base, `chaos-version=eb96de94+U`).
|
||||
|
||||
**Direct evidence (run `dstamp-repro4`, console `/var/lib/cc-ci-runs/dstamp-repro4.console.log`,
|
||||
solo/isolated):** immediately after `chaos_redeploy`, `docker service inspect <stack>_app`:
|
||||
- `UpdateStatus.State = "updating"`,
|
||||
- `.Spec.Labels coop-cloud.<stack>.chaos-version = 7ae7b0f7+U` (HEAD applied — abra stamped head
|
||||
correctly), `.version = 0.9.0+3.5.0`,
|
||||
- `.PreviousSpec.Labels …chaos-version = eb96de94+U` (the base), `.version = 0.7.0+3.3.1`.
|
||||
Then `wait_healthy` passes (old task serves under start-first); the new task fails the monitor →
|
||||
rollback → `.Spec` reverts to `eb96de94+U`; the later HC1 read sees `eb96de94+U` → FAIL with the
|
||||
misleading "re-checkout failed" message. (`dstamp-repro2`, lighter timing, had NO rollback →
|
||||
upgrade PASS @ `7ae7b0f7+U`.)
|
||||
|
||||
Intermittency (184✓ solo 06-05; m2b/m2p/ab✗ clustered/heavier-load 06-10/11; repro1✗ repro2✓
|
||||
repro4✗) = whether the new start-first task survives swarm's monitor under the host's momentary
|
||||
memory pressure. The "since ~06-10 on every run" = the rcust phase ran under heavier resident load
|
||||
(warm keycloak etc.) so the new task reliably failed → rollback every time. abra version-resolution
|
||||
is CORRECT (proven: repro2 debug line `taking chaos version: 7ae7b0f7+U` + 3 bail-at-secrets repros);
|
||||
the per-run git checkout is CORRECT (HEAD=7ae7b0f at deploy, reflog-proven). NOT abra, NOT the
|
||||
per-run tree, NOT concurrency.
|
||||
|
||||
## Fix (in progress) — HC1 keeps its teeth
|
||||
1. **Reliability (restore true level):** discourse `tests/discourse/compose.ccci.yml` overlay set
|
||||
the app service `deploy.update_config.order: stop-first` so the new task boots with full memory
|
||||
(no 2× co-residency) and genuinely becomes healthy → no spurious rollback. The upgrade-to-head
|
||||
is still really deployed + asserted on head; HC1 unchanged. Documented WHY in the overlay header.
|
||||
2. **Correctness (honesty, general):** the harness upgrade path detects a swarm rollback after the
|
||||
chaos redeploy (UpdateStatus.State rollback*/paused, or `.Spec` reverted to `.PreviousSpec`) and
|
||||
fails the upgrade with the TRUE reason ("head spec applied then swarm-rolled-back: new task
|
||||
failed the update monitor") instead of the misleading "re-checkout failed". A genuinely
|
||||
undeployable head still FAILS (teeth preserved).
|
||||
3. **Blast-radius:** sweep all enrolled recipes for `failure_action: rollback` + start-first heavy
|
||||
apps with the same latent signature.
|
||||
|
||||
## What is established (direct evidence, reproducible)
|
||||
|
||||
@ -43,9 +87,10 @@ Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2.
|
||||
the shared stack — NOT an abra/recipe/env regression. Under test now.
|
||||
|
||||
## In flight
|
||||
- Isolated clean real run (`CCCI_RUN_ID=dstamp-repro1`, STAGES=install,upgrade, ref 7ae7b0f,
|
||||
no concurrent discourse run) with full console capture → decides: isolated real run GREEN
|
||||
(⇒ concurrency artifact) vs DRIFT (⇒ read exact console). Console: `/var/lib/cc-ci-runs/dstamp-repro1.console.log` on cc-ci.
|
||||
- Implementing the fix (overlay stop-first + harness rollback detection), then a full real run
|
||||
(all stages) to prove discourse reliably reaches its true level, then the `!testme` drone path.
|
||||
- Repro evidence runs: `/var/lib/cc-ci-runs/dstamp-repro{1,2,3,4}.console.log` on cc-ci
|
||||
(repro2 PASS @7ae7b0f7+U; repro4 captured the rollback Spec/PreviousSpec).
|
||||
|
||||
## Blocked
|
||||
- (none)
|
||||
|
||||
Reference in New Issue
Block a user