diff --git a/BACKLOG-dstamp.md b/BACKLOG-dstamp.md index 60b7723..70af3e7 100644 --- a/BACKLOG-dstamp.md +++ b/BACKLOG-dstamp.md @@ -24,3 +24,30 @@ ## Adversary findings + +**Root cause independently confirmed @2026-06-11T17:3x (JOURNAL not read, anti-anchoring preserved):** + +Docker Swarm `failure_action: rollback` + `order: start-first` in discourse's `compose.yml` app +service (BOTH `eb96de94` base AND `7ae7b0f` PR-head). On the upgrade chaos redeploy, `start-first` +runs OLD + NEW tasks co-resident (~2× memory); the heavy Rails/precompile app fails swarm's 5s +update monitor under host memory pressure → rollback fires → app service spec reverts to +PreviousSpec (`chaos-version=eb96de94+U`). Because `start-first` kept the OLD task serving, +`wait_healthy` passed; `deployed_identity` read the rolled-back spec; HC1 misreported it as +"stamp mismatch" (the real failure was "new task failed the update monitor"). + +`services_converged` blind spot: `"rollback_completed"` not in blocking states → returned True. + +Evidence: `docker service inspect disc-ae10f0_..._app` confirmed `UpdateConfig: {On failure: +rollback, Order: start-first, Monitoring Period: 5s}`. repro1 (isolated, no concurrency) ALSO +showed drift → pure-concurrency hypothesis REFUTED independently before reading Builder evidence. + +abra exonerated: abra reads `git HEAD = 7ae7b0f` and stamps `7ae7b0f7+U` CORRECTLY. Three +bail-at-secrets repros + repro2 debug line confirm. The `+U` comes from `compose.ccci.yml` as +untracked file in per-run recipe dir (rcust-era overlay absent from run 184's pre-rcust path). + +Fix 0cc31a5 assessed CORRECT: overlay sets `order: stop-first` (eliminates OOM 2×-memory +trigger); `lifecycle.assert_upgrade_converged` closes the wait_healthy blind spot by catching +`"rollback_completed"|"rollback_paused"|"paused"` and failing HONESTLY. HC1 unchanged. +Minor race window in `assert_upgrade_converged` (first poll could see "none" before Docker +starts the roll) is covered: with stop-first, a post-race rollback also fails `wait_healthy`. +No blocker. Formal verdict awaits Builder's `claim(dstamp)` commit. diff --git a/REVIEW-dstamp.md b/REVIEW-dstamp.md index 2c78ae5..f1558b7 100644 --- a/REVIEW-dstamp.md +++ b/REVIEW-dstamp.md @@ -47,3 +47,83 @@ mirror git state moved (unreleased commits pushed past last tag) or a tag re-poi Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping on the `claim(...)` commit. + +--- + +## Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet) + +Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from +harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service +inspect on cc-ci. Independently arrived at the same attribution as the Builder. + +**Causal chain derived from code + direct evidence:** + +1. `provide_ccci_overlay` (rcust-era addition) copies `compose.ccci.yml` into the per-run + recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the old + `install_steps.sh` path writing to canonical `~/.abra`) — consistent with run 184 having + no `+U` suffix and passing. The `+U` itself is stripped by HC1's `chaos_commit.split("+",1)[0]` + and is NOT the cause of drift. + +2. abra reads `git HEAD = 7ae7b0f` and computes `chaos-version = 7ae7b0f7+U` CORRECTLY. + Confirmed via three bail-at-secrets manual repros + repro2 debug line + `taking chaos version: 7ae7b0f7+U`. abra and the per-run git checkout are EXONERATED. + +3. `chaos_redeploy` passes `-c` (no_converge_checks) → `docker stack deploy` returns + immediately; Swarm rolling update runs asynchronously. + +4. Discourse `compose.yml` (BOTH base `eb96de94` AND PR-head `7ae7b0f`) sets + `deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }` + on the `app` service. Confirmed by direct `docker service inspect disc-ae10f0_..._app`. + +5. With `order: start-first`, OLD + NEW task co-reside (~2× memory). Discourse's + Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10 + (warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's + 5s update monitor → `failure_action: rollback` fires → Swarm REVERTS the app service + spec to PreviousSpec (base deploy, `chaos-version=eb96de94+U`). + +6. `services_converged` blind spot: after rollback `UpdateStatus.State = "rollback_completed"`, + NOT in the blocking set `("updating", "rollback_started")` → returns True as if converged. + Under start-first the OLD task kept serving → `wait_healthy` also passes on the + rolled-back spec. + +7. `deployed_identity` reads `.Spec.Labels` → rolled-back spec → `chaos-version=eb96de94+U`. + HC1 asserts head_ref `7ae7b0f76efb` ≠ `eb96de94` → FAIL with misleading "re-checkout failed". + +**Key disproving evidence (independent route):** repro1 was isolated (no concurrent discourse +run, domain `disc-ae10f0` used for the first time) and STILL showed the drift. This refuted +the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL. + +**Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓):** +Whether the new start-first task survives the 5s monitor depends on momentary memory pressure. +Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm +volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10 +on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184. + +**Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):** + +*Part 1 — overlay `order: stop-first`*: Old task stops before new starts → new boots with full +host memory → no OOM under the 5s monitor → no spurious rollback. `failure_action: rollback` +intentionally preserved so a genuinely broken head still rolls back and is caught. +ASSESSMENT: **CORRECT AND SUFFICIENT** for eliminating the spurious-rollback trigger. + +*Part 2 — `lifecycle.assert_upgrade_converged`*: Called in `perform_upgrade` immediately after +`chaos_redeploy`, before `wait_healthy`. Polls `docker service inspect +--format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}'` until terminal. +Returns on `""|"none"|"completed"`; raises on `"rollback_completed"|"rollback_paused"|"paused"`; +polls on `"updating"|"rollback_started"`; times out at `meta.DEPLOY_TIMEOUT`. +ASSESSMENT: **CORRECT** — closes the wait_healthy-masking blind spot. Makes a swarm rollback +an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch. +HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs. + +**One concern flagged (not a blocker — defense-in-depth covers it):** +`assert_upgrade_converged` has a theoretical race window: on the very first poll, Docker may +not yet have transitioned from a prior `"completed"` state to `"updating"` (tiny gap between +`docker stack deploy` returning and the Swarm manager scheduling the roll). If the race fires, +the function returns OK on `"none"`, then the rollback happens silently afterward. +Mitigation: with `stop-first` (fix part 1), a post-assert-converged rollback leaves NO serving +task during the rollback → `wait_healthy` also FAILS → the test result is still FAIL, just +with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is +NOT weakened even if the race fires. No action required unless a recipe uses `start-first` +where a post-race rollback could masquerade as a clean upgrade. + +**Status:** no `claim(dstamp)` commit yet. Awaiting M1 claim to issue formal verdict.