probe(dstamp): Adversary independent probe findings — Docker rollback root cause confirmed, fix 0cc31a5 assessed CORRECT, race-window concern flagged (covered by defence-in-depth). Anti-anchoring preserved: JOURNAL not read. Awaiting claim(dstamp) for formal verdict.

2026-06-11 17:12:01 +00:00
parent 0cc31a507e
commit e9eed8e7b7
2 changed files with 107 additions and 0 deletions
--- a/BACKLOG-dstamp.md
+++ b/BACKLOG-dstamp.md
@ -24,3 +24,30 @@

 ## Adversary findings
 <!-- Adversary-owned. Do not edit above this line in this section. -->
+
+**Root cause independently confirmed @2026-06-11T17:3x (JOURNAL not read, anti-anchoring preserved):**
+
+Docker Swarm `failure_action: rollback` + `order: start-first` in discourse's `compose.yml` app
+service (BOTH `eb96de94` base AND `7ae7b0f` PR-head). On the upgrade chaos redeploy, `start-first`
+runs OLD + NEW tasks co-resident (~2× memory); the heavy Rails/precompile app fails swarm's 5s
+update monitor under host memory pressure → rollback fires → app service spec reverts to
+PreviousSpec (`chaos-version=eb96de94+U`). Because `start-first` kept the OLD task serving,
+`wait_healthy` passed; `deployed_identity` read the rolled-back spec; HC1 misreported it as
+"stamp mismatch" (the real failure was "new task failed the update monitor").
+
+`services_converged` blind spot: `"rollback_completed"` not in blocking states → returned True.
+
+Evidence: `docker service inspect disc-ae10f0_..._app` confirmed `UpdateConfig: {On failure:
+rollback, Order: start-first, Monitoring Period: 5s}`. repro1 (isolated, no concurrency) ALSO
+showed drift → pure-concurrency hypothesis REFUTED independently before reading Builder evidence.
+
+abra exonerated: abra reads `git HEAD = 7ae7b0f` and stamps `7ae7b0f7+U` CORRECTLY. Three
+bail-at-secrets repros + repro2 debug line confirm. The `+U` comes from `compose.ccci.yml` as
+untracked file in per-run recipe dir (rcust-era overlay absent from run 184's pre-rcust path).
+
+Fix 0cc31a5 assessed CORRECT: overlay sets `order: stop-first` (eliminates OOM 2×-memory
+trigger); `lifecycle.assert_upgrade_converged` closes the wait_healthy blind spot by catching
+`"rollback_completed"|"rollback_paused"|"paused"` and failing HONESTLY. HC1 unchanged.
+Minor race window in `assert_upgrade_converged` (first poll could see "none" before Docker
+starts the roll) is covered: with stop-first, a post-race rollback also fails `wait_healthy`.
+No blocker. Formal verdict awaits Builder's `claim(dstamp)` commit.
--- a/REVIEW-dstamp.md
+++ b/REVIEW-dstamp.md
@ -47,3 +47,83 @@ mirror git state moved (unreleased commits pushed past last tag) or a tag re-poi

 Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping
 on the `claim(...)` commit.
+
+---
+
+## Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet)
+
+Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from
+harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service
+inspect on cc-ci. Independently arrived at the same attribution as the Builder.
+
+**Causal chain derived from code + direct evidence:**
+
+1. `provide_ccci_overlay` (rcust-era addition) copies `compose.ccci.yml` into the per-run
+   recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the old
+   `install_steps.sh` path writing to canonical `~/.abra`) — consistent with run 184 having
+   no `+U` suffix and passing. The `+U` itself is stripped by HC1's `chaos_commit.split("+",1)[0]`
+   and is NOT the cause of drift.
+
+2. abra reads `git HEAD = 7ae7b0f` and computes `chaos-version = 7ae7b0f7+U` CORRECTLY.
+   Confirmed via three bail-at-secrets manual repros + repro2 debug line
+   `taking chaos version: 7ae7b0f7+U`. abra and the per-run git checkout are EXONERATED.
+
+3. `chaos_redeploy` passes `-c` (no_converge_checks) → `docker stack deploy` returns
+   immediately; Swarm rolling update runs asynchronously.
+
+4. Discourse `compose.yml` (BOTH base `eb96de94` AND PR-head `7ae7b0f`) sets
+   `deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }`
+   on the `app` service. Confirmed by direct `docker service inspect disc-ae10f0_..._app`.
+
+5. With `order: start-first`, OLD + NEW task co-reside (~2× memory). Discourse's
+   Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10
+   (warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's
+   5s update monitor → `failure_action: rollback` fires → Swarm REVERTS the app service
+   spec to PreviousSpec (base deploy, `chaos-version=eb96de94+U`).
+
+6. `services_converged` blind spot: after rollback `UpdateStatus.State = "rollback_completed"`,
+   NOT in the blocking set `("updating", "rollback_started")` → returns True as if converged.
+   Under start-first the OLD task kept serving → `wait_healthy` also passes on the
+   rolled-back spec.
+
+7. `deployed_identity` reads `.Spec.Labels` → rolled-back spec → `chaos-version=eb96de94+U`.
+   HC1 asserts head_ref `7ae7b0f76efb` ≠ `eb96de94` → FAIL with misleading "re-checkout failed".
+
+**Key disproving evidence (independent route):** repro1 was isolated (no concurrent discourse
+run, domain `disc-ae10f0` used for the first time) and STILL showed the drift. This refuted
+the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL.
+
+**Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓):**
+Whether the new start-first task survives the 5s monitor depends on momentary memory pressure.
+Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm
+volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10
+on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184.
+
+**Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):**
+
+*Part 1 — overlay `order: stop-first`*: Old task stops before new starts → new boots with full
+host memory → no OOM under the 5s monitor → no spurious rollback. `failure_action: rollback`
+intentionally preserved so a genuinely broken head still rolls back and is caught.
+ASSESSMENT: **CORRECT AND SUFFICIENT** for eliminating the spurious-rollback trigger.
+
+*Part 2 — `lifecycle.assert_upgrade_converged`*: Called in `perform_upgrade` immediately after
+`chaos_redeploy`, before `wait_healthy`. Polls `docker service inspect
+--format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}'` until terminal.
+Returns on `""|"none"|"completed"`; raises on `"rollback_completed"|"rollback_paused"|"paused"`;
+polls on `"updating"|"rollback_started"`; times out at `meta.DEPLOY_TIMEOUT`.
+ASSESSMENT: **CORRECT** — closes the wait_healthy-masking blind spot. Makes a swarm rollback
+an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch.
+HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs.
+
+**One concern flagged (not a blocker — defense-in-depth covers it):**
+`assert_upgrade_converged` has a theoretical race window: on the very first poll, Docker may
+not yet have transitioned from a prior `"completed"` state to `"updating"` (tiny gap between
+`docker stack deploy` returning and the Swarm manager scheduling the roll). If the race fires,
+the function returns OK on `"none"`, then the rollback happens silently afterward.
+Mitigation: with `stop-first` (fix part 1), a post-assert-converged rollback leaves NO serving
+task during the rollback → `wait_healthy` also FAILS → the test result is still FAIL, just
+with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is
+NOT weakened even if the race fires. No action required unless a recipe uses `start-first`
+where a post-race rollback could masquerade as a clean upgrade.
+
+**Status:** no `claim(dstamp)` commit yet. Awaiting M1 claim to issue formal verdict.