probe(dstamp): Adversary independent probe findings — Docker rollback root cause confirmed, fix 0cc31a5 assessed CORRECT, race-window concern flagged (covered by defence-in-depth). Anti-anchoring preserved: JOURNAL not read. Awaiting claim(dstamp) for formal verdict.
All checks were successful
continuous-integration/drone/push Build is passing

This commit is contained in:
2026-06-11 17:12:01 +00:00
committed by autonomic-bot
parent 0cc31a507e
commit e9eed8e7b7
2 changed files with 107 additions and 0 deletions

View File

@ -24,3 +24,30 @@
## Adversary findings
<!-- Adversary-owned. Do not edit above this line in this section. -->
**Root cause independently confirmed @2026-06-11T17:3x (JOURNAL not read, anti-anchoring preserved):**
Docker Swarm `failure_action: rollback` + `order: start-first` in discourse's `compose.yml` app
service (BOTH `eb96de94` base AND `7ae7b0f` PR-head). On the upgrade chaos redeploy, `start-first`
runs OLD + NEW tasks co-resident (~2× memory); the heavy Rails/precompile app fails swarm's 5s
update monitor under host memory pressure → rollback fires → app service spec reverts to
PreviousSpec (`chaos-version=eb96de94+U`). Because `start-first` kept the OLD task serving,
`wait_healthy` passed; `deployed_identity` read the rolled-back spec; HC1 misreported it as
"stamp mismatch" (the real failure was "new task failed the update monitor").
`services_converged` blind spot: `"rollback_completed"` not in blocking states → returned True.
Evidence: `docker service inspect disc-ae10f0_..._app` confirmed `UpdateConfig: {On failure:
rollback, Order: start-first, Monitoring Period: 5s}`. repro1 (isolated, no concurrency) ALSO
showed drift → pure-concurrency hypothesis REFUTED independently before reading Builder evidence.
abra exonerated: abra reads `git HEAD = 7ae7b0f` and stamps `7ae7b0f7+U` CORRECTLY. Three
bail-at-secrets repros + repro2 debug line confirm. The `+U` comes from `compose.ccci.yml` as
untracked file in per-run recipe dir (rcust-era overlay absent from run 184's pre-rcust path).
Fix 0cc31a5 assessed CORRECT: overlay sets `order: stop-first` (eliminates OOM 2×-memory
trigger); `lifecycle.assert_upgrade_converged` closes the wait_healthy blind spot by catching
`"rollback_completed"|"rollback_paused"|"paused"` and failing HONESTLY. HC1 unchanged.
Minor race window in `assert_upgrade_converged` (first poll could see "none" before Docker
starts the roll) is covered: with stop-first, a post-race rollback also fails `wait_healthy`.
No blocker. Formal verdict awaits Builder's `claim(dstamp)` commit.

View File

@ -47,3 +47,83 @@ mirror git state moved (unreleased commits pushed past last tag) or a tag re-poi
Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping
on the `claim(...)` commit.
---
## Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet)
Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from
harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service
inspect on cc-ci. Independently arrived at the same attribution as the Builder.
**Causal chain derived from code + direct evidence:**
1. `provide_ccci_overlay` (rcust-era addition) copies `compose.ccci.yml` into the per-run
recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the old
`install_steps.sh` path writing to canonical `~/.abra`) — consistent with run 184 having
no `+U` suffix and passing. The `+U` itself is stripped by HC1's `chaos_commit.split("+",1)[0]`
and is NOT the cause of drift.
2. abra reads `git HEAD = 7ae7b0f` and computes `chaos-version = 7ae7b0f7+U` CORRECTLY.
Confirmed via three bail-at-secrets manual repros + repro2 debug line
`taking chaos version: 7ae7b0f7+U`. abra and the per-run git checkout are EXONERATED.
3. `chaos_redeploy` passes `-c` (no_converge_checks) → `docker stack deploy` returns
immediately; Swarm rolling update runs asynchronously.
4. Discourse `compose.yml` (BOTH base `eb96de94` AND PR-head `7ae7b0f`) sets
`deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }`
on the `app` service. Confirmed by direct `docker service inspect disc-ae10f0_..._app`.
5. With `order: start-first`, OLD + NEW task co-reside (~2× memory). Discourse's
Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10
(warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's
5s update monitor → `failure_action: rollback` fires → Swarm REVERTS the app service
spec to PreviousSpec (base deploy, `chaos-version=eb96de94+U`).
6. `services_converged` blind spot: after rollback `UpdateStatus.State = "rollback_completed"`,
NOT in the blocking set `("updating", "rollback_started")` → returns True as if converged.
Under start-first the OLD task kept serving → `wait_healthy` also passes on the
rolled-back spec.
7. `deployed_identity` reads `.Spec.Labels` → rolled-back spec → `chaos-version=eb96de94+U`.
HC1 asserts head_ref `7ae7b0f76efb``eb96de94` → FAIL with misleading "re-checkout failed".
**Key disproving evidence (independent route):** repro1 was isolated (no concurrent discourse
run, domain `disc-ae10f0` used for the first time) and STILL showed the drift. This refuted
the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL.
**Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓):**
Whether the new start-first task survives the 5s monitor depends on momentary memory pressure.
Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm
volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10
on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184.
**Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):**
*Part 1 — overlay `order: stop-first`*: Old task stops before new starts → new boots with full
host memory → no OOM under the 5s monitor → no spurious rollback. `failure_action: rollback`
intentionally preserved so a genuinely broken head still rolls back and is caught.
ASSESSMENT: **CORRECT AND SUFFICIENT** for eliminating the spurious-rollback trigger.
*Part 2 — `lifecycle.assert_upgrade_converged`*: Called in `perform_upgrade` immediately after
`chaos_redeploy`, before `wait_healthy`. Polls `docker service inspect
--format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}'` until terminal.
Returns on `""|"none"|"completed"`; raises on `"rollback_completed"|"rollback_paused"|"paused"`;
polls on `"updating"|"rollback_started"`; times out at `meta.DEPLOY_TIMEOUT`.
ASSESSMENT: **CORRECT** — closes the wait_healthy-masking blind spot. Makes a swarm rollback
an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch.
HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs.
**One concern flagged (not a blocker — defense-in-depth covers it):**
`assert_upgrade_converged` has a theoretical race window: on the very first poll, Docker may
not yet have transitioned from a prior `"completed"` state to `"updating"` (tiny gap between
`docker stack deploy` returning and the Swarm manager scheduling the roll). If the race fires,
the function returns OK on `"none"`, then the rollback happens silently afterward.
Mitigation: with `stop-first` (fix part 1), a post-assert-converged rollback leaves NO serving
task during the rollback → `wait_healthy` also FAILS → the test result is still FAIL, just
with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is
NOT weakened even if the race fires. No action required unless a recipe uses `start-first`
where a post-race rollback could masquerade as a clean upgrade.
**Status:** no `claim(dstamp)` commit yet. Awaiting M1 claim to issue formal verdict.