probe(dstamp): Adversary independent probe findings — Docker rollback root cause confirmed, fix 0cc31a5 assessed CORRECT, race-window concern flagged (covered by defence-in-depth). Anti-anchoring preserved: JOURNAL not read. Awaiting claim(dstamp) for formal verdict.
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
@ -24,3 +24,30 @@
|
||||
|
||||
## Adversary findings
|
||||
<!-- Adversary-owned. Do not edit above this line in this section. -->
|
||||
|
||||
**Root cause independently confirmed @2026-06-11T17:3x (JOURNAL not read, anti-anchoring preserved):**
|
||||
|
||||
Docker Swarm `failure_action: rollback` + `order: start-first` in discourse's `compose.yml` app
|
||||
service (BOTH `eb96de94` base AND `7ae7b0f` PR-head). On the upgrade chaos redeploy, `start-first`
|
||||
runs OLD + NEW tasks co-resident (~2× memory); the heavy Rails/precompile app fails swarm's 5s
|
||||
update monitor under host memory pressure → rollback fires → app service spec reverts to
|
||||
PreviousSpec (`chaos-version=eb96de94+U`). Because `start-first` kept the OLD task serving,
|
||||
`wait_healthy` passed; `deployed_identity` read the rolled-back spec; HC1 misreported it as
|
||||
"stamp mismatch" (the real failure was "new task failed the update monitor").
|
||||
|
||||
`services_converged` blind spot: `"rollback_completed"` not in blocking states → returned True.
|
||||
|
||||
Evidence: `docker service inspect disc-ae10f0_..._app` confirmed `UpdateConfig: {On failure:
|
||||
rollback, Order: start-first, Monitoring Period: 5s}`. repro1 (isolated, no concurrency) ALSO
|
||||
showed drift → pure-concurrency hypothesis REFUTED independently before reading Builder evidence.
|
||||
|
||||
abra exonerated: abra reads `git HEAD = 7ae7b0f` and stamps `7ae7b0f7+U` CORRECTLY. Three
|
||||
bail-at-secrets repros + repro2 debug line confirm. The `+U` comes from `compose.ccci.yml` as
|
||||
untracked file in per-run recipe dir (rcust-era overlay absent from run 184's pre-rcust path).
|
||||
|
||||
Fix 0cc31a5 assessed CORRECT: overlay sets `order: stop-first` (eliminates OOM 2×-memory
|
||||
trigger); `lifecycle.assert_upgrade_converged` closes the wait_healthy blind spot by catching
|
||||
`"rollback_completed"|"rollback_paused"|"paused"` and failing HONESTLY. HC1 unchanged.
|
||||
Minor race window in `assert_upgrade_converged` (first poll could see "none" before Docker
|
||||
starts the roll) is covered: with stop-first, a post-race rollback also fails `wait_healthy`.
|
||||
No blocker. Formal verdict awaits Builder's `claim(dstamp)` commit.
|
||||
|
||||
@ -47,3 +47,83 @@ mirror git state moved (unreleased commits pushed past last tag) or a tag re-poi
|
||||
|
||||
Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping
|
||||
on the `claim(...)` commit.
|
||||
|
||||
---
|
||||
|
||||
## Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet)
|
||||
|
||||
Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from
|
||||
harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service
|
||||
inspect on cc-ci. Independently arrived at the same attribution as the Builder.
|
||||
|
||||
**Causal chain derived from code + direct evidence:**
|
||||
|
||||
1. `provide_ccci_overlay` (rcust-era addition) copies `compose.ccci.yml` into the per-run
|
||||
recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the old
|
||||
`install_steps.sh` path writing to canonical `~/.abra`) — consistent with run 184 having
|
||||
no `+U` suffix and passing. The `+U` itself is stripped by HC1's `chaos_commit.split("+",1)[0]`
|
||||
and is NOT the cause of drift.
|
||||
|
||||
2. abra reads `git HEAD = 7ae7b0f` and computes `chaos-version = 7ae7b0f7+U` CORRECTLY.
|
||||
Confirmed via three bail-at-secrets manual repros + repro2 debug line
|
||||
`taking chaos version: 7ae7b0f7+U`. abra and the per-run git checkout are EXONERATED.
|
||||
|
||||
3. `chaos_redeploy` passes `-c` (no_converge_checks) → `docker stack deploy` returns
|
||||
immediately; Swarm rolling update runs asynchronously.
|
||||
|
||||
4. Discourse `compose.yml` (BOTH base `eb96de94` AND PR-head `7ae7b0f`) sets
|
||||
`deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }`
|
||||
on the `app` service. Confirmed by direct `docker service inspect disc-ae10f0_..._app`.
|
||||
|
||||
5. With `order: start-first`, OLD + NEW task co-reside (~2× memory). Discourse's
|
||||
Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10
|
||||
(warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's
|
||||
5s update monitor → `failure_action: rollback` fires → Swarm REVERTS the app service
|
||||
spec to PreviousSpec (base deploy, `chaos-version=eb96de94+U`).
|
||||
|
||||
6. `services_converged` blind spot: after rollback `UpdateStatus.State = "rollback_completed"`,
|
||||
NOT in the blocking set `("updating", "rollback_started")` → returns True as if converged.
|
||||
Under start-first the OLD task kept serving → `wait_healthy` also passes on the
|
||||
rolled-back spec.
|
||||
|
||||
7. `deployed_identity` reads `.Spec.Labels` → rolled-back spec → `chaos-version=eb96de94+U`.
|
||||
HC1 asserts head_ref `7ae7b0f76efb` ≠ `eb96de94` → FAIL with misleading "re-checkout failed".
|
||||
|
||||
**Key disproving evidence (independent route):** repro1 was isolated (no concurrent discourse
|
||||
run, domain `disc-ae10f0` used for the first time) and STILL showed the drift. This refuted
|
||||
the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL.
|
||||
|
||||
**Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓):**
|
||||
Whether the new start-first task survives the 5s monitor depends on momentary memory pressure.
|
||||
Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm
|
||||
volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10
|
||||
on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184.
|
||||
|
||||
**Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):**
|
||||
|
||||
*Part 1 — overlay `order: stop-first`*: Old task stops before new starts → new boots with full
|
||||
host memory → no OOM under the 5s monitor → no spurious rollback. `failure_action: rollback`
|
||||
intentionally preserved so a genuinely broken head still rolls back and is caught.
|
||||
ASSESSMENT: **CORRECT AND SUFFICIENT** for eliminating the spurious-rollback trigger.
|
||||
|
||||
*Part 2 — `lifecycle.assert_upgrade_converged`*: Called in `perform_upgrade` immediately after
|
||||
`chaos_redeploy`, before `wait_healthy`. Polls `docker service inspect
|
||||
--format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}'` until terminal.
|
||||
Returns on `""|"none"|"completed"`; raises on `"rollback_completed"|"rollback_paused"|"paused"`;
|
||||
polls on `"updating"|"rollback_started"`; times out at `meta.DEPLOY_TIMEOUT`.
|
||||
ASSESSMENT: **CORRECT** — closes the wait_healthy-masking blind spot. Makes a swarm rollback
|
||||
an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch.
|
||||
HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs.
|
||||
|
||||
**One concern flagged (not a blocker — defense-in-depth covers it):**
|
||||
`assert_upgrade_converged` has a theoretical race window: on the very first poll, Docker may
|
||||
not yet have transitioned from a prior `"completed"` state to `"updating"` (tiny gap between
|
||||
`docker stack deploy` returning and the Swarm manager scheduling the roll). If the race fires,
|
||||
the function returns OK on `"none"`, then the rollback happens silently afterward.
|
||||
Mitigation: with `stop-first` (fix part 1), a post-assert-converged rollback leaves NO serving
|
||||
task during the rollback → `wait_healthy` also FAILS → the test result is still FAIL, just
|
||||
with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is
|
||||
NOT weakened even if the race fires. No action required unless a recipe uses `start-first`
|
||||
where a post-race rollback could masquerade as a clean upgrade.
|
||||
|
||||
**Status:** no `claim(dstamp)` commit yet. Awaiting M1 claim to issue formal verdict.
|
||||
|
||||
Reference in New Issue
Block a user