Files
cc-ci/BACKLOG-dstamp.md

51 lines
3.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BACKLOG — phase `dstamp`
## Build backlog (Builder-owned)
- [x] Read phase plan + plan.md §6.1/§7/§9 + Adversary prep notes + stamp-relevant harness code.
- [x] Establish abra's chaos-version mechanism from abra source @06a57de (= pinned binary).
- [x] Rule out abra-version drift (constant store path since nixos system-4, 2026-06-01).
- [x] Minimal reproductions of the git/abra chaos-version path (cp-a; go-git base; mirror-faithful)
— all stamp the CORRECT head 7ae7b0f7, NO drift in current host state.
- [x] Timeline: run 184 (06-05, solo) green @7ae7b0f; clustered 06-10/06-11 runs drift @ same ref.
- [x] Identify shared-stack collision vector (`app_domain` = hash(recipe|pr|ref); upgrade
chaos_redeploy bypasses app-domain flock).
- [x] Isolated real runs (repro14) + direct UpdateStatus/PreviousSpec capture → root cause attributed.
- [x] Concurrency REFUTED (solo repro1/4 reproduce). Mechanism = swarm `failure_action:rollback`
reverts the chaos-version label (direct evidence repro4: Spec=7ae7b0f7+U→PreviousSpec=eb96de9+U).
- [x] 06-05→06-10 change = rcust-phase heavier resident host load → start-first new task reliably OOMs → rollback every run (solo 06-05 run 184 didn't; my repro2 didn't either).
- [x] Blast-radius: only discourse affected (keycloak/n8n have the policy but upgrade PASS L4 across runs; drone/traefik infra). General harness guard covers all.
- [ ] Restore discourse to its true level in real CI via the drone `!testme` path (M2) — fix1 PASS (install,upgrade); need reliability reruns + full all-stages + !testme.
- [ ] Prove HC1 still has teeth (a deliberately wrong stamp still FAILs).
- [ ] Close the DEFERRED.md dstamp re-entry with pointers.
## Adversary findings
<!-- Adversary-owned. Do not edit above this line in this section. -->
**Root cause independently confirmed @2026-06-11T17:3x (JOURNAL not read, anti-anchoring preserved):**
Docker Swarm `failure_action: rollback` + `order: start-first` in discourse's `compose.yml` app
service (BOTH `eb96de94` base AND `7ae7b0f` PR-head). On the upgrade chaos redeploy, `start-first`
runs OLD + NEW tasks co-resident (~2× memory); the heavy Rails/precompile app fails swarm's 5s
update monitor under host memory pressure → rollback fires → app service spec reverts to
PreviousSpec (`chaos-version=eb96de94+U`). Because `start-first` kept the OLD task serving,
`wait_healthy` passed; `deployed_identity` read the rolled-back spec; HC1 misreported it as
"stamp mismatch" (the real failure was "new task failed the update monitor").
`services_converged` blind spot: `"rollback_completed"` not in blocking states → returned True.
Evidence: `docker service inspect disc-ae10f0_..._app` confirmed `UpdateConfig: {On failure:
rollback, Order: start-first, Monitoring Period: 5s}`. repro1 (isolated, no concurrency) ALSO
showed drift → pure-concurrency hypothesis REFUTED independently before reading Builder evidence.
abra exonerated: abra reads `git HEAD = 7ae7b0f` and stamps `7ae7b0f7+U` CORRECTLY. Three
bail-at-secrets repros + repro2 debug line confirm. The `+U` comes from `compose.ccci.yml` as
untracked file in per-run recipe dir (rcust-era overlay absent from run 184's pre-rcust path).
Fix 0cc31a5 assessed CORRECT: overlay sets `order: stop-first` (eliminates OOM 2×-memory
trigger); `lifecycle.assert_upgrade_converged` closes the wait_healthy blind spot by catching
`"rollback_completed"|"rollback_paused"|"paused"` and failing HONESTLY. HC1 unchanged.
Minor race window in `assert_upgrade_converged` (first poll could see "none" before Docker
starts the roll) is covered: with stop-first, a post-race rollback also fails `wait_healthy`.
No blocker. Formal verdict awaits Builder's `claim(dstamp)` commit.