74 lines
5.5 KiB
Markdown
74 lines
5.5 KiB
Markdown
# BACKLOG — phase `dstamp`
|
||
|
||
## Build backlog (Builder-owned)
|
||
|
||
- [x] Read phase plan + plan.md §6.1/§7/§9 + Adversary prep notes + stamp-relevant harness code.
|
||
- [x] Establish abra's chaos-version mechanism from abra source @06a57de (= pinned binary).
|
||
- [x] Rule out abra-version drift (constant store path since nixos system-4, 2026-06-01).
|
||
- [x] Minimal reproductions of the git/abra chaos-version path (cp-a; go-git base; mirror-faithful)
|
||
— all stamp the CORRECT head 7ae7b0f7, NO drift in current host state.
|
||
- [x] Timeline: run 184 (06-05, solo) green @7ae7b0f; clustered 06-10/06-11 runs drift @ same ref.
|
||
- [x] Identify shared-stack collision vector (`app_domain` = hash(recipe|pr|ref); upgrade
|
||
chaos_redeploy bypasses app-domain flock).
|
||
- [x] Isolated real runs (repro1–4) + direct UpdateStatus/PreviousSpec capture → root cause attributed.
|
||
- [x] Concurrency REFUTED (solo repro1/4 reproduce). Mechanism = swarm `failure_action:rollback`
|
||
reverts the chaos-version label (direct evidence repro4: Spec=7ae7b0f7+U→PreviousSpec=eb96de9+U).
|
||
- [x] 06-05→06-10 change = rcust-phase heavier resident host load → start-first new task reliably OOMs → rollback every run (solo 06-05 run 184 didn't; my repro2 didn't either).
|
||
- [x] Blast-radius: only discourse affected (keycloak/n8n have the policy but upgrade PASS L4 across runs; drone/traefik infra). General harness guard covers all.
|
||
- [x] Restore discourse to its true level in real CI via the drone `!testme` path (M2): build #450 = LEVEL 5, all tiers PASS (install/upgrade/backup/restore/custom), clean teardown, no leak; PR#2 ✅ passed. fix1+fix2+450 = 3 consecutive green with the fix.
|
||
- [~] HC1 teeth: code unchanged (generic.py:174-175) + assert_upgrade_converged RED on rollback (repro1/4). Live negative test = Adversary's M2 verification.
|
||
- [x] Closed the DEFERRED.md dstamp re-entry with pointers (✅ RESOLVED).
|
||
|
||
## Adversary findings
|
||
<!-- Adversary-owned. Do not edit above this line in this section. -->
|
||
|
||
**Root cause independently confirmed @2026-06-11T17:3x (JOURNAL not read, anti-anchoring preserved):**
|
||
|
||
Docker Swarm `failure_action: rollback` + `order: start-first` in discourse's `compose.yml` app
|
||
service (BOTH `eb96de94` base AND `7ae7b0f` PR-head). On the upgrade chaos redeploy, `start-first`
|
||
runs OLD + NEW tasks co-resident (~2× memory); the heavy Rails/precompile app fails swarm's 5s
|
||
update monitor under host memory pressure → rollback fires → app service spec reverts to
|
||
PreviousSpec (`chaos-version=eb96de94+U`). Because `start-first` kept the OLD task serving,
|
||
`wait_healthy` passed; `deployed_identity` read the rolled-back spec; HC1 misreported it as
|
||
"stamp mismatch" (the real failure was "new task failed the update monitor").
|
||
|
||
`services_converged` blind spot: `"rollback_completed"` not in blocking states → returned True.
|
||
|
||
Evidence: `docker service inspect disc-ae10f0_..._app` confirmed `UpdateConfig: {On failure:
|
||
rollback, Order: start-first, Monitoring Period: 5s}`. repro1 (isolated, no concurrency) ALSO
|
||
showed drift → pure-concurrency hypothesis REFUTED independently before reading Builder evidence.
|
||
|
||
abra exonerated: abra reads `git HEAD = 7ae7b0f` and stamps `7ae7b0f7+U` CORRECTLY. Three
|
||
bail-at-secrets repros + repro2 debug line confirm. The `+U` comes from `compose.ccci.yml` as
|
||
untracked file in per-run recipe dir (rcust-era overlay absent from run 184's pre-rcust path).
|
||
|
||
Fix 0cc31a5 assessed CORRECT: overlay sets `order: stop-first` (eliminates OOM 2×-memory
|
||
trigger); `lifecycle.assert_upgrade_converged` closes the wait_healthy blind spot by catching
|
||
`"rollback_completed"|"rollback_paused"|"paused"` and failing HONESTLY. HC1 unchanged.
|
||
Minor race window in `assert_upgrade_converged` (first poll could see "none" before Docker
|
||
starts the roll) is covered: with stop-first, a post-race rollback also fails `wait_healthy`.
|
||
No blocker. Formal verdict awaits Builder's `claim(dstamp)` commit.
|
||
|
||
**Blast-radius sweep @2026-06-11T17:4x:**
|
||
|
||
All 24 enrolled recipes swept for `failure_action: rollback` + `order: start-first` in `compose.yml`:
|
||
|
||
| Recipe | failure_action | order | ccci overlay | upgrade tests | recent upgrade | risk |
|
||
|-----------|---------------|-------------|--------------|---------------|----------------|------|
|
||
| discourse | rollback | start-first | YES (fixed) | yes | FIXED | fixed |
|
||
| drone | rollback | start-first | no | NO tests | n/a | latent, no CI exposure |
|
||
| keycloak | rollback | start-first | no | yes | PASS L4 | latent, low (JVM, lighter than Rails) |
|
||
| n8n | rollback | start-first | no | yes | PASS L4 | latent, low (Node.js) |
|
||
| traefik | rollback | STOP-first | no | no | n/a | SAFE |
|
||
| all others | none or absent | — | — | — | — | not at risk |
|
||
|
||
`assert_upgrade_converged` (added in 0cc31a5) provides a general harness backstop: if any
|
||
recipe's rolling update rolls back or pauses, the upgrade is failed HONESTLY for all recipes
|
||
— not just discourse. So keycloak/n8n are already covered by the harness fix even without
|
||
overlay changes.
|
||
|
||
Recommended overlay addition for keycloak if/when OOM symptoms appear:
|
||
`deploy.update_config.order: stop-first` (same pattern as discourse). Not urgent — current
|
||
host load shows no rollback symptom for keycloak/n8n and they're lighter apps than discourse.
|
||
drone has no upgrade tier in cc-ci; no action needed there.
|