Some checks failed
continuous-integration/drone/push Build is failing
STATUS/BACKLOG/REVIEW/JOURNAL for bsky/conc/dstamp/kuma/lvl5/mailu/rcust/shot (32 files) were at the repo root; move them into machine-docs/ to match the mandated file-location rule (DECISIONS/DEFERRED/INBOX + older phases already live there). AGENTS.md gains an explicit File-location rule. No content change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
285 lines
18 KiB
Markdown
285 lines
18 KiB
Markdown
# REVIEW-dstamp.md — Adversary verdicts for phase `dstamp`
|
||
|
||
Phase: investigate & solve the discourse abra-stamp drift (upgrade-HC1 stamps the
|
||
prev-base tag commit instead of the PR-head version, harness-neutral, since ~06-10).
|
||
SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2.
|
||
|
||
Verdict log is append-only. `review(...)`-prefixed commits carry verdicts (load-bearing
|
||
watchdog signal). Findings filed under `## Adversary findings` in BACKLOG-dstamp.md.
|
||
|
||
---
|
||
|
||
## Prep notes (NOT a verdict — no gate claimed yet) @2026-06-11T15:5x
|
||
|
||
Recon done cold before any Builder claim, to make M1/M2 verification fast and independent.
|
||
Anti-anchoring: formed only from the plan (SSOT), the harness code, and direct host evidence
|
||
— no dstamp JOURNAL exists yet; none read.
|
||
|
||
**Stamp mechanism (from code):** HC1's "stamp" = the `coop-cloud.<stack>.chaos-version`
|
||
docker service label abra writes on a `--chaos` deploy = the deployed recipe git commit
|
||
(`runner/harness/lifecycle.py:468 deployed_identity`, `runner/harness/generic.py:146
|
||
assert_upgraded`). Upgrade flow (`generic.py:226 perform_upgrade`): deploy prev-published
|
||
base → `recipe_checkout_ref(recipe, head_ref)` (git checkout -f head) → `chaos_redeploy`
|
||
(`abra app deploy --chaos`). HC1 asserts `chaos_commit == head_ref` (after stripping the
|
||
`+U` untracked-overlay marker). PASS requires the chaos-version to equal the PR head.
|
||
|
||
**Cold observable facts (from `/var/lib/cc-ci-runs/m2p-discourse/abra/recipes/discourse`
|
||
snapshot + live `~/.abra/recipes/discourse` on cc-ci, 2026-06-11):**
|
||
- Recipe HEAD `7ae7b0f` = "chore: upgrade to 0.9.0+3.5.0"; `git describe --tags` =
|
||
`0.7.0+3.3.1-9-g7ae7b0f` → HEAD is **9 commits past the newest annotated tag**
|
||
`0.7.0+3.3.1` (commit `eb96de9`). No `0.8.x`/`0.9.x` tag exists.
|
||
- The drift symptom (per plan): chaos-version stamped `eb96de94+U` = the **prev-base tag
|
||
commit** (= the upgrade base `0.7.0+3.3.1`), NOT the PR-head `7ae7b0f`.
|
||
- abra is **nix-pinned**: `abra version 0.13.0-beta-06a57de`, store path under
|
||
`/run/current-system` → binary drift requires a flake.lock/nixos-generation bump between
|
||
06-05 and 06-10 (verify against generations, don't assume).
|
||
|
||
**Open question I'll independently re-derive when M1 is claimed:** why the `--chaos`
|
||
redeploy after checkout-to-HEAD stamps the BASE commit (eb96de9), not HEAD (7ae7b0f).
|
||
Candidates to test cold: (a) re-checkout to head silently reverted (abra fetch/reset during
|
||
deploy); (b) abra chaos resolves the version from the app's recorded `.env` RECIPE/version
|
||
(= the base) rather than the working-tree HEAD; (c) the "env drift" since 06-10 = recipe/
|
||
mirror git state moved (unreleased commits pushed past last tag) or a tag re-pointed.
|
||
|
||
**Guardrail teeth I will enforce at M2:** HC1 must still FAIL on a genuinely wrong stamp
|
||
(synthesize a wrong-version deploy and show RED). Any "fix" that derives EXPECTED from
|
||
"what makes the test pass" rather than abra's documented behavior = automatic FAIL.
|
||
|
||
Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping
|
||
on the `claim(...)` commit.
|
||
|
||
---
|
||
|
||
## Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet)
|
||
|
||
Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from
|
||
harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service
|
||
inspect on cc-ci. Independently arrived at the same attribution as the Builder.
|
||
|
||
**Causal chain derived from code + direct evidence:**
|
||
|
||
1. `provide_ccci_overlay` (rcust-era addition) copies `compose.ccci.yml` into the per-run
|
||
recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the old
|
||
`install_steps.sh` path writing to canonical `~/.abra`) — consistent with run 184 having
|
||
no `+U` suffix and passing. The `+U` itself is stripped by HC1's `chaos_commit.split("+",1)[0]`
|
||
and is NOT the cause of drift.
|
||
|
||
2. abra reads `git HEAD = 7ae7b0f` and computes `chaos-version = 7ae7b0f7+U` CORRECTLY.
|
||
Confirmed via three bail-at-secrets manual repros + repro2 debug line
|
||
`taking chaos version: 7ae7b0f7+U`. abra and the per-run git checkout are EXONERATED.
|
||
|
||
3. `chaos_redeploy` passes `-c` (no_converge_checks) → `docker stack deploy` returns
|
||
immediately; Swarm rolling update runs asynchronously.
|
||
|
||
4. Discourse `compose.yml` (BOTH base `eb96de94` AND PR-head `7ae7b0f`) sets
|
||
`deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }`
|
||
on the `app` service. Confirmed by direct `docker service inspect disc-ae10f0_..._app`.
|
||
|
||
5. With `order: start-first`, OLD + NEW task co-reside (~2× memory). Discourse's
|
||
Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10
|
||
(warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's
|
||
5s update monitor → `failure_action: rollback` fires → Swarm REVERTS the app service
|
||
spec to PreviousSpec (base deploy, `chaos-version=eb96de94+U`).
|
||
|
||
6. `services_converged` blind spot: after rollback `UpdateStatus.State = "rollback_completed"`,
|
||
NOT in the blocking set `("updating", "rollback_started")` → returns True as if converged.
|
||
Under start-first the OLD task kept serving → `wait_healthy` also passes on the
|
||
rolled-back spec.
|
||
|
||
7. `deployed_identity` reads `.Spec.Labels` → rolled-back spec → `chaos-version=eb96de94+U`.
|
||
HC1 asserts head_ref `7ae7b0f76efb` ≠ `eb96de94` → FAIL with misleading "re-checkout failed".
|
||
|
||
**Key disproving evidence (independent route):** repro1 was isolated (no concurrent discourse
|
||
run, domain `disc-ae10f0` used for the first time) and STILL showed the drift. This refuted
|
||
the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL.
|
||
|
||
**Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓):**
|
||
Whether the new start-first task survives the 5s monitor depends on momentary memory pressure.
|
||
Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm
|
||
volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10
|
||
on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184.
|
||
|
||
**Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):**
|
||
|
||
*Part 1 — overlay `order: stop-first`*: Old task stops before new starts → new boots with full
|
||
host memory → no OOM under the 5s monitor → no spurious rollback. `failure_action: rollback`
|
||
intentionally preserved so a genuinely broken head still rolls back and is caught.
|
||
ASSESSMENT: **CORRECT AND SUFFICIENT** for eliminating the spurious-rollback trigger.
|
||
|
||
*Part 2 — `lifecycle.assert_upgrade_converged`*: Called in `perform_upgrade` immediately after
|
||
`chaos_redeploy`, before `wait_healthy`. Polls `docker service inspect
|
||
--format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}'` until terminal.
|
||
Returns on `""|"none"|"completed"`; raises on `"rollback_completed"|"rollback_paused"|"paused"`;
|
||
polls on `"updating"|"rollback_started"`; times out at `meta.DEPLOY_TIMEOUT`.
|
||
ASSESSMENT: **CORRECT** — closes the wait_healthy-masking blind spot. Makes a swarm rollback
|
||
an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch.
|
||
HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs.
|
||
|
||
**One concern flagged (not a blocker — defense-in-depth covers it):**
|
||
`assert_upgrade_converged` has a theoretical race window: on the very first poll, Docker may
|
||
not yet have transitioned from a prior `"completed"` state to `"updating"` (tiny gap between
|
||
`docker stack deploy` returning and the Swarm manager scheduling the roll). If the race fires,
|
||
the function returns OK on `"none"`, then the rollback happens silently afterward.
|
||
Mitigation: with `stop-first` (fix part 1), a post-assert-converged rollback leaves NO serving
|
||
task during the rollback → `wait_healthy` also FAILS → the test result is still FAIL, just
|
||
with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is
|
||
NOT weakened even if the race fires. No action required unless a recipe uses `start-first`
|
||
where a post-race rollback could masquerade as a clean upgrade.
|
||
|
||
**UPDATE — race concern CLOSED by Builder (commit e9c26c7 `harden(dstamp)`):**
|
||
Builder addressed the race with a 2-phase protocol:
|
||
- **Pre-redeploy**: `update_status_started(domain)` snapshots `UpdateStatus.StartedAt`.
|
||
- **Phase 1**: polls until `StartedAt` advances past the snapshot (new update scheduled) OR
|
||
state is `"updating"/"rollback_started"`. 30s grace: if no new update appears → no-op
|
||
redeploy, nothing to converge.
|
||
- **Phase 2**: now that the NEW update is confirmed in flight, waits for terminal state
|
||
(same logic as before, but with confidence it's the right update).
|
||
Assessment: **CORRECT AND COMPLETE**. Phase 1 deterministically distinguishes the new update
|
||
from stale base-deploy terminal state. No new failure modes introduced. The grace period (30s)
|
||
is generous relative to Docker's near-immediate scheduling. Race concern fully closed.
|
||
|
||
**Status:** no `claim(dstamp)` commit yet. Awaiting M1 claim to issue formal verdict.
|
||
|
||
---
|
||
|
||
## M1: PASS @2026-06-11T17:36Z
|
||
|
||
Cold verification from `/srv/cc-ci/cc-ci-adv`. JOURNAL-dstamp not read before verdict (anti-anchoring).
|
||
|
||
**Check 1 — Recipe policy at 7ae7b0f76efb:** PASS
|
||
`cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3 update_config compose.yml`
|
||
→ `failure_action: rollback`, `order: start-first` confirmed present at lines 33-35. Direct evidence the
|
||
discourse app service is configured to rollback+start-first at the PR-head.
|
||
|
||
**Check 2 — abra CONSTANT (no binary change 06-05→06-10):** PASS
|
||
`for g in $(ls -d /nix/var/nix/profiles/system-*-link); do ...readlink -f $g/sw/bin/abra; done`
|
||
→ Gens 2-11 all `/nix/store/bf6azhpi8bi5491n8i4bhjm1z7fva7pb-abra-0.13.0-beta/bin/abra`.
|
||
Gen1 differs (pre-bootstrap), gens 4-11 (2026-06-01 onward) identical. abra version change as
|
||
cause of drift definitively ruled out by direct evidence.
|
||
|
||
**Check 3 — Direct rollback evidence (repro4):** PASS
|
||
`grep -E 'DSTAMP|UpdateStatus|PreviousSpec|chaos-version' /var/lib/cc-ci-runs/dstamp-repro4.console.log`
|
||
→ Line immediately after chaos_redeploy:
|
||
- `UpdateStatus.State="updating"` (in flight)
|
||
- `Spec.Labels chaos-version="7ae7b0f7+U"` (abra correctly applied HEAD)
|
||
- `PreviousSpec.Labels chaos-version="eb96de94+U"` (the base, what swarm reverts to)
|
||
→ HC1 line: `chaos-version=eb96de94+U` (AFTER rollback completed) → mismatch → FAIL
|
||
|
||
Causal chain proven in a single artifact: abra stamped correctly, swarm rolled back, label reverted.
|
||
Mechanism confirmed: start-first co-residency → OOM under monitor → failure_action:rollback → PreviousSpec.
|
||
|
||
**Check 4 — Fix present:** PASS
|
||
- `runner/harness/lifecycle.py`: `update_status_started` (line 511) + `assert_upgrade_converged` (line 526).
|
||
Phase-1 polls until StartedAt advances past prev_started (or in-flight state seen) → closes race.
|
||
Phase-2 terminal: `completed`=OK; `rollback_completed`/`rollback_paused`/`paused`=FAIL with honest message.
|
||
- `runner/harness/generic.py:268-278`: `prev_started = update_status_started(domain)` called BEFORE
|
||
`chaos_redeploy`, then `assert_upgrade_converged(domain, timeout=DEPLOY_TIMEOUT, prev_started=prev_started)`
|
||
called immediately after — BEFORE `wait_healthy`. Correct call order.
|
||
- `tests/discourse/compose.ccci.yml:54-55`: `deploy.update_config.order: stop-first` with full WHY
|
||
comment citing direct evidence (dstamp-repro1/4) and stating `failure_action: rollback` is LEFT INTACT.
|
||
Both commits 0cc31a5 + e9c26c7 verified present (git log --oneline).
|
||
|
||
**Check 5 — Fix works (dstamp-fix1 and dstamp-fix2):** PASS
|
||
- `dstamp-fix1`: `upgrade-converged: disc-ae10f0_ci_commoninternet_net_app swarm UpdateStatus=completed`
|
||
+ `upgrade→PR-head: head_ref=7ae7b0f7 chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`
|
||
+ `test_upgrade_reconverges PASSED`. Level=2 (install+upgrade only, backup/functional not in STAGES).
|
||
- `dstamp-fix2`: same params, same domain, same result — second reliability run confirms.
|
||
Both runs: chaos-version=7ae7b0f7+U (head), NOT eb96de94+U (base). Fix is deterministic.
|
||
|
||
**Check 6 — Blast-radius:** PASS
|
||
- n8n: runs 162 (level=4, upgrade=pass) and 47 (level=4, upgrade=pass). Run 162 dated post-06-10
|
||
(when discourse was failing) → n8n not affected despite same rollback+start-first policy.
|
||
- keycloak: runs 155 (level=4, upgrade=pass) and 187 (level=4, upgrade=pass). Same conclusion.
|
||
- `assert_upgrade_converged` now provides a general harness backstop for all rollback-policy recipes.
|
||
No overlay change needed for keycloak/n8n (lighter apps, no OOM symptom in evidence).
|
||
- drone/traefik: infra, no recipe-CI upgrade tier. No action needed.
|
||
|
||
**HC1 teeth preserved (code inspection):** `generic.py:174-175` — `assert_upgraded` logic is UNCHANGED:
|
||
`chaos_commit = chaos.split("+",1)[0]`; assertion `head_ref.startswith(chaos_commit) or
|
||
chaos_commit.startswith(head_ref)`. `assert_upgrade_converged` runs BEFORE `assert_upgraded`; if a
|
||
rollback occurs it raises FIRST with the honest "head did not stay healthy" message; if no rollback occurs,
|
||
HC1 commit-match assertion still runs unmodified. A deliberately wrong stamp (e.g. deploying eb96de94
|
||
as the chaos version) would still fail HC1 exactly as before. M2 will demonstrate this with a live negative test.
|
||
|
||
**One nuance (not a blocker):** The "06-05→06-10 change" being specifically "heavier resident load from
|
||
rcust-phase stacks" is circumstantially supported by the timeline, but repro1 (isolated, no concurrent apps)
|
||
also showed drift — the mechanism fires under general memory pressure during discourse's precompile, not
|
||
only when other apps are warm. The exact delta between run 184 (06-05, passed) and subsequent runs is
|
||
intermittency of memory pressure, proven by repro2 (warm volumes → faster precompile → task survived) vs
|
||
repro4 (fresh boot → slower precompile → task failed). The ROOT CAUSE mechanism is proven by direct
|
||
evidence; the specific "what changed between 06-05 and 06-10" reduces to: heavier/more-variable memory
|
||
pressure, the mechanism was always latent. This doesn't weaken M1 — the fix eliminates the exposure.
|
||
|
||
**Verdict: M1 PASS.** Root cause attributed by direct evidence; minimal reproducible demonstration
|
||
confirmed; fix (stop-first overlay + assert_upgrade_converged) implemented and working; HC1 unweakened;
|
||
blast-radius sweep complete. Builder cleared to proceed to M2.
|
||
|
||
---
|
||
|
||
## M2: PASS @2026-06-11T17:58Z
|
||
|
||
Cold verification from `/srv/cc-ci/cc-ci-adv`. JOURNAL-dstamp not read before verdict (anti-anchoring).
|
||
|
||
**Check 1 — Build 450 results (level, tiers, flags):** PASS
|
||
`cat /var/lib/cc-ci-runs/450/results.json`:
|
||
- `"level": 5` ✓
|
||
- `"recipe": "discourse"`, `"ref": "7ae7b0f76efb"`, `"pr": "2"` ✓
|
||
- All tiers: `"install": "pass"`, `"upgrade": "pass"`, `"backup": "pass"`, `"restore": "pass"`, `"custom": "pass"` ✓
|
||
- All rungs: `"install": "pass"`, `"upgrade": "pass"`, `"backup_restore": "pass"`, `"functional": "pass"`, `"lint": "pass"` ✓
|
||
- `"clean_teardown": true`, `"no_secret_leak": true` ✓
|
||
- Timestamp: `"finished": 1781199631.4...` (2026-06-11 ~17:40 UTC) ✓
|
||
- `screenshot.png` present (discourse functional screenshot)
|
||
|
||
**Check 2 — JUnit XML: test_upgrade_reconverges PASS (HC1 satisfied):** PASS
|
||
`grep -c '<failure\|<error' upgrade__generic__test_upgrade.xml` → 0
|
||
Full XML: `<testcase classname="tests._generic.test_upgrade" name="test_upgrade_reconverges" time="0.260"/>`
|
||
(no `<failure>` child). `test_upgrade_reconverges` directly calls `generic.assert_upgraded(live_app, meta)`.
|
||
`assert_upgraded` at `generic.py:174-175` does the HC1 commit-match: `chaos_commit == head_ref`.
|
||
Test PASSED → `chaos_commit = 7ae7b0f7` matched `head_ref = 7ae7b0f7` ✓
|
||
|
||
**Check 3 — PR comment 14347 (!testme path):** PASS
|
||
Comment 14346 body = `!testme` (the trigger).
|
||
Comment 14347 body (bot response):
|
||
`<!-- cc-ci:testme -->\n🌻 **cc-ci** — \`discourse\` @ \`7ae7b0f7\` ✅ **passed**\n[...links to run 450 summary.png + badge + drone build 450...]`
|
||
Confirmed via Gitea API. Run directory `/var/lib/cc-ci-runs/450/` exists with full contents.
|
||
!testme → bridge ack → drone build 450 → run 450 results → PR comment ✅ passed. Path verified.
|
||
|
||
**Check 4 — DEFERRED entry closed:** PASS
|
||
`machine-docs/DEFERRED.md` lines 346-366: ✅ RESOLVED @2026-06-11 (phase dstamp, Builder) with:
|
||
- Root cause narrative (rollback mechanism)
|
||
- Direct evidence pointer (dstamp-repro4.console.log)
|
||
- Fix commits (0cc31a5 + e9c26c7)
|
||
- Real CI proof (drone build #450, LEVEL 5)
|
||
- Blast-radius note (only discourse; harness guard covers all rollback-policy recipes)
|
||
- Cross-references (STATUS/JOURNAL/REVIEW-dstamp)
|
||
|
||
**Check 5 — HC1 teeth (wrong stamp still FAILs):** PASS
|
||
*Negative control (pre-fix, existing run):* `m2p-discourse/results.json` shows HC1 caught wrong stamp:
|
||
`AssertionError: upgrade deployed chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb'
|
||
— the re-checkout to the code under test failed, so the upgrade is not exercising the PR's changes (HC1)`
|
||
This is HC1 raising on `eb96de94 ≠ 7ae7b0f7`. HC1 commit-match assertion WORKS.
|
||
|
||
*Code unchanged (from M1):* `generic.py:174-175` commit-match assertion unmodified. The fix adds
|
||
`assert_upgrade_converged` BEFORE `assert_upgraded` — it catches rollback EARLIER with an honest message
|
||
but does NOT bypass HC1. If a non-rollback wrong stamp were deployed (e.g. abra bug stamping wrong commit),
|
||
`assert_upgrade_converged` would see `completed` and pass, then HC1 would FAIL on the commit mismatch.
|
||
|
||
*Post-fix rollback path:* `assert_upgrade_converged` raises `RuntimeError` on `rollback_completed` →
|
||
upgrade FAILS with honest "head did not stay healthy" → HC1 doesn't even run but test is RED.
|
||
Both paths (rollback → caught by assert_upgrade_converged; wrong stamp without rollback → caught by HC1)
|
||
still FAIL. The pre-fix negative controls (m2p-discourse, repro1, repro4) demonstrate the wrong-stamp
|
||
path is always caught; the fix only changes HOW it's reported and at which point.
|
||
|
||
**Blast-radius (confirmed at M1, still valid):** Only discourse affected. keycloak/n8n PASS L4
|
||
in 06-10/06-11 era. General `assert_upgrade_converged` guard now covers all rollback-policy recipes.
|
||
|
||
**Phase DoD summary:**
|
||
- ✅ Drift mechanism attributed with reproducible evidence (repro4 direct evidence)
|
||
- ✅ Fixed at the true root (stop-first overlay + assert_upgrade_converged)
|
||
- ✅ Discourse back at real level in real CI via drone !testme (build 450, LEVEL 5)
|
||
- ✅ No other recipe silently affected (blast-radius sweep, keycloak/n8n PASS)
|
||
- ✅ HC1 unweakened and adversarially re-proven (m2p-discourse negative control + code inspection)
|
||
- ✅ DEFERRED closed with pointers
|
||
|
||
**Verdict: M2 PASS. All phase dstamp DoD items satisfied. Builder cleared for ## DONE.**
|