Files
cc-ci/machine-docs/STATUS-dstamp.md
autonomic-bot 85a781368a
Some checks failed
continuous-integration/drone/push Build is failing
machine-docs: move all per-phase coordination files out of repo root
STATUS/BACKLOG/REVIEW/JOURNAL for bsky/conc/dstamp/kuma/lvl5/mailu/rcust/shot
(32 files) were at the repo root; move them into machine-docs/ to match the
mandated file-location rule (DECISIONS/DEFERRED/INBOX + older phases already
live there). AGENTS.md gains an explicit File-location rule. No content change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 20:57:03 +00:00

220 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — phase `dstamp` (discourse abra-stamp drift)
Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2.
## DONE
M1 PASS (REVIEW-dstamp `fb411b2` @17:36Z) + M2 PASS (`71358da` @17:58Z), both fresh, no VETO.
All Definition-of-Done items Adversary-verified.
**Operator summary.** The discourse upgrade-tier "abra stamp drift" (upgrade-HC1 stamping the
prev-base tag commit `eb96de94+U` instead of the PR head `7ae7b0f7+U`, since ~06-10) was **NOT an
abra or harness git bug** — abra stamps the head correctly. **Root cause:** discourse's
`compose.yml` app service uses `deploy.update_config: { failure_action: rollback, order:
start-first, monitor: 5s }`. On the upgrade chaos redeploy, start-first co-resides the OLD+NEW
precompile/Rails-heavy task (~2× memory); under host memory pressure the NEW task fails swarm's 5s
update monitor → swarm **rolls back** to the base spec, reverting the `chaos-version` label
(head→base). start-first kept the old task serving, so `wait_healthy` passed and HC1 read the
reverted base commit — misreported as "re-checkout failed". Intermittent (memory-pressure
dependent): solo run 184 on 06-05 passed; the heavier 06-10/06-11 runs rolled back every time.
**Direct evidence:** `dstamp-repro4` captured `.Spec chaos-version=7ae7b0f7+U` (head applied) →
`.PreviousSpec=eb96de94+U` (base) with `UpdateStatus=updating`, then the post-rollback read = base.
**Fix (commits `0cc31a5` + `e9c26c7`, HC1 unweakened):** (1) `tests/discourse/compose.ccci.yml`
app `update_config.order: stop-first` — the new task boots with full host memory, no OOM, no
spurious rollback (`failure_action: rollback` left intact for genuine failures); (2) a general
harness guard `lifecycle.assert_upgrade_converged` (2-phase StartedAt protocol) that detects a
swarm rollback/pause after the upgrade redeploy and fails the upgrade HONESTLY — the HC1
commit-match assertion is unchanged.
**Proven in real CI:** drone `!testme` build **#450** (discourse @7ae7b0f) = **LEVEL 5** (was L1
under the drift), all tiers green, clean teardown, no secret leak; PR recipe-maintainers/discourse#2
shows ✅ passed. **Blast-radius:** only discourse was affected (keycloak/n8n share the policy but
upgrade-PASS L4; drone/traefik are infra) — the new harness guard now protects all rollback-policy
recipes. DEFERRED entry closed with pointers. **No operator action required.**
---
## Gate: M1 — PASS (REVIEW-dstamp fb411b2 @2026-06-11T17:36Z). Now on M2.
## Gate: M2 — CLAIMED, awaiting Adversary
**WHAT (M2 = Proven in real CI):** discourse full lifecycle GREEN at its true level via the drone
`!testme` path, upgrade-HC1 stamping the CORRECT head value; no other affected recipe; HC1
unweakened (a wrong stamp still FAILs); DEFERRED closed.
- **Real-CI proof — drone `!testme` build #450:** discourse @ `7ae7b0f76efb` (PR#2), STAGES full
(install,upgrade,backup,restore,custom), drone workspace at cc-ci main `2da1f01` (fix present) →
**LEVEL 5** (max), ALL tiers PASS, `clean_teardown=true`, `no_secret_leak=true`. Upgrade tier
`test_upgrade_reconverges` PASSED (HC1's `assert_upgraded` only passes when the deployed
chaos-version commit == head_ref `7ae7b0f`, after `assert_upgrade_converged` confirmed
`UpdateStatus=completed`). Was L1 (drift) before the fix → L5 now.
- **Triggered via the !testme path:** comment `14346` (`!testme`) on recipe-maintainers/discourse#2
→ bridge ack `14347`, updated to "🌻 cc-ci — discourse @ 7ae7b0f7 ✅ **passed**" with the L5
result card/badge linking drone build 450.
**HOW to verify (Adversary, cold):**
1. `grep -oE '"level": [0-9]+|"(install|upgrade|backup|restore|custom)": "[a-z]+"|"clean_teardown":
(true|false)|"no_secret_leak": (true|false)' /var/lib/cc-ci-runs/450/results.json` → level 5,
all `pass`, both flags `true`.
2. `/var/lib/cc-ci-runs/450/junit/upgrade__generic__test_upgrade.xml` → `test_upgrade_reconverges`
testcase with NO `<failure>` child (passed).
3. PR comment 14347 on recipe-maintainers/discourse#2 = ✅ passed, run 450.
4. *Fresh independent re-trigger (recommended):* post `!testme` on discourse#2 → new drone build on
cc-ci main → expect L5 again (reliability: manual fix1+fix2 + build 450 = 3 consecutive green
with the fix vs intermittent unpatched failures).
5. **HC1 teeth (negative test — Adversary leads):** synthesize a wrong stamp and show RED. Two live
teeth: (a) the unchanged commit-match `generic.py:174-175` — a deployed chaos commit ≠ head_ref
still FAILs (e.g. force the recheckout to the base, or deploy base-as-head); (b) the new
`assert_upgrade_converged` raises on a swarm `rollback_completed`/`paused` (the ORIGINAL drift
path — repro1/repro4 are exactly this RED, now with an honest message). Neither relaxes HC1.
6. DEFERRED closed: `machine-docs/DEFERRED.md` dstamp entry → ✅ RESOLVED with pointers.
**EXPECTED:** build 450 level 5, all tiers pass, both flags true; PR#2 ✅ passed; DEFERRED resolved.
**WHERE:** `/var/lib/cc-ci-runs/450/`; commits `0cc31a5`,`e9c26c7`; PR#2 comments 14346/14347;
`machine-docs/DEFERRED.md`. **No other recipe affected** (blast-radius: keycloak/n8n upgrade-PASS L4
across runs incl. rcust era; drone/traefik infra). Fresh Adversary M2 PASS → `## DONE`.
---
## (M1 — verified PASS; detail retained below)
**WHAT (M1 = Attribution):** root cause attributed by direct evidence; minimal reproducible
demonstration; 06-05→06-10 change identified; fix implemented (recipe overlay + harness, HC1
unweakened); blast-radius sweep complete.
Root cause: discourse `compose.yml` app service sets `deploy.update_config: { failure_action:
rollback, order: start-first, monitor: 5s }`. On the upgrade chaos redeploy, start-first co-resides
OLD+NEW (~2× memory) for the precompile/Rails-heavy app; under host memory pressure the NEW task
fails swarm's 5s update monitor → `failure_action: rollback` reverts the app service to its
PreviousSpec — INCLUDING the `coop-cloud.<stack>.chaos-version` label (head→base). Under start-first
the OLD task keeps serving, so `wait_healthy` passes; `deployed_identity` then reads the rolled-back
`.Spec` (base commit `eb96de94+U`) and HC1 misreports it as "re-checkout failed". abra+harness git
path EXONERATED (abra stamps head `7ae7b0f7+U` correctly; per-run HEAD=7ae7b0f at deploy).
**HOW to verify (Adversary, cold):**
1. *Recipe policy:* `cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3
update_config compose.yml` → `failure_action: rollback`, `order: start-first`. EXPECTED present.
2. *abra exonerated (minimal repro):* scratch ABRA_DIR, base→head checkout, `abra app deploy <d> -C
-o -n --debug` bails at `secret not generated` AFTER logging `app/deploy.go:372 version: taking
chaos version: 7ae7b0f7+U` (HEAD-correct). Procedure: JOURNAL-dstamp "mirror-faithful repro".
3. *Direct rollback evidence:* console `/var/lib/cc-ci-runs/dstamp-repro4.console.log` line
`[DSTAMP] post-redeploy svc inspect …` shows immediately post-redeploy `UpdateStatus.State=
"updating"`, `.Spec…chaos-version=7ae7b0f7+U` (head applied), `.PreviousSpec…chaos-version=
eb96de94+U` (base); the later HC1 read = eb96de94+U after the rollback completes.
4. *Fix present:* `runner/harness/lifecycle.py::assert_upgrade_converged` (+ `update_status_started`)
and its call in `runner/harness/generic.py::perform_upgrade`; `tests/discourse/compose.ccci.yml`
app `deploy.update_config.order: stop-first`. Commits `0cc31a5` + `e9c26c7`.
5. *Fix works:* run `dstamp-fix1` (fresh checkout, STAGES=install,upgrade) → upgrade PASS,
console `upgrade-converged: …UpdateStatus=completed` + `chaos-version=7ae7b0f7+U version=
0.7.0+3.3.1→0.9.0+3.5.0`. (Re-runnable: `RECIPE=discourse PR=2
REF=7ae7b0f76efb2988c1e54956348dc9eeb7812e0b SRC=recipe-maintainers/discourse
STAGES=install,upgrade CCCI_RUN_ID=<id> cc-ci-run runner/run_recipe_ci.py` from a checkout at
`e9c26c7`.)
6. *Blast-radius:* recipes with rollback+start-first = discourse, drone, keycloak, n8n, traefik.
keycloak/n8n upgrade PASS L4 across runs (155/186/187/m2r; 47/54/61/162/197/m2r) ⇒ not affected;
drone/traefik infra (no recipe-CI upgrade tier). Only discourse affected; the general
`assert_upgrade_converged` guard now protects all rollback-policy recipes.
**EXPECTED:** all of 16 hold. **WHERE:** commits 0cc31a5, e9c26c7; runs
`/var/lib/cc-ci-runs/dstamp-{repro1,repro2,repro4,fix1}`; recipe `~/.abra/recipes/discourse`.
HC1 teeth preserved: the commit-match assertion is unchanged; `assert_upgrade_converged` only makes
a swarm rollback an HONEST upgrade failure before HC1 runs (a genuinely undeployable head still
fails). M2 will demonstrate a wrong stamp still FAILs + full-lifecycle green via the `!testme` path.
---
## Root cause detail (evidence)
## ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED)
The upgrade chaos redeploy applies the **correct** head spec, then swarm **rolls it back** to the
base spec, reverting the `chaos-version` label — masked by the recipe's `start-first` strategy +
the harness's `wait_healthy` (the OLD task keeps serving, so health passes).
Recipe policy (`~/.abra/recipes/discourse/compose.yml`, app service): `deploy.update_config:
{ failure_action: rollback, order: start-first }`, `healthcheck.start_period: 20m`. The heavy
discourse app, started **start-first** (old+new co-resident ≈ 2× memory), intermittently fails
swarm's update monitor on the NEW task → swarm executes `failure_action: rollback` → app service
reverts to PreviousSpec (the base, `chaos-version=eb96de94+U`).
**Direct evidence (run `dstamp-repro4`, console `/var/lib/cc-ci-runs/dstamp-repro4.console.log`,
solo/isolated):** immediately after `chaos_redeploy`, `docker service inspect <stack>_app`:
- `UpdateStatus.State = "updating"`,
- `.Spec.Labels coop-cloud.<stack>.chaos-version = 7ae7b0f7+U` (HEAD applied — abra stamped head
correctly), `.version = 0.9.0+3.5.0`,
- `.PreviousSpec.Labels …chaos-version = eb96de94+U` (the base), `.version = 0.7.0+3.3.1`.
Then `wait_healthy` passes (old task serves under start-first); the new task fails the monitor →
rollback → `.Spec` reverts to `eb96de94+U`; the later HC1 read sees `eb96de94+U` → FAIL with the
misleading "re-checkout failed" message. (`dstamp-repro2`, lighter timing, had NO rollback →
upgrade PASS @ `7ae7b0f7+U`.)
Intermittency (184✓ solo 06-05; m2b/m2p/ab✗ clustered/heavier-load 06-10/11; repro1✗ repro2✓
repro4✗) = whether the new start-first task survives swarm's monitor under the host's momentary
memory pressure. The "since ~06-10 on every run" = the rcust phase ran under heavier resident load
(warm keycloak etc.) so the new task reliably failed → rollback every time. abra version-resolution
is CORRECT (proven: repro2 debug line `taking chaos version: 7ae7b0f7+U` + 3 bail-at-secrets repros);
the per-run git checkout is CORRECT (HEAD=7ae7b0f at deploy, reflog-proven). NOT abra, NOT the
per-run tree, NOT concurrency.
## Fix (in progress) — HC1 keeps its teeth
1. **Reliability (restore true level):** discourse `tests/discourse/compose.ccci.yml` overlay set
the app service `deploy.update_config.order: stop-first` so the new task boots with full memory
(no 2× co-residency) and genuinely becomes healthy → no spurious rollback. The upgrade-to-head
is still really deployed + asserted on head; HC1 unchanged. Documented WHY in the overlay header.
2. **Correctness (honesty, general):** the harness upgrade path detects a swarm rollback after the
chaos redeploy (UpdateStatus.State rollback*/paused, or `.Spec` reverted to `.PreviousSpec`) and
fails the upgrade with the TRUE reason ("head spec applied then swarm-rolled-back: new task
failed the update monitor") instead of the misleading "re-checkout failed". A genuinely
undeployable head still FAILS (teeth preserved).
3. **Blast-radius:** sweep all enrolled recipes for `failure_action: rollback` + start-first heavy
apps with the same latent signature.
## What is established (direct evidence, reproducible)
- **abra is CONSTANT, not the cause.** abra binary `bf6azhpi…-abra-0.13.0-beta` is the store
path for every nixos system generation from system-4 (2026-06-01) through system-11 (now).
No abra change between 06-05 and 06-10.
HOW: `for g in $(ls -d /nix/var/nix/profiles/system-*-link); do readlink -f "$g/sw/bin/abra"; done`
on cc-ci. EXPECTED: all `…bf6azhpi…` from system-4 on.
- **abra's chaos-version = `SmallSHA(git HEAD of the recipe checkout)`** (+`+U` if worktree
dirty). Source: abra@06a57de `cli/app/deploy.go:106,168,365-373` (chaos →
`toDeployVersion = Recipe.ChaosVersion()`), `pkg/recipe/git.go:300-318` (`ChaosVersion` =
`SmallSHA(Head())`), `:483-495` (`Head` = go-git `repo.Head()`). In chaos mode
`Recipe.Ensure` early-returns (`pkg/recipe/git.go:41-43`) — NO env-version re-checkout.
- **The isolated git/abra path stamps CORRECTLY now.** Three faithful reproductions on cc-ci
(scratch ABRA_DIR, fake domain, deploys bail at `secret not generated` AFTER the chaos
version is computed) all log `taking chaos version: 7ae7b0f7` (= PR head), NOT `eb96de9`:
1. `cp -a` canonical recipe + manual tag/head checkout.
2. real non-chaos base deploy (go-git `EnsureVersion` tag checkout) → CLI re-checkout head → chaos.
3. exact `fetch_recipe` replica: clone mirror `recipe-maintainers/discourse` @7ae7b0f +
`git fetch upstream refs/tags/*` → base deploy → re-checkout head → chaos.
HOW (variant 3, re-runnable cold): see JOURNAL-dstamp 2026-06-11 "mirror-faithful repro".
EXPECTED: `DEBU app/deploy.go:372 version: taking chaos version: 7ae7b0f7`.
- **Same ref, solo run was GREEN; clustered runs DRIFTED.** discourse @ ref `7ae7b0f76efb`:
run **184** (2026-06-05 02:17, solo) = **L4, upgrade PASS**; the 06-10/06-11 runs
**m2b-discourse** (06-10 20:54), **m2p-discourse** (06-11 00:44), **ab-discourse-7ae7b0f-oldmain**
(06-11 00:48) = **L1, upgrade FAIL** (`chaos commit 'eb96de94+U', not the intended PR-head
'7ae7b0f76efb' (HC1)`). HOW: `grep -oE '"level": [0-9]+|"upgrade": "[a-z]+"'
/var/lib/cc-ci-runs/{184,m2p-discourse}/results.json`.
- **All same-ref discourse runs share ONE swarm stack.** `naming.app_domain(recipe,pr,ref)` =
`<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net` → identical for identical
(recipe,pr,ref). The upgrade `chaos_redeploy` bypasses `deploy_app`'s app-domain flock
(`lifecycle.chaos_redeploy` / `generic.perform_upgrade`). LEADING HYPOTHESIS: the 06-10/06-11
drift is a CONCURRENCY ARTIFACT of the clustered rcust-M2 A/B discourse experiments racing on
the shared stack — NOT an abra/recipe/env regression. Under test now.
## In flight
- Implementing the fix (overlay stop-first + harness rollback detection), then a full real run
(all stages) to prove discourse reliably reaches its true level, then the `!testme` drone path.
- Repro evidence runs: `/var/lib/cc-ci-runs/dstamp-repro{1,2,3,4}.console.log` on cc-ci
(repro2 PASS @7ae7b0f7+U; repro4 captured the rollback Spec/PreviousSpec).
## Blocked
- (none)