Some checks failed
continuous-integration/drone/push Build is failing
STATUS/BACKLOG/REVIEW/JOURNAL for bsky/conc/dstamp/kuma/lvl5/mailu/rcust/shot (32 files) were at the repo root; move them into machine-docs/ to match the mandated file-location rule (DECISIONS/DEFERRED/INBOX + older phases already live there). AGENTS.md gains an explicit File-location rule. No content change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
220 lines
15 KiB
Markdown
220 lines
15 KiB
Markdown
# STATUS — phase `dstamp` (discourse abra-stamp drift)
|
||
|
||
Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2.
|
||
|
||
## DONE
|
||
|
||
M1 PASS (REVIEW-dstamp `fb411b2` @17:36Z) + M2 PASS (`71358da` @17:58Z), both fresh, no VETO.
|
||
All Definition-of-Done items Adversary-verified.
|
||
|
||
**Operator summary.** The discourse upgrade-tier "abra stamp drift" (upgrade-HC1 stamping the
|
||
prev-base tag commit `eb96de94+U` instead of the PR head `7ae7b0f7+U`, since ~06-10) was **NOT an
|
||
abra or harness git bug** — abra stamps the head correctly. **Root cause:** discourse's
|
||
`compose.yml` app service uses `deploy.update_config: { failure_action: rollback, order:
|
||
start-first, monitor: 5s }`. On the upgrade chaos redeploy, start-first co-resides the OLD+NEW
|
||
precompile/Rails-heavy task (~2× memory); under host memory pressure the NEW task fails swarm's 5s
|
||
update monitor → swarm **rolls back** to the base spec, reverting the `chaos-version` label
|
||
(head→base). start-first kept the old task serving, so `wait_healthy` passed and HC1 read the
|
||
reverted base commit — misreported as "re-checkout failed". Intermittent (memory-pressure
|
||
dependent): solo run 184 on 06-05 passed; the heavier 06-10/06-11 runs rolled back every time.
|
||
**Direct evidence:** `dstamp-repro4` captured `.Spec chaos-version=7ae7b0f7+U` (head applied) →
|
||
`.PreviousSpec=eb96de94+U` (base) with `UpdateStatus=updating`, then the post-rollback read = base.
|
||
|
||
**Fix (commits `0cc31a5` + `e9c26c7`, HC1 unweakened):** (1) `tests/discourse/compose.ccci.yml`
|
||
app `update_config.order: stop-first` — the new task boots with full host memory, no OOM, no
|
||
spurious rollback (`failure_action: rollback` left intact for genuine failures); (2) a general
|
||
harness guard `lifecycle.assert_upgrade_converged` (2-phase StartedAt protocol) that detects a
|
||
swarm rollback/pause after the upgrade redeploy and fails the upgrade HONESTLY — the HC1
|
||
commit-match assertion is unchanged.
|
||
|
||
**Proven in real CI:** drone `!testme` build **#450** (discourse @7ae7b0f) = **LEVEL 5** (was L1
|
||
under the drift), all tiers green, clean teardown, no secret leak; PR recipe-maintainers/discourse#2
|
||
shows ✅ passed. **Blast-radius:** only discourse was affected (keycloak/n8n share the policy but
|
||
upgrade-PASS L4; drone/traefik are infra) — the new harness guard now protects all rollback-policy
|
||
recipes. DEFERRED entry closed with pointers. **No operator action required.**
|
||
|
||
---
|
||
|
||
## Gate: M1 — PASS (REVIEW-dstamp fb411b2 @2026-06-11T17:36Z). Now on M2.
|
||
|
||
## Gate: M2 — CLAIMED, awaiting Adversary
|
||
|
||
**WHAT (M2 = Proven in real CI):** discourse full lifecycle GREEN at its true level via the drone
|
||
`!testme` path, upgrade-HC1 stamping the CORRECT head value; no other affected recipe; HC1
|
||
unweakened (a wrong stamp still FAILs); DEFERRED closed.
|
||
|
||
- **Real-CI proof — drone `!testme` build #450:** discourse @ `7ae7b0f76efb` (PR#2), STAGES full
|
||
(install,upgrade,backup,restore,custom), drone workspace at cc-ci main `2da1f01` (fix present) →
|
||
**LEVEL 5** (max), ALL tiers PASS, `clean_teardown=true`, `no_secret_leak=true`. Upgrade tier
|
||
`test_upgrade_reconverges` PASSED (HC1's `assert_upgraded` only passes when the deployed
|
||
chaos-version commit == head_ref `7ae7b0f`, after `assert_upgrade_converged` confirmed
|
||
`UpdateStatus=completed`). Was L1 (drift) before the fix → L5 now.
|
||
- **Triggered via the !testme path:** comment `14346` (`!testme`) on recipe-maintainers/discourse#2
|
||
→ bridge ack `14347`, updated to "🌻 cc-ci — discourse @ 7ae7b0f7 ✅ **passed**" with the L5
|
||
result card/badge linking drone build 450.
|
||
|
||
**HOW to verify (Adversary, cold):**
|
||
1. `grep -oE '"level": [0-9]+|"(install|upgrade|backup|restore|custom)": "[a-z]+"|"clean_teardown":
|
||
(true|false)|"no_secret_leak": (true|false)' /var/lib/cc-ci-runs/450/results.json` → level 5,
|
||
all `pass`, both flags `true`.
|
||
2. `/var/lib/cc-ci-runs/450/junit/upgrade__generic__test_upgrade.xml` → `test_upgrade_reconverges`
|
||
testcase with NO `<failure>` child (passed).
|
||
3. PR comment 14347 on recipe-maintainers/discourse#2 = ✅ passed, run 450.
|
||
4. *Fresh independent re-trigger (recommended):* post `!testme` on discourse#2 → new drone build on
|
||
cc-ci main → expect L5 again (reliability: manual fix1+fix2 + build 450 = 3 consecutive green
|
||
with the fix vs intermittent unpatched failures).
|
||
5. **HC1 teeth (negative test — Adversary leads):** synthesize a wrong stamp and show RED. Two live
|
||
teeth: (a) the unchanged commit-match `generic.py:174-175` — a deployed chaos commit ≠ head_ref
|
||
still FAILs (e.g. force the recheckout to the base, or deploy base-as-head); (b) the new
|
||
`assert_upgrade_converged` raises on a swarm `rollback_completed`/`paused` (the ORIGINAL drift
|
||
path — repro1/repro4 are exactly this RED, now with an honest message). Neither relaxes HC1.
|
||
6. DEFERRED closed: `machine-docs/DEFERRED.md` dstamp entry → ✅ RESOLVED with pointers.
|
||
|
||
**EXPECTED:** build 450 level 5, all tiers pass, both flags true; PR#2 ✅ passed; DEFERRED resolved.
|
||
**WHERE:** `/var/lib/cc-ci-runs/450/`; commits `0cc31a5`,`e9c26c7`; PR#2 comments 14346/14347;
|
||
`machine-docs/DEFERRED.md`. **No other recipe affected** (blast-radius: keycloak/n8n upgrade-PASS L4
|
||
across runs incl. rcust era; drone/traefik infra). Fresh Adversary M2 PASS → `## DONE`.
|
||
|
||
---
|
||
|
||
## (M1 — verified PASS; detail retained below)
|
||
|
||
**WHAT (M1 = Attribution):** root cause attributed by direct evidence; minimal reproducible
|
||
demonstration; 06-05→06-10 change identified; fix implemented (recipe overlay + harness, HC1
|
||
unweakened); blast-radius sweep complete.
|
||
|
||
Root cause: discourse `compose.yml` app service sets `deploy.update_config: { failure_action:
|
||
rollback, order: start-first, monitor: 5s }`. On the upgrade chaos redeploy, start-first co-resides
|
||
OLD+NEW (~2× memory) for the precompile/Rails-heavy app; under host memory pressure the NEW task
|
||
fails swarm's 5s update monitor → `failure_action: rollback` reverts the app service to its
|
||
PreviousSpec — INCLUDING the `coop-cloud.<stack>.chaos-version` label (head→base). Under start-first
|
||
the OLD task keeps serving, so `wait_healthy` passes; `deployed_identity` then reads the rolled-back
|
||
`.Spec` (base commit `eb96de94+U`) and HC1 misreports it as "re-checkout failed". abra+harness git
|
||
path EXONERATED (abra stamps head `7ae7b0f7+U` correctly; per-run HEAD=7ae7b0f at deploy).
|
||
|
||
**HOW to verify (Adversary, cold):**
|
||
1. *Recipe policy:* `cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3
|
||
update_config compose.yml` → `failure_action: rollback`, `order: start-first`. EXPECTED present.
|
||
2. *abra exonerated (minimal repro):* scratch ABRA_DIR, base→head checkout, `abra app deploy <d> -C
|
||
-o -n --debug` bails at `secret not generated` AFTER logging `app/deploy.go:372 version: taking
|
||
chaos version: 7ae7b0f7+U` (HEAD-correct). Procedure: JOURNAL-dstamp "mirror-faithful repro".
|
||
3. *Direct rollback evidence:* console `/var/lib/cc-ci-runs/dstamp-repro4.console.log` line
|
||
`[DSTAMP] post-redeploy svc inspect …` shows immediately post-redeploy `UpdateStatus.State=
|
||
"updating"`, `.Spec…chaos-version=7ae7b0f7+U` (head applied), `.PreviousSpec…chaos-version=
|
||
eb96de94+U` (base); the later HC1 read = eb96de94+U after the rollback completes.
|
||
4. *Fix present:* `runner/harness/lifecycle.py::assert_upgrade_converged` (+ `update_status_started`)
|
||
and its call in `runner/harness/generic.py::perform_upgrade`; `tests/discourse/compose.ccci.yml`
|
||
app `deploy.update_config.order: stop-first`. Commits `0cc31a5` + `e9c26c7`.
|
||
5. *Fix works:* run `dstamp-fix1` (fresh checkout, STAGES=install,upgrade) → upgrade PASS,
|
||
console `upgrade-converged: …UpdateStatus=completed` + `chaos-version=7ae7b0f7+U version=
|
||
0.7.0+3.3.1→0.9.0+3.5.0`. (Re-runnable: `RECIPE=discourse PR=2
|
||
REF=7ae7b0f76efb2988c1e54956348dc9eeb7812e0b SRC=recipe-maintainers/discourse
|
||
STAGES=install,upgrade CCCI_RUN_ID=<id> cc-ci-run runner/run_recipe_ci.py` from a checkout at
|
||
`e9c26c7`.)
|
||
6. *Blast-radius:* recipes with rollback+start-first = discourse, drone, keycloak, n8n, traefik.
|
||
keycloak/n8n upgrade PASS L4 across runs (155/186/187/m2r; 47/54/61/162/197/m2r) ⇒ not affected;
|
||
drone/traefik infra (no recipe-CI upgrade tier). Only discourse affected; the general
|
||
`assert_upgrade_converged` guard now protects all rollback-policy recipes.
|
||
|
||
**EXPECTED:** all of 1–6 hold. **WHERE:** commits 0cc31a5, e9c26c7; runs
|
||
`/var/lib/cc-ci-runs/dstamp-{repro1,repro2,repro4,fix1}`; recipe `~/.abra/recipes/discourse`.
|
||
|
||
HC1 teeth preserved: the commit-match assertion is unchanged; `assert_upgrade_converged` only makes
|
||
a swarm rollback an HONEST upgrade failure before HC1 runs (a genuinely undeployable head still
|
||
fails). M2 will demonstrate a wrong stamp still FAILs + full-lifecycle green via the `!testme` path.
|
||
|
||
---
|
||
|
||
## Root cause detail (evidence)
|
||
|
||
## ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED)
|
||
|
||
The upgrade chaos redeploy applies the **correct** head spec, then swarm **rolls it back** to the
|
||
base spec, reverting the `chaos-version` label — masked by the recipe's `start-first` strategy +
|
||
the harness's `wait_healthy` (the OLD task keeps serving, so health passes).
|
||
|
||
Recipe policy (`~/.abra/recipes/discourse/compose.yml`, app service): `deploy.update_config:
|
||
{ failure_action: rollback, order: start-first }`, `healthcheck.start_period: 20m`. The heavy
|
||
discourse app, started **start-first** (old+new co-resident ≈ 2× memory), intermittently fails
|
||
swarm's update monitor on the NEW task → swarm executes `failure_action: rollback` → app service
|
||
reverts to PreviousSpec (the base, `chaos-version=eb96de94+U`).
|
||
|
||
**Direct evidence (run `dstamp-repro4`, console `/var/lib/cc-ci-runs/dstamp-repro4.console.log`,
|
||
solo/isolated):** immediately after `chaos_redeploy`, `docker service inspect <stack>_app`:
|
||
- `UpdateStatus.State = "updating"`,
|
||
- `.Spec.Labels coop-cloud.<stack>.chaos-version = 7ae7b0f7+U` (HEAD applied — abra stamped head
|
||
correctly), `.version = 0.9.0+3.5.0`,
|
||
- `.PreviousSpec.Labels …chaos-version = eb96de94+U` (the base), `.version = 0.7.0+3.3.1`.
|
||
Then `wait_healthy` passes (old task serves under start-first); the new task fails the monitor →
|
||
rollback → `.Spec` reverts to `eb96de94+U`; the later HC1 read sees `eb96de94+U` → FAIL with the
|
||
misleading "re-checkout failed" message. (`dstamp-repro2`, lighter timing, had NO rollback →
|
||
upgrade PASS @ `7ae7b0f7+U`.)
|
||
|
||
Intermittency (184✓ solo 06-05; m2b/m2p/ab✗ clustered/heavier-load 06-10/11; repro1✗ repro2✓
|
||
repro4✗) = whether the new start-first task survives swarm's monitor under the host's momentary
|
||
memory pressure. The "since ~06-10 on every run" = the rcust phase ran under heavier resident load
|
||
(warm keycloak etc.) so the new task reliably failed → rollback every time. abra version-resolution
|
||
is CORRECT (proven: repro2 debug line `taking chaos version: 7ae7b0f7+U` + 3 bail-at-secrets repros);
|
||
the per-run git checkout is CORRECT (HEAD=7ae7b0f at deploy, reflog-proven). NOT abra, NOT the
|
||
per-run tree, NOT concurrency.
|
||
|
||
## Fix (in progress) — HC1 keeps its teeth
|
||
1. **Reliability (restore true level):** discourse `tests/discourse/compose.ccci.yml` overlay set
|
||
the app service `deploy.update_config.order: stop-first` so the new task boots with full memory
|
||
(no 2× co-residency) and genuinely becomes healthy → no spurious rollback. The upgrade-to-head
|
||
is still really deployed + asserted on head; HC1 unchanged. Documented WHY in the overlay header.
|
||
2. **Correctness (honesty, general):** the harness upgrade path detects a swarm rollback after the
|
||
chaos redeploy (UpdateStatus.State rollback*/paused, or `.Spec` reverted to `.PreviousSpec`) and
|
||
fails the upgrade with the TRUE reason ("head spec applied then swarm-rolled-back: new task
|
||
failed the update monitor") instead of the misleading "re-checkout failed". A genuinely
|
||
undeployable head still FAILS (teeth preserved).
|
||
3. **Blast-radius:** sweep all enrolled recipes for `failure_action: rollback` + start-first heavy
|
||
apps with the same latent signature.
|
||
|
||
## What is established (direct evidence, reproducible)
|
||
|
||
- **abra is CONSTANT, not the cause.** abra binary `bf6azhpi…-abra-0.13.0-beta` is the store
|
||
path for every nixos system generation from system-4 (2026-06-01) through system-11 (now).
|
||
No abra change between 06-05 and 06-10.
|
||
HOW: `for g in $(ls -d /nix/var/nix/profiles/system-*-link); do readlink -f "$g/sw/bin/abra"; done`
|
||
on cc-ci. EXPECTED: all `…bf6azhpi…` from system-4 on.
|
||
|
||
- **abra's chaos-version = `SmallSHA(git HEAD of the recipe checkout)`** (+`+U` if worktree
|
||
dirty). Source: abra@06a57de `cli/app/deploy.go:106,168,365-373` (chaos →
|
||
`toDeployVersion = Recipe.ChaosVersion()`), `pkg/recipe/git.go:300-318` (`ChaosVersion` =
|
||
`SmallSHA(Head())`), `:483-495` (`Head` = go-git `repo.Head()`). In chaos mode
|
||
`Recipe.Ensure` early-returns (`pkg/recipe/git.go:41-43`) — NO env-version re-checkout.
|
||
|
||
- **The isolated git/abra path stamps CORRECTLY now.** Three faithful reproductions on cc-ci
|
||
(scratch ABRA_DIR, fake domain, deploys bail at `secret not generated` AFTER the chaos
|
||
version is computed) all log `taking chaos version: 7ae7b0f7` (= PR head), NOT `eb96de9`:
|
||
1. `cp -a` canonical recipe + manual tag/head checkout.
|
||
2. real non-chaos base deploy (go-git `EnsureVersion` tag checkout) → CLI re-checkout head → chaos.
|
||
3. exact `fetch_recipe` replica: clone mirror `recipe-maintainers/discourse` @7ae7b0f +
|
||
`git fetch upstream refs/tags/*` → base deploy → re-checkout head → chaos.
|
||
HOW (variant 3, re-runnable cold): see JOURNAL-dstamp 2026-06-11 "mirror-faithful repro".
|
||
EXPECTED: `DEBU app/deploy.go:372 version: taking chaos version: 7ae7b0f7`.
|
||
|
||
- **Same ref, solo run was GREEN; clustered runs DRIFTED.** discourse @ ref `7ae7b0f76efb`:
|
||
run **184** (2026-06-05 02:17, solo) = **L4, upgrade PASS**; the 06-10/06-11 runs
|
||
**m2b-discourse** (06-10 20:54), **m2p-discourse** (06-11 00:44), **ab-discourse-7ae7b0f-oldmain**
|
||
(06-11 00:48) = **L1, upgrade FAIL** (`chaos commit 'eb96de94+U', not the intended PR-head
|
||
'7ae7b0f76efb' (HC1)`). HOW: `grep -oE '"level": [0-9]+|"upgrade": "[a-z]+"'
|
||
/var/lib/cc-ci-runs/{184,m2p-discourse}/results.json`.
|
||
|
||
- **All same-ref discourse runs share ONE swarm stack.** `naming.app_domain(recipe,pr,ref)` =
|
||
`<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net` → identical for identical
|
||
(recipe,pr,ref). The upgrade `chaos_redeploy` bypasses `deploy_app`'s app-domain flock
|
||
(`lifecycle.chaos_redeploy` / `generic.perform_upgrade`). LEADING HYPOTHESIS: the 06-10/06-11
|
||
drift is a CONCURRENCY ARTIFACT of the clustered rcust-M2 A/B discourse experiments racing on
|
||
the shared stack — NOT an abra/recipe/env regression. Under test now.
|
||
|
||
## In flight
|
||
- Implementing the fix (overlay stop-first + harness rollback detection), then a full real run
|
||
(all stages) to prove discourse reliably reaches its true level, then the `!testme` drone path.
|
||
- Repro evidence runs: `/var/lib/cc-ci-runs/dstamp-repro{1,2,3,4}.console.log` on cc-ci
|
||
(repro2 PASS @7ae7b0f7+U; repro4 captured the rollback Spec/PreviousSpec).
|
||
|
||
## Blocked
|
||
- (none)
|