diff --git a/BACKLOG-dstamp.md b/BACKLOG-dstamp.md new file mode 100644 index 0000000..60b7723 --- /dev/null +++ b/BACKLOG-dstamp.md @@ -0,0 +1,26 @@ +# BACKLOG — phase `dstamp` + +## Build backlog (Builder-owned) + +- [x] Read phase plan + plan.md §6.1/§7/§9 + Adversary prep notes + stamp-relevant harness code. +- [x] Establish abra's chaos-version mechanism from abra source @06a57de (= pinned binary). +- [x] Rule out abra-version drift (constant store path since nixos system-4, 2026-06-01). +- [x] Minimal reproductions of the git/abra chaos-version path (cp-a; go-git base; mirror-faithful) + — all stamp the CORRECT head 7ae7b0f7, NO drift in current host state. +- [x] Timeline: run 184 (06-05, solo) green @7ae7b0f; clustered 06-10/06-11 runs drift @ same ref. +- [x] Identify shared-stack collision vector (`app_domain` = hash(recipe|pr|ref); upgrade + chaos_redeploy bypasses app-domain flock). +- [ ] **IN FLIGHT:** single clean ISOLATED real run (install,upgrade @7ae7b0f, console-captured) + → decide concurrency-artifact vs real drift. +- [ ] If concurrency artifact: pin the exact mechanism producing the `eb96de9+U` chaos label on + the shared stack (deliberate 2-run concurrency repro if needed); decide the fix + (app-lock the upgrade chaos_redeploy / serialize same-stack runs) WITHOUT weakening HC1. +- [ ] If real env drift: read the isolated-run console, attribute the exact 06-05→06-10 change. +- [ ] Blast-radius sweep: every enrolled recipe's latest upgrade-tier evidence for the same + signature (prev-base tag commit stamped where a version was expected). +- [ ] Restore discourse to its true level in real CI via the drone `!testme` path (M2). +- [ ] Prove HC1 still has teeth (a deliberately wrong stamp still FAILs). +- [ ] Close the DEFERRED.md dstamp re-entry with pointers. + +## Adversary findings + diff --git a/JOURNAL-dstamp.md b/JOURNAL-dstamp.md new file mode 100644 index 0000000..61562f9 --- /dev/null +++ b/JOURNAL-dstamp.md @@ -0,0 +1,63 @@ +# JOURNAL — phase `dstamp` (Builder, reasoning/private) + +## 2026-06-11 — Bootstrap + investigation + +Read the phase plan, plan.md §6.1/§7/§9, the Adversary's REVIEW-dstamp prep notes, and the +stamp-relevant harness code (`abra.py`, `lifecycle.py:deployed_identity/recipe_checkout_ref/ +chaos_redeploy/prepull_images`, `generic.py:perform_upgrade/assert_upgraded`, run_recipe_ci +upgrade op + fetch_recipe). + +### Mechanism (from abra source @06a57de = the pinned binary) +chaos-version label is set in `cli/app/deploy.go`: for a `-C` deploy, `getDeployVersion` (l.365) +returns `Recipe.ChaosVersion()` (l.367-373) and `SetChaosVersionLabel(compose, stack, toDeployVersion)` +(l.168). `ChaosVersion` (`pkg/recipe/git.go:300`) = `formatter.SmallSHA(Head().String())` + `+U` +if dirty. `Head` (l.483) = go-git `repo.Head()`. Crucially, `app.Recipe.Ensure(ctx)` (deploy.go:86) +calls into git.go:38 which **early-returns on `ctx.Chaos`** (l.41-43) — so a chaos deploy does NOT +re-checkout the .env version. `GetEnsureContext` (cli/internal/ensure.go) wires `EnsureContext{Chaos, +Offline, IgnoreEnvVersion=DeployLatest}` from the CLI flags. So `-C` ⇒ Ensure no-op ⇒ chaos version += whatever git HEAD the harness left checked out. + +### The contradiction that drove the dig +The m2p failure message is `chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb'`. +`eb96de9` = tag `0.7.0+3.3.1` (the upgrade base); `7ae7b0f` = PR head (9 commits past that tag, +and there is NO 0.8/0.9 tag despite HEAD's "upgrade to 0.9.0+3.5.0" message). The harness +`perform_upgrade` does `recipe_checkout_ref(head_ref=7ae7b0f)` then `chaos_redeploy`, with only +`env_set` + `prepull_images` (pure docker compose, no git) in between — and the run's recipe +**snapshot HEAD = 7ae7b0f**. So at deploy time HEAD *should* be 7ae7b0f ⇒ stamp 7ae7b0f. Yet it +stamped eb96de9. abra's source says chaos = Head(); so for eb96de9 to be stamped, HEAD had to be +eb96de9 at the chaos deploy — which the isolated flow never produces. + +### Reproductions (all on cc-ci, scratch ABRA_DIR, deploys bail at `secret not generated` +### which is deploy.go:140, AFTER the chaos version is computed+logged at deploy.go:372) +1. cp -a canonical recipe, checkout head→base(tag)→head, `abra app deploy -C` → `taking chaos + version: 7ae7b0f7`. HEAD stays 7ae7b0f. NO drift. +2. real non-chaos base deploy (exercises go-git `EnsureVersion` which checks out tag via + `Branch: refs/tags/0.7.0+3.3.1`, leaving HEAD=eb96de9), then CLI `git checkout -f head`, then + `-C` deploy → `taking chaos version: 7ae7b0f7`. NO drift. +3. mirror-faithful: `git clone ` + `git checkout 7ae7b0f` + + `git fetch refs/tags/*:refs/tags/*` (exact `fetch_recipe`), then base + deploy → re-checkout head → `-C` deploy → `taking chaos version: 7ae7b0f7`. NO drift. + +Conclusion: the isolated git/abra version-resolution path is **correct** in the current host +state. The drift is not in that path. + +### Timeline / differentiator +- abra binary: constant since 2026-06-01 (system-4). Not abra. +- Same ref 7ae7b0f: run 184 (06-05 02:17, **solo**) was L4 upgrade-PASS. The drift runs + (m2b 06-10 20:54, m2p 06-11 00:44, ab 06-11 00:48) are **clustered** (m2p & ab 4 min apart → + overlapping for a multi-tier discourse run that takes ≫4 min). +- `app_domain` hashes (recipe|pr|ref) ⇒ all three drift runs, same ref, **collide on one swarm + stack**. The upgrade `chaos_redeploy` does NOT take `deploy_app`'s app-domain flock, so two + concurrent runs can interleave deploys on the shared stack and the `_app` service label + read by `deployed_identity` reflects whichever deploy last wrote it. + +**Leading hypothesis:** the "harness-neutral env drift" is actually a **concurrency artifact** of +the rcust-phase M2 A/B discourse experiments running near-simultaneously on the shared stack — not +an abra/recipe/environment regression. Run 184 solo = green; clustered 06-11 = drift; isolated +re-reproduction now = green. Testing with one clean isolated real run (install,upgrade) before +committing to this attribution — direct evidence required by the plan, not inference alone. + +Open: must still explain *exactly* how a concurrent peer produces an `eb96de9+U` (dirty CHAOS) +label on the shared stack — a base deploy is pinned/non-chaos (no chaos label), so the +U chaos +label must come from some chaos deploy with HEAD=eb96de9. The isolated real run + (if needed) a +deliberate 2-run concurrency repro will nail the mechanism. Will NOT claim M1 on inference. diff --git a/STATUS-dstamp.md b/STATUS-dstamp.md new file mode 100644 index 0000000..f46dbe2 --- /dev/null +++ b/STATUS-dstamp.md @@ -0,0 +1,51 @@ +# STATUS — phase `dstamp` (discourse abra-stamp drift) + +Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2. + +## Phase state: INVESTIGATING (no gate claimed yet) + +## What is established (direct evidence, reproducible) + +- **abra is CONSTANT, not the cause.** abra binary `bf6azhpi…-abra-0.13.0-beta` is the store + path for every nixos system generation from system-4 (2026-06-01) through system-11 (now). + No abra change between 06-05 and 06-10. + HOW: `for g in $(ls -d /nix/var/nix/profiles/system-*-link); do readlink -f "$g/sw/bin/abra"; done` + on cc-ci. EXPECTED: all `…bf6azhpi…` from system-4 on. + +- **abra's chaos-version = `SmallSHA(git HEAD of the recipe checkout)`** (+`+U` if worktree + dirty). Source: abra@06a57de `cli/app/deploy.go:106,168,365-373` (chaos → + `toDeployVersion = Recipe.ChaosVersion()`), `pkg/recipe/git.go:300-318` (`ChaosVersion` = + `SmallSHA(Head())`), `:483-495` (`Head` = go-git `repo.Head()`). In chaos mode + `Recipe.Ensure` early-returns (`pkg/recipe/git.go:41-43`) — NO env-version re-checkout. + +- **The isolated git/abra path stamps CORRECTLY now.** Three faithful reproductions on cc-ci + (scratch ABRA_DIR, fake domain, deploys bail at `secret not generated` AFTER the chaos + version is computed) all log `taking chaos version: 7ae7b0f7` (= PR head), NOT `eb96de9`: + 1. `cp -a` canonical recipe + manual tag/head checkout. + 2. real non-chaos base deploy (go-git `EnsureVersion` tag checkout) → CLI re-checkout head → chaos. + 3. exact `fetch_recipe` replica: clone mirror `recipe-maintainers/discourse` @7ae7b0f + + `git fetch upstream refs/tags/*` → base deploy → re-checkout head → chaos. + HOW (variant 3, re-runnable cold): see JOURNAL-dstamp 2026-06-11 "mirror-faithful repro". + EXPECTED: `DEBU app/deploy.go:372 version: taking chaos version: 7ae7b0f7`. + +- **Same ref, solo run was GREEN; clustered runs DRIFTED.** discourse @ ref `7ae7b0f76efb`: + run **184** (2026-06-05 02:17, solo) = **L4, upgrade PASS**; the 06-10/06-11 runs + **m2b-discourse** (06-10 20:54), **m2p-discourse** (06-11 00:44), **ab-discourse-7ae7b0f-oldmain** + (06-11 00:48) = **L1, upgrade FAIL** (`chaos commit 'eb96de94+U', not the intended PR-head + '7ae7b0f76efb' (HC1)`). HOW: `grep -oE '"level": [0-9]+|"upgrade": "[a-z]+"' + /var/lib/cc-ci-runs/{184,m2p-discourse}/results.json`. + +- **All same-ref discourse runs share ONE swarm stack.** `naming.app_domain(recipe,pr,ref)` = + `-<6hex(recipe|pr|ref)>.ci.commoninternet.net` → identical for identical + (recipe,pr,ref). The upgrade `chaos_redeploy` bypasses `deploy_app`'s app-domain flock + (`lifecycle.chaos_redeploy` / `generic.perform_upgrade`). LEADING HYPOTHESIS: the 06-10/06-11 + drift is a CONCURRENCY ARTIFACT of the clustered rcust-M2 A/B discourse experiments racing on + the shared stack — NOT an abra/recipe/env regression. Under test now. + +## In flight +- Isolated clean real run (`CCCI_RUN_ID=dstamp-repro1`, STAGES=install,upgrade, ref 7ae7b0f, + no concurrent discourse run) with full console capture → decides: isolated real run GREEN + (⇒ concurrency artifact) vs DRIFT (⇒ read exact console). Console: `/var/lib/cc-ci-runs/dstamp-repro1.console.log` on cc-ci. + +## Blocked +- (none)