Some checks failed
continuous-integration/drone/push Build is failing
STATUS/BACKLOG/REVIEW/JOURNAL for bsky/conc/dstamp/kuma/lvl5/mailu/rcust/shot (32 files) were at the repo root; move them into machine-docs/ to match the mandated file-location rule (DECISIONS/DEFERRED/INBOX + older phases already live there). AGENTS.md gains an explicit File-location rule. No content change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
187 lines
14 KiB
Markdown
187 lines
14 KiB
Markdown
# JOURNAL — phase `dstamp` (Builder, reasoning/private)
|
||
|
||
## 2026-06-11 — Bootstrap + investigation
|
||
|
||
Read the phase plan, plan.md §6.1/§7/§9, the Adversary's REVIEW-dstamp prep notes, and the
|
||
stamp-relevant harness code (`abra.py`, `lifecycle.py:deployed_identity/recipe_checkout_ref/
|
||
chaos_redeploy/prepull_images`, `generic.py:perform_upgrade/assert_upgraded`, run_recipe_ci
|
||
upgrade op + fetch_recipe).
|
||
|
||
### Mechanism (from abra source @06a57de = the pinned binary)
|
||
chaos-version label is set in `cli/app/deploy.go`: for a `-C` deploy, `getDeployVersion` (l.365)
|
||
returns `Recipe.ChaosVersion()` (l.367-373) and `SetChaosVersionLabel(compose, stack, toDeployVersion)`
|
||
(l.168). `ChaosVersion` (`pkg/recipe/git.go:300`) = `formatter.SmallSHA(Head().String())` + `+U`
|
||
if dirty. `Head` (l.483) = go-git `repo.Head()`. Crucially, `app.Recipe.Ensure(ctx)` (deploy.go:86)
|
||
calls into git.go:38 which **early-returns on `ctx.Chaos`** (l.41-43) — so a chaos deploy does NOT
|
||
re-checkout the .env version. `GetEnsureContext` (cli/internal/ensure.go) wires `EnsureContext{Chaos,
|
||
Offline, IgnoreEnvVersion=DeployLatest}` from the CLI flags. So `-C` ⇒ Ensure no-op ⇒ chaos version
|
||
= whatever git HEAD the harness left checked out.
|
||
|
||
### The contradiction that drove the dig
|
||
The m2p failure message is `chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb'`.
|
||
`eb96de9` = tag `0.7.0+3.3.1` (the upgrade base); `7ae7b0f` = PR head (9 commits past that tag,
|
||
and there is NO 0.8/0.9 tag despite HEAD's "upgrade to 0.9.0+3.5.0" message). The harness
|
||
`perform_upgrade` does `recipe_checkout_ref(head_ref=7ae7b0f)` then `chaos_redeploy`, with only
|
||
`env_set` + `prepull_images` (pure docker compose, no git) in between — and the run's recipe
|
||
**snapshot HEAD = 7ae7b0f**. So at deploy time HEAD *should* be 7ae7b0f ⇒ stamp 7ae7b0f. Yet it
|
||
stamped eb96de9. abra's source says chaos = Head(); so for eb96de9 to be stamped, HEAD had to be
|
||
eb96de9 at the chaos deploy — which the isolated flow never produces.
|
||
|
||
### Reproductions (all on cc-ci, scratch ABRA_DIR, deploys bail at `secret not generated`
|
||
### which is deploy.go:140, AFTER the chaos version is computed+logged at deploy.go:372)
|
||
1. cp -a canonical recipe, checkout head→base(tag)→head, `abra app deploy -C` → `taking chaos
|
||
version: 7ae7b0f7`. HEAD stays 7ae7b0f. NO drift.
|
||
2. real non-chaos base deploy (exercises go-git `EnsureVersion` which checks out tag via
|
||
`Branch: refs/tags/0.7.0+3.3.1`, leaving HEAD=eb96de9), then CLI `git checkout -f head`, then
|
||
`-C` deploy → `taking chaos version: 7ae7b0f7`. NO drift.
|
||
3. mirror-faithful: `git clone <recipe-maintainers/discourse>` + `git checkout 7ae7b0f` +
|
||
`git fetch <coop-cloud/discourse> refs/tags/*:refs/tags/*` (exact `fetch_recipe`), then base
|
||
deploy → re-checkout head → `-C` deploy → `taking chaos version: 7ae7b0f7`. NO drift.
|
||
|
||
Conclusion: the isolated git/abra version-resolution path is **correct** in the current host
|
||
state. The drift is not in that path.
|
||
|
||
### Timeline / differentiator
|
||
- abra binary: constant since 2026-06-01 (system-4). Not abra.
|
||
- Same ref 7ae7b0f: run 184 (06-05 02:17, **solo**) was L4 upgrade-PASS. The drift runs
|
||
(m2b 06-10 20:54, m2p 06-11 00:44, ab 06-11 00:48) are **clustered** (m2p & ab 4 min apart →
|
||
overlapping for a multi-tier discourse run that takes ≫4 min).
|
||
- `app_domain` hashes (recipe|pr|ref) ⇒ all three drift runs, same ref, **collide on one swarm
|
||
stack**. The upgrade `chaos_redeploy` does NOT take `deploy_app`'s app-domain flock, so two
|
||
concurrent runs can interleave deploys on the shared stack and the `<stack>_app` service label
|
||
read by `deployed_identity` reflects whichever deploy last wrote it.
|
||
|
||
**Leading hypothesis:** the "harness-neutral env drift" is actually a **concurrency artifact** of
|
||
the rcust-phase M2 A/B discourse experiments running near-simultaneously on the shared stack — not
|
||
an abra/recipe/environment regression. Run 184 solo = green; clustered 06-11 = drift; isolated
|
||
re-reproduction now = green. Testing with one clean isolated real run (install,upgrade) before
|
||
committing to this attribution — direct evidence required by the plan, not inference alone.
|
||
|
||
Open: must still explain *exactly* how a concurrent peer produces an `eb96de9+U` (dirty CHAOS)
|
||
label on the shared stack — a base deploy is pinned/non-chaos (no chaos label), so the +U chaos
|
||
label must come from some chaos deploy with HEAD=eb96de9. The isolated real run + (if needed) a
|
||
deliberate 2-run concurrency repro will nail the mechanism. Will NOT claim M1 on inference.
|
||
|
||
## 2026-06-11 (cont.) — REAL runs: concurrency REFUTED, true root cause = swarm rollback
|
||
|
||
Three real install+upgrade runs of discourse @7ae7b0f (CCCI_RUN_ID=dstamp-repro{1,2,3}), each
|
||
SOLO/isolated (no concurrent discourse run):
|
||
|
||
- **base deploy is CHAOS** (not pinned): `compose.ccci.yml` overlay is present ⇒
|
||
`deploy_app` takes the `has_ccci_overlay` auto-chaos branch (`lifecycle.py:291-298`). So the
|
||
base stamps `chaos-version = eb96de9+U` on the shared stack. (My earlier bail-at-secrets repros
|
||
used a non-chaos/manual base → that's why they didn't expose it.)
|
||
- **repro1 (unpatched): upgrade FAIL** — `chaos commit 'eb96de94+U', not 7ae7b0f76efb`. The
|
||
per-run tree reflog + snapshot prove HEAD = **7ae7b0f** at the upgrade deploy (last checkout
|
||
16:39:03, no checkout-back), yet the deployed `.Spec` chaos label was eb96de9+U.
|
||
- **repro2 (instrumented: abra deploy `--debug` + a HEAD-print subprocess before the redeploy):
|
||
upgrade PASS** — `[DSTAMP] taking chaos version: 7ae7b0f7+U`, HEAD=7ae7b0f,
|
||
`deployed_identity = {version 0.9.0+3.5.0, image bitnamilegacy/discourse:3.3.1, chaos 7ae7b0f7+U}`.
|
||
|
||
So the SAME solo config is **intermittent** (184✓ 06-05, m2b/m2p/ab✗ 06-10/11, repro1✗, repro2✓);
|
||
flipping with a tiny timing change ⇒ **NOT a concurrency artifact, NOT abra version-resolution**
|
||
(abra computes 7ae7b0f7 correctly — proven by repro2's debug line AND all 3 bail-at-secrets repros).
|
||
|
||
**TRUE ROOT CAUSE (recipe deploy policy + heavy/flaky new task):** discourse `compose.yml` app
|
||
service sets `deploy.update_config: { failure_action: rollback, order: start-first }` with a
|
||
`healthcheck.start_period: 20m`. The upgrade chaos deploy applies the head spec
|
||
(`chaos-version=7ae7b0f7+U`) start-first (old + new task co-resident = ~2× memory for a
|
||
precompile-heavy Rails app). When the NEW task intermittently fails swarm's update monitor,
|
||
swarm executes **failure_action: rollback ⇒ reverts the app service to its PreviousSpec (the
|
||
base: `chaos-version=eb96de9+U`)**. Under `start-first` the OLD task keeps serving, so the
|
||
harness `wait_healthy` still passes — but `deployed_identity` reads `.Spec.Labels` of the
|
||
ROLLED-BACK spec and sees the base commit. The "since ~06-10 on every run" pattern = the
|
||
rcust-phase runs happened under heavier host load (warm keycloak etc.), so the new task reliably
|
||
failed the monitor ⇒ rollback every time; the solo 06-05 run (184) didn't roll back. Harness- and
|
||
abra-neutral, exactly as observed.
|
||
|
||
repro3 (UpdateStatus + PreviousSpec capture, NO --debug to preserve failing timing) running to
|
||
get the swarm rollback in the act (expect `UpdateStatus.State = rollback_*`, `PreviousSpec.Labels`
|
||
chaos=eb96de9+U == the read `.Spec.Labels` after revert). That is the direct-evidence smoking gun.
|
||
|
||
### DIRECT EVIDENCE — captured (repro4, solo/isolated, upgrade FAIL)
|
||
repro3 base deploy FATA'd (abra convergence monitor gave up — discourse is genuinely flaky/heavy
|
||
under load, which is the very premise). repro4 reached the upgrade and the post-`chaos_redeploy`
|
||
`docker service inspect <stack>_app` capture is the smoking gun:
|
||
- `UpdateStatus = {"State":"updating","Message":"update in progress"}`
|
||
- `.Spec.Labels` chaos-version = **7ae7b0f7+U**, version = 0.9.0+3.5.0 (HEAD spec applied OK)
|
||
- `.PreviousSpec.Labels` chaos-version = **eb96de94+U**, version = 0.7.0+3.3.1 (the base)
|
||
- `deployed_identity` (same instant) = chaos **7ae7b0f7+U** (reads Spec, correct)
|
||
Then `wait_healthy` ran (old task serving under start-first → passes); the new task failed swarm's
|
||
monitor → `failure_action: rollback` reverted `.Spec` → `.PreviousSpec` (eb96de94+U); the
|
||
assertion-phase read saw eb96de94+U → HC1 FAIL. The ONLY operation that turns `.Spec.Labels` from
|
||
7ae7b0f7+U into the exact `.PreviousSpec` eb96de94+U is a swarm rollback. abra+harness exonerated;
|
||
the head was really deployed and then swarm-reverted. Attribution complete, by direct evidence.
|
||
|
||
Note the app image is `bitnamilegacy/discourse:3.3.1` for BOTH base and head spec (head only bumps
|
||
the version label + db image), so the new task isn't failing on a missing image — it's the
|
||
start-first 2× co-residency of the precompile/Rails-heavy app under host memory pressure (a real
|
||
new-task failure, intermittent), which trips `failure_action: rollback`.
|
||
|
||
### Fix plan (HC1 teeth preserved)
|
||
- Reliability: `tests/discourse/compose.ccci.yml` overlay → app `deploy.update_config.order:
|
||
stop-first` (old stops before new starts → new boots with full memory → genuinely healthy → no
|
||
spurious rollback). Upgrade-to-head still really deployed+asserted; not a weakening. WHY in header.
|
||
Risk to weigh: stop-first = brief real downtime during the CI upgrade (covered by DEPLOY_TIMEOUT
|
||
3600). Alternative `failure_action: pause` REJECTED — it would let a genuinely-failed new task
|
||
pass HC1 (start-first keeps old serving) = test-weakening.
|
||
- Correctness: harness upgrade path asserts the redeploy converged to the head spec (UpdateStatus
|
||
not rollback*/paused / `.Spec` not reverted to `.PreviousSpec`) → honest failure message on a
|
||
real rollback, instead of the misleading "re-checkout failed". General (all rollback-policy
|
||
recipes). HC1 teeth intact: a head that truly can't stay healthy still fails.
|
||
- Will validate stop-first actually eliminates the rollback with a full real run before claiming.
|
||
|
||
## 2026-06-11 (cont.) — fix validated + blast-radius
|
||
|
||
**Fix implemented** (commit 0cc31a5): (1) `tests/discourse/compose.ccci.yml` app service
|
||
`deploy.update_config.order: stop-first`; (2) `lifecycle.assert_upgrade_converged()` + call in
|
||
`generic.perform_upgrade` right after `chaos_redeploy` (before wait_healthy) — waits for swarm's
|
||
app-service rolling update to reach a TERMINAL state and FAILs honestly on rollback*/paused.
|
||
Unit tests: 253 passed (no regression).
|
||
|
||
**fix1 validation** (run `dstamp-fix1`, fresh checkout @0cc31a5, install+upgrade, solo): UPGRADE
|
||
**PASS** — `upgrade-converged: …UpdateStatus=completed`, `upgrade→PR-head: head_ref=7ae7b0f7
|
||
chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`. The head is deployed, the update
|
||
converges (no rollback), HC1 reads 7ae7b0f7+U. (Bug was intermittent — running more to show
|
||
reliability, since repro2 passed unpatched.)
|
||
|
||
**Blast-radius sweep** — recipes with `failure_action: rollback` + `order: start-first`:
|
||
`discourse, drone, keycloak, n8n, traefik`. Evidence check of the upgrade tier across many runs
|
||
(incl. the rcust-era m2r-* runs under the same heavy load):
|
||
- keycloak: runs 155/186/187/m2r/shot-proof → upgrade PASS L4 (HC1 pass ⇒ chaos==head). NOT affected.
|
||
- n8n: runs 47/54/61/162/197/m2r/shot-proof → upgrade PASS L4. NOT affected.
|
||
- drone, traefik: cc-ci INFRA (warm-reconciled), NOT enrolled in the recipe-CI upgrade tier.
|
||
⇒ **Only discourse actually exhibits the drift** — its app is uniquely heavy (Rails asset
|
||
precompile, 2.4GB image) so the start-first 2× co-residency OOMs the new task; the lighter
|
||
keycloak/n8n new tasks survive swarm's monitor, so no rollback. The general harness guard
|
||
(`assert_upgrade_converged`) now protects ALL rollback-policy recipes from a silent future
|
||
rollback (honest failure), and discourse additionally gets stop-first to converge reliably.
|
||
|
||
### Hardening (commit e9c26c7) + fix2 validation
|
||
Adversary independently confirmed the root cause + assessed the fix CORRECT (REVIEW-dstamp probe),
|
||
flagging one non-blocking race: assert_upgrade_converged's first poll could read a STALE terminal
|
||
`completed` (from the install/base deploy) before swarm schedules the new roll → return OK
|
||
prematurely → miss a later rollback. Hardened with a two-phase wait: phase 1 confirms the NEW
|
||
update is scheduled (`UpdateStatus.StartedAt` advances past the pre-redeploy value, captured via
|
||
`update_status_started`, or state is in-flight `updating`/`rollback_started`), with a 30s grace for
|
||
a genuine no-op redeploy; phase 2 then waits for the terminal verdict. fix2 (hardened, fresh
|
||
checkout @e9c26c7, install+upgrade): UPGRADE **PASS** — `upgrade-converged: …UpdateStatus=completed`,
|
||
`chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0`. Two consecutive green fixed runs
|
||
(fix1+fix2) vs intermittent unpatched failures (repro1✗ repro4✗ repro2✓). Unit tests 253 pass.
|
||
|
||
### M1 claimed
|
||
Attribution + minimal repro + 06-05→06-10 change + fix + blast-radius all complete and
|
||
Adversary-pre-confirmed → claiming M1 (verification recipe in STATUS-dstamp). Next: M2 — full
|
||
all-stages discourse green at true level via the drone `!testme` path (the recipe-CI pipeline runs
|
||
`cc-ci-run runner/run_recipe_ci.py` from the drone-cloned cc-ci workspace, so e9c26c7 is live for
|
||
!testme — no nixos-rebuild needed for the harness), other recipes re-proven (none affected), HC1
|
||
teeth shown (wrong stamp still FAILs), DEFERRED closed.
|
||
|
||
Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy
|
||
must assert against the *intended* applied spec, not a silently rolled-back one — i.e. the harness
|
||
must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with
|
||
a clear message (the deploy did not converge to the head spec), AND/OR make the upgrade redeploy not
|
||
subject to silent rollback masking (e.g. assert UpdateStatus completed before reading identity).
|
||
The recipe's rollback policy is legitimate for prod; the harness bug is that a rollback is invisible
|
||
to HC1 and masquerades as "stamped the wrong commit". Will finalise the fix after repro3 confirms.
|