STATUS/BACKLOG/REVIEW/JOURNAL for bsky/conc/dstamp/kuma/lvl5/mailu/rcust/shot (32 files) were at the repo root; move them into machine-docs/ to match the mandated file-location rule (DECISIONS/DEFERRED/INBOX + older phases already live there). AGENTS.md gains an explicit File-location rule. No content change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
14 KiB
JOURNAL — phase dstamp (Builder, reasoning/private)
2026-06-11 — Bootstrap + investigation
Read the phase plan, plan.md §6.1/§7/§9, the Adversary's REVIEW-dstamp prep notes, and the
stamp-relevant harness code (abra.py, lifecycle.py:deployed_identity/recipe_checkout_ref/ chaos_redeploy/prepull_images, generic.py:perform_upgrade/assert_upgraded, run_recipe_ci
upgrade op + fetch_recipe).
Mechanism (from abra source @06a57de = the pinned binary)
chaos-version label is set in cli/app/deploy.go: for a -C deploy, getDeployVersion (l.365)
returns Recipe.ChaosVersion() (l.367-373) and SetChaosVersionLabel(compose, stack, toDeployVersion)
(l.168). ChaosVersion (pkg/recipe/git.go:300) = formatter.SmallSHA(Head().String()) + +U
if dirty. Head (l.483) = go-git repo.Head(). Crucially, app.Recipe.Ensure(ctx) (deploy.go:86)
calls into git.go:38 which early-returns on ctx.Chaos (l.41-43) — so a chaos deploy does NOT
re-checkout the .env version. GetEnsureContext (cli/internal/ensure.go) wires EnsureContext{Chaos, Offline, IgnoreEnvVersion=DeployLatest} from the CLI flags. So -C ⇒ Ensure no-op ⇒ chaos version
= whatever git HEAD the harness left checked out.
The contradiction that drove the dig
The m2p failure message is chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb'.
eb96de9 = tag 0.7.0+3.3.1 (the upgrade base); 7ae7b0f = PR head (9 commits past that tag,
and there is NO 0.8/0.9 tag despite HEAD's "upgrade to 0.9.0+3.5.0" message). The harness
perform_upgrade does recipe_checkout_ref(head_ref=7ae7b0f) then chaos_redeploy, with only
env_set + prepull_images (pure docker compose, no git) in between — and the run's recipe
snapshot HEAD = 7ae7b0f. So at deploy time HEAD should be 7ae7b0f ⇒ stamp 7ae7b0f. Yet it
stamped eb96de9. abra's source says chaos = Head(); so for eb96de9 to be stamped, HEAD had to be
eb96de9 at the chaos deploy — which the isolated flow never produces.
Reproductions (all on cc-ci, scratch ABRA_DIR, deploys bail at secret not generated
which is deploy.go:140, AFTER the chaos version is computed+logged at deploy.go:372)
- cp -a canonical recipe, checkout head→base(tag)→head,
abra app deploy -C→taking chaos version: 7ae7b0f7. HEAD stays 7ae7b0f. NO drift. - real non-chaos base deploy (exercises go-git
EnsureVersionwhich checks out tag viaBranch: refs/tags/0.7.0+3.3.1, leaving HEAD=eb96de9), then CLIgit checkout -f head, then-Cdeploy →taking chaos version: 7ae7b0f7. NO drift. - mirror-faithful:
git clone <recipe-maintainers/discourse>+git checkout 7ae7b0f+git fetch <coop-cloud/discourse> refs/tags/*:refs/tags/*(exactfetch_recipe), then base deploy → re-checkout head →-Cdeploy →taking chaos version: 7ae7b0f7. NO drift.
Conclusion: the isolated git/abra version-resolution path is correct in the current host state. The drift is not in that path.
Timeline / differentiator
- abra binary: constant since 2026-06-01 (system-4). Not abra.
- Same ref 7ae7b0f: run 184 (06-05 02:17, solo) was L4 upgrade-PASS. The drift runs (m2b 06-10 20:54, m2p 06-11 00:44, ab 06-11 00:48) are clustered (m2p & ab 4 min apart → overlapping for a multi-tier discourse run that takes ≫4 min).
app_domainhashes (recipe|pr|ref) ⇒ all three drift runs, same ref, collide on one swarm stack. The upgradechaos_redeploydoes NOT takedeploy_app's app-domain flock, so two concurrent runs can interleave deploys on the shared stack and the<stack>_appservice label read bydeployed_identityreflects whichever deploy last wrote it.
Leading hypothesis: the "harness-neutral env drift" is actually a concurrency artifact of the rcust-phase M2 A/B discourse experiments running near-simultaneously on the shared stack — not an abra/recipe/environment regression. Run 184 solo = green; clustered 06-11 = drift; isolated re-reproduction now = green. Testing with one clean isolated real run (install,upgrade) before committing to this attribution — direct evidence required by the plan, not inference alone.
Open: must still explain exactly how a concurrent peer produces an eb96de9+U (dirty CHAOS)
label on the shared stack — a base deploy is pinned/non-chaos (no chaos label), so the +U chaos
label must come from some chaos deploy with HEAD=eb96de9. The isolated real run + (if needed) a
deliberate 2-run concurrency repro will nail the mechanism. Will NOT claim M1 on inference.
2026-06-11 (cont.) — REAL runs: concurrency REFUTED, true root cause = swarm rollback
Three real install+upgrade runs of discourse @7ae7b0f (CCCI_RUN_ID=dstamp-repro{1,2,3}), each SOLO/isolated (no concurrent discourse run):
- base deploy is CHAOS (not pinned):
compose.ccci.ymloverlay is present ⇒deploy_apptakes thehas_ccci_overlayauto-chaos branch (lifecycle.py:291-298). So the base stampschaos-version = eb96de9+Uon the shared stack. (My earlier bail-at-secrets repros used a non-chaos/manual base → that's why they didn't expose it.) - repro1 (unpatched): upgrade FAIL —
chaos commit 'eb96de94+U', not 7ae7b0f76efb. The per-run tree reflog + snapshot prove HEAD = 7ae7b0f at the upgrade deploy (last checkout 16:39:03, no checkout-back), yet the deployed.Specchaos label was eb96de9+U. - repro2 (instrumented: abra deploy
--debug+ a HEAD-print subprocess before the redeploy): upgrade PASS —[DSTAMP] taking chaos version: 7ae7b0f7+U, HEAD=7ae7b0f,deployed_identity = {version 0.9.0+3.5.0, image bitnamilegacy/discourse:3.3.1, chaos 7ae7b0f7+U}.
So the SAME solo config is intermittent (184✓ 06-05, m2b/m2p/ab✗ 06-10/11, repro1✗, repro2✓); flipping with a tiny timing change ⇒ NOT a concurrency artifact, NOT abra version-resolution (abra computes 7ae7b0f7 correctly — proven by repro2's debug line AND all 3 bail-at-secrets repros).
TRUE ROOT CAUSE (recipe deploy policy + heavy/flaky new task): discourse compose.yml app
service sets deploy.update_config: { failure_action: rollback, order: start-first } with a
healthcheck.start_period: 20m. The upgrade chaos deploy applies the head spec
(chaos-version=7ae7b0f7+U) start-first (old + new task co-resident = ~2× memory for a
precompile-heavy Rails app). When the NEW task intermittently fails swarm's update monitor,
swarm executes failure_action: rollback ⇒ reverts the app service to its PreviousSpec (the
base: chaos-version=eb96de9+U). Under start-first the OLD task keeps serving, so the
harness wait_healthy still passes — but deployed_identity reads .Spec.Labels of the
ROLLED-BACK spec and sees the base commit. The "since ~06-10 on every run" pattern = the
rcust-phase runs happened under heavier host load (warm keycloak etc.), so the new task reliably
failed the monitor ⇒ rollback every time; the solo 06-05 run (184) didn't roll back. Harness- and
abra-neutral, exactly as observed.
repro3 (UpdateStatus + PreviousSpec capture, NO --debug to preserve failing timing) running to
get the swarm rollback in the act (expect UpdateStatus.State = rollback_*, PreviousSpec.Labels
chaos=eb96de9+U == the read .Spec.Labels after revert). That is the direct-evidence smoking gun.
DIRECT EVIDENCE — captured (repro4, solo/isolated, upgrade FAIL)
repro3 base deploy FATA'd (abra convergence monitor gave up — discourse is genuinely flaky/heavy
under load, which is the very premise). repro4 reached the upgrade and the post-chaos_redeploy
docker service inspect <stack>_app capture is the smoking gun:
UpdateStatus = {"State":"updating","Message":"update in progress"}.Spec.Labelschaos-version = 7ae7b0f7+U, version = 0.9.0+3.5.0 (HEAD spec applied OK).PreviousSpec.Labelschaos-version = eb96de94+U, version = 0.7.0+3.3.1 (the base)deployed_identity(same instant) = chaos 7ae7b0f7+U (reads Spec, correct) Thenwait_healthyran (old task serving under start-first → passes); the new task failed swarm's monitor →failure_action: rollbackreverted.Spec→.PreviousSpec(eb96de94+U); the assertion-phase read saw eb96de94+U → HC1 FAIL. The ONLY operation that turns.Spec.Labelsfrom 7ae7b0f7+U into the exact.PreviousSpeceb96de94+U is a swarm rollback. abra+harness exonerated; the head was really deployed and then swarm-reverted. Attribution complete, by direct evidence.
Note the app image is bitnamilegacy/discourse:3.3.1 for BOTH base and head spec (head only bumps
the version label + db image), so the new task isn't failing on a missing image — it's the
start-first 2× co-residency of the precompile/Rails-heavy app under host memory pressure (a real
new-task failure, intermittent), which trips failure_action: rollback.
Fix plan (HC1 teeth preserved)
- Reliability:
tests/discourse/compose.ccci.ymloverlay → appdeploy.update_config.order: stop-first(old stops before new starts → new boots with full memory → genuinely healthy → no spurious rollback). Upgrade-to-head still really deployed+asserted; not a weakening. WHY in header. Risk to weigh: stop-first = brief real downtime during the CI upgrade (covered by DEPLOY_TIMEOUT 3600). Alternativefailure_action: pauseREJECTED — it would let a genuinely-failed new task pass HC1 (start-first keeps old serving) = test-weakening. - Correctness: harness upgrade path asserts the redeploy converged to the head spec (UpdateStatus
not rollback*/paused /
.Specnot reverted to.PreviousSpec) → honest failure message on a real rollback, instead of the misleading "re-checkout failed". General (all rollback-policy recipes). HC1 teeth intact: a head that truly can't stay healthy still fails. - Will validate stop-first actually eliminates the rollback with a full real run before claiming.
2026-06-11 (cont.) — fix validated + blast-radius
Fix implemented (commit 0cc31a5): (1) tests/discourse/compose.ccci.yml app service
deploy.update_config.order: stop-first; (2) lifecycle.assert_upgrade_converged() + call in
generic.perform_upgrade right after chaos_redeploy (before wait_healthy) — waits for swarm's
app-service rolling update to reach a TERMINAL state and FAILs honestly on rollback*/paused.
Unit tests: 253 passed (no regression).
fix1 validation (run dstamp-fix1, fresh checkout @0cc31a5, install+upgrade, solo): UPGRADE
PASS — upgrade-converged: …UpdateStatus=completed, upgrade→PR-head: head_ref=7ae7b0f7 chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0. The head is deployed, the update
converges (no rollback), HC1 reads 7ae7b0f7+U. (Bug was intermittent — running more to show
reliability, since repro2 passed unpatched.)
Blast-radius sweep — recipes with failure_action: rollback + order: start-first:
discourse, drone, keycloak, n8n, traefik. Evidence check of the upgrade tier across many runs
(incl. the rcust-era m2r-* runs under the same heavy load):
- keycloak: runs 155/186/187/m2r/shot-proof → upgrade PASS L4 (HC1 pass ⇒ chaos==head). NOT affected.
- n8n: runs 47/54/61/162/197/m2r/shot-proof → upgrade PASS L4. NOT affected.
- drone, traefik: cc-ci INFRA (warm-reconciled), NOT enrolled in the recipe-CI upgrade tier.
⇒ Only discourse actually exhibits the drift — its app is uniquely heavy (Rails asset
precompile, 2.4GB image) so the start-first 2× co-residency OOMs the new task; the lighter
keycloak/n8n new tasks survive swarm's monitor, so no rollback. The general harness guard
(
assert_upgrade_converged) now protects ALL rollback-policy recipes from a silent future rollback (honest failure), and discourse additionally gets stop-first to converge reliably.
Hardening (commit e9c26c7) + fix2 validation
Adversary independently confirmed the root cause + assessed the fix CORRECT (REVIEW-dstamp probe),
flagging one non-blocking race: assert_upgrade_converged's first poll could read a STALE terminal
completed (from the install/base deploy) before swarm schedules the new roll → return OK
prematurely → miss a later rollback. Hardened with a two-phase wait: phase 1 confirms the NEW
update is scheduled (UpdateStatus.StartedAt advances past the pre-redeploy value, captured via
update_status_started, or state is in-flight updating/rollback_started), with a 30s grace for
a genuine no-op redeploy; phase 2 then waits for the terminal verdict. fix2 (hardened, fresh
checkout @e9c26c7, install+upgrade): UPGRADE PASS — upgrade-converged: …UpdateStatus=completed,
chaos-version=7ae7b0f7+U version=0.7.0+3.3.1→0.9.0+3.5.0. Two consecutive green fixed runs
(fix1+fix2) vs intermittent unpatched failures (repro1✗ repro4✗ repro2✓). Unit tests 253 pass.
M1 claimed
Attribution + minimal repro + 06-05→06-10 change + fix + blast-radius all complete and
Adversary-pre-confirmed → claiming M1 (verification recipe in STATUS-dstamp). Next: M2 — full
all-stages discourse green at true level via the drone !testme path (the recipe-CI pipeline runs
cc-ci-run runner/run_recipe_ci.py from the drone-cloned cc-ci workspace, so e9c26c7 is live for
!testme — no nixos-rebuild needed for the harness), other recipes re-proven (none affected), HC1
teeth shown (wrong stamp still FAILs), DEFERRED closed.
Fix direction (HC1 must keep its teeth — do NOT relax the commit match): the upgrade chaos redeploy must assert against the intended applied spec, not a silently rolled-back one — i.e. the harness must DETECT a swarm rollback (UpdateStatus.State rollback*) and treat it as an upgrade FAILURE with a clear message (the deploy did not converge to the head spec), AND/OR make the upgrade redeploy not subject to silent rollback masking (e.g. assert UpdateStatus completed before reading identity). The recipe's rollback policy is legitimate for prod; the harness bug is that a rollback is invisible to HC1 and masquerades as "stamped the wrong commit". Will finalise the fix after repro3 confirms.