15 KiB
STATUS — phase dstamp (discourse abra-stamp drift)
Builder. SSOT: cc-ci-plan/plan-phase-dstamp-discourse-drift.md. Gates M1, M2.
DONE
M1 PASS (REVIEW-dstamp fb411b2 @17:36Z) + M2 PASS (71358da @17:58Z), both fresh, no VETO.
All Definition-of-Done items Adversary-verified.
Operator summary. The discourse upgrade-tier "abra stamp drift" (upgrade-HC1 stamping the
prev-base tag commit eb96de94+U instead of the PR head 7ae7b0f7+U, since ~06-10) was NOT an
abra or harness git bug — abra stamps the head correctly. Root cause: discourse's
compose.yml app service uses deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }. On the upgrade chaos redeploy, start-first co-resides the OLD+NEW
precompile/Rails-heavy task (~2× memory); under host memory pressure the NEW task fails swarm's 5s
update monitor → swarm rolls back to the base spec, reverting the chaos-version label
(head→base). start-first kept the old task serving, so wait_healthy passed and HC1 read the
reverted base commit — misreported as "re-checkout failed". Intermittent (memory-pressure
dependent): solo run 184 on 06-05 passed; the heavier 06-10/06-11 runs rolled back every time.
Direct evidence: dstamp-repro4 captured .Spec chaos-version=7ae7b0f7+U (head applied) →
.PreviousSpec=eb96de94+U (base) with UpdateStatus=updating, then the post-rollback read = base.
Fix (commits 0cc31a5 + e9c26c7, HC1 unweakened): (1) tests/discourse/compose.ccci.yml
app update_config.order: stop-first — the new task boots with full host memory, no OOM, no
spurious rollback (failure_action: rollback left intact for genuine failures); (2) a general
harness guard lifecycle.assert_upgrade_converged (2-phase StartedAt protocol) that detects a
swarm rollback/pause after the upgrade redeploy and fails the upgrade HONESTLY — the HC1
commit-match assertion is unchanged.
Proven in real CI: drone !testme build #450 (discourse @7ae7b0f) = LEVEL 5 (was L1
under the drift), all tiers green, clean teardown, no secret leak; PR recipe-maintainers/discourse#2
shows ✅ passed. Blast-radius: only discourse was affected (keycloak/n8n share the policy but
upgrade-PASS L4; drone/traefik are infra) — the new harness guard now protects all rollback-policy
recipes. DEFERRED entry closed with pointers. No operator action required.
Gate: M1 — PASS (REVIEW-dstamp fb411b2 @2026-06-11T17:36Z). Now on M2.
Gate: M2 — CLAIMED, awaiting Adversary
WHAT (M2 = Proven in real CI): discourse full lifecycle GREEN at its true level via the drone
!testme path, upgrade-HC1 stamping the CORRECT head value; no other affected recipe; HC1
unweakened (a wrong stamp still FAILs); DEFERRED closed.
- Real-CI proof — drone
!testmebuild #450: discourse @7ae7b0f76efb(PR#2), STAGES full (install,upgrade,backup,restore,custom), drone workspace at cc-ci main2da1f01(fix present) → LEVEL 5 (max), ALL tiers PASS,clean_teardown=true,no_secret_leak=true. Upgrade tiertest_upgrade_reconvergesPASSED (HC1'sassert_upgradedonly passes when the deployed chaos-version commit == head_ref7ae7b0f, afterassert_upgrade_convergedconfirmedUpdateStatus=completed). Was L1 (drift) before the fix → L5 now. - Triggered via the !testme path: comment
14346(!testme) on recipe-maintainers/discourse#2 → bridge ack14347, updated to "🌻 cc-ci — discourse @ 7ae7b0f7 ✅ passed" with the L5 result card/badge linking drone build 450.
HOW to verify (Adversary, cold):
grep -oE '"level": [0-9]+|"(install|upgrade|backup|restore|custom)": "[a-z]+"|"clean_teardown": (true|false)|"no_secret_leak": (true|false)' /var/lib/cc-ci-runs/450/results.json→ level 5, allpass, both flagstrue./var/lib/cc-ci-runs/450/junit/upgrade__generic__test_upgrade.xml→test_upgrade_reconvergestestcase with NO<failure>child (passed).- PR comment 14347 on recipe-maintainers/discourse#2 = ✅ passed, run 450.
- Fresh independent re-trigger (recommended): post
!testmeon discourse#2 → new drone build on cc-ci main → expect L5 again (reliability: manual fix1+fix2 + build 450 = 3 consecutive green with the fix vs intermittent unpatched failures). - HC1 teeth (negative test — Adversary leads): synthesize a wrong stamp and show RED. Two live
teeth: (a) the unchanged commit-match
generic.py:174-175— a deployed chaos commit ≠ head_ref still FAILs (e.g. force the recheckout to the base, or deploy base-as-head); (b) the newassert_upgrade_convergedraises on a swarmrollback_completed/paused(the ORIGINAL drift path — repro1/repro4 are exactly this RED, now with an honest message). Neither relaxes HC1. - DEFERRED closed:
machine-docs/DEFERRED.mddstamp entry → ✅ RESOLVED with pointers.
EXPECTED: build 450 level 5, all tiers pass, both flags true; PR#2 ✅ passed; DEFERRED resolved.
WHERE: /var/lib/cc-ci-runs/450/; commits 0cc31a5,e9c26c7; PR#2 comments 14346/14347;
machine-docs/DEFERRED.md. No other recipe affected (blast-radius: keycloak/n8n upgrade-PASS L4
across runs incl. rcust era; drone/traefik infra). Fresh Adversary M2 PASS → ## DONE.
(M1 — verified PASS; detail retained below)
WHAT (M1 = Attribution): root cause attributed by direct evidence; minimal reproducible demonstration; 06-05→06-10 change identified; fix implemented (recipe overlay + harness, HC1 unweakened); blast-radius sweep complete.
Root cause: discourse compose.yml app service sets deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }. On the upgrade chaos redeploy, start-first co-resides
OLD+NEW (~2× memory) for the precompile/Rails-heavy app; under host memory pressure the NEW task
fails swarm's 5s update monitor → failure_action: rollback reverts the app service to its
PreviousSpec — INCLUDING the coop-cloud.<stack>.chaos-version label (head→base). Under start-first
the OLD task keeps serving, so wait_healthy passes; deployed_identity then reads the rolled-back
.Spec (base commit eb96de94+U) and HC1 misreports it as "re-checkout failed". abra+harness git
path EXONERATED (abra stamps head 7ae7b0f7+U correctly; per-run HEAD=7ae7b0f at deploy).
HOW to verify (Adversary, cold):
- Recipe policy:
cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3 update_config compose.yml→failure_action: rollback,order: start-first. EXPECTED present. - abra exonerated (minimal repro): scratch ABRA_DIR, base→head checkout,
abra app deploy <d> -C -o -n --debugbails atsecret not generatedAFTER loggingapp/deploy.go:372 version: taking chaos version: 7ae7b0f7+U(HEAD-correct). Procedure: JOURNAL-dstamp "mirror-faithful repro". - Direct rollback evidence: console
/var/lib/cc-ci-runs/dstamp-repro4.console.logline[DSTAMP] post-redeploy svc inspect …shows immediately post-redeployUpdateStatus.State= "updating",.Spec…chaos-version=7ae7b0f7+U(head applied),.PreviousSpec…chaos-version= eb96de94+U(base); the later HC1 read = eb96de94+U after the rollback completes. - Fix present:
runner/harness/lifecycle.py::assert_upgrade_converged(+update_status_started) and its call inrunner/harness/generic.py::perform_upgrade;tests/discourse/compose.ccci.ymlappdeploy.update_config.order: stop-first. Commits0cc31a5+e9c26c7. - Fix works: run
dstamp-fix1(fresh checkout, STAGES=install,upgrade) → upgrade PASS, consoleupgrade-converged: …UpdateStatus=completed+chaos-version=7ae7b0f7+U version= 0.7.0+3.3.1→0.9.0+3.5.0. (Re-runnable:RECIPE=discourse PR=2 REF=7ae7b0f76efb2988c1e54956348dc9eeb7812e0b SRC=recipe-maintainers/discourse STAGES=install,upgrade CCCI_RUN_ID=<id> cc-ci-run runner/run_recipe_ci.pyfrom a checkout ate9c26c7.) - Blast-radius: recipes with rollback+start-first = discourse, drone, keycloak, n8n, traefik.
keycloak/n8n upgrade PASS L4 across runs (155/186/187/m2r; 47/54/61/162/197/m2r) ⇒ not affected;
drone/traefik infra (no recipe-CI upgrade tier). Only discourse affected; the general
assert_upgrade_convergedguard now protects all rollback-policy recipes.
EXPECTED: all of 1–6 hold. WHERE: commits 0cc31a5, e9c26c7; runs
/var/lib/cc-ci-runs/dstamp-{repro1,repro2,repro4,fix1}; recipe ~/.abra/recipes/discourse.
HC1 teeth preserved: the commit-match assertion is unchanged; assert_upgrade_converged only makes
a swarm rollback an HONEST upgrade failure before HC1 runs (a genuinely undeployable head still
fails). M2 will demonstrate a wrong stamp still FAILs + full-lifecycle green via the !testme path.
Root cause detail (evidence)
ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED)
The upgrade chaos redeploy applies the correct head spec, then swarm rolls it back to the
base spec, reverting the chaos-version label — masked by the recipe's start-first strategy +
the harness's wait_healthy (the OLD task keeps serving, so health passes).
Recipe policy (~/.abra/recipes/discourse/compose.yml, app service): deploy.update_config: { failure_action: rollback, order: start-first }, healthcheck.start_period: 20m. The heavy
discourse app, started start-first (old+new co-resident ≈ 2× memory), intermittently fails
swarm's update monitor on the NEW task → swarm executes failure_action: rollback → app service
reverts to PreviousSpec (the base, chaos-version=eb96de94+U).
Direct evidence (run dstamp-repro4, console /var/lib/cc-ci-runs/dstamp-repro4.console.log,
solo/isolated): immediately after chaos_redeploy, docker service inspect <stack>_app:
UpdateStatus.State = "updating",.Spec.Labels coop-cloud.<stack>.chaos-version = 7ae7b0f7+U(HEAD applied — abra stamped head correctly),.version = 0.9.0+3.5.0,.PreviousSpec.Labels …chaos-version = eb96de94+U(the base),.version = 0.7.0+3.3.1. Thenwait_healthypasses (old task serves under start-first); the new task fails the monitor → rollback →.Specreverts toeb96de94+U; the later HC1 read seeseb96de94+U→ FAIL with the misleading "re-checkout failed" message. (dstamp-repro2, lighter timing, had NO rollback → upgrade PASS @7ae7b0f7+U.)
Intermittency (184✓ solo 06-05; m2b/m2p/ab✗ clustered/heavier-load 06-10/11; repro1✗ repro2✓
repro4✗) = whether the new start-first task survives swarm's monitor under the host's momentary
memory pressure. The "since ~06-10 on every run" = the rcust phase ran under heavier resident load
(warm keycloak etc.) so the new task reliably failed → rollback every time. abra version-resolution
is CORRECT (proven: repro2 debug line taking chaos version: 7ae7b0f7+U + 3 bail-at-secrets repros);
the per-run git checkout is CORRECT (HEAD=7ae7b0f at deploy, reflog-proven). NOT abra, NOT the
per-run tree, NOT concurrency.
Fix (in progress) — HC1 keeps its teeth
- Reliability (restore true level): discourse
tests/discourse/compose.ccci.ymloverlay set the app servicedeploy.update_config.order: stop-firstso the new task boots with full memory (no 2× co-residency) and genuinely becomes healthy → no spurious rollback. The upgrade-to-head is still really deployed + asserted on head; HC1 unchanged. Documented WHY in the overlay header. - Correctness (honesty, general): the harness upgrade path detects a swarm rollback after the
chaos redeploy (UpdateStatus.State rollback*/paused, or
.Specreverted to.PreviousSpec) and fails the upgrade with the TRUE reason ("head spec applied then swarm-rolled-back: new task failed the update monitor") instead of the misleading "re-checkout failed". A genuinely undeployable head still FAILS (teeth preserved). - Blast-radius: sweep all enrolled recipes for
failure_action: rollback+ start-first heavy apps with the same latent signature.
What is established (direct evidence, reproducible)
-
abra is CONSTANT, not the cause. abra binary
bf6azhpi…-abra-0.13.0-betais the store path for every nixos system generation from system-4 (2026-06-01) through system-11 (now). No abra change between 06-05 and 06-10. HOW:for g in $(ls -d /nix/var/nix/profiles/system-*-link); do readlink -f "$g/sw/bin/abra"; doneon cc-ci. EXPECTED: all…bf6azhpi…from system-4 on. -
abra's chaos-version =
SmallSHA(git HEAD of the recipe checkout)(++Uif worktree dirty). Source: abra@06a57decli/app/deploy.go:106,168,365-373(chaos →toDeployVersion = Recipe.ChaosVersion()),pkg/recipe/git.go:300-318(ChaosVersion=SmallSHA(Head())),:483-495(Head= go-gitrepo.Head()). In chaos modeRecipe.Ensureearly-returns (pkg/recipe/git.go:41-43) — NO env-version re-checkout. -
The isolated git/abra path stamps CORRECTLY now. Three faithful reproductions on cc-ci (scratch ABRA_DIR, fake domain, deploys bail at
secret not generatedAFTER the chaos version is computed) all logtaking chaos version: 7ae7b0f7(= PR head), NOTeb96de9:cp -acanonical recipe + manual tag/head checkout.- real non-chaos base deploy (go-git
EnsureVersiontag checkout) → CLI re-checkout head → chaos. - exact
fetch_recipereplica: clone mirrorrecipe-maintainers/discourse@7ae7b0f +git fetch upstream refs/tags/*→ base deploy → re-checkout head → chaos. HOW (variant 3, re-runnable cold): see JOURNAL-dstamp 2026-06-11 "mirror-faithful repro". EXPECTED:DEBU app/deploy.go:372 version: taking chaos version: 7ae7b0f7.
-
Same ref, solo run was GREEN; clustered runs DRIFTED. discourse @ ref
7ae7b0f76efb: run 184 (2026-06-05 02:17, solo) = L4, upgrade PASS; the 06-10/06-11 runs m2b-discourse (06-10 20:54), m2p-discourse (06-11 00:44), ab-discourse-7ae7b0f-oldmain (06-11 00:48) = L1, upgrade FAIL (chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb' (HC1)). HOW:grep -oE '"level": [0-9]+|"upgrade": "[a-z]+"' /var/lib/cc-ci-runs/{184,m2p-discourse}/results.json. -
All same-ref discourse runs share ONE swarm stack.
naming.app_domain(recipe,pr,ref)=<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net→ identical for identical (recipe,pr,ref). The upgradechaos_redeploybypassesdeploy_app's app-domain flock (lifecycle.chaos_redeploy/generic.perform_upgrade). LEADING HYPOTHESIS: the 06-10/06-11 drift is a CONCURRENCY ARTIFACT of the clustered rcust-M2 A/B discourse experiments racing on the shared stack — NOT an abra/recipe/env regression. Under test now.
In flight
- Implementing the fix (overlay stop-first + harness rollback detection), then a full real run
(all stages) to prove discourse reliably reaches its true level, then the
!testmedrone path. - Repro evidence runs:
/var/lib/cc-ci-runs/dstamp-repro{1,2,3,4}.console.logon cc-ci (repro2 PASS @7ae7b0f7+U; repro4 captured the rollback Spec/PreviousSpec).
Blocked
- (none)