6.4 KiB
STATUS — phase dstamp (discourse abra-stamp drift)
Builder. SSOT: cc-ci-plan/plan-phase-dstamp-discourse-drift.md. Gates M1, M2.
Phase state: ROOT CAUSE ATTRIBUTED (direct evidence) — building fix, no gate claimed yet
ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED)
The upgrade chaos redeploy applies the correct head spec, then swarm rolls it back to the
base spec, reverting the chaos-version label — masked by the recipe's start-first strategy +
the harness's wait_healthy (the OLD task keeps serving, so health passes).
Recipe policy (~/.abra/recipes/discourse/compose.yml, app service): deploy.update_config: { failure_action: rollback, order: start-first }, healthcheck.start_period: 20m. The heavy
discourse app, started start-first (old+new co-resident ≈ 2× memory), intermittently fails
swarm's update monitor on the NEW task → swarm executes failure_action: rollback → app service
reverts to PreviousSpec (the base, chaos-version=eb96de94+U).
Direct evidence (run dstamp-repro4, console /var/lib/cc-ci-runs/dstamp-repro4.console.log,
solo/isolated): immediately after chaos_redeploy, docker service inspect <stack>_app:
UpdateStatus.State = "updating",.Spec.Labels coop-cloud.<stack>.chaos-version = 7ae7b0f7+U(HEAD applied — abra stamped head correctly),.version = 0.9.0+3.5.0,.PreviousSpec.Labels …chaos-version = eb96de94+U(the base),.version = 0.7.0+3.3.1. Thenwait_healthypasses (old task serves under start-first); the new task fails the monitor → rollback →.Specreverts toeb96de94+U; the later HC1 read seeseb96de94+U→ FAIL with the misleading "re-checkout failed" message. (dstamp-repro2, lighter timing, had NO rollback → upgrade PASS @7ae7b0f7+U.)
Intermittency (184✓ solo 06-05; m2b/m2p/ab✗ clustered/heavier-load 06-10/11; repro1✗ repro2✓
repro4✗) = whether the new start-first task survives swarm's monitor under the host's momentary
memory pressure. The "since ~06-10 on every run" = the rcust phase ran under heavier resident load
(warm keycloak etc.) so the new task reliably failed → rollback every time. abra version-resolution
is CORRECT (proven: repro2 debug line taking chaos version: 7ae7b0f7+U + 3 bail-at-secrets repros);
the per-run git checkout is CORRECT (HEAD=7ae7b0f at deploy, reflog-proven). NOT abra, NOT the
per-run tree, NOT concurrency.
Fix (in progress) — HC1 keeps its teeth
- Reliability (restore true level): discourse
tests/discourse/compose.ccci.ymloverlay set the app servicedeploy.update_config.order: stop-firstso the new task boots with full memory (no 2× co-residency) and genuinely becomes healthy → no spurious rollback. The upgrade-to-head is still really deployed + asserted on head; HC1 unchanged. Documented WHY in the overlay header. - Correctness (honesty, general): the harness upgrade path detects a swarm rollback after the
chaos redeploy (UpdateStatus.State rollback*/paused, or
.Specreverted to.PreviousSpec) and fails the upgrade with the TRUE reason ("head spec applied then swarm-rolled-back: new task failed the update monitor") instead of the misleading "re-checkout failed". A genuinely undeployable head still FAILS (teeth preserved). - Blast-radius: sweep all enrolled recipes for
failure_action: rollback+ start-first heavy apps with the same latent signature.
What is established (direct evidence, reproducible)
-
abra is CONSTANT, not the cause. abra binary
bf6azhpi…-abra-0.13.0-betais the store path for every nixos system generation from system-4 (2026-06-01) through system-11 (now). No abra change between 06-05 and 06-10. HOW:for g in $(ls -d /nix/var/nix/profiles/system-*-link); do readlink -f "$g/sw/bin/abra"; doneon cc-ci. EXPECTED: all…bf6azhpi…from system-4 on. -
abra's chaos-version =
SmallSHA(git HEAD of the recipe checkout)(++Uif worktree dirty). Source: abra@06a57decli/app/deploy.go:106,168,365-373(chaos →toDeployVersion = Recipe.ChaosVersion()),pkg/recipe/git.go:300-318(ChaosVersion=SmallSHA(Head())),:483-495(Head= go-gitrepo.Head()). In chaos modeRecipe.Ensureearly-returns (pkg/recipe/git.go:41-43) — NO env-version re-checkout. -
The isolated git/abra path stamps CORRECTLY now. Three faithful reproductions on cc-ci (scratch ABRA_DIR, fake domain, deploys bail at
secret not generatedAFTER the chaos version is computed) all logtaking chaos version: 7ae7b0f7(= PR head), NOTeb96de9:cp -acanonical recipe + manual tag/head checkout.- real non-chaos base deploy (go-git
EnsureVersiontag checkout) → CLI re-checkout head → chaos. - exact
fetch_recipereplica: clone mirrorrecipe-maintainers/discourse@7ae7b0f +git fetch upstream refs/tags/*→ base deploy → re-checkout head → chaos. HOW (variant 3, re-runnable cold): see JOURNAL-dstamp 2026-06-11 "mirror-faithful repro". EXPECTED:DEBU app/deploy.go:372 version: taking chaos version: 7ae7b0f7.
-
Same ref, solo run was GREEN; clustered runs DRIFTED. discourse @ ref
7ae7b0f76efb: run 184 (2026-06-05 02:17, solo) = L4, upgrade PASS; the 06-10/06-11 runs m2b-discourse (06-10 20:54), m2p-discourse (06-11 00:44), ab-discourse-7ae7b0f-oldmain (06-11 00:48) = L1, upgrade FAIL (chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb' (HC1)). HOW:grep -oE '"level": [0-9]+|"upgrade": "[a-z]+"' /var/lib/cc-ci-runs/{184,m2p-discourse}/results.json. -
All same-ref discourse runs share ONE swarm stack.
naming.app_domain(recipe,pr,ref)=<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net→ identical for identical (recipe,pr,ref). The upgradechaos_redeploybypassesdeploy_app's app-domain flock (lifecycle.chaos_redeploy/generic.perform_upgrade). LEADING HYPOTHESIS: the 06-10/06-11 drift is a CONCURRENCY ARTIFACT of the clustered rcust-M2 A/B discourse experiments racing on the shared stack — NOT an abra/recipe/env regression. Under test now.
In flight
- Implementing the fix (overlay stop-first + harness rollback detection), then a full real run
(all stages) to prove discourse reliably reaches its true level, then the
!testmedrone path. - Repro evidence runs:
/var/lib/cc-ci-runs/dstamp-repro{1,2,3,4}.console.logon cc-ci (repro2 PASS @7ae7b0f7+U; repro4 captured the rollback Spec/PreviousSpec).
Blocked
- (none)