Files
cc-ci/STATUS-dstamp.md

6.4 KiB
Raw Blame History

STATUS — phase dstamp (discourse abra-stamp drift)

Builder. SSOT: cc-ci-plan/plan-phase-dstamp-discourse-drift.md. Gates M1, M2.

Phase state: ROOT CAUSE ATTRIBUTED (direct evidence) — building fix, no gate claimed yet

ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED)

The upgrade chaos redeploy applies the correct head spec, then swarm rolls it back to the base spec, reverting the chaos-version label — masked by the recipe's start-first strategy + the harness's wait_healthy (the OLD task keeps serving, so health passes).

Recipe policy (~/.abra/recipes/discourse/compose.yml, app service): deploy.update_config: { failure_action: rollback, order: start-first }, healthcheck.start_period: 20m. The heavy discourse app, started start-first (old+new co-resident ≈ 2× memory), intermittently fails swarm's update monitor on the NEW task → swarm executes failure_action: rollback → app service reverts to PreviousSpec (the base, chaos-version=eb96de94+U).

Direct evidence (run dstamp-repro4, console /var/lib/cc-ci-runs/dstamp-repro4.console.log, solo/isolated): immediately after chaos_redeploy, docker service inspect <stack>_app:

  • UpdateStatus.State = "updating",
  • .Spec.Labels coop-cloud.<stack>.chaos-version = 7ae7b0f7+U (HEAD applied — abra stamped head correctly), .version = 0.9.0+3.5.0,
  • .PreviousSpec.Labels …chaos-version = eb96de94+U (the base), .version = 0.7.0+3.3.1. Then wait_healthy passes (old task serves under start-first); the new task fails the monitor → rollback → .Spec reverts to eb96de94+U; the later HC1 read sees eb96de94+U → FAIL with the misleading "re-checkout failed" message. (dstamp-repro2, lighter timing, had NO rollback → upgrade PASS @ 7ae7b0f7+U.)

Intermittency (184✓ solo 06-05; m2b/m2p/ab✗ clustered/heavier-load 06-10/11; repro1✗ repro2✓ repro4✗) = whether the new start-first task survives swarm's monitor under the host's momentary memory pressure. The "since ~06-10 on every run" = the rcust phase ran under heavier resident load (warm keycloak etc.) so the new task reliably failed → rollback every time. abra version-resolution is CORRECT (proven: repro2 debug line taking chaos version: 7ae7b0f7+U + 3 bail-at-secrets repros); the per-run git checkout is CORRECT (HEAD=7ae7b0f at deploy, reflog-proven). NOT abra, NOT the per-run tree, NOT concurrency.

Fix (in progress) — HC1 keeps its teeth

  1. Reliability (restore true level): discourse tests/discourse/compose.ccci.yml overlay set the app service deploy.update_config.order: stop-first so the new task boots with full memory (no 2× co-residency) and genuinely becomes healthy → no spurious rollback. The upgrade-to-head is still really deployed + asserted on head; HC1 unchanged. Documented WHY in the overlay header.
  2. Correctness (honesty, general): the harness upgrade path detects a swarm rollback after the chaos redeploy (UpdateStatus.State rollback*/paused, or .Spec reverted to .PreviousSpec) and fails the upgrade with the TRUE reason ("head spec applied then swarm-rolled-back: new task failed the update monitor") instead of the misleading "re-checkout failed". A genuinely undeployable head still FAILS (teeth preserved).
  3. Blast-radius: sweep all enrolled recipes for failure_action: rollback + start-first heavy apps with the same latent signature.

What is established (direct evidence, reproducible)

  • abra is CONSTANT, not the cause. abra binary bf6azhpi…-abra-0.13.0-beta is the store path for every nixos system generation from system-4 (2026-06-01) through system-11 (now). No abra change between 06-05 and 06-10. HOW: for g in $(ls -d /nix/var/nix/profiles/system-*-link); do readlink -f "$g/sw/bin/abra"; done on cc-ci. EXPECTED: all …bf6azhpi… from system-4 on.

  • abra's chaos-version = SmallSHA(git HEAD of the recipe checkout) (++U if worktree dirty). Source: abra@06a57de cli/app/deploy.go:106,168,365-373 (chaos → toDeployVersion = Recipe.ChaosVersion()), pkg/recipe/git.go:300-318 (ChaosVersion = SmallSHA(Head())), :483-495 (Head = go-git repo.Head()). In chaos mode Recipe.Ensure early-returns (pkg/recipe/git.go:41-43) — NO env-version re-checkout.

  • The isolated git/abra path stamps CORRECTLY now. Three faithful reproductions on cc-ci (scratch ABRA_DIR, fake domain, deploys bail at secret not generated AFTER the chaos version is computed) all log taking chaos version: 7ae7b0f7 (= PR head), NOT eb96de9:

    1. cp -a canonical recipe + manual tag/head checkout.
    2. real non-chaos base deploy (go-git EnsureVersion tag checkout) → CLI re-checkout head → chaos.
    3. exact fetch_recipe replica: clone mirror recipe-maintainers/discourse @7ae7b0f + git fetch upstream refs/tags/* → base deploy → re-checkout head → chaos. HOW (variant 3, re-runnable cold): see JOURNAL-dstamp 2026-06-11 "mirror-faithful repro". EXPECTED: DEBU app/deploy.go:372 version: taking chaos version: 7ae7b0f7.
  • Same ref, solo run was GREEN; clustered runs DRIFTED. discourse @ ref 7ae7b0f76efb: run 184 (2026-06-05 02:17, solo) = L4, upgrade PASS; the 06-10/06-11 runs m2b-discourse (06-10 20:54), m2p-discourse (06-11 00:44), ab-discourse-7ae7b0f-oldmain (06-11 00:48) = L1, upgrade FAIL (chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb' (HC1)). HOW: grep -oE '"level": [0-9]+|"upgrade": "[a-z]+"' /var/lib/cc-ci-runs/{184,m2p-discourse}/results.json.

  • All same-ref discourse runs share ONE swarm stack. naming.app_domain(recipe,pr,ref) = <recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net → identical for identical (recipe,pr,ref). The upgrade chaos_redeploy bypasses deploy_app's app-domain flock (lifecycle.chaos_redeploy / generic.perform_upgrade). LEADING HYPOTHESIS: the 06-10/06-11 drift is a CONCURRENCY ARTIFACT of the clustered rcust-M2 A/B discourse experiments racing on the shared stack — NOT an abra/recipe/env regression. Under test now.

In flight

  • Implementing the fix (overlay stop-first + harness rollback detection), then a full real run (all stages) to prove discourse reliably reaches its true level, then the !testme drone path.
  • Repro evidence runs: /var/lib/cc-ci-runs/dstamp-repro{1,2,3,4}.console.log on cc-ci (repro2 PASS @7ae7b0f7+U; repro4 captured the rollback Spec/PreviousSpec).

Blocked

  • (none)