Files
cc-ci/STATUS-dstamp.md

15 KiB
Raw Blame History

STATUS — phase dstamp (discourse abra-stamp drift)

Builder. SSOT: cc-ci-plan/plan-phase-dstamp-discourse-drift.md. Gates M1, M2.

DONE

M1 PASS (REVIEW-dstamp fb411b2 @17:36Z) + M2 PASS (71358da @17:58Z), both fresh, no VETO. All Definition-of-Done items Adversary-verified.

Operator summary. The discourse upgrade-tier "abra stamp drift" (upgrade-HC1 stamping the prev-base tag commit eb96de94+U instead of the PR head 7ae7b0f7+U, since ~06-10) was NOT an abra or harness git bug — abra stamps the head correctly. Root cause: discourse's compose.yml app service uses deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }. On the upgrade chaos redeploy, start-first co-resides the OLD+NEW precompile/Rails-heavy task (~2× memory); under host memory pressure the NEW task fails swarm's 5s update monitor → swarm rolls back to the base spec, reverting the chaos-version label (head→base). start-first kept the old task serving, so wait_healthy passed and HC1 read the reverted base commit — misreported as "re-checkout failed". Intermittent (memory-pressure dependent): solo run 184 on 06-05 passed; the heavier 06-10/06-11 runs rolled back every time. Direct evidence: dstamp-repro4 captured .Spec chaos-version=7ae7b0f7+U (head applied) → .PreviousSpec=eb96de94+U (base) with UpdateStatus=updating, then the post-rollback read = base.

Fix (commits 0cc31a5 + e9c26c7, HC1 unweakened): (1) tests/discourse/compose.ccci.yml app update_config.order: stop-first — the new task boots with full host memory, no OOM, no spurious rollback (failure_action: rollback left intact for genuine failures); (2) a general harness guard lifecycle.assert_upgrade_converged (2-phase StartedAt protocol) that detects a swarm rollback/pause after the upgrade redeploy and fails the upgrade HONESTLY — the HC1 commit-match assertion is unchanged.

Proven in real CI: drone !testme build #450 (discourse @7ae7b0f) = LEVEL 5 (was L1 under the drift), all tiers green, clean teardown, no secret leak; PR recipe-maintainers/discourse#2 shows passed. Blast-radius: only discourse was affected (keycloak/n8n share the policy but upgrade-PASS L4; drone/traefik are infra) — the new harness guard now protects all rollback-policy recipes. DEFERRED entry closed with pointers. No operator action required.


Gate: M1 — PASS (REVIEW-dstamp fb411b2 @2026-06-11T17:36Z). Now on M2.

Gate: M2 — CLAIMED, awaiting Adversary

WHAT (M2 = Proven in real CI): discourse full lifecycle GREEN at its true level via the drone !testme path, upgrade-HC1 stamping the CORRECT head value; no other affected recipe; HC1 unweakened (a wrong stamp still FAILs); DEFERRED closed.

  • Real-CI proof — drone !testme build #450: discourse @ 7ae7b0f76efb (PR#2), STAGES full (install,upgrade,backup,restore,custom), drone workspace at cc-ci main 2da1f01 (fix present) → LEVEL 5 (max), ALL tiers PASS, clean_teardown=true, no_secret_leak=true. Upgrade tier test_upgrade_reconverges PASSED (HC1's assert_upgraded only passes when the deployed chaos-version commit == head_ref 7ae7b0f, after assert_upgrade_converged confirmed UpdateStatus=completed). Was L1 (drift) before the fix → L5 now.
  • Triggered via the !testme path: comment 14346 (!testme) on recipe-maintainers/discourse#2 → bridge ack 14347, updated to "🌻 cc-ci — discourse @ 7ae7b0f7 passed" with the L5 result card/badge linking drone build 450.

HOW to verify (Adversary, cold):

  1. grep -oE '"level": [0-9]+|"(install|upgrade|backup|restore|custom)": "[a-z]+"|"clean_teardown": (true|false)|"no_secret_leak": (true|false)' /var/lib/cc-ci-runs/450/results.json → level 5, all pass, both flags true.
  2. /var/lib/cc-ci-runs/450/junit/upgrade__generic__test_upgrade.xmltest_upgrade_reconverges testcase with NO <failure> child (passed).
  3. PR comment 14347 on recipe-maintainers/discourse#2 = passed, run 450.
  4. Fresh independent re-trigger (recommended): post !testme on discourse#2 → new drone build on cc-ci main → expect L5 again (reliability: manual fix1+fix2 + build 450 = 3 consecutive green with the fix vs intermittent unpatched failures).
  5. HC1 teeth (negative test — Adversary leads): synthesize a wrong stamp and show RED. Two live teeth: (a) the unchanged commit-match generic.py:174-175 — a deployed chaos commit ≠ head_ref still FAILs (e.g. force the recheckout to the base, or deploy base-as-head); (b) the new assert_upgrade_converged raises on a swarm rollback_completed/paused (the ORIGINAL drift path — repro1/repro4 are exactly this RED, now with an honest message). Neither relaxes HC1.
  6. DEFERRED closed: machine-docs/DEFERRED.md dstamp entry → RESOLVED with pointers.

EXPECTED: build 450 level 5, all tiers pass, both flags true; PR#2 passed; DEFERRED resolved. WHERE: /var/lib/cc-ci-runs/450/; commits 0cc31a5,e9c26c7; PR#2 comments 14346/14347; machine-docs/DEFERRED.md. No other recipe affected (blast-radius: keycloak/n8n upgrade-PASS L4 across runs incl. rcust era; drone/traefik infra). Fresh Adversary M2 PASS → ## DONE.


(M1 — verified PASS; detail retained below)

WHAT (M1 = Attribution): root cause attributed by direct evidence; minimal reproducible demonstration; 06-05→06-10 change identified; fix implemented (recipe overlay + harness, HC1 unweakened); blast-radius sweep complete.

Root cause: discourse compose.yml app service sets deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }. On the upgrade chaos redeploy, start-first co-resides OLD+NEW (~2× memory) for the precompile/Rails-heavy app; under host memory pressure the NEW task fails swarm's 5s update monitor → failure_action: rollback reverts the app service to its PreviousSpec — INCLUDING the coop-cloud.<stack>.chaos-version label (head→base). Under start-first the OLD task keeps serving, so wait_healthy passes; deployed_identity then reads the rolled-back .Spec (base commit eb96de94+U) and HC1 misreports it as "re-checkout failed". abra+harness git path EXONERATED (abra stamps head 7ae7b0f7+U correctly; per-run HEAD=7ae7b0f at deploy).

HOW to verify (Adversary, cold):

  1. Recipe policy: cd ~/.abra/recipes/discourse && git checkout -q 7ae7b0f76efb && grep -nA3 update_config compose.ymlfailure_action: rollback, order: start-first. EXPECTED present.
  2. abra exonerated (minimal repro): scratch ABRA_DIR, base→head checkout, abra app deploy <d> -C -o -n --debug bails at secret not generated AFTER logging app/deploy.go:372 version: taking chaos version: 7ae7b0f7+U (HEAD-correct). Procedure: JOURNAL-dstamp "mirror-faithful repro".
  3. Direct rollback evidence: console /var/lib/cc-ci-runs/dstamp-repro4.console.log line [DSTAMP] post-redeploy svc inspect … shows immediately post-redeploy UpdateStatus.State= "updating", .Spec…chaos-version=7ae7b0f7+U (head applied), .PreviousSpec…chaos-version= eb96de94+U (base); the later HC1 read = eb96de94+U after the rollback completes.
  4. Fix present: runner/harness/lifecycle.py::assert_upgrade_converged (+ update_status_started) and its call in runner/harness/generic.py::perform_upgrade; tests/discourse/compose.ccci.yml app deploy.update_config.order: stop-first. Commits 0cc31a5 + e9c26c7.
  5. Fix works: run dstamp-fix1 (fresh checkout, STAGES=install,upgrade) → upgrade PASS, console upgrade-converged: …UpdateStatus=completed + chaos-version=7ae7b0f7+U version= 0.7.0+3.3.1→0.9.0+3.5.0. (Re-runnable: RECIPE=discourse PR=2 REF=7ae7b0f76efb2988c1e54956348dc9eeb7812e0b SRC=recipe-maintainers/discourse STAGES=install,upgrade CCCI_RUN_ID=<id> cc-ci-run runner/run_recipe_ci.py from a checkout at e9c26c7.)
  6. Blast-radius: recipes with rollback+start-first = discourse, drone, keycloak, n8n, traefik. keycloak/n8n upgrade PASS L4 across runs (155/186/187/m2r; 47/54/61/162/197/m2r) ⇒ not affected; drone/traefik infra (no recipe-CI upgrade tier). Only discourse affected; the general assert_upgrade_converged guard now protects all rollback-policy recipes.

EXPECTED: all of 16 hold. WHERE: commits 0cc31a5, e9c26c7; runs /var/lib/cc-ci-runs/dstamp-{repro1,repro2,repro4,fix1}; recipe ~/.abra/recipes/discourse.

HC1 teeth preserved: the commit-match assertion is unchanged; assert_upgrade_converged only makes a swarm rollback an HONEST upgrade failure before HC1 runs (a genuinely undeployable head still fails). M2 will demonstrate a wrong stamp still FAILs + full-lifecycle green via the !testme path.


Root cause detail (evidence)

ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED)

The upgrade chaos redeploy applies the correct head spec, then swarm rolls it back to the base spec, reverting the chaos-version label — masked by the recipe's start-first strategy + the harness's wait_healthy (the OLD task keeps serving, so health passes).

Recipe policy (~/.abra/recipes/discourse/compose.yml, app service): deploy.update_config: { failure_action: rollback, order: start-first }, healthcheck.start_period: 20m. The heavy discourse app, started start-first (old+new co-resident ≈ 2× memory), intermittently fails swarm's update monitor on the NEW task → swarm executes failure_action: rollback → app service reverts to PreviousSpec (the base, chaos-version=eb96de94+U).

Direct evidence (run dstamp-repro4, console /var/lib/cc-ci-runs/dstamp-repro4.console.log, solo/isolated): immediately after chaos_redeploy, docker service inspect <stack>_app:

  • UpdateStatus.State = "updating",
  • .Spec.Labels coop-cloud.<stack>.chaos-version = 7ae7b0f7+U (HEAD applied — abra stamped head correctly), .version = 0.9.0+3.5.0,
  • .PreviousSpec.Labels …chaos-version = eb96de94+U (the base), .version = 0.7.0+3.3.1. Then wait_healthy passes (old task serves under start-first); the new task fails the monitor → rollback → .Spec reverts to eb96de94+U; the later HC1 read sees eb96de94+U → FAIL with the misleading "re-checkout failed" message. (dstamp-repro2, lighter timing, had NO rollback → upgrade PASS @ 7ae7b0f7+U.)

Intermittency (184✓ solo 06-05; m2b/m2p/ab✗ clustered/heavier-load 06-10/11; repro1✗ repro2✓ repro4✗) = whether the new start-first task survives swarm's monitor under the host's momentary memory pressure. The "since ~06-10 on every run" = the rcust phase ran under heavier resident load (warm keycloak etc.) so the new task reliably failed → rollback every time. abra version-resolution is CORRECT (proven: repro2 debug line taking chaos version: 7ae7b0f7+U + 3 bail-at-secrets repros); the per-run git checkout is CORRECT (HEAD=7ae7b0f at deploy, reflog-proven). NOT abra, NOT the per-run tree, NOT concurrency.

Fix (in progress) — HC1 keeps its teeth

  1. Reliability (restore true level): discourse tests/discourse/compose.ccci.yml overlay set the app service deploy.update_config.order: stop-first so the new task boots with full memory (no 2× co-residency) and genuinely becomes healthy → no spurious rollback. The upgrade-to-head is still really deployed + asserted on head; HC1 unchanged. Documented WHY in the overlay header.
  2. Correctness (honesty, general): the harness upgrade path detects a swarm rollback after the chaos redeploy (UpdateStatus.State rollback*/paused, or .Spec reverted to .PreviousSpec) and fails the upgrade with the TRUE reason ("head spec applied then swarm-rolled-back: new task failed the update monitor") instead of the misleading "re-checkout failed". A genuinely undeployable head still FAILS (teeth preserved).
  3. Blast-radius: sweep all enrolled recipes for failure_action: rollback + start-first heavy apps with the same latent signature.

What is established (direct evidence, reproducible)

  • abra is CONSTANT, not the cause. abra binary bf6azhpi…-abra-0.13.0-beta is the store path for every nixos system generation from system-4 (2026-06-01) through system-11 (now). No abra change between 06-05 and 06-10. HOW: for g in $(ls -d /nix/var/nix/profiles/system-*-link); do readlink -f "$g/sw/bin/abra"; done on cc-ci. EXPECTED: all …bf6azhpi… from system-4 on.

  • abra's chaos-version = SmallSHA(git HEAD of the recipe checkout) (++U if worktree dirty). Source: abra@06a57de cli/app/deploy.go:106,168,365-373 (chaos → toDeployVersion = Recipe.ChaosVersion()), pkg/recipe/git.go:300-318 (ChaosVersion = SmallSHA(Head())), :483-495 (Head = go-git repo.Head()). In chaos mode Recipe.Ensure early-returns (pkg/recipe/git.go:41-43) — NO env-version re-checkout.

  • The isolated git/abra path stamps CORRECTLY now. Three faithful reproductions on cc-ci (scratch ABRA_DIR, fake domain, deploys bail at secret not generated AFTER the chaos version is computed) all log taking chaos version: 7ae7b0f7 (= PR head), NOT eb96de9:

    1. cp -a canonical recipe + manual tag/head checkout.
    2. real non-chaos base deploy (go-git EnsureVersion tag checkout) → CLI re-checkout head → chaos.
    3. exact fetch_recipe replica: clone mirror recipe-maintainers/discourse @7ae7b0f + git fetch upstream refs/tags/* → base deploy → re-checkout head → chaos. HOW (variant 3, re-runnable cold): see JOURNAL-dstamp 2026-06-11 "mirror-faithful repro". EXPECTED: DEBU app/deploy.go:372 version: taking chaos version: 7ae7b0f7.
  • Same ref, solo run was GREEN; clustered runs DRIFTED. discourse @ ref 7ae7b0f76efb: run 184 (2026-06-05 02:17, solo) = L4, upgrade PASS; the 06-10/06-11 runs m2b-discourse (06-10 20:54), m2p-discourse (06-11 00:44), ab-discourse-7ae7b0f-oldmain (06-11 00:48) = L1, upgrade FAIL (chaos commit 'eb96de94+U', not the intended PR-head '7ae7b0f76efb' (HC1)). HOW: grep -oE '"level": [0-9]+|"upgrade": "[a-z]+"' /var/lib/cc-ci-runs/{184,m2p-discourse}/results.json.

  • All same-ref discourse runs share ONE swarm stack. naming.app_domain(recipe,pr,ref) = <recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net → identical for identical (recipe,pr,ref). The upgrade chaos_redeploy bypasses deploy_app's app-domain flock (lifecycle.chaos_redeploy / generic.perform_upgrade). LEADING HYPOTHESIS: the 06-10/06-11 drift is a CONCURRENCY ARTIFACT of the clustered rcust-M2 A/B discourse experiments racing on the shared stack — NOT an abra/recipe/env regression. Under test now.

In flight

  • Implementing the fix (overlay stop-first + harness rollback detection), then a full real run (all stages) to prove discourse reliably reaches its true level, then the !testme drone path.
  • Repro evidence runs: /var/lib/cc-ci-runs/dstamp-repro{1,2,3,4}.console.log on cc-ci (repro2 PASS @7ae7b0f7+U; repro4 captured the rollback Spec/PreviousSpec).

Blocked

  • (none)