Files
cc-ci/STATUS-dstamp.md

97 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — phase `dstamp` (discourse abra-stamp drift)
Builder. SSOT: `cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2.
## Phase state: ROOT CAUSE ATTRIBUTED (direct evidence) — building fix, no gate claimed yet
## ROOT CAUSE (attributed by direct evidence, abra+harness EXONERATED)
The upgrade chaos redeploy applies the **correct** head spec, then swarm **rolls it back** to the
base spec, reverting the `chaos-version` label — masked by the recipe's `start-first` strategy +
the harness's `wait_healthy` (the OLD task keeps serving, so health passes).
Recipe policy (`~/.abra/recipes/discourse/compose.yml`, app service): `deploy.update_config:
{ failure_action: rollback, order: start-first }`, `healthcheck.start_period: 20m`. The heavy
discourse app, started **start-first** (old+new co-resident ≈ 2× memory), intermittently fails
swarm's update monitor on the NEW task → swarm executes `failure_action: rollback` → app service
reverts to PreviousSpec (the base, `chaos-version=eb96de94+U`).
**Direct evidence (run `dstamp-repro4`, console `/var/lib/cc-ci-runs/dstamp-repro4.console.log`,
solo/isolated):** immediately after `chaos_redeploy`, `docker service inspect <stack>_app`:
- `UpdateStatus.State = "updating"`,
- `.Spec.Labels coop-cloud.<stack>.chaos-version = 7ae7b0f7+U` (HEAD applied — abra stamped head
correctly), `.version = 0.9.0+3.5.0`,
- `.PreviousSpec.Labels …chaos-version = eb96de94+U` (the base), `.version = 0.7.0+3.3.1`.
Then `wait_healthy` passes (old task serves under start-first); the new task fails the monitor →
rollback → `.Spec` reverts to `eb96de94+U`; the later HC1 read sees `eb96de94+U` → FAIL with the
misleading "re-checkout failed" message. (`dstamp-repro2`, lighter timing, had NO rollback →
upgrade PASS @ `7ae7b0f7+U`.)
Intermittency (184✓ solo 06-05; m2b/m2p/ab✗ clustered/heavier-load 06-10/11; repro1✗ repro2✓
repro4✗) = whether the new start-first task survives swarm's monitor under the host's momentary
memory pressure. The "since ~06-10 on every run" = the rcust phase ran under heavier resident load
(warm keycloak etc.) so the new task reliably failed → rollback every time. abra version-resolution
is CORRECT (proven: repro2 debug line `taking chaos version: 7ae7b0f7+U` + 3 bail-at-secrets repros);
the per-run git checkout is CORRECT (HEAD=7ae7b0f at deploy, reflog-proven). NOT abra, NOT the
per-run tree, NOT concurrency.
## Fix (in progress) — HC1 keeps its teeth
1. **Reliability (restore true level):** discourse `tests/discourse/compose.ccci.yml` overlay set
the app service `deploy.update_config.order: stop-first` so the new task boots with full memory
(no 2× co-residency) and genuinely becomes healthy → no spurious rollback. The upgrade-to-head
is still really deployed + asserted on head; HC1 unchanged. Documented WHY in the overlay header.
2. **Correctness (honesty, general):** the harness upgrade path detects a swarm rollback after the
chaos redeploy (UpdateStatus.State rollback*/paused, or `.Spec` reverted to `.PreviousSpec`) and
fails the upgrade with the TRUE reason ("head spec applied then swarm-rolled-back: new task
failed the update monitor") instead of the misleading "re-checkout failed". A genuinely
undeployable head still FAILS (teeth preserved).
3. **Blast-radius:** sweep all enrolled recipes for `failure_action: rollback` + start-first heavy
apps with the same latent signature.
## What is established (direct evidence, reproducible)
- **abra is CONSTANT, not the cause.** abra binary `bf6azhpi…-abra-0.13.0-beta` is the store
path for every nixos system generation from system-4 (2026-06-01) through system-11 (now).
No abra change between 06-05 and 06-10.
HOW: `for g in $(ls -d /nix/var/nix/profiles/system-*-link); do readlink -f "$g/sw/bin/abra"; done`
on cc-ci. EXPECTED: all `…bf6azhpi…` from system-4 on.
- **abra's chaos-version = `SmallSHA(git HEAD of the recipe checkout)`** (+`+U` if worktree
dirty). Source: abra@06a57de `cli/app/deploy.go:106,168,365-373` (chaos →
`toDeployVersion = Recipe.ChaosVersion()`), `pkg/recipe/git.go:300-318` (`ChaosVersion` =
`SmallSHA(Head())`), `:483-495` (`Head` = go-git `repo.Head()`). In chaos mode
`Recipe.Ensure` early-returns (`pkg/recipe/git.go:41-43`) — NO env-version re-checkout.
- **The isolated git/abra path stamps CORRECTLY now.** Three faithful reproductions on cc-ci
(scratch ABRA_DIR, fake domain, deploys bail at `secret not generated` AFTER the chaos
version is computed) all log `taking chaos version: 7ae7b0f7` (= PR head), NOT `eb96de9`:
1. `cp -a` canonical recipe + manual tag/head checkout.
2. real non-chaos base deploy (go-git `EnsureVersion` tag checkout) → CLI re-checkout head → chaos.
3. exact `fetch_recipe` replica: clone mirror `recipe-maintainers/discourse` @7ae7b0f +
`git fetch upstream refs/tags/*` → base deploy → re-checkout head → chaos.
HOW (variant 3, re-runnable cold): see JOURNAL-dstamp 2026-06-11 "mirror-faithful repro".
EXPECTED: `DEBU app/deploy.go:372 version: taking chaos version: 7ae7b0f7`.
- **Same ref, solo run was GREEN; clustered runs DRIFTED.** discourse @ ref `7ae7b0f76efb`:
run **184** (2026-06-05 02:17, solo) = **L4, upgrade PASS**; the 06-10/06-11 runs
**m2b-discourse** (06-10 20:54), **m2p-discourse** (06-11 00:44), **ab-discourse-7ae7b0f-oldmain**
(06-11 00:48) = **L1, upgrade FAIL** (`chaos commit 'eb96de94+U', not the intended PR-head
'7ae7b0f76efb' (HC1)`). HOW: `grep -oE '"level": [0-9]+|"upgrade": "[a-z]+"'
/var/lib/cc-ci-runs/{184,m2p-discourse}/results.json`.
- **All same-ref discourse runs share ONE swarm stack.** `naming.app_domain(recipe,pr,ref)` =
`<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net` → identical for identical
(recipe,pr,ref). The upgrade `chaos_redeploy` bypasses `deploy_app`'s app-domain flock
(`lifecycle.chaos_redeploy` / `generic.perform_upgrade`). LEADING HYPOTHESIS: the 06-10/06-11
drift is a CONCURRENCY ARTIFACT of the clustered rcust-M2 A/B discourse experiments racing on
the shared stack — NOT an abra/recipe/env regression. Under test now.
## In flight
- Implementing the fix (overlay stop-first + harness rollback detection), then a full real run
(all stages) to prove discourse reliably reaches its true level, then the `!testme` drone path.
- Repro evidence runs: `/var/lib/cc-ci-runs/dstamp-repro{1,2,3,4}.console.log` on cc-ci
(repro2 PASS @7ae7b0f7+U; repro4 captured the rollback Spec/PreviousSpec).
## Blocked
- (none)