All checks were successful
continuous-integration/drone/push Build is passing
130 lines
8.2 KiB
Markdown
130 lines
8.2 KiB
Markdown
# REVIEW-dstamp.md — Adversary verdicts for phase `dstamp`
|
||
|
||
Phase: investigate & solve the discourse abra-stamp drift (upgrade-HC1 stamps the
|
||
prev-base tag commit instead of the PR-head version, harness-neutral, since ~06-10).
|
||
SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-dstamp-discourse-drift.md`. Gates M1, M2.
|
||
|
||
Verdict log is append-only. `review(...)`-prefixed commits carry verdicts (load-bearing
|
||
watchdog signal). Findings filed under `## Adversary findings` in BACKLOG-dstamp.md.
|
||
|
||
---
|
||
|
||
## Prep notes (NOT a verdict — no gate claimed yet) @2026-06-11T15:5x
|
||
|
||
Recon done cold before any Builder claim, to make M1/M2 verification fast and independent.
|
||
Anti-anchoring: formed only from the plan (SSOT), the harness code, and direct host evidence
|
||
— no dstamp JOURNAL exists yet; none read.
|
||
|
||
**Stamp mechanism (from code):** HC1's "stamp" = the `coop-cloud.<stack>.chaos-version`
|
||
docker service label abra writes on a `--chaos` deploy = the deployed recipe git commit
|
||
(`runner/harness/lifecycle.py:468 deployed_identity`, `runner/harness/generic.py:146
|
||
assert_upgraded`). Upgrade flow (`generic.py:226 perform_upgrade`): deploy prev-published
|
||
base → `recipe_checkout_ref(recipe, head_ref)` (git checkout -f head) → `chaos_redeploy`
|
||
(`abra app deploy --chaos`). HC1 asserts `chaos_commit == head_ref` (after stripping the
|
||
`+U` untracked-overlay marker). PASS requires the chaos-version to equal the PR head.
|
||
|
||
**Cold observable facts (from `/var/lib/cc-ci-runs/m2p-discourse/abra/recipes/discourse`
|
||
snapshot + live `~/.abra/recipes/discourse` on cc-ci, 2026-06-11):**
|
||
- Recipe HEAD `7ae7b0f` = "chore: upgrade to 0.9.0+3.5.0"; `git describe --tags` =
|
||
`0.7.0+3.3.1-9-g7ae7b0f` → HEAD is **9 commits past the newest annotated tag**
|
||
`0.7.0+3.3.1` (commit `eb96de9`). No `0.8.x`/`0.9.x` tag exists.
|
||
- The drift symptom (per plan): chaos-version stamped `eb96de94+U` = the **prev-base tag
|
||
commit** (= the upgrade base `0.7.0+3.3.1`), NOT the PR-head `7ae7b0f`.
|
||
- abra is **nix-pinned**: `abra version 0.13.0-beta-06a57de`, store path under
|
||
`/run/current-system` → binary drift requires a flake.lock/nixos-generation bump between
|
||
06-05 and 06-10 (verify against generations, don't assume).
|
||
|
||
**Open question I'll independently re-derive when M1 is claimed:** why the `--chaos`
|
||
redeploy after checkout-to-HEAD stamps the BASE commit (eb96de9), not HEAD (7ae7b0f).
|
||
Candidates to test cold: (a) re-checkout to head silently reverted (abra fetch/reset during
|
||
deploy); (b) abra chaos resolves the version from the app's recorded `.env` RECIPE/version
|
||
(= the base) rather than the working-tree HEAD; (c) the "env drift" since 06-10 = recipe/
|
||
mirror git state moved (unreleased commits pushed past last tag) or a tag re-pointed.
|
||
|
||
**Guardrail teeth I will enforce at M2:** HC1 must still FAIL on a genuinely wrong stamp
|
||
(synthesize a wrong-version deploy and show RED). Any "fix" that derives EXPECTED from
|
||
"what makes the test pass" rather than abra's documented behavior = automatic FAIL.
|
||
|
||
Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping
|
||
on the `claim(...)` commit.
|
||
|
||
---
|
||
|
||
## Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet)
|
||
|
||
Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from
|
||
harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service
|
||
inspect on cc-ci. Independently arrived at the same attribution as the Builder.
|
||
|
||
**Causal chain derived from code + direct evidence:**
|
||
|
||
1. `provide_ccci_overlay` (rcust-era addition) copies `compose.ccci.yml` into the per-run
|
||
recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the old
|
||
`install_steps.sh` path writing to canonical `~/.abra`) — consistent with run 184 having
|
||
no `+U` suffix and passing. The `+U` itself is stripped by HC1's `chaos_commit.split("+",1)[0]`
|
||
and is NOT the cause of drift.
|
||
|
||
2. abra reads `git HEAD = 7ae7b0f` and computes `chaos-version = 7ae7b0f7+U` CORRECTLY.
|
||
Confirmed via three bail-at-secrets manual repros + repro2 debug line
|
||
`taking chaos version: 7ae7b0f7+U`. abra and the per-run git checkout are EXONERATED.
|
||
|
||
3. `chaos_redeploy` passes `-c` (no_converge_checks) → `docker stack deploy` returns
|
||
immediately; Swarm rolling update runs asynchronously.
|
||
|
||
4. Discourse `compose.yml` (BOTH base `eb96de94` AND PR-head `7ae7b0f`) sets
|
||
`deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s }`
|
||
on the `app` service. Confirmed by direct `docker service inspect disc-ae10f0_..._app`.
|
||
|
||
5. With `order: start-first`, OLD + NEW task co-reside (~2× memory). Discourse's
|
||
Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10
|
||
(warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's
|
||
5s update monitor → `failure_action: rollback` fires → Swarm REVERTS the app service
|
||
spec to PreviousSpec (base deploy, `chaos-version=eb96de94+U`).
|
||
|
||
6. `services_converged` blind spot: after rollback `UpdateStatus.State = "rollback_completed"`,
|
||
NOT in the blocking set `("updating", "rollback_started")` → returns True as if converged.
|
||
Under start-first the OLD task kept serving → `wait_healthy` also passes on the
|
||
rolled-back spec.
|
||
|
||
7. `deployed_identity` reads `.Spec.Labels` → rolled-back spec → `chaos-version=eb96de94+U`.
|
||
HC1 asserts head_ref `7ae7b0f76efb` ≠ `eb96de94` → FAIL with misleading "re-checkout failed".
|
||
|
||
**Key disproving evidence (independent route):** repro1 was isolated (no concurrent discourse
|
||
run, domain `disc-ae10f0` used for the first time) and STILL showed the drift. This refuted
|
||
the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL.
|
||
|
||
**Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓):**
|
||
Whether the new start-first task survives the 5s monitor depends on momentary memory pressure.
|
||
Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm
|
||
volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10
|
||
on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184.
|
||
|
||
**Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):**
|
||
|
||
*Part 1 — overlay `order: stop-first`*: Old task stops before new starts → new boots with full
|
||
host memory → no OOM under the 5s monitor → no spurious rollback. `failure_action: rollback`
|
||
intentionally preserved so a genuinely broken head still rolls back and is caught.
|
||
ASSESSMENT: **CORRECT AND SUFFICIENT** for eliminating the spurious-rollback trigger.
|
||
|
||
*Part 2 — `lifecycle.assert_upgrade_converged`*: Called in `perform_upgrade` immediately after
|
||
`chaos_redeploy`, before `wait_healthy`. Polls `docker service inspect
|
||
--format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}'` until terminal.
|
||
Returns on `""|"none"|"completed"`; raises on `"rollback_completed"|"rollback_paused"|"paused"`;
|
||
polls on `"updating"|"rollback_started"`; times out at `meta.DEPLOY_TIMEOUT`.
|
||
ASSESSMENT: **CORRECT** — closes the wait_healthy-masking blind spot. Makes a swarm rollback
|
||
an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch.
|
||
HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs.
|
||
|
||
**One concern flagged (not a blocker — defense-in-depth covers it):**
|
||
`assert_upgrade_converged` has a theoretical race window: on the very first poll, Docker may
|
||
not yet have transitioned from a prior `"completed"` state to `"updating"` (tiny gap between
|
||
`docker stack deploy` returning and the Swarm manager scheduling the roll). If the race fires,
|
||
the function returns OK on `"none"`, then the rollback happens silently afterward.
|
||
Mitigation: with `stop-first` (fix part 1), a post-assert-converged rollback leaves NO serving
|
||
task during the rollback → `wait_healthy` also FAILS → the test result is still FAIL, just
|
||
with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is
|
||
NOT weakened even if the race fires. No action required unless a recipe uses `start-first`
|
||
where a post-race rollback could masquerade as a clean upgrade.
|
||
|
||
**Status:** no `claim(dstamp)` commit yet. Awaiting M1 claim to issue formal verdict.
|