Files
cc-ci/REVIEW-dstamp.md

8.2 KiB
Raw Blame History

REVIEW-dstamp.md — Adversary verdicts for phase dstamp

Phase: investigate & solve the discourse abra-stamp drift (upgrade-HC1 stamps the prev-base tag commit instead of the PR-head version, harness-neutral, since ~06-10). SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-dstamp-discourse-drift.md. Gates M1, M2.

Verdict log is append-only. review(...)-prefixed commits carry verdicts (load-bearing watchdog signal). Findings filed under ## Adversary findings in BACKLOG-dstamp.md.


Prep notes (NOT a verdict — no gate claimed yet) @2026-06-11T15:5x

Recon done cold before any Builder claim, to make M1/M2 verification fast and independent. Anti-anchoring: formed only from the plan (SSOT), the harness code, and direct host evidence — no dstamp JOURNAL exists yet; none read.

Stamp mechanism (from code): HC1's "stamp" = the coop-cloud.<stack>.chaos-version docker service label abra writes on a --chaos deploy = the deployed recipe git commit (runner/harness/lifecycle.py:468 deployed_identity, runner/harness/generic.py:146 assert_upgraded). Upgrade flow (generic.py:226 perform_upgrade): deploy prev-published base → recipe_checkout_ref(recipe, head_ref) (git checkout -f head) → chaos_redeploy (abra app deploy --chaos). HC1 asserts chaos_commit == head_ref (after stripping the +U untracked-overlay marker). PASS requires the chaos-version to equal the PR head.

Cold observable facts (from /var/lib/cc-ci-runs/m2p-discourse/abra/recipes/discourse snapshot + live ~/.abra/recipes/discourse on cc-ci, 2026-06-11):

  • Recipe HEAD 7ae7b0f = "chore: upgrade to 0.9.0+3.5.0"; git describe --tags = 0.7.0+3.3.1-9-g7ae7b0f → HEAD is 9 commits past the newest annotated tag 0.7.0+3.3.1 (commit eb96de9). No 0.8.x/0.9.x tag exists.
  • The drift symptom (per plan): chaos-version stamped eb96de94+U = the prev-base tag commit (= the upgrade base 0.7.0+3.3.1), NOT the PR-head 7ae7b0f.
  • abra is nix-pinned: abra version 0.13.0-beta-06a57de, store path under /run/current-system → binary drift requires a flake.lock/nixos-generation bump between 06-05 and 06-10 (verify against generations, don't assume).

Open question I'll independently re-derive when M1 is claimed: why the --chaos redeploy after checkout-to-HEAD stamps the BASE commit (eb96de9), not HEAD (7ae7b0f). Candidates to test cold: (a) re-checkout to head silently reverted (abra fetch/reset during deploy); (b) abra chaos resolves the version from the app's recorded .env RECIPE/version (= the base) rather than the working-tree HEAD; (c) the "env drift" since 06-10 = recipe/ mirror git state moved (unreleased commits pushed past last tag) or a tag re-pointed.

Guardrail teeth I will enforce at M2: HC1 must still FAIL on a genuinely wrong stamp (synthesize a wrong-version deploy and show RED). Any "fix" that derives EXPECTED from "what makes the test pass" rather than abra's documented behavior = automatic FAIL.

Status: idle, awaiting Builder to seed STATUS-dstamp.md and claim M1. Watchdog will ping on the claim(...) commit.


Independent probe findings @2026-06-11T17:3x (NOT a verdict — no M1 claim yet)

Anti-anchoring preserved: JOURNAL-dstamp NOT read. Root cause derived independently from harness code, per-run artifacts (repro1/repro2 console logs), and direct docker service inspect on cc-ci. Independently arrived at the same attribution as the Builder.

Causal chain derived from code + direct evidence:

  1. provide_ccci_overlay (rcust-era addition) copies compose.ccci.yml into the per-run recipe dir as an UNTRACKED file. Absent in run 184 (2026-06-05, which used the old install_steps.sh path writing to canonical ~/.abra) — consistent with run 184 having no +U suffix and passing. The +U itself is stripped by HC1's chaos_commit.split("+",1)[0] and is NOT the cause of drift.

  2. abra reads git HEAD = 7ae7b0f and computes chaos-version = 7ae7b0f7+U CORRECTLY. Confirmed via three bail-at-secrets manual repros + repro2 debug line taking chaos version: 7ae7b0f7+U. abra and the per-run git checkout are EXONERATED.

  3. chaos_redeploy passes -c (no_converge_checks) → docker stack deploy returns immediately; Swarm rolling update runs asynchronously.

  4. Discourse compose.yml (BOTH base eb96de94 AND PR-head 7ae7b0f) sets deploy.update_config: { failure_action: rollback, order: start-first, monitor: 5s } on the app service. Confirmed by direct docker service inspect disc-ae10f0_..._app.

  5. With order: start-first, OLD + NEW task co-reside (~2× memory). Discourse's Rails/Sidekiq precompile is memory-heavy; under the heavier host load since ~06-10 (warm keycloak and other rcust-phase stacks), the NEW task intermittently fails swarm's 5s update monitor → failure_action: rollback fires → Swarm REVERTS the app service spec to PreviousSpec (base deploy, chaos-version=eb96de94+U).

  6. services_converged blind spot: after rollback UpdateStatus.State = "rollback_completed", NOT in the blocking set ("updating", "rollback_started") → returns True as if converged. Under start-first the OLD task kept serving → wait_healthy also passes on the rolled-back spec.

  7. deployed_identity reads .Spec.Labels → rolled-back spec → chaos-version=eb96de94+U. HC1 asserts head_ref 7ae7b0f76efbeb96de94 → FAIL with misleading "re-checkout failed".

Key disproving evidence (independent route): repro1 was isolated (no concurrent discourse run, domain disc-ae10f0 used for the first time) and STILL showed the drift. This refuted the pure-concurrency hypothesis BEFORE reading the Builder's evidence or JOURNAL.

Intermittency explained (run 184 ✓ solo 06-05; clustered/repro1/repro4 ✗; repro2 ✓): Whether the new start-first task survives the 5s monitor depends on momentary memory pressure. Run 184: solo + lighter host load + pre-rcust overlay path → new task survived. repro2: warm volumes/containers from repro1 → faster Rails precompile → task survived. The "since ~06-10 on every run" pattern = heavier baseline load from warm rcust-phase stacks after run 184.

Fix analysis (Builder commit 0cc31a5 — read before JOURNAL):

Part 1 — overlay order: stop-first: Old task stops before new starts → new boots with full host memory → no OOM under the 5s monitor → no spurious rollback. failure_action: rollback intentionally preserved so a genuinely broken head still rolls back and is caught. ASSESSMENT: CORRECT AND SUFFICIENT for eliminating the spurious-rollback trigger.

Part 2 — lifecycle.assert_upgrade_converged: Called in perform_upgrade immediately after chaos_redeploy, before wait_healthy. Polls docker service inspect --format '{{if .UpdateStatus}}{{.UpdateStatus.State}}{{else}}none{{end}}' until terminal. Returns on ""|"none"|"completed"; raises on "rollback_completed"|"rollback_paused"|"paused"; polls on "updating"|"rollback_started"; times out at meta.DEPLOY_TIMEOUT. ASSESSMENT: CORRECT — closes the wait_healthy-masking blind spot. Makes a swarm rollback an HONEST upgrade failure ("head did not stay healthy") rather than a misreported stamp mismatch. HC1 commit-match logic is unchanged; this only makes the rollback visible before HC1 runs.

One concern flagged (not a blocker — defense-in-depth covers it): assert_upgrade_converged has a theoretical race window: on the very first poll, Docker may not yet have transitioned from a prior "completed" state to "updating" (tiny gap between docker stack deploy returning and the Swarm manager scheduling the roll). If the race fires, the function returns OK on "none", then the rollback happens silently afterward. Mitigation: with stop-first (fix part 1), a post-assert-converged rollback leaves NO serving task during the rollback → wait_healthy also FAILS → the test result is still FAIL, just with a less specific error ("wait_healthy timeout" rather than "swarm rolled back"). HC1 is NOT weakened even if the race fires. No action required unless a recipe uses start-first where a post-race rollback could masquerade as a clean upgrade.

Status: no claim(dstamp) commit yet. Awaiting M1 claim to issue formal verdict.