Files
cc-ci/machine-docs/REVIEW-redfix.md

91 lines
7.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW — phase `redfix` (Adversary)
Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md`
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
gitea, keycloak), isolate → root-cause → classify (flake vs genuine; recipe vs test vs
warm-machinery vs load) → FIX each via a recipe PR or harness improvement → verify green.
No standing exceptions. Nothing merged.
Gates:
- **M1** — all six investigated in isolation, classified with evidence. Adversary cold-verifies:
claimed flake = reproducibly green in isolation (and red under load); claimed recipe defect =
genuinely the recipe (not a stale test / harness artifact); claimed warm-machinery bug = in cc-ci.
- **M2** — all six FIXED + verified green (recipe PR via `!testme`; harness/cc-ci PR via the harness;
flake-stabilization green under load). All six promote/pass. No standing exception. Nothing merged.
DONE = Builder writes `## DONE` only after M1+M2 fresh Adversary PASS here.
---
## Verdicts
_(none yet — awaiting Builder bootstrap + first gate claim)_
## Adversary verification log
- 2026-06-17T23:18Z — Phase redfix opened. Refreshed phase plan + plan.md §6.1. Cold access to cc-ci
confirmed (`ssh cc-ci`: host `nixos`, uptime 4d, `systemctl --failed` empty, load ~0.8). No Builder
state files (`STATUS/BACKLOG/JOURNAL-redfix.md`) yet; no gate claimed. Idling for the first claim.
- 2026-06-18T00:10Z — Non-contending pre-staging (M1 NOT yet claimed; Builder mid-investigation:
gitea isolation running, keycloak pending). Stayed OFF the swarm to avoid contaminating the
Builder's isolation runs. Independently corroborated two deterministic static claims via pure
code reads on cc-ci (no deploys):
* **mattermost-lts** (recipe @ `2.1.9+10.11.15`): postgres svc has `backupbot.backup.pre-hook`
(pg_dump → /var/lib/postgresql/data/postgres-backup.sql), `backup.post-hook` (rm dump),
`backup.path=/var/lib/postgresql/data/` (hot live PGDATA) — and **NO `backupbot.restore.post-hook`**.
immich (passes) uses dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook:
/pg_backup.sh restore`. Corroborates "genuine recipe defect — no restore round-trip." ✔ pre-staged.
* **discourse** (recipe @ `0.8.1+3.5.0` = `bitnamilegacy/discourse:3.5.0` + sidekiq): overlay
`tests/discourse/test_upgrade.py` is a phase-prevb PR-faithfulness test asserting app image ==
official `discourse/discourse:3.5.3` AND sidekiq dropped — only true on an unreleased PR head, not
the latest release the canon sweep deploys. So it red-by-construction in the sweep. Corroborates
"stale/PR-specific overlay test, not flake/timeout/recipe-deploy." ✔ pre-staged.
* STILL OWED before any M1 PASS: my OWN cold isolation run of discourse to confirm the
re-classification from the original canon hypothesis ("cold-deploy timeout, ~51-min wedge") to
"deploys+serves fine, only the overlay test reds." Will run when M1 is claimed and the swarm is
free (Builder not deploying). Same for bluesky app-alias collision (needs live caddy/getent diag).
These are NOT verdicts — formal M1 PASS/FAIL awaits the Builder's gate claim.
- 2026-06-18T00:25Z — **M1 CLAIMED** (commit 0a06c41). Node verified idle/clean before any run
(only infra + live warm-keycloak; no bluesky/test stacks; no run_recipe_ci; load 0.03; gitea idle
3.5.3) — Builder "node clean" claim ✔. Began my own COLD isolation re-runs (one at a time, no
concurrent load), swarm confirmed free.
- 2026-06-18T00:29Z — **bluesky-pds CONFIRMED by my own reproduction** (`/tmp/adv-bluesky.log`,
tag 0.3.0+v0.4.219, RECIPE=bluesky-pds CCCI_SKIP_FETCH=1). Cold lifecycle GREEN (install/backup/
restore/custom=pass, upgrade=skip) — reproduced. WC5 promote → unhealthy, 000. DECISIVE live diag
inside the warm caddy container (60326521a2ac, nets: proxy=10.10.52.13 + internal=10.0.5.3):
* `getent hosts app`**10.10.0.4** (a *proxy*-net foreign endpoint) — NOT bluesky's own app.
* bluesky's OWN app is at internal **10.0.5.6** (real target), never resolved.
* caddy TLS log cycles `dial tcp 10.10.0.{4,5,6,8,10,11,12}:3000: connect: connection refused`
on `ask http://app:3000/tls-check` → on-demand cert denied → TLS fails → /xrpc/_health = 000.
Verdict basis: NOT a flake (deterministic, every retry refused); NOT promote-machinery (the probe
correctly refuses an unhealthy endpoint, no false promote); **genuine recipe routing defect**
recipe names its svc `app` + puts caddy on the shared multi-tenant `proxy` net + Caddyfile uses bare
`app`, so docker DNS resolves `app` to OTHER stacks' apps. Builder's classification (recipe defect,
reverses the plan's "cc-ci warm-machinery" prior) is CORRECT. Sharper than Builder's note (my run's
internal IP 10.0.5.6 vs their 10.0.3.3 — same mechanism, different deploy). Letting run finish + will
tear down the orphan warm-bluesky stack. [interim — full M1 verdict batched after mumble+discourse.]
- 2026-06-18T00:38Z — bluesky run finished; promote log `!! WC5 promote failed (non-fatal; known-good
unchanged) … last status 0` — **machinery correctly refused to write canonical** (seals "not
promote-machinery"). Cleaned up: `docker stack rm warm-bluesky-pds…` + removed both volumes
(caddy_data, pds_data). Node verified clean of bluesky.
- 2026-06-18T00:44Z — **mumble CONFIRMED by my own isolation run** (`/tmp/adv-mumble.log`, tag
1.0.0+v1.6.870-0). ALL 5 tiers GREEN: install/upgrade/backup/restore/custom = pass. The exact
canon-sweep failure `tests/mumble/custom/test_protocol_handshake.py::test_handshake_completes_with_
channel_presence` **PASSED** in isolation. WC5 promote SUCCEEDED (canonical advanced to known-good
1.0.0+v1.6.870-0, idle, volume retained). A recipe defect would fail deterministically in isolation
(cf. mattermost restore) — mumble passing cleanly confirms **load/timing FLAKE**, not a recipe bug.
(My 1 isolation green + Builder's 2× = 3 isolation greens / 0 isolation reds vs 1 canon red-under-load
— consistent flake signature.) Builder's classification CORRECT.
- 2026-06-18T00:53Z — **discourse CONFIRMED by my own isolation run** (`/tmp/adv-discourse.log`, tag
0.8.1+3.5.0). Tiers: **install pass / upgrade FAIL / backup pass / restore pass / custom pass** —
exactly the Builder's claim. Deploy **converged in minutes; NO FATA, NO rc=142/143, NO ~51-min
wedge** → the original canon "cold-deploy timeout" hypothesis did NOT reproduce in isolation (Builder
reclassification CORRECT). Upgrade failed on the two PR-faithfulness overlay assertions:
`test_head_runs_official_image_not_bitnamilegacy` (deployed image = `bitnamilegacy/discourse:3.5.0@
sha256:db7e...`, the release's own image) and `test_sidekiq_service_dropped_by_head` (services =
`['app','db','redis','sidekiq']`). The overlay demands official `discourse/discourse:3.5.3` + no
sidekiq — an unreleased PR migration in NO release tag and NOT in main (verified earlier: tag AND
main both `bitnamilegacy:3.5.0`+sidekiq). AssertionError self-documents "the prevb bug." So the
recipe DEPLOYS+SERVES fine; only the stale/PR-specific overlay reds by construction in the canonical
sweep. **stale cc-ci OVERLAY test**, not flake/timeout/recipe-deploy/warm-machinery. Builder CORRECT.