67 lines
5.0 KiB
Markdown
67 lines
5.0 KiB
Markdown
# REVIEW — phase `redfix` (Adversary)
|
|
|
|
Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md`
|
|
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
|
|
gitea, keycloak), isolate → root-cause → classify (flake vs genuine; recipe vs test vs
|
|
warm-machinery vs load) → FIX each via a recipe PR or harness improvement → verify green.
|
|
No standing exceptions. Nothing merged.
|
|
|
|
Gates:
|
|
- **M1** — all six investigated in isolation, classified with evidence. Adversary cold-verifies:
|
|
claimed flake = reproducibly green in isolation (and red under load); claimed recipe defect =
|
|
genuinely the recipe (not a stale test / harness artifact); claimed warm-machinery bug = in cc-ci.
|
|
- **M2** — all six FIXED + verified green (recipe PR via `!testme`; harness/cc-ci PR via the harness;
|
|
flake-stabilization green under load). All six promote/pass. No standing exception. Nothing merged.
|
|
|
|
DONE = Builder writes `## DONE` only after M1+M2 fresh Adversary PASS here.
|
|
|
|
---
|
|
|
|
## Verdicts
|
|
|
|
_(none yet — awaiting Builder bootstrap + first gate claim)_
|
|
|
|
## Adversary verification log
|
|
|
|
- 2026-06-17T23:18Z — Phase redfix opened. Refreshed phase plan + plan.md §6.1. Cold access to cc-ci
|
|
confirmed (`ssh cc-ci`: host `nixos`, uptime 4d, `systemctl --failed` empty, load ~0.8). No Builder
|
|
state files (`STATUS/BACKLOG/JOURNAL-redfix.md`) yet; no gate claimed. Idling for the first claim.
|
|
- 2026-06-18T00:10Z — Non-contending pre-staging (M1 NOT yet claimed; Builder mid-investigation:
|
|
gitea isolation running, keycloak pending). Stayed OFF the swarm to avoid contaminating the
|
|
Builder's isolation runs. Independently corroborated two deterministic static claims via pure
|
|
code reads on cc-ci (no deploys):
|
|
* **mattermost-lts** (recipe @ `2.1.9+10.11.15`): postgres svc has `backupbot.backup.pre-hook`
|
|
(pg_dump → /var/lib/postgresql/data/postgres-backup.sql), `backup.post-hook` (rm dump),
|
|
`backup.path=/var/lib/postgresql/data/` (hot live PGDATA) — and **NO `backupbot.restore.post-hook`**.
|
|
immich (passes) uses dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook:
|
|
/pg_backup.sh restore`. Corroborates "genuine recipe defect — no restore round-trip." ✔ pre-staged.
|
|
* **discourse** (recipe @ `0.8.1+3.5.0` = `bitnamilegacy/discourse:3.5.0` + sidekiq): overlay
|
|
`tests/discourse/test_upgrade.py` is a phase-prevb PR-faithfulness test asserting app image ==
|
|
official `discourse/discourse:3.5.3` AND sidekiq dropped — only true on an unreleased PR head, not
|
|
the latest release the canon sweep deploys. So it red-by-construction in the sweep. Corroborates
|
|
"stale/PR-specific overlay test, not flake/timeout/recipe-deploy." ✔ pre-staged.
|
|
* STILL OWED before any M1 PASS: my OWN cold isolation run of discourse to confirm the
|
|
re-classification from the original canon hypothesis ("cold-deploy timeout, ~51-min wedge") to
|
|
"deploys+serves fine, only the overlay test reds." Will run when M1 is claimed and the swarm is
|
|
free (Builder not deploying). Same for bluesky app-alias collision (needs live caddy/getent diag).
|
|
These are NOT verdicts — formal M1 PASS/FAIL awaits the Builder's gate claim.
|
|
- 2026-06-18T00:25Z — **M1 CLAIMED** (commit 0a06c41). Node verified idle/clean before any run
|
|
(only infra + live warm-keycloak; no bluesky/test stacks; no run_recipe_ci; load 0.03; gitea idle
|
|
3.5.3) — Builder "node clean" claim ✔. Began my own COLD isolation re-runs (one at a time, no
|
|
concurrent load), swarm confirmed free.
|
|
- 2026-06-18T00:29Z — **bluesky-pds CONFIRMED by my own reproduction** (`/tmp/adv-bluesky.log`,
|
|
tag 0.3.0+v0.4.219, RECIPE=bluesky-pds CCCI_SKIP_FETCH=1). Cold lifecycle GREEN (install/backup/
|
|
restore/custom=pass, upgrade=skip) — reproduced. WC5 promote → unhealthy, 000. DECISIVE live diag
|
|
inside the warm caddy container (60326521a2ac, nets: proxy=10.10.52.13 + internal=10.0.5.3):
|
|
* `getent hosts app` → **10.10.0.4** (a *proxy*-net foreign endpoint) — NOT bluesky's own app.
|
|
* bluesky's OWN app is at internal **10.0.5.6** (real target), never resolved.
|
|
* caddy TLS log cycles `dial tcp 10.10.0.{4,5,6,8,10,11,12}:3000: connect: connection refused`
|
|
on `ask http://app:3000/tls-check` → on-demand cert denied → TLS fails → /xrpc/_health = 000.
|
|
Verdict basis: NOT a flake (deterministic, every retry refused); NOT promote-machinery (the probe
|
|
correctly refuses an unhealthy endpoint, no false promote); **genuine recipe routing defect** —
|
|
recipe names its svc `app` + puts caddy on the shared multi-tenant `proxy` net + Caddyfile uses bare
|
|
`app`, so docker DNS resolves `app` to OTHER stacks' apps. Builder's classification (recipe defect,
|
|
reverses the plan's "cc-ci warm-machinery" prior) is CORRECT. Sharper than Builder's note (my run's
|
|
internal IP 10.0.5.6 vs their 10.0.3.3 — same mechanism, different deploy). Letting run finish + will
|
|
tear down the orphan warm-bluesky stack. [interim — full M1 verdict batched after mumble+discourse.]
|