Files
cc-ci/machine-docs/REVIEW-redfix.md

203 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW — phase `redfix` (Adversary)
Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md`
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
gitea, keycloak), isolate → root-cause → classify (flake vs genuine; recipe vs test vs
warm-machinery vs load) → FIX each via a recipe PR or harness improvement → verify green.
No standing exceptions. Nothing merged.
Gates:
- **M1** — all six investigated in isolation, classified with evidence. Adversary cold-verifies:
claimed flake = reproducibly green in isolation (and red under load); claimed recipe defect =
genuinely the recipe (not a stale test / harness artifact); claimed warm-machinery bug = in cc-ci.
- **M2** — all six FIXED + verified green (recipe PR via `!testme`; harness/cc-ci PR via the harness;
flake-stabilization green under load). All six promote/pass. No standing exception. Nothing merged.
DONE = Builder writes `## DONE` only after M1+M2 fresh Adversary PASS here.
---
## Verdicts
### M1 — investigate + isolate + classify: **PASS** @ 2026-06-18T01:18Z
Gate claim: `claim(redfix-M1)` commit `0a06c41` (@00:25Z). Verified from a COLD START on cc-ci with my
OWN isolation re-runs (one recipe at a time, no concurrent load) — NOT the Builder's logs. Isolation
discipline honored: verdict formed from the phase plan (SSOT), the recipe code / git history, the
verification info in STATUS, and my own cold acceptance runs; I did NOT read JOURNAL-redfix.md before
writing this verdict.
All six classifications are CORRECT. Evidence per recipe (full detail in the verification log below):
| Recipe | My independent reproduction | Classification — verified |
|---|---|---|
| **discourse** | my isolation run `/tmp/adv-discourse.log`: install/backup/restore/custom PASS, upgrade FAIL on the 2 PR-faithfulness overlay asserts; **converged in minutes, no FATA/rc=142/wedge** | **stale/PR-specific cc-ci OVERLAY test** (canon "timeout" root-cause was WRONG — confirmed). Recipe deploys+serves fine. ✔ |
| **mattermost-lts** | my isolation run `/tmp/adv-mattermost.log`: **restore FAIL deterministically** (`relation "ci_marker" does not exist`, 91s, isolated) | **genuine RECIPE defect** — no `backupbot.restore.post-hook`; NOT the canon "loaded-node race." ✔ |
| **mumble** | my isolation run `/tmp/adv-mumble.log`: ALL 5 tiers GREEN incl `test_handshake_completes_with_channel_presence`; promote OK | **load/timing FLAKE** — green in isolation (a recipe defect would red deterministically; it didn't). ✔ |
| **bluesky-pds** | my isolation run `/tmp/adv-bluesky.log` + live caddy diag: cold GREEN, warm promote **000 deterministic**; `getent app`→10.10.0.4 (foreign proxy), own app 10.0.5.6 never resolved; caddy log cycles `dial 10.10.0.{4..12}:3000 refused` | **genuine recipe ROUTING defect** (bare `app` + caddy on shared `proxy`), NOT cc-ci promote-machinery (it correctly refused to promote), NOT flake. (Reverses the plan's "warm-machinery" prior — confirmed against it.) ✔ |
| **gitea** | my isolation run `/tmp/adv-gitea.log` + container crash log: cold GREEN, warm advance crash-loops 0/1; `LoadCommonSettings() [F] … error saving JWT Secret … "/etc/gitea/app.ini": read-only file system`; canonical correctly stayed 3.5.3 (promote timed out, refused) | **genuine RECIPE defect** (3.6.0 JWT save vs read-only app.ini docker-config mount; `/etc/gitea` is a writable volume but the app.ini file is the RO config). ✔ |
| **keycloak** | code-verified: `canonical.canonical_domain('keycloak')``warm.stable_domain``warm-keycloak.ci.commoninternet.net` == `warm.WARM_DOMAINS['keycloak']` (warm.py:47 documents the equality); live keycloak 200 on `/realms/master` | **HARNESS defect** (data-warm canonical domain collides with the live-warm OIDC provider; no collision-free namespace). ✔ |
No defects in the classification work. No VETO. Node verified clean before AND after my runs (only infra
+ live warm-keycloak; gitea restored to undeployed idle 3.5.3, volumes retained, canonical commit
`e6a1cc79` unchanged; warm-keycloak healthy throughout). **M1 PASS — Builder cleared to proceed to M2.**
(M2 will re-verify each FIX green; this PASS is for the investigation/classification gate only.)
_(prior placeholder removed)_
## Adversary verification log
- 2026-06-17T23:18Z — Phase redfix opened. Refreshed phase plan + plan.md §6.1. Cold access to cc-ci
confirmed (`ssh cc-ci`: host `nixos`, uptime 4d, `systemctl --failed` empty, load ~0.8). No Builder
state files (`STATUS/BACKLOG/JOURNAL-redfix.md`) yet; no gate claimed. Idling for the first claim.
- 2026-06-18T00:10Z — Non-contending pre-staging (M1 NOT yet claimed; Builder mid-investigation:
gitea isolation running, keycloak pending). Stayed OFF the swarm to avoid contaminating the
Builder's isolation runs. Independently corroborated two deterministic static claims via pure
code reads on cc-ci (no deploys):
* **mattermost-lts** (recipe @ `2.1.9+10.11.15`): postgres svc has `backupbot.backup.pre-hook`
(pg_dump → /var/lib/postgresql/data/postgres-backup.sql), `backup.post-hook` (rm dump),
`backup.path=/var/lib/postgresql/data/` (hot live PGDATA) — and **NO `backupbot.restore.post-hook`**.
immich (passes) uses dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook:
/pg_backup.sh restore`. Corroborates "genuine recipe defect — no restore round-trip." ✔ pre-staged.
* **discourse** (recipe @ `0.8.1+3.5.0` = `bitnamilegacy/discourse:3.5.0` + sidekiq): overlay
`tests/discourse/test_upgrade.py` is a phase-prevb PR-faithfulness test asserting app image ==
official `discourse/discourse:3.5.3` AND sidekiq dropped — only true on an unreleased PR head, not
the latest release the canon sweep deploys. So it red-by-construction in the sweep. Corroborates
"stale/PR-specific overlay test, not flake/timeout/recipe-deploy." ✔ pre-staged.
* STILL OWED before any M1 PASS: my OWN cold isolation run of discourse to confirm the
re-classification from the original canon hypothesis ("cold-deploy timeout, ~51-min wedge") to
"deploys+serves fine, only the overlay test reds." Will run when M1 is claimed and the swarm is
free (Builder not deploying). Same for bluesky app-alias collision (needs live caddy/getent diag).
These are NOT verdicts — formal M1 PASS/FAIL awaits the Builder's gate claim.
- 2026-06-18T00:25Z — **M1 CLAIMED** (commit 0a06c41). Node verified idle/clean before any run
(only infra + live warm-keycloak; no bluesky/test stacks; no run_recipe_ci; load 0.03; gitea idle
3.5.3) — Builder "node clean" claim ✔. Began my own COLD isolation re-runs (one at a time, no
concurrent load), swarm confirmed free.
- 2026-06-18T00:29Z — **bluesky-pds CONFIRMED by my own reproduction** (`/tmp/adv-bluesky.log`,
tag 0.3.0+v0.4.219, RECIPE=bluesky-pds CCCI_SKIP_FETCH=1). Cold lifecycle GREEN (install/backup/
restore/custom=pass, upgrade=skip) — reproduced. WC5 promote → unhealthy, 000. DECISIVE live diag
inside the warm caddy container (60326521a2ac, nets: proxy=10.10.52.13 + internal=10.0.5.3):
* `getent hosts app`**10.10.0.4** (a *proxy*-net foreign endpoint) — NOT bluesky's own app.
* bluesky's OWN app is at internal **10.0.5.6** (real target), never resolved.
* caddy TLS log cycles `dial tcp 10.10.0.{4,5,6,8,10,11,12}:3000: connect: connection refused`
on `ask http://app:3000/tls-check` → on-demand cert denied → TLS fails → /xrpc/_health = 000.
Verdict basis: NOT a flake (deterministic, every retry refused); NOT promote-machinery (the probe
correctly refuses an unhealthy endpoint, no false promote); **genuine recipe routing defect**
recipe names its svc `app` + puts caddy on the shared multi-tenant `proxy` net + Caddyfile uses bare
`app`, so docker DNS resolves `app` to OTHER stacks' apps. Builder's classification (recipe defect,
reverses the plan's "cc-ci warm-machinery" prior) is CORRECT. Sharper than Builder's note (my run's
internal IP 10.0.5.6 vs their 10.0.3.3 — same mechanism, different deploy). Letting run finish + will
tear down the orphan warm-bluesky stack. [interim — full M1 verdict batched after mumble+discourse.]
- 2026-06-18T00:38Z — bluesky run finished; promote log `!! WC5 promote failed (non-fatal; known-good
unchanged) … last status 0` — **machinery correctly refused to write canonical** (seals "not
promote-machinery"). Cleaned up: `docker stack rm warm-bluesky-pds…` + removed both volumes
(caddy_data, pds_data). Node verified clean of bluesky.
- 2026-06-18T00:44Z — **mumble CONFIRMED by my own isolation run** (`/tmp/adv-mumble.log`, tag
1.0.0+v1.6.870-0). ALL 5 tiers GREEN: install/upgrade/backup/restore/custom = pass. The exact
canon-sweep failure `tests/mumble/custom/test_protocol_handshake.py::test_handshake_completes_with_
channel_presence` **PASSED** in isolation. WC5 promote SUCCEEDED (canonical advanced to known-good
1.0.0+v1.6.870-0, idle, volume retained). A recipe defect would fail deterministically in isolation
(cf. mattermost restore) — mumble passing cleanly confirms **load/timing FLAKE**, not a recipe bug.
(My 1 isolation green + Builder's 2× = 3 isolation greens / 0 isolation reds vs 1 canon red-under-load
— consistent flake signature.) Builder's classification CORRECT.
- 2026-06-18T00:53Z — **discourse CONFIRMED by my own isolation run** (`/tmp/adv-discourse.log`, tag
0.8.1+3.5.0). Tiers: **install pass / upgrade FAIL / backup pass / restore pass / custom pass** —
exactly the Builder's claim. Deploy **converged in minutes; NO FATA, NO rc=142/143, NO ~51-min
wedge** → the original canon "cold-deploy timeout" hypothesis did NOT reproduce in isolation (Builder
reclassification CORRECT). Upgrade failed on the two PR-faithfulness overlay assertions:
`test_head_runs_official_image_not_bitnamilegacy` (deployed image = `bitnamilegacy/discourse:3.5.0@
sha256:db7e...`, the release's own image) and `test_sidekiq_service_dropped_by_head` (services =
`['app','db','redis','sidekiq']`). The overlay demands official `discourse/discourse:3.5.3` + no
sidekiq — an unreleased PR migration in NO release tag and NOT in main (verified earlier: tag AND
main both `bitnamilegacy:3.5.0`+sidekiq). AssertionError self-documents "the prevb bug." So the
recipe DEPLOYS+SERVES fine; only the stale/PR-specific overlay reds by construction in the canonical
sweep. **stale cc-ci OVERLAY test**, not flake/timeout/recipe-deploy/warm-machinery. Builder CORRECT.
- 2026-06-18T01:02Z — **mattermost-lts CONFIRMED by my own isolation run** (`/tmp/adv-mattermost.log`,
tag 2.1.9+10.11.15). Tiers: install pass / upgrade pass / backup pass / **restore FAIL** / custom
pass — exactly Builder's claim. The overlay `tests/mattermost-lts/test_restore.py::
test_restore_returns_state` FAILED with the EXACT `RuntimeError: docker exec … postgres failed
(rc=1): ERROR: relation "ci_marker" does not exist`. **Deterministic in isolation** (91s, no
concurrent load) → NOT the canon "loaded-node db-cycle race." Note: generic `test_restore_healthy`
PASSED (app returns healthy) but the STATE round-trip failed — the seeded marker is gone after
restore. Mechanism matches the static finding: backup dumps + backs up hot PGDATA but has NO
`backupbot.restore.post-hook` to replay the dump → postgres logical data never round-trips. **genuine
RECIPE defect**, not a flake/load-race/stale-test. Builder's classification CORRECT.
- 2026-06-18T01:09Z — **gitea CONFIRMED by my own isolation run + container crash log**
(`/tmp/adv-gitea.log`, tag 3.6.0+1.24.2-rootless). Cold lifecycle all 5 tiers GREEN (incl fresh
3.5.3→3.6.0 upgrade tier). WC5 advance (reattach idle 3.5.3 volumes with 3.6.0 image) → warm-gitea
app crash-loops 0/1. Container log (every task, e.g. .8zd4952…): `setting.go:105:LoadCommonSettings()
[F] Unable to load settings from config: error saving JWT Secret for custom config: failed to save
"/etc/gitea/app.ini": open /etc/gitea/app.ini: read-only file system`. Mount nuance CONFIRMED:
`/etc/gitea` is a writable VOLUME (RW=true) but app.ini is a docker CONFIG overlaying that path
read-only → gitea can write the dir but NOT the app.ini file. **genuine RECIPE defect** (3.6.0 JWT
save vs read-only app.ini config mount). Cold passes (fresh render, no runtime save). Builder's
classification + proposed fix (render app.ini into the writable volume) CORRECT. Will verify
canonical stays 3.5.3 (promote refused) + restore warm-gitea to undeployed idle.
- 2026-06-18T02:15Z — **M2 interim corroboration (NOT a verdict — M2 not yet claimed).** Node cold-checked
idle (load 0.07, no run_recipe_ci/abra, only live warm-keycloak) — Builder between M2 fixes, so I stayed
OFF the swarm (no contending deploy). Non-contending read-only check of the one fix marked DONE
(mattermost-lts PR #1, ref `4ca7f4182d83`): cc-ci run **#901** artifacts on cc-ci
(`/var/lib/cc-ci-runs/901/`) confirm all tiers pass (install/upgrade/backup/restore/custom), rungs all
pass, `flags.clean_teardown=true`, `flags.no_secret_leak=true`, `WARM_CANONICAL=true`. The exact
M1-failing test now PASSES: `junit/restore__cc-ci__test_restore.xml` → testsuite
`failures="0" errors="0" skipped="0" tests="1"`, testcase `test_restore_returns_state`. This is a
read-only artifact check, NOT my own cold re-run — the formal M2 PASS will require my own cold
re-verification of all six fixes once the Builder claims M2. Pre-staged anchor only.
- 2026-06-18T04:12Z — **Idle break-it probe (NOT a verdict — M2 not yet claimed).** Cold-checked node
while Builder reworks bluesky+gitea (their journal: 4/6 verified, bluesky warm-verify structurally
blocked pre-merge, gitea needs rework). Stayed OFF the swarm. Observations: live
`warm-keycloak.ci.commoninternet.net/realms/master` = **200** (live shared SSO undisturbed by the
keycloak harness fix + its verify run — the keycloak DoD's hard constraint holds). Deployed stacks =
infra + live warm-keycloak + a `warm-gitea` (Builder's active rework; app `/api/v1/version`=404 =
wizard mode, consistent with their "gitea fix v1 broke 3.5.3→3.6.0 transition"). No orphan
test/bluesky stacks, no `run_recipe_ci` procs, load 0.44. **Critical break-it check PASSED: gitea
canonical is UNCHANGED** — `/var/lib/ci-warm/gitea/canonical.json` still `3.5.3+1.24.2-rootless`,
commit `e6a1cc79`, status `idle`, ts `20260617T083930Z` (identical to M1). The Builder's broken gitea
fix attempts did NOT falsely promote 3.6.0 to canonical. Idling for the M2 gate claim.
---
## M2 gate verification (CLAIMED 2026-06-18T05:53Z) — component re-runs in progress
Verifying all 6 fixes from a COLD START via my own independent harness checkout (`/tmp/adv-m2` on cc-ci
@ origin/redfix-m2-harness b96b8a4 = keycloak 61211db + mumble 07fc6d4 + bluesky exec-into-pds b96b8a4)
and my own chaos-deploys. One recipe at a time, no concurrent load. Node idle at start (load 0.02, only
live warm-keycloak). Static code review of the harness branch first: canonical.py adds `warm-canon-<r>`
for r in `warm.WARM_DOMAINS` (ONLY keycloak — confirmed, so zero blast radius on the other 15
canonicals); mumble widens handshake budget 12->36 attempts (60s->180s) with the asserts UNCHANGED
(non-weakening); keycloak recipe_meta WARM_CANONICAL False->True. All three are genuine, not
test-disabling.
- 2026-06-18T06:08Z — **keycloak component VERIFIED (1/6)** by my OWN cold harness run
(`/tmp/adv-keycloak-m2.log`, RECIPE=keycloak from /tmp/adv-m2 @b96b8a4, recipe tag 10.8.0+26.6.3).
RUN SUMMARY: deploy-count=1, **all 5 cold tiers pass** (install/upgrade/backup/restore/custom incl
`custom/test_password_grant_token.py::test_password_grant_issues_valid_jwt`). **WC5 promote landed at
the COLLISION-FREE domain**: `/var/lib/ci-warm/keycloak/canonical.json` domain=
`warm-canon-keycloak.ci.commoninternet.net`, version 10.8.0+26.6.3, status idle, ts 20260618T060549Z
(THIS run). Promote genuinely DEPLOYED there — its own volumes exist (`warm-canon-keycloak_…_mariadb`,
`_providers`). **Hard invariant HOLDS — live shared SSO undisturbed**: live
`warm-keycloak_ci_commoninternet_net_app` up **4 days**, service last Updated **2026-06-13** (predates
my 06:04Z run by days → NOT bounced); `warm-keycloak.ci.commoninternet.net/realms/master` = **200**
before/during/after. The data-warm canonical (warm-canon-keycloak) and live-warm provider
(warm-keycloak) are fully separate deployments that never touched. Builder's keycloak fix CORRECT +
non-weakening; the §2.B de-enrollment is now structurally resolved. (1/6)
- 2026-06-18T06:15Z — **mumble component VERIFIED (2/6)** by my OWN cold harness run
(`/tmp/adv-mumble-m2.log`, RECIPE=mumble from /tmp/adv-m2, recipe tag 1.0.0+v1.6.870-0). RUN SUMMARY:
deploy-count=1, **all 5 cold tiers pass**. The stabilized custom test
`test_handshake_completes_with_channel_presence` **PASSED** (junit failures=0, time=10.3s). The
handshake completing in ~10s confirms M1's **load/timing-FLAKE** classification (fast in isolation,
nowhere near even the OLD 60s budget) and that the fix — widening 12->36 attempts (60s->180s) — is
pure headroom: the asserts are UNCHANGED, so a genuinely dead server still exhausts all 36 retries
and FAILs. **Non-weakening.** WC5 promote: `/var/lib/ci-warm/mumble/canonical.json` version
1.0.0+v1.6.870-0, idle, ts 20260618T061114Z (THIS run). Builder's mumble fix CORRECT. (2/6)
NOTE on branch state: I cloned /tmp/adv-m2 at tip `b96b8a4` just before the Builder force-reset
`redfix-m2-harness` to `07fc6d4` (dropping a bluesky exec-into-pds commit). Confirmed
`git diff 07fc6d4 b96b8a4` = ONLY `tests/bluesky-pds/_p4.py` + `test_account_and_post.py` (2 lines,
bluesky-only) → keycloak (61211db) and mumble (07fc6d4) code are BYTE-IDENTICAL between b96b8a4 and
the claimed tip 07fc6d4, so my keycloak+mumble PASSES hold at the claimed state. bluesky is verified
separately via recipe chaos-deploy (PR #4 @4987ba9, now recipe-PR-only per operator directive), so
the harness-checkout staleness does not touch it.