230 lines
20 KiB
Markdown
230 lines
20 KiB
Markdown
# REVIEW — phase `redfix` (Adversary)
|
||
|
||
Phase SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md`
|
||
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds,
|
||
gitea, keycloak), isolate → root-cause → classify (flake vs genuine; recipe vs test vs
|
||
warm-machinery vs load) → FIX each via a recipe PR or harness improvement → verify green.
|
||
No standing exceptions. Nothing merged.
|
||
|
||
Gates:
|
||
- **M1** — all six investigated in isolation, classified with evidence. Adversary cold-verifies:
|
||
claimed flake = reproducibly green in isolation (and red under load); claimed recipe defect =
|
||
genuinely the recipe (not a stale test / harness artifact); claimed warm-machinery bug = in cc-ci.
|
||
- **M2** — all six FIXED + verified green (recipe PR via `!testme`; harness/cc-ci PR via the harness;
|
||
flake-stabilization green under load). All six promote/pass. No standing exception. Nothing merged.
|
||
|
||
DONE = Builder writes `## DONE` only after M1+M2 fresh Adversary PASS here.
|
||
|
||
---
|
||
|
||
## Verdicts
|
||
|
||
### M1 — investigate + isolate + classify: **PASS** @ 2026-06-18T01:18Z
|
||
|
||
Gate claim: `claim(redfix-M1)` commit `0a06c41` (@00:25Z). Verified from a COLD START on cc-ci with my
|
||
OWN isolation re-runs (one recipe at a time, no concurrent load) — NOT the Builder's logs. Isolation
|
||
discipline honored: verdict formed from the phase plan (SSOT), the recipe code / git history, the
|
||
verification info in STATUS, and my own cold acceptance runs; I did NOT read JOURNAL-redfix.md before
|
||
writing this verdict.
|
||
|
||
All six classifications are CORRECT. Evidence per recipe (full detail in the verification log below):
|
||
|
||
| Recipe | My independent reproduction | Classification — verified |
|
||
|---|---|---|
|
||
| **discourse** | my isolation run `/tmp/adv-discourse.log`: install/backup/restore/custom PASS, upgrade FAIL on the 2 PR-faithfulness overlay asserts; **converged in minutes, no FATA/rc=142/wedge** | **stale/PR-specific cc-ci OVERLAY test** (canon "timeout" root-cause was WRONG — confirmed). Recipe deploys+serves fine. ✔ |
|
||
| **mattermost-lts** | my isolation run `/tmp/adv-mattermost.log`: **restore FAIL deterministically** (`relation "ci_marker" does not exist`, 91s, isolated) | **genuine RECIPE defect** — no `backupbot.restore.post-hook`; NOT the canon "loaded-node race." ✔ |
|
||
| **mumble** | my isolation run `/tmp/adv-mumble.log`: ALL 5 tiers GREEN incl `test_handshake_completes_with_channel_presence`; promote OK | **load/timing FLAKE** — green in isolation (a recipe defect would red deterministically; it didn't). ✔ |
|
||
| **bluesky-pds** | my isolation run `/tmp/adv-bluesky.log` + live caddy diag: cold GREEN, warm promote **000 deterministic**; `getent app`→10.10.0.4 (foreign proxy), own app 10.0.5.6 never resolved; caddy log cycles `dial 10.10.0.{4..12}:3000 refused` | **genuine recipe ROUTING defect** (bare `app` + caddy on shared `proxy`), NOT cc-ci promote-machinery (it correctly refused to promote), NOT flake. (Reverses the plan's "warm-machinery" prior — confirmed against it.) ✔ |
|
||
| **gitea** | my isolation run `/tmp/adv-gitea.log` + container crash log: cold GREEN, warm advance crash-loops 0/1; `LoadCommonSettings() [F] … error saving JWT Secret … "/etc/gitea/app.ini": read-only file system`; canonical correctly stayed 3.5.3 (promote timed out, refused) | **genuine RECIPE defect** (3.6.0 JWT save vs read-only app.ini docker-config mount; `/etc/gitea` is a writable volume but the app.ini file is the RO config). ✔ |
|
||
| **keycloak** | code-verified: `canonical.canonical_domain('keycloak')`→`warm.stable_domain`→`warm-keycloak.ci.commoninternet.net` == `warm.WARM_DOMAINS['keycloak']` (warm.py:47 documents the equality); live keycloak 200 on `/realms/master` | **HARNESS defect** (data-warm canonical domain collides with the live-warm OIDC provider; no collision-free namespace). ✔ |
|
||
|
||
No defects in the classification work. No VETO. Node verified clean before AND after my runs (only infra
|
||
+ live warm-keycloak; gitea restored to undeployed idle 3.5.3, volumes retained, canonical commit
|
||
`e6a1cc79` unchanged; warm-keycloak healthy throughout). **M1 PASS — Builder cleared to proceed to M2.**
|
||
(M2 will re-verify each FIX green; this PASS is for the investigation/classification gate only.)
|
||
|
||
_(prior placeholder removed)_
|
||
|
||
## Adversary verification log
|
||
|
||
- 2026-06-17T23:18Z — Phase redfix opened. Refreshed phase plan + plan.md §6.1. Cold access to cc-ci
|
||
confirmed (`ssh cc-ci`: host `nixos`, uptime 4d, `systemctl --failed` empty, load ~0.8). No Builder
|
||
state files (`STATUS/BACKLOG/JOURNAL-redfix.md`) yet; no gate claimed. Idling for the first claim.
|
||
- 2026-06-18T00:10Z — Non-contending pre-staging (M1 NOT yet claimed; Builder mid-investigation:
|
||
gitea isolation running, keycloak pending). Stayed OFF the swarm to avoid contaminating the
|
||
Builder's isolation runs. Independently corroborated two deterministic static claims via pure
|
||
code reads on cc-ci (no deploys):
|
||
* **mattermost-lts** (recipe @ `2.1.9+10.11.15`): postgres svc has `backupbot.backup.pre-hook`
|
||
(pg_dump → /var/lib/postgresql/data/postgres-backup.sql), `backup.post-hook` (rm dump),
|
||
`backup.path=/var/lib/postgresql/data/` (hot live PGDATA) — and **NO `backupbot.restore.post-hook`**.
|
||
immich (passes) uses dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook:
|
||
/pg_backup.sh restore`. Corroborates "genuine recipe defect — no restore round-trip." ✔ pre-staged.
|
||
* **discourse** (recipe @ `0.8.1+3.5.0` = `bitnamilegacy/discourse:3.5.0` + sidekiq): overlay
|
||
`tests/discourse/test_upgrade.py` is a phase-prevb PR-faithfulness test asserting app image ==
|
||
official `discourse/discourse:3.5.3` AND sidekiq dropped — only true on an unreleased PR head, not
|
||
the latest release the canon sweep deploys. So it red-by-construction in the sweep. Corroborates
|
||
"stale/PR-specific overlay test, not flake/timeout/recipe-deploy." ✔ pre-staged.
|
||
* STILL OWED before any M1 PASS: my OWN cold isolation run of discourse to confirm the
|
||
re-classification from the original canon hypothesis ("cold-deploy timeout, ~51-min wedge") to
|
||
"deploys+serves fine, only the overlay test reds." Will run when M1 is claimed and the swarm is
|
||
free (Builder not deploying). Same for bluesky app-alias collision (needs live caddy/getent diag).
|
||
These are NOT verdicts — formal M1 PASS/FAIL awaits the Builder's gate claim.
|
||
- 2026-06-18T00:25Z — **M1 CLAIMED** (commit 0a06c41). Node verified idle/clean before any run
|
||
(only infra + live warm-keycloak; no bluesky/test stacks; no run_recipe_ci; load 0.03; gitea idle
|
||
3.5.3) — Builder "node clean" claim ✔. Began my own COLD isolation re-runs (one at a time, no
|
||
concurrent load), swarm confirmed free.
|
||
- 2026-06-18T00:29Z — **bluesky-pds CONFIRMED by my own reproduction** (`/tmp/adv-bluesky.log`,
|
||
tag 0.3.0+v0.4.219, RECIPE=bluesky-pds CCCI_SKIP_FETCH=1). Cold lifecycle GREEN (install/backup/
|
||
restore/custom=pass, upgrade=skip) — reproduced. WC5 promote → unhealthy, 000. DECISIVE live diag
|
||
inside the warm caddy container (60326521a2ac, nets: proxy=10.10.52.13 + internal=10.0.5.3):
|
||
* `getent hosts app` → **10.10.0.4** (a *proxy*-net foreign endpoint) — NOT bluesky's own app.
|
||
* bluesky's OWN app is at internal **10.0.5.6** (real target), never resolved.
|
||
* caddy TLS log cycles `dial tcp 10.10.0.{4,5,6,8,10,11,12}:3000: connect: connection refused`
|
||
on `ask http://app:3000/tls-check` → on-demand cert denied → TLS fails → /xrpc/_health = 000.
|
||
Verdict basis: NOT a flake (deterministic, every retry refused); NOT promote-machinery (the probe
|
||
correctly refuses an unhealthy endpoint, no false promote); **genuine recipe routing defect** —
|
||
recipe names its svc `app` + puts caddy on the shared multi-tenant `proxy` net + Caddyfile uses bare
|
||
`app`, so docker DNS resolves `app` to OTHER stacks' apps. Builder's classification (recipe defect,
|
||
reverses the plan's "cc-ci warm-machinery" prior) is CORRECT. Sharper than Builder's note (my run's
|
||
internal IP 10.0.5.6 vs their 10.0.3.3 — same mechanism, different deploy). Letting run finish + will
|
||
tear down the orphan warm-bluesky stack. [interim — full M1 verdict batched after mumble+discourse.]
|
||
- 2026-06-18T00:38Z — bluesky run finished; promote log `!! WC5 promote failed (non-fatal; known-good
|
||
unchanged) … last status 0` — **machinery correctly refused to write canonical** (seals "not
|
||
promote-machinery"). Cleaned up: `docker stack rm warm-bluesky-pds…` + removed both volumes
|
||
(caddy_data, pds_data). Node verified clean of bluesky.
|
||
- 2026-06-18T00:44Z — **mumble CONFIRMED by my own isolation run** (`/tmp/adv-mumble.log`, tag
|
||
1.0.0+v1.6.870-0). ALL 5 tiers GREEN: install/upgrade/backup/restore/custom = pass. The exact
|
||
canon-sweep failure `tests/mumble/custom/test_protocol_handshake.py::test_handshake_completes_with_
|
||
channel_presence` **PASSED** in isolation. WC5 promote SUCCEEDED (canonical advanced to known-good
|
||
1.0.0+v1.6.870-0, idle, volume retained). A recipe defect would fail deterministically in isolation
|
||
(cf. mattermost restore) — mumble passing cleanly confirms **load/timing FLAKE**, not a recipe bug.
|
||
(My 1 isolation green + Builder's 2× = 3 isolation greens / 0 isolation reds vs 1 canon red-under-load
|
||
— consistent flake signature.) Builder's classification CORRECT.
|
||
- 2026-06-18T00:53Z — **discourse CONFIRMED by my own isolation run** (`/tmp/adv-discourse.log`, tag
|
||
0.8.1+3.5.0). Tiers: **install pass / upgrade FAIL / backup pass / restore pass / custom pass** —
|
||
exactly the Builder's claim. Deploy **converged in minutes; NO FATA, NO rc=142/143, NO ~51-min
|
||
wedge** → the original canon "cold-deploy timeout" hypothesis did NOT reproduce in isolation (Builder
|
||
reclassification CORRECT). Upgrade failed on the two PR-faithfulness overlay assertions:
|
||
`test_head_runs_official_image_not_bitnamilegacy` (deployed image = `bitnamilegacy/discourse:3.5.0@
|
||
sha256:db7e...`, the release's own image) and `test_sidekiq_service_dropped_by_head` (services =
|
||
`['app','db','redis','sidekiq']`). The overlay demands official `discourse/discourse:3.5.3` + no
|
||
sidekiq — an unreleased PR migration in NO release tag and NOT in main (verified earlier: tag AND
|
||
main both `bitnamilegacy:3.5.0`+sidekiq). AssertionError self-documents "the prevb bug." So the
|
||
recipe DEPLOYS+SERVES fine; only the stale/PR-specific overlay reds by construction in the canonical
|
||
sweep. **stale cc-ci OVERLAY test**, not flake/timeout/recipe-deploy/warm-machinery. Builder CORRECT.
|
||
- 2026-06-18T01:02Z — **mattermost-lts CONFIRMED by my own isolation run** (`/tmp/adv-mattermost.log`,
|
||
tag 2.1.9+10.11.15). Tiers: install pass / upgrade pass / backup pass / **restore FAIL** / custom
|
||
pass — exactly Builder's claim. The overlay `tests/mattermost-lts/test_restore.py::
|
||
test_restore_returns_state` FAILED with the EXACT `RuntimeError: docker exec … postgres failed
|
||
(rc=1): ERROR: relation "ci_marker" does not exist`. **Deterministic in isolation** (91s, no
|
||
concurrent load) → NOT the canon "loaded-node db-cycle race." Note: generic `test_restore_healthy`
|
||
PASSED (app returns healthy) but the STATE round-trip failed — the seeded marker is gone after
|
||
restore. Mechanism matches the static finding: backup dumps + backs up hot PGDATA but has NO
|
||
`backupbot.restore.post-hook` to replay the dump → postgres logical data never round-trips. **genuine
|
||
RECIPE defect**, not a flake/load-race/stale-test. Builder's classification CORRECT.
|
||
- 2026-06-18T01:09Z — **gitea CONFIRMED by my own isolation run + container crash log**
|
||
(`/tmp/adv-gitea.log`, tag 3.6.0+1.24.2-rootless). Cold lifecycle all 5 tiers GREEN (incl fresh
|
||
3.5.3→3.6.0 upgrade tier). WC5 advance (reattach idle 3.5.3 volumes with 3.6.0 image) → warm-gitea
|
||
app crash-loops 0/1. Container log (every task, e.g. .8zd4952…): `setting.go:105:LoadCommonSettings()
|
||
[F] Unable to load settings from config: error saving JWT Secret for custom config: failed to save
|
||
"/etc/gitea/app.ini": open /etc/gitea/app.ini: read-only file system`. Mount nuance CONFIRMED:
|
||
`/etc/gitea` is a writable VOLUME (RW=true) but app.ini is a docker CONFIG overlaying that path
|
||
read-only → gitea can write the dir but NOT the app.ini file. **genuine RECIPE defect** (3.6.0 JWT
|
||
save vs read-only app.ini config mount). Cold passes (fresh render, no runtime save). Builder's
|
||
classification + proposed fix (render app.ini into the writable volume) CORRECT. Will verify
|
||
canonical stays 3.5.3 (promote refused) + restore warm-gitea to undeployed idle.
|
||
|
||
- 2026-06-18T02:15Z — **M2 interim corroboration (NOT a verdict — M2 not yet claimed).** Node cold-checked
|
||
idle (load 0.07, no run_recipe_ci/abra, only live warm-keycloak) — Builder between M2 fixes, so I stayed
|
||
OFF the swarm (no contending deploy). Non-contending read-only check of the one fix marked DONE
|
||
(mattermost-lts PR #1, ref `4ca7f4182d83`): cc-ci run **#901** artifacts on cc-ci
|
||
(`/var/lib/cc-ci-runs/901/`) confirm all tiers pass (install/upgrade/backup/restore/custom), rungs all
|
||
pass, `flags.clean_teardown=true`, `flags.no_secret_leak=true`, `WARM_CANONICAL=true`. The exact
|
||
M1-failing test now PASSES: `junit/restore__cc-ci__test_restore.xml` → testsuite
|
||
`failures="0" errors="0" skipped="0" tests="1"`, testcase `test_restore_returns_state`. This is a
|
||
read-only artifact check, NOT my own cold re-run — the formal M2 PASS will require my own cold
|
||
re-verification of all six fixes once the Builder claims M2. Pre-staged anchor only.
|
||
|
||
- 2026-06-18T04:12Z — **Idle break-it probe (NOT a verdict — M2 not yet claimed).** Cold-checked node
|
||
while Builder reworks bluesky+gitea (their journal: 4/6 verified, bluesky warm-verify structurally
|
||
blocked pre-merge, gitea needs rework). Stayed OFF the swarm. Observations: live
|
||
`warm-keycloak.ci.commoninternet.net/realms/master` = **200** (live shared SSO undisturbed by the
|
||
keycloak harness fix + its verify run — the keycloak DoD's hard constraint holds). Deployed stacks =
|
||
infra + live warm-keycloak + a `warm-gitea` (Builder's active rework; app `/api/v1/version`=404 =
|
||
wizard mode, consistent with their "gitea fix v1 broke 3.5.3→3.6.0 transition"). No orphan
|
||
test/bluesky stacks, no `run_recipe_ci` procs, load 0.44. **Critical break-it check PASSED: gitea
|
||
canonical is UNCHANGED** — `/var/lib/ci-warm/gitea/canonical.json` still `3.5.3+1.24.2-rootless`,
|
||
commit `e6a1cc79`, status `idle`, ts `20260617T083930Z` (identical to M1). The Builder's broken gitea
|
||
fix attempts did NOT falsely promote 3.6.0 to canonical. Idling for the M2 gate claim.
|
||
|
||
---
|
||
## M2 gate verification (CLAIMED 2026-06-18T05:53Z) — component re-runs in progress
|
||
|
||
Verifying all 6 fixes from a COLD START via my own independent harness checkout (`/tmp/adv-m2` on cc-ci
|
||
@ origin/redfix-m2-harness b96b8a4 = keycloak 61211db + mumble 07fc6d4 + bluesky exec-into-pds b96b8a4)
|
||
and my own chaos-deploys. One recipe at a time, no concurrent load. Node idle at start (load 0.02, only
|
||
live warm-keycloak). Static code review of the harness branch first: canonical.py adds `warm-canon-<r>`
|
||
for r in `warm.WARM_DOMAINS` (ONLY keycloak — confirmed, so zero blast radius on the other 15
|
||
canonicals); mumble widens handshake budget 12->36 attempts (60s->180s) with the asserts UNCHANGED
|
||
(non-weakening); keycloak recipe_meta WARM_CANONICAL False->True. All three are genuine, not
|
||
test-disabling.
|
||
|
||
- 2026-06-18T06:08Z — **keycloak component VERIFIED (1/6)** by my OWN cold harness run
|
||
(`/tmp/adv-keycloak-m2.log`, RECIPE=keycloak from /tmp/adv-m2 @b96b8a4, recipe tag 10.8.0+26.6.3).
|
||
RUN SUMMARY: deploy-count=1, **all 5 cold tiers pass** (install/upgrade/backup/restore/custom incl
|
||
`custom/test_password_grant_token.py::test_password_grant_issues_valid_jwt`). **WC5 promote landed at
|
||
the COLLISION-FREE domain**: `/var/lib/ci-warm/keycloak/canonical.json` domain=
|
||
`warm-canon-keycloak.ci.commoninternet.net`, version 10.8.0+26.6.3, status idle, ts 20260618T060549Z
|
||
(THIS run). Promote genuinely DEPLOYED there — its own volumes exist (`warm-canon-keycloak_…_mariadb`,
|
||
`_providers`). **Hard invariant HOLDS — live shared SSO undisturbed**: live
|
||
`warm-keycloak_ci_commoninternet_net_app` up **4 days**, service last Updated **2026-06-13** (predates
|
||
my 06:04Z run by days → NOT bounced); `warm-keycloak.ci.commoninternet.net/realms/master` = **200**
|
||
before/during/after. The data-warm canonical (warm-canon-keycloak) and live-warm provider
|
||
(warm-keycloak) are fully separate deployments that never touched. Builder's keycloak fix CORRECT +
|
||
non-weakening; the §2.B de-enrollment is now structurally resolved. (1/6)
|
||
|
||
- 2026-06-18T06:15Z — **mumble component VERIFIED (2/6)** by my OWN cold harness run
|
||
(`/tmp/adv-mumble-m2.log`, RECIPE=mumble from /tmp/adv-m2, recipe tag 1.0.0+v1.6.870-0). RUN SUMMARY:
|
||
deploy-count=1, **all 5 cold tiers pass**. The stabilized custom test
|
||
`test_handshake_completes_with_channel_presence` **PASSED** (junit failures=0, time=10.3s). The
|
||
handshake completing in ~10s confirms M1's **load/timing-FLAKE** classification (fast in isolation,
|
||
nowhere near even the OLD 60s budget) and that the fix — widening 12->36 attempts (60s->180s) — is
|
||
pure headroom: the asserts are UNCHANGED, so a genuinely dead server still exhausts all 36 retries
|
||
and FAILs. **Non-weakening.** WC5 promote: `/var/lib/ci-warm/mumble/canonical.json` version
|
||
1.0.0+v1.6.870-0, idle, ts 20260618T061114Z (THIS run). Builder's mumble fix CORRECT. (2/6)
|
||
|
||
NOTE on branch state: I cloned /tmp/adv-m2 at tip `b96b8a4` just before the Builder force-reset
|
||
`redfix-m2-harness` to `07fc6d4` (dropping a bluesky exec-into-pds commit). Confirmed
|
||
`git diff 07fc6d4 b96b8a4` = ONLY `tests/bluesky-pds/_p4.py` + `test_account_and_post.py` (2 lines,
|
||
bluesky-only) → keycloak (61211db) and mumble (07fc6d4) code are BYTE-IDENTICAL between b96b8a4 and
|
||
the claimed tip 07fc6d4, so my keycloak+mumble PASSES hold at the claimed state. bluesky is verified
|
||
separately via recipe chaos-deploy (PR #4 @4987ba9, now recipe-PR-only per operator directive), so
|
||
the harness-checkout staleness does not touch it.
|
||
|
||
- 2026-06-18T06:18Z — **gitea component VERIFIED (3/6)** by my OWN direct chaos-deploy of recipe PR #2
|
||
@a0f2db8 onto the retained idle 3.5.3 canonical volumes (`/tmp/adv-gitea-m2.log`). This reproduces
|
||
the EXACT M1 warm-advance scenario. Two-sided proof: I verified the UNFIXED-crashes side first-hand
|
||
in M1 (`/tmp/adv-gitea.log`: read-only-file-system FATA at LoadCommonSettings). Now the FIX side:
|
||
* **Fix is genuine, not test-disabling** — compose.yml moves the read-only swarm config to
|
||
`/etc/gitea/app.ini.init`; docker-setup.sh.tmpl (v1->v3) seeds it into the WRITABLE `/etc/gitea`
|
||
volume **only when missing OR EMPTY** (`! -s`, handling the 0-byte placeholder the old direct-config
|
||
mount leaves); a non-empty app.ini (gitea's persisted state incl the JWT) is preserved.
|
||
* **Pre-state genuine pre-fix**: config-volume app.ini = **0 bytes**; retained 3.5.3 data (gitea.db
|
||
1347584 B dated 2026-06-17T08:39); canonical 3.5.3 idle e6a1cc79; stack not deployed.
|
||
* **Deploy result**: `deploy succeeded`, NEW DEPLOYMENT a0f2db88, docker_setup_sh v3. **service 1/1,
|
||
ZERO restarts** (task Running, no Error). **M1 read-only crash signature ABSENT** (grep of service
|
||
logs for `read-only file system`/`LoadCommonSettings`/`[F]` = empty). **app.ini seeded 0->1862 B**
|
||
with `[server] INSTALL_LOCK = true` (NOT wizard mode — the very bug that broke the Builder's v1
|
||
fix). `/api/v1/version` -> **200 {"version":"1.24.2"}**; `/api/healthz` -> **200**. Retained
|
||
gitea.db adopted in place (still 1347584 B @08:39, SQLite WAL active) — matches Builder's stated
|
||
adoption signal (data dirs @08:39). (Empty users/repos = minimal canonical install, not a
|
||
regression.)
|
||
* **Merge-gating is HONEST, not a shrug**: published 3.6.0 tag = commit 357926f (independently
|
||
confirmed) != fix commit a0f2db8, so a non-chaos WC5 promote deploys the unfixed release (the abra
|
||
force-fetch of refs/tags/* reverts any local tag-move). Chaos-deploy of the working-tree fix is the
|
||
maximal faithful pre-merge proof; canonical advance follows on operator merge — consistent with the
|
||
phase's "nothing merged" constraint, NOT a standing exception.
|
||
* **Node restored**: undeploy succeeded, app.ini truncated back to 0, recipe back to published tag,
|
||
**canonical UNCHANGED 3.5.3 idle e6a1cc79 ts 20260617T083930Z**, stack gone. Builder's gitea fix
|
||
CORRECT. (3/6)
|