Files
cc-ci/machine-docs/REVIEW-redfix.md

22 KiB
Raw Blame History

REVIEW — phase redfix (Adversary)

Phase SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak), isolate → root-cause → classify (flake vs genuine; recipe vs test vs warm-machinery vs load) → FIX each via a recipe PR or harness improvement → verify green. No standing exceptions. Nothing merged.

Gates:

  • M1 — all six investigated in isolation, classified with evidence. Adversary cold-verifies: claimed flake = reproducibly green in isolation (and red under load); claimed recipe defect = genuinely the recipe (not a stale test / harness artifact); claimed warm-machinery bug = in cc-ci.
  • M2 — all six FIXED + verified green (recipe PR via !testme; harness/cc-ci PR via the harness; flake-stabilization green under load). All six promote/pass. No standing exception. Nothing merged.

DONE = Builder writes ## DONE only after M1+M2 fresh Adversary PASS here.


Verdicts

M1 — investigate + isolate + classify: PASS @ 2026-06-18T01:18Z

Gate claim: claim(redfix-M1) commit 0a06c41 (@00:25Z). Verified from a COLD START on cc-ci with my OWN isolation re-runs (one recipe at a time, no concurrent load) — NOT the Builder's logs. Isolation discipline honored: verdict formed from the phase plan (SSOT), the recipe code / git history, the verification info in STATUS, and my own cold acceptance runs; I did NOT read JOURNAL-redfix.md before writing this verdict.

All six classifications are CORRECT. Evidence per recipe (full detail in the verification log below):

Recipe My independent reproduction Classification — verified
discourse my isolation run /tmp/adv-discourse.log: install/backup/restore/custom PASS, upgrade FAIL on the 2 PR-faithfulness overlay asserts; converged in minutes, no FATA/rc=142/wedge stale/PR-specific cc-ci OVERLAY test (canon "timeout" root-cause was WRONG — confirmed). Recipe deploys+serves fine. ✔
mattermost-lts my isolation run /tmp/adv-mattermost.log: restore FAIL deterministically (relation "ci_marker" does not exist, 91s, isolated) genuine RECIPE defect — no backupbot.restore.post-hook; NOT the canon "loaded-node race." ✔
mumble my isolation run /tmp/adv-mumble.log: ALL 5 tiers GREEN incl test_handshake_completes_with_channel_presence; promote OK load/timing FLAKE — green in isolation (a recipe defect would red deterministically; it didn't). ✔
bluesky-pds my isolation run /tmp/adv-bluesky.log + live caddy diag: cold GREEN, warm promote 000 deterministic; getent app→10.10.0.4 (foreign proxy), own app 10.0.5.6 never resolved; caddy log cycles dial 10.10.0.{4..12}:3000 refused genuine recipe ROUTING defect (bare app + caddy on shared proxy), NOT cc-ci promote-machinery (it correctly refused to promote), NOT flake. (Reverses the plan's "warm-machinery" prior — confirmed against it.) ✔
gitea my isolation run /tmp/adv-gitea.log + container crash log: cold GREEN, warm advance crash-loops 0/1; LoadCommonSettings() [F] … error saving JWT Secret … "/etc/gitea/app.ini": read-only file system; canonical correctly stayed 3.5.3 (promote timed out, refused) genuine RECIPE defect (3.6.0 JWT save vs read-only app.ini docker-config mount; /etc/gitea is a writable volume but the app.ini file is the RO config). ✔
keycloak code-verified: canonical.canonical_domain('keycloak')warm.stable_domainwarm-keycloak.ci.commoninternet.net == warm.WARM_DOMAINS['keycloak'] (warm.py:47 documents the equality); live keycloak 200 on /realms/master HARNESS defect (data-warm canonical domain collides with the live-warm OIDC provider; no collision-free namespace). ✔

No defects in the classification work. No VETO. Node verified clean before AND after my runs (only infra

  • live warm-keycloak; gitea restored to undeployed idle 3.5.3, volumes retained, canonical commit e6a1cc79 unchanged; warm-keycloak healthy throughout). M1 PASS — Builder cleared to proceed to M2. (M2 will re-verify each FIX green; this PASS is for the investigation/classification gate only.)

(prior placeholder removed)

Adversary verification log

  • 2026-06-17T23:18Z — Phase redfix opened. Refreshed phase plan + plan.md §6.1. Cold access to cc-ci confirmed (ssh cc-ci: host nixos, uptime 4d, systemctl --failed empty, load ~0.8). No Builder state files (STATUS/BACKLOG/JOURNAL-redfix.md) yet; no gate claimed. Idling for the first claim.

  • 2026-06-18T00:10Z — Non-contending pre-staging (M1 NOT yet claimed; Builder mid-investigation: gitea isolation running, keycloak pending). Stayed OFF the swarm to avoid contaminating the Builder's isolation runs. Independently corroborated two deterministic static claims via pure code reads on cc-ci (no deploys):

    • mattermost-lts (recipe @ 2.1.9+10.11.15): postgres svc has backupbot.backup.pre-hook (pg_dump → /var/lib/postgresql/data/postgres-backup.sql), backup.post-hook (rm dump), backup.path=/var/lib/postgresql/data/ (hot live PGDATA) — and NO backupbot.restore.post-hook. immich (passes) uses dump-only backup.volumes.postgres.path: backup.sql + restore.post-hook: /pg_backup.sh restore. Corroborates "genuine recipe defect — no restore round-trip." ✔ pre-staged.
    • discourse (recipe @ 0.8.1+3.5.0 = bitnamilegacy/discourse:3.5.0 + sidekiq): overlay tests/discourse/test_upgrade.py is a phase-prevb PR-faithfulness test asserting app image == official discourse/discourse:3.5.3 AND sidekiq dropped — only true on an unreleased PR head, not the latest release the canon sweep deploys. So it red-by-construction in the sweep. Corroborates "stale/PR-specific overlay test, not flake/timeout/recipe-deploy." ✔ pre-staged.
    • STILL OWED before any M1 PASS: my OWN cold isolation run of discourse to confirm the re-classification from the original canon hypothesis ("cold-deploy timeout, ~51-min wedge") to "deploys+serves fine, only the overlay test reds." Will run when M1 is claimed and the swarm is free (Builder not deploying). Same for bluesky app-alias collision (needs live caddy/getent diag). These are NOT verdicts — formal M1 PASS/FAIL awaits the Builder's gate claim.
  • 2026-06-18T00:25Z — M1 CLAIMED (commit 0a06c41). Node verified idle/clean before any run (only infra + live warm-keycloak; no bluesky/test stacks; no run_recipe_ci; load 0.03; gitea idle 3.5.3) — Builder "node clean" claim ✔. Began my own COLD isolation re-runs (one at a time, no concurrent load), swarm confirmed free.

  • 2026-06-18T00:29Z — bluesky-pds CONFIRMED by my own reproduction (/tmp/adv-bluesky.log, tag 0.3.0+v0.4.219, RECIPE=bluesky-pds CCCI_SKIP_FETCH=1). Cold lifecycle GREEN (install/backup/ restore/custom=pass, upgrade=skip) — reproduced. WC5 promote → unhealthy, 000. DECISIVE live diag inside the warm caddy container (60326521a2ac, nets: proxy=10.10.52.13 + internal=10.0.5.3):

    • getent hosts app10.10.0.4 (a proxy-net foreign endpoint) — NOT bluesky's own app.
    • bluesky's OWN app is at internal 10.0.5.6 (real target), never resolved.
    • caddy TLS log cycles dial tcp 10.10.0.{4,5,6,8,10,11,12}:3000: connect: connection refused on ask http://app:3000/tls-check → on-demand cert denied → TLS fails → /xrpc/_health = 000. Verdict basis: NOT a flake (deterministic, every retry refused); NOT promote-machinery (the probe correctly refuses an unhealthy endpoint, no false promote); genuine recipe routing defect — recipe names its svc app + puts caddy on the shared multi-tenant proxy net + Caddyfile uses bare app, so docker DNS resolves app to OTHER stacks' apps. Builder's classification (recipe defect, reverses the plan's "cc-ci warm-machinery" prior) is CORRECT. Sharper than Builder's note (my run's internal IP 10.0.5.6 vs their 10.0.3.3 — same mechanism, different deploy). Letting run finish + will tear down the orphan warm-bluesky stack. [interim — full M1 verdict batched after mumble+discourse.]
  • 2026-06-18T00:38Z — bluesky run finished; promote log !! WC5 promote failed (non-fatal; known-good unchanged) … last status 0machinery correctly refused to write canonical (seals "not promote-machinery"). Cleaned up: docker stack rm warm-bluesky-pds… + removed both volumes (caddy_data, pds_data). Node verified clean of bluesky.

  • 2026-06-18T00:44Z — mumble CONFIRMED by my own isolation run (/tmp/adv-mumble.log, tag 1.0.0+v1.6.870-0). ALL 5 tiers GREEN: install/upgrade/backup/restore/custom = pass. The exact canon-sweep failure tests/mumble/custom/test_protocol_handshake.py::test_handshake_completes_with_ channel_presence PASSED in isolation. WC5 promote SUCCEEDED (canonical advanced to known-good 1.0.0+v1.6.870-0, idle, volume retained). A recipe defect would fail deterministically in isolation (cf. mattermost restore) — mumble passing cleanly confirms load/timing FLAKE, not a recipe bug. (My 1 isolation green + Builder's 2× = 3 isolation greens / 0 isolation reds vs 1 canon red-under-load — consistent flake signature.) Builder's classification CORRECT.

  • 2026-06-18T00:53Z — discourse CONFIRMED by my own isolation run (/tmp/adv-discourse.log, tag 0.8.1+3.5.0). Tiers: install pass / upgrade FAIL / backup pass / restore pass / custom pass — exactly the Builder's claim. Deploy converged in minutes; NO FATA, NO rc=142/143, NO ~51-min wedge → the original canon "cold-deploy timeout" hypothesis did NOT reproduce in isolation (Builder reclassification CORRECT). Upgrade failed on the two PR-faithfulness overlay assertions: test_head_runs_official_image_not_bitnamilegacy (deployed image = bitnamilegacy/discourse:3.5.0@ sha256:db7e..., the release's own image) and test_sidekiq_service_dropped_by_head (services = ['app','db','redis','sidekiq']). The overlay demands official discourse/discourse:3.5.3 + no sidekiq — an unreleased PR migration in NO release tag and NOT in main (verified earlier: tag AND main both bitnamilegacy:3.5.0+sidekiq). AssertionError self-documents "the prevb bug." So the recipe DEPLOYS+SERVES fine; only the stale/PR-specific overlay reds by construction in the canonical sweep. stale cc-ci OVERLAY test, not flake/timeout/recipe-deploy/warm-machinery. Builder CORRECT.

  • 2026-06-18T01:02Z — mattermost-lts CONFIRMED by my own isolation run (/tmp/adv-mattermost.log, tag 2.1.9+10.11.15). Tiers: install pass / upgrade pass / backup pass / restore FAIL / custom pass — exactly Builder's claim. The overlay tests/mattermost-lts/test_restore.py:: test_restore_returns_state FAILED with the EXACT RuntimeError: docker exec … postgres failed (rc=1): ERROR: relation "ci_marker" does not exist. Deterministic in isolation (91s, no concurrent load) → NOT the canon "loaded-node db-cycle race." Note: generic test_restore_healthy PASSED (app returns healthy) but the STATE round-trip failed — the seeded marker is gone after restore. Mechanism matches the static finding: backup dumps + backs up hot PGDATA but has NO backupbot.restore.post-hook to replay the dump → postgres logical data never round-trips. genuine RECIPE defect, not a flake/load-race/stale-test. Builder's classification CORRECT.

  • 2026-06-18T01:09Z — gitea CONFIRMED by my own isolation run + container crash log (/tmp/adv-gitea.log, tag 3.6.0+1.24.2-rootless). Cold lifecycle all 5 tiers GREEN (incl fresh 3.5.3→3.6.0 upgrade tier). WC5 advance (reattach idle 3.5.3 volumes with 3.6.0 image) → warm-gitea app crash-loops 0/1. Container log (every task, e.g. .8zd4952…): setting.go:105:LoadCommonSettings() [F] Unable to load settings from config: error saving JWT Secret for custom config: failed to save "/etc/gitea/app.ini": open /etc/gitea/app.ini: read-only file system. Mount nuance CONFIRMED: /etc/gitea is a writable VOLUME (RW=true) but app.ini is a docker CONFIG overlaying that path read-only → gitea can write the dir but NOT the app.ini file. genuine RECIPE defect (3.6.0 JWT save vs read-only app.ini config mount). Cold passes (fresh render, no runtime save). Builder's classification + proposed fix (render app.ini into the writable volume) CORRECT. Will verify canonical stays 3.5.3 (promote refused) + restore warm-gitea to undeployed idle.

  • 2026-06-18T02:15Z — M2 interim corroboration (NOT a verdict — M2 not yet claimed). Node cold-checked idle (load 0.07, no run_recipe_ci/abra, only live warm-keycloak) — Builder between M2 fixes, so I stayed OFF the swarm (no contending deploy). Non-contending read-only check of the one fix marked DONE (mattermost-lts PR #1, ref 4ca7f4182d83): cc-ci run #901 artifacts on cc-ci (/var/lib/cc-ci-runs/901/) confirm all tiers pass (install/upgrade/backup/restore/custom), rungs all pass, flags.clean_teardown=true, flags.no_secret_leak=true, WARM_CANONICAL=true. The exact M1-failing test now PASSES: junit/restore__cc-ci__test_restore.xml → testsuite failures="0" errors="0" skipped="0" tests="1", testcase test_restore_returns_state. This is a read-only artifact check, NOT my own cold re-run — the formal M2 PASS will require my own cold re-verification of all six fixes once the Builder claims M2. Pre-staged anchor only.

  • 2026-06-18T04:12Z — Idle break-it probe (NOT a verdict — M2 not yet claimed). Cold-checked node while Builder reworks bluesky+gitea (their journal: 4/6 verified, bluesky warm-verify structurally blocked pre-merge, gitea needs rework). Stayed OFF the swarm. Observations: live warm-keycloak.ci.commoninternet.net/realms/master = 200 (live shared SSO undisturbed by the keycloak harness fix + its verify run — the keycloak DoD's hard constraint holds). Deployed stacks = infra + live warm-keycloak + a warm-gitea (Builder's active rework; app /api/v1/version=404 = wizard mode, consistent with their "gitea fix v1 broke 3.5.3→3.6.0 transition"). No orphan test/bluesky stacks, no run_recipe_ci procs, load 0.44. Critical break-it check PASSED: gitea canonical is UNCHANGED/var/lib/ci-warm/gitea/canonical.json still 3.5.3+1.24.2-rootless, commit e6a1cc79, status idle, ts 20260617T083930Z (identical to M1). The Builder's broken gitea fix attempts did NOT falsely promote 3.6.0 to canonical. Idling for the M2 gate claim.


M2 gate verification (CLAIMED 2026-06-18T05:53Z) — component re-runs in progress

Verifying all 6 fixes from a COLD START via my own independent harness checkout (/tmp/adv-m2 on cc-ci @ origin/redfix-m2-harness b96b8a4 = keycloak 61211db + mumble 07fc6d4 + bluesky exec-into-pds b96b8a4) and my own chaos-deploys. One recipe at a time, no concurrent load. Node idle at start (load 0.02, only live warm-keycloak). Static code review of the harness branch first: canonical.py adds warm-canon-<r> for r in warm.WARM_DOMAINS (ONLY keycloak — confirmed, so zero blast radius on the other 15 canonicals); mumble widens handshake budget 12->36 attempts (60s->180s) with the asserts UNCHANGED (non-weakening); keycloak recipe_meta WARM_CANONICAL False->True. All three are genuine, not test-disabling.

  • 2026-06-18T06:08Z — keycloak component VERIFIED (1/6) by my OWN cold harness run (/tmp/adv-keycloak-m2.log, RECIPE=keycloak from /tmp/adv-m2 @b96b8a4, recipe tag 10.8.0+26.6.3). RUN SUMMARY: deploy-count=1, all 5 cold tiers pass (install/upgrade/backup/restore/custom incl custom/test_password_grant_token.py::test_password_grant_issues_valid_jwt). WC5 promote landed at the COLLISION-FREE domain: /var/lib/ci-warm/keycloak/canonical.json domain= warm-canon-keycloak.ci.commoninternet.net, version 10.8.0+26.6.3, status idle, ts 20260618T060549Z (THIS run). Promote genuinely DEPLOYED there — its own volumes exist (warm-canon-keycloak_…_mariadb, _providers). Hard invariant HOLDS — live shared SSO undisturbed: live warm-keycloak_ci_commoninternet_net_app up 4 days, service last Updated 2026-06-13 (predates my 06:04Z run by days → NOT bounced); warm-keycloak.ci.commoninternet.net/realms/master = 200 before/during/after. The data-warm canonical (warm-canon-keycloak) and live-warm provider (warm-keycloak) are fully separate deployments that never touched. Builder's keycloak fix CORRECT + non-weakening; the §2.B de-enrollment is now structurally resolved. (1/6)

  • 2026-06-18T06:15Z — mumble component VERIFIED (2/6) by my OWN cold harness run (/tmp/adv-mumble-m2.log, RECIPE=mumble from /tmp/adv-m2, recipe tag 1.0.0+v1.6.870-0). RUN SUMMARY: deploy-count=1, all 5 cold tiers pass. The stabilized custom test test_handshake_completes_with_channel_presence PASSED (junit failures=0, time=10.3s). The handshake completing in ~10s confirms M1's load/timing-FLAKE classification (fast in isolation, nowhere near even the OLD 60s budget) and that the fix — widening 12->36 attempts (60s->180s) — is pure headroom: the asserts are UNCHANGED, so a genuinely dead server still exhausts all 36 retries and FAILs. Non-weakening. WC5 promote: /var/lib/ci-warm/mumble/canonical.json version 1.0.0+v1.6.870-0, idle, ts 20260618T061114Z (THIS run). Builder's mumble fix CORRECT. (2/6)

    NOTE on branch state: I cloned /tmp/adv-m2 at tip b96b8a4 just before the Builder force-reset redfix-m2-harness to 07fc6d4 (dropping a bluesky exec-into-pds commit). Confirmed git diff 07fc6d4 b96b8a4 = ONLY tests/bluesky-pds/_p4.py + test_account_and_post.py (2 lines, bluesky-only) → keycloak (61211db) and mumble (07fc6d4) code are BYTE-IDENTICAL between b96b8a4 and the claimed tip 07fc6d4, so my keycloak+mumble PASSES hold at the claimed state. bluesky is verified separately via recipe chaos-deploy (PR #4 @4987ba9, now recipe-PR-only per operator directive), so the harness-checkout staleness does not touch it.

  • 2026-06-18T06:18Z — gitea component VERIFIED (3/6) by my OWN direct chaos-deploy of recipe PR #2 @a0f2db8 onto the retained idle 3.5.3 canonical volumes (/tmp/adv-gitea-m2.log). This reproduces the EXACT M1 warm-advance scenario. Two-sided proof: I verified the UNFIXED-crashes side first-hand in M1 (/tmp/adv-gitea.log: read-only-file-system FATA at LoadCommonSettings). Now the FIX side:

    • Fix is genuine, not test-disabling — compose.yml moves the read-only swarm config to /etc/gitea/app.ini.init; docker-setup.sh.tmpl (v1->v3) seeds it into the WRITABLE /etc/gitea volume only when missing OR EMPTY (! -s, handling the 0-byte placeholder the old direct-config mount leaves); a non-empty app.ini (gitea's persisted state incl the JWT) is preserved.
    • Pre-state genuine pre-fix: config-volume app.ini = 0 bytes; retained 3.5.3 data (gitea.db 1347584 B dated 2026-06-17T08:39); canonical 3.5.3 idle e6a1cc79; stack not deployed.
    • Deploy result: deploy succeeded, NEW DEPLOYMENT a0f2db88, docker_setup_sh v3. service 1/1, ZERO restarts (task Running, no Error). M1 read-only crash signature ABSENT (grep of service logs for read-only file system/LoadCommonSettings/[F] = empty). app.ini seeded 0->1862 B with [server] INSTALL_LOCK = true (NOT wizard mode — the very bug that broke the Builder's v1 fix). /api/v1/version -> 200 {"version":"1.24.2"}; /api/healthz -> 200. Retained gitea.db adopted in place (still 1347584 B @08:39, SQLite WAL active) — matches Builder's stated adoption signal (data dirs @08:39). (Empty users/repos = minimal canonical install, not a regression.)
    • Merge-gating is HONEST, not a shrug: published 3.6.0 tag = commit 357926f (independently confirmed) != fix commit a0f2db8, so a non-chaos WC5 promote deploys the unfixed release (the abra force-fetch of refs/tags/* reverts any local tag-move). Chaos-deploy of the working-tree fix is the maximal faithful pre-merge proof; canonical advance follows on operator merge — consistent with the phase's "nothing merged" constraint, NOT a standing exception.
    • Node restored: undeploy succeeded, app.ini truncated back to 0, recipe back to published tag, canonical UNCHANGED 3.5.3 idle e6a1cc79 ts 20260617T083930Z, stack gone. Builder's gitea fix CORRECT. (3/6)
  • 2026-06-18T06:25Z — bluesky-pds component VERIFIED (4/6) by my OWN direct chaos-deploy of recipe PR #4 @4987ba9 (/tmp/adv-bluesky-m2.log). Two-sided proof: I verified the M1 000-side first-hand in M1 (/tmp/redfix-bluesky-pds.log + live diag: WC5 promote 000, caddy app -> foreign proxy IP, no cert). Now the FIX side. NOTE: per Builder inbox (06:11Z) + operator directive, the bluesky fix is now recipe-PR-ONLY (NOT the earlier service rename); the dropped harness commit b96b8a4 is irrelevant.

    • Fix is genuine — Caddyfile ask http://app:3000/tls-check -> http://{$APP_HOST}:3000/tls-check and reverse_proxy app:3000 -> {$APP_HOST}:3000; compose sets APP_HOST=${STACK_NAME}_app on the caddy service; CADDYFILE_VERSION v1->v2. Service stays named app. Established coop-cloud pattern.
    • Deploy: secret generate + secp256k1/32B-hex PLC rotation key insert (install_steps logic) + re-checkout 4987ba9 + abra app deploy -C -o -n -> deploy succeeded, NEW DEPLOYMENT 4987ba91, caddyfile v2, pds:0.4.219. app 1/1, caddy 1/1.
    • Root-cause inversion PROVEN inside caddy: getent hosts warm-bluesky-pds_ci_commoninternet_net_app -> 10.0.5.5 (own-stack INTERNAL) while bare getent hosts app -> 10.10.0.12 (FOREIGN proxy IP — the exact M1 collision). The fix makes caddy resolve the FQ swarm name (own app), bypassing the shared-proxy app-alias collision.
    • External health: https://warm-bluesky-pds.ci.commoninternet.net/xrpc/_health -> 200 {"version":"0.4.219"} on 3/3 attempts (M1 was 000). caddy log: 1 certificate obtained successfully (Let's Encrypt ACME), 0 connection refused (M1 had connection-refused -> 000).
    • Merge-gating identical to gitea (warm-promote force-fetches the published unfixed tag f7b6c8df); chaos-deploy of the working-tree fix is the faithful pre-merge proof. NOT a standing exception.
    • Node restored: undeploy + removed both volumes (caddy_data, pds_data) + all 3 secrets; recipe back to published tag 0.3.0+v0.4.219; NO bluesky stack/volume/secret/canonical (matches M1). Builder's bluesky fix CORRECT. (4/6)