Files
cc-ci/machine-docs/REVIEW-redfix.md

13 KiB
Raw Blame History

REVIEW — phase redfix (Adversary)

Phase SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak), isolate → root-cause → classify (flake vs genuine; recipe vs test vs warm-machinery vs load) → FIX each via a recipe PR or harness improvement → verify green. No standing exceptions. Nothing merged.

Gates:

  • M1 — all six investigated in isolation, classified with evidence. Adversary cold-verifies: claimed flake = reproducibly green in isolation (and red under load); claimed recipe defect = genuinely the recipe (not a stale test / harness artifact); claimed warm-machinery bug = in cc-ci.
  • M2 — all six FIXED + verified green (recipe PR via !testme; harness/cc-ci PR via the harness; flake-stabilization green under load). All six promote/pass. No standing exception. Nothing merged.

DONE = Builder writes ## DONE only after M1+M2 fresh Adversary PASS here.


Verdicts

M1 — investigate + isolate + classify: PASS @ 2026-06-18T01:18Z

Gate claim: claim(redfix-M1) commit 0a06c41 (@00:25Z). Verified from a COLD START on cc-ci with my OWN isolation re-runs (one recipe at a time, no concurrent load) — NOT the Builder's logs. Isolation discipline honored: verdict formed from the phase plan (SSOT), the recipe code / git history, the verification info in STATUS, and my own cold acceptance runs; I did NOT read JOURNAL-redfix.md before writing this verdict.

All six classifications are CORRECT. Evidence per recipe (full detail in the verification log below):

Recipe My independent reproduction Classification — verified
discourse my isolation run /tmp/adv-discourse.log: install/backup/restore/custom PASS, upgrade FAIL on the 2 PR-faithfulness overlay asserts; converged in minutes, no FATA/rc=142/wedge stale/PR-specific cc-ci OVERLAY test (canon "timeout" root-cause was WRONG — confirmed). Recipe deploys+serves fine. ✔
mattermost-lts my isolation run /tmp/adv-mattermost.log: restore FAIL deterministically (relation "ci_marker" does not exist, 91s, isolated) genuine RECIPE defect — no backupbot.restore.post-hook; NOT the canon "loaded-node race." ✔
mumble my isolation run /tmp/adv-mumble.log: ALL 5 tiers GREEN incl test_handshake_completes_with_channel_presence; promote OK load/timing FLAKE — green in isolation (a recipe defect would red deterministically; it didn't). ✔
bluesky-pds my isolation run /tmp/adv-bluesky.log + live caddy diag: cold GREEN, warm promote 000 deterministic; getent app→10.10.0.4 (foreign proxy), own app 10.0.5.6 never resolved; caddy log cycles dial 10.10.0.{4..12}:3000 refused genuine recipe ROUTING defect (bare app + caddy on shared proxy), NOT cc-ci promote-machinery (it correctly refused to promote), NOT flake. (Reverses the plan's "warm-machinery" prior — confirmed against it.) ✔
gitea my isolation run /tmp/adv-gitea.log + container crash log: cold GREEN, warm advance crash-loops 0/1; LoadCommonSettings() [F] … error saving JWT Secret … "/etc/gitea/app.ini": read-only file system; canonical correctly stayed 3.5.3 (promote timed out, refused) genuine RECIPE defect (3.6.0 JWT save vs read-only app.ini docker-config mount; /etc/gitea is a writable volume but the app.ini file is the RO config). ✔
keycloak code-verified: canonical.canonical_domain('keycloak')warm.stable_domainwarm-keycloak.ci.commoninternet.net == warm.WARM_DOMAINS['keycloak'] (warm.py:47 documents the equality); live keycloak 200 on /realms/master HARNESS defect (data-warm canonical domain collides with the live-warm OIDC provider; no collision-free namespace). ✔

No defects in the classification work. No VETO. Node verified clean before AND after my runs (only infra

  • live warm-keycloak; gitea restored to undeployed idle 3.5.3, volumes retained, canonical commit e6a1cc79 unchanged; warm-keycloak healthy throughout). M1 PASS — Builder cleared to proceed to M2. (M2 will re-verify each FIX green; this PASS is for the investigation/classification gate only.)

(prior placeholder removed)

Adversary verification log

  • 2026-06-17T23:18Z — Phase redfix opened. Refreshed phase plan + plan.md §6.1. Cold access to cc-ci confirmed (ssh cc-ci: host nixos, uptime 4d, systemctl --failed empty, load ~0.8). No Builder state files (STATUS/BACKLOG/JOURNAL-redfix.md) yet; no gate claimed. Idling for the first claim.

  • 2026-06-18T00:10Z — Non-contending pre-staging (M1 NOT yet claimed; Builder mid-investigation: gitea isolation running, keycloak pending). Stayed OFF the swarm to avoid contaminating the Builder's isolation runs. Independently corroborated two deterministic static claims via pure code reads on cc-ci (no deploys):

    • mattermost-lts (recipe @ 2.1.9+10.11.15): postgres svc has backupbot.backup.pre-hook (pg_dump → /var/lib/postgresql/data/postgres-backup.sql), backup.post-hook (rm dump), backup.path=/var/lib/postgresql/data/ (hot live PGDATA) — and NO backupbot.restore.post-hook. immich (passes) uses dump-only backup.volumes.postgres.path: backup.sql + restore.post-hook: /pg_backup.sh restore. Corroborates "genuine recipe defect — no restore round-trip." ✔ pre-staged.
    • discourse (recipe @ 0.8.1+3.5.0 = bitnamilegacy/discourse:3.5.0 + sidekiq): overlay tests/discourse/test_upgrade.py is a phase-prevb PR-faithfulness test asserting app image == official discourse/discourse:3.5.3 AND sidekiq dropped — only true on an unreleased PR head, not the latest release the canon sweep deploys. So it red-by-construction in the sweep. Corroborates "stale/PR-specific overlay test, not flake/timeout/recipe-deploy." ✔ pre-staged.
    • STILL OWED before any M1 PASS: my OWN cold isolation run of discourse to confirm the re-classification from the original canon hypothesis ("cold-deploy timeout, ~51-min wedge") to "deploys+serves fine, only the overlay test reds." Will run when M1 is claimed and the swarm is free (Builder not deploying). Same for bluesky app-alias collision (needs live caddy/getent diag). These are NOT verdicts — formal M1 PASS/FAIL awaits the Builder's gate claim.
  • 2026-06-18T00:25Z — M1 CLAIMED (commit 0a06c41). Node verified idle/clean before any run (only infra + live warm-keycloak; no bluesky/test stacks; no run_recipe_ci; load 0.03; gitea idle 3.5.3) — Builder "node clean" claim ✔. Began my own COLD isolation re-runs (one at a time, no concurrent load), swarm confirmed free.

  • 2026-06-18T00:29Z — bluesky-pds CONFIRMED by my own reproduction (/tmp/adv-bluesky.log, tag 0.3.0+v0.4.219, RECIPE=bluesky-pds CCCI_SKIP_FETCH=1). Cold lifecycle GREEN (install/backup/ restore/custom=pass, upgrade=skip) — reproduced. WC5 promote → unhealthy, 000. DECISIVE live diag inside the warm caddy container (60326521a2ac, nets: proxy=10.10.52.13 + internal=10.0.5.3):

    • getent hosts app10.10.0.4 (a proxy-net foreign endpoint) — NOT bluesky's own app.
    • bluesky's OWN app is at internal 10.0.5.6 (real target), never resolved.
    • caddy TLS log cycles dial tcp 10.10.0.{4,5,6,8,10,11,12}:3000: connect: connection refused on ask http://app:3000/tls-check → on-demand cert denied → TLS fails → /xrpc/_health = 000. Verdict basis: NOT a flake (deterministic, every retry refused); NOT promote-machinery (the probe correctly refuses an unhealthy endpoint, no false promote); genuine recipe routing defect — recipe names its svc app + puts caddy on the shared multi-tenant proxy net + Caddyfile uses bare app, so docker DNS resolves app to OTHER stacks' apps. Builder's classification (recipe defect, reverses the plan's "cc-ci warm-machinery" prior) is CORRECT. Sharper than Builder's note (my run's internal IP 10.0.5.6 vs their 10.0.3.3 — same mechanism, different deploy). Letting run finish + will tear down the orphan warm-bluesky stack. [interim — full M1 verdict batched after mumble+discourse.]
  • 2026-06-18T00:38Z — bluesky run finished; promote log !! WC5 promote failed (non-fatal; known-good unchanged) … last status 0machinery correctly refused to write canonical (seals "not promote-machinery"). Cleaned up: docker stack rm warm-bluesky-pds… + removed both volumes (caddy_data, pds_data). Node verified clean of bluesky.

  • 2026-06-18T00:44Z — mumble CONFIRMED by my own isolation run (/tmp/adv-mumble.log, tag 1.0.0+v1.6.870-0). ALL 5 tiers GREEN: install/upgrade/backup/restore/custom = pass. The exact canon-sweep failure tests/mumble/custom/test_protocol_handshake.py::test_handshake_completes_with_ channel_presence PASSED in isolation. WC5 promote SUCCEEDED (canonical advanced to known-good 1.0.0+v1.6.870-0, idle, volume retained). A recipe defect would fail deterministically in isolation (cf. mattermost restore) — mumble passing cleanly confirms load/timing FLAKE, not a recipe bug. (My 1 isolation green + Builder's 2× = 3 isolation greens / 0 isolation reds vs 1 canon red-under-load — consistent flake signature.) Builder's classification CORRECT.

  • 2026-06-18T00:53Z — discourse CONFIRMED by my own isolation run (/tmp/adv-discourse.log, tag 0.8.1+3.5.0). Tiers: install pass / upgrade FAIL / backup pass / restore pass / custom pass — exactly the Builder's claim. Deploy converged in minutes; NO FATA, NO rc=142/143, NO ~51-min wedge → the original canon "cold-deploy timeout" hypothesis did NOT reproduce in isolation (Builder reclassification CORRECT). Upgrade failed on the two PR-faithfulness overlay assertions: test_head_runs_official_image_not_bitnamilegacy (deployed image = bitnamilegacy/discourse:3.5.0@ sha256:db7e..., the release's own image) and test_sidekiq_service_dropped_by_head (services = ['app','db','redis','sidekiq']). The overlay demands official discourse/discourse:3.5.3 + no sidekiq — an unreleased PR migration in NO release tag and NOT in main (verified earlier: tag AND main both bitnamilegacy:3.5.0+sidekiq). AssertionError self-documents "the prevb bug." So the recipe DEPLOYS+SERVES fine; only the stale/PR-specific overlay reds by construction in the canonical sweep. stale cc-ci OVERLAY test, not flake/timeout/recipe-deploy/warm-machinery. Builder CORRECT.

  • 2026-06-18T01:02Z — mattermost-lts CONFIRMED by my own isolation run (/tmp/adv-mattermost.log, tag 2.1.9+10.11.15). Tiers: install pass / upgrade pass / backup pass / restore FAIL / custom pass — exactly Builder's claim. The overlay tests/mattermost-lts/test_restore.py:: test_restore_returns_state FAILED with the EXACT RuntimeError: docker exec … postgres failed (rc=1): ERROR: relation "ci_marker" does not exist. Deterministic in isolation (91s, no concurrent load) → NOT the canon "loaded-node db-cycle race." Note: generic test_restore_healthy PASSED (app returns healthy) but the STATE round-trip failed — the seeded marker is gone after restore. Mechanism matches the static finding: backup dumps + backs up hot PGDATA but has NO backupbot.restore.post-hook to replay the dump → postgres logical data never round-trips. genuine RECIPE defect, not a flake/load-race/stale-test. Builder's classification CORRECT.

  • 2026-06-18T01:09Z — gitea CONFIRMED by my own isolation run + container crash log (/tmp/adv-gitea.log, tag 3.6.0+1.24.2-rootless). Cold lifecycle all 5 tiers GREEN (incl fresh 3.5.3→3.6.0 upgrade tier). WC5 advance (reattach idle 3.5.3 volumes with 3.6.0 image) → warm-gitea app crash-loops 0/1. Container log (every task, e.g. .8zd4952…): setting.go:105:LoadCommonSettings() [F] Unable to load settings from config: error saving JWT Secret for custom config: failed to save "/etc/gitea/app.ini": open /etc/gitea/app.ini: read-only file system. Mount nuance CONFIRMED: /etc/gitea is a writable VOLUME (RW=true) but app.ini is a docker CONFIG overlaying that path read-only → gitea can write the dir but NOT the app.ini file. genuine RECIPE defect (3.6.0 JWT save vs read-only app.ini config mount). Cold passes (fresh render, no runtime save). Builder's classification + proposed fix (render app.ini into the writable volume) CORRECT. Will verify canonical stays 3.5.3 (promote refused) + restore warm-gitea to undeployed idle.

  • 2026-06-18T02:15Z — M2 interim corroboration (NOT a verdict — M2 not yet claimed). Node cold-checked idle (load 0.07, no run_recipe_ci/abra, only live warm-keycloak) — Builder between M2 fixes, so I stayed OFF the swarm (no contending deploy). Non-contending read-only check of the one fix marked DONE (mattermost-lts PR #1, ref 4ca7f4182d83): cc-ci run #901 artifacts on cc-ci (/var/lib/cc-ci-runs/901/) confirm all tiers pass (install/upgrade/backup/restore/custom), rungs all pass, flags.clean_teardown=true, flags.no_secret_leak=true, WARM_CANONICAL=true. The exact M1-failing test now PASSES: junit/restore__cc-ci__test_restore.xml → testsuite failures="0" errors="0" skipped="0" tests="1", testcase test_restore_returns_state. This is a read-only artifact check, NOT my own cold re-run — the formal M2 PASS will require my own cold re-verification of all six fixes once the Builder claims M2. Pre-staged anchor only.