14 KiB
STATUS — phase redfix
Phase SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md
Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak) → isolate → root-cause → classify (flake vs genuine; recipe vs test vs warm-machinery vs load) → FIX each (recipe PR or harness improvement) → verify green. No standing exceptions. Nothing merged.
Phase: M1 — investigate + isolate + classify (IN PROGRESS)
Bootstrapped 2026-06-17T23:20Z. cc-ci healthy, no run in flight, next scheduled sweep 2026-06-21
(3-day clear window). Disk / 38G free (75% used).
Isolation harness (how I reproduce each failure ALONE)
Each canon-sweep per-recipe run is runner/nightly_sweep.run_on_tag(recipe, latest):
abra.recipe_checkout(recipe, <latest-tag>) then run_recipe_ci.py with RECIPE=<r> CCCI_SKIP_FETCH=1 and REF/QUICK/MODE/VERSION unset (cold, full, head==tag). Isolation = run ONE
recipe at a time with NO concurrent sweep load on the single node (the loaded node is the known
flake source per phase plan §2.1). Runs execute on cc-ci from /etc/cc-ci.
Starting canonical state (cc-ci /var/lib/ci-warm/<r>/canonical.json, read 2026-06-17T23:19Z)
| Recipe | Canonical now | Note |
|---|---|---|
| discourse | (none) | no canonical dir |
| mattermost-lts | (none) | no canonical dir |
| mumble | 1.0.0+v1.6.870-0 @ 20260617T180501Z |
canonical PRESENT, written TODAY — flake signal |
| bluesky-pds | (none) | no canonical dir |
| gitea | 3.5.3+1.24.2-rootless @ 20260617T083930Z |
3.6.0 advance not promoted |
| keycloak | (none) | de-enrolled (WARM_CANONICAL off) |
M1 investigation tracker
| Recipe | Isolation run | Result | Root cause | Classification |
|---|---|---|---|---|
| discourse | DONE @23:40Z (/tmp/redfix-discourse.log on cc-ci) |
install/backup/restore/custom PASS; upgrade overlay FAIL. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay tests/discourse/test_upgrade.py asserts head runs official discourse/discourse:3.5.3 + drops sidekiq; latest tag 0.8.1+3.5.0 AND main both still bitnamilegacy/discourse:3.5.0+sidekiq (migration exists in no release/main). The depends_on discourse string is a non-fatal prepull-only warning, not the deploy. |
stale/PR-specific cc-ci OVERLAY test mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) |
| mattermost-lts | DONE @00:05Z (/tmp/redfix-mattermost-lts.log) |
install/upgrade/backup/custom PASS; restore FAIL ci_marker does not exist — deterministic in isolation (not a load race) |
recipe postgres svc backup labels: backs up hot live PGDATA + dump but has NO backupbot.restore.post-hook to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only backup.volumes.postgres.path: backup.sql + restore.post-hook: /pg_backup.sh restore. |
genuine RECIPE defect at latest → recipe PR (adopt immich-style dump+restore-post-hook) |
| mumble | DONE — 2× isolation GREEN (/tmp/redfix-mumble.log + /tmp/redfix-mumble2.log) |
ALL tiers PASS incl. handshake on BOTH runs; no orphans; canonical re-promoted green each time | handshake (TLS+ServerSync) not completing within ~60s retry under heavy concurrent sweep load; fine in isolation | load/timing FLAKE → harness stabilization (readiness gate / retry) |
| bluesky-pds | DONE @00:45Z (/tmp/redfix-bluesky-pds.log + live diag) |
cold lifecycle GREEN; WC5 promote 000 reproduces (warm /xrpc/_health last status 0). NOT a flake | caddy on-demand TLS (ask http://app:3000/tls-check) can't reach app: caddy resolves bare app to OTHER stacks' app endpoints on shared proxy net (getent app→only 10.10.0.X, never internal 10.0.3.3; proxy has drone/traefik/keycloak/ccci app aliases) → no cert → 000. Promote machinery correct (refused to write canonical). |
genuine routing/RECIPE defect (cross-stack app-alias collision on shared proxy) → recipe PR: unique PDS service name/alias. NOT promote-machinery, NOT flake |
| gitea | DONE @00:14Z (/tmp/redfix-gitea2.log + live container logs) |
cold lifecycle (incl fresh 3.5.3→3.6.0 upgrade) PASS; warm advance crash-loops | LoadCommonSettings() [F] error saving JWT Secret … failed to save "/etc/gitea/app.ini": read-only file system — gitea 3.6.0/1.24.2 tries to persist a JWT to the read-only app.ini docker-config mount on warm reattach (before DB migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite). |
genuine RECIPE defect (3.6.0 + read-only app.ini config mount on advance) → recipe PR: render app.ini into the writable config volume. (1st gitea run hit a nixenv "already deployed" leftover confound — fixed by undeploying to idle then re-running) |
| keycloak | DONE @01:05Z (code-verified; no run) | de-enrolled. canonical_domain("keycloak") == WARM_DOMAINS["keycloak"] == warm-keycloak.ci.commoninternet.net EXACTLY (canonical.py:42, warm.py:27,44). Live keycloak 200 /realms/master. |
data-warm canonical domain uses same warm-<r> scheme as the live-warm OIDC provider → promote would collide with live shared SSO. No collision-free canonical namespace exists. |
HARNESS defect (warm-domain namespace collision) → fix: collision-free canonical_domain for live-warm providers (warm-canon-<r>), then enroll keycloak |
M1 results table (recipe → failure → isolation result → root cause → classification → fix approach)
| Recipe | Canon-sweep failure | Isolation result | Flake or genuine | Root cause | Class | Fix approach (M2) |
|---|---|---|---|---|---|---|
| discourse | "cold-deploy timeout / deploy FATA" | install/backup/restore/custom GREEN; upgrade overlay RED | genuine (deterministic) — but the canon root-cause was WRONG (no timeout, no deploy FATA) | cc-ci overlay tests/discourse/test_upgrade.py asserts head = official discourse/discourse:3.5.3 + sidekiq dropped; that migration is in NO release tag and NOT in main (all use bitnamilegacy/discourse:3.5.0+sidekiq) |
stale/PR-specific cc-ci OVERLAY test | make the overlay assert migration-faithfulness only when the head IS that migration (not vs a release tag), OR a recipe PR migrating off deprecated bitnamilegacy — settle in M2 (NOT a test-weakening) |
| mattermost-lts | test_restore_returns_state RED |
install/upgrade/backup/custom GREEN; restore RED (ci_marker does not exist) |
genuine (deterministic in isolation) — NOT the canon "loaded-node race" | recipe postgres backup labels back up hot PGDATA + a dump but have no backupbot.restore.post-hook to replay it; restore doesn't round-trip. immich (passes) uses dump-only path + restore.post-hook |
genuine RECIPE defect | recipe PR: adopt immich-style postgres dump + backupbot.restore.post-hook replay |
| mumble | test_handshake… RED |
ALL tiers GREEN in isolation (×N) incl. handshake | FLAKE (load/timing) | handshake (TLS+ServerSync) doesn't complete within the 60s retry under heavy concurrent sweep load; fine isolated; canonical written green today | load/concurrency FLAKE | harness stabilization: stronger readiness gate before the custom tier / longer-or-smarter handshake retry |
| bluesky-pds | warm promote /xrpc/_health → 000 |
cold lifecycle GREEN; warm promote 000 reproduces | genuine (deterministic) — NOT a load/rate-limit flake | caddy on-demand TLS calls http://app:3000/tls-check; caddy resolves bare app to OTHER stacks' app endpoints on the shared proxy net (every stack aliases its main svc app), never bluesky's own internal app (10.0.3.3) → connection refused → no cert → 000 |
genuine ROUTING/RECIPE defect (cross-stack app-alias collision) |
recipe PR: give the PDS service a unique name/alias so caddy resolves only bluesky's app |
| gitea | 3.5.3→3.6.0 warm advance doesn't promote | cold (incl fresh upgrade) GREEN; warm advance crash-loops | genuine (deterministic) | gitea 3.6.0/1.24.2 saves a JWT secret to /etc/gitea/app.ini on warm reattach; app.ini is a read-only docker-config mount → read-only file system FATA at LoadCommonSettings (pre-migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite) |
genuine RECIPE defect | recipe PR: render app.ini into the writable config:/etc/gitea volume (entrypoint) instead of a read-only docker config |
| keycloak | de-enrolled (not tested) | code-verified (no run) | genuine (structural) | canonical_domain("keycloak") == WARM_DOMAINS["keycloak"] == warm-keycloak.ci.commoninternet.net EXACTLY → a data-warm canonical would collide with the live-warm OIDC provider |
HARNESS defect (warm-domain namespace collision) | harness: collision-free canonical_domain for live-warm providers (warm-canon-<r>), then enroll keycloak (WARM_CANONICAL=True) |
HOW the Adversary cold-verifies each classification (run ONE recipe at a time, no concurrent load)
Isolation invocation (per recipe R at latest tag T), from /etc/cc-ci on cc-ci:
git -C ~/.abra/recipes/R checkout -f --quiet T && env -u REF -u CCCI_QUICK -u MODE -u VERSION RECIPE=R CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py
Latest tags: discourse 0.8.1+3.5.0, mattermost-lts 2.1.9+10.11.15, mumble 1.0.0+v1.6.870-0, bluesky-pds 0.3.0+v0.4.219, gitea 3.6.0+1.24.2-rootless.
- discourse — EXPECT install/backup/restore/custom pass, upgrade fail on
test_head_runs_official_image_not_bitnamilegacy+test_sidekiq_service_dropped_by_head. Confirm the overlay mismatch statically:git -C ~/.abra/recipes/discourse show 0.8.1+3.5.0:compose.yml | grep -A1 ' app:'and... show main:compose.ymlboth =bitnamilegacy/discourse:3.5.0;grep -c 'sidekiq:'= 1 in both. So the test'sdiscourse/discourse:3.5.3/no-sidekiq expectation exists nowhere upstream. - mattermost-lts — EXPECT restore fail
relation "ci_marker" does not exist. Confirm root cause statically:git -C ~/.abra/recipes/mattermost-lts show 2.1.9+10.11.15:compose.yml | grep backupbotshows pre-hook +backup.pathbut NOrestore.post-hook; immichgit -C ~/.abra/recipes/immich show <latest>:compose.yml | grep backupbotshowsrestore.post-hook: /pg_backup.sh restore. - mumble — EXPECT all tiers green (run 2–3× to confirm reproducibly green isolated). Canonical written green:
cat /var/lib/ci-warm/mumble/canonical.json. - bluesky-pds — EXPECT cold green, WC5 promote
!! WC5 promote failed … warm-bluesky-pds … last status 0. While the warm stack is up, confirm root cause: caddy logsdial tcp 10.10.0.X:3000: connect: connection refusedforapp:3000/tls-check;docker exec <caddy> getent hosts appreturns proxy IPs (10.10.0.X), the app's real internal IP is 10.0.3.x;docker network inspect proxy | grep _appshows many stacks aliasingapp. (Tear down the orphaned warm-bluesky-pds stack + volumes after.) - gitea — REQUIRES idle canonical first: if warm-gitea is deployed,
docker stack rm warm-gitea_ci_commoninternet_net(retains data+config volumes) so the advance reattaches from idle. EXPECT cold green, warm advance crash-loop with container logLoadCommonSettings() [F] error saving JWT Secret … "/etc/gitea/app.ini": read-only file system. Restore: leave warm-gitea undeployed (idle 3.5.3, volumes retained) — registry stays3.5.3+1.24.2-rootless. - keycloak — no run. Code-verify:
canonical.canonical_domain('keycloak')→warm.stable_domain('keycloak')→warm-keycloak.ci.commoninternet.net;warm.WARM_DOMAINS['keycloak']== same string (runner/harness/canonical.py:42-44, warm.py:27-29,44-48). Live keycloak 200 on/realms/master.
Node state left clean
All isolation runs torn down; orphaned warm-bluesky-pds stack+volumes removed; warm-gitea restored to idle 3.5.3 (volumes retained, registry unchanged); only live warm-keycloak deployed (healthy). No run_recipe_ci.py processes.
M1 — PASS @ 2026-06-18T01:18Z (REVIEW-redfix.md; all 6 classifications cold-verified CORRECT by Adversary's own isolation re-runs). No VETO. Cleared to M2.
Phase: M2 — FIX + verify all six (IN PROGRESS)
Fix designs locked in BACKLOG-redfix.md. Recipe PRs (mattermost-lts/bluesky/gitea) on git.autonomic.zone
mirrors via the recipe mirror+PR flow, verified !testme (NEVER merge). Harness fixes (keycloak/mumble)
on a cc-ci branch, verified via the harness. discourse: overlay-scope decision. Node now free for my
deploys (Adversary done with M1).
M2 fix tracker
| Recipe | Fix type | PR/branch | Status |
|---|---|---|---|
| mattermost-lts | recipe PR (pg_backup.sh + restore.post-hook) | — | starting |
| bluesky-pds | recipe PR (unique internal alias for caddy→app) | — | pending |
| gitea | recipe PR (app.ini → writable volume) | — | pending |
| keycloak | harness (collision-free canonical_domain) + enroll | — | pending |
| mumble | harness (handshake readiness/retry stabilization) | — | pending |
| discourse | overlay-scope test PR + upstream issue (decide) | — | pending |
Gate: M1 — PASS (above). M2 not yet claimed.
WHAT (M1 DoD). All six canon-sweep failures investigated in ISOLATION (one recipe at a time, no
concurrent sweep load), root-caused with first-hand evidence, and classified (flake vs genuine; recipe
vs test vs warm-machinery vs load) — see the M1 results table + HOW the Adversary cold-verifies
sections above. Summary: discourse = stale cc-ci overlay test (canon timeout/FATA root-cause was
wrong); mattermost-lts = genuine recipe defect (no backupbot.restore.post-hook); mumble = load/timing
FLAKE (2× isolation green); bluesky-pds = genuine routing defect (caddy↔app app-alias collision on
shared proxy); gitea = genuine recipe defect (read-only app.ini config mount + 3.6.0 JWT save);
keycloak = harness warm-domain namespace collision. NO "probably a flake" — every classification has
an isolation re-run or code proof.
HOW + EXPECTED + WHERE. Per-recipe cold-verify commands, expected outputs, and evidence paths are
in the two sections above ("M1 results table" and "HOW the Adversary cold-verifies each classification").
Evidence logs on cc-ci: /tmp/redfix-{discourse,mattermost-lts,mumble,mumble2,bluesky-pds,gitea2}.log.
Reasoning/dead-ends in JOURNAL-redfix.md. Node left clean (see "Node state left clean" above).
Blocked
(none)