Files
cc-ci/machine-docs/STATUS-redfix.md
autonomic-bot abfbe8b0aa
Some checks failed
continuous-integration/drone/push Build is failing
continuous-integration/drone Build is passing
journal+status(redfix): M2 recon — discourse #4 (official-image) already !testme-green; mattermost #1 (pg-restore) triggered for verify
2026-06-18 01:24:48 +00:00

14 KiB
Raw Blame History

STATUS — phase redfix

Phase SSOT: /srv/cc-ci/cc-ci-plan/plan-phase-redfix-canon-sweep-failures.md

Mission: investigate every canon-sweep failure (discourse, mattermost-lts, mumble, bluesky-pds, gitea, keycloak) → isolate → root-cause → classify (flake vs genuine; recipe vs test vs warm-machinery vs load) → FIX each (recipe PR or harness improvement) → verify green. No standing exceptions. Nothing merged.

Phase: M1 — investigate + isolate + classify (IN PROGRESS)

Bootstrapped 2026-06-17T23:20Z. cc-ci healthy, no run in flight, next scheduled sweep 2026-06-21 (3-day clear window). Disk / 38G free (75% used).

Isolation harness (how I reproduce each failure ALONE)

Each canon-sweep per-recipe run is runner/nightly_sweep.run_on_tag(recipe, latest): abra.recipe_checkout(recipe, <latest-tag>) then run_recipe_ci.py with RECIPE=<r> CCCI_SKIP_FETCH=1 and REF/QUICK/MODE/VERSION unset (cold, full, head==tag). Isolation = run ONE recipe at a time with NO concurrent sweep load on the single node (the loaded node is the known flake source per phase plan §2.1). Runs execute on cc-ci from /etc/cc-ci.

Starting canonical state (cc-ci /var/lib/ci-warm/<r>/canonical.json, read 2026-06-17T23:19Z)

Recipe Canonical now Note
discourse (none) no canonical dir
mattermost-lts (none) no canonical dir
mumble 1.0.0+v1.6.870-0 @ 20260617T180501Z canonical PRESENT, written TODAY — flake signal
bluesky-pds (none) no canonical dir
gitea 3.5.3+1.24.2-rootless @ 20260617T083930Z 3.6.0 advance not promoted
keycloak (none) de-enrolled (WARM_CANONICAL off)

M1 investigation tracker

Recipe Isolation run Result Root cause Classification
discourse DONE @23:40Z (/tmp/redfix-discourse.log on cc-ci) install/backup/restore/custom PASS; upgrade overlay FAIL. Deploys+serves fine — NOT a timeout/FATA. cc-ci overlay tests/discourse/test_upgrade.py asserts head runs official discourse/discourse:3.5.3 + drops sidekiq; latest tag 0.8.1+3.5.0 AND main both still bitnamilegacy/discourse:3.5.0+sidekiq (migration exists in no release/main). The depends_on discourse string is a non-fatal prepull-only warning, not the deploy. stale/PR-specific cc-ci OVERLAY test mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery)
mattermost-lts DONE @00:05Z (/tmp/redfix-mattermost-lts.log) install/upgrade/backup/custom PASS; restore FAIL ci_marker does not existdeterministic in isolation (not a load race) recipe postgres svc backup labels: backs up hot live PGDATA + dump but has NO backupbot.restore.post-hook to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only backup.volumes.postgres.path: backup.sql + restore.post-hook: /pg_backup.sh restore. genuine RECIPE defect at latest → recipe PR (adopt immich-style dump+restore-post-hook)
mumble DONE — 2× isolation GREEN (/tmp/redfix-mumble.log + /tmp/redfix-mumble2.log) ALL tiers PASS incl. handshake on BOTH runs; no orphans; canonical re-promoted green each time handshake (TLS+ServerSync) not completing within ~60s retry under heavy concurrent sweep load; fine in isolation load/timing FLAKE → harness stabilization (readiness gate / retry)
bluesky-pds DONE @00:45Z (/tmp/redfix-bluesky-pds.log + live diag) cold lifecycle GREEN; WC5 promote 000 reproduces (warm /xrpc/_health last status 0). NOT a flake caddy on-demand TLS (ask http://app:3000/tls-check) can't reach app: caddy resolves bare app to OTHER stacks' app endpoints on shared proxy net (getent app→only 10.10.0.X, never internal 10.0.3.3; proxy has drone/traefik/keycloak/ccci app aliases) → no cert → 000. Promote machinery correct (refused to write canonical). genuine routing/RECIPE defect (cross-stack app-alias collision on shared proxy) → recipe PR: unique PDS service name/alias. NOT promote-machinery, NOT flake
gitea DONE @00:14Z (/tmp/redfix-gitea2.log + live container logs) cold lifecycle (incl fresh 3.5.3→3.6.0 upgrade) PASS; warm advance crash-loops LoadCommonSettings() [F] error saving JWT Secret … failed to save "/etc/gitea/app.ini": read-only file system — gitea 3.6.0/1.24.2 tries to persist a JWT to the read-only app.ini docker-config mount on warm reattach (before DB migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite). genuine RECIPE defect (3.6.0 + read-only app.ini config mount on advance) → recipe PR: render app.ini into the writable config volume. (1st gitea run hit a nixenv "already deployed" leftover confound — fixed by undeploying to idle then re-running)
keycloak DONE @01:05Z (code-verified; no run) de-enrolled. canonical_domain("keycloak") == WARM_DOMAINS["keycloak"] == warm-keycloak.ci.commoninternet.net EXACTLY (canonical.py:42, warm.py:27,44). Live keycloak 200 /realms/master. data-warm canonical domain uses same warm-<r> scheme as the live-warm OIDC provider → promote would collide with live shared SSO. No collision-free canonical namespace exists. HARNESS defect (warm-domain namespace collision) → fix: collision-free canonical_domain for live-warm providers (warm-canon-<r>), then enroll keycloak

M1 results table (recipe → failure → isolation result → root cause → classification → fix approach)

Recipe Canon-sweep failure Isolation result Flake or genuine Root cause Class Fix approach (M2)
discourse "cold-deploy timeout / deploy FATA" install/backup/restore/custom GREEN; upgrade overlay RED genuine (deterministic) — but the canon root-cause was WRONG (no timeout, no deploy FATA) cc-ci overlay tests/discourse/test_upgrade.py asserts head = official discourse/discourse:3.5.3 + sidekiq dropped; that migration is in NO release tag and NOT in main (all use bitnamilegacy/discourse:3.5.0+sidekiq) stale/PR-specific cc-ci OVERLAY test make the overlay assert migration-faithfulness only when the head IS that migration (not vs a release tag), OR a recipe PR migrating off deprecated bitnamilegacy — settle in M2 (NOT a test-weakening)
mattermost-lts test_restore_returns_state RED install/upgrade/backup/custom GREEN; restore RED (ci_marker does not exist) genuine (deterministic in isolation) — NOT the canon "loaded-node race" recipe postgres backup labels back up hot PGDATA + a dump but have no backupbot.restore.post-hook to replay it; restore doesn't round-trip. immich (passes) uses dump-only path + restore.post-hook genuine RECIPE defect recipe PR: adopt immich-style postgres dump + backupbot.restore.post-hook replay
mumble test_handshake… RED ALL tiers GREEN in isolation (×N) incl. handshake FLAKE (load/timing) handshake (TLS+ServerSync) doesn't complete within the 60s retry under heavy concurrent sweep load; fine isolated; canonical written green today load/concurrency FLAKE harness stabilization: stronger readiness gate before the custom tier / longer-or-smarter handshake retry
bluesky-pds warm promote /xrpc/_health → 000 cold lifecycle GREEN; warm promote 000 reproduces genuine (deterministic) — NOT a load/rate-limit flake caddy on-demand TLS calls http://app:3000/tls-check; caddy resolves bare app to OTHER stacks' app endpoints on the shared proxy net (every stack aliases its main svc app), never bluesky's own internal app (10.0.3.3) → connection refused → no cert → 000 genuine ROUTING/RECIPE defect (cross-stack app-alias collision) recipe PR: give the PDS service a unique name/alias so caddy resolves only bluesky's app
gitea 3.5.3→3.6.0 warm advance doesn't promote cold (incl fresh upgrade) GREEN; warm advance crash-loops genuine (deterministic) gitea 3.6.0/1.24.2 saves a JWT secret to /etc/gitea/app.ini on warm reattach; app.ini is a read-only docker-config mountread-only file system FATA at LoadCommonSettings (pre-migration; 3.5.3 data intact). Cold passes (fresh secrets, no rewrite) genuine RECIPE defect recipe PR: render app.ini into the writable config:/etc/gitea volume (entrypoint) instead of a read-only docker config
keycloak de-enrolled (not tested) code-verified (no run) genuine (structural) canonical_domain("keycloak") == WARM_DOMAINS["keycloak"] == warm-keycloak.ci.commoninternet.net EXACTLY → a data-warm canonical would collide with the live-warm OIDC provider HARNESS defect (warm-domain namespace collision) harness: collision-free canonical_domain for live-warm providers (warm-canon-<r>), then enroll keycloak (WARM_CANONICAL=True)

HOW the Adversary cold-verifies each classification (run ONE recipe at a time, no concurrent load)

Isolation invocation (per recipe R at latest tag T), from /etc/cc-ci on cc-ci: git -C ~/.abra/recipes/R checkout -f --quiet T && env -u REF -u CCCI_QUICK -u MODE -u VERSION RECIPE=R CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py Latest tags: discourse 0.8.1+3.5.0, mattermost-lts 2.1.9+10.11.15, mumble 1.0.0+v1.6.870-0, bluesky-pds 0.3.0+v0.4.219, gitea 3.6.0+1.24.2-rootless.

  • discourse — EXPECT install/backup/restore/custom pass, upgrade fail on test_head_runs_official_image_not_bitnamilegacy + test_sidekiq_service_dropped_by_head. Confirm the overlay mismatch statically: git -C ~/.abra/recipes/discourse show 0.8.1+3.5.0:compose.yml | grep -A1 ' app:' and ... show main:compose.yml both = bitnamilegacy/discourse:3.5.0; grep -c 'sidekiq:' = 1 in both. So the test's discourse/discourse:3.5.3/no-sidekiq expectation exists nowhere upstream.
  • mattermost-lts — EXPECT restore fail relation "ci_marker" does not exist. Confirm root cause statically: git -C ~/.abra/recipes/mattermost-lts show 2.1.9+10.11.15:compose.yml | grep backupbot shows pre-hook + backup.path but NO restore.post-hook; immich git -C ~/.abra/recipes/immich show <latest>:compose.yml | grep backupbot shows restore.post-hook: /pg_backup.sh restore.
  • mumble — EXPECT all tiers green (run 23× to confirm reproducibly green isolated). Canonical written green: cat /var/lib/ci-warm/mumble/canonical.json.
  • bluesky-pds — EXPECT cold green, WC5 promote !! WC5 promote failed … warm-bluesky-pds … last status 0. While the warm stack is up, confirm root cause: caddy logs dial tcp 10.10.0.X:3000: connect: connection refused for app:3000/tls-check; docker exec <caddy> getent hosts app returns proxy IPs (10.10.0.X), the app's real internal IP is 10.0.3.x; docker network inspect proxy | grep _app shows many stacks aliasing app. (Tear down the orphaned warm-bluesky-pds stack + volumes after.)
  • gitea — REQUIRES idle canonical first: if warm-gitea is deployed, docker stack rm warm-gitea_ci_commoninternet_net (retains data+config volumes) so the advance reattaches from idle. EXPECT cold green, warm advance crash-loop with container log LoadCommonSettings() [F] error saving JWT Secret … "/etc/gitea/app.ini": read-only file system. Restore: leave warm-gitea undeployed (idle 3.5.3, volumes retained) — registry stays 3.5.3+1.24.2-rootless.
  • keycloak — no run. Code-verify: canonical.canonical_domain('keycloak')warm.stable_domain('keycloak')warm-keycloak.ci.commoninternet.net; warm.WARM_DOMAINS['keycloak'] == same string (runner/harness/canonical.py:42-44, warm.py:27-29,44-48). Live keycloak 200 on /realms/master.

Node state left clean

All isolation runs torn down; orphaned warm-bluesky-pds stack+volumes removed; warm-gitea restored to idle 3.5.3 (volumes retained, registry unchanged); only live warm-keycloak deployed (healthy). No run_recipe_ci.py processes.

M1 — PASS @ 2026-06-18T01:18Z (REVIEW-redfix.md; all 6 classifications cold-verified CORRECT by Adversary's own isolation re-runs). No VETO. Cleared to M2.

Phase: M2 — FIX + verify all six (IN PROGRESS)

Fix designs locked in BACKLOG-redfix.md. Recipe PRs (mattermost-lts/bluesky/gitea) on git.autonomic.zone mirrors via the recipe mirror+PR flow, verified !testme (NEVER merge). Harness fixes (keycloak/mumble) on a cc-ci branch, verified via the harness. discourse: overlay-scope decision. Node now free for my deploys (Adversary done with M1).

M2 fix tracker

Recipe Fix type PR/branch Status
mattermost-lts recipe PR (pg_backup.sh + restore.post-hook) mirror PR #1 ci/pg-restore (exists, correct) !testme triggered @01:24Z — verifying restore green
bluesky-pds recipe PR (unique internal alias for caddy→app) pending
gitea recipe PR (app.ini → writable volume) pending
keycloak harness (collision-free canonical_domain) + enroll pending
mumble harness (handshake readiness/retry stabilization) pending
discourse recipe PR (official-image migration) mirror PR #4 discourse-official-image already !testme-GREEN @53ba0910 (run #849, 16:36Z); re-verify fresh

Gate: M1 — PASS (above). M2 not yet claimed.

WHAT (M1 DoD). All six canon-sweep failures investigated in ISOLATION (one recipe at a time, no concurrent sweep load), root-caused with first-hand evidence, and classified (flake vs genuine; recipe vs test vs warm-machinery vs load) — see the M1 results table + HOW the Adversary cold-verifies sections above. Summary: discourse = stale cc-ci overlay test (canon timeout/FATA root-cause was wrong); mattermost-lts = genuine recipe defect (no backupbot.restore.post-hook); mumble = load/timing FLAKE (2× isolation green); bluesky-pds = genuine routing defect (caddy↔app app-alias collision on shared proxy); gitea = genuine recipe defect (read-only app.ini config mount + 3.6.0 JWT save); keycloak = harness warm-domain namespace collision. NO "probably a flake" — every classification has an isolation re-run or code proof.

HOW + EXPECTED + WHERE. Per-recipe cold-verify commands, expected outputs, and evidence paths are in the two sections above ("M1 results table" and "HOW the Adversary cold-verifies each classification"). Evidence logs on cc-ci: /tmp/redfix-{discourse,mattermost-lts,mumble,mumble2,bluesky-pds,gitea2}.log. Reasoning/dead-ends in JOURNAL-redfix.md. Node left clean (see "Node state left clean" above).

Blocked

(none)