23 KiB
JOURNAL — phase redfix
2026-06-17T23:20Z — Bootstrap
Read phase plan + plan.md §6.1/§7/§9 + canon DECISIONS exceptions (lines ~1494–1552). Six canon-sweep failures to investigate. Confirmed cc-ci access, no run in flight, sweep timer next fires 2026-06-21 (3-day window), disk 38G free.
Isolation mechanism understood: runner/nightly_sweep.run_on_tag = abra.recipe_checkout(r, tag) +
run_recipe_ci.py RECIPE=<r> CCCI_SKIP_FETCH=1 cold/full. I reproduce each failure by running ONE
recipe at a time with no concurrent load.
Starting canonical state notable: mumble canonical IS present (1.0.0+v1.6.870-0, written
20260617T180501Z — during today's nixenv sweep). The canon DECISIONS recorded mumble RED
(test_handshake_completes_with_channel_presence). A canonical only gets written on a GREEN cold run
on latest → mumble flipped green in a recent run. Strong early evidence for the operator's "mumble
passed before" → load flake hypothesis. Must confirm with a clean isolation re-run + check whether the
canon-sweep red was under concurrent load.
Next: start M1 investigation. Plan order (cheap/informative first): triage the existing sweep logs on
cc-ci to pin the EXACT assertion/error for each (mumble, mattermost-lts restore, gitea app.ini,
bluesky routing, discourse compose), then run isolation re-runs. discourse's recorded cause is an
UPSTREAM compose defect (sidekiq.depends_on: discourse while service is app) that FATAs before any
deploy — that's deterministic, not a load timeout, so it may not even need a long isolation run to
confirm; verify the compose at the latest tag directly first.
2026-06-17T23:40Z — M1: discourse isolation run — CANON ROOT-CAUSE WAS WRONG
Ran discourse ALONE on cc-ci (recipe_checkout discourse 0.8.1+3.5.0 + RECIPE=discourse CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py, log /tmp/redfix-discourse.log).
RESULT: install PASS, upgrade FAIL, backup PASS, restore PASS, custom PASS — the recipe deploys,
serves (200 /srv/status), backs up and restores cleanly. NOT a deploy timeout, NOT a 51-min wedge,
NOT a deploy FATA. The canon DECISIONS root-cause ("abra app deploy FATAs: service sidekiq depends
on undefined service discourse → invalid compose project") is misattributed: that string appears
ONLY from the non-fatal prepull docker compose config --images (rc=15, harness logs "skipping
(deploy will pull as usual)"). The real abra app deploy is a swarm docker stack deploy, which
ignores depends_on entirely → the stack converges (UpdateStatus=completed).
The ONLY failure is the cc-ci upgrade OVERLAY tests/discourse/test_upgrade.py:
test_head_runs_official_image_not_bitnamilegacy— app image isbitnamilegacy/discourse:3.5.0; test demandsdiscourse/discourse:3.5.3(official).test_sidekiq_service_dropped_by_head— services['app','db','redis','sidekiq']; test demands sidekiq dropped.
These prevb-phase overlay tests are PR-FAITHFULNESS assertions for a specific migration PR
(bitnamilegacy → official discourse/discourse:3.5.3, drop sidekiq). Verified that migration exists
in NO upstream release tag and NOT in main — git show main:compose.yml and every tag
(0.1.0…0.8.1+3.5.0) all use bitnamilegacy/discourse:3.5.0 + sidekiq. So the overlay asserts a
state that doesn't exist anywhere upstream → deterministic RED whenever the sweep tests the latest
release tag. The head DID deploy (chaos-version label = head f87c612d+U, converged) — the test
expectation is simply wrong for the released recipe.
Note (M2 design): migrating discourse from the deprecated bitnamilegacy image to official
discourse/discourse is a MAJOR recipe rewrite (different fs layout, entrypoint, no /opt/bitnami
sidekiq run.sh) — not a 1-line image swap. So the overlay test's discourse/discourse:3.5.3
expectation may not be a realistic near-term recipe change. The bitnamilegacy deprecation is real
(bitnami sunset legacy images), so a migration is the right long-term direction, but the test as
written hard-codes a migration target absent upstream. Classification + fix approach to settle in M1
table / M2.
Classification: stale/PR-specific cc-ci OVERLAY test mismatched to the canonical-sweep context
(NOT a flake, NOT a load timeout, NOT a recipe-deploy defect, NOT warm-machinery). Teardown clean (no
discourse stack left). Evidence: /tmp/redfix-discourse.log on cc-ci; junit under
/var/lib/cc-ci-runs/manual/junit/upgrade__cc-ci__test_upgrade.xml.
2026-06-18T00:05Z — M1: mattermost-lts isolation run — DETERMINISTIC restore failure (recipe defect)
Ran mattermost-lts ALONE (tag 2.1.9+10.11.15, log /tmp/redfix-mattermost-lts.log).
RESULT: install/upgrade/backup/custom PASS, restore FAIL — identical to the canon failure:
tests/mattermost-lts/test_restore.py::test_restore_returns_state → relation "ci_marker" does not exist after restore. So it is deterministic in isolation, NOT a loaded-node race (canon framing
was wrong). The marker logic is sound (postgres table seeded pre-backup, dropped pre-restore, asserted
post-restore — same pattern immich uses and PASSES).
ROOT CAUSE (recipe backup/restore labels). Compared mattermost-lts vs immich (immich passes the IDENTICAL test):
- immich
databasesvc:backupbot.backup.pre-hook: /pg_backup.sh backup,backupbot.backup.volumes.postgres.path: backup.sql(backs up ONLY the dump file), andbackupbot.restore.post-hook: /pg_backup.sh restore(replays the dump on restore). → round-trips. - mattermost-lts
postgressvc:pre-hook: pg_dump > /var/lib/postgresql/data/postgres-backup.sql,backup.path: /var/lib/postgresql/data/(backs up the WHOLE live/hot PGDATA dir + the dump),post-hook: rm .../postgres-backup.sql, and NObackupbot.restore.post-hook. So on restore, abra restores the files but NOTHING replays the dump, and a hot-copied live PGDATA over a running postgres does not reload →ci_markerlost. Restore log confirmsRestoring Snapshot b0495d36 at /with no post-hook reimport.
Classification: GENUINE RECIPE DEFECT at latest (postgres backup/restore does not round-trip —
missing restore post-hook + backs up hot PGDATA instead of dump-only). NOT a flake, NOT cc-ci test
weakening (test is correct & unmodified; immich proves the pattern works). Fix (M2) = recipe PR
adopting the immich-style postgres backup/restore (a /pg_backup.sh-style dump + restore post-hook).
Teardown clean (no matt stack). Evidence: /tmp/redfix-mattermost-lts.log; junit
restore__cc-ci__test_restore.xml.
Tooling note: my background "waiter" loop while pgrep -f run_recipe_ci.py self-matched (its own
cmdline contains the string) → never exited, falsely showed a run active. Use pgrep -f "[r]un_recipe_ci.py" or match the python invocation. Killed the stuck waiters; node confirmed free.
2026-06-18T00:18Z — M1: mumble isolation run — GREEN (flake confirmed)
Ran mumble ALONE (tag 1.0.0+v1.6.870-0, log /tmp/redfix-mumble.log). RESULT: ALL tiers PASS
(install/upgrade/backup/restore/custom), including custom/test_protocol_handshake.py:: test_handshake_completes_with_channel_presence PASSED. No orphan stacks. The canon sweep recorded
this RED (test_handshake… failed under concurrent sweep load); it is GREEN here in isolation, and
its canonical was already written green TODAY (1.0.0+v1.6.870-0 @20260617T180501Z) under the lighter
nixenv sweep. → load/timing FLAKE on the control-channel handshake, NOT a recipe defect.
The handshake test already retries (retry_handshake(attempts=12, interval=5.0) = 60s). So the flake
is the voice server not completing the TLS+ServerSync handshake within ~60s under heavy concurrent
node load (deploy contention). M2 fix = harness stabilization (stronger readiness gate before the
custom tier / longer-or-smarter retry / serialize), based on the load failure mode. Classification:
FLAKE (load/concurrency) → harness stabilization.
Reproducibility: 1 green isolation run here + canonical green today + documented red under canon load. Will do 1–2 more isolation repeats before the M1 claim to firm "reproducibly green in isolation."
2026-06-18T00:45Z — M1: bluesky-pds isolation run — 000 REPRODUCES; root cause = app DNS collision on shared proxy
Ran bluesky-pds ALONE (tag 0.3.0+v0.4.219, log /tmp/redfix-bluesky-pds.log). Cold lifecycle GREEN
(install/backup/restore/custom pass; upgrade EXPECTED_NA per recipe_meta — moving pds:0.4 tag). Then
WC5 promote-on-green-cold FAILED exactly as canon: warm-bluesky-pds.ci.commoninternet.net: not healthy over HTTPS /xrpc/_health (last status 0). So the 000 reproduces deterministically in
isolation — NOT a sweep-load/ACME-rate-limit flake (my first hypothesis, refuted).
LIVE DIAGNOSIS (stack left deployed by the failed promote; probed before teardown):
- app service 1/1, healthy:
docker exec app wget localhost:3000/xrpc/_health→{"version":"0.4.219"}; app listens on:::3000; no restarts. So the PDS itself is fine. - HTTPS to warm domain → 000. caddy logs flood:
tls "failed to get permission for on-demand certificate" domain=warm-bluesky-pds… error=… Get "http://app:3000/tls-check?domain=…": dial tcp 10.10.0.X:3000: connect: connection refused(X varies: .2 .4 .5 .6 .8 .9 .10 .12). - bluesky uses caddy on-demand TLS (Caddyfile:
on_demand_tls { ask http://app:3000/tls-check },tls { on_demand },reverse_proxy app:3000). caddy must reach app:3000/tls-check to be GRANTED a cert before serving TLS. It can't → no cert → TLS handshake fails → 000. - WHY can't caddy reach app: service-name
appcollision on the sharedproxyoverlay.- app is on
warm-bluesky-pds…_internalONLY (IP 10.0.3.3). caddy is onproxy(10.10.50.223) +…_internal(10.0.3.6). docker exec caddy getent hosts app→ returns ONLY proxy IPs (8/8 tries: 10.10.0.4/.5/.6/.10/.12), NEVER the internal 10.0.3.3. The proxy-netappalias shadows bluesky's own internal app.docker network inspect proxyshows EVERY stack aliases its main serviceapp:drone…_app=10.10.0.2,traefik…_app=10.10.0.5,warm-keycloak…_app=10.10.0.9,ccci-reports/bridge/dashboard_app, … — exactly the IPs caddy hits. None listens a PDS on 3000 → connection refused. So caddy resolves bareappto OTHER stacks' app endpoints on the shared proxy, never its own PDS.
- app is on
WHY cold passes / warm fails: cold's health window is long (HTTP_TIMEOUT=600) and on first success caddy CACHES the issued cert; the promote's shorter health window doesn't give caddy a chance to ever resolve correctly (and here it provably never resolves to 10.0.3.3 at all). The collision is the root cause; the promote machinery is CORRECT (it refused to write a canonical for an unhealthy 000 — no canonical.json written, verified).
Classification: genuine ROUTING/recipe defect — caddy↔app cross-stack app-alias collision on the
shared proxy net, deterministic, reproducible in isolation. NOT a flake; NOT a promote-machinery bug.
Fix approach (M2): recipe PR giving the PDS service a UNIQUE name/alias (e.g. rename app→pds) so
caddy's reverse_proxy/tls-check resolve only bluesky's own internal service (no shared-proxy app
collision). (Alternatively a caddy-side internal-only resolution; renaming is cleanest.) Will confirm
the exact fix in M2 + verify the warm domain then serves 200.
Cleanup: removed orphaned warm-bluesky-pds stack + its volumes/secrets (promote had left it deployed; no canonical written). Node clean.
2026-06-18T01:05Z — M1: keycloak — warm-domain namespace collision (harness), classification complete
keycloak was de-enrolled (WARM_CANONICAL=False) because its data-warm canonical domain would collide with the LIVE-warm OIDC provider. Verified the collision STRUCTURALLY (code, no run needed):
canonical.canonical_domain(r)→warm.stable_domain(r)→f"warm-{r}.ci.commoninternet.net"(runner/harness/canonical.py:42-44, warm.py:44-48).warm.WARM_DOMAINS["keycloak"] = "warm-keycloak.ci.commoninternet.net"(warm.py:27-29) — the always-on shared OIDC provider lasuite-*/drone consume for SSO; kept current by roll_warm_infra.- So
canonical_domain("keycloak") == WARM_DOMAINS["keycloak"]EXACTLY. Enrolling keycloak as a data-warm canonical → the sweep's promote deploy/teardown at warm-keycloak collides with the live provider. Confirmed live keycloak healthy (200 /realms/master) — I did not disturb it.
The collision is unique to keycloak: it is the ONLY recipe that is both a live-warm provider (in WARM_DOMAINS) AND would want a canonical. No collision-free canonical namespace exists today.
Classification: HARNESS defect — warm canonical domain namespace can collide with a live-warm
provider. NOT a recipe/flake. Fix approach (M2): make canonical_domain(r) collision-free when r
is a live-warm provider — e.g. warm-canon-<r> (or unconditionally) so the canonical deploy gets a
distinct domain → distinct stack → cannot touch the live warm-keycloak. Then set keycloak
WARM_CANONICAL=True and verify it promotes at the collision-free domain WITHOUT disrupting live
keycloak. Minimal blast radius: special-case only providers in WARM_DOMAINS (the 15 other canonicals
keep warm-<r>); confirm in M2.
2026-06-18T01:05Z — M1: gitea first advance attempt hit a LEFTOVER confound (not the real crash)
First gitea cold@3.6.0 run: cold lifecycle (install/upgrade/backup/restore/custom) ALL PASS; promote
advance FAILED with FATA warm-gitea.ci.commoninternet.net is already deployed — NOT the app.ini
crash. Cause: warm-gitea was left DEPLOYED at 3.5.3 by the nixenv-phase sweep (registry said
status=idle but the stack was actually running — a state inconsistency). The advance does abra app deploy warm-gitea assuming the canonical is idle/undeployed; finding it deployed, abra FATAs. This is
the same GREEN-BUT-PROMOTE-FAILED the nixenv phase saw. To reproduce the REAL app.ini issue I undeployed
warm-gitea (docker stack rm; retained data+config volumes → proper idle state) and re-ran gitea
cold@3.6.0 (gitea2). Result pending. NOTE: the "already deployed" promote-failure-when-left-deployed
may be a secondary promote-machinery robustness gap (advance should undeploy-or-chaos an
already-deployed canonical) — will assess after confirming the primary app.ini crash.
2026-06-18T00:14Z — M1: gitea warm advance — app.ini read-only JWT crash CONFIRMED (recipe defect)
After restoring warm-gitea to proper idle state (undeployed, 3.5.3 data+config volumes retained), re-ran gitea cold@3.6.0 (gitea2, log /tmp/redfix-gitea2.log). Cold lifecycle ALL PASS (install/upgrade/backup/restore/custom — incl. the cold FRESH 3.5.3→3.6.0 upgrade tier). WC5 promote advance then crash-loops. Live container logs (warm-gitea_..._app, repeated Failed/exit 1):
modules/setting/setting.go:105:LoadCommonSettings() [F] Unable to load settings from config: error saving JWT Secret for custom config: failed to save "/etc/gitea/app.ini": open /etc/gitea/app.ini: read-only file system
EXACTLY the canon-documented crash. Mechanism: the recipe mounts app.ini as a docker config
(read-only by design) at /etc/gitea/app.ini (compose configs: - source: app_ini target: /etc/gitea/app.ini, app.ini.tmpl). gitea 1.24.2 (3.6.0), on the warm REATTACH of the retained
3.5.3 config volume, decides to (re)generate+SAVE a JWT secret to app.ini → read-only fs → FATA at
config-load, BEFORE any DB migration (so the 3.5.3 data volume stays intact — confirmed canon).
Why cold passes but warm crashes: the cold fresh deploy + cold chaos-upgrade use freshly-generated secrets consistent with a freshly-initialized config, so gitea never needs to rewrite app.ini. The warm advance reattaches an OLDER retained config-volume state (seeded under 3.5.3) against the new run's secrets/3.6.0 binary → gitea reconciles by trying to persist a JWT secret → read-only crash.
Classification: genuine RECIPE defect (gitea 3.6.0/1.24.2 + read-only app.ini docker-config mount
on the warm-reattach advance), deterministic, reproduced first-hand. NOT a flake, NOT promote
machinery. Fix approach (M2): recipe PR making app.ini writable on the advance path — e.g. render the
config into the WRITABLE config:/etc/gitea volume via an entrypoint (not a read-only docker config),
OR ensure the persisted secrets are accepted without rewrite. (Secondary harness option: canonical
advance falls back to clean re-deploy when in-place config rewrite is impossible — but that loses the
reattach data-warm property; recipe fix preferred.) Ties to LFS PR #1 (app.ini secret handling).
ACTION NEEDED after run exits: warm-gitea is left crash-looping at 3.6.0 → restore it to 3.5.3 (redeploy the known-good canonical version) so the canonical is healthy again. Data volume intact.
2026-06-18T00:25Z — M1 CLAIMED (6/6 investigated, isolated, classified)
mumble repeat #2 (mumble2): ALL tiers green again incl. handshake; canonical re-promoted green (ts 20260618T001730Z). So mumble = 2× reproducibly green in isolation → load/timing FLAKE confirmed.
All six classified with first-hand isolation evidence (or code proof for keycloak). Two canon
root-causes were CORRECTED by isolation: discourse (not a timeout/deploy-FATA — it's a stale cc-ci
overlay test asserting an unreleased migration) and mattermost-lts (not a loaded-node race — a
deterministic recipe restore defect: missing backupbot.restore.post-hook). bluesky's 000 is NOT a
load/rate-limit flake (my initial hypothesis) but a deterministic caddy↔app app-alias DNS collision
on the shared proxy. gitea app.ini read-only JWT crash reproduced first-hand. keycloak collision proven
structurally in code.
Node clean: warm-gitea idle@3.5.3 (volumes retained), orphaned warm-bluesky removed, only live warm-keycloak up (healthy 200). Claiming M1; will start M2 fix design while awaiting the Adversary verdict (keep an unblocked item in hand).
2026-06-18T00:25Z — M2 prep (gated on M1 PASS): bluesky fix refinement
While parked at the M1 gate (no node deploys — Adversary cold-verifying), refined the bluesky fix:
cc-ci's bluesky tests probe via HTTP (/xrpc/_health), but the GENERIC harness defaults to
service="app" (deployed_identity/_app_container). So RENAMING the recipe's app service → pds
could break generic harness assumptions. Cleaner fix: keep the service named app but give it a
UNIQUE network ALIAS on the internal net (e.g. aliases: [pds-internal]) and point caddy at
pds-internal:3000 (reverse_proxy + on_demand_tls ask). A unique alias has no collision on the shared
proxy (only the bare app alias collides), and the service name stays app → zero cc-ci-side
breakage. Will validate this exact approach in M2 after M1 PASS.
2026-06-18T01:21Z — M1 PASS; starting M2
Adversary M1 verdict: PASS @01:18Z — all 6 classifications cold-verified CORRECT by its OWN isolation re-runs (discourse/mattermost/mumble/bluesky/gitea) + code-verify (keycloak). No VETO. "Builder cleared to proceed to M2." Two canon root-causes corrected and confirmed (discourse: not a timeout, stale overlay; mattermost: not a load race, recipe defect). bluesky reclassification (recipe, not warm-machinery) confirmed against the plan's prior.
Starting M2. Plan: recipe PRs (mattermost-lts, bluesky-pds, gitea) via the recipe mirror+PR flow
(!testme-verified, never merge); harness fixes (keycloak collision-free canonical_domain + enroll;
mumble handshake stabilization) on a cc-ci branch; discourse overlay-scope decision. Node now mine
(Adversary done). Will examine the recipe-create-pr flow first, then execute one fix at a time.
2026-06-18T01:25Z — M2 recon: prior-phase fix PRs already exist for discourse + mattermost
Surveyed open PRs on all 6 mirrors before doing redundant work:
- discourse #4
discourse-official-image("switch to official discourse/discourse"): created 2026-06-16 by autonomic-bot; !testme PASSED twice, latest @53ba0910 today 16:36Z (run #849) ✅. This migrates off deprecated bitnamilegacy → official image + drops sidekiq = EXACTLY what the upgrade overlay asserts. So the overlay test was correctly demanding the migration; PR #4 IS the discourse fix and is already !testme-green. (Reframes M1 "stale test": the test is right; the release tag predates the migration; the fix is the migration PR, not weakening the test.) - mattermost-lts #1
ci/pg-restore("reimport the postgres dump on restore"): correct immich-pattern fix — pg_backup.sh (backup pg_dump|gzip; restore: terminate conns + DROP DATABASE WITH FORCE + createdb + reimport) + dump-onlybackup.volumes.postgres_data.path: backup.sql+restore.post-hook: /pg_backup.sh restore. Created 2026-05-30; needs a fresh !testme to confirm green NOW. (Also PR #2 upgrade-2.1.11 overlaps — adds restore hook + version bump; #1 is the focused fix.) - mumble #1 = "cfold sweep probe" (not the fix — mumble is a harness flake, no recipe PR needed).
- bluesky #3 = version bump (not the routing fix — need a NEW PR for the app-alias collision).
- gitea, keycloak = no open PRs (gitea LFS #1 closed; keycloak is a harness fix).
M2 plan refined: VERIFY discourse #4 (re-!testme fresh) + mattermost #1 (!testme); CREATE recipe PRs for bluesky (unique alias) + gitea (app.ini writable); HARNESS fixes for mumble (handshake stab) + keycloak (collision-free canonical_domain + enroll). Starting with mattermost #1 !testme.
2026-06-18T01:30Z — M2: mattermost-lts FIXED (verified) + discourse already green + bluesky PR created
- mattermost-lts: !testme on PR #1
ci/pg-restore(@4ca7f418) → run #901 ALL tiers green (install/upgrade/backup/restore/custom, every junit failures=0 skipped=0). The M1-failingrestore__cc-ci__test_restore.py::test_restore_returns_statenow PASSES — the pg_backup.sh restore post-hook (terminate conns + DROP DATABASE WITH FORCE + createdb + reimport dump) round-trips postgres state. FIXED + verified. (Nothing merged — operator merges.) - discourse: PR #4
discourse-official-imagealready !testme-green @53ba0910 (run #849, today 16:36Z) — the official-image migration makes the upgrade overlay pass. Will re-verify fresh for current evidence before the M2 claim. - bluesky-pds: created mirror PR #4
ci/warm-routing-alias(uniquepdsalias on internal + caddy reverse_proxy/ask → pds:3000; service staysapp). compose validated (docker compose configrc=0). VERIFICATION NOTE: bluesky's 000 is warm-promote-only (cold path always green), so !testme (cold) won't reproduce/verify it — I'll verify by running the FIXED recipe through the promote path (cold-on-latest with the fix checked out) → warm-bluesky-pds should serve 200 (vs M1's 000), then tear down the phantom canonical.
Remaining M2: bluesky promote-verify, gitea recipe PR (app.ini writable), keycloak harness (collision-free canonical_domain + enroll), mumble harness (handshake stabilization).