Files
cc-ci/machine-docs/JOURNAL-redfix.md
autonomic-bot 191ddc9fb8
Some checks failed
continuous-integration/drone/push Build is failing
status(redfix): M1 PASS (Adversary cold-verified all 6 classifications CORRECT); begin M2 fixes
2026-06-18 01:20:15 +00:00

271 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# JOURNAL — phase `redfix`
## 2026-06-17T23:20Z — Bootstrap
Read phase plan + plan.md §6.1/§7/§9 + canon DECISIONS exceptions (lines ~14941552). Six
canon-sweep failures to investigate. Confirmed cc-ci access, no run in flight, sweep timer next
fires 2026-06-21 (3-day window), disk 38G free.
Isolation mechanism understood: `runner/nightly_sweep.run_on_tag` = `abra.recipe_checkout(r, tag)` +
`run_recipe_ci.py RECIPE=<r> CCCI_SKIP_FETCH=1` cold/full. I reproduce each failure by running ONE
recipe at a time with no concurrent load.
Starting canonical state notable: **mumble canonical IS present** (`1.0.0+v1.6.870-0`, written
20260617T180501Z — during today's nixenv sweep). The canon DECISIONS recorded mumble RED
(`test_handshake_completes_with_channel_presence`). A canonical only gets written on a GREEN cold run
on latest → mumble flipped green in a recent run. Strong early evidence for the operator's "mumble
passed before" → load flake hypothesis. Must confirm with a clean isolation re-run + check whether the
canon-sweep red was under concurrent load.
Next: start M1 investigation. Plan order (cheap/informative first): triage the existing sweep logs on
cc-ci to pin the EXACT assertion/error for each (mumble, mattermost-lts restore, gitea app.ini,
bluesky routing, discourse compose), then run isolation re-runs. discourse's recorded cause is an
UPSTREAM compose defect (`sidekiq.depends_on: discourse` while service is `app`) that FATAs before any
deploy — that's deterministic, not a load timeout, so it may not even need a long isolation run to
confirm; verify the compose at the latest tag directly first.
## 2026-06-17T23:40Z — M1: discourse isolation run — CANON ROOT-CAUSE WAS WRONG
Ran discourse ALONE on cc-ci (`recipe_checkout discourse 0.8.1+3.5.0` + `RECIPE=discourse
CCCI_SKIP_FETCH=1 cc-ci-run runner/run_recipe_ci.py`, log `/tmp/redfix-discourse.log`).
RESULT: **install PASS, upgrade FAIL, backup PASS, restore PASS, custom PASS** — the recipe deploys,
serves (200 /srv/status), backs up and restores cleanly. NOT a deploy timeout, NOT a 51-min wedge,
NOT a deploy FATA. The canon DECISIONS root-cause ("`abra app deploy` FATAs: service sidekiq depends
on undefined service discourse → invalid compose project") is **misattributed**: that string appears
ONLY from the non-fatal prepull `docker compose config --images` (rc=15, harness logs "skipping
(deploy will pull as usual)"). The real `abra app deploy` is a swarm `docker stack deploy`, which
ignores `depends_on` entirely → the stack converges (`UpdateStatus=completed`).
The ONLY failure is the cc-ci upgrade OVERLAY `tests/discourse/test_upgrade.py`:
- `test_head_runs_official_image_not_bitnamilegacy` — app image is `bitnamilegacy/discourse:3.5.0`;
test demands `discourse/discourse:3.5.3` (official).
- `test_sidekiq_service_dropped_by_head` — services `['app','db','redis','sidekiq']`; test demands
sidekiq dropped.
These `prevb`-phase overlay tests are PR-FAITHFULNESS assertions for a specific migration PR
(bitnamilegacy → official `discourse/discourse:3.5.3`, drop sidekiq). Verified that migration exists
in **NO upstream release tag and NOT in main**`git show main:compose.yml` and every tag
(`0.1.0…0.8.1+3.5.0`) all use `bitnamilegacy/discourse:3.5.0` + sidekiq. So the overlay asserts a
state that doesn't exist anywhere upstream → deterministic RED whenever the sweep tests the latest
release tag. The head DID deploy (chaos-version label = head f87c612d+U, converged) — the test
expectation is simply wrong for the released recipe.
Note (M2 design): migrating discourse from the deprecated `bitnamilegacy` image to official
`discourse/discourse` is a MAJOR recipe rewrite (different fs layout, entrypoint, no `/opt/bitnami`
sidekiq run.sh) — not a 1-line image swap. So the overlay test's `discourse/discourse:3.5.3`
expectation may not be a realistic near-term recipe change. The bitnamilegacy deprecation is real
(bitnami sunset legacy images), so a migration is the right long-term direction, but the test as
written hard-codes a migration target absent upstream. Classification + fix approach to settle in M1
table / M2.
Classification: **stale/PR-specific cc-ci OVERLAY test mismatched to the canonical-sweep context**
(NOT a flake, NOT a load timeout, NOT a recipe-deploy defect, NOT warm-machinery). Teardown clean (no
discourse stack left). Evidence: `/tmp/redfix-discourse.log` on cc-ci; junit under
`/var/lib/cc-ci-runs/manual/junit/upgrade__cc-ci__test_upgrade.xml`.
## 2026-06-18T00:05Z — M1: mattermost-lts isolation run — DETERMINISTIC restore failure (recipe defect)
Ran mattermost-lts ALONE (tag 2.1.9+10.11.15, log /tmp/redfix-mattermost-lts.log).
RESULT: **install/upgrade/backup/custom PASS, restore FAIL** — identical to the canon failure:
`tests/mattermost-lts/test_restore.py::test_restore_returns_state``relation "ci_marker" does not
exist` after restore. So it is **deterministic in isolation, NOT a loaded-node race** (canon framing
was wrong). The marker logic is sound (postgres table seeded pre-backup, dropped pre-restore, asserted
post-restore — same pattern immich uses and PASSES).
ROOT CAUSE (recipe backup/restore labels). Compared mattermost-lts vs immich (immich passes the
IDENTICAL test):
- immich `database` svc: `backupbot.backup.pre-hook: /pg_backup.sh backup`,
`backupbot.backup.volumes.postgres.path: backup.sql` (backs up ONLY the dump file), and
**`backupbot.restore.post-hook: /pg_backup.sh restore`** (replays the dump on restore). → round-trips.
- mattermost-lts `postgres` svc: `pre-hook: pg_dump > /var/lib/postgresql/data/postgres-backup.sql`,
`backup.path: /var/lib/postgresql/data/` (backs up the WHOLE live/hot PGDATA dir + the dump),
`post-hook: rm .../postgres-backup.sql`, and **NO `backupbot.restore.post-hook`**. So on restore,
abra restores the files but NOTHING replays the dump, and a hot-copied live PGDATA over a running
postgres does not reload → `ci_marker` lost. Restore log confirms `Restoring Snapshot b0495d36 at /`
with no post-hook reimport.
Classification: **GENUINE RECIPE DEFECT at latest** (postgres backup/restore does not round-trip —
missing restore post-hook + backs up hot PGDATA instead of dump-only). NOT a flake, NOT cc-ci test
weakening (test is correct & unmodified; immich proves the pattern works). Fix (M2) = recipe PR
adopting the immich-style postgres backup/restore (a `/pg_backup.sh`-style dump + restore post-hook).
Teardown clean (no matt stack). Evidence: /tmp/redfix-mattermost-lts.log; junit
restore__cc-ci__test_restore.xml.
Tooling note: my background "waiter" loop `while pgrep -f run_recipe_ci.py` self-matched (its own
cmdline contains the string) → never exited, falsely showed a run active. Use `pgrep -f
"[r]un_recipe_ci.py"` or match the python invocation. Killed the stuck waiters; node confirmed free.
## 2026-06-18T00:18Z — M1: mumble isolation run — GREEN (flake confirmed)
Ran mumble ALONE (tag 1.0.0+v1.6.870-0, log /tmp/redfix-mumble.log). RESULT: **ALL tiers PASS**
(install/upgrade/backup/restore/custom), including `custom/test_protocol_handshake.py::
test_handshake_completes_with_channel_presence` PASSED. No orphan stacks. The canon sweep recorded
this RED (`test_handshake…` failed under concurrent sweep load); it is GREEN here in isolation, and
its canonical was already written green TODAY (1.0.0+v1.6.870-0 @20260617T180501Z) under the lighter
nixenv sweep. → **load/timing FLAKE** on the control-channel handshake, NOT a recipe defect.
The handshake test already retries (`retry_handshake(attempts=12, interval=5.0)` = 60s). So the flake
is the voice server not completing the TLS+ServerSync handshake within ~60s under heavy concurrent
node load (deploy contention). M2 fix = harness stabilization (stronger readiness gate before the
custom tier / longer-or-smarter retry / serialize), based on the load failure mode. Classification:
**FLAKE (load/concurrency)** → harness stabilization.
Reproducibility: 1 green isolation run here + canonical green today + documented red under canon load.
Will do 12 more isolation repeats before the M1 claim to firm "reproducibly green in isolation."
## 2026-06-18T00:45Z — M1: bluesky-pds isolation run — 000 REPRODUCES; root cause = `app` DNS collision on shared proxy
Ran bluesky-pds ALONE (tag 0.3.0+v0.4.219, log /tmp/redfix-bluesky-pds.log). Cold lifecycle GREEN
(install/backup/restore/custom pass; upgrade EXPECTED_NA per recipe_meta — moving pds:0.4 tag). Then
WC5 promote-on-green-cold FAILED exactly as canon: `warm-bluesky-pds.ci.commoninternet.net: not
healthy over HTTPS /xrpc/_health (last status 0)`. So **the 000 reproduces deterministically in
isolation — NOT a sweep-load/ACME-rate-limit flake** (my first hypothesis, refuted).
LIVE DIAGNOSIS (stack left deployed by the failed promote; probed before teardown):
- app service 1/1, healthy: `docker exec app wget localhost:3000/xrpc/_health``{"version":"0.4.219"}`;
app listens on `:::3000`; no restarts. So the PDS itself is fine.
- HTTPS to warm domain → 000. caddy logs flood:
`tls "failed to get permission for on-demand certificate" domain=warm-bluesky-pds…
error=… Get "http://app:3000/tls-check?domain=…": dial tcp 10.10.0.X:3000: connect: connection refused`
(X varies: .2 .4 .5 .6 .8 .9 .10 .12).
- bluesky uses caddy **on-demand TLS** (Caddyfile: `on_demand_tls { ask http://app:3000/tls-check }`,
`tls { on_demand }`, `reverse_proxy app:3000`). caddy must reach app:3000/tls-check to be GRANTED a
cert before serving TLS. It can't → no cert → TLS handshake fails → 000.
- WHY can't caddy reach app: **service-name `app` collision on the shared `proxy` overlay.**
- app is on `warm-bluesky-pds…_internal` ONLY (IP 10.0.3.3). caddy is on `proxy` (10.10.50.223) +
`…_internal` (10.0.3.6).
- `docker exec caddy getent hosts app` → returns ONLY proxy IPs (8/8 tries: 10.10.0.4/.5/.6/.10/.12),
**NEVER the internal 10.0.3.3.** The proxy-net `app` alias shadows bluesky's own internal app.
- `docker network inspect proxy` shows EVERY stack aliases its main service `app`:
`drone…_app=10.10.0.2`, `traefik…_app=10.10.0.5`, `warm-keycloak…_app=10.10.0.9`,
`ccci-reports/bridge/dashboard_app`, … — exactly the IPs caddy hits. None listens a PDS on 3000 →
connection refused.
So caddy resolves bare `app` to OTHER stacks' app endpoints on the shared proxy, never its own PDS.
WHY cold passes / warm fails: cold's health window is long (HTTP_TIMEOUT=600) and on first success
caddy CACHES the issued cert; the promote's shorter health window doesn't give caddy a chance to ever
resolve correctly (and here it provably never resolves to 10.0.3.3 at all). The collision is the root
cause; the promote machinery is CORRECT (it refused to write a canonical for an unhealthy 000 — no
canonical.json written, verified).
Classification: **genuine ROUTING/recipe defect — caddy↔app cross-stack `app`-alias collision on the
shared proxy net**, deterministic, reproducible in isolation. NOT a flake; NOT a promote-machinery bug.
Fix approach (M2): recipe PR giving the PDS service a UNIQUE name/alias (e.g. rename `app``pds`) so
caddy's `reverse_proxy`/`tls-check` resolve only bluesky's own internal service (no shared-proxy `app`
collision). (Alternatively a caddy-side internal-only resolution; renaming is cleanest.) Will confirm
the exact fix in M2 + verify the warm domain then serves 200.
Cleanup: removed orphaned warm-bluesky-pds stack + its volumes/secrets (promote had left it deployed;
no canonical written). Node clean.
## 2026-06-18T01:05Z — M1: keycloak — warm-domain namespace collision (harness), classification complete
keycloak was de-enrolled (WARM_CANONICAL=False) because its data-warm canonical domain would collide
with the LIVE-warm OIDC provider. Verified the collision STRUCTURALLY (code, no run needed):
- `canonical.canonical_domain(r)``warm.stable_domain(r)``f"warm-{r}.ci.commoninternet.net"`
(runner/harness/canonical.py:42-44, warm.py:44-48).
- `warm.WARM_DOMAINS["keycloak"] = "warm-keycloak.ci.commoninternet.net"` (warm.py:27-29) — the
always-on shared OIDC provider lasuite-*/drone consume for SSO; kept current by roll_warm_infra.
- So `canonical_domain("keycloak") == WARM_DOMAINS["keycloak"]` EXACTLY. Enrolling keycloak as a
data-warm canonical → the sweep's promote deploy/teardown at warm-keycloak collides with the live
provider. Confirmed live keycloak healthy (200 /realms/master) — I did not disturb it.
The collision is unique to keycloak: it is the ONLY recipe that is both a live-warm provider (in
WARM_DOMAINS) AND would want a canonical. No collision-free canonical namespace exists today.
Classification: **HARNESS defect — warm canonical domain namespace can collide with a live-warm
provider.** NOT a recipe/flake. Fix approach (M2): make `canonical_domain(r)` collision-free when `r`
is a live-warm provider — e.g. `warm-canon-<r>` (or unconditionally) so the canonical deploy gets a
distinct domain → distinct stack → cannot touch the live `warm-keycloak`. Then set keycloak
WARM_CANONICAL=True and verify it promotes at the collision-free domain WITHOUT disrupting live
keycloak. Minimal blast radius: special-case only providers in WARM_DOMAINS (the 15 other canonicals
keep `warm-<r>`); confirm in M2.
## 2026-06-18T01:05Z — M1: gitea first advance attempt hit a LEFTOVER confound (not the real crash)
First gitea cold@3.6.0 run: cold lifecycle (install/upgrade/backup/restore/custom) ALL PASS; promote
advance FAILED with `FATA warm-gitea.ci.commoninternet.net is already deployed` — NOT the app.ini
crash. Cause: warm-gitea was left DEPLOYED at 3.5.3 by the nixenv-phase sweep (registry said
status=idle but the stack was actually running — a state inconsistency). The advance does `abra app
deploy warm-gitea` assuming the canonical is idle/undeployed; finding it deployed, abra FATAs. This is
the same GREEN-BUT-PROMOTE-FAILED the nixenv phase saw. To reproduce the REAL app.ini issue I undeployed
warm-gitea (docker stack rm; retained data+config volumes → proper idle state) and re-ran gitea
cold@3.6.0 (gitea2). Result pending. NOTE: the "already deployed" promote-failure-when-left-deployed
may be a secondary promote-machinery robustness gap (advance should undeploy-or-chaos an
already-deployed canonical) — will assess after confirming the primary app.ini crash.
## 2026-06-18T00:14Z — M1: gitea warm advance — app.ini read-only JWT crash CONFIRMED (recipe defect)
After restoring warm-gitea to proper idle state (undeployed, 3.5.3 data+config volumes retained),
re-ran gitea cold@3.6.0 (gitea2, log /tmp/redfix-gitea2.log). Cold lifecycle ALL PASS
(install/upgrade/backup/restore/custom — incl. the cold FRESH 3.5.3→3.6.0 upgrade tier). WC5 promote
advance then crash-loops. Live container logs (warm-gitea_..._app, repeated Failed/exit 1):
modules/setting/setting.go:105:LoadCommonSettings() [F] Unable to load settings from config:
error saving JWT Secret for custom config: failed to save "/etc/gitea/app.ini":
open /etc/gitea/app.ini: read-only file system
EXACTLY the canon-documented crash. Mechanism: the recipe mounts app.ini as a docker `config`
(read-only by design) at /etc/gitea/app.ini (compose `configs: - source: app_ini target:
/etc/gitea/app.ini`, app.ini.tmpl). gitea 1.24.2 (3.6.0), on the warm REATTACH of the retained
3.5.3 config volume, decides to (re)generate+SAVE a JWT secret to app.ini → read-only fs → FATA at
config-load, BEFORE any DB migration (so the 3.5.3 data volume stays intact — confirmed canon).
Why cold passes but warm crashes: the cold fresh deploy + cold chaos-upgrade use freshly-generated
secrets consistent with a freshly-initialized config, so gitea never needs to rewrite app.ini. The
warm advance reattaches an OLDER retained config-volume state (seeded under 3.5.3) against the new
run's secrets/3.6.0 binary → gitea reconciles by trying to persist a JWT secret → read-only crash.
Classification: **genuine RECIPE defect** (gitea 3.6.0/1.24.2 + read-only app.ini docker-config mount
on the warm-reattach advance), deterministic, reproduced first-hand. NOT a flake, NOT promote
machinery. Fix approach (M2): recipe PR making app.ini writable on the advance path — e.g. render the
config into the WRITABLE `config:/etc/gitea` volume via an entrypoint (not a read-only docker config),
OR ensure the persisted secrets are accepted without rewrite. (Secondary harness option: canonical
advance falls back to clean re-deploy when in-place config rewrite is impossible — but that loses the
reattach data-warm property; recipe fix preferred.) Ties to LFS PR #1 (app.ini secret handling).
ACTION NEEDED after run exits: warm-gitea is left crash-looping at 3.6.0 → restore it to 3.5.3
(redeploy the known-good canonical version) so the canonical is healthy again. Data volume intact.
## 2026-06-18T00:25Z — M1 CLAIMED (6/6 investigated, isolated, classified)
mumble repeat #2 (mumble2): ALL tiers green again incl. handshake; canonical re-promoted green
(ts 20260618T001730Z). So mumble = 2× reproducibly green in isolation → load/timing FLAKE confirmed.
All six classified with first-hand isolation evidence (or code proof for keycloak). Two canon
root-causes were CORRECTED by isolation: discourse (not a timeout/deploy-FATA — it's a stale cc-ci
overlay test asserting an unreleased migration) and mattermost-lts (not a loaded-node race — a
deterministic recipe restore defect: missing `backupbot.restore.post-hook`). bluesky's 000 is NOT a
load/rate-limit flake (my initial hypothesis) but a deterministic caddy↔app `app`-alias DNS collision
on the shared proxy. gitea app.ini read-only JWT crash reproduced first-hand. keycloak collision proven
structurally in code.
Node clean: warm-gitea idle@3.5.3 (volumes retained), orphaned warm-bluesky removed, only live
warm-keycloak up (healthy 200). Claiming M1; will start M2 fix design while awaiting the Adversary
verdict (keep an unblocked item in hand).
## 2026-06-18T00:25Z — M2 prep (gated on M1 PASS): bluesky fix refinement
While parked at the M1 gate (no node deploys — Adversary cold-verifying), refined the bluesky fix:
cc-ci's bluesky tests probe via HTTP (/xrpc/_health), but the GENERIC harness defaults to
`service="app"` (deployed_identity/_app_container). So RENAMING the recipe's `app` service → `pds`
could break generic harness assumptions. Cleaner fix: keep the service named `app` but give it a
UNIQUE network ALIAS on the internal net (e.g. `aliases: [pds-internal]`) and point caddy at
`pds-internal:3000` (reverse_proxy + on_demand_tls ask). A unique alias has no collision on the shared
proxy (only the bare `app` alias collides), and the service name stays `app` → zero cc-ci-side
breakage. Will validate this exact approach in M2 after M1 PASS.
## 2026-06-18T01:21Z — M1 PASS; starting M2
Adversary M1 verdict: **PASS** @01:18Z — all 6 classifications cold-verified CORRECT by its OWN
isolation re-runs (discourse/mattermost/mumble/bluesky/gitea) + code-verify (keycloak). No VETO.
"Builder cleared to proceed to M2." Two canon root-causes corrected and confirmed (discourse: not a
timeout, stale overlay; mattermost: not a load race, recipe defect). bluesky reclassification (recipe,
not warm-machinery) confirmed against the plan's prior.
Starting M2. Plan: recipe PRs (mattermost-lts, bluesky-pds, gitea) via the recipe mirror+PR flow
(`!testme`-verified, never merge); harness fixes (keycloak collision-free canonical_domain + enroll;
mumble handshake stabilization) on a cc-ci branch; discourse overlay-scope decision. Node now mine
(Adversary done). Will examine the recipe-create-pr flow first, then execute one fix at a time.