From f8ba0c3a1fb9d7f32f5fe21d87951d22c8d4b9a0 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Thu, 18 Jun 2026 00:02:26 +0000 Subject: [PATCH] =?UTF-8?q?journal(redfix):=20M1=20bluesky-pds=20=E2=80=94?= =?UTF-8?q?=20000=20reproduces=20deterministically;=20root=20cause=20=3D?= =?UTF-8?q?=20caddy=E2=86=94app=20cross-stack=20'app'=20alias=20collision?= =?UTF-8?q?=20on=20shared=20proxy=20(recipe=20defect)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- machine-docs/JOURNAL-redfix.md | 45 ++++++++++++++++++++++++++++++++++ machine-docs/STATUS-redfix.md | 6 ++--- 2 files changed, 48 insertions(+), 3 deletions(-) diff --git a/machine-docs/JOURNAL-redfix.md b/machine-docs/JOURNAL-redfix.md index 86b3c49..7b12ff4 100644 --- a/machine-docs/JOURNAL-redfix.md +++ b/machine-docs/JOURNAL-redfix.md @@ -113,3 +113,48 @@ custom tier / longer-or-smarter retry / serialize), based on the load failure mo Reproducibility: 1 green isolation run here + canonical green today + documented red under canon load. Will do 1–2 more isolation repeats before the M1 claim to firm "reproducibly green in isolation." + +## 2026-06-18T00:45Z — M1: bluesky-pds isolation run — 000 REPRODUCES; root cause = `app` DNS collision on shared proxy + +Ran bluesky-pds ALONE (tag 0.3.0+v0.4.219, log /tmp/redfix-bluesky-pds.log). Cold lifecycle GREEN +(install/backup/restore/custom pass; upgrade EXPECTED_NA per recipe_meta — moving pds:0.4 tag). Then +WC5 promote-on-green-cold FAILED exactly as canon: `warm-bluesky-pds.ci.commoninternet.net: not +healthy over HTTPS /xrpc/_health (last status 0)`. So **the 000 reproduces deterministically in +isolation — NOT a sweep-load/ACME-rate-limit flake** (my first hypothesis, refuted). + +LIVE DIAGNOSIS (stack left deployed by the failed promote; probed before teardown): +- app service 1/1, healthy: `docker exec app wget localhost:3000/xrpc/_health` → `{"version":"0.4.219"}`; + app listens on `:::3000`; no restarts. So the PDS itself is fine. +- HTTPS to warm domain → 000. caddy logs flood: + `tls "failed to get permission for on-demand certificate" domain=warm-bluesky-pds… + error=… Get "http://app:3000/tls-check?domain=…": dial tcp 10.10.0.X:3000: connect: connection refused` + (X varies: .2 .4 .5 .6 .8 .9 .10 .12). +- bluesky uses caddy **on-demand TLS** (Caddyfile: `on_demand_tls { ask http://app:3000/tls-check }`, + `tls { on_demand }`, `reverse_proxy app:3000`). caddy must reach app:3000/tls-check to be GRANTED a + cert before serving TLS. It can't → no cert → TLS handshake fails → 000. +- WHY can't caddy reach app: **service-name `app` collision on the shared `proxy` overlay.** + - app is on `warm-bluesky-pds…_internal` ONLY (IP 10.0.3.3). caddy is on `proxy` (10.10.50.223) + + `…_internal` (10.0.3.6). + - `docker exec caddy getent hosts app` → returns ONLY proxy IPs (8/8 tries: 10.10.0.4/.5/.6/.10/.12), + **NEVER the internal 10.0.3.3.** The proxy-net `app` alias shadows bluesky's own internal app. + - `docker network inspect proxy` shows EVERY stack aliases its main service `app`: + `drone…_app=10.10.0.2`, `traefik…_app=10.10.0.5`, `warm-keycloak…_app=10.10.0.9`, + `ccci-reports/bridge/dashboard_app`, … — exactly the IPs caddy hits. None listens a PDS on 3000 → + connection refused. + So caddy resolves bare `app` to OTHER stacks' app endpoints on the shared proxy, never its own PDS. + +WHY cold passes / warm fails: cold's health window is long (HTTP_TIMEOUT=600) and on first success +caddy CACHES the issued cert; the promote's shorter health window doesn't give caddy a chance to ever +resolve correctly (and here it provably never resolves to 10.0.3.3 at all). The collision is the root +cause; the promote machinery is CORRECT (it refused to write a canonical for an unhealthy 000 — no +canonical.json written, verified). + +Classification: **genuine ROUTING/recipe defect — caddy↔app cross-stack `app`-alias collision on the +shared proxy net**, deterministic, reproducible in isolation. NOT a flake; NOT a promote-machinery bug. +Fix approach (M2): recipe PR giving the PDS service a UNIQUE name/alias (e.g. rename `app`→`pds`) so +caddy's `reverse_proxy`/`tls-check` resolve only bluesky's own internal service (no shared-proxy `app` +collision). (Alternatively a caddy-side internal-only resolution; renaming is cleanest.) Will confirm +the exact fix in M2 + verify the warm domain then serves 200. + +Cleanup: removed orphaned warm-bluesky-pds stack + its volumes/secrets (promote had left it deployed; +no canonical written). Node clean. diff --git a/machine-docs/STATUS-redfix.md b/machine-docs/STATUS-redfix.md index b669cf9..02c4595 100644 --- a/machine-docs/STATUS-redfix.md +++ b/machine-docs/STATUS-redfix.md @@ -38,9 +38,9 @@ flake source per phase plan §2.1). Runs execute on cc-ci from `/etc/cc-ci`. | discourse | DONE @23:40Z (`/tmp/redfix-discourse.log` on cc-ci) | install/backup/restore/custom PASS; **upgrade overlay FAIL**. Deploys+serves fine — NOT a timeout/FATA. | cc-ci overlay `tests/discourse/test_upgrade.py` asserts head runs official `discourse/discourse:3.5.3` + drops sidekiq; latest tag `0.8.1+3.5.0` AND main both still `bitnamilegacy/discourse:3.5.0`+sidekiq (migration exists in no release/main). The `depends_on discourse` string is a non-fatal prepull-only warning, not the deploy. | **stale/PR-specific cc-ci OVERLAY test** mismatched to canonical-sweep context (not flake/timeout/recipe-deploy/warm-machinery) | | mattermost-lts | DONE @00:05Z (`/tmp/redfix-mattermost-lts.log`) | install/upgrade/backup/custom PASS; **restore FAIL** `ci_marker does not exist` — **deterministic in isolation** (not a load race) | recipe `postgres` svc backup labels: backs up hot live PGDATA + dump but has **NO `backupbot.restore.post-hook`** to replay the dump → restore doesn't round-trip postgres. Contrast immich (passes): dump-only `backup.volumes.postgres.path: backup.sql` + `restore.post-hook: /pg_backup.sh restore`. | **genuine RECIPE defect** at latest → recipe PR (adopt immich-style dump+restore-post-hook) | | mumble | DONE @00:18Z (`/tmp/redfix-mumble.log`) | **ALL tiers PASS** incl. handshake; no orphans. Canon red under load; canonical written green TODAY | handshake (TLS+ServerSync) not completing within ~60s retry under heavy concurrent sweep load; fine in isolation | **load/timing FLAKE** → harness stabilization (readiness gate / retry). (1 isolation green; will repeat 1-2× before M1 claim) | -| bluesky-pds | running (isolation; promote at end) | — | — | — | -| gitea | pending | — | — | — | -| keycloak | pending | — | — | — | +| bluesky-pds | DONE @00:45Z (`/tmp/redfix-bluesky-pds.log` + live diag) | cold lifecycle GREEN; **WC5 promote 000** reproduces (warm /xrpc/_health last status 0). NOT a flake | caddy on-demand TLS (`ask http://app:3000/tls-check`) can't reach app: caddy resolves bare `app` to OTHER stacks' app endpoints on shared `proxy` net (getent app→only 10.10.0.X, never internal 10.0.3.3; proxy has drone/traefik/keycloak/ccci `app` aliases) → no cert → 000. Promote machinery correct (refused to write canonical). | **genuine routing/RECIPE defect** (cross-stack `app`-alias collision on shared proxy) → recipe PR: unique PDS service name/alias. NOT promote-machinery, NOT flake | +| gitea | running (isolation; warm 3.6.0 advance) | — | — | — | +| keycloak | pending (mostly design; verify collision) | — | — | — | Gate: M1 not yet claimed.