journal(redfix): M1 bluesky-pds — 000 reproduces deterministically; root cause = caddy↔app cross-stack 'app' alias collision on shared proxy (recipe defect)
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
This commit is contained in:
@ -113,3 +113,48 @@ custom tier / longer-or-smarter retry / serialize), based on the load failure mo
|
||||
|
||||
Reproducibility: 1 green isolation run here + canonical green today + documented red under canon load.
|
||||
Will do 1–2 more isolation repeats before the M1 claim to firm "reproducibly green in isolation."
|
||||
|
||||
## 2026-06-18T00:45Z — M1: bluesky-pds isolation run — 000 REPRODUCES; root cause = `app` DNS collision on shared proxy
|
||||
|
||||
Ran bluesky-pds ALONE (tag 0.3.0+v0.4.219, log /tmp/redfix-bluesky-pds.log). Cold lifecycle GREEN
|
||||
(install/backup/restore/custom pass; upgrade EXPECTED_NA per recipe_meta — moving pds:0.4 tag). Then
|
||||
WC5 promote-on-green-cold FAILED exactly as canon: `warm-bluesky-pds.ci.commoninternet.net: not
|
||||
healthy over HTTPS /xrpc/_health (last status 0)`. So **the 000 reproduces deterministically in
|
||||
isolation — NOT a sweep-load/ACME-rate-limit flake** (my first hypothesis, refuted).
|
||||
|
||||
LIVE DIAGNOSIS (stack left deployed by the failed promote; probed before teardown):
|
||||
- app service 1/1, healthy: `docker exec app wget localhost:3000/xrpc/_health` → `{"version":"0.4.219"}`;
|
||||
app listens on `:::3000`; no restarts. So the PDS itself is fine.
|
||||
- HTTPS to warm domain → 000. caddy logs flood:
|
||||
`tls "failed to get permission for on-demand certificate" domain=warm-bluesky-pds…
|
||||
error=… Get "http://app:3000/tls-check?domain=…": dial tcp 10.10.0.X:3000: connect: connection refused`
|
||||
(X varies: .2 .4 .5 .6 .8 .9 .10 .12).
|
||||
- bluesky uses caddy **on-demand TLS** (Caddyfile: `on_demand_tls { ask http://app:3000/tls-check }`,
|
||||
`tls { on_demand }`, `reverse_proxy app:3000`). caddy must reach app:3000/tls-check to be GRANTED a
|
||||
cert before serving TLS. It can't → no cert → TLS handshake fails → 000.
|
||||
- WHY can't caddy reach app: **service-name `app` collision on the shared `proxy` overlay.**
|
||||
- app is on `warm-bluesky-pds…_internal` ONLY (IP 10.0.3.3). caddy is on `proxy` (10.10.50.223) +
|
||||
`…_internal` (10.0.3.6).
|
||||
- `docker exec caddy getent hosts app` → returns ONLY proxy IPs (8/8 tries: 10.10.0.4/.5/.6/.10/.12),
|
||||
**NEVER the internal 10.0.3.3.** The proxy-net `app` alias shadows bluesky's own internal app.
|
||||
- `docker network inspect proxy` shows EVERY stack aliases its main service `app`:
|
||||
`drone…_app=10.10.0.2`, `traefik…_app=10.10.0.5`, `warm-keycloak…_app=10.10.0.9`,
|
||||
`ccci-reports/bridge/dashboard_app`, … — exactly the IPs caddy hits. None listens a PDS on 3000 →
|
||||
connection refused.
|
||||
So caddy resolves bare `app` to OTHER stacks' app endpoints on the shared proxy, never its own PDS.
|
||||
|
||||
WHY cold passes / warm fails: cold's health window is long (HTTP_TIMEOUT=600) and on first success
|
||||
caddy CACHES the issued cert; the promote's shorter health window doesn't give caddy a chance to ever
|
||||
resolve correctly (and here it provably never resolves to 10.0.3.3 at all). The collision is the root
|
||||
cause; the promote machinery is CORRECT (it refused to write a canonical for an unhealthy 000 — no
|
||||
canonical.json written, verified).
|
||||
|
||||
Classification: **genuine ROUTING/recipe defect — caddy↔app cross-stack `app`-alias collision on the
|
||||
shared proxy net**, deterministic, reproducible in isolation. NOT a flake; NOT a promote-machinery bug.
|
||||
Fix approach (M2): recipe PR giving the PDS service a UNIQUE name/alias (e.g. rename `app`→`pds`) so
|
||||
caddy's `reverse_proxy`/`tls-check` resolve only bluesky's own internal service (no shared-proxy `app`
|
||||
collision). (Alternatively a caddy-side internal-only resolution; renaming is cleanest.) Will confirm
|
||||
the exact fix in M2 + verify the warm domain then serves 200.
|
||||
|
||||
Cleanup: removed orphaned warm-bluesky-pds stack + its volumes/secrets (promote had left it deployed;
|
||||
no canonical written). Node clean.
|
||||
|
||||
Reference in New Issue
Block a user