From 8a54c4d0ea75ea0a00c5bc091453d8d554536c65 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Thu, 18 Jun 2026 00:08:25 +0000 Subject: [PATCH] journal(redfix): M1 keycloak (harness warm-domain collision, design-complete) + gitea first-run already-deployed confound --- machine-docs/JOURNAL-redfix.md | 36 ++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/machine-docs/JOURNAL-redfix.md b/machine-docs/JOURNAL-redfix.md index 7b12ff4..24c44ad 100644 --- a/machine-docs/JOURNAL-redfix.md +++ b/machine-docs/JOURNAL-redfix.md @@ -158,3 +158,39 @@ the exact fix in M2 + verify the warm domain then serves 200. Cleanup: removed orphaned warm-bluesky-pds stack + its volumes/secrets (promote had left it deployed; no canonical written). Node clean. + +## 2026-06-18T01:05Z — M1: keycloak — warm-domain namespace collision (harness), classification complete + +keycloak was de-enrolled (WARM_CANONICAL=False) because its data-warm canonical domain would collide +with the LIVE-warm OIDC provider. Verified the collision STRUCTURALLY (code, no run needed): +- `canonical.canonical_domain(r)` → `warm.stable_domain(r)` → `f"warm-{r}.ci.commoninternet.net"` + (runner/harness/canonical.py:42-44, warm.py:44-48). +- `warm.WARM_DOMAINS["keycloak"] = "warm-keycloak.ci.commoninternet.net"` (warm.py:27-29) — the + always-on shared OIDC provider lasuite-*/drone consume for SSO; kept current by roll_warm_infra. +- So `canonical_domain("keycloak") == WARM_DOMAINS["keycloak"]` EXACTLY. Enrolling keycloak as a + data-warm canonical → the sweep's promote deploy/teardown at warm-keycloak collides with the live + provider. Confirmed live keycloak healthy (200 /realms/master) — I did not disturb it. + +The collision is unique to keycloak: it is the ONLY recipe that is both a live-warm provider (in +WARM_DOMAINS) AND would want a canonical. No collision-free canonical namespace exists today. + +Classification: **HARNESS defect — warm canonical domain namespace can collide with a live-warm +provider.** NOT a recipe/flake. Fix approach (M2): make `canonical_domain(r)` collision-free when `r` +is a live-warm provider — e.g. `warm-canon-` (or unconditionally) so the canonical deploy gets a +distinct domain → distinct stack → cannot touch the live `warm-keycloak`. Then set keycloak +WARM_CANONICAL=True and verify it promotes at the collision-free domain WITHOUT disrupting live +keycloak. Minimal blast radius: special-case only providers in WARM_DOMAINS (the 15 other canonicals +keep `warm-`); confirm in M2. + +## 2026-06-18T01:05Z — M1: gitea first advance attempt hit a LEFTOVER confound (not the real crash) + +First gitea cold@3.6.0 run: cold lifecycle (install/upgrade/backup/restore/custom) ALL PASS; promote +advance FAILED with `FATA warm-gitea.ci.commoninternet.net is already deployed` — NOT the app.ini +crash. Cause: warm-gitea was left DEPLOYED at 3.5.3 by the nixenv-phase sweep (registry said +status=idle but the stack was actually running — a state inconsistency). The advance does `abra app +deploy warm-gitea` assuming the canonical is idle/undeployed; finding it deployed, abra FATAs. This is +the same GREEN-BUT-PROMOTE-FAILED the nixenv phase saw. To reproduce the REAL app.ini issue I undeployed +warm-gitea (docker stack rm; retained data+config volumes → proper idle state) and re-ran gitea +cold@3.6.0 (gitea2). Result pending. NOTE: the "already deployed" promote-failure-when-left-deployed +may be a secondary promote-machinery robustness gap (advance should undeploy-or-chaos an +already-deployed canonical) — will assess after confirming the primary app.ini crash.