diff --git a/JOURNAL-1c.md b/JOURNAL-1c.md index b40a7fd..12ab025 100644 --- a/JOURNAL-1c.md +++ b/JOURNAL-1c.md @@ -227,3 +227,29 @@ the published repo now builds to izsmiajw==running — this is the form the Adve C4/W5 standard (Adversary dd710a6 == orchestrator guidance): keep DOMAIN=ci.commoninternet.net, verify TLS locally on the VM via `curl --resolve …:443:127.0.0.1` (SNI ci.commoninternet.net), served leaf fingerprint must == git cert leaf `57:8D:67:9E:…:B8:A6`; oneshots converge; only age key out-of-band. + +## 2026-05-27 — W4 Step B: throwaway rebuilt; concurrent-abra race found + fixed + +**Throwaway rebuild result (pre-fix config, clone @dd710a6):** `nixos-rebuild switch` BUILD succeeded +(2.8 G peak RAM < 4 GB, 11.5 min CPU) → toplevel **`izsmiajw…` == cc-ci's running system** (blank VM +reproduces cc-ci byte-for-byte from git + the bootstrap age key). **sops cert decrypted via the +RECOVERY key**: /var/lib/ci-certs/live/{fullchain,privkey}.pem → /run/secrets/*, sha256 `c1d96d61…` +(match). swarm-init + docker active (node Ready/Leader). BUT activation reported "error(s) while +switching": `deploy-proxy` + `deploy-drone` FAILED → system `degraded`. + +**Root cause:** the abra reconcilers (proxy/drone/bridge/dashboard/backupbot) are all +`wantedBy multi-user.target`; drone/bridge/dashboard were `after deploy-proxy` but **concurrent with +each other**, and backupbot concurrent with proxy. On a FRESH `~/.abra` they race on catalogue/recipe +init → fast failures. Confirmed: `abra recipe fetch traefik` works fine alone (rc=0); re-running the +oneshots **sequentially** (`systemctl restart deploy-proxy; …drone; …bridge; …dashboard; …backupbot`) +→ ALL success, system `running`, **0 failed, all 6 stacks 1/1** (traefik app+socket-proxy, drone, +bridge, dashboard, backups) — identical to cc-ci. + +**Fix (7563d47):** serialize the chain via ordering-only `after`: +proxy → drone → bridge → dashboard → backupbot (bridge after drone, dashboard after bridge, backupbot +after dashboard). So a single `nixos-rebuild switch` on a blank host converges with no concurrent abra. +New toplevel `ld19aj2…`. Deploying to cc-ci (reconcilers already deployed there ⇒ serial no-op +re-runs) + re-verify byte-identical, then **recreate the throwaway FRESH** to prove single-switch +convergence (authoritative C4; mirrors the Adversary's W5 cold test). + +This is the LAST planned config change before W4 completes (config stable ld19aj2 thereafter).