# JOURNAL — cc-ci Builder (append-only) ## 2026-05-26 — Bootstrap (§1) **Access verification (all pass):** - `ssh cc-ci 'hostname && whoami && nixos-version'` → `nixos` / `root` / `24.11.719113.50ab793786d9 (Vicuna)` - `curl https://git.autonomic.zone/api/v1/version` → `{"version":"1.24.2"}` - Gitea bot auth (`curl -u $GITEA_USERNAME:$GITEA_PASSWORD .../api/v1/user`) → `login: autonomic-bot`, id 64 - `getent hosts probe-$RANDOM.ci.commoninternet.net` → `143.244.213.108` (the gateway IP, as expected — TLS passthrough) - Cert present: `ls /var/lib/ci-certs/live/` → `fullchain.pem` (2909 b), `privkey.pem` (227 b, mode 640) - recipe-maintainers org exists (private); `recipe-maintainers/cc-ci` → 404 (created below) - Mirrored recipes already present: bluesky-pds, lasuite-docs, custom-html, custom-html-tiny, n8n, keycloak, lasuite-meet, matrix-synapse, cryptpad **Baseline (docs/baseline.md):** fresh NixOS 24.11 Incus VM, 2 vCPU, 3.5 GiB RAM, 8.9 GiB disk (3.8 GiB free). No docker/swarm/abra. Channel-based `/etc/nixos/configuration.nix` (no flake). **Actions:** - Created repo `recipe-maintainers/cc-ci` (private) via Gitea API. - `git init` in /srv/cc-ci/cc-ci; credential helper reads creds from /srv/cc-ci/.testenv (no secrets stored in git config). - Seeded skeleton layout (§3) + loop-state files + docs/baseline.md. **Next:** commit + push bootstrap, then M0 (flake + base config + sops test secret). ## 2026-05-26 — M0: flake + base config rebuilt from repo **Authored** `flake.nix` (pins nixpkgs rev `50ab793786d9…`, the exact rev cc-ci ran), `hosts/cc-ci/hardware.nix` (incus VM module + cloud-init + DHCP/nameservers) and `hosts/cc-ci/configuration.nix` (faithful baseline repro: tailscale w/ hardcoded `--hostname= cc-nix-test` since `builtins.readFile /etc/ts-hostname` is impure under flakes; sshd root; firewall trust tailscale0 + tcp/22; base pkgs). **Disk/inode hiccup → resolved:** first `nix flake lock`/build hit `No space left on device` — diagnosed as **inode** exhaustion (`df -i` → 6005 free of 586336; old 8.9 GiB fs). Operator grew the VM to 28 GiB while I was measuring; ext4 auto-resized → 22 GiB free, 1.21M inodes free. Retried. **Build + switch (commands + output):** - `ssh cc-ci 'cd /root/cc-ci && nix flake lock && nixos-rebuild build --flake .#cc-ci'` → `BUILD EXIT 0`, produced `nixos-system-nixos-24.11.20250630.50ab793`. - `ssh cc-ci 'systemd-run --unit=ccci-rebuild --collect --property=Type=oneshot nixos-rebuild switch --flake /root/cc-ci#cc-ci'` (detached so it survives ssh drop) → unit `Result=success ExecMainStatus=0`. **Gate verification:** - `systemctl is-system-running` → `running` - `readlink /run/current-system` → `…-nixos-system-nixos-24.11.20250630.50ab793` (gen 3, from flake) - `systemctl is-active tailscaled` → `active`; `sshd.socket` → `active` (sshd is socket-activated, so `sshd.service` reads inactive — live ssh proves it works) - `systemctl --failed` → none - `nixos-rebuild list-generations` → gen 3 current @20:23, prior channel gen 2 retained for rollback. **Known warning (tracked, non-blocking):** incus module enables `systemd.network` while we keep `networking.useDHCP=true` (scripted dhcpcd); Nix warns both may manage interfaces. Inherited from baseline; networking is up. Clean up by choosing one stack later. **Deploy mechanism settled** (DECISIONS.md): `switch --flake` on-host, repo synced via `tar | ssh`. **Next:** sops-nix wiring (host age key from ssh host key + a decrypt-a-test-secret proof), then CLAIM the M0 gate for the Adversary. ## 2026-05-26 — M0: sops-nix wiring + decrypt-a-test-secret (M0 COMPLETE, gate CLAIMED) **Keys:** - Host age recipient from ssh host key: `ssh cc-ci 'nix run nixpkgs#ssh-to-age -- -i /etc/ssh/ssh_host_ed25519_key.pub'` → `age1h90utdztfc23kx8ewrtrtk80mnddvrf8pg4ppej55rwwwupzhfvqhmp3qa`. - Master recovery key generated on host (`age-keygen`), public `age1cmk26t…`; private moved off-box to `/srv/cc-ci/.sops/master-age.txt` (mode 600) and `shred`-ded from the host. Never in repo. **Files:** `.sops.yaml` (both recipients, rule `secrets/.*\.(yaml|json|env)$`); `modules/secrets.nix` (`sops.age.sshKeyPaths=[/etc/ssh/ssh_host_ed25519_key]`, `secrets.test_secret={}`); flake gains `sops-nix` input + `sops-nix.nixosModules.sops`; configuration.nix imports the module. **sops-nix version pin (dead-end avoided):** master sops-nix wants `buildGo125Module` (Go 1.25), absent in pinned nixpkgs 24.11 → eval error. Pinned sops-nix to `77c423a…` (2025-06-17, last using plain `buildGoModule`). Verified the file at that rev uses `buildGoModule`. Build then OK. **Encrypt test secret:** on host, `printf 'test_secret: cc-ci-m0-' > secrets/secrets.yaml` then `nix run nixpkgs#sops -- --encrypt --in-place secrets/secrets.yaml` (run inside repo so `.sops.yaml` resolves) → rc=0, two age recipients in the file. **Build + switch (commands + output):** - `nixos-rebuild build --flake .#cc-ci` → `BUILD EXIT 0` (built sops-install-secrets w/ Go 1.23.8). - `systemd-run --unit=ccci-rebuild2 ... nixos-rebuild switch --flake /root/cc-ci#cc-ci` → `Result=success ExecMainStatus=0`. **Gate verification (M0):** - `systemctl is-system-running` → `running`; `systemctl --failed` → none. - `ls -la /run/secrets/test_secret` → `-r-------- 1 root root 41` ; `stat` → `root:root 400`. - `head -c9` → `cc-ci-m0-` (matches generated value), `wc -c` → 41 (9 + 32 hex). Decrypt path proven. - Pulled encrypted `secrets/secrets.yaml` + `flake.lock` back to clone; `grep cc-ci-m0 secrets.yaml` → no plaintext leak; lock inputs = nixpkgs, sops-nix. **Gate handshake:** set `Gate: M0 — CLAIMED, awaiting Adversary` in STATUS.md. REVIEW.md still empty (no Adversary activity yet). Per §6.1 liveness I won't idle-block: I keep M0 claimed and proceed with M1 (independent infra build), without advancing to M2 until M0 shows PASS. **Next:** M1 — Docker + single-node swarm via Nix (modules/swarm.nix), then Traefik (file provider → /var/lib/ci-certs/live/) + abra, then a by-hand HTTPS deploy/teardown of a trivial recipe. ## 2026-05-26 — M1: Docker + single-node swarm via Nix **modules/swarm.nix:** `virtualisation.docker.enable` + daily autoprune (--all --volumes until=24h to protect the 28 GiB root), `docker` in systemPackages, and a `swarm-init` oneshot (`docker swarm init --advertise-addr 127.0.0.1` if not active; `docker network create --driver overlay --attachable proxy` if absent). Imported into configuration.nix. **Build + switch:** `nixos-rebuild build --flake .#cc-ci` → EXIT 0; `systemd-run … switch` → `Result=success`. **Verify (commands + output):** - `systemctl show swarm-init -p Result` → `Result=success` - `docker info --format ...` → `Swarm=active Managers=1 Nodes=1` - `docker network ls --filter name=proxy` → `proxy overlay swarm` - `systemctl is-system-running` → `running`; `--failed` → none. **Next:** Traefik as a swarm stack (Nix-declared compose + `docker stack deploy` oneshot): docker swarm provider + file provider serving /var/lib/ci-certs/live/{fullchain,privkey}.pem on :443, attached to `proxy`. Then abra install + by-hand HTTPS deploy/teardown of a trivial recipe (M1 gate). Rationale for swarm-service Traefik over a host `services.traefik`: a host process isn't on the `proxy` overlay, so it can't reach swarm service VIPs; coop-cloud recipes assume an on-`proxy` Traefik watching swarm labels. ## 2026-05-26 — M1: Traefik swarm stack + HTTPS path proven **modules/traefik.nix:** Traefik v3.3 as a swarm service on `proxy` (so it reaches recipe VIPs). Config via Nix `writeText` store files bind-mounted into the container (real files, not /etc symlinks): static `traefik.yml` (entrypoints web/websecure; `providers.swarm` unix socket, exposedByDefault=false, network=proxy; `providers.file` dir /etc/traefik/dynamic; ping; no dashboard) and dynamic `certs.yml` (wildcard at /var/lib/ci-certs/live/* as `stores.default. defaultCertificate` + certificates — so any *.ci.commoninternet.net router with tls=true is covered, no ACME). Deployed by a `traefik-deploy` oneshot (`docker stack deploy`) after swarm-init. Opened firewall 80/443 (gateway forwards over enp5s0). **Build + switch:** build EXIT 0; switch `Result=success`; `traefik-deploy` `Result=success`; `docker service ls` → `traefik_traefik traefik:v3.3 1/1`. **Verify (commands + output):** - Local: `curl -ksv -H 'Host: probe-test.ci.commoninternet.net' https://localhost/` → `subject: CN=*.ci.commoninternet.net`, `issuer: …Let's Encrypt; CN=E8`, TLSv1.3, HTTP 404. - **End-to-end via gateway:** `curl -ksv --resolve probe-test.ci.commoninternet.net:443:143.244.213.108 https://probe-test.ci.commoninternet.net/` → `Connected to …(143.244.213.108) port 443`, same wildcard cert, HTTP 404. Confirms gateway SNI-passthrough → cc-ci Traefik TLS termination. 404 is correct (no router for that host yet). **Next:** install abra (M1 last task), `abra app new` a trivial recipe (custom-html) → deploy → reach over HTTPS at .ci.commoninternet.net → teardown leaving no volumes. That completes M1 → CLAIM M1 gate. ## 2026-05-26 — M1: proxy pivot to real coop-cloud/traefik via abra; recipe deploy/teardown (M1 CLAIMED) **Orchestrator decision (mid-M1):** replace the hand-rolled Traefik with the canonical Co-op Cloud `traefik` recipe deployed via abra, wildcard/file-provider mode, no ACME/token. Removed custom `modules/traefik.nix`; moved firewall 80/443 into `modules/swarm.nix`. Recorded in DECISIONS.md. **Why the pivot also fixed a real bug:** my custom Traefik used entrypoint `websecure`; coop-cloud recipes label `entrypoints=web-secure`. While chasing that I also hit a sharp **systemd-run gotcha**: `systemd-run … nixos-rebuild switch --flake .#cc-ci` runs with cwd `/`, so `.#` → `/` → "could not find a flake.nix"; the switch silently failed while a post-`--collect` `systemctl show` returned a stale `Result=success`. Fix: always use the **absolute** flake path `/root/cc-ci#cc-ci`, and read the result before resetting. (rebuild6/7 had silently not applied; rebuild2–5 used the absolute path.) **abra packaged** (modules/abra.nix): release binary 0.13.0-beta, pinned by sha256, autoPatchelf'd. `abra --version` → `0.13.0-beta-06a57de`. **scripts/deploy-proxy.sh** (idempotent, pure-bash — host has no python3): ensure local abra server, fetch traefik, write wildcard/no-ACME env (`WILDCARDS_ENABLED=1`, `SECRET_WILDCARD_*_VERSION=v1`, `COMPOSE_FILE=compose.yml:compose.wildcard.yml`, `LETS_ENCRYPT_ENV=` empty), insert cert secrets via `abra app secret insert … -f` from /var/lib/ci-certs/live, deploy. Bugs fixed en route: multi-line PEM must use `-f` (not arg); secret-presence must check `docker secret ls` (abra's recipe list always shows the name with `created on server:false`). **Traefik deploy:** `abra app deploy` → `deploy succeeded 🟢` (traefik v3.6.15 + socket-proxy). Verify: `docker service ls` → app+socket-proxy 1/1; via gateway `curl --resolve probe.*:443: 143.244.213.108` → `CN=*.ci.commoninternet.net` (LE E8); **0 ACME log lines**. **M1 gate (recipe over HTTPS + teardown):** - `abra app new custom-html -s default -D cchtml1.ci.commoninternet.net -S -n` then set `LETS_ENCRYPT_ENV=` and `abra app deploy -n -C` → `🟢` (nginx 1.29.0). - `curl -ks --resolve cchtml1.ci.commoninternet.net:443:143.244.213.108 https://…/` → `http_code=200 size=615`, served the nginx welcome page over HTTPS with the wildcard cert. - Teardown: `abra app undeploy -n` → 🟢; `abra app volume remove -f -n` → "1 volumes removed"; leak check → services 0 / volumes 0 / secrets 0 / containers 0. **Clean.** - Correct teardown syntax confirmed: `secret remove --all -n` (not `--all-secrets`). **docs/install.md** seeded (flake apply + deploy-proxy + verify). M1 gate CLAIMED in STATUS.md. **Next:** M2 — Drone server + exec runner via Nix, Gitea OAuth app, hello-world .drone.yml green. ## 2026-05-26 — M2 start: CI engine decision + Gitea OAuth app + Drone secrets **Decision (DECISIONS.md):** keep Drone per plan. nixpkgs 24.11 has drone server 2.24.0 but only the abandoned `drone-runner-exec` (unstable-2020) — accepted (stable RPC), Woodpecker is the documented fallback. Deploy shape mirrors traefik: server via coop-cloud `drone` recipe (abra, swarm, traefik-routed at drone.ci.commoninternet.net, no ACME), exec runner as a host Nix systemd service. **Recipe recon:** coop-cloud `drone` recipe = drone/drone:2.26.0, secrets `rpc_secret` + `CLIENT_SECRET` (Gitea OAuth), Gitea SSO via `compose.gitea.yml` (`GITEA_CLIENT_ID`, `GITEA_DOMAIN`). Server env: DRONE_SERVER_HOST/PROTO, DRONE_USER_CREATE. **Done this tick:** - Created Gitea OAuth app `cc-ci-drone` (bot): client_id `ab4cdb9d-…`, redirect `https://drone.ci.commoninternet.net/login`. - Generated `DRONE_RPC_SECRET` (openssl-equivalent /dev/urandom hex32) + stored client_secret; both added to `secrets/secrets.yaml` via `sops set` (needed `SOPS_AGE_KEY` from the host ssh key: `ssh-to-age -private-key -i /etc/ssh/ssh_host_ed25519_key`). Verified: decrypt shows keys test_secret/drone_rpc_secret/drone_gitea_client_secret; file stays encrypted (4× ENC). **Next:** scripts/deploy-drone.sh (abra deploy of drone server w/ Gitea SSO + rpc/client secrets), modules/drone-runner.nix (exec runner systemd unit, rpc secret from sops), wire sops secrets for the runner, then push a hello-world .drone.yml and confirm a green build (M2 gate). ## 2026-05-26 — M2: Drone server + exec runner up; infra as idempotent-reconcile oneshots **Orchestrator steer (2×):** collapse install to a single `nixos-rebuild switch` — convert the manual deploy scripts into **idempotent-reconcile systemd oneshots** (writeShellApplication, embedded in store; after swarm-init+docker; wants network-online; wantedBy multi-user; reconcile every activation/boot, NO run-once sentinel; fail visibly on missing cert). Applied to proxy + drone. **Refactor done:** - `modules/packages.nix`: `pkgs.abra` overlay (shared pinned build). - `modules/proxy.nix`: `deploy-proxy` oneshot — reconciles coop-cloud traefik (wildcard/no-ACME). - `modules/drone.nix`: `deploy-drone` oneshot — reconciles coop-cloud drone (Gitea SSO, secrets from /run/secrets), after deploy-proxy. - `modules/drone-runner.nix`: exec runner (fixed PATH conflict via `lib.mkForce`; allowUnfree for drone-runner-exec — Polyform license). - `modules/secrets.nix`: declared drone_rpc_secret + drone_gitea_client_secret + a sops *template* `drone-runner.env` (DRONE_RPC_SECRET) as the runner's EnvironmentFile (shared secret). - Removed `scripts/deploy-*.sh`. install.md now = clone + nixos-rebuild switch + preconditions. **Build/switch:** build EXIT 0 (shellcheck clean via writeShellApplication; runner pkg unfree-allowed). `nixos-rebuild switch` → all three units `active`/`success`: - `deploy-proxy` success (reconciled traefik), `deploy-drone` → `deploy succeeded 🟢` (drone/drone 2.26.0, secrets client_secret+rpc_secret v1, drone_env config), `drone-runner-exec` active. **Verify (commands + output):** - `docker service ls` → `drone_ci_commoninternet_net_app 1/1`, traefik app+socket-proxy 1/1. - Via gateway: `…/healthz` → **200**; `/` → **303** (login redirect, correct). - Runner: journal shows a few startup `cannot ping the remote server (404)` (drone RPC not ready yet) then `successfully pinged the remote server` + `polling the remote server capacity=2 endpoint=https://drone.ci.commoninternet.net kind=pipeline type=exec`. **Runner connected via RPC.** **Remaining for M2 gate:** push a hello-world `.drone.yml` to cc-ci + get a green build. Needs the cc-ci repo activated in Drone, which requires the bot's Gitea OAuth login (browser flow) to grant Drone a Gitea token (to sync repos + set the push webhook). Next tick: script the OAuth login to mint a Drone token, activate cc-ci, push .drone.yml, confirm green. (DRONE_USER_CREATE made autonomic-bot the admin.) ## 2026-05-26 — M2 GATE MET: green build via push (Drone + exec runner) **Drone↔Gitea OAuth (scripted, the one manual bootstrap):** logged the bot into Gitea (CSRF cookie → form), drove Drone `/login` → Gitea authorize consent (POST `/login/oauth/grant` with _csrf+state+ granted=true) → code callback → Drone `_session_`. Captured the whole flow in `scripts/bootstrap-drone-oauth.sh` (reads bot creds from env; documented in install.md §2; one-time, token persists in Drone's data volume). **Repo activation:** `GET /api/user` → autonomic-bot admin=true; `GET /api/user/repos?latest=true` synced 12 repos; `POST /api/repos/recipe-maintainers/cc-ci` → active=true, config_path .drone.yml (sets the Gitea push webhook). **Green build:** added `.drone.yml` (exec pipeline), pushed (0d89e28). Polled `/api/repos/recipe-maintainers/cc-ci/builds` → build #1 pending→running→**success**. Steps: clone success exit 0; hello success exit 0 — log shows `whoami=root`, `abra 0.13.0-beta-06a57de`, `swarm=active` (ran on the host via the exec runner). **M2 gate met; CLAIMED.** **Next:** M3 — comment-bridge service: Gitea issue_comment webhook → verify HMAC + `!testme` exact + collaborator → resolve PR head repo/SHA → trigger a parameterized Drone build; post a PR comment with the run link. Need a Drone API token for the bridge (mint from the bot's Drone account). ## 2026-05-26 — M3 start: bridge secrets + comment-bridge source **Secrets (sops):** minted a Gitea API token (`cc-ci-bridge`, scopes read:org/user, write:repo/issue), a Drone API token (`POST /api/user/token`, the stable personal token; rotates on call), and a webhook HMAC (urandom hex64). Stored as bridge_gitea_token / bridge_drone_token / bridge_webhook_hmac via `sops set` (host age identity). secrets.yaml now holds 6 secrets. **bridge/bridge.py** (Python stdlib only, §4.1): POST /hook handler — verifies Gitea HMAC (`X-Gitea-Signature` sha256), requires `X-Gitea-Event: issue_comment`, action=created, body trimmed == `!testme`, issue is a PR; checks commenter is a collaborator (Gitea collaborators endpoint, 204); resolves PR head sha+repo; triggers a parameterized Drone build (`POST /api/repos//builds?branch=main&RECIPE&REF&PR&SRC`, custom params → pipeline env); posts a PR comment linking the run. Secrets read from mounted files; config via env. `/healthz` GET. **Next:** package the bridge as a swarm service (dockerTools image, no Docker Hub pull) behind traefik at `ci.commoninternet.net/hook` via a reconcile oneshot (modules/bridge.nix); register a per-repo webhook with the HMAC; demo on a scratch PR (!testme triggers; non-!testme + non-collab rejected). That's the M3 gate. ## 2026-05-26 — M3: bridge deployed + verified; webhook DELIVERY blocked (Gitea-side) **Deployed** the comment-bridge as a Nix-built OCI image (no Docker Hub pull) → swarm service on `proxy`, behind traefik at `ci.commoninternet.net/hook`, via reconcile oneshot `modules/bridge.nix`. Swarm secrets (webhook_hmac/drone_token/gitea_token) materialised from /run/secrets. **Verified working (bridge side):** - `docker service ls` → ccci-bridge_app 1/1. - `GET /hook/healthz` → 200 **from the sandbox over real public DNS** (ci.commoninternet.net → 143.244.213.108); also 200 via gateway from cc-ci. - HMAC logic: bad sig → 401; a manually openssl-HMAC-signed body → 204 (passes sig, ignored as non-trigger); wrong event → 204. (Debug log added: `got=/want=/bodylen/seclen`.) - Registered per-repo `issue_comment` webhook (id 210) on recipe-maintainers/cc-ci → ci.../hook with the HMAC. Created scratch PR #1. **Blocker found:** commenting `!testme` (×several) and Gitea's "Test Delivery" (UI returns 200) yield ZERO requests at the bridge container. Bridge is publicly reachable by hostname from a 3rd network; gateway accepts public sources; public DNS correct → Gitea is not *sending* the delivery. Deliveries panel is AJAX (uninspectable via curl); bot is not Gitea admin (can't read `ALLOWED_HOST_LIST`). Conclusion: git.autonomic.zone webhook policy (likely `ALLOWED_HOST_LIST`) blocks ci.commoninternet.net. Recorded in STATUS ## Blocked with operator options (whitelist host, or I pivot bridge to polling). **Plan:** surface to operator; meanwhile proceed to M4 (harness + install stage) which doesn't depend on the webhook (dev recipe-CI builds triggerable directly via the Drone API). Revisit M3 gate once the host is whitelisted or via the polling fallback. ## 2026-05-27 — M4: harness + install stage green (custom-html), guaranteed teardown **Built the harness:** `runner/harness/abra.py` (abra wrappers w/ gotchas: no --chaos on undeploy/volume-remove, `-n` everywhere, parse `app ls -S -m` nested {server:{apps}}, timeouts), `runner/harness/lifecycle.py` (deploy_app forcing `LETS_ENCRYPT_ENV=""` [A1], wait_healthy = services-converged + HTTPS, teardown_app = undeploy+volume+secret+env-config, janitor for orphans), `tests/conftest.py` (`deployed_app` session fixture with finalizer teardown; short unique domain), `tests/custom-html/test_install.py` (HTTP 200 + Playwright/Chromium content assertion), `runner/run_recipe_ci.py` (orchestrator: fetch recipe@REF, run stage pytest), `modules/harness.nix` (`cc-ci-run` = Nix python3+pytest+playwright with PLAYWRIGHT_BROWSERS_PATH from nixpkgs). **Bugs fixed en route (3):** 1. Swarm config name > 64 chars (long domain) → switched to short `-<6hex>` domain scheme (DECISIONS.md). 2. `services_converged` used wrong stack name (replaced hyphens) → abra keeps hyphens, only dots→_. 3. `http_get` connected to the gateway IP (drops SNI, gateway routes by SNI) → use the real URL (resolves to gateway on cc-ci, correct SNI). Also teardown now removes the app .env config. **Green run + teardown (commands + output):** - `RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py` → `tests/custom-html/test_install.py::test_http_reachable PASSED`, `::test_playwright_page PASSED` — **2 passed in 57.99s**. - Leak check after: services 0 / volumes 0 / secrets 0 / containers 0 / env config removed. Clean. **A1 addressed:** deploy_app forces `LETS_ENCRYPT_ENV=""` (no ACME) on every deploy. M4 CLAIMED. **M3 still blocked** (Gitea webhook delivery — operator); no response yet. Next: M5 (upgrade + backup/restore for custom-html), then wire the parameterized Drone pipeline (API-triggerable). ## 2026-05-27 — M5: upgrade + backup/restore stages green (custom-html) **Upgrade stage** (tests/custom-html/test_upgrade.py): deploy previous published version (git-tag sort, second-newest), write a data marker into the served volume (nginx serves /usr/share/nginx/html, so the marker is HTTP-fetchable), `abra app upgrade` to current, assert healthy + marker survived. Fix: `upgrade` has no `--chaos` flag (used `-f -D -n`). **backup-bot-two** deployed as reconcile oneshot (modules/backupbot.nix): restic repo in a local `backups` volume, restic_password abra-generated (only if missing). Fixes: `abra app secret generate` needs `-m` (machine) to avoid the TTY/ioctl path, and stdout redirected so generated values never hit the journal (D6). `abra app backup create`/`restore` need a real PTY ('input device is not a TTY') → run via util-linux `script -qec` (harness `_run_pty`; util-linux added to cc-ci-run). **Backup stage** (test_backup.py): write "original" → `abra app backup create` → mutate to "mutated" → `abra app restore` → assert state back to "original". **Full 3-stage run** (`STAGES=install,upgrade,backup`): - install: 2 passed (http 200 + playwright) - upgrade: 1 passed (data survives upgrade) - backup: 1 passed (restore returns pre-mutation state) - teardown: 0 orphaned run services/volumes/secrets; infra (traefik/drone/bridge/backupbot) all 1/1. M5 CLAIMED. **M3 still blocked** (webhook; no operator response across several ticks). Plan: if still blocked, pivot the bridge to poll the Gitea API (self-service, Adversary-endorsed) to unblock D1. Next: M6. ## 2026-05-27 — Fix adversary findings A2 (dead janitor) + A3 (unverified teardown) **A2 (janitor matched dead `-pr` filter):** rewrote `harness.lifecycle.janitor` to match the real run-app naming (`RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$`), reap via docker primitives, AND scan `docker service ls` to catch orphans whose `.env` is already gone (reconstructs the domain from the service name). Age-gated (default 2h, env `CCCI_JANITOR_MAX_AGE`) so concurrent in-flight runs are never killed. **A3 (teardown unverified + unconditional .env removal):** `teardown_app` now (1) `docker stack rm` fallback if `abra undeploy` leaves services, (2) removes volumes/secrets *before* the `.env` and only drops the `.env` after the stack is confirmed gone, (3) retries docker volume rm (a stopped task briefly holds the volume), (4) **verifies** no residual services/volumes/secrets and raises `TeardownError` otherwise — so a partial teardown FAILS the run instead of silently orphaning. **Re-test (commands + output):** - Normal install run → 2 passed, verified teardown clean. - Orphan (deploy, no teardown) → `janitor(CCCI_JANITOR_MAX_AGE=0)` → services/volumes/secrets/env 0. - **Env-less orphan** (deploy then `rm` the .env, the A3 bad state) → janitor reaps via docker stack rm → services/volumes/secrets 0. - Full 3-stage run (install/upgrade/backup) still green with verified teardown, no TeardownError. A2/A3 fixed; left for the Adversary to re-test + close. ## 2026-05-27 — M6 (part 1): harness enhancements for recipe #2 + D4 discovery Before enrolling recipe #2, made the shared harness recipe-agnostic so enrolling a recipe needs no harness-code change (D5): - **Per-recipe meta** (`tests//recipe_meta.py`, optional): HEALTH_PATH, HEALTH_OK, DEPLOY_TIMEOUT, HTTP_TIMEOUT. conftest reads it; `wait_healthy` gained a `path` param (e.g. keycloak `/realms/master`). Defaults preserve custom-html behaviour (verified: install still green). - **Shared naming** (`harness/naming.py`): single source for the `-<6hex>` domain, used by conftest + the orchestrator. - **D4 recipe-local discovery** (`run_recipe_ci.run_recipe_local`): if a recipe ships `tests/` with `test_*.py`, deploy the app, run those tests against the LIVE deployment (contract: env `CCCI_BASE_URL` + `CCCI_APP_DOMAIN`), merge as another reported stage, guaranteed teardown. Real recipes ship tests/ committed in their repo (clean checkout) → discovered on clone/fetch. (custom- html via catalogue is an awkward case — abra refuses an unstaged recipe and `abra recipe fetch` resets local commits — so D4 is demonstrated end-to-end with recipe #2 hedgedoc, which ships committed tests/.) **Next:** mirror hedgedoc (postgres+hedgedoc, DB-backed) via the mirror+PR flow with a committed tests/ dir, write tests/hedgedoc/ (install/upgrade/backup + recipe_meta), run all stages + D4 green. ## 2026-05-27 — M6 (part 2): recipe #2 keycloak install green (DB-backed, no harness surgery) Enrolled keycloak (recipe #2): keycloak 26.6.2 **+ mariadb 12.2** — genuinely DB-backed/multi-service (vs custom-html stateless). Added only `tests/keycloak/recipe_meta.py` (HEALTH_PATH=/realms/master, HEALTH_OK=(200,), 600s timeouts) + `tests/keycloak/test_install.py` (realm-endpoint health + Playwright admin-console login). **No change to runner/harness code** — the recipe-agnostic harness (per-recipe meta) handled it (D5 evidence). Run: `RECIPE=keycloak STAGES=install cc-ci-run runner/run_recipe_ci.py` → 2 passed in 545s (keycloak is slow: image pull + JVM + mariadb migration). Teardown clean (0 keyc-* services/volumes after). **Next:** D4 demo via a mirror shipping committed tests/ (recipe-local run against live app); then keycloak upgrade + backup/restore (DB data survival via a realm marker through the admin API). ## 2026-05-27 — M6: D4 recipe-local discovery + recipe #2 enrolled (CLAIMED) **D4 recipe-local discovery working.** Demo: pushed a committed `tests/test_recipe_local.py` to the mirror on branch `recipe-maintainers/custom-html@ci/d4-recipe-local`; ran `RECIPE=custom-html SRC=recipe-maintainers/custom-html REF=ci/d4-recipe-local STAGES=install` → install 2 passed, then `===== STAGE: recipe-local (D4) =====` ran the recipe-shipped test against the LIVE app (CCCI_BASE_URL) → 1 passed. Clean teardown (0 orphans). **Hard-won abra behaviour (DECISIONS.md):** private mirror clone needs the bot token (per-command `http.extraHeader`, not persisted/logged). abra commands (`app ls`, `secret generate`, version resolution) silently `git checkout ` the recipe, dropping a PR branch's files — so (1) all harness abra calls use `-C -o` (chaos+offline = current checkout, no remote fetch), and (2) D4 snapshots the recipe's tests/ to a temp dir right after fetch (later abra cmds still reset it). Traced the drop step-by-step: app_new ok, deploy ok, but `secret generate` (no flags) and `app ls` each reset the checkout. **Recipe #2 = keycloak** (keycloak + mariadb, DB-backed) install green with only `tests/keycloak/recipe_meta.py` + `test_install.py` — **no runner/harness change** (D5). custom-html remains 3-stage green (M5). docs/enroll-recipe.md written. **M6 CLAIMED.** keycloak's full 3-stage (DB data survival via a realm marker) folds into M6.5. **Next:** M6.5 — keycloak upgrade/backup, then recipes 3–6 across the remaining D10 categories.