cc-ci recipe CI server complete. Loop stopped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
52 KiB
JOURNAL — cc-ci Builder (append-only)
2026-05-26 — Bootstrap (§1)
Access verification (all pass):
ssh cc-ci 'hostname && whoami && nixos-version'→nixos/root/24.11.719113.50ab793786d9 (Vicuna)curl https://git.autonomic.zone/api/v1/version→{"version":"1.24.2"}- Gitea bot auth (
curl -u $GITEA_USERNAME:$GITEA_PASSWORD .../api/v1/user) →login: autonomic-bot, id 64 getent hosts probe-$RANDOM.ci.commoninternet.net→143.244.213.108(the gateway IP, as expected — TLS passthrough)- Cert present:
ls /var/lib/ci-certs/live/→fullchain.pem(2909 b),privkey.pem(227 b, mode 640) - recipe-maintainers org exists (private);
recipe-maintainers/cc-ci→ 404 (created below) - Mirrored recipes already present: bluesky-pds, lasuite-docs, custom-html, custom-html-tiny, n8n, keycloak, lasuite-meet, matrix-synapse, cryptpad
Baseline (docs/baseline.md): fresh NixOS 24.11 Incus VM, 2 vCPU, 3.5 GiB RAM, 8.9 GiB disk
(3.8 GiB free). No docker/swarm/abra. Channel-based /etc/nixos/configuration.nix (no flake).
Actions:
- Created repo
recipe-maintainers/cc-ci(private) via Gitea API. git initin /srv/cc-ci/cc-ci; credential helper reads creds from /srv/cc-ci/.testenv (no secrets stored in git config).- Seeded skeleton layout (§3) + loop-state files + docs/baseline.md.
Next: commit + push bootstrap, then M0 (flake + base config + sops test secret).
2026-05-26 — M0: flake + base config rebuilt from repo
Authored flake.nix (pins nixpkgs rev 50ab793786d9…, the exact rev cc-ci ran),
hosts/cc-ci/hardware.nix (incus VM module + cloud-init + DHCP/nameservers) and
hosts/cc-ci/configuration.nix (faithful baseline repro: tailscale w/ hardcoded --hostname= cc-nix-test since builtins.readFile /etc/ts-hostname is impure under flakes; sshd root; firewall
trust tailscale0 + tcp/22; base pkgs).
Disk/inode hiccup → resolved: first nix flake lock/build hit No space left on device —
diagnosed as inode exhaustion (df -i → 6005 free of 586336; old 8.9 GiB fs). Operator grew
the VM to 28 GiB while I was measuring; ext4 auto-resized → 22 GiB free, 1.21M inodes free. Retried.
Build + switch (commands + output):
ssh cc-ci 'cd /root/cc-ci && nix flake lock && nixos-rebuild build --flake .#cc-ci'→BUILD EXIT 0, producednixos-system-nixos-24.11.20250630.50ab793.ssh cc-ci 'systemd-run --unit=ccci-rebuild --collect --property=Type=oneshot nixos-rebuild switch --flake /root/cc-ci#cc-ci'(detached so it survives ssh drop) → unitResult=success ExecMainStatus=0.
Gate verification:
systemctl is-system-running→runningreadlink /run/current-system→…-nixos-system-nixos-24.11.20250630.50ab793(gen 3, from flake)systemctl is-active tailscaled→active;sshd.socket→active(sshd is socket-activated, sosshd.servicereads inactive — live ssh proves it works)systemctl --failed→ nonenixos-rebuild list-generations→ gen 3 current @20:23, prior channel gen 2 retained for rollback.
Known warning (tracked, non-blocking): incus module enables systemd.network while we keep
networking.useDHCP=true (scripted dhcpcd); Nix warns both may manage interfaces. Inherited from
baseline; networking is up. Clean up by choosing one stack later.
Deploy mechanism settled (DECISIONS.md): switch --flake on-host, repo synced via tar | ssh.
Next: sops-nix wiring (host age key from ssh host key + a decrypt-a-test-secret proof), then CLAIM the M0 gate for the Adversary.
2026-05-26 — M0: sops-nix wiring + decrypt-a-test-secret (M0 COMPLETE, gate CLAIMED)
Keys:
- Host age recipient from ssh host key:
ssh cc-ci 'nix run nixpkgs#ssh-to-age -- -i /etc/ssh/ssh_host_ed25519_key.pub'→age1h90utdztfc23kx8ewrtrtk80mnddvrf8pg4ppej55rwwwupzhfvqhmp3qa. - Master recovery key generated on host (
age-keygen), publicage1cmk26t…; private moved off-box to/srv/cc-ci/.sops/master-age.txt(mode 600) andshred-ded from the host. Never in repo.
Files: .sops.yaml (both recipients, rule secrets/.*\.(yaml|json|env)$); modules/secrets.nix
(sops.age.sshKeyPaths=[/etc/ssh/ssh_host_ed25519_key], secrets.test_secret={}); flake gains
sops-nix input + sops-nix.nixosModules.sops; configuration.nix imports the module.
sops-nix version pin (dead-end avoided): master sops-nix wants buildGo125Module (Go 1.25),
absent in pinned nixpkgs 24.11 → eval error. Pinned sops-nix to 77c423a… (2025-06-17, last using
plain buildGoModule). Verified the file at that rev uses buildGoModule. Build then OK.
Encrypt test secret: on host, printf 'test_secret: cc-ci-m0-<rand>' > secrets/secrets.yaml
then nix run nixpkgs#sops -- --encrypt --in-place secrets/secrets.yaml (run inside repo so
.sops.yaml resolves) → rc=0, two age recipients in the file.
Build + switch (commands + output):
nixos-rebuild build --flake .#cc-ci→BUILD EXIT 0(built sops-install-secrets w/ Go 1.23.8).systemd-run --unit=ccci-rebuild2 ... nixos-rebuild switch --flake /root/cc-ci#cc-ci→Result=success ExecMainStatus=0.
Gate verification (M0):
systemctl is-system-running→running;systemctl --failed→ none.ls -la /run/secrets/test_secret→-r-------- 1 root root 41;stat→root:root 400.head -c9→cc-ci-m0-(matches generated value),wc -c→ 41 (9 + 32 hex). Decrypt path proven.- Pulled encrypted
secrets/secrets.yaml+flake.lockback to clone;grep cc-ci-m0 secrets.yaml→ no plaintext leak; lock inputs = nixpkgs, sops-nix.
Gate handshake: set Gate: M0 — CLAIMED, awaiting Adversary in STATUS.md. REVIEW.md still empty
(no Adversary activity yet). Per §6.1 liveness I won't idle-block: I keep M0 claimed and proceed
with M1 (independent infra build), without advancing to M2 until M0 shows PASS.
Next: M1 — Docker + single-node swarm via Nix (modules/swarm.nix), then Traefik (file provider → /var/lib/ci-certs/live/) + abra, then a by-hand HTTPS deploy/teardown of a trivial recipe.
2026-05-26 — M1: Docker + single-node swarm via Nix
modules/swarm.nix: virtualisation.docker.enable + daily autoprune (--all --volumes until=24h
to protect the 28 GiB root), docker in systemPackages, and a swarm-init oneshot
(docker swarm init --advertise-addr 127.0.0.1 if not active; docker network create --driver overlay --attachable proxy if absent). Imported into configuration.nix.
Build + switch: nixos-rebuild build --flake .#cc-ci → EXIT 0; systemd-run … switch →
Result=success.
Verify (commands + output):
systemctl show swarm-init -p Result→Result=successdocker info --format ...→Swarm=active Managers=1 Nodes=1docker network ls --filter name=proxy→proxy overlay swarmsystemctl is-system-running→running;--failed→ none.
Next: Traefik as a swarm stack (Nix-declared compose + docker stack deploy oneshot): docker
swarm provider + file provider serving /var/lib/ci-certs/live/{fullchain,privkey}.pem on :443,
attached to proxy. Then abra install + by-hand HTTPS deploy/teardown of a trivial recipe (M1 gate).
Rationale for swarm-service Traefik over a host services.traefik: a host process isn't on the
proxy overlay, so it can't reach swarm service VIPs; coop-cloud recipes assume an on-proxy
Traefik watching swarm labels.
2026-05-26 — M1: Traefik swarm stack + HTTPS path proven
modules/traefik.nix: Traefik v3.3 as a swarm service on proxy (so it reaches recipe VIPs).
Config via Nix writeText store files bind-mounted into the container (real files, not /etc
symlinks): static traefik.yml (entrypoints web/websecure; providers.swarm unix socket,
exposedByDefault=false, network=proxy; providers.file dir /etc/traefik/dynamic; ping; no
dashboard) and dynamic certs.yml (wildcard at /var/lib/ci-certs/live/* as stores.default. defaultCertificate + certificates — so any *.ci.commoninternet.net router with tls=true is covered,
no ACME). Deployed by a traefik-deploy oneshot (docker stack deploy) after swarm-init. Opened
firewall 80/443 (gateway forwards over enp5s0).
Build + switch: build EXIT 0; switch Result=success; traefik-deploy Result=success;
docker service ls → traefik_traefik traefik:v3.3 1/1.
Verify (commands + output):
- Local:
curl -ksv -H 'Host: probe-test.ci.commoninternet.net' https://localhost/→subject: CN=*.ci.commoninternet.net,issuer: …Let's Encrypt; CN=E8, TLSv1.3, HTTP 404. - End-to-end via gateway:
curl -ksv --resolve probe-test.ci.commoninternet.net:443:143.244.213.108 https://probe-test.ci.commoninternet.net/→Connected to …(143.244.213.108) port 443, same wildcard cert, HTTP 404. Confirms gateway SNI-passthrough → cc-ci Traefik TLS termination. 404 is correct (no router for that host yet).
Next: install abra (M1 last task), abra app new a trivial recipe (custom-html) → deploy →
reach over HTTPS at .ci.commoninternet.net → teardown leaving no volumes. That completes M1
→ CLAIM M1 gate.
2026-05-26 — M1: proxy pivot to real coop-cloud/traefik via abra; recipe deploy/teardown (M1 CLAIMED)
Orchestrator decision (mid-M1): replace the hand-rolled Traefik with the canonical Co-op Cloud
traefik recipe deployed via abra, wildcard/file-provider mode, no ACME/token. Removed custom
modules/traefik.nix; moved firewall 80/443 into modules/swarm.nix. Recorded in DECISIONS.md.
Why the pivot also fixed a real bug: my custom Traefik used entrypoint websecure; coop-cloud
recipes label entrypoints=web-secure. While chasing that I also hit a sharp systemd-run gotcha:
systemd-run … nixos-rebuild switch --flake .#cc-ci runs with cwd /, so .# → / → "could not
find a flake.nix"; the switch silently failed while a post---collect systemctl show returned a
stale Result=success. Fix: always use the absolute flake path /root/cc-ci#cc-ci, and read the
result before resetting. (rebuild6/7 had silently not applied; rebuild2–5 used the absolute path.)
abra packaged (modules/abra.nix): release binary 0.13.0-beta, pinned by sha256, autoPatchelf'd.
abra --version → 0.13.0-beta-06a57de.
scripts/deploy-proxy.sh (idempotent, pure-bash — host has no python3): ensure local abra server,
fetch traefik, write wildcard/no-ACME env (WILDCARDS_ENABLED=1, SECRET_WILDCARD_*_VERSION=v1,
COMPOSE_FILE=compose.yml:compose.wildcard.yml, LETS_ENCRYPT_ENV= empty), insert cert secrets via
abra app secret insert … -f from /var/lib/ci-certs/live, deploy. Bugs fixed en route: multi-line
PEM must use -f (not arg); secret-presence must check docker secret ls (abra's recipe list always
shows the name with created on server:false).
Traefik deploy: abra app deploy → deploy succeeded 🟢 (traefik v3.6.15 + socket-proxy).
Verify: docker service ls → app+socket-proxy 1/1; via gateway curl --resolve probe.*:443: 143.244.213.108 → CN=*.ci.commoninternet.net (LE E8); 0 ACME log lines.
M1 gate (recipe over HTTPS + teardown):
abra app new custom-html -s default -D cchtml1.ci.commoninternet.net -S -nthen setLETS_ENCRYPT_ENV=andabra app deploy -n -C→🟢(nginx 1.29.0).curl -ks --resolve cchtml1.ci.commoninternet.net:443:143.244.213.108 https://…/→http_code=200 size=615, served the nginx welcome page over HTTPS with the wildcard cert.- Teardown:
abra app undeploy -n→ 🟢;abra app volume remove -f -n→ "1 volumes removed"; leak check → services 0 / volumes 0 / secrets 0 / containers 0. Clean. - Correct teardown syntax confirmed:
secret remove <d> --all -n(not--all-secrets).
docs/install.md seeded (flake apply + deploy-proxy + verify). M1 gate CLAIMED in STATUS.md.
Next: M2 — Drone server + exec runner via Nix, Gitea OAuth app, hello-world .drone.yml green.
2026-05-26 — M2 start: CI engine decision + Gitea OAuth app + Drone secrets
Decision (DECISIONS.md): keep Drone per plan. nixpkgs 24.11 has drone server 2.24.0 but only the
abandoned drone-runner-exec (unstable-2020) — accepted (stable RPC), Woodpecker is the documented
fallback. Deploy shape mirrors traefik: server via coop-cloud drone recipe (abra, swarm,
traefik-routed at drone.ci.commoninternet.net, no ACME), exec runner as a host Nix systemd service.
Recipe recon: coop-cloud drone recipe = drone/drone:2.26.0, secrets rpc_secret +
CLIENT_SECRET (Gitea OAuth), Gitea SSO via compose.gitea.yml (GITEA_CLIENT_ID, GITEA_DOMAIN).
Server env: DRONE_SERVER_HOST/PROTO, DRONE_USER_CREATE.
Done this tick:
- Created Gitea OAuth app
cc-ci-drone(bot): client_idab4cdb9d-…, redirecthttps://drone.ci.commoninternet.net/login. - Generated
DRONE_RPC_SECRET(openssl-equivalent /dev/urandom hex32) + stored client_secret; both added tosecrets/secrets.yamlviasops set(neededSOPS_AGE_KEYfrom the host ssh key:ssh-to-age -private-key -i /etc/ssh/ssh_host_ed25519_key). Verified: decrypt shows keys test_secret/drone_rpc_secret/drone_gitea_client_secret; file stays encrypted (4× ENC).
Next: scripts/deploy-drone.sh (abra deploy of drone server w/ Gitea SSO + rpc/client secrets), modules/drone-runner.nix (exec runner systemd unit, rpc secret from sops), wire sops secrets for the runner, then push a hello-world .drone.yml and confirm a green build (M2 gate).
2026-05-26 — M2: Drone server + exec runner up; infra as idempotent-reconcile oneshots
Orchestrator steer (2×): collapse install to a single nixos-rebuild switch — convert the
manual deploy scripts into idempotent-reconcile systemd oneshots (writeShellApplication, embedded
in store; after swarm-init+docker; wants network-online; wantedBy multi-user; reconcile every
activation/boot, NO run-once sentinel; fail visibly on missing cert). Applied to proxy + drone.
Refactor done:
modules/packages.nix:pkgs.abraoverlay (shared pinned build).modules/proxy.nix:deploy-proxyoneshot — reconciles coop-cloud traefik (wildcard/no-ACME).modules/drone.nix:deploy-droneoneshot — reconciles coop-cloud drone (Gitea SSO, secrets from /run/secrets), after deploy-proxy.modules/drone-runner.nix: exec runner (fixed PATH conflict vialib.mkForce; allowUnfree for drone-runner-exec — Polyform license).modules/secrets.nix: declared drone_rpc_secret + drone_gitea_client_secret + a sops templatedrone-runner.env(DRONE_RPC_SECRET) as the runner's EnvironmentFile (shared secret).- Removed
scripts/deploy-*.sh. install.md now = clone + nixos-rebuild switch + preconditions.
Build/switch: build EXIT 0 (shellcheck clean via writeShellApplication; runner pkg unfree-allowed).
nixos-rebuild switch → all three units active/success:
deploy-proxysuccess (reconciled traefik),deploy-drone→deploy succeeded 🟢(drone/drone 2.26.0, secrets client_secret+rpc_secret v1, drone_env config),drone-runner-execactive.
Verify (commands + output):
docker service ls→drone_ci_commoninternet_net_app 1/1, traefik app+socket-proxy 1/1.- Via gateway:
…/healthz→ 200;/→ 303 (login redirect, correct). - Runner: journal shows a few startup
cannot ping the remote server (404)(drone RPC not ready yet) thensuccessfully pinged the remote server+polling the remote server capacity=2 endpoint=https://drone.ci.commoninternet.net kind=pipeline type=exec. Runner connected via RPC.
Remaining for M2 gate: push a hello-world .drone.yml to cc-ci + get a green build. Needs the
cc-ci repo activated in Drone, which requires the bot's Gitea OAuth login (browser flow) to grant
Drone a Gitea token (to sync repos + set the push webhook). Next tick: script the OAuth login to mint
a Drone token, activate cc-ci, push .drone.yml, confirm green. (DRONE_USER_CREATE made autonomic-bot
the admin.)
2026-05-26 — M2 GATE MET: green build via push (Drone + exec runner)
Drone↔Gitea OAuth (scripted, the one manual bootstrap): logged the bot into Gitea (CSRF cookie
→ form), drove Drone /login → Gitea authorize consent (POST /login/oauth/grant with _csrf+state+
granted=true) → code callback → Drone _session_. Captured the whole flow in
scripts/bootstrap-drone-oauth.sh (reads bot creds from env; documented in install.md §2; one-time,
token persists in Drone's data volume).
Repo activation: GET /api/user → autonomic-bot admin=true; GET /api/user/repos?latest=true
synced 12 repos; POST /api/repos/recipe-maintainers/cc-ci → active=true, config_path .drone.yml
(sets the Gitea push webhook).
Green build: added .drone.yml (exec pipeline), pushed (0d89e28). Polled
/api/repos/recipe-maintainers/cc-ci/builds → build #1 pending→running→success. Steps:
clone success exit 0; hello success exit 0 — log shows whoami=root, abra 0.13.0-beta-06a57de,
swarm=active (ran on the host via the exec runner). M2 gate met; CLAIMED.
Next: M3 — comment-bridge service: Gitea issue_comment webhook → verify HMAC + !testme exact +
collaborator → resolve PR head repo/SHA → trigger a parameterized Drone build; post a PR comment with
the run link. Need a Drone API token for the bridge (mint from the bot's Drone account).
2026-05-26 — M3 start: bridge secrets + comment-bridge source
Secrets (sops): minted a Gitea API token (cc-ci-bridge, scopes read:org/user, write:repo/issue),
a Drone API token (POST /api/user/token, the stable personal token; rotates on call), and a webhook
HMAC (urandom hex64). Stored as bridge_gitea_token / bridge_drone_token / bridge_webhook_hmac via
sops set (host age identity). secrets.yaml now holds 6 secrets.
bridge/bridge.py (Python stdlib only, §4.1): POST /hook handler — verifies Gitea HMAC
(X-Gitea-Signature sha256), requires X-Gitea-Event: issue_comment, action=created, body trimmed
== !testme, issue is a PR; checks commenter is a collaborator (Gitea collaborators endpoint, 204);
resolves PR head sha+repo; triggers a parameterized Drone build
(POST /api/repos/<CI_REPO>/builds?branch=main&RECIPE&REF&PR&SRC, custom params → pipeline env);
posts a PR comment linking the run. Secrets read from mounted files; config via env. /healthz GET.
Next: package the bridge as a swarm service (dockerTools image, no Docker Hub pull) behind
traefik at ci.commoninternet.net/hook via a reconcile oneshot (modules/bridge.nix); register a
per-repo webhook with the HMAC; demo on a scratch PR (!testme triggers; non-!testme + non-collab
rejected). That's the M3 gate.
2026-05-26 — M3: bridge deployed + verified; webhook DELIVERY blocked (Gitea-side)
Deployed the comment-bridge as a Nix-built OCI image (no Docker Hub pull) → swarm service on
proxy, behind traefik at ci.commoninternet.net/hook, via reconcile oneshot modules/bridge.nix.
Swarm secrets (webhook_hmac/drone_token/gitea_token) materialised from /run/secrets.
Verified working (bridge side):
docker service ls→ ccci-bridge_app 1/1.GET /hook/healthz→ 200 from the sandbox over real public DNS (ci.commoninternet.net → 143.244.213.108); also 200 via gateway from cc-ci.- HMAC logic: bad sig → 401; a manually openssl-HMAC-signed body → 204 (passes sig, ignored as
non-trigger); wrong event → 204. (Debug log added:
got=/want=/bodylen/seclen.) - Registered per-repo
issue_commentwebhook (id 210) on recipe-maintainers/cc-ci → ci.../hook with the HMAC. Created scratch PR #1.
Blocker found: commenting !testme (×several) and Gitea's "Test Delivery" (UI returns 200) yield
ZERO requests at the bridge container. Bridge is publicly reachable by hostname from a 3rd network;
gateway accepts public sources; public DNS correct → Gitea is not sending the delivery. Deliveries
panel is AJAX (uninspectable via curl); bot is not Gitea admin (can't read ALLOWED_HOST_LIST).
Conclusion: git.autonomic.zone webhook policy (likely ALLOWED_HOST_LIST) blocks ci.commoninternet.net.
Recorded in STATUS ## Blocked with operator options (whitelist host, or I pivot bridge to polling).
Plan: surface to operator; meanwhile proceed to M4 (harness + install stage) which doesn't depend on the webhook (dev recipe-CI builds triggerable directly via the Drone API). Revisit M3 gate once the host is whitelisted or via the polling fallback.
2026-05-27 — M4: harness + install stage green (custom-html), guaranteed teardown
Built the harness: runner/harness/abra.py (abra wrappers w/ gotchas: no --chaos on
undeploy/volume-remove, -n everywhere, parse app ls -S -m nested {server:{apps}}, timeouts),
runner/harness/lifecycle.py (deploy_app forcing LETS_ENCRYPT_ENV="" [A1], wait_healthy =
services-converged + HTTPS, teardown_app = undeploy+volume+secret+env-config, janitor for orphans),
tests/conftest.py (deployed_app session fixture with finalizer teardown; short unique domain),
tests/custom-html/test_install.py (HTTP 200 + Playwright/Chromium content assertion),
runner/run_recipe_ci.py (orchestrator: fetch recipe@REF, run stage pytest), modules/harness.nix
(cc-ci-run = Nix python3+pytest+playwright with PLAYWRIGHT_BROWSERS_PATH from nixpkgs).
Bugs fixed en route (3):
- Swarm config name > 64 chars (long domain) → switched to short
<recipe[:4]>-<6hex>domain scheme (DECISIONS.md). services_convergedused wrong stack name (replaced hyphens) → abra keeps hyphens, only dots→_.http_getconnected to the gateway IP (drops SNI, gateway routes by SNI) → use the real URL (resolves to gateway on cc-ci, correct SNI). Also teardown now removes the app .env config.
Green run + teardown (commands + output):
RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py→tests/custom-html/test_install.py::test_http_reachable PASSED,::test_playwright_page PASSED— 2 passed in 57.99s.- Leak check after: services 0 / volumes 0 / secrets 0 / containers 0 / env config removed. Clean.
A1 addressed: deploy_app forces LETS_ENCRYPT_ENV="" (no ACME) on every deploy. M4 CLAIMED.
M3 still blocked (Gitea webhook delivery — operator); no response yet. Next: M5 (upgrade + backup/restore for custom-html), then wire the parameterized Drone pipeline (API-triggerable).
2026-05-27 — M5: upgrade + backup/restore stages green (custom-html)
Upgrade stage (tests/custom-html/test_upgrade.py): deploy previous published version
(git-tag sort, second-newest), write a data marker into the served volume (nginx serves
/usr/share/nginx/html, so the marker is HTTP-fetchable), abra app upgrade to current, assert
healthy + marker survived. Fix: upgrade has no --chaos flag (used -f -D -n).
backup-bot-two deployed as reconcile oneshot (modules/backupbot.nix): restic repo in a local
backups volume, restic_password abra-generated (only if missing). Fixes: abra app secret generate
needs -m (machine) to avoid the TTY/ioctl path, and stdout redirected so generated values never
hit the journal (D6). abra app backup create/restore need a real PTY ('input device is not a
TTY') → run via util-linux script -qec (harness _run_pty; util-linux added to cc-ci-run).
Backup stage (test_backup.py): write "original" → abra app backup create → mutate to
"mutated" → abra app restore → assert state back to "original".
Full 3-stage run (STAGES=install,upgrade,backup):
- install: 2 passed (http 200 + playwright)
- upgrade: 1 passed (data survives upgrade)
- backup: 1 passed (restore returns pre-mutation state)
- teardown: 0 orphaned run services/volumes/secrets; infra (traefik/drone/bridge/backupbot) all 1/1. M5 CLAIMED.
M3 still blocked (webhook; no operator response across several ticks). Plan: if still blocked, pivot the bridge to poll the Gitea API (self-service, Adversary-endorsed) to unblock D1. Next: M6.
2026-05-27 — Fix adversary findings A2 (dead janitor) + A3 (unverified teardown)
A2 (janitor matched dead -pr filter): rewrote harness.lifecycle.janitor to match the real
run-app naming (RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$), reap via
docker primitives, AND scan docker service ls to catch orphans whose .env is already gone
(reconstructs the domain from the service name). Age-gated (default 2h, env CCCI_JANITOR_MAX_AGE)
so concurrent in-flight runs are never killed.
A3 (teardown unverified + unconditional .env removal): teardown_app now (1) docker stack rm
fallback if abra undeploy leaves services, (2) removes volumes/secrets before the .env and
only drops the .env after the stack is confirmed gone, (3) retries docker volume rm (a stopped
task briefly holds the volume), (4) verifies no residual services/volumes/secrets and raises
TeardownError otherwise — so a partial teardown FAILS the run instead of silently orphaning.
Re-test (commands + output):
- Normal install run → 2 passed, verified teardown clean.
- Orphan (deploy, no teardown) →
janitor(CCCI_JANITOR_MAX_AGE=0)→ services/volumes/secrets/env 0. - Env-less orphan (deploy then
rmthe .env, the A3 bad state) → janitor reaps via docker stack rm → services/volumes/secrets 0. - Full 3-stage run (install/upgrade/backup) still green with verified teardown, no TeardownError.
A2/A3 fixed; left for the Adversary to re-test + close.
2026-05-27 — M6 (part 1): harness enhancements for recipe #2 + D4 discovery
Before enrolling recipe #2, made the shared harness recipe-agnostic so enrolling a recipe needs no harness-code change (D5):
- Per-recipe meta (
tests/<recipe>/recipe_meta.py, optional): HEALTH_PATH, HEALTH_OK, DEPLOY_TIMEOUT, HTTP_TIMEOUT. conftest reads it;wait_healthygained apathparam (e.g. keycloak/realms/master). Defaults preserve custom-html behaviour (verified: install still green). - Shared naming (
harness/naming.py): single source for the<recipe[:4]>-<6hex>domain, used by conftest + the orchestrator. - D4 recipe-local discovery (
run_recipe_ci.run_recipe_local): if a recipe shipstests/withtest_*.py, deploy the app, run those tests against the LIVE deployment (contract: envCCCI_BASE_URL+CCCI_APP_DOMAIN), merge as another reported stage, guaranteed teardown. Real recipes ship tests/ committed in their repo (clean checkout) → discovered on clone/fetch. (custom- html via catalogue is an awkward case — abra refuses an unstaged recipe andabra recipe fetchresets local commits — so D4 is demonstrated end-to-end with recipe #2 hedgedoc, which ships committed tests/.)
Next: mirror hedgedoc (postgres+hedgedoc, DB-backed) via the mirror+PR flow with a committed tests/ dir, write tests/hedgedoc/ (install/upgrade/backup + recipe_meta), run all stages + D4 green.
2026-05-27 — M6 (part 2): recipe #2 keycloak install green (DB-backed, no harness surgery)
Enrolled keycloak (recipe #2): keycloak 26.6.2 + mariadb 12.2 — genuinely DB-backed/multi-service
(vs custom-html stateless). Added only tests/keycloak/recipe_meta.py (HEALTH_PATH=/realms/master,
HEALTH_OK=(200,), 600s timeouts) + tests/keycloak/test_install.py (realm-endpoint health +
Playwright admin-console login). No change to runner/harness code — the recipe-agnostic harness
(per-recipe meta) handled it (D5 evidence).
Run: RECIPE=keycloak STAGES=install cc-ci-run runner/run_recipe_ci.py → 2 passed in 545s (keycloak
is slow: image pull + JVM + mariadb migration). Teardown clean (0 keyc-* services/volumes after).
Next: D4 demo via a mirror shipping committed tests/ (recipe-local run against live app); then keycloak upgrade + backup/restore (DB data survival via a realm marker through the admin API).
2026-05-27 — M6: D4 recipe-local discovery + recipe #2 enrolled (CLAIMED)
D4 recipe-local discovery working. Demo: pushed a committed tests/test_recipe_local.py to the
mirror on branch recipe-maintainers/custom-html@ci/d4-recipe-local; ran
RECIPE=custom-html SRC=recipe-maintainers/custom-html REF=ci/d4-recipe-local STAGES=install →
install 2 passed, then ===== STAGE: recipe-local (D4) ===== ran the recipe-shipped test against
the LIVE app (CCCI_BASE_URL) → 1 passed. Clean teardown (0 orphans).
Hard-won abra behaviour (DECISIONS.md): private mirror clone needs the bot token (per-command
http.extraHeader, not persisted/logged). abra commands (app ls, secret generate, version
resolution) silently git checkout <tag> the recipe, dropping a PR branch's files — so (1) all
harness abra calls use -C -o (chaos+offline = current checkout, no remote fetch), and (2) D4
snapshots the recipe's tests/ to a temp dir right after fetch (later abra cmds still reset it).
Traced the drop step-by-step: app_new ok, deploy ok, but secret generate (no flags) and app ls
each reset the checkout.
Recipe #2 = keycloak (keycloak + mariadb, DB-backed) install green with only
tests/keycloak/recipe_meta.py + test_install.py — no runner/harness change (D5). custom-html
remains 3-stage green (M5). docs/enroll-recipe.md written.
M6 CLAIMED. keycloak's full 3-stage (DB data survival via a realm marker) folds into M6.5. Next: M6.5 — keycloak upgrade/backup, then recipes 3–6 across the remaining D10 categories.
2026-05-27 — Trigger redesign (polling primary) + resource safety + M3 verified
Session restarted by watchdog (prior tmux died mid-turn with uncommitted bridge WIP). Re-oriented from STATUS + plan; two orchestrator design changes landed and are now implemented + verified.
(1) Trigger: POLLING PRIMARY, webhook optional, org-membership auth (plan §4.1/§1.5; commit
7addb96). Rewrote bridge/bridge.py: a poll thread (poll_loop, always-on, primary) scans each
POLL_REPOS repo's open PRs every 30s for new !testme; the /hook webhook stays as an optional
admin-registered push optimization. Both share an in-memory comment-id seen-set → a comment seen by
both fires once. First poll marks pre-existing comments seen (no startup re-fire). Authorization now
GET /orgs/{owner}/members/{user} (204=member, read-level) + optional AUTH_ALLOWLIST, replacing
the admin-requiring /collaborators/{user}/permission. Bot never self-registers webhooks.
- Verified org endpoint at read level (bot basic-auth):
members/{autonomic-bot,trav,notplants}→ 204;members/definitely-not-a-member-xyz→ 404. - Deployed (nixos-rebuild, deploy-bridge reconcile); new container logs:
poller (primary) watching ['recipe-maintainers/cc-ci'] every 30s+(poll primary + optional webhook). - End-to-end M3 trigger (poll path): posted
!testmeon PR #1 (comment 13705, by bot) → Drone build #26 appeared after 6s (latest was #25); bridge logged[poll] triggered build 26 for cc-ci@d397720a (PR #1, comment 13705) by autonomic-bot; bridge posted backcc-ci: started CI run for cc-ci @ d397720a → https://drone.ci.commoninternet.net/.... Satisfies D1 (<60s) over the read-only outbound path — no operator webhook whitelist needed.
(2) Resource safety: bound live test apps (plan §4.2/§4.3; commit 72ff8e2). MAX_TESTS =
DRONE_RUNNER_CAPACITY = 1 (modules/drone-runner.nix) → Drone runs ≤1 build at once, queues the
rest natively. Per-build timeout = 60m, reconciled best-effort in modules/drone.nix
(PATCH /api/repos/.../cc-ci {"timeout":60}, non-fatal). Janitor remains the backstop for
SIGKILL'd/timed-out builds (reaps orphaned run apps at run-start before each deploy).
- Verified on host after rebuild:
DRONE_RUNNER_CAPACITY=1; deploy-drone loggedset cc-ci build timeout = 60m; Drone API confirms repotimeout: 60.
Gap noted (next item): .drone.yml still only has the self-test pipeline — a bridge-triggered
build runs the self-test, NOT runner/run_recipe_ci.py. M4/M5 ran the orchestrator by hand
(cc-ci-run). Need a recipe-CI pipeline keyed on the RECIPE build param (runs
cc-ci-run runner/run_recipe_ci.py with STAGES=install,upgrade,backup, CCCI_JANITOR_MAX_AGE=0,
concurrency:{limit:1}) to connect bridge→Drone→harness end-to-end (required for D2/D10 via real
!testme). Added to Build backlog.
M3 CLAIMED (gate). Trigger + auth + comment-back demoed live; the webhook-delivery blocker is moot now that polling is primary.
2026-05-27 — Bridge→Drone→harness integration (recipe-ci pipeline) wired & green
Closed the gap where a bridge-triggered build ran only the self-test. Split .drone.yml into two
event-filtered exec pipelines (commits 9d51cb6, bc8baae, 7aa0346):
self-test—trigger.event: [push](M2 sanity on pushes).recipe-ci—trigger.event: [custom](bridge fires event=custom builds): runscc-ci-run runner/run_recipe_ci.pywith STAGES=install,upgrade,backup,CCCI_JANITOR_MAX_AGE=0(safe at capacity=1),concurrency:{limit:1}, andHOME=/root(the exec runner otherwise points HOME at an empty per-build workspace → abraFATA directory is empty: .../.abra/servers).
Verified by triggering a custom build (RECIPE=custom-html, as the bridge does) via the Drone API:
- Build #31 got past
abra app new(HOME fix) but failed at backup:abra app backup create … FATA … authentication required: Unauthorized— backup/restore weren't passing-C -o, so abra fetched recipe tags from the (private) remote. Alsorecipe versionsfound no tags (contaminated recipe dir: private-mirror origin, no tags) → upgrade stage SKIPPED. - Fixes:
abra.pybackup_create/restore now pass-C -o;fetch_recipecatalogue path rm's the recipe dir first so a leftover private-mirror clone can't poison version resolution. - Build #33 → SUCCESS (124s), all three stages green through Drone:
install
2 passed(real deploy + Playwright), upgrade1 passed(real — tags restored by the clean re-clone, no longer skipped), backup1 passed(the -C -o fix). Post-run on host: 0 run-app services, 0 run-app volumes; traefik/drone/bridge infra intact. Event filtering works (only recipe-ci ran, not self-test).
So the full D1→D2 path is wired and proven in two verified halves: poll-trigger→Drone (build #26, RECIPE param correct) and Drone→harness 3-stage CI (build #33, green + clean teardown). Remaining for full single-comment E2E on a recipe PR: enroll the recipe in the bridge POLL_REPOS + open a recipe PR (M6.5/M10 breadth work).
Adversary findings status (signal for re-test): A2 (janitor -pr filter) and A3 (teardown
verification + .env-last ordering) are both already fixed in the current code
(lifecycle.RUN_APP_RE hashed-scheme match; teardown_app _residual() raise + docker stack rm
fallback) — awaiting the Adversary's kill-probe re-test on an idle host. A4 (concurrent same-recipe
collision): its named root cause "no Drone concurrency cap (capacity=2)" is eliminated by
MAX_TESTS=capacity=1 — no concurrent runs possible on this single node, so the shared-recipe-dir race
can't occur. No Builder fix outstanding on findings; next milestone work is M6.5 breadth.
2026-05-27 — M6.5: keycloak full 3-stage GREEN through the Drone recipe-ci pipeline
Ran keycloak (DB-backed, SSO/identity category) end-to-end via the integrated recipe-ci pipeline
(triggered custom build #39, RECIPE=keycloak). Build #39 → success (~31m), all three stages
green as separate reported stages:
- install
2 passed(8m30s):test_realm_endpoint_healthy(/realms/master 200) + Playwright admin console login. - upgrade
1 passed(10m10s):test_upgrade_preserves_realm— realm marker written pre-upgrade survives the previous→latest upgrade (DB data survival). - backup
1 passed(8m15s):test_backup_mutate_restore— backup→mutate→restore returns original. Clean teardown verified on host: 0 keyc services, 0 keyc volumes. keycloak cold start is slow on this VM (Quarkus augmentation ~80s + Liquibase schema init), so each deploy is ~5-8m — well within the 60m build timeout; that's why the run took ~31m. No harness surgery (D5): keycloak runs offtests/keycloak/{recipe_meta,test_install,test_upgrade,test_backup}.py+kc_admin.pyonly.
This both advances M6.5 (first DB-backed recipe full 3-stage) and confirms the recipe-ci integration works on a heavy DB-backed recipe (Drone→harness→3 stages→teardown). Next M6.5: enroll recipes 3–6 covering the remaining D10 categories (stateful-no-DB, multi-service+S3, large-volume, etc.).
2026-05-27 — M6.5: cryptpad (recipe #3) enrolled + full 3-stage green; fixed a real backup bug
Enrolled cryptpad (stateful, no external DB — the D10 "stateful/no-DB" category). No shared-harness
surgery beyond a generic feature: added per-recipe EXTRA_ENV (recipe_meta.py dict or
domain-callable) applied in deploy_app at every deploy path. cryptpad uses it for its required
distinct SANDBOX_DOMAIN (a sibling subdomain under the wildcard, so no cert work). Data-survival
tests write a marker into the backed-up cryptpad_data volume and read it via exec_in_app
(cryptpad's datastore isn't HTTP-served like custom-html).
Host runs (HOME=/root, cc-ci-run): install 2 passed (~2m; http 200 + Playwright loads cryptpad), upgrade 1 passed (~1m; marker survives previous→current), backup 1 passed after a fix (below). Clean teardown (0 cryp services/volumes).
Real bug found+fixed — backups were silently mis-wired (set_env newline). cryptpad backup first
failed: abra app backup create → backup-bot-two's /usr/bin/backup raised
KeyError: 'RESTIC_REPOSITORY'. Root cause: backup-bot-two's .env.sample ends with a newline-less
comment line, and the reconcile's set_env did a bare printf >> .env, gluing
RESTIC_REPOSITORY=/backups/restic onto that comment → commented out. abra --debug confirmed the
backupbot env map lacked RESTIC_REPOSITORY, and docker exec backupbot printenv RESTIC_REPOSITORY
was empty. Fix: set_env now ensures a trailing newline before appending (modules/backupbot.nix +
modules/drone.nix, same latent bug). After rebuild: .env has a clean RESTIC_REPOSITORY= line, the
backupbot container has RESTIC_REPOSITORY=/backups/restic, and cryptpad backup→mutate→restore
passes. NOTE: keycloak backup (build #39) passed off an earlier, non-corrupted backupbot deploy;
worth a re-verify, but the mechanism is now correct/reproducible. Triggered Drone build #46 (cryptpad)
as the canonical recipe-ci run.
2026-05-27 — M6.5: matrix-synapse (recipe #4, DB+media/large-volume) full 3-stage green
Enrolled matrix-synapse (synapse app + postgres db + nginx web) — the large-volume/DB+media
D10 category. No harness surgery (server_name = DOMAIN; no EXTRA_ENV needed). Host runs (cc-ci-run):
install 2 passed (~2.7m; client API 200 + real /_matrix/client/versions JSON), upgrade
1 passed (~2.3m; postgres marker survives previous→current), backup 1 passed (~1.5m). Clean
teardown (0 matr services). The data-survival tests use a ci_marker postgres row exec'd via
psql in the db service — this exercises the recipe's real DB-dump backup hook
(backupbot.backup.pre-hook=/pg_backup.sh backup / restore.post-hook), the meaningful matrix data
path (not a plain volume copy). Worked first try (the set_env/RESTIC fix holds for hook-based
backups too). Triggering the canonical Drone recipe-ci run.
4 of 6 D10 recipes now green: custom-html (simple), keycloak (SSO/DB), cryptpad (stateful/no-DB), matrix-synapse (DB+media/large-volume). Remaining categories: multi-service+S3 (lasuite-docs) and TLS-passthrough (bluesky-pds).
2026-05-27 — M6.5: lasuite-docs (recipe #5, multi-service + S3/MinIO) full 3-stage green
Enrolled lasuite-docs (the object-storage/S3 + multi-service D10 category): a 9-service stack (frontend app + Django backend + celery + y-provider + docspec + postgres + redis + minio + nginx). Host runs (cc-ci-run): install 2 passed (~2.5m; SPA served + Playwright), upgrade 1 passed (~3m; postgres marker survives previous→current, incl. cold-pulling the older images), backup 1 passed (~2.3m; pg_backup.sh dump/restore). Clean teardown.
Root-caused the initial deploy timeout: cold-pulling ~9 large images (impress frontend/backend,
minio, postgres18, docspec, y-provider, redis) exceeds abra's default 300s convergence TIMEOUT →
FATA deploy timed out 🟠. A manual deploy confirmed the stack converges 9/9 once images are pulled.
Fix: bump the recipe TIMEOUT to 900 via the generic EXTRA_ENV mechanism (no harness surgery). OIDC is
config-only (Django manage.py check validates but doesn't fetch), so the stack starts healthy with
placeholder OIDC; login isn't exercised in CI (documented in recipe_meta). Data-survival uses a
postgres marker (docs/docs) via the pg_backup hook.
5 of 6 D10 recipes green: custom-html (simple), keycloak (SSO/DB), cryptpad (stateful/no-DB), matrix-synapse (DB+media/large-volume), lasuite-docs (multi-service + S3/MinIO). Remaining: a TLS-passthrough recipe (bluesky-pds) for the 6th, which needs cc-ci Traefik passthrough config (plan §4.0 caveat) — the hardest infra-wise.
2026-05-27 — M6.5 COMPLETE: n8n (recipe #6) full 3-stage green — all 6 D10 recipes done
Enrolled n8n (workflow automation; single app service, stateful via the /home/node/.n8n volume,
normal terminate-at-Traefik). Host runs: install 2 passed (~3.8m; /healthz 200 + Playwright
editor), upgrade 1 passed (~1.3m; marker in /home/node/.n8n survives), backup 1 passed
(~0.8m; backupbot.backup.path file backup). Clean teardown. (Caught a sync gap first: committed the
tests but forgot to tar tests/n8n to the host → run skipped "no stage test files"; synced + re-ran.)
n8n is recipe #6 in place of bluesky-pds (TLS-passthrough), swapped per DECISIONS (caddy self-ACME conflicts with cc-ci's no-ACME/static-wildcard design).
All 6 D10 recipes now have a full 3-stage green run (host):
- custom-html — simple/stateless
- keycloak — SSO/identity + DB (Drone #39)
- cryptpad — stateful/no-DB (Drone #46)
- matrix-synapse — DB+media/large-volume (Drone #51)
- lasuite-docs — multi-service + S3/MinIO/object-storage (Drone #57)
- n8n — workflow automation (Drone canonical run triggering now) All 5 required D10 categories covered. Triggering n8n canonical Drone run, then claiming the M6.5 gate.
2026-05-27 — M8/D7: results dashboard live (overview + badges)
Built the results dashboard (dashboard/dashboard.py + modules/dashboard.nix): a stdlib HTTP service (Nix-built OCI image, swarm service on proxy, reconcile oneshot like bridge/drone) that polls the Drone API for recipe-CI builds (event=custom), groups latest-run-per-recipe, and renders a YunoHost-CI-like overview at ci.commoninternet.net/ with pass/fail/running badges, last ref, when, and a link to the canonical Drone run. Plus /badge/.svg embeddable badges.
Verified live via the public gateway: overview lists exactly the 6 enrolled recipes (cryptpad,
custom-html, keycloak, lasuite-docs, matrix-synapse, n8n) each success; /badge/keycloak.svg →
200 image/svg+xml; /healthz → 200; /hook still routes to the bridge (200) — the bridge's
Host && PathPrefix(/hook) rule keeps priority over the dashboard's Host-only rule.
Two fixes en route: (1) filter out the cc-ci repo's own name as a recipe row (Adversary !testme on
the cc-ci PR showed a spurious cc-ci=failure); (2) content-hash image tag — a fixed :latest
tag + unchanged stack spec does NOT roll the swarm service on a code change, so the tag is now
derived from a hash of dashboard.py → docker stack deploy rolls reliably (reproducible/self-heal).
NOTE: the bridge image has the same latent :latest issue (only rolled this session because its
.nix env also changed) — worth the same content-tag treatment (backlog).
Remaining M8 piece: PR-comment outcome reflection — the bridge posts the start/run-link comment but doesn't yet update it with the final pass/fail (needs a Drone build-completion hook or the bridge polling build status). Overview + badges (the core of D7) are done.
2026-05-27 — M8/D7 complete: PR-comment outcome reflection + gate claim
Added outcome reflection to the bridge: after triggering, a daemon watcher polls the Drone build to completion and edits the run-link PR comment to ✅ passed / ❌ (Gitea PATCH issues/comments/{id}). Gave the bridge image a content-hash tag so the swarm service actually rolls on bridge.py changes (same latent :latest no-roll issue the dashboard had).
Verified end-to-end: posted a fresh !testme on PR #1 → poller fired → "started" comment posted →
build #76 (RECIPE=cc-ci, fails fast: no tests/cc-ci) → within ~20s the same comment was edited to
cc-ci: run for cc-ci @ d397720a ❌ failure → …/76. The pass/fail now mirrors onto the PR comment.
D7 fully met: per-run logs (Drone UI) + overview page with badges (dashboard, live) + PR comment links back AND reflects the outcome. Claiming the M8 gate.
2026-05-27 — M10/D10: real !testme path proven on custom-html; enrolling the breadth set
Wired the real-PR path end-to-end and proved it on custom-html. !testme on
recipe-maintainers/custom-html#2 → bridge poller fired → recipe-ci build (SRC=mirror, REF=PR head
db9a9502) → build #84 success, all 3 stages green (install 2✓, upgrade 1✓ — now runs for real,
backup 1✓) → bridge comment edited to ✅ passed. Clean teardown.
Three fixes to make the real-PR path exercise the upgrade stage (mirror PR clones carry no tags):
- fetch_recipe (SRC+REF) read-only fetches the published version tags from the PUBLIC upstream
(
git fetch <upstream> refs/tags/*:refs/tags/*— bare--tagserrored "no remote HEAD"); plain git, never pushes to the mirror (guardrail-safe). - abra.upgrade now passes
-o(offline) — it was 401'ing trying to fetch tags from the private mirror origin; offline uses the local (upstream-populated) tags. - (earlier) backup/restore already pass
-C -o. Now firing !testme on the other recipes' open PRs (keycloak#1, matrix-synapse#1, lasuite-docs#1, n8n#1) — they queue at MAX_TESTS=1. cryptpad has no open PR → opening one next.
2026-05-27 — M10/D10: real !testme breadth runs — 5/6 green, lasuite-docs upgrade retry
Fired !testme on all 6 recipe PRs (capacity=1, sequential). Results (real PR-triggered, full 3-stage):
- custom-html #84 ✅ (PR head db9a9502)
- keycloak #86 ✅ (DB realm marker survives upgrade)
- matrix-synapse #87 ✅ (postgres marker, pg_backup hook)
- n8n #89 ✅
- cryptpad #90 ✅ (test PR #2 opened via Gitea API: branch ci/testme + .ci-testme marker)
- lasuite-docs #88 ❌ — install ✅ + backup ✅, but UPGRADE failed:
abra app upgrade … -o→FATA deploy failed(a convergence failure during the 9-service rolling upgrade prev→latest, not a timeout). It PASSED on the host/catalogue run, and ran right after the heavy matrix build, so likely transient resource contention. Re-fired !testme on lasuite-docs#1 to test transient-vs-persistent.
So the real-!testme path + the upgrade fixes (upstream tags + upgrade -o) work across simple, DB,
DB+media, workflow, and stateful recipes. lasuite-docs (the object-storage/S3 category, required)
needs its upgrade to pass on the real path for the 6/6 D10 proof.
2026-05-27 — M10: 5/6 real-!testme green; lasuite-docs blocked on Docker Hub rate limit (A1)
lasuite-docs #88/#92 upgrade failed "deploy failed" → diagnosed: node disk at 90% (2.7G free) — a
9-service rolling upgrade couldn't converge. Pruned 30 unused images (reclaimed 12GB → 15G free).
Retry #93: got further (5/8 services up) but redis task Rejected "No such image: redis:8.2.6" →
docker pull redis:8.2.6 on the node = toomanyrequests: unauthenticated pull rate limit. So the
prune fixed disk but forced re-pulls that hit Docker Hub's anonymous limit (A1 registry-creds
finding, §1.5/§4.4). Recorded in STATUS ## Blocked + DECISIONS; surfaced to operator (provide Docker
Hub creds). 5/6 recipes green via real !testme; lasuite install+backup green, upgrade gated.
Pivoting to M9 (docs/reproducibility, unblocked) while the limit resets / creds arrive.
2026-05-27 — lasuite quota-window retry insufficient; halting retries pending creds (3rd attempt)
Re-fired lasuite-docs !testme during the apparently-eased window (#96). The cached image redis:8.2.6
gave "up to date", but the LATEST version's uncached redis:8.6.3 → toomanyrequests again. So the
anonymous quota isn't reset enough for a full 9-service × 2-version deploy. Cancelled #96 + tore down
clean. This is the 3rd confirmation the blocker is the Docker Hub rate limit. Per anti-thrash:
halting lasuite retries until the operator provides Docker Hub creds (A1, STATUS ## Blocked).
5/6 D10 recipes remain green via real !testme. Pivoting to M9 (docs/reproducibility) — fully
unblocked, no image pulls.
2026-05-27 — M10/D10 BUILDER-COMPLETE: all 6 recipes green via real !testme
Diagnosed the lasuite-docs upgrade failure with an instrumented host run: abra app upgrade reported
FATA deploy failed while all 9 services were actually 1/1 healthy — abra's convergence poll gives
up too early on the slow stop-first rolling upgrade (pulling new images). Fix: pass -c
(--no-converge-checks) to abra app upgrade and let the harness's wait_healthy + data-survival
assertion be the (patient, real) gate. (Also: /root/cc-ci was stale — fully synced; the first diag
hit the old no--o auth error, masking this.)
lasuite-docs #108 → success with the fix: install 2✓, upgrade 1✓, backup 1✓; bridge comment
edited to ✅ passed. So all 6 D10 recipes are green via REAL !testme on a PR, full 3-stage,
comment-reflected, clean teardown:
| recipe | category | build |
|---|---|---|
| custom-html | simple/stateless | #84 |
| keycloak | SSO/identity + DB | #86 |
| matrix-synapse | DB + media / large-volume | #87 |
| n8n | workflow automation | #89 |
| cryptpad | stateful / no external DB | #90 |
| lasuite-docs | multi-service + S3/MinIO/object-storage | #108 |
All 5 required D10 categories covered. The earlier Docker Hub rate-limit blocker resolved on quota reset (registry creds still recommended for reproducibility under load — see DECISIONS). D10 is Builder-complete; DONE awaits the Adversary's <24h PASS on D1–D10 (esp. independent D10 verification).
2026-05-27 — D10 PASS (6/6); DONE now blocked only on D8 live VM rebuild (Adversary)
Adversary logged D10 PASS (6/6) @11:57Z (all 6 recipes green via real !testme, all categories) +
M8/D7 PASS, D8-core PASS, D9 PASS this window. No VETO, no open findings. Audited REVIEW: D1
(=M3), D2 (=M5/M6.5/D10), D3 (=M4), D4/D5 (=M6/M6.5), D6 (=M7), D7 (=M8), D9, D10 all have <24h PASS.
The sole remaining item is D8's full live blank-VM rebuild — the Adversary marked it "core PASS;
live rebuild pending — to complete before DONE," deferred on the premise that it needs operator
registry creds. That premise is now OBSOLETE (D10 went 6/6 with no creds; rate limit was transient;
real fix was -c). Surfaced in STATUS that the VM rebuild is feasible now without creds. NOT writing
## DONE until REVIEW shows a full D8 PASS (per §6.1 — every D1–D10 needs a <24h PASS). No Builder
implementation remains; awaiting the Adversary's D8 completion.
2026-05-27 — ## DONE
Adversary logged D8 PASS (byte-identical nixos-rebuild build closure == /run/current-system
- documented-alternative: a throwaway-VM live rebuild is infeasible by design — its host key isn't a sops recipient and the A1 preconditions bind to cc-ci itself) and the DONE-readiness table: all D1–D10 PASS <24h, no VETO, explicitly: "the DONE handshake (§6.1) is CLEARED — Builder may flip STATUS → DONE."
Wrote ## DONE to STATUS.md with the D1–D10 evidence table. Every Definition-of-Done item is
independently Adversary-verified. Stopping the self-paced loop (no further ScheduleWakeup).
Final system: NixOS-flake-declared cc-ci host running single-node Docker Swarm + coop-cloud/traefik
(wildcard cert, no ACME) + Drone (server recipe + host exec runner, MAX_TESTS=1, 60m timeout) +
comment-bridge (polling-primary !testme, org-membership auth, PR-comment outcome reflection) +
backup-bot-two + results dashboard. !testme on an enrolled recipe PR → 3-stage (install/upgrade/
backup) real e2e CI with Playwright → live Drone logs + dashboard + PR ✅/❌. Six recipes proven.