Files

continuous-integration/drone/push Build is passing

Details

## DONE — all D1-D10 Adversary-PASS <24h, no VETO, handshake cleared

cc-ci recipe CI server complete. Loop stopped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 12:02:03 +01:00

52 KiB

Raw Blame History

JOURNAL — cc-ci Builder (append-only)

2026-05-26 — Bootstrap (§1)

Access verification (all pass):

ssh cc-ci 'hostname && whoami && nixos-version' → nixos / root / 24.11.719113.50ab793786d9 (Vicuna)
curl https://git.autonomic.zone/api/v1/version → {"version":"1.24.2"}
Gitea bot auth (curl -u $GITEA_USERNAME:$GITEA_PASSWORD .../api/v1/user) → login: autonomic-bot, id 64
getent hosts probe-$RANDOM.ci.commoninternet.net → 143.244.213.108 (the gateway IP, as expected — TLS passthrough)
Cert present: ls /var/lib/ci-certs/live/ → fullchain.pem (2909 b), privkey.pem (227 b, mode 640)
recipe-maintainers org exists (private); recipe-maintainers/cc-ci → 404 (created below)
Mirrored recipes already present: bluesky-pds, lasuite-docs, custom-html, custom-html-tiny, n8n, keycloak, lasuite-meet, matrix-synapse, cryptpad

Baseline (docs/baseline.md): fresh NixOS 24.11 Incus VM, 2 vCPU, 3.5 GiB RAM, 8.9 GiB disk (3.8 GiB free). No docker/swarm/abra. Channel-based /etc/nixos/configuration.nix (no flake).

Actions:

Created repo recipe-maintainers/cc-ci (private) via Gitea API.
git init in /srv/cc-ci/cc-ci; credential helper reads creds from /srv/cc-ci/.testenv (no secrets stored in git config).
Seeded skeleton layout (§3) + loop-state files + docs/baseline.md.

Next: commit + push bootstrap, then M0 (flake + base config + sops test secret).

2026-05-26 — M0: flake + base config rebuilt from repo

Authored flake.nix (pins nixpkgs rev 50ab793786d9…, the exact rev cc-ci ran), hosts/cc-ci/hardware.nix (incus VM module + cloud-init + DHCP/nameservers) and hosts/cc-ci/configuration.nix (faithful baseline repro: tailscale w/ hardcoded --hostname= cc-nix-test since builtins.readFile /etc/ts-hostname is impure under flakes; sshd root; firewall trust tailscale0 + tcp/22; base pkgs).

Disk/inode hiccup → resolved: first nix flake lock/build hit No space left on device — diagnosed as inode exhaustion (df -i → 6005 free of 586336; old 8.9 GiB fs). Operator grew the VM to 28 GiB while I was measuring; ext4 auto-resized → 22 GiB free, 1.21M inodes free. Retried.

Build + switch (commands + output):

ssh cc-ci 'cd /root/cc-ci && nix flake lock && nixos-rebuild build --flake .#cc-ci' → BUILD EXIT 0, produced nixos-system-nixos-24.11.20250630.50ab793.
ssh cc-ci 'systemd-run --unit=ccci-rebuild --collect --property=Type=oneshot nixos-rebuild switch --flake /root/cc-ci#cc-ci' (detached so it survives ssh drop) → unit Result=success ExecMainStatus=0.

Gate verification:

systemctl is-system-running → running
readlink /run/current-system → …-nixos-system-nixos-24.11.20250630.50ab793 (gen 3, from flake)
systemctl is-active tailscaled → active; sshd.socket → active (sshd is socket-activated, so sshd.service reads inactive — live ssh proves it works)
systemctl --failed → none
nixos-rebuild list-generations → gen 3 current @20:23, prior channel gen 2 retained for rollback.

Known warning (tracked, non-blocking): incus module enables systemd.network while we keep networking.useDHCP=true (scripted dhcpcd); Nix warns both may manage interfaces. Inherited from baseline; networking is up. Clean up by choosing one stack later.

Deploy mechanism settled (DECISIONS.md): switch --flake on-host, repo synced via tar | ssh.

Next: sops-nix wiring (host age key from ssh host key + a decrypt-a-test-secret proof), then CLAIM the M0 gate for the Adversary.

2026-05-26 — M0: sops-nix wiring + decrypt-a-test-secret (M0 COMPLETE, gate CLAIMED)

Keys:

Host age recipient from ssh host key: ssh cc-ci 'nix run nixpkgs#ssh-to-age -- -i /etc/ssh/ssh_host_ed25519_key.pub' → age1h90utdztfc23kx8ewrtrtk80mnddvrf8pg4ppej55rwwwupzhfvqhmp3qa.
Master recovery key generated on host (age-keygen), public age1cmk26t…; private moved off-box to /srv/cc-ci/.sops/master-age.txt (mode 600) and shred-ded from the host. Never in repo.

Files: .sops.yaml (both recipients, rule secrets/.*\.(yaml|json|env)$); modules/secrets.nix (sops.age.sshKeyPaths=[/etc/ssh/ssh_host_ed25519_key], secrets.test_secret={}); flake gains sops-nix input + sops-nix.nixosModules.sops; configuration.nix imports the module.

sops-nix version pin (dead-end avoided): master sops-nix wants buildGo125Module (Go 1.25), absent in pinned nixpkgs 24.11 → eval error. Pinned sops-nix to 77c423a… (2025-06-17, last using plain buildGoModule). Verified the file at that rev uses buildGoModule. Build then OK.

Encrypt test secret: on host, printf 'test_secret: cc-ci-m0-<rand>' > secrets/secrets.yaml then nix run nixpkgs#sops -- --encrypt --in-place secrets/secrets.yaml (run inside repo so .sops.yaml resolves) → rc=0, two age recipients in the file.

Build + switch (commands + output):

nixos-rebuild build --flake .#cc-ci → BUILD EXIT 0 (built sops-install-secrets w/ Go 1.23.8).
systemd-run --unit=ccci-rebuild2 ... nixos-rebuild switch --flake /root/cc-ci#cc-ci → Result=success ExecMainStatus=0.

Gate verification (M0):

systemctl is-system-running → running; systemctl --failed → none.
ls -la /run/secrets/test_secret → -r-------- 1 root root 41 ; stat → root:root 400.
head -c9 → cc-ci-m0- (matches generated value), wc -c → 41 (9 + 32 hex). Decrypt path proven.
Pulled encrypted secrets/secrets.yaml + flake.lock back to clone; grep cc-ci-m0 secrets.yaml → no plaintext leak; lock inputs = nixpkgs, sops-nix.

Gate handshake: set Gate: M0 — CLAIMED, awaiting Adversary in STATUS.md. REVIEW.md still empty (no Adversary activity yet). Per §6.1 liveness I won't idle-block: I keep M0 claimed and proceed with M1 (independent infra build), without advancing to M2 until M0 shows PASS.

Next: M1 — Docker + single-node swarm via Nix (modules/swarm.nix), then Traefik (file provider → /var/lib/ci-certs/live/) + abra, then a by-hand HTTPS deploy/teardown of a trivial recipe.

2026-05-26 — M1: Docker + single-node swarm via Nix

modules/swarm.nix: virtualisation.docker.enable + daily autoprune (--all --volumes until=24h to protect the 28 GiB root), docker in systemPackages, and a swarm-init oneshot (docker swarm init --advertise-addr 127.0.0.1 if not active; docker network create --driver overlay --attachable proxy if absent). Imported into configuration.nix.

Build + switch: nixos-rebuild build --flake .#cc-ci → EXIT 0; systemd-run … switch → Result=success.

Verify (commands + output):

systemctl show swarm-init -p Result → Result=success
docker info --format ... → Swarm=active Managers=1 Nodes=1
docker network ls --filter name=proxy → proxy overlay swarm
systemctl is-system-running → running; --failed → none.

Next: Traefik as a swarm stack (Nix-declared compose + docker stack deploy oneshot): docker swarm provider + file provider serving /var/lib/ci-certs/live/{fullchain,privkey}.pem on :443, attached to proxy. Then abra install + by-hand HTTPS deploy/teardown of a trivial recipe (M1 gate). Rationale for swarm-service Traefik over a host services.traefik: a host process isn't on the proxy overlay, so it can't reach swarm service VIPs; coop-cloud recipes assume an on-proxy Traefik watching swarm labels.

2026-05-26 — M1: Traefik swarm stack + HTTPS path proven

modules/traefik.nix: Traefik v3.3 as a swarm service on proxy (so it reaches recipe VIPs). Config via Nix writeText store files bind-mounted into the container (real files, not /etc symlinks): static traefik.yml (entrypoints web/websecure; providers.swarm unix socket, exposedByDefault=false, network=proxy; providers.file dir /etc/traefik/dynamic; ping; no dashboard) and dynamic certs.yml (wildcard at /var/lib/ci-certs/live/* as stores.default. defaultCertificate + certificates — so any *.ci.commoninternet.net router with tls=true is covered, no ACME). Deployed by a traefik-deploy oneshot (docker stack deploy) after swarm-init. Opened firewall 80/443 (gateway forwards over enp5s0).

Build + switch: build EXIT 0; switch Result=success; traefik-deploy Result=success; docker service ls → traefik_traefik traefik:v3.3 1/1.

Verify (commands + output):

Local: curl -ksv -H 'Host: probe-test.ci.commoninternet.net' https://localhost/ → subject: CN=*.ci.commoninternet.net, issuer: …Let's Encrypt; CN=E8, TLSv1.3, HTTP 404.
End-to-end via gateway: curl -ksv --resolve probe-test.ci.commoninternet.net:443:143.244.213.108 https://probe-test.ci.commoninternet.net/ → Connected to …(143.244.213.108) port 443, same wildcard cert, HTTP 404. Confirms gateway SNI-passthrough → cc-ci Traefik TLS termination. 404 is correct (no router for that host yet).

Next: install abra (M1 last task), abra app new a trivial recipe (custom-html) → deploy → reach over HTTPS at .ci.commoninternet.net → teardown leaving no volumes. That completes M1 → CLAIM M1 gate.

2026-05-26 — M1: proxy pivot to real coop-cloud/traefik via abra; recipe deploy/teardown (M1 CLAIMED)

Orchestrator decision (mid-M1): replace the hand-rolled Traefik with the canonical Co-op Cloud traefik recipe deployed via abra, wildcard/file-provider mode, no ACME/token. Removed custom modules/traefik.nix; moved firewall 80/443 into modules/swarm.nix. Recorded in DECISIONS.md.

Why the pivot also fixed a real bug: my custom Traefik used entrypoint websecure; coop-cloud recipes label entrypoints=web-secure. While chasing that I also hit a sharp systemd-run gotcha: systemd-run … nixos-rebuild switch --flake .#cc-ci runs with cwd /, so .# → / → "could not find a flake.nix"; the switch silently failed while a post---collect systemctl show returned a stale Result=success. Fix: always use the absolute flake path /root/cc-ci#cc-ci, and read the result before resetting. (rebuild6/7 had silently not applied; rebuild2–5 used the absolute path.)

abra packaged (modules/abra.nix): release binary 0.13.0-beta, pinned by sha256, autoPatchelf'd. abra --version → 0.13.0-beta-06a57de.

scripts/deploy-proxy.sh (idempotent, pure-bash — host has no python3): ensure local abra server, fetch traefik, write wildcard/no-ACME env (WILDCARDS_ENABLED=1, SECRET_WILDCARD_*_VERSION=v1, COMPOSE_FILE=compose.yml:compose.wildcard.yml, LETS_ENCRYPT_ENV= empty), insert cert secrets via abra app secret insert … -f from /var/lib/ci-certs/live, deploy. Bugs fixed en route: multi-line PEM must use -f (not arg); secret-presence must check docker secret ls (abra's recipe list always shows the name with created on server:false).

Traefik deploy: abra app deploy → deploy succeeded 🟢 (traefik v3.6.15 + socket-proxy). Verify: docker service ls → app+socket-proxy 1/1; via gateway curl --resolve probe.*:443: 143.244.213.108 → CN=*.ci.commoninternet.net (LE E8); 0 ACME log lines.

M1 gate (recipe over HTTPS + teardown):

abra app new custom-html -s default -D cchtml1.ci.commoninternet.net -S -n then set LETS_ENCRYPT_ENV= and abra app deploy -n -C → 🟢 (nginx 1.29.0).
curl -ks --resolve cchtml1.ci.commoninternet.net:443:143.244.213.108 https://…/ → http_code=200 size=615, served the nginx welcome page over HTTPS with the wildcard cert.
Teardown: abra app undeploy -n → 🟢; abra app volume remove -f -n → "1 volumes removed"; leak check → services 0 / volumes 0 / secrets 0 / containers 0. Clean.
Correct teardown syntax confirmed: secret remove <d> --all -n (not --all-secrets).

docs/install.md seeded (flake apply + deploy-proxy + verify). M1 gate CLAIMED in STATUS.md.

Next: M2 — Drone server + exec runner via Nix, Gitea OAuth app, hello-world .drone.yml green.

2026-05-26 — M2 start: CI engine decision + Gitea OAuth app + Drone secrets

Decision (DECISIONS.md): keep Drone per plan. nixpkgs 24.11 has drone server 2.24.0 but only the abandoned drone-runner-exec (unstable-2020) — accepted (stable RPC), Woodpecker is the documented fallback. Deploy shape mirrors traefik: server via coop-cloud drone recipe (abra, swarm, traefik-routed at drone.ci.commoninternet.net, no ACME), exec runner as a host Nix systemd service.

Recipe recon: coop-cloud drone recipe = drone/drone:2.26.0, secrets rpc_secret + CLIENT_SECRET (Gitea OAuth), Gitea SSO via compose.gitea.yml (GITEA_CLIENT_ID, GITEA_DOMAIN). Server env: DRONE_SERVER_HOST/PROTO, DRONE_USER_CREATE.

Done this tick:

Created Gitea OAuth app cc-ci-drone (bot): client_id ab4cdb9d-…, redirect https://drone.ci.commoninternet.net/login.
Generated DRONE_RPC_SECRET (openssl-equivalent /dev/urandom hex32) + stored client_secret; both added to secrets/secrets.yaml via sops set (needed SOPS_AGE_KEY from the host ssh key: ssh-to-age -private-key -i /etc/ssh/ssh_host_ed25519_key). Verified: decrypt shows keys test_secret/drone_rpc_secret/drone_gitea_client_secret; file stays encrypted (4× ENC).

Next: scripts/deploy-drone.sh (abra deploy of drone server w/ Gitea SSO + rpc/client secrets), modules/drone-runner.nix (exec runner systemd unit, rpc secret from sops), wire sops secrets for the runner, then push a hello-world .drone.yml and confirm a green build (M2 gate).

2026-05-26 — M2: Drone server + exec runner up; infra as idempotent-reconcile oneshots

Orchestrator steer (2×): collapse install to a single nixos-rebuild switch — convert the manual deploy scripts into idempotent-reconcile systemd oneshots (writeShellApplication, embedded in store; after swarm-init+docker; wants network-online; wantedBy multi-user; reconcile every activation/boot, NO run-once sentinel; fail visibly on missing cert). Applied to proxy + drone.

Refactor done:

modules/packages.nix: pkgs.abra overlay (shared pinned build).
modules/proxy.nix: deploy-proxy oneshot — reconciles coop-cloud traefik (wildcard/no-ACME).
modules/drone.nix: deploy-drone oneshot — reconciles coop-cloud drone (Gitea SSO, secrets from /run/secrets), after deploy-proxy.
modules/drone-runner.nix: exec runner (fixed PATH conflict via lib.mkForce; allowUnfree for drone-runner-exec — Polyform license).
modules/secrets.nix: declared drone_rpc_secret + drone_gitea_client_secret + a sops template drone-runner.env (DRONE_RPC_SECRET) as the runner's EnvironmentFile (shared secret).
Removed scripts/deploy-*.sh. install.md now = clone + nixos-rebuild switch + preconditions.

Build/switch: build EXIT 0 (shellcheck clean via writeShellApplication; runner pkg unfree-allowed). nixos-rebuild switch → all three units active/success:

deploy-proxy success (reconciled traefik), deploy-drone → deploy succeeded 🟢 (drone/drone 2.26.0, secrets client_secret+rpc_secret v1, drone_env config), drone-runner-exec active.

Verify (commands + output):

docker service ls → drone_ci_commoninternet_net_app 1/1, traefik app+socket-proxy 1/1.
Via gateway: …/healthz → 200; / → 303 (login redirect, correct).
Runner: journal shows a few startup cannot ping the remote server (404) (drone RPC not ready yet) then successfully pinged the remote server + polling the remote server capacity=2 endpoint=https://drone.ci.commoninternet.net kind=pipeline type=exec. Runner connected via RPC.

Remaining for M2 gate: push a hello-world .drone.yml to cc-ci + get a green build. Needs the cc-ci repo activated in Drone, which requires the bot's Gitea OAuth login (browser flow) to grant Drone a Gitea token (to sync repos + set the push webhook). Next tick: script the OAuth login to mint a Drone token, activate cc-ci, push .drone.yml, confirm green. (DRONE_USER_CREATE made autonomic-bot the admin.)

2026-05-26 — M2 GATE MET: green build via push (Drone + exec runner)

Drone↔Gitea OAuth (scripted, the one manual bootstrap): logged the bot into Gitea (CSRF cookie → form), drove Drone /login → Gitea authorize consent (POST /login/oauth/grant with _csrf+state+ granted=true) → code callback → Drone _session_. Captured the whole flow in scripts/bootstrap-drone-oauth.sh (reads bot creds from env; documented in install.md §2; one-time, token persists in Drone's data volume).

Repo activation: GET /api/user → autonomic-bot admin=true; GET /api/user/repos?latest=true synced 12 repos; POST /api/repos/recipe-maintainers/cc-ci → active=true, config_path .drone.yml (sets the Gitea push webhook).

Green build: added .drone.yml (exec pipeline), pushed (0d89e28). Polled /api/repos/recipe-maintainers/cc-ci/builds → build #1 pending→running→success. Steps: clone success exit 0; hello success exit 0 — log shows whoami=root, abra 0.13.0-beta-06a57de, swarm=active (ran on the host via the exec runner). M2 gate met; CLAIMED.

Next: M3 — comment-bridge service: Gitea issue_comment webhook → verify HMAC + !testme exact + collaborator → resolve PR head repo/SHA → trigger a parameterized Drone build; post a PR comment with the run link. Need a Drone API token for the bridge (mint from the bot's Drone account).

2026-05-26 — M3 start: bridge secrets + comment-bridge source

Secrets (sops): minted a Gitea API token (cc-ci-bridge, scopes read:org/user, write:repo/issue), a Drone API token (POST /api/user/token, the stable personal token; rotates on call), and a webhook HMAC (urandom hex64). Stored as bridge_gitea_token / bridge_drone_token / bridge_webhook_hmac via sops set (host age identity). secrets.yaml now holds 6 secrets.

bridge/bridge.py (Python stdlib only, §4.1): POST /hook handler — verifies Gitea HMAC (X-Gitea-Signature sha256), requires X-Gitea-Event: issue_comment, action=created, body trimmed == !testme, issue is a PR; checks commenter is a collaborator (Gitea collaborators endpoint, 204); resolves PR head sha+repo; triggers a parameterized Drone build (POST /api/repos/<CI_REPO>/builds?branch=main&RECIPE&REF&PR&SRC, custom params → pipeline env); posts a PR comment linking the run. Secrets read from mounted files; config via env. /healthz GET.

Next: package the bridge as a swarm service (dockerTools image, no Docker Hub pull) behind traefik at ci.commoninternet.net/hook via a reconcile oneshot (modules/bridge.nix); register a per-repo webhook with the HMAC; demo on a scratch PR (!testme triggers; non-!testme + non-collab rejected). That's the M3 gate.

2026-05-26 — M3: bridge deployed + verified; webhook DELIVERY blocked (Gitea-side)

Deployed the comment-bridge as a Nix-built OCI image (no Docker Hub pull) → swarm service on proxy, behind traefik at ci.commoninternet.net/hook, via reconcile oneshot modules/bridge.nix. Swarm secrets (webhook_hmac/drone_token/gitea_token) materialised from /run/secrets.

Verified working (bridge side):

docker service ls → ccci-bridge_app 1/1.
GET /hook/healthz → 200 from the sandbox over real public DNS (ci.commoninternet.net → 143.244.213.108); also 200 via gateway from cc-ci.
HMAC logic: bad sig → 401; a manually openssl-HMAC-signed body → 204 (passes sig, ignored as non-trigger); wrong event → 204. (Debug log added: got=/want=/bodylen/seclen.)
Registered per-repo issue_comment webhook (id 210) on recipe-maintainers/cc-ci → ci.../hook with the HMAC. Created scratch PR #1.

Blocker found: commenting !testme (×several) and Gitea's "Test Delivery" (UI returns 200) yield ZERO requests at the bridge container. Bridge is publicly reachable by hostname from a 3rd network; gateway accepts public sources; public DNS correct → Gitea is not sending the delivery. Deliveries panel is AJAX (uninspectable via curl); bot is not Gitea admin (can't read ALLOWED_HOST_LIST). Conclusion: git.autonomic.zone webhook policy (likely ALLOWED_HOST_LIST) blocks ci.commoninternet.net. Recorded in STATUS ## Blocked with operator options (whitelist host, or I pivot bridge to polling).

Plan: surface to operator; meanwhile proceed to M4 (harness + install stage) which doesn't depend on the webhook (dev recipe-CI builds triggerable directly via the Drone API). Revisit M3 gate once the host is whitelisted or via the polling fallback.

2026-05-27 — M4: harness + install stage green (custom-html), guaranteed teardown

Built the harness: runner/harness/abra.py (abra wrappers w/ gotchas: no --chaos on undeploy/volume-remove, -n everywhere, parse app ls -S -m nested {server:{apps}}, timeouts), runner/harness/lifecycle.py (deploy_app forcing LETS_ENCRYPT_ENV="" [A1], wait_healthy = services-converged + HTTPS, teardown_app = undeploy+volume+secret+env-config, janitor for orphans), tests/conftest.py (deployed_app session fixture with finalizer teardown; short unique domain), tests/custom-html/test_install.py (HTTP 200 + Playwright/Chromium content assertion), runner/run_recipe_ci.py (orchestrator: fetch recipe@REF, run stage pytest), modules/harness.nix (cc-ci-run = Nix python3+pytest+playwright with PLAYWRIGHT_BROWSERS_PATH from nixpkgs).

Bugs fixed en route (3):

Swarm config name > 64 chars (long domain) → switched to short <recipe[:4]>-<6hex> domain scheme (DECISIONS.md).
services_converged used wrong stack name (replaced hyphens) → abra keeps hyphens, only dots→_.
http_get connected to the gateway IP (drops SNI, gateway routes by SNI) → use the real URL (resolves to gateway on cc-ci, correct SNI). Also teardown now removes the app .env config.

Green run + teardown (commands + output):

RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py → tests/custom-html/test_install.py::test_http_reachable PASSED, ::test_playwright_page PASSED — 2 passed in 57.99s.
Leak check after: services 0 / volumes 0 / secrets 0 / containers 0 / env config removed. Clean.

A1 addressed: deploy_app forces LETS_ENCRYPT_ENV="" (no ACME) on every deploy. M4 CLAIMED.

M3 still blocked (Gitea webhook delivery — operator); no response yet. Next: M5 (upgrade + backup/restore for custom-html), then wire the parameterized Drone pipeline (API-triggerable).

2026-05-27 — M5: upgrade + backup/restore stages green (custom-html)

Upgrade stage (tests/custom-html/test_upgrade.py): deploy previous published version (git-tag sort, second-newest), write a data marker into the served volume (nginx serves /usr/share/nginx/html, so the marker is HTTP-fetchable), abra app upgrade to current, assert healthy + marker survived. Fix: upgrade has no --chaos flag (used -f -D -n).

backup-bot-two deployed as reconcile oneshot (modules/backupbot.nix): restic repo in a local backups volume, restic_password abra-generated (only if missing). Fixes: abra app secret generate needs -m (machine) to avoid the TTY/ioctl path, and stdout redirected so generated values never hit the journal (D6). abra app backup create/restore need a real PTY ('input device is not a TTY') → run via util-linux script -qec (harness _run_pty; util-linux added to cc-ci-run).

Backup stage (test_backup.py): write "original" → abra app backup create → mutate to "mutated" → abra app restore → assert state back to "original".

Full 3-stage run (STAGES=install,upgrade,backup):

install: 2 passed (http 200 + playwright)
upgrade: 1 passed (data survives upgrade)
backup: 1 passed (restore returns pre-mutation state)
teardown: 0 orphaned run services/volumes/secrets; infra (traefik/drone/bridge/backupbot) all 1/1. M5 CLAIMED.

M3 still blocked (webhook; no operator response across several ticks). Plan: if still blocked, pivot the bridge to poll the Gitea API (self-service, Adversary-endorsed) to unblock D1. Next: M6.

2026-05-27 — Fix adversary findings A2 (dead janitor) + A3 (unverified teardown)

A2 (janitor matched dead -pr filter): rewrote harness.lifecycle.janitor to match the real run-app naming (RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$), reap via docker primitives, AND scan docker service ls to catch orphans whose .env is already gone (reconstructs the domain from the service name). Age-gated (default 2h, env CCCI_JANITOR_MAX_AGE) so concurrent in-flight runs are never killed.

A3 (teardown unverified + unconditional .env removal): teardown_app now (1) docker stack rm fallback if abra undeploy leaves services, (2) removes volumes/secrets before the .env and only drops the .env after the stack is confirmed gone, (3) retries docker volume rm (a stopped task briefly holds the volume), (4) verifies no residual services/volumes/secrets and raises TeardownError otherwise — so a partial teardown FAILS the run instead of silently orphaning.

Re-test (commands + output):

Normal install run → 2 passed, verified teardown clean.
Orphan (deploy, no teardown) → janitor(CCCI_JANITOR_MAX_AGE=0) → services/volumes/secrets/env 0.
Env-less orphan (deploy then rm the .env, the A3 bad state) → janitor reaps via docker stack rm → services/volumes/secrets 0.
Full 3-stage run (install/upgrade/backup) still green with verified teardown, no TeardownError.

A2/A3 fixed; left for the Adversary to re-test + close.

2026-05-27 — M6 (part 1): harness enhancements for recipe #2 + D4 discovery

Before enrolling recipe #2, made the shared harness recipe-agnostic so enrolling a recipe needs no harness-code change (D5):

Per-recipe meta (tests/<recipe>/recipe_meta.py, optional): HEALTH_PATH, HEALTH_OK, DEPLOY_TIMEOUT, HTTP_TIMEOUT. conftest reads it; wait_healthy gained a path param (e.g. keycloak /realms/master). Defaults preserve custom-html behaviour (verified: install still green).
Shared naming (harness/naming.py): single source for the <recipe[:4]>-<6hex> domain, used by conftest + the orchestrator.
D4 recipe-local discovery (run_recipe_ci.run_recipe_local): if a recipe ships tests/ with test_*.py, deploy the app, run those tests against the LIVE deployment (contract: env CCCI_BASE_URL + CCCI_APP_DOMAIN), merge as another reported stage, guaranteed teardown. Real recipes ship tests/ committed in their repo (clean checkout) → discovered on clone/fetch. (custom- html via catalogue is an awkward case — abra refuses an unstaged recipe and abra recipe fetch resets local commits — so D4 is demonstrated end-to-end with recipe #2 hedgedoc, which ships committed tests/.)

Next: mirror hedgedoc (postgres+hedgedoc, DB-backed) via the mirror+PR flow with a committed tests/ dir, write tests/hedgedoc/ (install/upgrade/backup + recipe_meta), run all stages + D4 green.

2026-05-27 — M6 (part 2): recipe #2 keycloak install green (DB-backed, no harness surgery)

Enrolled keycloak (recipe #2): keycloak 26.6.2 + mariadb 12.2 — genuinely DB-backed/multi-service (vs custom-html stateless). Added only tests/keycloak/recipe_meta.py (HEALTH_PATH=/realms/master, HEALTH_OK=(200,), 600s timeouts) + tests/keycloak/test_install.py (realm-endpoint health + Playwright admin-console login). No change to runner/harness code — the recipe-agnostic harness (per-recipe meta) handled it (D5 evidence).

Run: RECIPE=keycloak STAGES=install cc-ci-run runner/run_recipe_ci.py → 2 passed in 545s (keycloak is slow: image pull + JVM + mariadb migration). Teardown clean (0 keyc-* services/volumes after).

Next: D4 demo via a mirror shipping committed tests/ (recipe-local run against live app); then keycloak upgrade + backup/restore (DB data survival via a realm marker through the admin API).

2026-05-27 — M6: D4 recipe-local discovery + recipe #2 enrolled (CLAIMED)

D4 recipe-local discovery working. Demo: pushed a committed tests/test_recipe_local.py to the mirror on branch recipe-maintainers/custom-html@ci/d4-recipe-local; ran RECIPE=custom-html SRC=recipe-maintainers/custom-html REF=ci/d4-recipe-local STAGES=install → install 2 passed, then ===== STAGE: recipe-local (D4) ===== ran the recipe-shipped test against the LIVE app (CCCI_BASE_URL) → 1 passed. Clean teardown (0 orphans).

Hard-won abra behaviour (DECISIONS.md): private mirror clone needs the bot token (per-command http.extraHeader, not persisted/logged). abra commands (app ls, secret generate, version resolution) silently git checkout <tag> the recipe, dropping a PR branch's files — so (1) all harness abra calls use -C -o (chaos+offline = current checkout, no remote fetch), and (2) D4 snapshots the recipe's tests/ to a temp dir right after fetch (later abra cmds still reset it). Traced the drop step-by-step: app_new ok, deploy ok, but secret generate (no flags) and app ls each reset the checkout.

Recipe #2 = keycloak (keycloak + mariadb, DB-backed) install green with only tests/keycloak/recipe_meta.py + test_install.py — no runner/harness change (D5). custom-html remains 3-stage green (M5). docs/enroll-recipe.md written.

M6 CLAIMED. keycloak's full 3-stage (DB data survival via a realm marker) folds into M6.5. Next: M6.5 — keycloak upgrade/backup, then recipes 3–6 across the remaining D10 categories.

2026-05-27 — Trigger redesign (polling primary) + resource safety + M3 verified

Session restarted by watchdog (prior tmux died mid-turn with uncommitted bridge WIP). Re-oriented from STATUS + plan; two orchestrator design changes landed and are now implemented + verified.

(1) Trigger: POLLING PRIMARY, webhook optional, org-membership auth (plan §4.1/§1.5; commit 7addb96). Rewrote bridge/bridge.py: a poll thread (poll_loop, always-on, primary) scans each POLL_REPOS repo's open PRs every 30s for new !testme; the /hook webhook stays as an optional admin-registered push optimization. Both share an in-memory comment-id seen-set → a comment seen by both fires once. First poll marks pre-existing comments seen (no startup re-fire). Authorization now GET /orgs/{owner}/members/{user} (204=member, read-level) + optional AUTH_ALLOWLIST, replacing the admin-requiring /collaborators/{user}/permission. Bot never self-registers webhooks.

Verified org endpoint at read level (bot basic-auth): members/{autonomic-bot,trav,notplants} → 204; members/definitely-not-a-member-xyz → 404.
Deployed (nixos-rebuild, deploy-bridge reconcile); new container logs: poller (primary) watching ['recipe-maintainers/cc-ci'] every 30s + (poll primary + optional webhook).
End-to-end M3 trigger (poll path): posted !testme on PR #1 (comment 13705, by bot) → Drone build #26 appeared after 6s (latest was #25); bridge logged [poll] triggered build 26 for cc-ci@d397720a (PR #1, comment 13705) by autonomic-bot; bridge posted back cc-ci: started CI run for cc-ci @ d397720a → https://drone.ci.commoninternet.net/.... Satisfies D1 (<60s) over the read-only outbound path — no operator webhook whitelist needed.

(2) Resource safety: bound live test apps (plan §4.2/§4.3; commit 72ff8e2). MAX_TESTS = DRONE_RUNNER_CAPACITY = 1 (modules/drone-runner.nix) → Drone runs ≤1 build at once, queues the rest natively. Per-build timeout = 60m, reconciled best-effort in modules/drone.nix (PATCH /api/repos/.../cc-ci {"timeout":60}, non-fatal). Janitor remains the backstop for SIGKILL'd/timed-out builds (reaps orphaned run apps at run-start before each deploy).

Verified on host after rebuild: DRONE_RUNNER_CAPACITY=1; deploy-drone logged set cc-ci build timeout = 60m; Drone API confirms repo timeout: 60.

Gap noted (next item): .drone.yml still only has the self-test pipeline — a bridge-triggered build runs the self-test, NOT runner/run_recipe_ci.py. M4/M5 ran the orchestrator by hand (cc-ci-run). Need a recipe-CI pipeline keyed on the RECIPE build param (runs cc-ci-run runner/run_recipe_ci.py with STAGES=install,upgrade,backup, CCCI_JANITOR_MAX_AGE=0, concurrency:{limit:1}) to connect bridge→Drone→harness end-to-end (required for D2/D10 via real !testme). Added to Build backlog.

M3 CLAIMED (gate). Trigger + auth + comment-back demoed live; the webhook-delivery blocker is moot now that polling is primary.

2026-05-27 — Bridge→Drone→harness integration (recipe-ci pipeline) wired & green

Closed the gap where a bridge-triggered build ran only the self-test. Split .drone.yml into two event-filtered exec pipelines (commits 9d51cb6, bc8baae, 7aa0346):

self-test — trigger.event: [push] (M2 sanity on pushes).
recipe-ci — trigger.event: [custom] (bridge fires event=custom builds): runs cc-ci-run runner/run_recipe_ci.py with STAGES=install,upgrade,backup, CCCI_JANITOR_MAX_AGE=0 (safe at capacity=1), concurrency:{limit:1}, and HOME=/root (the exec runner otherwise points HOME at an empty per-build workspace → abra FATA directory is empty: .../.abra/servers).

Verified by triggering a custom build (RECIPE=custom-html, as the bridge does) via the Drone API:

Build #31 got past abra app new (HOME fix) but failed at backup: abra app backup create … FATA … authentication required: Unauthorized — backup/restore weren't passing -C -o, so abra fetched recipe tags from the (private) remote. Also recipe versions found no tags (contaminated recipe dir: private-mirror origin, no tags) → upgrade stage SKIPPED.
Fixes: abra.py backup_create/restore now pass -C -o; fetch_recipe catalogue path rm's the recipe dir first so a leftover private-mirror clone can't poison version resolution.
Build #33 → SUCCESS (124s), all three stages green through Drone: install 2 passed (real deploy + Playwright), upgrade 1 passed (real — tags restored by the clean re-clone, no longer skipped), backup 1 passed (the -C -o fix). Post-run on host: 0 run-app services, 0 run-app volumes; traefik/drone/bridge infra intact. Event filtering works (only recipe-ci ran, not self-test).

So the full D1→D2 path is wired and proven in two verified halves: poll-trigger→Drone (build #26, RECIPE param correct) and Drone→harness 3-stage CI (build #33, green + clean teardown). Remaining for full single-comment E2E on a recipe PR: enroll the recipe in the bridge POLL_REPOS + open a recipe PR (M6.5/M10 breadth work).

Adversary findings status (signal for re-test): A2 (janitor -pr filter) and A3 (teardown verification + .env-last ordering) are both already fixed in the current code (lifecycle.RUN_APP_RE hashed-scheme match; teardown_app _residual() raise + docker stack rm fallback) — awaiting the Adversary's kill-probe re-test on an idle host. A4 (concurrent same-recipe collision): its named root cause "no Drone concurrency cap (capacity=2)" is eliminated by MAX_TESTS=capacity=1 — no concurrent runs possible on this single node, so the shared-recipe-dir race can't occur. No Builder fix outstanding on findings; next milestone work is M6.5 breadth.

2026-05-27 — M6.5: keycloak full 3-stage GREEN through the Drone recipe-ci pipeline

Ran keycloak (DB-backed, SSO/identity category) end-to-end via the integrated recipe-ci pipeline (triggered custom build #39, RECIPE=keycloak). Build #39 → success (~31m), all three stages green as separate reported stages:

install 2 passed (8m30s): test_realm_endpoint_healthy (/realms/master 200) + Playwright admin console login.
upgrade 1 passed (10m10s): test_upgrade_preserves_realm — realm marker written pre-upgrade survives the previous→latest upgrade (DB data survival).
backup 1 passed (8m15s): test_backup_mutate_restore — backup→mutate→restore returns original. Clean teardown verified on host: 0 keyc services, 0 keyc volumes. keycloak cold start is slow on this VM (Quarkus augmentation ~80s + Liquibase schema init), so each deploy is ~5-8m — well within the 60m build timeout; that's why the run took ~31m. No harness surgery (D5): keycloak runs off tests/keycloak/{recipe_meta,test_install,test_upgrade,test_backup}.py + kc_admin.py only.

This both advances M6.5 (first DB-backed recipe full 3-stage) and confirms the recipe-ci integration works on a heavy DB-backed recipe (Drone→harness→3 stages→teardown). Next M6.5: enroll recipes 3–6 covering the remaining D10 categories (stateful-no-DB, multi-service+S3, large-volume, etc.).

2026-05-27 — M6.5: cryptpad (recipe #3) enrolled + full 3-stage green; fixed a real backup bug

Enrolled cryptpad (stateful, no external DB — the D10 "stateful/no-DB" category). No shared-harness surgery beyond a generic feature: added per-recipe EXTRA_ENV (recipe_meta.py dict or domain-callable) applied in deploy_app at every deploy path. cryptpad uses it for its required distinct SANDBOX_DOMAIN (a sibling subdomain under the wildcard, so no cert work). Data-survival tests write a marker into the backed-up cryptpad_data volume and read it via exec_in_app (cryptpad's datastore isn't HTTP-served like custom-html).

Host runs (HOME=/root, cc-ci-run): install 2 passed (~2m; http 200 + Playwright loads cryptpad), upgrade 1 passed (~1m; marker survives previous→current), backup 1 passed after a fix (below). Clean teardown (0 cryp services/volumes).

Real bug found+fixed — backups were silently mis-wired (set_env newline). cryptpad backup first failed: abra app backup create → backup-bot-two's /usr/bin/backup raised KeyError: 'RESTIC_REPOSITORY'. Root cause: backup-bot-two's .env.sample ends with a newline-less comment line, and the reconcile's set_env did a bare printf >> .env, gluing RESTIC_REPOSITORY=/backups/restic onto that comment → commented out. abra --debug confirmed the backupbot env map lacked RESTIC_REPOSITORY, and docker exec backupbot printenv RESTIC_REPOSITORY was empty. Fix: set_env now ensures a trailing newline before appending (modules/backupbot.nix + modules/drone.nix, same latent bug). After rebuild: .env has a clean RESTIC_REPOSITORY= line, the backupbot container has RESTIC_REPOSITORY=/backups/restic, and cryptpad backup→mutate→restore passes. NOTE: keycloak backup (build #39) passed off an earlier, non-corrupted backupbot deploy; worth a re-verify, but the mechanism is now correct/reproducible. Triggered Drone build #46 (cryptpad) as the canonical recipe-ci run.

2026-05-27 — M6.5: matrix-synapse (recipe #4, DB+media/large-volume) full 3-stage green

Enrolled matrix-synapse (synapse app + postgres db + nginx web) — the large-volume/DB+media D10 category. No harness surgery (server_name = DOMAIN; no EXTRA_ENV needed). Host runs (cc-ci-run): install 2 passed (~2.7m; client API 200 + real /_matrix/client/versions JSON), upgrade 1 passed (~2.3m; postgres marker survives previous→current), backup 1 passed (~1.5m). Clean teardown (0 matr services). The data-survival tests use a ci_marker postgres row exec'd via psql in the db service — this exercises the recipe's real DB-dump backup hook (backupbot.backup.pre-hook=/pg_backup.sh backup / restore.post-hook), the meaningful matrix data path (not a plain volume copy). Worked first try (the set_env/RESTIC fix holds for hook-based backups too). Triggering the canonical Drone recipe-ci run.

4 of 6 D10 recipes now green: custom-html (simple), keycloak (SSO/DB), cryptpad (stateful/no-DB), matrix-synapse (DB+media/large-volume). Remaining categories: multi-service+S3 (lasuite-docs) and TLS-passthrough (bluesky-pds).

2026-05-27 — M6.5: lasuite-docs (recipe #5, multi-service + S3/MinIO) full 3-stage green

Enrolled lasuite-docs (the object-storage/S3 + multi-service D10 category): a 9-service stack (frontend app + Django backend + celery + y-provider + docspec + postgres + redis + minio + nginx). Host runs (cc-ci-run): install 2 passed (~2.5m; SPA served + Playwright), upgrade 1 passed (~3m; postgres marker survives previous→current, incl. cold-pulling the older images), backup 1 passed (~2.3m; pg_backup.sh dump/restore). Clean teardown.

Root-caused the initial deploy timeout: cold-pulling ~9 large images (impress frontend/backend, minio, postgres18, docspec, y-provider, redis) exceeds abra's default 300s convergence TIMEOUT → FATA deploy timed out 🟠. A manual deploy confirmed the stack converges 9/9 once images are pulled. Fix: bump the recipe TIMEOUT to 900 via the generic EXTRA_ENV mechanism (no harness surgery). OIDC is config-only (Django manage.py check validates but doesn't fetch), so the stack starts healthy with placeholder OIDC; login isn't exercised in CI (documented in recipe_meta). Data-survival uses a postgres marker (docs/docs) via the pg_backup hook.

5 of 6 D10 recipes green: custom-html (simple), keycloak (SSO/DB), cryptpad (stateful/no-DB), matrix-synapse (DB+media/large-volume), lasuite-docs (multi-service + S3/MinIO). Remaining: a TLS-passthrough recipe (bluesky-pds) for the 6th, which needs cc-ci Traefik passthrough config (plan §4.0 caveat) — the hardest infra-wise.

2026-05-27 — M6.5 COMPLETE: n8n (recipe #6) full 3-stage green — all 6 D10 recipes done

Enrolled n8n (workflow automation; single app service, stateful via the /home/node/.n8n volume, normal terminate-at-Traefik). Host runs: install 2 passed (~3.8m; /healthz 200 + Playwright editor), upgrade 1 passed (~1.3m; marker in /home/node/.n8n survives), backup 1 passed (~0.8m; backupbot.backup.path file backup). Clean teardown. (Caught a sync gap first: committed the tests but forgot to tar tests/n8n to the host → run skipped "no stage test files"; synced + re-ran.)

n8n is recipe #6 in place of bluesky-pds (TLS-passthrough), swapped per DECISIONS (caddy self-ACME conflicts with cc-ci's no-ACME/static-wildcard design).

All 6 D10 recipes now have a full 3-stage green run (host):

custom-html — simple/stateless
keycloak — SSO/identity + DB (Drone #39)
cryptpad — stateful/no-DB (Drone #46)
matrix-synapse — DB+media/large-volume (Drone #51)
lasuite-docs — multi-service + S3/MinIO/object-storage (Drone #57)
n8n — workflow automation (Drone canonical run triggering now) All 5 required D10 categories covered. Triggering n8n canonical Drone run, then claiming the M6.5 gate.

2026-05-27 — M8/D7: results dashboard live (overview + badges)

Built the results dashboard (dashboard/dashboard.py + modules/dashboard.nix): a stdlib HTTP service (Nix-built OCI image, swarm service on proxy, reconcile oneshot like bridge/drone) that polls the Drone API for recipe-CI builds (event=custom), groups latest-run-per-recipe, and renders a YunoHost-CI-like overview at ci.commoninternet.net/ with pass/fail/running badges, last ref, when, and a link to the canonical Drone run. Plus /badge/.svg embeddable badges.

Verified live via the public gateway: overview lists exactly the 6 enrolled recipes (cryptpad, custom-html, keycloak, lasuite-docs, matrix-synapse, n8n) each success; /badge/keycloak.svg → 200 image/svg+xml; /healthz → 200; /hook still routes to the bridge (200) — the bridge's Host && PathPrefix(/hook) rule keeps priority over the dashboard's Host-only rule.

Two fixes en route: (1) filter out the cc-ci repo's own name as a recipe row (Adversary !testme on the cc-ci PR showed a spurious cc-ci=failure); (2) content-hash image tag — a fixed :latest tag + unchanged stack spec does NOT roll the swarm service on a code change, so the tag is now derived from a hash of dashboard.py → docker stack deploy rolls reliably (reproducible/self-heal). NOTE: the bridge image has the same latent :latest issue (only rolled this session because its .nix env also changed) — worth the same content-tag treatment (backlog).

Remaining M8 piece: PR-comment outcome reflection — the bridge posts the start/run-link comment but doesn't yet update it with the final pass/fail (needs a Drone build-completion hook or the bridge polling build status). Overview + badges (the core of D7) are done.

2026-05-27 — M8/D7 complete: PR-comment outcome reflection + gate claim

Added outcome reflection to the bridge: after triggering, a daemon watcher polls the Drone build to completion and edits the run-link PR comment to ✅ passed / ❌ (Gitea PATCH issues/comments/{id}). Gave the bridge image a content-hash tag so the swarm service actually rolls on bridge.py changes (same latent :latest no-roll issue the dashboard had).

Verified end-to-end: posted a fresh !testme on PR #1 → poller fired → "started" comment posted → build #76 (RECIPE=cc-ci, fails fast: no tests/cc-ci) → within ~20s the same comment was edited to cc-ci: run for cc-ci @ d397720a ❌ failure → …/76. The pass/fail now mirrors onto the PR comment.

D7 fully met: per-run logs (Drone UI) + overview page with badges (dashboard, live) + PR comment links back AND reflects the outcome. Claiming the M8 gate.

2026-05-27 — M10/D10: real !testme path proven on custom-html; enrolling the breadth set

Wired the real-PR path end-to-end and proved it on custom-html. !testme on recipe-maintainers/custom-html#2 → bridge poller fired → recipe-ci build (SRC=mirror, REF=PR head db9a9502) → build #84 success, all 3 stages green (install 2✓, upgrade 1✓ — now runs for real, backup 1✓) → bridge comment edited to ✅ passed. Clean teardown.

Three fixes to make the real-PR path exercise the upgrade stage (mirror PR clones carry no tags):

fetch_recipe (SRC+REF) read-only fetches the published version tags from the PUBLIC upstream (git fetch <upstream> refs/tags/*:refs/tags/* — bare --tags errored "no remote HEAD"); plain git, never pushes to the mirror (guardrail-safe).
abra.upgrade now passes -o (offline) — it was 401'ing trying to fetch tags from the private mirror origin; offline uses the local (upstream-populated) tags.
(earlier) backup/restore already pass -C -o. Now firing !testme on the other recipes' open PRs (keycloak#1, matrix-synapse#1, lasuite-docs#1, n8n#1) — they queue at MAX_TESTS=1. cryptpad has no open PR → opening one next.

2026-05-27 — M10/D10: real !testme breadth runs — 5/6 green, lasuite-docs upgrade retry

Fired !testme on all 6 recipe PRs (capacity=1, sequential). Results (real PR-triggered, full 3-stage):

custom-html #84 ✅ (PR head db9a9502)
keycloak #86 ✅ (DB realm marker survives upgrade)
matrix-synapse #87 ✅ (postgres marker, pg_backup hook)
n8n #89 ✅
cryptpad #90 ✅ (test PR #2 opened via Gitea API: branch ci/testme + .ci-testme marker)
lasuite-docs #88 ❌ — install ✅ + backup ✅, but UPGRADE failed: abra app upgrade … -o → FATA deploy failed (a convergence failure during the 9-service rolling upgrade prev→latest, not a timeout). It PASSED on the host/catalogue run, and ran right after the heavy matrix build, so likely transient resource contention. Re-fired !testme on lasuite-docs#1 to test transient-vs-persistent.

So the real-!testme path + the upgrade fixes (upstream tags + upgrade -o) work across simple, DB, DB+media, workflow, and stateful recipes. lasuite-docs (the object-storage/S3 category, required) needs its upgrade to pass on the real path for the 6/6 D10 proof.

2026-05-27 — M10: 5/6 real-!testme green; lasuite-docs blocked on Docker Hub rate limit (A1)

lasuite-docs #88/#92 upgrade failed "deploy failed" → diagnosed: node disk at 90% (2.7G free) — a 9-service rolling upgrade couldn't converge. Pruned 30 unused images (reclaimed 12GB → 15G free). Retry #93: got further (5/8 services up) but redis task Rejected "No such image: redis:8.2.6" → docker pull redis:8.2.6 on the node = toomanyrequests: unauthenticated pull rate limit. So the prune fixed disk but forced re-pulls that hit Docker Hub's anonymous limit (A1 registry-creds finding, §1.5/§4.4). Recorded in STATUS ## Blocked + DECISIONS; surfaced to operator (provide Docker Hub creds). 5/6 recipes green via real !testme; lasuite install+backup green, upgrade gated. Pivoting to M9 (docs/reproducibility, unblocked) while the limit resets / creds arrive.

2026-05-27 — lasuite quota-window retry insufficient; halting retries pending creds (3rd attempt)

Re-fired lasuite-docs !testme during the apparently-eased window (#96). The cached image redis:8.2.6 gave "up to date", but the LATEST version's uncached redis:8.6.3 → toomanyrequests again. So the anonymous quota isn't reset enough for a full 9-service × 2-version deploy. Cancelled #96 + tore down clean. This is the 3rd confirmation the blocker is the Docker Hub rate limit. Per anti-thrash: halting lasuite retries until the operator provides Docker Hub creds (A1, STATUS ## Blocked). 5/6 D10 recipes remain green via real !testme. Pivoting to M9 (docs/reproducibility) — fully unblocked, no image pulls.

2026-05-27 — M10/D10 BUILDER-COMPLETE: all 6 recipes green via real !testme

Diagnosed the lasuite-docs upgrade failure with an instrumented host run: abra app upgrade reported FATA deploy failed while all 9 services were actually 1/1 healthy — abra's convergence poll gives up too early on the slow stop-first rolling upgrade (pulling new images). Fix: pass -c (--no-converge-checks) to abra app upgrade and let the harness's wait_healthy + data-survival assertion be the (patient, real) gate. (Also: /root/cc-ci was stale — fully synced; the first diag hit the old no--o auth error, masking this.)

lasuite-docs #108 → success with the fix: install 2✓, upgrade 1✓, backup 1✓; bridge comment edited to ✅ passed. So all 6 D10 recipes are green via REAL !testme on a PR, full 3-stage, comment-reflected, clean teardown:

recipe	category	build
custom-html	simple/stateless	#84
keycloak	SSO/identity + DB	#86
matrix-synapse	DB + media / large-volume	#87
n8n	workflow automation	#89
cryptpad	stateful / no external DB	#90
lasuite-docs	multi-service + S3/MinIO/object-storage	#108

All 5 required D10 categories covered. The earlier Docker Hub rate-limit blocker resolved on quota reset (registry creds still recommended for reproducibility under load — see DECISIONS). D10 is Builder-complete; DONE awaits the Adversary's <24h PASS on D1–D10 (esp. independent D10 verification).

2026-05-27 — D10 PASS (6/6); DONE now blocked only on D8 live VM rebuild (Adversary)

Adversary logged D10 PASS (6/6) @11:57Z (all 6 recipes green via real !testme, all categories) + M8/D7 PASS, D8-core PASS, D9 PASS this window. No VETO, no open findings. Audited REVIEW: D1 (=M3), D2 (=M5/M6.5/D10), D3 (=M4), D4/D5 (=M6/M6.5), D6 (=M7), D7 (=M8), D9, D10 all have <24h PASS. The sole remaining item is D8's full live blank-VM rebuild — the Adversary marked it "core PASS; live rebuild pending — to complete before DONE," deferred on the premise that it needs operator registry creds. That premise is now OBSOLETE (D10 went 6/6 with no creds; rate limit was transient; real fix was -c). Surfaced in STATUS that the VM rebuild is feasible now without creds. NOT writing ## DONE until REVIEW shows a full D8 PASS (per §6.1 — every D1–D10 needs a <24h PASS). No Builder implementation remains; awaiting the Adversary's D8 completion.

2026-05-27 — ## DONE

Adversary logged D8 PASS (byte-identical nixos-rebuild build closure == /run/current-system

documented-alternative: a throwaway-VM live rebuild is infeasible by design — its host key isn't a sops recipient and the A1 preconditions bind to cc-ci itself) and the DONE-readiness table: all D1–D10 PASS <24h, no VETO, explicitly: "the DONE handshake (§6.1) is CLEARED — Builder may flip STATUS → DONE."

Wrote ## DONE to STATUS.md with the D1–D10 evidence table. Every Definition-of-Done item is independently Adversary-verified. Stopping the self-paced loop (no further ScheduleWakeup).

Final system: NixOS-flake-declared cc-ci host running single-node Docker Swarm + coop-cloud/traefik (wildcard cert, no ACME) + Drone (server recipe + host exec runner, MAX_TESTS=1, 60m timeout) + comment-bridge (polling-primary !testme, org-membership auth, PR-comment outcome reflection) + backup-bot-two + results dashboard. !testme on an enrolled recipe PR → 3-stage (install/upgrade/ backup) real e2e CI with Playwright → live Drone logs + dashboard + PR ✅/❌. Six recipes proven.

52 KiB Raw Blame History Unescape Escape

JOURNAL — cc-ci Builder (append-only)

2026-05-26 — Bootstrap (§1)

2026-05-26 — M0: flake + base config rebuilt from repo

2026-05-26 — M0: sops-nix wiring + decrypt-a-test-secret (M0 COMPLETE, gate CLAIMED)

2026-05-26 — M1: Docker + single-node swarm via Nix

2026-05-26 — M1: Traefik swarm stack + HTTPS path proven

2026-05-26 — M1: proxy pivot to real coop-cloud/traefik via abra; recipe deploy/teardown (M1 CLAIMED)

2026-05-26 — M2 start: CI engine decision + Gitea OAuth app + Drone secrets

2026-05-26 — M2: Drone server + exec runner up; infra as idempotent-reconcile oneshots

2026-05-26 — M2 GATE MET: green build via push (Drone + exec runner)

2026-05-26 — M3 start: bridge secrets + comment-bridge source

2026-05-26 — M3: bridge deployed + verified; webhook DELIVERY blocked (Gitea-side)

2026-05-27 — M4: harness + install stage green (custom-html), guaranteed teardown

2026-05-27 — M5: upgrade + backup/restore stages green (custom-html)

2026-05-27 — Fix adversary findings A2 (dead janitor) + A3 (unverified teardown)

2026-05-27 — M6 (part 1): harness enhancements for recipe #2 + D4 discovery

2026-05-27 — M6 (part 2): recipe #2 keycloak install green (DB-backed, no harness surgery)

2026-05-27 — M6: D4 recipe-local discovery + recipe #2 enrolled (CLAIMED)

2026-05-27 — Trigger redesign (polling primary) + resource safety + M3 verified

2026-05-27 — Bridge→Drone→harness integration (recipe-ci pipeline) wired & green

2026-05-27 — M6.5: keycloak full 3-stage GREEN through the Drone recipe-ci pipeline

2026-05-27 — M6.5: cryptpad (recipe #3) enrolled + full 3-stage green; fixed a real backup bug

2026-05-27 — M6.5: matrix-synapse (recipe #4, DB+media/large-volume) full 3-stage green

2026-05-27 — M6.5: lasuite-docs (recipe #5, multi-service + S3/MinIO) full 3-stage green

2026-05-27 — M6.5 COMPLETE: n8n (recipe #6) full 3-stage green — all 6 D10 recipes done

2026-05-27 — M8/D7: results dashboard live (overview + badges)

2026-05-27 — M8/D7 complete: PR-comment outcome reflection + gate claim

2026-05-27 — M10/D10: real !testme path proven on custom-html; enrolling the breadth set

2026-05-27 — M10/D10: real !testme breadth runs — 5/6 green, lasuite-docs upgrade retry

2026-05-27 — M10: 5/6 real-!testme green; lasuite-docs blocked on Docker Hub rate limit (A1)

2026-05-27 — lasuite quota-window retry insufficient; halting retries pending creds (3rd attempt)

2026-05-27 — M10/D10 BUILDER-COMPLETE: all 6 recipes green via real !testme

2026-05-27 — D10 PASS (6/6); DONE now blocked only on D8 live VM rebuild (Adversary)

2026-05-27 — ## DONE

52 KiB

Raw Blame History