Files

autonomic-bot 239dfd8e26 Watchdog handoff signalling: ping the waiting loop on gate-claim / verdict (kill double-idle)

launch.sh watchdog now runs a fast (~30s) handoff_check alongside the heavy (300s) restart/DONE
check: when the Builder writes a CLAIMED gate it pings the Adversary to verify now; when the
Adversary updates REVIEW.md it pings the Builder to proceed (edge-triggered, reads local clones).
So a pending handoff resolves in <~30s instead of a whole idle interval. Pacing revised: the
Adversary may idle freely when nothing's pending (no pointless re-verify/busy-poll) and is woken
by the watchdog; Builder waits on the ping + a fallback ~2-4m self-poll. kickoff documents the
new "handoff signalling" role.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 06:15:25 +01:00

52 KiB

Raw Blame History

cc-ci — Co-op Cloud Recipe CI Server (Autonomous Build Plan)

Status: ACTIVE — autonomous loop Owner agent: Builder (primary) + Adversary (reviewer) Source brief: brief.md (do not edit; this file supersedes it) This file's canonical path: /srv/cc-ci/cc-ci-plan/plan.md Target server: cc-ci (NixOS) Code/config home: git.autonomic.zone/recipe-maintainers/cc-ci (the CI project repo — distinct from this /srv/cc-ci/cc-ci-plan/ planning+launch folder) Last updated: keep current via STATUS.md (see §7)

0. How to read this document

This plan is written to be handed to an autonomous Claude agent running in a sandbox over several days, driving itself in a loop until the CI server is "done" per §2. A second agent (the Adversary) independently tries to disprove every "done" claim. Neither agent is trusted to mark its own work complete.

If you are an agent waking up into this loop for the first time, go straight to §1 Bootstrap. On every subsequent wake, go to §7 The Loop Protocol and continue from STATUS.md.

The rest of the document (§3–§6) is the technical design. Treat it as the default architecture, but you are allowed to revise it when reality disagrees — record any deviation in DECISIONS.md with a one-line rationale.

1. Bootstrap (first wake only)

Do these in order. Each step is idempotent; re-running is safe.

Verify access. (Full credential map + how each is used is in §1.5 — read it first.)
- ssh cc-ci 'hostname && whoami' — you log in as root on cc-ci (NixOS), so there is no separate sudo step. ssh cc-ci is preconfigured to tunnel through the userspace-tailscaled SOCKS proxy (§1.5); if it fails, the proxy/daemon is probably down — restart it (§1.5) before declaring blocked.
- ssh cc-ci 'nixos-version' — confirm NixOS.
- Confirm you can reach the Gitea API with the bot creds from .testenv (§1.5): curl -s https://$GITEA_URL/api/v1/version. The bot authenticates with GITEA_USERNAME/GITEA_PASSWORD (basic auth) or a token you mint from them via POST /api/v1/users/<user>/tokens — do not expect a ready-made $GITEA_TOKEN.
- Confirm the preconfigured test-app DNS (§4.0/§4.4): a random subdomain under the wildcard resolves, e.g. getent hosts probe-$RANDOM.ci.commoninternet.net returns the gateway's IP (not cc-ci's — the gateway TLS-passthroughs to cc-ci, so do not expect cc-ci's address; and use getent, not dig, since this host's resolver is Tailscale-only — see §1.5). Traefik is not up yet — you deploy it at M1 (the real coop-cloud/traefik recipe via abra, wildcard/file-provider mode → the pre-issued cert at /var/lib/ci-certs/live/, no ACME); the DNS record + gateway passthrough + cert are the preconditions, and full end-to-end HTTPS reachability is proven at M1, not now. If the wildcard does not resolve at all, that's a ## Blocked item (operator fixes DNS/gateway).
- If any check fails, write the failure to STATUS.md under ## Blocked and stop — a human must fix access. Do not try to work around missing access.
Create the cc-ci repo on git.autonomic.zone if it does not exist. Push an initial skeleton (see §3 layout). The Builder clones to /srv/cc-ci/cc-ci; the Adversary loop keeps its own independent clone at /srv/cc-ci/cc-ci-adv. The repo is the only channel between the two loops (§6.1) — loop state lives inside it (STATUS.md, BACKLOG.md, etc.).
Snapshot the starting environment into cc-ci/docs/baseline.md: current NixOS config on the server (/etc/nixos or existing flake), installed packages, whether Docker/Swarm/abra already exist, DNS that already points at the box. This is the rollback reference.
Seed the loop state files (§7) if absent: STATUS.md, BACKLOG.md, REVIEW.md, JOURNAL.md, DECISIONS.md. Give BACKLOG.md two H2 sections — ## Build backlog (populated from §5 milestones) and ## Adversary findings (empty) — per the single-writer rule in §6.1.
Commit ("chore: bootstrap cc-ci loop state") and begin the loop at §7.

1.5 Credentials & access — where everything lives and how to use it

The loops run on the sandbox host (not on cc-ci) and reach cc-ci over Tailscale. This section is the authoritative map of what credentials exist, where, and how to use them. Never copy any secret value into the repo, a commit, a log, or the dashboard (§9) — reference locations only.

Provided credentials (already in place)

What	Where	How to use
Tailscale auth key (joins cc-ci's tailnet `taila4a0bf.ts.net`)	`/srv/cc-ci/.testenv` → `TS_AUTH_KEY` (Tailscale SaaS key, keyID ends `CNTRL`)	Used to bring up the userspace tailscaled (below). It's reusable; re-run `tailscale up` with it if the node drops.
cc-ci SSH (root)	private key `~/.ssh/cc-ci-root-ed25519`; config `Host cc-ci` in `~/.ssh/config`	Just run `ssh cc-ci` (logs in as root). The pubkey is already in cc-ci's `/root/.ssh/authorized_keys`.
Gitea bot account	`/srv/cc-ci/.testenv` → `GITEA_USERNAME` (`autonomic-bot`), `GITEA_PASSWORD`, `GITEA_URL` (`git.autonomic.zone`)	Basic-auth to the Gitea API, or mint a scoped token: `POST https://$GITEA_URL/api/v1/users/$GITEA_USERNAME/tokens`. Used to push the `cc-ci` project repo, read recipe repos, comment on PRs, and poll for `!testme` (read-level; the bot does not register webhooks).

Load them in a shell with: set -a; . /srv/cc-ci/.testenv; set +a (don't echo the values).

The Tailscale connection (how `ssh cc-ci` and the proxy work)

cc-ci (cc-nix-test, 100.90.116.4) is on a different tailnet than the sandbox host's default one, so it is reached via a second, userspace tailscaled — this keeps the host's own tailnet untouched. State lives in ~/.cc-ci-ts/; it exposes a SOCKS5/HTTP proxy on 127.0.0.1:1055, which is the only route to that tailnet (userspace networking ⇒ the host OS can't route the tailnet IPs directly).

It runs as a persistent systemd service (cc-ci-tailscaled.service, enabled, Restart=always, starts on boot; unit at /etc/systemd/system/cc-ci-tailscaled.service, runs as user notplants). It reuses the already-authenticated state in ~/.cc-ci-ts/, so it reconnects across reboots/crashes without the auth key.

ssh cc-ci works out of the box (its ProxyCommand uses the proxy; logs in as root).
For HTTP(S) to cc-ci / *.ci.commoninternet.net from the sandbox, go through the proxy, e.g. curl --proxy socks5h://localhost:1055 https://<app>.ci.commoninternet.net.
If connectivity is down: sudo systemctl restart cc-ci-tailscaled (diagnose with systemctl status cc-ci-tailscaled / journalctl -u cc-ci-tailscaled). A dead proxy is an access failure to recover, not a ## Blocked-and-stop condition — unless the auth key itself is rejected (then re-auth with tailscale --socket=$HOME/.cc-ci-ts/tailscaled.sock up --auth-key="$TS_AUTH_KEY" --hostname=cc-ci-claude-sandbox --accept-routes --accept-dns=false, and if that fails the key is a class-A1 blocker).
DNS gotcha: this host's /etc/resolv.conf lists only Tailscale resolvers, so direct dig @1.1.1.1 … queries get no answer and look falsely empty. Use getent hosts <name> to resolve from the sandbox. commoninternet.net itself is a normal public zone hosted at Gandi.

Credentials the loop GENERATES itself (do not wait on a human for these)

Drone RPC secret and webhook HMAC secret — generate (openssl rand -hex 32), store sops-encrypted in secrets/, and wire both ends. Internal shared secrets, not human inputs.
Gitea OAuth app for Drone — create it under the bot account via the API (POST /api/v1/user/applications/oauth2); capture client id/secret into secrets/.
cc-ci host age/GPG key for sops — generate on the host (or derive from its SSH host key); add as a sops recipient. Keep a recovery copy of the master age identity off-box if desired.
Per-recipe app secrets (class-B, §4.4) — the harness generates these per run.

Credentials STILL NEEDED from the operator (class-A — block if missing, per §9)

Wildcard TLS cert — PROVIDED, not a token. The operator has pre-issued the wildcard SAN cert (*.ci.commoninternet.net + ci.commoninternet.net) and placed it on cc-ci at /var/lib/ci-certs/live/{fullchain.pem,privkey.pem} (§4.0). The agent feeds these into the coop-cloud/traefik recipe as its ssl_cert/ssl_key swarm secrets (wildcard/file-provider mode) and runs no ACME for this domain. Do not request or expect a commoninternet.net DNS token — issuance/renewal is handled out-of-band by the operator (LE 90-day cert; next renewal ~2026-08-24). A missing/expired cert is a finding for the operator, not an agent re-issue.
Registry pull credentials (e.g. Docker Hub) — recommended to avoid anonymous pull-rate limits breaking deploys under load. Treat a rate-limit failure traced to this as a finding, then request creds. Store sops-encrypted in secrets/.
Gitea bot permissions (a grant, not a secret) — least privilege: read, not admin. The bot needs: write on its own recipe-maintainers/cc-ci project repo; read + comment on the recipe repos under test; and org membership in recipe-maintainers (read-level — used both to authorize commenters via the members endpoint and to read members). It does not need repo-admin and does not register webhooks (that's an optional manual admin task, §4.1). If a needed grant is missing, that's a ## Blocked item for the operator.

2. Definition of Done (the loop's exit condition)

The loop terminates only when every item below is true and the Adversary has independently re-verified each one within the last 24h (logged in REVIEW.md with timestamps and command output). Partial credit does not count.

D1 — Trigger. Commenting !testme on any open PR in any enrolled recipe repo on git.autonomic.zone starts a CI run for the code at that PR's head commit within 60s. Other comments do not. Re-commenting re-runs.
D2 — Test matrix. For a recipe under test, the run executes, as separate reported stages: new install, upgrade (previous published version → PR version), and backup + restore. All are genuine end-to-end against a really-deployed recipe (real containers, real Traefik routing, real volumes) — no mocks, no stubs.
D3 — Python + Playwright. Tests are Python. Functional assertions that require a browser use Playwright against the live deployed app.
D4 — Recipe-local tests. If the recipe repo contains its own tests/ folder, those tests are also discovered and run as part of the same CI run, with results merged in.
D5 — Per-recipe test tree. The cc-ci repo holds tests/<recipe>/ with the install/upgrade/backup tests as Python files, plus a shared harness. Adding a new recipe is a documented, small, repeatable operation.
D6 — Secrets. App + infra secrets are handled reproducibly (committed encrypted, decrypted on the server), documented, and rotatable. No plaintext secrets in git, logs, or the results UI.
D7 — Results UX. Each run has a stable URL with live, tail-able logs per stage and a final pass/fail; there is an overview page listing recipes with their latest status — look-and-feel comparable to the YunoHost app CI (ci-apps.yunohost.org). A PR comment links back to its run and reflects the outcome.
D8 — Reproducible server. The entire server (Drone, runner, comment bridge, swarm, Traefik, dashboard, secrets wiring) is declared in the cc-ci repo's NixOS flake and can be rebuilt from scratch onto a blank NixOS host following docs/install.md, verified by the Adversary doing exactly that on a throwaway VM (or documenting why a full from-scratch rebuild was infeasible and what was tested instead).
D9 — Documentation. README.md + docs/ explain architecture, how to enroll a recipe, how to add/run tests locally, how to operate/rotate secrets, and how to debug a failed run. A new engineer can enroll a recipe and get a green run using only the docs.
D10 — Proof (breadth). At least six real recipes spanning the meaningful categories have a full green run triggered by !testme on a real PR, with all three stages (install / upgrade / backup+restore) actually exercised. The set must cover: a stateless/simple app, a single-DB app, a multi-service app, an SSO/identity app, and an object-storage/large-volume app. Target set (all previously verified deployable): hedgedoc (simple), cryptpad (stateful, no external DB), keycloak + authentik (SSO/identity, DB-backed), lasuite-docs and/or lasuite-drive (multi-service + S3/MinIO), matrix-synapse (DB + media store), immich (large volumes + Postgres), bluesky-pds (TLS-passthrough/atproto). Pick six that together satisfy the categories; record the chosen set and per-recipe green-run evidence in REVIEW.md. Any recipe that genuinely cannot be CI'd is a documented finding (in DECISIONS.md) with the reason, not a silent omission. Recipe availability: the testable repos live on the private mirror git.autonomic.zone/recipe-maintainers/<recipe> (already mirrored as of bootstrap: bluesky-pds, cryptpad, keycloak, lasuite-docs, lasuite-meet, matrix-synapse, n8n, custom-html, custom-html-tiny). Any recipe not yet mirrored (e.g. hedgedoc, authentik, immich, lasuite-drive) is pulled from upstream git.coopcloud.tech and created on the mirror via the recipe mirror+PR flow (§4.1) — so the target set is not capped by what currently exists. If the chosen simple/stateless app isn't mirrored, custom-html / custom-html-tiny already are.

When all of D1–D10 hold and are Adversary-verified, write ## DONE to STATUS.md with the evidence links and stop scheduling new iterations.

3. Repository layout (`git.autonomic.zone/recipe-maintainers/cc-ci`)

cc-ci/
├── README.md
├── flake.nix                 # NixOS host(s) + devshell
├── flake.lock
├── hosts/
│   └── cc-ci/
│       ├── configuration.nix # the cc-ci machine
│       └── hardware.nix
├── modules/
│   ├── drone.nix             # Drone server + runner (exec/docker)
│   ├── comment-bridge.nix    # !testme webhook listener service
│   ├── swarm.nix             # Docker + single-node swarm + `proxy` net; deploys the
│   │                         #   coop-cloud/traefik recipe via abra (wildcard/file-provider, §4.2)
│   ├── dashboard.nix         # results overview site
│   └── secrets.nix           # sops-nix / agenix wiring
├── secrets/                  # sops-encrypted (*.enc / *.age); see §4.4
│   └── secrets.yaml
├── bridge/                   # comment-bridge source (small Go/Python service)
├── runner/                   # CI orchestration entrypoint invoked by Drone
│   ├── run_recipe_ci.py      # top-level: deploy→test→teardown for a recipe@ref
│   └── harness/              # shared pytest fixtures (abra wrappers, app lifecycle)
├── dashboard/                # results UI generator (reads Drone API → static site)
├── tests/
│   ├── conftest.py           # shared fixtures, recipe selection, teardown guarantees
│   ├── <recipe>/
│   │   ├── test_install.py
│   │   ├── test_upgrade.py
│   │   ├── test_backup.py
│   │   └── playwright/       # e2e flows for this recipe
│   └── _template/            # copy-to-add-a-recipe template
├── docs/
│   ├── install.md            # from-scratch server build (D8)
│   ├── enroll-recipe.md      # how to add a recipe (D5)
│   ├── secrets.md            # secret model + rotation (D6)
│   ├── architecture.md
│   ├── runbook.md            # debugging failed runs
│   └── baseline.md           # bootstrap snapshot
├── STATUS.md  BACKLOG.md  REVIEW.md  JOURNAL.md  DECISIONS.md   # loop state (§7)
└── .drone.yml                # pipeline for cc-ci's own repo (lint/self-test)

4. Technical design (default architecture)

4.0 Domain model (where things live)

Two DNS zones, deliberately separated — do not conflate them:

git.autonomic.zone — source of truth for code (unchanged, not ours to reconfigure). The Gitea host: the enrolled recipe repos and the cc-ci config repo live here. The loop reads, comments, and (when enrolling) adds a webhook here, but deploys nothing here. Per §9 this zone is read/comment-only — never push recipe code, never point app DNS at it.
commoninternet.net — the CI server's own zone; everything CI-facing. A wildcard *.ci.commoninternet.net resolves to a gateway (not cc-ci directly — see Network path below). Under it:
- Apps under test: each run deploys to a unique subdomain <recipe>-pr<n>-<short-sha>.ci.commoninternet.net, so concurrent runs never collide on a hostname. The subdomain (app, volumes, secrets, Traefik route) is torn down at run end (§4.3).
- Results dashboard: ci.commoninternet.net — overview page + per-recipe status badges (§4.5).
- Webhook bridge: ci.commoninternet.net/hook — the Gitea issue_comment receiver (§4.1).
Network path (gateway → TLS passthrough → cc-ci). The wildcard record does not point at cc-ci's IP. It points at a gateway that passes TLS through to cc-ci: the gateway routes by SNI and forwards the raw encrypted stream without decrypting it, so TLS still terminates on cc-ci's Traefik. Consequences the agent must respect:
- dig <sub>.ci.commoninternet.net returns the gateway's IP, not cc-ci's — do not assert the record points at cc-ci. Reachability is proven end-to-end (an HTTPS request lands on cc-ci), not by comparing A records.
- The gateway is assumed to passthrough the whole wildcard, so a fresh per-run subdomain needs no gateway change and no cert work (the pre-issued wildcard already covers it) — the agent only adds the Traefik router on cc-ci. (If the gateway instead needs per-host config, that's an operator/gateway concern and a ## Blocked item, not something the agent reconfigures — the gateway is not ours, only cc-ci is, per §9.)
- The gateway is operator-managed and out of scope; the agent configures only cc-ci.
- Caveat for TLS-passthrough recipes (e.g. bluesky-pds, §2 D10): the default path terminates TLS at cc-ci's Traefik. A recipe that expects to terminate TLS in its own container needs cc-ci's Traefik configured to passthrough that host too (the outer gateway already passes the whole wildcard). Treat this as a per-recipe harness quirk to absorb (§5 M6.5), or pick a non-passthrough recipe for that D10 category and record the swap in DECISIONS.md — not a silent omission.
Wildcard TLS — operator pre-issues, agent serves it statically (no token in the agent). Routing and certs are separate: the preconfigured wildcard DNS solves routing only; a cert is still needed because the gateway passes TLS through and cc-ci's Traefik terminates it. The cert is pre-provisioned out-of-band so the DNS-editing token never enters the agent/repo. A wildcard SAN cert covering *.ci.commoninternet.net + ci.commoninternet.net (issued via Let's Encrypt DNS-01 against Gandi, by the operator, using a token the agent never sees) lives on cc-ci:
- /var/lib/ci-certs/live/fullchain.pem (leaf+intermediate) and …/privkey.pem.
- Traefik is the real coop-cloud/traefik recipe, deployed via abra (for e2e fidelity — see §4.2), run in its wildcard / file-provider mode (WILDCARDS_ENABLED=1 + compose.wildcard.yml). The pre-issued cert is supplied as the recipe's ssl_cert/ssl_key swarm secrets (sourced from the files above); the recipe's file provider then serves it under tls.certificates. No ACME resolver / no DNS provider is enabled — only the cert+key reach cc-ci, never the DNS token. One cert covers every per-run subdomain (matched by SNI), so a new app domain needs no cert work.
- Renewal is a manual operator task (LE 90-day cert): the operator re-issues out-of-band, then updates the ssl_cert/ssl_key secret (bump its version) and redeploys traefik. The agent must not attempt ACME/DNS-01 for commoninternet.net and must not expect a DNS token — a missing/expired cert is an operator action surfaced as a finding, not something the agent re-issues. (Rationale for choosing a wildcard cert over per-subdomain: a wildcard is reused for every churning run subdomain and sidesteps LE's 50-certs/week-per-domain limit; only DNS-01 can mint a wildcard. We keep that DNS-01 issuance with the operator rather than handing the agent the zone token.)
Record the live facts in docs/install.md: the zone + DNS provider (Gandi), that the wildcard *.ci.commoninternet.net (and bare ci.commoninternet.net) point at the gateway, that the gateway TLS-passthroughs the wildcard to cc-ci, the gateway's address, the TTL, and that the wildcard cert is pre-issued/operator-renewed at /var/lib/ci-certs/live/ (no DNS token on cc-ci).

4.1 The `!testme` trigger path

Gitea does not natively forward PR-comment events to Drone, and Drone's built-in triggers fire on push/PR-open, not on a magic comment. So:

PR comment "!testme"
   │  Gitea webhook (issue_comment event)  ──►  comment-bridge (modules/comment-bridge.nix)
   │                                              • verifies webhook HMAC secret
   │                                              • checks comment body == "!testme" (exact, trimmed)
   │                                              • checks commenter is allowed (org member / collaborator)
   │                                              • resolves PR head repo + SHA via Gitea API
   │                                              • calls Drone API: build for cc-ci pipeline,
   │                                                params RECIPE=<repo> REF=<sha> PR=<n> SRC=<headrepo>
   ▼
Drone build (cc-ci repo pipeline, parameterized) ──► runner/run_recipe_ci.py
   ▼
Bridge posts/updates a Gitea PR comment with the run URL and (on completion) pass/fail.

The bridge is a tiny service (Go or Python+FastAPI). Keep it dependency-light; it's a NixOS systemd service behind Traefik at e.g. ci.commoninternet.net/hook (§4.0).
Trigger: POLLING is primary; webhook is an optional, admin-registered push optimization (SETTLED). Hard constraint: the CI server/bot must run on READ-level access — never repo-admin.
- Polling (primary, default): the bridge polls the Gitea API for new !testme comments on enrolled repos at ≤60s (satisfies D1). This is outbound (cc-ci → git.autonomic.zone, the reliably-working direction) and needs only read. It is the source of truth for triggering.
- Webhook (optional): the bridge keeps its /hook endpoint so a Gitea issue_comment webhook, if present, gives lower latency. But the server does NOT self-register webhooks (that needs repo-admin, which we refuse to require). Registration is a manual admin task, documented in docs/enroll-recipe.md (URL https://ci.commoninternet.net/hook, event issue_comment, content-type json, the shared HMAC secret, and the note that the Gitea instance must allow the host). The two paths are mutually exclusive in effect; don't double-fire a comment seen by both.
- (Webhook delivery on this instance was flaky early on — last_status: None — so polling being primary is also the robust choice, not just the low-privilege one.)
Commenter auth via org membership (read-level — no admin). The repo's explicit collaborator list is empty: the bot and the maintainers (trav/notplants) all reach the repo as recipe-maintainers org members/owners, so GET /collaborators/{user} 404s for everyone, and GET /collaborators/{user}/permission would authorize correctly but requires repo-admin — which we refuse. Instead authorize with GET /orgs/recipe-maintainers/members/{user} (204 = member = authorized; 404 = rejected) — readable by any org member (read-level), verified to admit trav/notplants/the bot and reject non-members. Note public_members is hidden here, so use the authenticated members endpoint (bot must be an org member, still read-level). Fail-closed on error. Zero-privilege fallback: a configured allowlist of usernames. (Still satisfies §6's non-collaborator-rejection check.)
Enrollment = adding the recipe to the bridge's poll list + ensuring a tests/<recipe>/ dir exists. The bot needs only read on the recipe repo (+ comment-back to post status). Registering a webhook is optional and operator/admin-side (documented in enroll-recipe.md), never required for CI to work.
Recipe mirror+PR flow (how a recipe gets a testable PR). Recipe repos under test live on the private mirror git.autonomic.zone/recipe-maintainers/<recipe>, mirrored from the official upstream git.coopcloud.tech. To bring a recipe under CI: abra recipe fetch <recipe> (pulls from upstream into ~/.abra/recipes/<recipe>), then mirror it to the org + open a PR via the recipe mirror+PR procedure — reference implementation: /srv/recipe-maintainer/.claude/commands/recipe-create-pr.md (creates recipe-maintainers/<recipe> if absent, force-syncs main from upstream so the PR diff is clean, pushes a branch, opens the PR). !testme on that PR is what kicks off a run. So a recipe missing from the mirror is not a blocker — mirror it first.
Decide and record in DECISIONS.md: one shared Gitea org-level webhook vs per-repo webhooks. Org-level is fewer moving parts; per-repo is more explicit. Default: per-repo via enroll script.

4.2 Drone + the test target

Drone server connects to Gitea via OAuth app (Gitea → Settings → Applications). Runner is the exec runner (or a privileged docker runner) running on cc-ci itself, because tests must drive abra to deploy real recipes onto a real swarm.
cc-ci doubles as the deploy target: single-node Docker Swarm + abra, with the reverse proxy provided by the real coop-cloud/traefik recipe deployed via abra (not a hand-rolled Traefik — chosen for end-to-end fidelity: test apps route through the exact proxy a real Co-op Cloud host uses — web/web-secure entrypoints, the proxy overlay, the swarm provider). TLS terminates on it using the pre-issued static wildcard cert (§4.0): run the recipe in wildcard/file-provider mode (WILDCARDS_ENABLED=1 + compose.wildcard.yml) and supply the cert as the recipe's ssl_cert/ssl_key swarm secrets from /var/lib/ci-certs/live/. The operator preconfigures the wildcard DNS (→ gateway), the gateway's TLS-passthrough, and the cert itself (§4.4); the agent deploys the traefik recipe + swarm on top — no ACME, no DNS token on cc-ci. Make the abra app new/deploy traefik steps reproducible (scripted/Nix-invoked) for D8.
Each CI run gets an isolated app domain <recipe>-pr<n>-<short-sha>.ci.commoninternet.net (§4.0) so concurrent runs don't collide. Teardown removes app, secrets, and volumes.
Concurrency cap + queue — use Drone natively (SETTLED). Don't let the server fill with simultaneously-deployed apps. Expose a configurable MAX_TESTS mapped to the exec runner's DRONE_RUNNER_CAPACITY (Nix-set on the runner; default low — 1–2 given a single 28 GiB node and heavy recipes like matrix/immich). Drone runs at most MAX_TESTS builds at once and automatically queues excess builds (its native pending-build queue), starting them as slots free. Per-build timeout (repo/runner timeout) guarantees a hung test is killed and frees its slot — so "continue once a current test finishes or times out" is built in. No custom queue needed. Optionally also set concurrency: { limit: <N> } in .drone.yml as a per-pipeline cap.
One app at a time per run, torn down at run end. A build deploys its recipe, runs the three stages, then undeploys — the server should not accumulate live test apps. Guaranteed teardown
- the run-start janitor (§4.3) enforce this even when a build is timed-out/killed (in-process cleanup can't run, so the janitor reaps it).

4.3 The test harness & recipe test contract

runner/run_recipe_ci.py orchestrates per run:

Fetch recipe at $REF (the PR head) via abra/git.
Install stage → tests/<recipe>/test_install.py: abra app new, generate secrets, abra app deploy, wait healthy, run Playwright smoke + assertions.
Upgrade stage → deploy previous published version first, then upgrade to $REF; assert data survives and app still healthy.
Backup/restore stage → abra app backup, mutate state, abra app restore, assert restored state matches pre-mutation.
Recipe-local tests (D4) → if <recipe-repo>/tests/ exists, discover & run it in the same live environment; merge results.
Teardown (always, even on failure) → abra app undeploy, abra app volume remove, abra app secret remove, DNS/route cleanup.

Shared fixtures (tests/conftest.py + runner/harness/) wrap abra. Known abra gotchas to bake in from day one (carried over from prior work, re-verify on the installed abra version):

abra app undeploy and abra app volume remove do not accept --chaos → never pass it.
Plumb a timeout kwarg through secret-generate/insert/remove-all calls.
abra app ls -S -m returns nested {server: {apps: [...]}} — parse the inner structure.
Pick robust health checks per app (e.g. Keycloak: /realms/master, not /).

The teardown guarantee is sacred: a failed test must never leak a deployed app or volume into the next run. Implement teardown as a pytest fixture finalizer / try/finally in the orchestrator and add a janitor pass at run start that nukes any orphaned *-pr* apps older than N hours. Crucially, the janitor is the backstop for timed-out/killed builds: when Drone hits the per-build timeout (or a build is cancelled) it may SIGKILL the runner process, so the try/finally teardown can't run — those orphaned apps/volumes are reaped by the next build's run-start janitor (and the janitor should run regardless of how the previous build ended). Net effect with the MAX_TESTS/DRONE_RUNNER_CAPACITY cap (§4.2): at most MAX_TESTS apps are ever live at once, and each is torn down (or janitor-reaped) so the single node never accumulates deployments.

4.4 Secrets (D6)

There are two distinct classes of secret and they are handled in opposite ways. Do not conflate them.

(A) Infra secrets. All of these end up sops-nix-encrypted in secrets/, decrypt into the Nix store at activation, and are never world-readable. But they split into two sub-classes — see §1.5 for the concrete locations/usage — and only the first sub-class blocks:

(A1) External inputs — provided by the operator, the loop cannot create them. The Tailscale auth key + Gitea bot creds (/srv/cc-ci/.testenv, already provided), the pre-issued wildcard TLS cert at /var/lib/ci-certs/live/ (§4.0 — not a DNS token; the agent serves it, never issues it), and registry pull creds (if needed). If one of these is missing or invalid, the loop is blocked — write it to STATUS.md ## Blocked and stop (§9). The agent must not invent or work around an external input it wasn't given, and must not attempt ACME/DNS-01 for commoninternet.net.
(A2) Internal secrets — the loop generates and manages these itself; never block on them. Drone RPC secret + webhook HMAC (openssl rand), the Gitea OAuth app for Drone (created via the bot API), and the cc-ci host age/GPG key for sops. These are not human inputs; generate, store in secrets/, and wire both ends.

Alongside these, three preconfigured network/cert facts are operator-provided inputs the loop also depends on (not secrets the agent makes, but class-A in the same "provided, don't improvise" sense): (1) the wildcard *.ci.commoninternet.net record (and bare ci.commoninternet.net) already points at the gateway, (2) the gateway TLS-passthroughs that wildcard to cc-ci (SNI-routed, no decryption — see §4.0 Network path), and (3) the pre-issued wildcard cert is in place at /var/lib/ci-certs/live/. The operator owns the DNS record, the gateway, and cert issuance/renewal; everything else on cc-ci is the agent's job — Traefik (pointed at the static cert), swarm, per-run subdomain routing, and teardown. If the wildcard does not resolve, the gateway doesn't reach cc-ci, or the cert is missing/expired, that is a ## Blocked condition (operator action), not something to work around (the gateway and DNS are not ours to reconfigure, per §9).

(B) Recipe app secrets — generated by the test, persisted within the run. These are NOT a blocker and are NOT pre-provisioned by a human. The harness creates them itself for each app under test and is responsible for persisting them across the run so the multi-stage lifecycle works:

Generate at install: the harness runs abra app secret generate (+ inserts any deterministic test fixtures like an admin password / test user it chooses) when it deploys the app.
Persist for the run's duration: the same generated secrets must survive across stages — install → upgrade and especially backup → restore — because an app cannot be upgraded or restored against rotated credentials. Persist them in a per-run secret store keyed by the run's unique app name (e.g. <recipe>-pr<n>-<sha>): the live abra/swarm secrets plus a sidecar record the harness writes (e.g. the app's .env + the generated values) to a run-scoped, non-public location on the runner, so any stage can re-read them. They are emphemeral by design.
Destroy at teardown: the same teardown that removes the app/volumes also runs abra app secret remove (with timeout plumbed) and deletes the per-run sidecar. Nothing generated for a run outlives that run.
How the harness should "figure out" persistence (acceptance for D6): decide and document one concrete mechanism — recommended default is "abra/swarm holds the live secrets; the harness keeps a run-scoped sidecar file under a runs/<app-name>/ dir on the runner (mode 600), and reloads from it between stages." Whatever is chosen, it must (1) keep the same values stable across all three stages, (2) isolate concurrent runs from each other, and (3) leave nothing behind.

(C) Drone CI tokens: store as Drone org/repo secrets, referenced by the pipeline. Where a value is an external input (A1, e.g. registry creds) it is provided; where it is internal (A2) it is generated — see the (A) split above.

Hard rule across all classes: scrub secrets from logs before they reach the dashboard; the results UI shows sanitized logs only. Add a redaction filter in the log pipeline and an Adversary test that greps published logs and the overview site for known secret patterns and any generated app password.

4.5 Results UX (D7) — YunoHost-CI-like

Per-run logs: Drone's native UI already gives live, per-stage, tail-able logs and a final status — use it as the canonical run view; the PR comment links to it.
Overview page: a small generator (dashboard/) polls the Drone API and renders a static page at ci.commoninternet.net (§4.0): a table of enrolled recipes, latest run status badge (pass/fail/running), last-tested version, link to history — mirroring the YunoHost app-list feel. Served by Traefik; regenerated on build-completion webhook or a short timer.
Provide a status badge endpoint per recipe for embedding in recipe READMEs.

5. Milestones / initial BACKLOG

Work top-down; each milestone ends with an Adversary gate (Adversary must independently verify the acceptance check before the next milestone starts). Seed BACKLOG.md from this.

M0 — Foundations. Repo created; flake builds; nixos-rebuild (or deploy-rs) applies a no-op-then-base config to cc-ci; sops decrypts a test secret on the host. Accept: ssh cc-ci 'systemctl is-system-running' healthy after a rebuild from the repo.
M1 — Swarm + abra target. Docker + single-node swarm + proxy network; the coop-cloud/traefik recipe deployed via abra (wildcard/file-provider mode, serving the pre-issued cert — §4.0/§4.2, not a custom Traefik); abra can deploy and tear down a trivial recipe by hand. Accept: a recipe deployed via abra is reachable over HTTPS (valid wildcard cert) on the web-secure entrypoint at *.ci.commoninternet.net, then fully torn down leaving no volumes; the proxy is verifiably the traefik recipe and no DNS/ACME token is present on cc-ci.
M2 — Drone online. Drone server+runner via Nix, OAuth to Gitea; a hello-world .drone.yml in cc-ci runs green; logs visible in Drone UI. Accept: push to cc-ci triggers a visible green Drone build.
M3 — Comment bridge. !testme on a PR triggers a parameterized Drone build; bridge posts a PR comment with the run link; non-!testme comments and non-collaborators are ignored. Accept: live demo on a scratch PR — comment in, build out, link back, auth enforced.
M4 — Harness + install stage. run_recipe_ci.py + conftest; install stage green for one simple recipe end-to-end with a Playwright assertion; guaranteed teardown. Accept: full green install run for recipe #1, no orphaned app/volume afterward.
M5 — Upgrade + backup/restore stages. Add the other two stages for recipe #1. Accept: upgrade preserves data; backup→mutate→restore returns original state.
M6 — Recipe-local tests (D4) + second recipe. Discover/run recipe-repo tests/; enroll a second, DB-backed recipe via the documented flow. Accept: both recipes green; recipe-local tests demonstrably executed and merged.
M6.5 — Breadth ramp. Enroll recipes 3→6 covering the remaining D10 categories, one at a time, each via the documented enroll flow (this is the real test of D5: enrolling recipe N should be template-copy + recipe-specific tests/fixtures, with no harness surgery). Expect per-recipe quirks — multi-service deps, S3/MinIO config, SSO client setup, TLS passthrough, large-volume backups — and absorb them into the shared harness, not one-off per-recipe hacks. When flakiness appears, add real readiness/wait robustness to the harness rather than sprinkling sleeps. Run benchmarks/long deploys sequentially, never in parallel (network contention). Accept: recipes 3–6 each have a full three-stage green run; enrolling N≥3 needed no changes to shared harness code.
M7 — Secrets hardening (D6). Full sops model, rotation doc, log redaction + leak test. Accept: Adversary's secret-grep over published logs finds nothing; rotation doc followed.
M8 — Dashboard (D7). Overview page + badges + PR-comment outcome reflection. Accept: overview matches reality across several runs; outcomes mirrored to PR comments.
M9 — Reproducibility + docs (D8/D9). docs/install.md rebuilds the server from scratch on a blank VM; all docs complete. Accept: Adversary rebuilds from docs onto a throwaway host (or records the tested subset).
M10 — Proof (D10). All six chosen recipes green via real !testme PRs (the breadth set from M6/M6.5 carried through the hardened pipeline), each with install/upgrade/backup-restore exercised and Adversary-verified; flip STATUS.md to DONE.

6. The two agents

Builder (primary)

Implements the backlog top-down. Discipline:

One backlog item in flight at a time. Small, committed, reversible steps.
Every change verified against the real system (server, Drone, Gitea) before claiming done — never "should work". Paste the verifying command + output into JOURNAL.md.
Touch production carefully: cc-ci is the only target; never deploy test apps onto unrelated production servers; never reuse production domains. Idempotent server changes only (via Nix).
If blocked on access/secrets/external state, write it to STATUS.md ## Blocked and pick up an unblocked item rather than hacking around it.

Adversary (reviewer)

Runs as a separate, independent loop in its own process/sandbox (see §6.1 for how the two loops coordinate). Its job is to disbelieve. It:

Re-verifies each Definition of Done and milestone-acceptance claim independently, from a cold start (fresh shell, own clone, no cached state), and logs PASS/FAIL + evidence in REVIEW.md.
Actively tries to break things: comment !testmexyz (should NOT trigger), comment as a non-collaborator (should be rejected), push a PR that fails tests (must report red, not green), kill an app mid-run (teardown must still clean up), grep published logs/dashboard for secrets, run two !testmes concurrently (no domain/volume/secret collision), confirm the same generated app secrets persist across install→upgrade→backup/restore.
Files every defect as a BACKLOG.md item tagged [adversary] with repro steps. The Builder may not close an adversary item; only the Adversary closes it after re-test.
Has veto power over STATUS.md → DONE.

6.1 Coordination protocol (two independent loops, one shared repo)

The two loops never talk directly; the git repo is the only coordination medium. Each agent has its own clone (e.g. Builder in /srv/cc-ci/cc-ci, Adversary in /srv/cc-ci/cc-ci-adv) and its own pacing. To make concurrent writes conflict-free:

File ownership (one writer each — the other only reads):
- Builder owns: all source code/config, STATUS.md, JOURNAL.md, DECISIONS.md.
- Adversary owns: REVIEW.md.
- BACKLOG.md is split into two H2 sections — ## Build backlog (Builder-only) and ## Adversary findings (Adversary-only). Each agent edits only its own section, so git merges the two cleanly. Closing an item = checking the box in your own section; the Builder fixes an [adversary] finding and notes the fix in JOURNAL, but only the Adversary ticks it closed after re-test.
Append-only where possible. JOURNAL.md and REVIEW.md are append-only logs → they never conflict. Prefer appending over rewriting.
Git discipline (both loops, every write): git pull --rebase before editing, make the smallest change, commit, git push. On a rebase conflict, it will be inside the other agent's file/section only if a rule was broken — re-pull and keep to your own files. Never --force.
Gate handshake via STATUS.md. When the Builder believes a milestone gate is met, it sets in STATUS.md: Gate: <Mn> — CLAIMED, awaiting Adversary and stops advancing past it. The Adversary, on its next wake, sees the claim, runs the acceptance check cold, and writes the verdict to REVIEW.md (<Mn>: PASS @<ts> with evidence, or FAIL + an [adversary] item). The Builder only proceeds past the gate after seeing PASS in REVIEW.md.
DONE handshake. Builder may write ## DONE to STATUS.md only when REVIEW.md shows a PASS dated within 24h for every D1–D10. The Adversary can write ## VETO <reason> to REVIEW.md at any time, which forbids DONE until cleared.
Liveness. If the Adversary sees a gate CLAIMED for too long with no Builder progress, or the Builder sees no Adversary verdict on a standing claim, note it in your own ledger and keep doing independent work — neither loop blocks idle waiting on the other beyond its gate.

(If you are ever forced to run with a single process, the degraded fallback is to alternate roles per iteration and keep JOURNAL.md and REVIEW.md strictly separate — but two loops is the intended design.)

7. The Loop Protocol

Both loops run this same shape; state lives in the repo so it survives restarts/compaction. On every wake, git pull --rebase first, then:

Orient. Read STATUS.md (phase, in-flight item, gate claims, blockers), BACKLOG.md, and the tail of REVIEW.md. Reconcile with reality via cheap probes (Drone health, last build, git log) — never trust the ledger blindly; if it disagrees with the system, fix the ledger first (your own files only — see §6.1).
Select.
- Builder: highest-priority open item in ## Build backlog: unresolved [adversary] findings > current milestone's next task > next milestone. Never advance past a milestone gate until REVIEW.md shows its PASS.
- Adversary: any standing Gate: <Mn> CLAIMED in STATUS.md to verify > re-verify a D1–D10 gate whose last PASS is stale (>24h) > a fresh break-it probe from §6.
Act. Smallest change that advances the item. Builder verifies against the real system; Adversary verifies from a cold start. Commit with a clear message (author per repo convention).
Record (your own files only). Builder: append to JOURNAL.md (what you did + verifying command/output + next), update STATUS.md, tick ## Build backlog. Adversary: append PASS/ FAIL + evidence to REVIEW.md, add/close items in ## Adversary findings. Then git push.
Gate handshake (§6.1). Builder, on reaching a milestone, sets Gate: <Mn> CLAIMED, awaiting Adversary in STATUS.md and works on other unblocked items meanwhile. Adversary clears it with a REVIEW.md verdict. No gate is "passed" without a logged PASS.
Decide continuation. Builder writes ## DONE only when REVIEW.md shows a <24h PASS for every D1–D10 and no standing ## VETO. Otherwise schedule the next wake.

Pacing. Use /loop (self-paced) or ScheduleWakeup. Most waits here are for things the harness can't notify you about — a Drone build, a nixos-rebuild, a deploy converging — so poll the specific thing. Three cases:

Something in flight (build/deploy/nixos-rebuild) → re-check on a short cadence (≈4 min) to stay cache-warm; keep polling it, don't treat it as idle, and don't spin on a minutes-long build.
Blocked on the other loop — Builder parked at a CLAIMED gate awaiting the Adversary, or Adversary waiting for the Builder to fix an [adversary] finding. You don't need to busy-poll here: the watchdog signals across the handoff. The moment the Builder writes a CLAIMED gate, the watchdog pings the Adversary to verify now; the moment the Adversary updates REVIEW.md (verdict/finding), it pings the Builder to proceed (launch.sh, ~30 s detection). So you may sleep while blocked and trust the ping — but keep a fallback self-poll on a modest cadence (~2–4 min) in case a ping is missed (a dead session is restarted by the watchdog and re-orients from the repo anyway). The goal: a pending handoff resolves in well under a minute, not a whole idle interval.
Genuinely idle, nothing pending from either loop → sleep ~10–15 min, then re-orient.

Notes: The Adversary may idle freely when nothing is pending — it should NOT pointlessly re-verify or busy-poll to look busy. It gets woken by the watchdog the instant the Builder claims a gate, so "start verifying very soon after the Builder waits" is handled by the signal, not by the Adversary spinning. The Builder should prefer keeping an unblocked backlog item in hand so it's rarely fully blocked on a gate; only hit case 2 when everything is genuinely gated behind the pending verification — and then rely on the watchdog ping (+ fallback poll) rather than a long idle.

Anti-drift guards.

Cap retries: if an approach fails 3× the same way, stop, write the dead-end in DECISIONS.md, and try a different approach or mark blocked. No thrashing.
Never weaken a test to make it pass. A red test is information; "fix" the recipe/harness or file a finding — do not delete the assertion. (This is the single most important rule; the Adversary watches specifically for tests being softened or skipped.)
Keep changes reversible; prefer Nix-declared state over imperative server edits so any rebuild reproduces it.
Don't expand scope beyond §2. New ideas → BACKLOG.md (tagged [idea]), not into this run.

8. Open decisions to settle early (log in DECISIONS.md)

Deploy mechanism: nixos-rebuild --target-host vs deploy-rs/colmena. (Default: deploy-rs for atomic rollbacks; nixos-rebuild fine if simpler.)
Webhook scope: per-repo vs org-level Gitea webhook. (Default: per-repo via enroll script.)
Drone runner type: exec vs privileged docker. (Default: exec, since it must drive host abra.)
Secret tool: sops-nix vs agenix. (Default: sops-nix for multi-recipient + yaml ergonomics.)
Reverse proxy / Wildcard TLS: SETTLED — deploy the real coop-cloud/traefik recipe via abra (for e2e fidelity), in wildcard/file-provider mode, serving the operator's pre-issued wildcard cert; no ACME, no token (§4.0/§4.2). Supersedes the original plan's hand-rolled modules/traefik.nix. The operator issued the wildcard SAN cert (*.ci.commoninternet.net + ci.commoninternet.net) via LE DNS-01/Gandi out-of-band into /var/lib/ci-certs/live/; the agent feeds it as the recipe's ssl_cert/ssl_key swarm secrets so the DNS-editing token never reaches cc-ci. Manual renewal ~90 days (next ~2026-08-24): re-issue → update the secret → redeploy.
Proof recipe set (D10 — six, category-spanning). Default candidates, all previously verified deployable: hedgedoc, cryptpad, keycloak, authentik, lasuite-docs/lasuite-drive, matrix-synapse, immich, bluesky-pds. Lock the final six early so M4–M6.5 build toward them. Sequence easy→hard: prove the pipeline on hedgedoc/cryptpad before tackling SSO, S3, media stores, and TLS-passthrough recipes.

Each default stands until the Adversary or reality forces a change; record the change and why.

9. Guardrails / hard rules

Access boundary: only cc-ci is yours to reconfigure. Recipe repos: read + comment + (when enrolling) add a webhook — nothing else. Never push to a recipe repo's code.
No secrets in git/logs/UI. Ever. Verified by the Adversary's leak test.
No mocks for the e2e stages. D2 means real deploys. If something can't be tested for real, it's a finding, not a pass.
Idempotent + reversible. Anything done to the server must be re-derivable from the repo. Infra bring-up is declarative idempotent reconciliation in Nix — not manual post-steps and not run-once scripts. Each piece (swarm + proxy net, the traefik recipe deploy, Drone, the comment-bridge, the dashboard) is a systemd oneshot that re-runs on every activation/boot and converges to the desired state (inspect → act only if needed → no-op if already correct), like swarm-init. No /var/lib/.bootstrapped-style sentinels (they don't self-heal drift). The goal: a from-scratch install is git clone + nixos-rebuild switch + the operator preconditions — docs/install.md must never accumulate manual post-rebuild steps.
Stop on missing external infra inputs (class-A1 in §4.4: cc-ci SSH/root access, the Tailscale auth key, Gitea bot creds, the pre-issued wildcard cert at /var/lib/ci-certs/live/, registry creds — and the preconfigured DNS/gateway facts) rather than improvising around them; surface in STATUS.md ## Blocked. Never attempt ACME/DNS-01 for commoninternet.net — the cert is pre-provided and renewed out-of-band by the operator. This does NOT apply to internal infra secrets (class-A2: Drone RPC, webhook HMAC, Gitea OAuth app, host age key — the agent generates these) or to recipe app secrets (class-B): those the test harness generates itself (abra app secret generate + chosen fixtures), persists for the run, and destroys at teardown — a missing app secret is never a blocker, it is something the harness creates. See §4.4.
Honest reporting. If a stage is skipped or a check failed, say so in STATUS.md/JOURNAL.md with the output. The loop's value depends entirely on the ledgers being true.

52 KiB Raw Blame History Unescape Escape