M2: Drone server + exec runner up; infra as idempotent-reconcile oneshots

Convert proxy+drone bring-up to writeShellApplication systemd oneshots that
reconcile every activation (orchestrator steer). pkgs.abra overlay. Runner
connected via RPC (polling, capacity=2). install.md = clone + nixos-rebuild switch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-26 22:59:59 +01:00
parent 62b23e3a41
commit a385148af9
11 changed files with 296 additions and 113 deletions

View File

@ -2,8 +2,12 @@
> WORK IN PROGRESS — grows with each milestone; the full from-scratch rebuild is verified at M9 (D8).
cc-ci is declared as a NixOS flake (this repo) plus a reproducible proxy-deploy step. Target:
a NixOS 24.11 host reachable as `cc-ci` over SSH (root), with the operator preconditions in place.
cc-ci is declared **entirely** as a NixOS flake (this repo). Bringing up the box is just
**clone + `nixos-rebuild switch`** + the operator preconditions — no manual post-steps. The proxy
(traefik) and Drone server are deployed by **idempotent-reconcile systemd oneshots** (`modules/
proxy.nix`, `modules/drone.nix`) that converge the swarm to the desired state on every activation
and boot (and self-heal drift), mirroring `swarm-init`. Target: a NixOS 24.11 host reachable as
`cc-ci` over SSH (root).
## Operator preconditions (class-A1, see DECISIONS.md / docs/baseline.md)
@ -12,43 +16,40 @@ a NixOS 24.11 host reachable as `cc-ci` over SSH (root), with the operator preco
- DNS: `*.ci.commoninternet.net` (+ bare) → the **gateway**, which TLS-passthroughs (SNI) to cc-ci.
- Firewall path: gateway reaches cc-ci on tcp/80+443 (opened by `modules/swarm.nix`).
## 1. Apply the NixOS flake
## 1. Apply the NixOS flake (this is the whole install)
The flake (`flake.nix`, `hosts/cc-ci/`, `modules/`) declares: base host, sops-nix (decrypts via the
host SSH key), Docker + single-node Swarm + the `proxy` overlay (`modules/swarm.nix`), and abra
(`modules/abra.nix`).
host SSH key), Docker + single-node Swarm + the `proxy` overlay + firewall 80/443
(`modules/swarm.nix`), abra (`modules/abra.nix` / `packages.nix`), the **traefik reconcile oneshot**
(`modules/proxy.nix`), the **Drone server reconcile oneshot** (`modules/drone.nix`), and the
**Drone exec runner** (`modules/drone-runner.nix`).
```sh
# materialise the repo on the host (the build runs on cc-ci itself — see DECISIONS.md deploy mech)
# e.g. git clone <repo> /root/cc-ci (or sync it)
nixos-rebuild switch --flake /root/cc-ci#cc-ci
# verify
systemctl is-system-running # -> running
docker info --format '{{.Swarm.LocalNodeState}}' # -> active
docker network ls | grep proxy # -> proxy ... overlay swarm
```
On activation, the reconcile oneshots (`deploy-proxy`, `deploy-drone`) run automatically and converge
the swarm. Verify:
```sh
systemctl is-system-running # -> running
docker info --format '{{.Swarm.LocalNodeState}}' # -> active
docker service ls # traefik (app+socket-proxy) + drone, all 1/1
systemctl is-active deploy-proxy deploy-drone drone-runner-exec # -> active x3
# wildcard cert served end-to-end via the gateway:
curl -ksv --resolve probe.ci.commoninternet.net:443:<gateway-ip> https://probe.ci.commoninternet.net/ \
2>&1 | grep -E 'subject:|HTTP/' # -> CN=*.ci.commoninternet.net, HTTP 404 (no app router yet)
curl -ks --resolve drone.ci.commoninternet.net:443:<gateway-ip> \
-o /dev/null -w '%{http_code}\n' https://drone.ci.commoninternet.net/healthz # -> 200
```
> Tip: when driving the switch over an SSH session that rides Tailscale, run it as a detached unit so
> it survives a momentary drop, and **use the absolute flake path** (systemd units run with cwd `/`):
> `systemd-run --unit=ccci-sw --property=Type=oneshot nixos-rebuild switch --flake /root/cc-ci#cc-ci`
## 2. Deploy the reverse proxy (coop-cloud traefik, wildcard/file-provider, no ACME)
## 2. (later milestones) comment-bridge, dashboard, recipe enrollment
```sh
bash /root/cc-ci/scripts/deploy-proxy.sh
```
This idempotently deploys the canonical Co-op Cloud `traefik` recipe via abra in wildcard mode,
serving the pre-issued cert as the `ssl_cert`/`ssl_key` swarm secrets, with `LETS_ENCRYPT_ENV` empty
so no ACME ever runs (see DECISIONS.md "Proxy: real coop-cloud/traefik via abra"). Verify:
```sh
docker service ls | grep traefik # app + socket-proxy, 1/1
# wildcard cert served end-to-end via the gateway:
curl -ksv --resolve probe.ci.commoninternet.net:443:<gateway-ip> https://probe.ci.commoninternet.net/ \
2>&1 | grep -E 'subject:|HTTP/' # -> CN=*.ci.commoninternet.net, HTTP 404 (no app router yet)
```
## 3. (later milestones) Drone, comment-bridge, dashboard, recipe enrollment
See `docs/enroll-recipe.md` (D5), `docs/secrets.md` (D6), `docs/runbook.md`. Added as those land.
See `docs/enroll-recipe.md` (D5), `docs/secrets.md` (D6), `docs/runbook.md`. Each new piece of infra
is added as another idempotent reconcile oneshot, so this install stays a single `nixos-rebuild`.