Convert proxy+drone bring-up to writeShellApplication systemd oneshots that reconcile every activation (orchestrator steer). pkgs.abra overlay. Runner connected via RPC (polling, capacity=2). install.md = clone + nixos-rebuild switch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.8 KiB
DECISIONS — cc-ci Builder
Architecture decisions and dead-ends. One line of rationale each. (§0, §8)
Settled
-
Wildcard TLS: operator pre-issues wildcard cert at
/var/lib/ci-certs/live/; Traefik file provider serves it; no ACME for commoninternet.net. (Plan §4.0/§8 — fixed.) -
Repo:
git.autonomic.zone/recipe-maintainers/cc-ci, private. Bot is org admin. (Bootstrap.) -
Git credentials: helper script in repo-local git config sources
/srv/cc-ci/.testenvat call time — no secret values stored in.git/configor commits. -
Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26, overrides plan §3
modules/traefik.nix). Instead of a hand-rolled Traefik we deploy the canonical Co-op Cloudtraefikrecipe via abra in wildcard / file-provider mode, for end-to-end fidelity (canonicalweb/web-secureentrypoints + proxy/swarm conventions every recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO DNS token on the box:WILDCARDS_ENABLED=1+ appendcompose.wildcard.yml; the pre-issued cert is fed as thessl_cert/ssl_keyswarm secrets (v1) viaabra app secret insert … -ffrom/var/lib/ci-certs/live/{fullchain,privkey}.pem. The file provider serves it (tls.certificates).LETS_ENCRYPT_ENV=empty on the traefik app and on every test app → the recipe'stls.certresolver=${LETS_ENCRYPT_ENV}label resolves to no resolver → routers serve the wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)- Reproducibility (D8):
scripts/deploy-proxy.shis idempotent (ensures local abra server, fetches recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented indocs/install.md. The custommodules/traefik.nixwas removed;modules/swarm.nixkeeps swarm init +proxynet + firewall 80/443. - Renewal (manual, ~90d): operator re-issues the wildcard at the same paths, then
abra app secret rm traefik.ci.commoninternet.net ssl_cert -n+ re-insert at a new version (bumpSECRET_WILDCARD_CERT_VERSION) and redeploy. (Documented in docs/secrets.md at M7.) - abra teardown syntax (for harness, §4.3):
abra app undeploy <d> -n,abra app volume remove <d> -f -n,abra app secret remove <d> --all -n. None take--chaos.
-
Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer 2026-05-26). Every piece of swarm infra that abra deploys (traefik
modules/proxy.nix, Dronemodules/drone.nix, later comment-bridge + dashboard) is asystemd.services.<x>withType=oneshot+RemainAfterExit,after/requiresswarm-init + docker,wantsnetwork-online,wantedBymulti-user, embedding its script viapkgs.writeShellApplication(self-contained in the store, not a/root/cc-cipath). The script reconciles (inspect → converge → no-op if correct) on every activation/boot — no run-once sentinel — so it self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit) on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses togit clone+nixos-rebuild switch+ operator preconditions, no manual post-steps. The oldscripts/deploy-*.shwere folded into these modules and removed.pkgs.abrais provided via an overlay (modules/packages.nix) so all modules share the one pinned build.- Cert rotation note: the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
wildcard means bumping
SECRET_WILDCARD_*_VERSION(operator) so the next reconcile re-inserts. Documented in docs/secrets.md at M7.
- Cert rotation note: the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
wildcard means bumping
Open (defaults from §8, to confirm as reality lands)
- Deploy mechanism — SETTLED (M0):
nixos-rebuild switch --flake /root/cc-ci#cc-cirun on cc-ci itself, with the repo materialised on the host at/root/cc-ci. Chosen over--target-host/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS proxy (slow/fragile). Atomic rollback preserved by Nix generations (nixos-rebuild --rollback). The switch is launched as a detached transient systemd unit (systemd-run --unit=ccci-rebuild --collect) so it survives a momentary ssh-over-tailscale drop during activation. For the build loop the host copy is synced from the sandbox clone viatar | ssh(rsync absent on host); source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo on a fresh host, thennixos-rebuild switch --flake .#cc-ci).- nixpkgs pin: flake pins the exact rev cc-ci already ran (
50ab793…) so the first rebuild is a true no-op-then-base. Bump deliberately, never drift.
- nixpkgs pin: flake pins the exact rev cc-ci already ran (
- Webhook scope: default per-repo via enroll script.
- CI engine: Drone (per plan) — kept, with a noted risk. nixpkgs 24.11 has Drone server
2.24.0 but
drone-runner-execis abandoned (unstable-2020-04-19) — the only exec runner Drone ever shipped (upstream archived ~2021). The maintained fork Woodpecker (2.7.3, with NixOS modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific (D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern Drone server (RPC protocol stable). Fallback: if the exec runner proves incompatible/broken, pivot to Woodpecker (coop-cloud ships awoodpeckerrecipe too) and record it — like the traefik pivot. Re-evaluate at the M2 gate. - Drone deployment shape — SETTLED (M2): mirror the traefik pattern. The server is the
coop-cloud
dronerecipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by traefik atdrone.ci.commoninternet.net,LETS_ENCRYPT_ENVempty → wildcard cert, no ACME), with Gitea SSO (compose.gitea.yml). The exec runner runs as a Nix systemd service on the host (modules/drone-runner.nix) so it can drive host abra/swarm (plan §4.2). One generatedDRONE_RPC_SECRETis shared: inserted as the server'srpc_secretswarm secret AND read by the runner from sops. Reproducible deploy:scripts/deploy-drone.sh.- Gitea OAuth app
cc-ci-dronecreated under the bot (client_idab4cdb9d-ee96-4867-875f- 87384505fc52, redirecthttps://drone.ci.commoninternet.net/login); client_secret + rpc_secret stored sops-encrypted insecrets/secrets.yaml(A2 internal secrets).
- Gitea OAuth app
- Drone runner type: exec (must drive host abra).
- Secret tool — SETTLED (M0): sops-nix. cc-ci decrypts at activation using its ed25519 SSH
host key as the age identity (
sops.age.sshKeyPaths), so no extra key file to manage on the box. Recipients in/.sops.yaml: the host age key (age1h90ut…, from ssh-to-age) + an off-box master recovery key (age1cmk26t…; private half only at/srv/cc-ci/.sops/master-age.txton the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing plaintext intosecrets/<f>.yamlthensops -e -i(run inside the repo so.sops.yamlis found). - D10 recipe set: lock six early. Candidates favouring already-mirrored: custom-html (simple), cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3), bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5.
Risks
- Disk — RESOLVED 2026-05-26. Original 8.9 GiB root had only ~3.8 GiB free and a hard
inode ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
inodes before bytes. Operator grew the VM to 28 GiB (22 GiB free, 1.78M inodes / 1.21M free);
the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
periodic
docker image pruneto avoid regressing during M6.5 breadth.
Dead-ends
- (none yet)