Files
cc-ci/DECISIONS.md
autonomic-bot a385148af9 M2: Drone server + exec runner up; infra as idempotent-reconcile oneshots
Convert proxy+drone bring-up to writeShellApplication systemd oneshots that
reconcile every activation (orchestrator steer). pkgs.abra overlay. Runner
connected via RPC (polling, capacity=2). install.md = clone + nixos-rebuild switch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 22:59:59 +01:00

7.8 KiB
Raw Blame History

DECISIONS — cc-ci Builder

Architecture decisions and dead-ends. One line of rationale each. (§0, §8)

Settled

  • Wildcard TLS: operator pre-issues wildcard cert at /var/lib/ci-certs/live/; Traefik file provider serves it; no ACME for commoninternet.net. (Plan §4.0/§8 — fixed.)

  • Repo: git.autonomic.zone/recipe-maintainers/cc-ci, private. Bot is org admin. (Bootstrap.)

  • Git credentials: helper script in repo-local git config sources /srv/cc-ci/.testenv at call time — no secret values stored in .git/config or commits.

  • Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26, overrides plan §3 modules/traefik.nix). Instead of a hand-rolled Traefik we deploy the canonical Co-op Cloud traefik recipe via abra in wildcard / file-provider mode, for end-to-end fidelity (canonical web/web-secure entrypoints + proxy/swarm conventions every recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO DNS token on the box:

    • WILDCARDS_ENABLED=1 + append compose.wildcard.yml; the pre-issued cert is fed as the ssl_cert/ssl_key swarm secrets (v1) via abra app secret insert … -f from /var/lib/ci-certs/live/{fullchain,privkey}.pem. The file provider serves it (tls.certificates).
    • LETS_ENCRYPT_ENV= empty on the traefik app and on every test app → the recipe's tls.certresolver=${LETS_ENCRYPT_ENV} label resolves to no resolver → routers serve the wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
    • Reproducibility (D8): scripts/deploy-proxy.sh is idempotent (ensures local abra server, fetches recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in docs/install.md. The custom modules/traefik.nix was removed; modules/swarm.nix keeps swarm init + proxy net + firewall 80/443.
    • Renewal (manual, ~90d): operator re-issues the wildcard at the same paths, then abra app secret rm traefik.ci.commoninternet.net ssl_cert -n + re-insert at a new version (bump SECRET_WILDCARD_CERT_VERSION) and redeploy. (Documented in docs/secrets.md at M7.)
    • abra teardown syntax (for harness, §4.3): abra app undeploy <d> -n, abra app volume remove <d> -f -n, abra app secret remove <d> --all -n. None take --chaos.
  • Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer 2026-05-26). Every piece of swarm infra that abra deploys (traefik modules/proxy.nix, Drone modules/drone.nix, later comment-bridge + dashboard) is a systemd.services.<x> with Type=oneshot + RemainAfterExit, after/requires swarm-init + docker, wants network-online, wantedBy multi-user, embedding its script via pkgs.writeShellApplication (self-contained in the store, not a /root/cc-ci path). The script reconciles (inspect → converge → no-op if correct) on every activation/boot — no run-once sentinel — so it self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit) on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses to git clone + nixos-rebuild switch + operator preconditions, no manual post-steps. The old scripts/deploy-*.sh were folded into these modules and removed. pkgs.abra is provided via an overlay (modules/packages.nix) so all modules share the one pinned build.

    • Cert rotation note: the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the wildcard means bumping SECRET_WILDCARD_*_VERSION (operator) so the next reconcile re-inserts. Documented in docs/secrets.md at M7.

Open (defaults from §8, to confirm as reality lands)

  • Deploy mechanism — SETTLED (M0): nixos-rebuild switch --flake /root/cc-ci#cc-ci run on cc-ci itself, with the repo materialised on the host at /root/cc-ci. Chosen over --target-host/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS proxy (slow/fragile). Atomic rollback preserved by Nix generations (nixos-rebuild --rollback). The switch is launched as a detached transient systemd unit (systemd-run --unit=ccci-rebuild --collect) so it survives a momentary ssh-over-tailscale drop during activation. For the build loop the host copy is synced from the sandbox clone via tar | ssh (rsync absent on host); source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo on a fresh host, then nixos-rebuild switch --flake .#cc-ci).
    • nixpkgs pin: flake pins the exact rev cc-ci already ran (50ab793…) so the first rebuild is a true no-op-then-base. Bump deliberately, never drift.
  • Webhook scope: default per-repo via enroll script.
  • CI engine: Drone (per plan) — kept, with a noted risk. nixpkgs 24.11 has Drone server 2.24.0 but drone-runner-exec is abandoned (unstable-2020-04-19) — the only exec runner Drone ever shipped (upstream archived ~2021). The maintained fork Woodpecker (2.7.3, with NixOS modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific (D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern Drone server (RPC protocol stable). Fallback: if the exec runner proves incompatible/broken, pivot to Woodpecker (coop-cloud ships a woodpecker recipe too) and record it — like the traefik pivot. Re-evaluate at the M2 gate.
  • Drone deployment shape — SETTLED (M2): mirror the traefik pattern. The server is the coop-cloud drone recipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by traefik at drone.ci.commoninternet.net, LETS_ENCRYPT_ENV empty → wildcard cert, no ACME), with Gitea SSO (compose.gitea.yml). The exec runner runs as a Nix systemd service on the host (modules/drone-runner.nix) so it can drive host abra/swarm (plan §4.2). One generated DRONE_RPC_SECRET is shared: inserted as the server's rpc_secret swarm secret AND read by the runner from sops. Reproducible deploy: scripts/deploy-drone.sh.
    • Gitea OAuth app cc-ci-drone created under the bot (client_id ab4cdb9d-ee96-4867-875f- 87384505fc52, redirect https://drone.ci.commoninternet.net/login); client_secret + rpc_secret stored sops-encrypted in secrets/secrets.yaml (A2 internal secrets).
  • Drone runner type: exec (must drive host abra).
  • Secret tool — SETTLED (M0): sops-nix. cc-ci decrypts at activation using its ed25519 SSH host key as the age identity (sops.age.sshKeyPaths), so no extra key file to manage on the box. Recipients in /.sops.yaml: the host age key (age1h90ut…, from ssh-to-age) + an off-box master recovery key (age1cmk26t…; private half only at /srv/cc-ci/.sops/master-age.txt on the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing plaintext into secrets/<f>.yaml then sops -e -i (run inside the repo so .sops.yaml is found).
  • D10 recipe set: lock six early. Candidates favouring already-mirrored: custom-html (simple), cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3), bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4M6.5.

Risks

  • Disk — RESOLVED 2026-05-26. Original 8.9 GiB root had only ~3.8 GiB free and a hard inode ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on inodes before bytes. Operator grew the VM to 28 GiB (22 GiB free, 1.78M inodes / 1.21M free); the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown + periodic docker image prune to avoid regressing during M6.5 breadth.

Dead-ends

  • (none yet)