Files
cc-ci/DECISIONS.md
autonomic-bot 12f86fd3fb M1: proxy via real coop-cloud/traefik (abra, wildcard/no-ACME); recipe deploy+teardown; M1 CLAIMED
Orchestrator decision: deploy canonical coop-cloud traefik via abra instead of a
hand-rolled module. abra packaged in Nix (pinned). custom-html deployed over HTTPS
(200) via the gateway and torn down clean. docs/install.md seeded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 22:21:12 +01:00

4.9 KiB
Raw Blame History

DECISIONS — cc-ci Builder

Architecture decisions and dead-ends. One line of rationale each. (§0, §8)

Settled

  • Wildcard TLS: operator pre-issues wildcard cert at /var/lib/ci-certs/live/; Traefik file provider serves it; no ACME for commoninternet.net. (Plan §4.0/§8 — fixed.)

  • Repo: git.autonomic.zone/recipe-maintainers/cc-ci, private. Bot is org admin. (Bootstrap.)

  • Git credentials: helper script in repo-local git config sources /srv/cc-ci/.testenv at call time — no secret values stored in .git/config or commits.

  • Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26, overrides plan §3 modules/traefik.nix). Instead of a hand-rolled Traefik we deploy the canonical Co-op Cloud traefik recipe via abra in wildcard / file-provider mode, for end-to-end fidelity (canonical web/web-secure entrypoints + proxy/swarm conventions every recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO DNS token on the box:

    • WILDCARDS_ENABLED=1 + append compose.wildcard.yml; the pre-issued cert is fed as the ssl_cert/ssl_key swarm secrets (v1) via abra app secret insert … -f from /var/lib/ci-certs/live/{fullchain,privkey}.pem. The file provider serves it (tls.certificates).
    • LETS_ENCRYPT_ENV= empty on the traefik app and on every test app → the recipe's tls.certresolver=${LETS_ENCRYPT_ENV} label resolves to no resolver → routers serve the wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
    • Reproducibility (D8): scripts/deploy-proxy.sh is idempotent (ensures local abra server, fetches recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in docs/install.md. The custom modules/traefik.nix was removed; modules/swarm.nix keeps swarm init + proxy net + firewall 80/443.
    • Renewal (manual, ~90d): operator re-issues the wildcard at the same paths, then abra app secret rm traefik.ci.commoninternet.net ssl_cert -n + re-insert at a new version (bump SECRET_WILDCARD_CERT_VERSION) and redeploy. (Documented in docs/secrets.md at M7.)
    • abra teardown syntax (for harness, §4.3): abra app undeploy <d> -n, abra app volume remove <d> -f -n, abra app secret remove <d> --all -n. None take --chaos.

Open (defaults from §8, to confirm as reality lands)

  • Deploy mechanism — SETTLED (M0): nixos-rebuild switch --flake /root/cc-ci#cc-ci run on cc-ci itself, with the repo materialised on the host at /root/cc-ci. Chosen over --target-host/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS proxy (slow/fragile). Atomic rollback preserved by Nix generations (nixos-rebuild --rollback). The switch is launched as a detached transient systemd unit (systemd-run --unit=ccci-rebuild --collect) so it survives a momentary ssh-over-tailscale drop during activation. For the build loop the host copy is synced from the sandbox clone via tar | ssh (rsync absent on host); source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo on a fresh host, then nixos-rebuild switch --flake .#cc-ci).
    • nixpkgs pin: flake pins the exact rev cc-ci already ran (50ab793…) so the first rebuild is a true no-op-then-base. Bump deliberately, never drift.
  • Webhook scope: default per-repo via enroll script.
  • Drone runner type: default exec (must drive host abra).
  • Secret tool — SETTLED (M0): sops-nix. cc-ci decrypts at activation using its ed25519 SSH host key as the age identity (sops.age.sshKeyPaths), so no extra key file to manage on the box. Recipients in /.sops.yaml: the host age key (age1h90ut…, from ssh-to-age) + an off-box master recovery key (age1cmk26t…; private half only at /srv/cc-ci/.sops/master-age.txt on the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing plaintext into secrets/<f>.yaml then sops -e -i (run inside the repo so .sops.yaml is found).
  • D10 recipe set: lock six early. Candidates favouring already-mirrored: custom-html (simple), cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3), bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4M6.5.

Risks

  • Disk — RESOLVED 2026-05-26. Original 8.9 GiB root had only ~3.8 GiB free and a hard inode ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on inodes before bytes. Operator grew the VM to 28 GiB (22 GiB free, 1.78M inodes / 1.21M free); the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown + periodic docker image prune to avoid regressing during M6.5 breadth.

Dead-ends

  • (none yet)