Orchestrator decision: deploy canonical coop-cloud traefik via abra instead of a hand-rolled module. abra packaged in Nix (pinned). custom-html deployed over HTTPS (200) via the gateway and torn down clean. docs/install.md seeded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
70 lines
4.9 KiB
Markdown
70 lines
4.9 KiB
Markdown
# DECISIONS — cc-ci Builder
|
||
|
||
Architecture decisions and dead-ends. One line of rationale each. (§0, §8)
|
||
|
||
## Settled
|
||
|
||
- **Wildcard TLS:** operator pre-issues wildcard cert at `/var/lib/ci-certs/live/`; Traefik file
|
||
provider serves it; **no ACME** for commoninternet.net. (Plan §4.0/§8 — fixed.)
|
||
- **Repo:** `git.autonomic.zone/recipe-maintainers/cc-ci`, private. Bot is org admin. (Bootstrap.)
|
||
- **Git credentials:** helper script in repo-local git config sources `/srv/cc-ci/.testenv` at call
|
||
time — no secret values stored in `.git/config` or commits.
|
||
|
||
- **Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26,
|
||
overrides plan §3 `modules/traefik.nix`).** Instead of a hand-rolled Traefik we deploy the
|
||
canonical Co-op Cloud `traefik` recipe via abra in **wildcard / file-provider mode**, for
|
||
end-to-end fidelity (canonical `web`/`web-secure` entrypoints + proxy/swarm conventions every
|
||
recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO
|
||
DNS token on the box:
|
||
- `WILDCARDS_ENABLED=1` + append `compose.wildcard.yml`; the pre-issued cert is fed as the
|
||
`ssl_cert`/`ssl_key` swarm secrets (v1) via `abra app secret insert … -f` from
|
||
`/var/lib/ci-certs/live/{fullchain,privkey}.pem`. The file provider serves it (`tls.certificates`).
|
||
- `LETS_ENCRYPT_ENV=` **empty** on the traefik app *and* on every test app → the recipe's
|
||
`tls.certresolver=${LETS_ENCRYPT_ENV}` label resolves to no resolver → routers serve the
|
||
wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
|
||
- Reproducibility (D8): `scripts/deploy-proxy.sh` is idempotent (ensures local abra server, fetches
|
||
recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in
|
||
`docs/install.md`. The custom `modules/traefik.nix` was removed; `modules/swarm.nix` keeps swarm
|
||
init + `proxy` net + firewall 80/443.
|
||
- **Renewal (manual, ~90d):** operator re-issues the wildcard at the same paths, then
|
||
`abra app secret rm traefik.ci.commoninternet.net ssl_cert -n` + re-insert at a new version (bump
|
||
`SECRET_WILDCARD_CERT_VERSION`) and redeploy. (Documented in docs/secrets.md at M7.)
|
||
- **abra teardown syntax** (for harness, §4.3): `abra app undeploy <d> -n`,
|
||
`abra app volume remove <d> -f -n`, `abra app secret remove <d> --all -n`. None take `--chaos`.
|
||
|
||
## Open (defaults from §8, to confirm as reality lands)
|
||
|
||
- **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on
|
||
cc-ci itself*, with the repo materialised on the host at `/root/cc-ci`. Chosen over
|
||
`--target-host`/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS
|
||
proxy (slow/fragile). Atomic rollback preserved by Nix generations (`nixos-rebuild --rollback`).
|
||
The switch is launched as a **detached transient systemd unit** (`systemd-run --unit=ccci-rebuild
|
||
--collect`) so it survives a momentary ssh-over-tailscale drop during activation. For the build
|
||
loop the host copy is synced from the sandbox clone via `tar | ssh` (rsync absent on host);
|
||
source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo
|
||
on a fresh host, then `nixos-rebuild switch --flake .#cc-ci`).
|
||
- **nixpkgs pin:** flake pins the exact rev cc-ci already ran (`50ab793…`) so the first rebuild
|
||
is a true no-op-then-base. Bump deliberately, never drift.
|
||
- **Webhook scope:** default per-repo via enroll script.
|
||
- **Drone runner type:** default exec (must drive host abra).
|
||
- **Secret tool — SETTLED (M0):** sops-nix. cc-ci decrypts at activation using its **ed25519 SSH
|
||
host key** as the age identity (`sops.age.sshKeyPaths`), so no extra key file to manage on the box.
|
||
Recipients in `/.sops.yaml`: the host age key (`age1h90ut…`, from ssh-to-age) + an off-box
|
||
**master recovery key** (`age1cmk26t…`; private half only at `/srv/cc-ci/.sops/master-age.txt` on
|
||
the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing
|
||
plaintext into `secrets/<f>.yaml` then `sops -e -i` (run inside the repo so `.sops.yaml` is found).
|
||
- **D10 recipe set:** lock six early. Candidates favouring already-mirrored: custom-html (simple),
|
||
cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3),
|
||
bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5.
|
||
|
||
## Risks
|
||
|
||
- **Disk — RESOLVED 2026-05-26.** Original 8.9 GiB root had only ~3.8 GiB free *and* a hard
|
||
**inode** ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
|
||
inodes before bytes. Operator grew the VM to **28 GiB** (22 GiB free, 1.78M inodes / 1.21M free);
|
||
the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
|
||
periodic `docker image prune` to avoid regressing during M6.5 breadth.
|
||
|
||
## Dead-ends
|
||
- (none yet)
|