plan: cc-ci on DigitalOcean — terraform/ + nixos-infect + nix provisioning (8GB droplet, reproducible from the cc-ci flake)

This commit is contained in:
autonomic-bot
2026-05-31 00:18:27 +00:00
parent 01874821f2
commit 67226efe72

View File

@ -0,0 +1,116 @@
# Plan — cc-ci on DigitalOcean: `terraform/` + nixos-infect + Nix provisioning
**Status:** PROPOSED. Add a **`terraform/`** folder to the **cc-ci product repo**
(`recipe-maintainers/cc-ci`) that provisions the cc-ci server on **DigitalOcean** (8 GB droplet),
converts it to NixOS via **nixos-infect**, then applies the existing cc-ci flake config — making the
CI server reproducible-from-scratch on real cloud hosting. **Owner:** a cc-ci-repo infra task
(implementable by the Builder/Adversary loops as an infra unit, or by the assistant). **This file:**
`/srv/cc-ci/cc-ci-plan/plan-cc-ci-digitalocean-terraform.md`.
---
## 0. Why
cc-ci currently runs as the Incus VM `cc-nix-test` on b1 — a small, shared 4-core host (the contention
we kept hitting). A dedicated DO **8 GB** droplet gives standard, reliable hosting with a **public
IP**, fully reproducible via Terraform + the existing cc-ci NixOS flake. "Spin up cc-ci from nothing"
becomes a `terraform apply` instead of hand-driven Incus API calls.
## 1. What already exists — build ON this, don't reinvent
- cc-ci is a **flake-based NixOS system**: `flake.nix``nixosConfigurations.cc-ci` (pinned nixpkgs
24.11) → `nix/hosts/cc-ci/{configuration.nix, hardware.nix}` + `nix/modules/*` (proxy/traefik,
drone, drone-runner, bridge, dashboard, backupbot, swarm, abra, harness, warm-keycloak, secrets).
- **From-scratch install is already VERIFIED (D8, `docs/install.md`):** a blank NixOS host + the two
repos (cc-ci cloned `--recursive` so the `cc-ci-secrets` submodule at `secrets/` comes too) + the
**one bootstrap age key** at `/var/lib/sops-nix/key.txt` → a single `nixos-rebuild switch` converges
the whole server (0 failed units; serialized reconcile oneshots proxy→drone→bridge→dashboard→
backupbot). The wildcard TLS cert + all secrets are **sops-encrypted in `cc-ci-secrets`** (not
out-of-band).
- So **"provision via Nix in the expected way" = that exact D8 flow:** clone `--recursive` + bootstrap
age key + `nixos-rebuild switch --flake .#<host>`.
- The current `nix/hosts/cc-ci/hardware.nix` is **Incus-VM-specific** — DO needs its own
hardware/bootloader/networking, which **nixos-infect generates**.
## 2. `terraform/` layout (in `recipe-maintainers/cc-ci`)
```
terraform/
versions.tf # terraform + digitalocean/digitalocean provider, pinned
variables.tf # do_token(sensitive), region, size, ssh_key_id, ts_auth_key(sensitive), hostname
main.tf # digitalocean_droplet + ssh key + user_data
outputs.tf # droplet ipv4, id
user-data.sh # cloud-init stage-1: run nixos-infect (pinned)
README.md # apply instructions + operator inputs
.gitignore # *.tfstate*, *.auto.tfvars, .terraform/ (NEVER commit secrets/state)
```
- **Droplet:** `digitalocean_droplet` — size **`s-4vcpu-8gb`** (8 GB RAM / 4 vCPU), region (e.g.
`nyc3`), base image **`ubuntu-24-04-x64`** (nixos-infect supports it), `ssh_keys=[var.ssh_key_id]`,
`user_data=file("user-data.sh")`, a stable name + tag. Optional: `monitoring=true`.
- Secrets (DO token, TS key) are **sensitive vars** via `TF_VAR_*` env or a **gitignored**
`*.auto.tfvars`; `terraform.tfstate` is **gitignored** (it can hold secrets). Mirrors cc-ci's
no-secrets-in-git rule.
## 3. Stage 1 — nixos-infect (base Ubuntu → NixOS)
`user-data.sh` runs on first boot of the droplet:
```sh
#!/usr/bin/env bash
set -euo pipefail
export NIX_CHANNEL=nixos-24.11
curl -fsSL https://raw.githubusercontent.com/elitak/nixos-infect/<PINNED_SHA>/nixos-infect | bash -x
```
nixos-infect converts the droplet to NixOS in place, generates `/etc/nixos/{configuration.nix,
hardware-configuration.nix, networking.nix}` (DO-correct: bootloader on the DO disk, public-IP
networking via the DO metadata), and reboots into NixOS. **Pin the nixos-infect revision** — do not
`curl | bash` master blind. After this, the droplet is **bare NixOS on DO**, ssh-able as root.
## 4. Stage 2 — provision via Nix (bare NixOS → converged cc-ci) — "the expected way"
1. **Capture DO hardware into the flake.** Take the `hardware-configuration.nix` + `networking.nix`
nixos-infect generated and add them as a flake host. **Cleaner: a new host `nix/hosts/cc-ci-do/`**
that imports the shared `nix/modules/*` + the DO hardware, with `nixosConfigurations.cc-ci-do` in
`flake.nix` (keeps the Incus `cc-ci` host buildable during transition). Make DO the canonical
`cc-ci` after cutover.
2. **Run the D8 install flow on the droplet:** clone `recipe-maintainers/cc-ci` `--recursive` (brings
`cc-ci-secrets`), provision the **bootstrap age key** at `/var/lib/sops-nix/key.txt`, then
`nixos-rebuild switch --flake .#cc-ci-do`. The reconcile oneshots converge the swarm.
3. **Where stage 2 runs (recommendation):** **v1 = documented operator step** (Terraform provisions +
infects; the age-key placement + `nixos-rebuild` is the manual step, exactly like `docs/install.md`
— the age key is operator-provided anyway). Automate later via a Terraform `remote-exec` provisioner
or a cloud-init second stage once the key-delivery story is settled.
## 5. Operator inputs (class-A1 — provide at apply, NEVER commit)
- **DO API token** (`TF_VAR_do_token`).
- **DO SSH key** (registered on DO; the operator/agent holds the private half to ssh + run stage 2).
- **`TS_AUTH_KEY`** — tailnet join (cc-ci enables tailscale; the droplet joins the same tailnet so the
orchestrator/loops reach it exactly as today, direct peer).
- **Bootstrap age key** → `/var/lib/sops-nix/key.txt` on the droplet (decrypts `cc-ci-secrets` incl.
the wildcard TLS cert). The single out-of-band secret per `docs/install.md`.
## 6. DNS / gateway — a simplification DO enables (open decision)
Today `*.ci.commoninternet.net` reaches the Incus VM (no public IP) via an external **nginx
TLS-passthrough gateway** → MagicDNS. A DO droplet has a **public IP**, so point
`ci.commoninternet.net` + the `*.ci` wildcard **A record straight at the droplet** and **drop the
gateway** — Traefik terminates TLS directly. The pre-issued sops wildcard cert still works as-is; or,
with a public IP, switch Traefik to **ACME (Let's Encrypt)** and retire the manual cert + renewal.
**v1: keep the sops cert (no behavior change); evaluate ACME-on-public-IP as a follow-up.** Record in
DECISIONS.md.
## 7. Open decisions (log in DECISIONS.md)
- **Replace vs. parallel:** stand DO up **in parallel**, verify a full `!testme` + the D-gates green on
it, then cut DNS over and **retire the Incus `cc-nix-test`**. Nothing stateful is lost — recipes
redeploy, warm canonicals re-seed on first green runs.
- **Flake host:** parallel `cc-ci-do` host until cutover, then make DO the canonical `cc-ci`.
- **Droplet size/region**; **ACME vs sops cert** (§6); **stage-2 automation** (§4.3).
## 8. Definition of Done
- `terraform/` in the cc-ci repo; `terraform apply` (with operator inputs) creates an 8 GB DO droplet
and nixos-infect converts it to NixOS.
- Given the bootstrap age key, the droplet converges to a full cc-ci via `nixos-rebuild switch --flake
.#<host>` (the D8 flow) — 0 failed units; traefik/drone/bridge/dashboard/backupbot up.
- A real recipe `!testme` runs **green** on the DO cc-ci; the dashboard + `*.ci.commoninternet.net`
reachable via the chosen DNS path.
- `terraform/README.md` documents apply + operator inputs; **no secrets/state committed**.
- **Adversary-verifiable:** from-scratch reproducibility (the D8 guarantee) holds on DO.
## 9. Guardrails
- **No secrets in git** (DO token, TS key, age key, tfstate all out-of-band/gitignored) — cc-ci's rule.
- **Pin everything** (provider, nixos-infect rev; nixpkgs already pinned) — reproducible, no drift.
- **Don't break the running Incus cc-ci** until the DO one is verified green (parallel bring-up + cutover).
- **Real Nix provisioning** (the flake), not hand-installed packages.