- plan-cc-ci-hetzner-migration.md: 3-phase plan — (1) provision the Hetzner cpx32 cc-ci fully + green !testme readiness gate, (2) repoint the loops + dashboard + *.ci at it (one ssh-config + DNS change), (3) stop the b1 cc-nix-test (cold standby). Parallel bring-up, reversible cutover, b1 freed. - plan-cc-ci-hetzner-terraform.md: cpx31 is retired → default to cpx32 (current dedicated-vCPU 8GB). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
134 lines
9.4 KiB
Markdown
134 lines
9.4 KiB
Markdown
# Plan — cc-ci on Hetzner Cloud: `terraform/` + nixos-infect + Nix provisioning
|
|
|
|
**Status:** PROPOSED → handed to the assistant to implement. Add a **`terraform/`** folder to the
|
|
**cc-ci product repo** (`recipe-maintainers/cc-ci`) that provisions the cc-ci server on **Hetzner
|
|
Cloud** (8 GB server), converts it to NixOS via **nixos-infect**, then applies the existing cc-ci
|
|
flake config — making the CI server reproducible-from-scratch on real cloud hosting.
|
|
**This file:** `/srv/cc-ci/cc-ci-plan/plan-cc-ci-hetzner-terraform.md`.
|
|
|
|
**Token (operator, 2026-05-31):** an `HCLOUD_TOKEN` with **read/write to an isolated Hetzner project**
|
|
(just for this) is in `/srv/cc-ci/.testenv`. The operator **will invalidate it** once the terraform is
|
|
verified working — so the goal is **write + apply + verify the working terraform**, then report.
|
|
|
|
---
|
|
|
|
## 0. Why
|
|
cc-ci currently runs as the Incus VM `cc-nix-test` on b1 — a small, shared 4-core host (the contention
|
|
we kept hitting). A dedicated Hetzner **8 GB** server gives standard, reliable hosting with a **public
|
|
IP**, fully reproducible via Terraform + the existing cc-ci NixOS flake. "Spin up cc-ci from nothing"
|
|
becomes a `terraform apply`.
|
|
|
|
## 1. What already exists — build ON this, don't reinvent
|
|
- cc-ci is a **flake-based NixOS system**: `flake.nix` → `nixosConfigurations.cc-ci` (pinned nixpkgs
|
|
24.11, **`system = "x86_64-linux"`**) → `nix/hosts/cc-ci/{configuration.nix, hardware.nix}` +
|
|
`nix/modules/*` (proxy/traefik, drone, drone-runner, bridge, dashboard, backupbot, swarm, abra,
|
|
harness, warm-keycloak, secrets).
|
|
- **From-scratch install is already VERIFIED (D8, `docs/install.md`):** a blank NixOS host + the two
|
|
repos (cc-ci cloned `--recursive` so the `cc-ci-secrets` submodule at `secrets/` comes too) + the
|
|
**one bootstrap age key** at `/var/lib/sops-nix/key.txt` → a single `nixos-rebuild switch` converges
|
|
the whole server (0 failed units; serialized reconcile oneshots). The wildcard TLS cert + all secrets
|
|
are **sops-encrypted in `cc-ci-secrets`** (not out-of-band).
|
|
- So **"provision via Nix in the expected way" = that exact D8 flow:** clone `--recursive` + bootstrap
|
|
age key + `nixos-rebuild switch --flake .#<host>`.
|
|
- The current `nix/hosts/cc-ci/hardware.nix` is **Incus-VM-specific** — Hetzner needs its own
|
|
hardware/bootloader/networking, which **nixos-infect generates**.
|
|
|
|
## 2. `terraform/` layout (in `recipe-maintainers/cc-ci`)
|
|
```
|
|
terraform/
|
|
versions.tf # terraform + hetznercloud/hcloud provider, pinned
|
|
variables.tf # hcloud_token(sensitive), location, server_type, image, ssh_key, ts_auth_key(sensitive), hostname
|
|
main.tf # hcloud_ssh_key + hcloud_server + user_data
|
|
outputs.tf # server ipv4, id
|
|
user-data.sh # cloud-init stage-1: run nixos-infect (pinned)
|
|
README.md # apply instructions + operator inputs
|
|
.gitignore # *.tfstate*, *.auto.tfvars, .terraform/ (NEVER commit secrets/state)
|
|
```
|
|
- **Provider:** `hetznercloud/hcloud` (pinned in `versions.tf`). The token comes from
|
|
**`HCLOUD_TOKEN`** (env, read by the provider) or `TF_VAR_hcloud_token` — it's in `.testenv`; do
|
|
NOT hardcode/commit it.
|
|
- **Server:** `hcloud_server` — type **`cpx32`** (AMD **dedicated vCPU**, **8 GB RAM**, NVMe SSD) —
|
|
**DEFAULT** (operator 2026-05-31: `cpx31` is **retired**; `cpx32` is the current dedicated-vCPU 8 GB
|
|
type). Dedicated vCPU avoids noisy-neighbor variance for bursty CI. Must be **x86** (the flake is
|
|
`x86_64-linux`; do **NOT** use the `cax*` ARM types). `cx32` (Intel shared vCPU, 8 GB) is a cheaper
|
|
alt. Confirm exact specs from the hcloud API at apply time. `image = "ubuntu-24.04"` (nixos-infect-supported base), a `location`
|
|
(e.g. `nbg1`/`fsn1`/`hel1` EU or `ash`/`hil` US — pick one, make it a var), `ssh_keys=[hcloud_ssh_key.id]`,
|
|
`user_data=file("user-data.sh")`, `public_net { ipv4_enabled = true }`, a stable name + label.
|
|
- Keep the token + TS key **sensitive**; `terraform.tfstate` **gitignored** (can hold secrets) — mirrors
|
|
cc-ci's no-secrets-in-git rule.
|
|
|
|
## 3. Stage 1 — nixos-infect (base Ubuntu → NixOS)
|
|
`user-data.sh` on first boot:
|
|
```sh
|
|
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
export NIX_CHANNEL=nixos-24.11
|
|
export PROVIDER=hetzner # nixos-infect provider hint (Hetzner Cloud is supported)
|
|
curl -fsSL https://raw.githubusercontent.com/elitak/nixos-infect/<PINNED_SHA>/nixos-infect | bash -x
|
|
```
|
|
nixos-infect converts the server to NixOS in place, generates `/etc/nixos/{configuration.nix,
|
|
hardware-configuration.nix, networking.nix}` (Hetzner-correct bootloader + public-IP networking), and
|
|
reboots into NixOS. **Pin the nixos-infect revision** — don't `curl|bash` master blind. After this the
|
|
server is **bare NixOS on Hetzner**, ssh-able as root.
|
|
|
|
## 4. Stage 2 — provision via Nix (bare NixOS → converged cc-ci) — "the expected way"
|
|
1. **Capture Hetzner hardware into the flake.** Take the `hardware-configuration.nix` + `networking.nix`
|
|
nixos-infect generated and add them as a flake host. **Cleaner: a new host `nix/hosts/cc-ci-hetzner/`**
|
|
importing the shared `nix/modules/*` + the Hetzner hardware, with `nixosConfigurations.cc-ci-hetzner`
|
|
in `flake.nix` (keeps the Incus `cc-ci` host buildable during transition). Make Hetzner the canonical
|
|
`cc-ci` after cutover.
|
|
2. **Run the D8 install flow on the server:** clone `recipe-maintainers/cc-ci` `--recursive` (brings
|
|
`cc-ci-secrets`), provision the **bootstrap age key** at `/var/lib/sops-nix/key.txt`, then
|
|
`nixos-rebuild switch --flake .#cc-ci-hetzner`. The reconcile oneshots converge the swarm.
|
|
3. **Where stage 2 runs:** **v1 = documented step run after `terraform apply`** (Terraform provisions +
|
|
infects; the age-key placement + `nixos-rebuild` is the explicit step, like `docs/install.md`).
|
|
Automate later via a Terraform `remote-exec` provisioner once key-delivery is settled.
|
|
- **Note on secrets for verification:** full cc-ci convergence needs the bootstrap age key (decrypts
|
|
`cc-ci-secrets`). If that key isn't available to the implementer, verify as far as possible —
|
|
`terraform apply` → nixos-infect → bare NixOS → the flake **builds/evaluates** for the Hetzner host
|
|
(`nixos-rebuild build --flake .#cc-ci-hetzner`) — and flag the age-key step as operator-pending.
|
|
|
|
## 5. Operator inputs (class-A1 — provide at apply, NEVER commit)
|
|
- **`HCLOUD_TOKEN`** — already in `.testenv` (isolated project, read/write; operator will invalidate
|
|
after). The provider reads it from env.
|
|
- **SSH key** — register a public key as `hcloud_ssh_key`; hold the private half to ssh + run stage 2.
|
|
- **`TS_AUTH_KEY`** — tailnet join (cc-ci enables tailscale; the server joins the same tailnet so the
|
|
orchestrator/loops reach it as today, direct peer). Already in `.testenv`.
|
|
- **Bootstrap age key** → `/var/lib/sops-nix/key.txt` (decrypts `cc-ci-secrets` incl. the TLS cert).
|
|
The single out-of-band secret per `docs/install.md`.
|
|
|
|
## 6. DNS / gateway — a simplification the public IP enables (open decision)
|
|
Today `*.ci.commoninternet.net` reaches the Incus VM (no public IP) via an external nginx
|
|
TLS-passthrough gateway → MagicDNS. A Hetzner server has a **public IP**, so point
|
|
`ci.commoninternet.net` + the `*.ci` wildcard **A record straight at the server** and **drop the
|
|
gateway** — Traefik terminates TLS directly. The sops wildcard cert still works as-is; or switch
|
|
Traefik to **ACME** and retire the manual cert + renewal. **v1: keep the sops cert (no behavior
|
|
change); evaluate ACME-on-public-IP as a follow-up.** Record in DECISIONS.md.
|
|
|
|
## 7. Open decisions (log in DECISIONS.md)
|
|
- **Replace vs. parallel:** stand Hetzner up **in parallel**, verify a full `!testme` + the D-gates
|
|
green, then cut DNS over and **retire the Incus `cc-nix-test`**. Nothing stateful is lost — recipes
|
|
redeploy, warm canonicals re-seed on first green runs.
|
|
- **Flake host:** parallel `cc-ci-hetzner` host until cutover, then make Hetzner the canonical `cc-ci`.
|
|
- **Server type/location** (cx32 vs cpx31; region); **ACME vs sops cert** (§6); **stage-2 automation** (§4.3).
|
|
|
|
## 8. Definition of Done
|
|
- `terraform/` in the cc-ci repo; `terraform apply` (with `HCLOUD_TOKEN`) creates an **8 GB cx32**
|
|
Hetzner server and nixos-infect converts it to NixOS.
|
|
- The flake **builds for the Hetzner host** (`nixos-rebuild build --flake .#cc-ci-hetzner`); given the
|
|
bootstrap age key it **switches** to a fully converged cc-ci (the D8 flow) — 0 failed units.
|
|
- (Once secrets available) a real recipe `!testme` runs **green** on the Hetzner cc-ci; dashboard +
|
|
`*.ci.commoninternet.net` reachable via the chosen DNS path.
|
|
- `terraform/README.md` documents apply + operator inputs; **no secrets/state committed**.
|
|
- The terraform is proven **idempotent** (`terraform plan` clean after apply); test resources cleaned
|
|
up (`terraform destroy`) if this is a throwaway verification rather than the real cutover.
|
|
|
|
## 9. Guardrails
|
|
- **No secrets in git** (HCLOUD_TOKEN, TS key, age key, tfstate all out-of-band/gitignored) — cc-ci's rule.
|
|
- **Pin everything** (hcloud provider, nixos-infect rev; nixpkgs already pinned) — reproducible, no drift.
|
|
- **x86 only** — the flake is `x86_64-linux`; use `cx32`/`cpx31`, never `cax*` (ARM).
|
|
- **Don't break the running Incus cc-ci** until the Hetzner one is verified green (parallel + cutover).
|
|
- **Real Nix provisioning** (the flake), not hand-installed packages.
|
|
- **The token is invalidatable + isolated** — but still treat it as a live secret: never commit it,
|
|
never echo it into logs.
|