Files
cc-ci-orchestrator/cc-ci-plan/plan-cc-ci-hetzner-terraform.md
autonomic-bot 102427ab5b plan: full migrate-to-Hetzner (provision → cut over loops → stop old b1 VM); server type cpx31→cpx32
- plan-cc-ci-hetzner-migration.md: 3-phase plan — (1) provision the Hetzner cpx32 cc-ci fully + green
  !testme readiness gate, (2) repoint the loops + dashboard + *.ci at it (one ssh-config + DNS change),
  (3) stop the b1 cc-nix-test (cold standby). Parallel bring-up, reversible cutover, b1 freed.
- plan-cc-ci-hetzner-terraform.md: cpx31 is retired → default to cpx32 (current dedicated-vCPU 8GB).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 01:15:29 +00:00

9.4 KiB

Plan — cc-ci on Hetzner Cloud: terraform/ + nixos-infect + Nix provisioning

Status: PROPOSED → handed to the assistant to implement. Add a terraform/ folder to the cc-ci product repo (recipe-maintainers/cc-ci) that provisions the cc-ci server on Hetzner Cloud (8 GB server), converts it to NixOS via nixos-infect, then applies the existing cc-ci flake config — making the CI server reproducible-from-scratch on real cloud hosting. This file: /srv/cc-ci/cc-ci-plan/plan-cc-ci-hetzner-terraform.md.

Token (operator, 2026-05-31): an HCLOUD_TOKEN with read/write to an isolated Hetzner project (just for this) is in /srv/cc-ci/.testenv. The operator will invalidate it once the terraform is verified working — so the goal is write + apply + verify the working terraform, then report.


0. Why

cc-ci currently runs as the Incus VM cc-nix-test on b1 — a small, shared 4-core host (the contention we kept hitting). A dedicated Hetzner 8 GB server gives standard, reliable hosting with a public IP, fully reproducible via Terraform + the existing cc-ci NixOS flake. "Spin up cc-ci from nothing" becomes a terraform apply.

1. What already exists — build ON this, don't reinvent

  • cc-ci is a flake-based NixOS system: flake.nixnixosConfigurations.cc-ci (pinned nixpkgs 24.11, system = "x86_64-linux") → nix/hosts/cc-ci/{configuration.nix, hardware.nix} + nix/modules/* (proxy/traefik, drone, drone-runner, bridge, dashboard, backupbot, swarm, abra, harness, warm-keycloak, secrets).
  • From-scratch install is already VERIFIED (D8, docs/install.md): a blank NixOS host + the two repos (cc-ci cloned --recursive so the cc-ci-secrets submodule at secrets/ comes too) + the one bootstrap age key at /var/lib/sops-nix/key.txt → a single nixos-rebuild switch converges the whole server (0 failed units; serialized reconcile oneshots). The wildcard TLS cert + all secrets are sops-encrypted in cc-ci-secrets (not out-of-band).
  • So "provision via Nix in the expected way" = that exact D8 flow: clone --recursive + bootstrap age key + nixos-rebuild switch --flake .#<host>.
  • The current nix/hosts/cc-ci/hardware.nix is Incus-VM-specific — Hetzner needs its own hardware/bootloader/networking, which nixos-infect generates.

2. terraform/ layout (in recipe-maintainers/cc-ci)

terraform/
  versions.tf      # terraform + hetznercloud/hcloud provider, pinned
  variables.tf     # hcloud_token(sensitive), location, server_type, image, ssh_key, ts_auth_key(sensitive), hostname
  main.tf          # hcloud_ssh_key + hcloud_server + user_data
  outputs.tf       # server ipv4, id
  user-data.sh     # cloud-init stage-1: run nixos-infect (pinned)
  README.md        # apply instructions + operator inputs
  .gitignore       # *.tfstate*, *.auto.tfvars, .terraform/  (NEVER commit secrets/state)
  • Provider: hetznercloud/hcloud (pinned in versions.tf). The token comes from HCLOUD_TOKEN (env, read by the provider) or TF_VAR_hcloud_token — it's in .testenv; do NOT hardcode/commit it.
  • Server: hcloud_server — type cpx32 (AMD dedicated vCPU, 8 GB RAM, NVMe SSD) — DEFAULT (operator 2026-05-31: cpx31 is retired; cpx32 is the current dedicated-vCPU 8 GB type). Dedicated vCPU avoids noisy-neighbor variance for bursty CI. Must be x86 (the flake is x86_64-linux; do NOT use the cax* ARM types). cx32 (Intel shared vCPU, 8 GB) is a cheaper alt. Confirm exact specs from the hcloud API at apply time. image = "ubuntu-24.04" (nixos-infect-supported base), a location (e.g. nbg1/fsn1/hel1 EU or ash/hil US — pick one, make it a var), ssh_keys=[hcloud_ssh_key.id], user_data=file("user-data.sh"), public_net { ipv4_enabled = true }, a stable name + label.
  • Keep the token + TS key sensitive; terraform.tfstate gitignored (can hold secrets) — mirrors cc-ci's no-secrets-in-git rule.

3. Stage 1 — nixos-infect (base Ubuntu → NixOS)

user-data.sh on first boot:

#!/usr/bin/env bash
set -euo pipefail
export NIX_CHANNEL=nixos-24.11
export PROVIDER=hetzner            # nixos-infect provider hint (Hetzner Cloud is supported)
curl -fsSL https://raw.githubusercontent.com/elitak/nixos-infect/<PINNED_SHA>/nixos-infect | bash -x

nixos-infect converts the server to NixOS in place, generates /etc/nixos/{configuration.nix, hardware-configuration.nix, networking.nix} (Hetzner-correct bootloader + public-IP networking), and reboots into NixOS. Pin the nixos-infect revision — don't curl|bash master blind. After this the server is bare NixOS on Hetzner, ssh-able as root.

4. Stage 2 — provision via Nix (bare NixOS → converged cc-ci) — "the expected way"

  1. Capture Hetzner hardware into the flake. Take the hardware-configuration.nix + networking.nix nixos-infect generated and add them as a flake host. Cleaner: a new host nix/hosts/cc-ci-hetzner/ importing the shared nix/modules/* + the Hetzner hardware, with nixosConfigurations.cc-ci-hetzner in flake.nix (keeps the Incus cc-ci host buildable during transition). Make Hetzner the canonical cc-ci after cutover.
  2. Run the D8 install flow on the server: clone recipe-maintainers/cc-ci --recursive (brings cc-ci-secrets), provision the bootstrap age key at /var/lib/sops-nix/key.txt, then nixos-rebuild switch --flake .#cc-ci-hetzner. The reconcile oneshots converge the swarm.
  3. Where stage 2 runs: v1 = documented step run after terraform apply (Terraform provisions + infects; the age-key placement + nixos-rebuild is the explicit step, like docs/install.md). Automate later via a Terraform remote-exec provisioner once key-delivery is settled.
    • Note on secrets for verification: full cc-ci convergence needs the bootstrap age key (decrypts cc-ci-secrets). If that key isn't available to the implementer, verify as far as possible — terraform apply → nixos-infect → bare NixOS → the flake builds/evaluates for the Hetzner host (nixos-rebuild build --flake .#cc-ci-hetzner) — and flag the age-key step as operator-pending.

5. Operator inputs (class-A1 — provide at apply, NEVER commit)

  • HCLOUD_TOKEN — already in .testenv (isolated project, read/write; operator will invalidate after). The provider reads it from env.
  • SSH key — register a public key as hcloud_ssh_key; hold the private half to ssh + run stage 2.
  • TS_AUTH_KEY — tailnet join (cc-ci enables tailscale; the server joins the same tailnet so the orchestrator/loops reach it as today, direct peer). Already in .testenv.
  • Bootstrap age key/var/lib/sops-nix/key.txt (decrypts cc-ci-secrets incl. the TLS cert). The single out-of-band secret per docs/install.md.

6. DNS / gateway — a simplification the public IP enables (open decision)

Today *.ci.commoninternet.net reaches the Incus VM (no public IP) via an external nginx TLS-passthrough gateway → MagicDNS. A Hetzner server has a public IP, so point ci.commoninternet.net + the *.ci wildcard A record straight at the server and drop the gateway — Traefik terminates TLS directly. The sops wildcard cert still works as-is; or switch Traefik to ACME and retire the manual cert + renewal. v1: keep the sops cert (no behavior change); evaluate ACME-on-public-IP as a follow-up. Record in DECISIONS.md.

7. Open decisions (log in DECISIONS.md)

  • Replace vs. parallel: stand Hetzner up in parallel, verify a full !testme + the D-gates green, then cut DNS over and retire the Incus cc-nix-test. Nothing stateful is lost — recipes redeploy, warm canonicals re-seed on first green runs.
  • Flake host: parallel cc-ci-hetzner host until cutover, then make Hetzner the canonical cc-ci.
  • Server type/location (cx32 vs cpx31; region); ACME vs sops cert (§6); stage-2 automation (§4.3).

8. Definition of Done

  • terraform/ in the cc-ci repo; terraform apply (with HCLOUD_TOKEN) creates an 8 GB cx32 Hetzner server and nixos-infect converts it to NixOS.
  • The flake builds for the Hetzner host (nixos-rebuild build --flake .#cc-ci-hetzner); given the bootstrap age key it switches to a fully converged cc-ci (the D8 flow) — 0 failed units.
  • (Once secrets available) a real recipe !testme runs green on the Hetzner cc-ci; dashboard + *.ci.commoninternet.net reachable via the chosen DNS path.
  • terraform/README.md documents apply + operator inputs; no secrets/state committed.
  • The terraform is proven idempotent (terraform plan clean after apply); test resources cleaned up (terraform destroy) if this is a throwaway verification rather than the real cutover.

9. Guardrails

  • No secrets in git (HCLOUD_TOKEN, TS key, age key, tfstate all out-of-band/gitignored) — cc-ci's rule.
  • Pin everything (hcloud provider, nixos-infect rev; nixpkgs already pinned) — reproducible, no drift.
  • x86 only — the flake is x86_64-linux; use cx32/cpx31, never cax* (ARM).
  • Don't break the running Incus cc-ci until the Hetzner one is verified green (parallel + cutover).
  • Real Nix provisioning (the flake), not hand-installed packages.
  • The token is invalidatable + isolated — but still treat it as a live secret: never commit it, never echo it into logs.