diff --git a/cc-ci-plan/plan-cc-ci-digitalocean-terraform.md b/cc-ci-plan/plan-cc-ci-digitalocean-terraform.md deleted file mode 100644 index 7d42829..0000000 --- a/cc-ci-plan/plan-cc-ci-digitalocean-terraform.md +++ /dev/null @@ -1,116 +0,0 @@ -# Plan — cc-ci on DigitalOcean: `terraform/` + nixos-infect + Nix provisioning - -**Status:** PROPOSED. Add a **`terraform/`** folder to the **cc-ci product repo** -(`recipe-maintainers/cc-ci`) that provisions the cc-ci server on **DigitalOcean** (8 GB droplet), -converts it to NixOS via **nixos-infect**, then applies the existing cc-ci flake config — making the -CI server reproducible-from-scratch on real cloud hosting. **Owner:** a cc-ci-repo infra task -(implementable by the Builder/Adversary loops as an infra unit, or by the assistant). **This file:** -`/srv/cc-ci/cc-ci-plan/plan-cc-ci-digitalocean-terraform.md`. - ---- - -## 0. Why -cc-ci currently runs as the Incus VM `cc-nix-test` on b1 — a small, shared 4-core host (the contention -we kept hitting). A dedicated DO **8 GB** droplet gives standard, reliable hosting with a **public -IP**, fully reproducible via Terraform + the existing cc-ci NixOS flake. "Spin up cc-ci from nothing" -becomes a `terraform apply` instead of hand-driven Incus API calls. - -## 1. What already exists — build ON this, don't reinvent -- cc-ci is a **flake-based NixOS system**: `flake.nix` → `nixosConfigurations.cc-ci` (pinned nixpkgs - 24.11) → `nix/hosts/cc-ci/{configuration.nix, hardware.nix}` + `nix/modules/*` (proxy/traefik, - drone, drone-runner, bridge, dashboard, backupbot, swarm, abra, harness, warm-keycloak, secrets). -- **From-scratch install is already VERIFIED (D8, `docs/install.md`):** a blank NixOS host + the two - repos (cc-ci cloned `--recursive` so the `cc-ci-secrets` submodule at `secrets/` comes too) + the - **one bootstrap age key** at `/var/lib/sops-nix/key.txt` → a single `nixos-rebuild switch` converges - the whole server (0 failed units; serialized reconcile oneshots proxy→drone→bridge→dashboard→ - backupbot). The wildcard TLS cert + all secrets are **sops-encrypted in `cc-ci-secrets`** (not - out-of-band). -- So **"provision via Nix in the expected way" = that exact D8 flow:** clone `--recursive` + bootstrap - age key + `nixos-rebuild switch --flake .#`. -- The current `nix/hosts/cc-ci/hardware.nix` is **Incus-VM-specific** — DO needs its own - hardware/bootloader/networking, which **nixos-infect generates**. - -## 2. `terraform/` layout (in `recipe-maintainers/cc-ci`) -``` -terraform/ - versions.tf # terraform + digitalocean/digitalocean provider, pinned - variables.tf # do_token(sensitive), region, size, ssh_key_id, ts_auth_key(sensitive), hostname - main.tf # digitalocean_droplet + ssh key + user_data - outputs.tf # droplet ipv4, id - user-data.sh # cloud-init stage-1: run nixos-infect (pinned) - README.md # apply instructions + operator inputs - .gitignore # *.tfstate*, *.auto.tfvars, .terraform/ (NEVER commit secrets/state) -``` -- **Droplet:** `digitalocean_droplet` — size **`s-4vcpu-8gb`** (8 GB RAM / 4 vCPU), region (e.g. - `nyc3`), base image **`ubuntu-24-04-x64`** (nixos-infect supports it), `ssh_keys=[var.ssh_key_id]`, - `user_data=file("user-data.sh")`, a stable name + tag. Optional: `monitoring=true`. -- Secrets (DO token, TS key) are **sensitive vars** via `TF_VAR_*` env or a **gitignored** - `*.auto.tfvars`; `terraform.tfstate` is **gitignored** (it can hold secrets). Mirrors cc-ci's - no-secrets-in-git rule. - -## 3. Stage 1 — nixos-infect (base Ubuntu → NixOS) -`user-data.sh` runs on first boot of the droplet: -```sh -#!/usr/bin/env bash -set -euo pipefail -export NIX_CHANNEL=nixos-24.11 -curl -fsSL https://raw.githubusercontent.com/elitak/nixos-infect//nixos-infect | bash -x -``` -nixos-infect converts the droplet to NixOS in place, generates `/etc/nixos/{configuration.nix, -hardware-configuration.nix, networking.nix}` (DO-correct: bootloader on the DO disk, public-IP -networking via the DO metadata), and reboots into NixOS. **Pin the nixos-infect revision** — do not -`curl | bash` master blind. After this, the droplet is **bare NixOS on DO**, ssh-able as root. - -## 4. Stage 2 — provision via Nix (bare NixOS → converged cc-ci) — "the expected way" -1. **Capture DO hardware into the flake.** Take the `hardware-configuration.nix` + `networking.nix` - nixos-infect generated and add them as a flake host. **Cleaner: a new host `nix/hosts/cc-ci-do/`** - that imports the shared `nix/modules/*` + the DO hardware, with `nixosConfigurations.cc-ci-do` in - `flake.nix` (keeps the Incus `cc-ci` host buildable during transition). Make DO the canonical - `cc-ci` after cutover. -2. **Run the D8 install flow on the droplet:** clone `recipe-maintainers/cc-ci` `--recursive` (brings - `cc-ci-secrets`), provision the **bootstrap age key** at `/var/lib/sops-nix/key.txt`, then - `nixos-rebuild switch --flake .#cc-ci-do`. The reconcile oneshots converge the swarm. -3. **Where stage 2 runs (recommendation):** **v1 = documented operator step** (Terraform provisions + - infects; the age-key placement + `nixos-rebuild` is the manual step, exactly like `docs/install.md` - — the age key is operator-provided anyway). Automate later via a Terraform `remote-exec` provisioner - or a cloud-init second stage once the key-delivery story is settled. - -## 5. Operator inputs (class-A1 — provide at apply, NEVER commit) -- **DO API token** (`TF_VAR_do_token`). -- **DO SSH key** (registered on DO; the operator/agent holds the private half to ssh + run stage 2). -- **`TS_AUTH_KEY`** — tailnet join (cc-ci enables tailscale; the droplet joins the same tailnet so the - orchestrator/loops reach it exactly as today, direct peer). -- **Bootstrap age key** → `/var/lib/sops-nix/key.txt` on the droplet (decrypts `cc-ci-secrets` incl. - the wildcard TLS cert). The single out-of-band secret per `docs/install.md`. - -## 6. DNS / gateway — a simplification DO enables (open decision) -Today `*.ci.commoninternet.net` reaches the Incus VM (no public IP) via an external **nginx -TLS-passthrough gateway** → MagicDNS. A DO droplet has a **public IP**, so point -`ci.commoninternet.net` + the `*.ci` wildcard **A record straight at the droplet** and **drop the -gateway** — Traefik terminates TLS directly. The pre-issued sops wildcard cert still works as-is; or, -with a public IP, switch Traefik to **ACME (Let's Encrypt)** and retire the manual cert + renewal. -**v1: keep the sops cert (no behavior change); evaluate ACME-on-public-IP as a follow-up.** Record in -DECISIONS.md. - -## 7. Open decisions (log in DECISIONS.md) -- **Replace vs. parallel:** stand DO up **in parallel**, verify a full `!testme` + the D-gates green on - it, then cut DNS over and **retire the Incus `cc-nix-test`**. Nothing stateful is lost — recipes - redeploy, warm canonicals re-seed on first green runs. -- **Flake host:** parallel `cc-ci-do` host until cutover, then make DO the canonical `cc-ci`. -- **Droplet size/region**; **ACME vs sops cert** (§6); **stage-2 automation** (§4.3). - -## 8. Definition of Done -- `terraform/` in the cc-ci repo; `terraform apply` (with operator inputs) creates an 8 GB DO droplet - and nixos-infect converts it to NixOS. -- Given the bootstrap age key, the droplet converges to a full cc-ci via `nixos-rebuild switch --flake - .#` (the D8 flow) — 0 failed units; traefik/drone/bridge/dashboard/backupbot up. -- A real recipe `!testme` runs **green** on the DO cc-ci; the dashboard + `*.ci.commoninternet.net` - reachable via the chosen DNS path. -- `terraform/README.md` documents apply + operator inputs; **no secrets/state committed**. -- **Adversary-verifiable:** from-scratch reproducibility (the D8 guarantee) holds on DO. - -## 9. Guardrails -- **No secrets in git** (DO token, TS key, age key, tfstate all out-of-band/gitignored) — cc-ci's rule. -- **Pin everything** (provider, nixos-infect rev; nixpkgs already pinned) — reproducible, no drift. -- **Don't break the running Incus cc-ci** until the DO one is verified green (parallel bring-up + cutover). -- **Real Nix provisioning** (the flake), not hand-installed packages. diff --git a/cc-ci-plan/plan-cc-ci-hetzner-terraform.md b/cc-ci-plan/plan-cc-ci-hetzner-terraform.md new file mode 100644 index 0000000..634aa92 --- /dev/null +++ b/cc-ci-plan/plan-cc-ci-hetzner-terraform.md @@ -0,0 +1,131 @@ +# Plan — cc-ci on Hetzner Cloud: `terraform/` + nixos-infect + Nix provisioning + +**Status:** PROPOSED → handed to the assistant to implement. Add a **`terraform/`** folder to the +**cc-ci product repo** (`recipe-maintainers/cc-ci`) that provisions the cc-ci server on **Hetzner +Cloud** (8 GB server), converts it to NixOS via **nixos-infect**, then applies the existing cc-ci +flake config — making the CI server reproducible-from-scratch on real cloud hosting. +**This file:** `/srv/cc-ci/cc-ci-plan/plan-cc-ci-hetzner-terraform.md`. + +**Token (operator, 2026-05-31):** an `HCLOUD_TOKEN` with **read/write to an isolated Hetzner project** +(just for this) is in `/srv/cc-ci/.testenv`. The operator **will invalidate it** once the terraform is +verified working — so the goal is **write + apply + verify the working terraform**, then report. + +--- + +## 0. Why +cc-ci currently runs as the Incus VM `cc-nix-test` on b1 — a small, shared 4-core host (the contention +we kept hitting). A dedicated Hetzner **8 GB** server gives standard, reliable hosting with a **public +IP**, fully reproducible via Terraform + the existing cc-ci NixOS flake. "Spin up cc-ci from nothing" +becomes a `terraform apply`. + +## 1. What already exists — build ON this, don't reinvent +- cc-ci is a **flake-based NixOS system**: `flake.nix` → `nixosConfigurations.cc-ci` (pinned nixpkgs + 24.11, **`system = "x86_64-linux"`**) → `nix/hosts/cc-ci/{configuration.nix, hardware.nix}` + + `nix/modules/*` (proxy/traefik, drone, drone-runner, bridge, dashboard, backupbot, swarm, abra, + harness, warm-keycloak, secrets). +- **From-scratch install is already VERIFIED (D8, `docs/install.md`):** a blank NixOS host + the two + repos (cc-ci cloned `--recursive` so the `cc-ci-secrets` submodule at `secrets/` comes too) + the + **one bootstrap age key** at `/var/lib/sops-nix/key.txt` → a single `nixos-rebuild switch` converges + the whole server (0 failed units; serialized reconcile oneshots). The wildcard TLS cert + all secrets + are **sops-encrypted in `cc-ci-secrets`** (not out-of-band). +- So **"provision via Nix in the expected way" = that exact D8 flow:** clone `--recursive` + bootstrap + age key + `nixos-rebuild switch --flake .#`. +- The current `nix/hosts/cc-ci/hardware.nix` is **Incus-VM-specific** — Hetzner needs its own + hardware/bootloader/networking, which **nixos-infect generates**. + +## 2. `terraform/` layout (in `recipe-maintainers/cc-ci`) +``` +terraform/ + versions.tf # terraform + hetznercloud/hcloud provider, pinned + variables.tf # hcloud_token(sensitive), location, server_type, image, ssh_key, ts_auth_key(sensitive), hostname + main.tf # hcloud_ssh_key + hcloud_server + user_data + outputs.tf # server ipv4, id + user-data.sh # cloud-init stage-1: run nixos-infect (pinned) + README.md # apply instructions + operator inputs + .gitignore # *.tfstate*, *.auto.tfvars, .terraform/ (NEVER commit secrets/state) +``` +- **Provider:** `hetznercloud/hcloud` (pinned in `versions.tf`). The token comes from + **`HCLOUD_TOKEN`** (env, read by the provider) or `TF_VAR_hcloud_token` — it's in `.testenv`; do + NOT hardcode/commit it. +- **Server:** `hcloud_server` — type **`cx32`** (Intel **shared vCPU**, **4 vCPU / 8 GB**) — must be + **x86** (the flake is `x86_64-linux`; do **NOT** use the `cax*` ARM types). `cpx31` (AMD, 4 vCPU / + 8 GB) is an acceptable alt. `image = "ubuntu-24.04"` (nixos-infect-supported base), a `location` + (e.g. `nbg1`/`fsn1`/`hel1` EU or `ash`/`hil` US — pick one, make it a var), `ssh_keys=[hcloud_ssh_key.id]`, + `user_data=file("user-data.sh")`, `public_net { ipv4_enabled = true }`, a stable name + label. +- Keep the token + TS key **sensitive**; `terraform.tfstate` **gitignored** (can hold secrets) — mirrors + cc-ci's no-secrets-in-git rule. + +## 3. Stage 1 — nixos-infect (base Ubuntu → NixOS) +`user-data.sh` on first boot: +```sh +#!/usr/bin/env bash +set -euo pipefail +export NIX_CHANNEL=nixos-24.11 +export PROVIDER=hetzner # nixos-infect provider hint (Hetzner Cloud is supported) +curl -fsSL https://raw.githubusercontent.com/elitak/nixos-infect//nixos-infect | bash -x +``` +nixos-infect converts the server to NixOS in place, generates `/etc/nixos/{configuration.nix, +hardware-configuration.nix, networking.nix}` (Hetzner-correct bootloader + public-IP networking), and +reboots into NixOS. **Pin the nixos-infect revision** — don't `curl|bash` master blind. After this the +server is **bare NixOS on Hetzner**, ssh-able as root. + +## 4. Stage 2 — provision via Nix (bare NixOS → converged cc-ci) — "the expected way" +1. **Capture Hetzner hardware into the flake.** Take the `hardware-configuration.nix` + `networking.nix` + nixos-infect generated and add them as a flake host. **Cleaner: a new host `nix/hosts/cc-ci-hetzner/`** + importing the shared `nix/modules/*` + the Hetzner hardware, with `nixosConfigurations.cc-ci-hetzner` + in `flake.nix` (keeps the Incus `cc-ci` host buildable during transition). Make Hetzner the canonical + `cc-ci` after cutover. +2. **Run the D8 install flow on the server:** clone `recipe-maintainers/cc-ci` `--recursive` (brings + `cc-ci-secrets`), provision the **bootstrap age key** at `/var/lib/sops-nix/key.txt`, then + `nixos-rebuild switch --flake .#cc-ci-hetzner`. The reconcile oneshots converge the swarm. +3. **Where stage 2 runs:** **v1 = documented step run after `terraform apply`** (Terraform provisions + + infects; the age-key placement + `nixos-rebuild` is the explicit step, like `docs/install.md`). + Automate later via a Terraform `remote-exec` provisioner once key-delivery is settled. + - **Note on secrets for verification:** full cc-ci convergence needs the bootstrap age key (decrypts + `cc-ci-secrets`). If that key isn't available to the implementer, verify as far as possible — + `terraform apply` → nixos-infect → bare NixOS → the flake **builds/evaluates** for the Hetzner host + (`nixos-rebuild build --flake .#cc-ci-hetzner`) — and flag the age-key step as operator-pending. + +## 5. Operator inputs (class-A1 — provide at apply, NEVER commit) +- **`HCLOUD_TOKEN`** — already in `.testenv` (isolated project, read/write; operator will invalidate + after). The provider reads it from env. +- **SSH key** — register a public key as `hcloud_ssh_key`; hold the private half to ssh + run stage 2. +- **`TS_AUTH_KEY`** — tailnet join (cc-ci enables tailscale; the server joins the same tailnet so the + orchestrator/loops reach it as today, direct peer). Already in `.testenv`. +- **Bootstrap age key** → `/var/lib/sops-nix/key.txt` (decrypts `cc-ci-secrets` incl. the TLS cert). + The single out-of-band secret per `docs/install.md`. + +## 6. DNS / gateway — a simplification the public IP enables (open decision) +Today `*.ci.commoninternet.net` reaches the Incus VM (no public IP) via an external nginx +TLS-passthrough gateway → MagicDNS. A Hetzner server has a **public IP**, so point +`ci.commoninternet.net` + the `*.ci` wildcard **A record straight at the server** and **drop the +gateway** — Traefik terminates TLS directly. The sops wildcard cert still works as-is; or switch +Traefik to **ACME** and retire the manual cert + renewal. **v1: keep the sops cert (no behavior +change); evaluate ACME-on-public-IP as a follow-up.** Record in DECISIONS.md. + +## 7. Open decisions (log in DECISIONS.md) +- **Replace vs. parallel:** stand Hetzner up **in parallel**, verify a full `!testme` + the D-gates + green, then cut DNS over and **retire the Incus `cc-nix-test`**. Nothing stateful is lost — recipes + redeploy, warm canonicals re-seed on first green runs. +- **Flake host:** parallel `cc-ci-hetzner` host until cutover, then make Hetzner the canonical `cc-ci`. +- **Server type/location** (cx32 vs cpx31; region); **ACME vs sops cert** (§6); **stage-2 automation** (§4.3). + +## 8. Definition of Done +- `terraform/` in the cc-ci repo; `terraform apply` (with `HCLOUD_TOKEN`) creates an **8 GB cx32** + Hetzner server and nixos-infect converts it to NixOS. +- The flake **builds for the Hetzner host** (`nixos-rebuild build --flake .#cc-ci-hetzner`); given the + bootstrap age key it **switches** to a fully converged cc-ci (the D8 flow) — 0 failed units. +- (Once secrets available) a real recipe `!testme` runs **green** on the Hetzner cc-ci; dashboard + + `*.ci.commoninternet.net` reachable via the chosen DNS path. +- `terraform/README.md` documents apply + operator inputs; **no secrets/state committed**. +- The terraform is proven **idempotent** (`terraform plan` clean after apply); test resources cleaned + up (`terraform destroy`) if this is a throwaway verification rather than the real cutover. + +## 9. Guardrails +- **No secrets in git** (HCLOUD_TOKEN, TS key, age key, tfstate all out-of-band/gitignored) — cc-ci's rule. +- **Pin everything** (hcloud provider, nixos-infect rev; nixpkgs already pinned) — reproducible, no drift. +- **x86 only** — the flake is `x86_64-linux`; use `cx32`/`cpx31`, never `cax*` (ARM). +- **Don't break the running Incus cc-ci** until the Hetzner one is verified green (parallel + cutover). +- **Real Nix provisioning** (the flake), not hand-installed packages. +- **The token is invalidatable + isolated** — but still treat it as a live secret: never commit it, + never echo it into logs.