diff --git a/cc-ci-plan/plan-cc-ci-hetzner-migration.md b/cc-ci-plan/plan-cc-ci-hetzner-migration.md new file mode 100644 index 0000000..ff024bc --- /dev/null +++ b/cc-ci-plan/plan-cc-ci-hetzner-migration.md @@ -0,0 +1,96 @@ +# Plan — migrate the cc-ci SERVER from b1 Incus to Hetzner (full cutover) + +**Status:** PROPOSED. Move the cc-ci **CI server** (`cc-nix-test`) off the slow b1 host onto a fast +Hetzner **cpx32** (8 GB, dedicated vCPU, NVMe), repoint the Builder/Adversary loops + everything at it, +then stop the old VM. **This file:** `/srv/cc-ci/cc-ci-plan/plan-cc-ci-hetzner-migration.md`. +**Owner:** assistant (provisioning + cutover mechanics) + orchestrator (coordination); operator for the +secret/DNS gates. **Supersedes** the narrower `plan-cc-ci-hetzner-terraform.md` (that is Phase 1's +deliverable; this plan wraps it with the cutover + decommission). + +--- + +## 0. Context (why, and what's where) +- **Two VMs run on b1** (a 2015 **Intel i5-6400T low-power CPU + a spinning HDD** — measured: CPU + pressure ~55%, root disk `ROTA=1`): + - **cc-ci server** `cc-nix-test` (tailnet `100.90.116.4`, 8 GB) — where the loops deploy recipes + + run the harness (the heavy CI work). **This is what we migrate.** + - **orchestrator VM** `cc-ci-orchestrator` (tailnet `100.116.55.106`, 2 GB) — where the loops + + orchestrator + assistant *run* (claude sessions). Stays for now. +- b1 is overloaded running both on a slow CPU + HDD — "everything is getting slow." +- **The win (see the perf analysis):** Hetzner cpx32 = modern dedicated vCPU + **NVMe** vs a 2015 + low-power CPU + **HDD** → I/O-bound deploys (the ghost/discourse near-timeouts) likely **3–10×** + faster, CPU work **~2–3×**. Moving the *heavy* server off b1 also relieves b1, so the orchestrator VM + (still there) speeds up too. + +## 1. Phase 1 — provision the Hetzner cc-ci, FULLY ready +The `plan-cc-ci-hetzner-terraform.md` deliverable, taken all the way to a **converged, green** server +(not just "terraform applies"): +- `terraform/` in the cc-ci repo (cpx32, ubuntu-24.04, pinned hcloud provider + nixos-infect). `apply` + → nixos-infect → bare NixOS on Hetzner. +- Add the `cc-ci-hetzner` flake host (nixos-infect's DO/Hetzner hardware + the shared `nix/modules/*`). +- **Full convergence (the D8 flow):** clone cc-ci `--recursive` + place the **bootstrap age key** at + `/var/lib/sops-nix/key.txt` (operator) + `nixos-rebuild switch --flake .#cc-ci-hetzner` → traefik / + drone / bridge / dashboard / backupbot / swarm all up, **0 failed units**. +- **DNS/cert:** point `ci.commoninternet.net` + `*.ci` **A record at the Hetzner public IP** (the + server has one — can drop the b1 TLS-passthrough gateway). Keep the sops wildcard cert for v1 + (or ACME — §decision). +- **Readiness gate (before any cutover):** ssh works; the dashboard + `*.ci.commoninternet.net` are + reachable; a **full `!testme` runs GREEN on the Hetzner server** (drive one recipe end-to-end via + the harness). Keep the b1 cc-ci running untouched in parallel during all of Phase 1. +- **Operator inputs for Phase 1:** `HCLOUD_TOKEN` (have), `TS_AUTH_KEY` (have), the **bootstrap age + key** (needed for convergence), and the **DNS change**. Note: the token may be invalidated after the + KEEPER server is applied — the server runs without it; only future `terraform` needs a (new) token. + +## 2. Phase 2 — cut everything over to the Hetzner server +Once Phase 1 is green, switch all consumers from the b1 `cc-nix-test` to the Hetzner server: +- **Loop access:** update the `Host cc-ci` entry in the loops' ssh config (on the orchestrator VM, + used by builder/adversary/orchestrator/assistant) — `HostName` from `100.90.116.4` → + the **Hetzner server's tailnet IP / MagicDNS**. (`ssh cc-ci` is the single indirection the loops + use, so this one change repoints all of them. The Hetzner box joins the SAME tailnet via + `TS_AUTH_KEY`, so it's a direct peer like today.) +- **CI flow:** the `!testme` → bridge → Drone → harness path + the dashboard now run on the Hetzner + server (they're part of the converged config there). The recipe mirrors stay on Gitea (unaffected). +- **State carry-over (minimal — mostly stateless):** recipes redeploy from the mirrors; **warm + canonicals re-seed** on the first green cold runs; the harness lives in the cc-ci repo. Drone build + history + dashboard state start **fresh** on the new server (acceptable; migrate only if wanted). +- **Verify cutover:** a full loop cycle works against Hetzner — Builder deploys + claims a gate, the + Adversary **cold-verifies green** on the Hetzner server; phase-2 recipe work continues, now fast. + Watch a ghost/discourse deploy to confirm the timeouts are gone. + +## 3. Phase 3 — stop the old cc-ci VM (free b1) +- Once everything is confirmed serving green on Hetzner, **stop `cc-nix-test` on b1** (Incus + `PUT .../state {"action":"stop"}`). **Keep it as a cold standby for a few days** (don't delete) for + rollback, then retire. +- b1 now runs only the small orchestrator VM → it gets b1's full (modest) resources → the loops' + *runtime* is less starved too. "Everything faster from here on out." +- **Rollback (until the old VM is deleted):** if Hetzner has a problem, revert the `Host cc-ci` ssh + entry to `100.90.116.4` and start the b1 VM again. + +## 4. Sequencing & gates (don't break the running CI) +- **Strictly parallel bring-up:** Phase 1 stands Hetzner up *alongside* the live b1 cc-ci; **no + consumer is repointed until the Hetzner `!testme` is green** (Phase 1 readiness gate). +- The cutover (Phase 2) is a **single ssh-config repoint** + DNS — fast and reversible. +- Phase 3 (stop b1) only after Phase 2 is verified. +- The loops keep working on b1 throughout Phase 1 (no disruption); the brief cutover window is the + only moment they switch servers. + +## 5. Open decisions (log in DECISIONS.md) +- **DNS/cert:** point `*.ci` at the Hetzner public IP + drop the gateway; sops cert (v1) vs ACME. +- **Drone/dashboard history:** fresh on Hetzner (default) vs migrate the volumes. +- **Orchestrator VM:** leave on b1 (freed) for now; a *later, separate* plan could also move the loops' + runtime to Hetzner and fully retire b1 — out of scope here (the runtime needn't be fast). +- **Token lifecycle:** invalidate `HCLOUD_TOKEN` after the keeper apply, or keep a (rotated) one for + ongoing `terraform` management of the server. + +## 6. Definition of Done +- Hetzner cpx32 cc-ci fully converged (0 failed units) + a **green `!testme`** on it. +- Loops + dashboard + `*.ci.commoninternet.net` all served from Hetzner; a full Builder→Adversary + cycle verified green there; deploy/convergence visibly faster (ghost/discourse no longer near-timeout). +- Old b1 `cc-nix-test` **stopped** (cold standby, not deleted). +- `terraform/` committed to the cc-ci repo (via PR); no secrets/state in git; `docs/install.md` + updated for the Hetzner host. Adversary-verifiable: from-scratch reproducibility holds on Hetzner. + +## 7. Guardrails +- Parallel bring-up; never repoint consumers until Hetzner is green; keep b1 as cold standby. +- No secrets in git (token, TS key, age key, tfstate). Pin everything. x86 only (cpx32/cx32). +- Real Nix provisioning (the flake) + real abra; don't weaken anything to make the new server "pass." diff --git a/cc-ci-plan/plan-cc-ci-hetzner-terraform.md b/cc-ci-plan/plan-cc-ci-hetzner-terraform.md index 634aa92..2a4d370 100644 --- a/cc-ci-plan/plan-cc-ci-hetzner-terraform.md +++ b/cc-ci-plan/plan-cc-ci-hetzner-terraform.md @@ -47,9 +47,11 @@ terraform/ - **Provider:** `hetznercloud/hcloud` (pinned in `versions.tf`). The token comes from **`HCLOUD_TOKEN`** (env, read by the provider) or `TF_VAR_hcloud_token` — it's in `.testenv`; do NOT hardcode/commit it. -- **Server:** `hcloud_server` — type **`cx32`** (Intel **shared vCPU**, **4 vCPU / 8 GB**) — must be - **x86** (the flake is `x86_64-linux`; do **NOT** use the `cax*` ARM types). `cpx31` (AMD, 4 vCPU / - 8 GB) is an acceptable alt. `image = "ubuntu-24.04"` (nixos-infect-supported base), a `location` +- **Server:** `hcloud_server` — type **`cpx32`** (AMD **dedicated vCPU**, **8 GB RAM**, NVMe SSD) — + **DEFAULT** (operator 2026-05-31: `cpx31` is **retired**; `cpx32` is the current dedicated-vCPU 8 GB + type). Dedicated vCPU avoids noisy-neighbor variance for bursty CI. Must be **x86** (the flake is + `x86_64-linux`; do **NOT** use the `cax*` ARM types). `cx32` (Intel shared vCPU, 8 GB) is a cheaper + alt. Confirm exact specs from the hcloud API at apply time. `image = "ubuntu-24.04"` (nixos-infect-supported base), a `location` (e.g. `nbg1`/`fsn1`/`hel1` EU or `ash`/`hil` US — pick one, make it a var), `ssh_keys=[hcloud_ssh_key.id]`, `user_data=file("user-data.sh")`, `public_net { ipv4_enabled = true }`, a stable name + label. - Keep the token + TS key **sensitive**; `terraform.tfstate` **gitignored** (can hold secrets) — mirrors