From 4c418765c8d3e994b1929ceaf105b357cbc548c0 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Sun, 31 May 2026 02:04:02 +0000 Subject: [PATCH] =?UTF-8?q?plan:=20full=20migrate-cc-ci-to-hetzner=20(prov?= =?UTF-8?q?ision=20cpx32=20=E2=86=92=20benchmark=202=20recipes=20=E2=86=92?= =?UTF-8?q?=20cutover=20loops+pipeline+DNS=20=E2=86=92=20retire=20Incus=20?= =?UTF-8?q?VM);=20age=20key=20is=20on=20the=20VM=20so=20no=20secret-blocke?= =?UTF-8?q?r;=20harden=20.gitignore=20for=20the=20age=20key?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md | 83 +++++++++++++++++++++ 1 file changed, 83 insertions(+) create mode 100644 cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md diff --git a/cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md b/cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md new file mode 100644 index 0000000..213e15b --- /dev/null +++ b/cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md @@ -0,0 +1,83 @@ +# Plan — migrate the cc-ci SERVER from the Incus VM to Hetzner (provision → benchmark → cutover → retire) + +**Status:** PROPOSED. Move the **cc-ci CI server** off the Incus VM `cc-nix-test` (b1: 2015 i5-6400T ++ **spinning HDD**, CPU-pressure ~55%, getting very slow) onto a **Hetzner `cpx32`** (4 vCPU / 8 GB / +160 GB **NVMe**, x86, ~€16.49/mo). Everything (Builder, Adversary, the !testme pipeline) then targets +the fast new server. **This file:** `/srv/cc-ci/cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md`. + +**Key enabler (verified 2026-05-31):** the bootstrap age key is **already on this VM** at +`/srv/cc-ci/.sops/master-age.txt` and the `cc-ci-secrets` submodule is populated — so the new server +can be **fully provisioned end-to-end with NO operator secret-blocker** (the D8 flow decrypts the TLS +cert + all secrets). The Pi is not needed. + +**Architecture reminder:** the Builder/Adversary **loops run on this orchestrator VM** and reach the +CI server via `ssh cc-ci`; the **!testme pipeline (Gitea webhook → bridge → Drone → harness) runs ON +the cc-ci server**, and `*.ci.commoninternet.net` + the dashboard are served from it. "Switch +everything to the new server" = make the Hetzner box the cc-ci, then repoint `ssh cc-ci`, the +webhook/DNS, and the dashboard at it. The loops' code/clones don't move — only their target. + +--- + +## Phase 1 — Provision the new Hetzner cc-ci, fully converged (assistant) +Per **`plan-cc-ci-hetzner-terraform.md`** (the provisioning detail): `terraform/` in the cc-ci repo → +`hcloud` `cpx32` from `ubuntu-24.04` → **pinned nixos-infect** → bare NixOS → add the **`cc-ci-hetzner` +flake host** (the nixos-infect-generated DO/Hetzner hardware + the shared `nix/modules/*`) → run the +**D8 flow**: clone `--recursive`, place `/srv/cc-ci/.sops/master-age.txt` at `/var/lib/sops-nix/key.txt`, +`nixos-rebuild switch --flake .#cc-ci-hetzner`. The server joins the tailnet (TS_AUTH_KEY). +- **Accept:** 0 failed units; traefik/drone/bridge/dashboard/backupbot up; the box is on the tailnet + and ssh-able; terraform is idempotent (`plan` clean). This is a **real** server we keep (not the + throwaway the terraform-plan first described) — do **not** `terraform destroy` once it converges. +- Done in **parallel** — the old Incus cc-ci keeps serving the loops until Phase 3. + +## Phase 2 — Benchmark: old vs new, two recipes (a short report) +Pick **two representative recipes** — one light (e.g. `n8n` or `custom-html`) and one heavy/slow (e.g. +`ghost` or `discourse` — the HDD-bound timeout cases). Run the **same full harness** (cold, +install+upgrade+backup+restore+custom) on **both servers**: +- old: `ssh cc-ci-incus` (the current `cc-nix-test`), new: `ssh cc-ci-hetzner`. +- Capture **per-tier + total wall-clock** from the `RUN SUMMARY` for each recipe on each host. +Write a short comparison report → **`docs/perf/hetzner-vs-incus.md`** in the cc-ci repo (table: recipe +× tier × old-time × new-time × speedup). This empirically confirms the expected ~2–4× (more on the +I/O-bound phases). *(Run identical conditions — same recipe versions, cold cache both sides.)* + +## Phase 3 — Cutover: point everything at the new server (orchestrated; pick a quiet moment) +1. **Quiesce briefly:** ensure no live `!testme`/deploy is mid-run on the old server. +2. **Repoint the loops' `ssh cc-ci`** → the Hetzner box's tailnet IP: update `Host cc-ci` in + `/home/loops/.ssh/config` (and root's) `HostName` → new IP. The loops keep working from this VM; + only their target changes. (Keep a `Host cc-ci-incus` alias for the old box during the overlap.) +3. **DNS / webhook / gateway:** point `ci.commoninternet.net` + the `*.ci` wildcard **A record at the + Hetzner public IP** (drop the TLS-passthrough gateway — Traefik on the droplet terminates directly; + the sops wildcard cert works as-is). Re-point the Gitea `issue_comment` webhook → the new server so + `!testme` triggers there. **DNS is operator-owned (`commoninternet.net`)** — the one operator step. +4. **Verify end-to-end on the new server:** a real PR `!testme` runs green through the new + bridge→Drone→harness; the dashboard + `*.ci.commoninternet.net` load; the loops' `ssh cc-ci` deploys + land on Hetzner. Re-run the relevant D-gates cold-verified by the Adversary. +5. Make `cc-ci-hetzner` the **canonical** `nixosConfigurations.cc-ci` in the flake (retire the Incus + `hardware.nix` once the old box is gone). + +## Phase 4 — Retire the old Incus cc-nix-test +Once Hetzner is the verified live cc-ci: **stop** the Incus VM via the b1 Incus API (mTLS certs are on +this VM under `incus-terraform-nix-vm-creator/terraform-secrets/`) — `PUT .../instances/cc-nix-test/ +state {"action":"stop"}`. Keep it as a **cold standby for a few days**, then delete (frees b1). Update +the memory/docs ([[cc-ci-setup]]) to point cc-ci at Hetzner. + +## Who does what +- **Assistant:** Phase 1 (the terraform + full convergence) and the Phase-2 benchmark runs. +- **Orchestrator (me) + operator:** Phase 3 cutover (I do the ssh-repoint + the Incus stop via the + API; **operator does the DNS change** + the go/no-go) and Phase 4. + +## Guardrails +- **Parallel bring-up** — never break the running Incus cc-ci until Hetzner is verified green; the + cutover is the only switch moment, at a quiet point. +- **No secrets in git** — `HCLOUD_TOKEN`, TS key, the age key (`.sops/`), tfstate all gitignored + (`.gitignore` hardened for `*age*.txt`/`.sops/`); never echo/commit them. +- **x86 `cpx32`**, pin the hcloud provider + nixos-infect rev (nixpkgs already pinned). +- **Reproducible-from-scratch holds** (the D8 guarantee) — the Hetzner cc-ci comes from `terraform + apply` + one `nixos-rebuild switch`, no hand steps beyond the operator DNS + age key. + +## Definition of Done +- Hetzner `cpx32` cc-ci fully converged (0 failed units) via terraform + the D8 flake flow. +- `docs/perf/hetzner-vs-incus.md` shows the two-recipe old-vs-new comparison (real numbers). +- The loops, `!testme` pipeline, dashboard, and `*.ci.commoninternet.net` all run on Hetzner; a PR + `!testme` is green end-to-end there; D-gates re-verified. +- The Incus `cc-nix-test` is stopped (cold standby → deletion); the flake's canonical `cc-ci` host is + Hetzner; docs/memory updated.