Files
cc-ci-orchestrator/cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md

84 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Plan — migrate the cc-ci SERVER from the Incus VM to Hetzner (provision → benchmark → cutover → retire)
**Status:** PROPOSED. Move the **cc-ci CI server** off the Incus VM `cc-nix-test` (b1: 2015 i5-6400T
+ **spinning HDD**, CPU-pressure ~55%, getting very slow) onto a **Hetzner `cpx32`** (4 vCPU / 8 GB /
160 GB **NVMe**, x86, ~€16.49/mo). Everything (Builder, Adversary, the !testme pipeline) then targets
the fast new server. **This file:** `/srv/cc-ci/cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md`.
**Key enabler (verified 2026-05-31):** the bootstrap age key is **already on this VM** at
`/srv/cc-ci/.sops/master-age.txt` and the `cc-ci-secrets` submodule is populated — so the new server
can be **fully provisioned end-to-end with NO operator secret-blocker** (the D8 flow decrypts the TLS
cert + all secrets). The Pi is not needed.
**Architecture reminder:** the Builder/Adversary **loops run on this orchestrator VM** and reach the
CI server via `ssh cc-ci`; the **!testme pipeline (Gitea webhook → bridge → Drone → harness) runs ON
the cc-ci server**, and `*.ci.commoninternet.net` + the dashboard are served from it. "Switch
everything to the new server" = make the Hetzner box the cc-ci, then repoint `ssh cc-ci`, the
webhook/DNS, and the dashboard at it. The loops' code/clones don't move — only their target.
---
## Phase 1 — Provision the new Hetzner cc-ci, fully converged (assistant)
Per **`plan-cc-ci-hetzner-terraform.md`** (the provisioning detail): `terraform/` in the cc-ci repo →
`hcloud` `cpx32` from `ubuntu-24.04`**pinned nixos-infect** → bare NixOS → add the **`cc-ci-hetzner`
flake host** (the nixos-infect-generated DO/Hetzner hardware + the shared `nix/modules/*`) → run the
**D8 flow**: clone `--recursive`, place `/srv/cc-ci/.sops/master-age.txt` at `/var/lib/sops-nix/key.txt`,
`nixos-rebuild switch --flake .#cc-ci-hetzner`. The server joins the tailnet (TS_AUTH_KEY).
- **Accept:** 0 failed units; traefik/drone/bridge/dashboard/backupbot up; the box is on the tailnet
and ssh-able; terraform is idempotent (`plan` clean). This is a **real** server we keep (not the
throwaway the terraform-plan first described) — do **not** `terraform destroy` once it converges.
- Done in **parallel** — the old Incus cc-ci keeps serving the loops until Phase 3.
## Phase 2 — Benchmark: old vs new, two recipes (a short report)
Pick **two representative recipes** — one light (e.g. `n8n` or `custom-html`) and one heavy/slow (e.g.
`ghost` or `discourse` — the HDD-bound timeout cases). Run the **same full harness** (cold,
install+upgrade+backup+restore+custom) on **both servers**:
- old: `ssh cc-ci-incus` (the current `cc-nix-test`), new: `ssh cc-ci-hetzner`.
- Capture **per-tier + total wall-clock** from the `RUN SUMMARY` for each recipe on each host.
Write a short comparison report → **`docs/perf/hetzner-vs-incus.md`** in the cc-ci repo (table: recipe
× tier × old-time × new-time × speedup). This empirically confirms the expected ~24× (more on the
I/O-bound phases). *(Run identical conditions — same recipe versions, cold cache both sides.)*
## Phase 3 — Cutover: point everything at the new server (orchestrated; pick a quiet moment)
1. **Quiesce briefly:** ensure no live `!testme`/deploy is mid-run on the old server.
2. **Repoint the loops' `ssh cc-ci`** → the Hetzner box's tailnet IP: update `Host cc-ci` in
`/home/loops/.ssh/config` (and root's) `HostName` → new IP. The loops keep working from this VM;
only their target changes. (Keep a `Host cc-ci-incus` alias for the old box during the overlap.)
3. **DNS / webhook / gateway:** point `ci.commoninternet.net` + the `*.ci` wildcard **A record at the
Hetzner public IP** (drop the TLS-passthrough gateway — Traefik on the droplet terminates directly;
the sops wildcard cert works as-is). Re-point the Gitea `issue_comment` webhook → the new server so
`!testme` triggers there. **DNS is operator-owned (`commoninternet.net`)** — the one operator step.
4. **Verify end-to-end on the new server:** a real PR `!testme` runs green through the new
bridge→Drone→harness; the dashboard + `*.ci.commoninternet.net` load; the loops' `ssh cc-ci` deploys
land on Hetzner. Re-run the relevant D-gates cold-verified by the Adversary.
5. Make `cc-ci-hetzner` the **canonical** `nixosConfigurations.cc-ci` in the flake (retire the Incus
`hardware.nix` once the old box is gone).
## Phase 4 — Retire the old Incus cc-nix-test
Once Hetzner is the verified live cc-ci: **stop** the Incus VM via the b1 Incus API (mTLS certs are on
this VM under `incus-terraform-nix-vm-creator/terraform-secrets/`) — `PUT .../instances/cc-nix-test/
state {"action":"stop"}`. Keep it as a **cold standby for a few days**, then delete (frees b1). Update
the memory/docs ([[cc-ci-setup]]) to point cc-ci at Hetzner.
## Who does what
- **Assistant:** Phase 1 (the terraform + full convergence) and the Phase-2 benchmark runs.
- **Orchestrator (me) + operator:** Phase 3 cutover (I do the ssh-repoint + the Incus stop via the
API; **operator does the DNS change** + the go/no-go) and Phase 4.
## Guardrails
- **Parallel bring-up** — never break the running Incus cc-ci until Hetzner is verified green; the
cutover is the only switch moment, at a quiet point.
- **No secrets in git** — `HCLOUD_TOKEN`, TS key, the age key (`.sops/`), tfstate all gitignored
(`.gitignore` hardened for `*age*.txt`/`.sops/`); never echo/commit them.
- **x86 `cpx32`**, pin the hcloud provider + nixos-infect rev (nixpkgs already pinned).
- **Reproducible-from-scratch holds** (the D8 guarantee) — the Hetzner cc-ci comes from `terraform
apply` + one `nixos-rebuild switch`, no hand steps beyond the operator DNS + age key.
## Definition of Done
- Hetzner `cpx32` cc-ci fully converged (0 failed units) via terraform + the D8 flake flow.
- `docs/perf/hetzner-vs-incus.md` shows the two-recipe old-vs-new comparison (real numbers).
- The loops, `!testme` pipeline, dashboard, and `*.ci.commoninternet.net` all run on Hetzner; a PR
`!testme` is green end-to-end there; D-gates re-verified.
- The Incus `cc-nix-test` is stopped (cold standby → deletion); the flake's canonical `cc-ci` host is
Hetzner; docs/memory updated.