Files
cc-ci-orchestrator/cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md

6.1 KiB
Raw Blame History

Plan — migrate the cc-ci SERVER from the Incus VM to Hetzner (provision → benchmark → cutover → retire)

Status: PROPOSED. Move the cc-ci CI server off the Incus VM cc-nix-test (b1: 2015 i5-6400T

  • spinning HDD, CPU-pressure ~55%, getting very slow) onto a Hetzner cpx32 (4 vCPU / 8 GB / 160 GB NVMe, x86, ~€16.49/mo). Everything (Builder, Adversary, the !testme pipeline) then targets the fast new server. This file: /srv/cc-ci/cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md.

Key enabler (verified 2026-05-31): the bootstrap age key is already on this VM at /srv/cc-ci/.sops/master-age.txt and the cc-ci-secrets submodule is populated — so the new server can be fully provisioned end-to-end with NO operator secret-blocker (the D8 flow decrypts the TLS cert + all secrets). The Pi is not needed.

Architecture reminder: the Builder/Adversary loops run on this orchestrator VM and reach the CI server via ssh cc-ci; the !testme pipeline (Gitea webhook → bridge → Drone → harness) runs ON the cc-ci server, and *.ci.commoninternet.net + the dashboard are served from it. "Switch everything to the new server" = make the Hetzner box the cc-ci, then repoint ssh cc-ci, the webhook/DNS, and the dashboard at it. The loops' code/clones don't move — only their target.


Phase 1 — Provision the new Hetzner cc-ci, fully converged (assistant)

Per plan-cc-ci-hetzner-terraform.md (the provisioning detail): terraform/ in the cc-ci repo → hcloud cpx32 from ubuntu-24.04pinned nixos-infect → bare NixOS → add the cc-ci-hetzner flake host (the nixos-infect-generated DO/Hetzner hardware + the shared nix/modules/*) → run the D8 flow: clone --recursive, place /srv/cc-ci/.sops/master-age.txt at /var/lib/sops-nix/key.txt, nixos-rebuild switch --flake .#cc-ci-hetzner. The server joins the tailnet (TS_AUTH_KEY).

  • Accept: 0 failed units; traefik/drone/bridge/dashboard/backupbot up; the box is on the tailnet and ssh-able; terraform is idempotent (plan clean). This is a real server we keep (not the throwaway the terraform-plan first described) — do not terraform destroy once it converges.
  • Done in parallel — the old Incus cc-ci keeps serving the loops until Phase 3.

Phase 2 — Benchmark: old vs new, two recipes (a short report)

Pick two representative recipes — one light (e.g. n8n or custom-html) and one heavy/slow (e.g. ghost or discourse — the HDD-bound timeout cases). Run the same full harness (cold, install+upgrade+backup+restore+custom) on both servers:

  • old: ssh cc-ci-incus (the current cc-nix-test), new: ssh cc-ci-hetzner.
  • Capture per-tier + total wall-clock from the RUN SUMMARY for each recipe on each host. Write a short comparison report → docs/perf/hetzner-vs-incus.md in the cc-ci repo (table: recipe × tier × old-time × new-time × speedup). This empirically confirms the expected ~24× (more on the I/O-bound phases). (Run identical conditions — same recipe versions, cold cache both sides.)

Phase 3 — Cutover: point everything at the new server (orchestrated; pick a quiet moment)

  1. Quiesce briefly: ensure no live !testme/deploy is mid-run on the old server.
  2. Repoint the loops' ssh cc-ci → the Hetzner box's tailnet IP: update Host cc-ci in /home/loops/.ssh/config (and root's) HostName → new IP. The loops keep working from this VM; only their target changes. (Keep a Host cc-ci-incus alias for the old box during the overlap.)
  3. DNS / webhook / gateway: point ci.commoninternet.net + the *.ci wildcard A record at the Hetzner public IP (drop the TLS-passthrough gateway — Traefik on the droplet terminates directly; the sops wildcard cert works as-is). Re-point the Gitea issue_comment webhook → the new server so !testme triggers there. DNS is operator-owned (commoninternet.net) — the one operator step.
  4. Verify end-to-end on the new server: a real PR !testme runs green through the new bridge→Drone→harness; the dashboard + *.ci.commoninternet.net load; the loops' ssh cc-ci deploys land on Hetzner. Re-run the relevant D-gates cold-verified by the Adversary.
  5. Make cc-ci-hetzner the canonical nixosConfigurations.cc-ci in the flake (retire the Incus hardware.nix once the old box is gone).

Phase 4 — Retire the old Incus cc-nix-test

Once Hetzner is the verified live cc-ci: stop the Incus VM via the b1 Incus API (mTLS certs are on this VM under incus-terraform-nix-vm-creator/terraform-secrets/) — PUT .../instances/cc-nix-test/ state {"action":"stop"}. Keep it as a cold standby for a few days, then delete (frees b1). Update the memory/docs (cc-ci-setup) to point cc-ci at Hetzner.

Who does what

  • Assistant: Phase 1 (the terraform + full convergence) and the Phase-2 benchmark runs.
  • Orchestrator (me) + operator: Phase 3 cutover (I do the ssh-repoint + the Incus stop via the API; operator does the DNS change + the go/no-go) and Phase 4.

Guardrails

  • Parallel bring-up — never break the running Incus cc-ci until Hetzner is verified green; the cutover is the only switch moment, at a quiet point.
  • No secrets in gitHCLOUD_TOKEN, TS key, the age key (.sops/), tfstate all gitignored (.gitignore hardened for *age*.txt/.sops/); never echo/commit them.
  • x86 cpx32, pin the hcloud provider + nixos-infect rev (nixpkgs already pinned).
  • Reproducible-from-scratch holds (the D8 guarantee) — the Hetzner cc-ci comes from terraform apply + one nixos-rebuild switch, no hand steps beyond the operator DNS + age key.

Definition of Done

  • Hetzner cpx32 cc-ci fully converged (0 failed units) via terraform + the D8 flake flow.
  • docs/perf/hetzner-vs-incus.md shows the two-recipe old-vs-new comparison (real numbers).
  • The loops, !testme pipeline, dashboard, and *.ci.commoninternet.net all run on Hetzner; a PR !testme is green end-to-end there; D-gates re-verified.
  • The Incus cc-nix-test is stopped (cold standby → deletion); the flake's canonical cc-ci host is Hetzner; docs/memory updated.