Files
cc-ci-orchestrator/cc-ci-plan/plan-cc-ci-hetzner-migration.md
autonomic-bot 102427ab5b plan: full migrate-to-Hetzner (provision → cut over loops → stop old b1 VM); server type cpx31→cpx32
- plan-cc-ci-hetzner-migration.md: 3-phase plan — (1) provision the Hetzner cpx32 cc-ci fully + green
  !testme readiness gate, (2) repoint the loops + dashboard + *.ci at it (one ssh-config + DNS change),
  (3) stop the b1 cc-nix-test (cold standby). Parallel bring-up, reversible cutover, b1 freed.
- plan-cc-ci-hetzner-terraform.md: cpx31 is retired → default to cpx32 (current dedicated-vCPU 8GB).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 01:15:29 +00:00

6.9 KiB
Raw Blame History

Plan — migrate the cc-ci SERVER from b1 Incus to Hetzner (full cutover)

Status: PROPOSED. Move the cc-ci CI server (cc-nix-test) off the slow b1 host onto a fast Hetzner cpx32 (8 GB, dedicated vCPU, NVMe), repoint the Builder/Adversary loops + everything at it, then stop the old VM. This file: /srv/cc-ci/cc-ci-plan/plan-cc-ci-hetzner-migration.md. Owner: assistant (provisioning + cutover mechanics) + orchestrator (coordination); operator for the secret/DNS gates. Supersedes the narrower plan-cc-ci-hetzner-terraform.md (that is Phase 1's deliverable; this plan wraps it with the cutover + decommission).


0. Context (why, and what's where)

  • Two VMs run on b1 (a 2015 Intel i5-6400T low-power CPU + a spinning HDD — measured: CPU pressure ~55%, root disk ROTA=1):
    • cc-ci server cc-nix-test (tailnet 100.90.116.4, 8 GB) — where the loops deploy recipes + run the harness (the heavy CI work). This is what we migrate.
    • orchestrator VM cc-ci-orchestrator (tailnet 100.116.55.106, 2 GB) — where the loops + orchestrator + assistant run (claude sessions). Stays for now.
  • b1 is overloaded running both on a slow CPU + HDD — "everything is getting slow."
  • The win (see the perf analysis): Hetzner cpx32 = modern dedicated vCPU + NVMe vs a 2015 low-power CPU + HDD → I/O-bound deploys (the ghost/discourse near-timeouts) likely 310× faster, CPU work ~23×. Moving the heavy server off b1 also relieves b1, so the orchestrator VM (still there) speeds up too.

1. Phase 1 — provision the Hetzner cc-ci, FULLY ready

The plan-cc-ci-hetzner-terraform.md deliverable, taken all the way to a converged, green server (not just "terraform applies"):

  • terraform/ in the cc-ci repo (cpx32, ubuntu-24.04, pinned hcloud provider + nixos-infect). apply → nixos-infect → bare NixOS on Hetzner.
  • Add the cc-ci-hetzner flake host (nixos-infect's DO/Hetzner hardware + the shared nix/modules/*).
  • Full convergence (the D8 flow): clone cc-ci --recursive + place the bootstrap age key at /var/lib/sops-nix/key.txt (operator) + nixos-rebuild switch --flake .#cc-ci-hetzner → traefik / drone / bridge / dashboard / backupbot / swarm all up, 0 failed units.
  • DNS/cert: point ci.commoninternet.net + *.ci A record at the Hetzner public IP (the server has one — can drop the b1 TLS-passthrough gateway). Keep the sops wildcard cert for v1 (or ACME — §decision).
  • Readiness gate (before any cutover): ssh works; the dashboard + *.ci.commoninternet.net are reachable; a full !testme runs GREEN on the Hetzner server (drive one recipe end-to-end via the harness). Keep the b1 cc-ci running untouched in parallel during all of Phase 1.
  • Operator inputs for Phase 1: HCLOUD_TOKEN (have), TS_AUTH_KEY (have), the bootstrap age key (needed for convergence), and the DNS change. Note: the token may be invalidated after the KEEPER server is applied — the server runs without it; only future terraform needs a (new) token.

2. Phase 2 — cut everything over to the Hetzner server

Once Phase 1 is green, switch all consumers from the b1 cc-nix-test to the Hetzner server:

  • Loop access: update the Host cc-ci entry in the loops' ssh config (on the orchestrator VM, used by builder/adversary/orchestrator/assistant) — HostName from 100.90.116.4 → the Hetzner server's tailnet IP / MagicDNS. (ssh cc-ci is the single indirection the loops use, so this one change repoints all of them. The Hetzner box joins the SAME tailnet via TS_AUTH_KEY, so it's a direct peer like today.)
  • CI flow: the !testme → bridge → Drone → harness path + the dashboard now run on the Hetzner server (they're part of the converged config there). The recipe mirrors stay on Gitea (unaffected).
  • State carry-over (minimal — mostly stateless): recipes redeploy from the mirrors; warm canonicals re-seed on the first green cold runs; the harness lives in the cc-ci repo. Drone build history + dashboard state start fresh on the new server (acceptable; migrate only if wanted).
  • Verify cutover: a full loop cycle works against Hetzner — Builder deploys + claims a gate, the Adversary cold-verifies green on the Hetzner server; phase-2 recipe work continues, now fast. Watch a ghost/discourse deploy to confirm the timeouts are gone.

3. Phase 3 — stop the old cc-ci VM (free b1)

  • Once everything is confirmed serving green on Hetzner, stop cc-nix-test on b1 (Incus PUT .../state {"action":"stop"}). Keep it as a cold standby for a few days (don't delete) for rollback, then retire.
  • b1 now runs only the small orchestrator VM → it gets b1's full (modest) resources → the loops' runtime is less starved too. "Everything faster from here on out."
  • Rollback (until the old VM is deleted): if Hetzner has a problem, revert the Host cc-ci ssh entry to 100.90.116.4 and start the b1 VM again.

4. Sequencing & gates (don't break the running CI)

  • Strictly parallel bring-up: Phase 1 stands Hetzner up alongside the live b1 cc-ci; no consumer is repointed until the Hetzner !testme is green (Phase 1 readiness gate).
  • The cutover (Phase 2) is a single ssh-config repoint + DNS — fast and reversible.
  • Phase 3 (stop b1) only after Phase 2 is verified.
  • The loops keep working on b1 throughout Phase 1 (no disruption); the brief cutover window is the only moment they switch servers.

5. Open decisions (log in DECISIONS.md)

  • DNS/cert: point *.ci at the Hetzner public IP + drop the gateway; sops cert (v1) vs ACME.
  • Drone/dashboard history: fresh on Hetzner (default) vs migrate the volumes.
  • Orchestrator VM: leave on b1 (freed) for now; a later, separate plan could also move the loops' runtime to Hetzner and fully retire b1 — out of scope here (the runtime needn't be fast).
  • Token lifecycle: invalidate HCLOUD_TOKEN after the keeper apply, or keep a (rotated) one for ongoing terraform management of the server.

6. Definition of Done

  • Hetzner cpx32 cc-ci fully converged (0 failed units) + a green !testme on it.
  • Loops + dashboard + *.ci.commoninternet.net all served from Hetzner; a full Builder→Adversary cycle verified green there; deploy/convergence visibly faster (ghost/discourse no longer near-timeout).
  • Old b1 cc-nix-test stopped (cold standby, not deleted).
  • terraform/ committed to the cc-ci repo (via PR); no secrets/state in git; docs/install.md updated for the Hetzner host. Adversary-verifiable: from-scratch reproducibility holds on Hetzner.

7. Guardrails

  • Parallel bring-up; never repoint consumers until Hetzner is green; keep b1 as cold standby.
  • No secrets in git (token, TS key, age key, tfstate). Pin everything. x86 only (cpx32/cx32).
  • Real Nix provisioning (the flake) + real abra; don't weaken anything to make the new server "pass."