Files
cc-ci-orchestrator/cc-ci-plan/plan-cc-ci-hetzner-migration.md
autonomic-bot 102427ab5b plan: full migrate-to-Hetzner (provision → cut over loops → stop old b1 VM); server type cpx31→cpx32
- plan-cc-ci-hetzner-migration.md: 3-phase plan — (1) provision the Hetzner cpx32 cc-ci fully + green
  !testme readiness gate, (2) repoint the loops + dashboard + *.ci at it (one ssh-config + DNS change),
  (3) stop the b1 cc-nix-test (cold standby). Parallel bring-up, reversible cutover, b1 freed.
- plan-cc-ci-hetzner-terraform.md: cpx31 is retired → default to cpx32 (current dedicated-vCPU 8GB).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 01:15:29 +00:00

97 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Plan — migrate the cc-ci SERVER from b1 Incus to Hetzner (full cutover)
**Status:** PROPOSED. Move the cc-ci **CI server** (`cc-nix-test`) off the slow b1 host onto a fast
Hetzner **cpx32** (8 GB, dedicated vCPU, NVMe), repoint the Builder/Adversary loops + everything at it,
then stop the old VM. **This file:** `/srv/cc-ci/cc-ci-plan/plan-cc-ci-hetzner-migration.md`.
**Owner:** assistant (provisioning + cutover mechanics) + orchestrator (coordination); operator for the
secret/DNS gates. **Supersedes** the narrower `plan-cc-ci-hetzner-terraform.md` (that is Phase 1's
deliverable; this plan wraps it with the cutover + decommission).
---
## 0. Context (why, and what's where)
- **Two VMs run on b1** (a 2015 **Intel i5-6400T low-power CPU + a spinning HDD** — measured: CPU
pressure ~55%, root disk `ROTA=1`):
- **cc-ci server** `cc-nix-test` (tailnet `100.90.116.4`, 8 GB) — where the loops deploy recipes +
run the harness (the heavy CI work). **This is what we migrate.**
- **orchestrator VM** `cc-ci-orchestrator` (tailnet `100.116.55.106`, 2 GB) — where the loops +
orchestrator + assistant *run* (claude sessions). Stays for now.
- b1 is overloaded running both on a slow CPU + HDD — "everything is getting slow."
- **The win (see the perf analysis):** Hetzner cpx32 = modern dedicated vCPU + **NVMe** vs a 2015
low-power CPU + **HDD** → I/O-bound deploys (the ghost/discourse near-timeouts) likely **310×**
faster, CPU work **~23×**. Moving the *heavy* server off b1 also relieves b1, so the orchestrator VM
(still there) speeds up too.
## 1. Phase 1 — provision the Hetzner cc-ci, FULLY ready
The `plan-cc-ci-hetzner-terraform.md` deliverable, taken all the way to a **converged, green** server
(not just "terraform applies"):
- `terraform/` in the cc-ci repo (cpx32, ubuntu-24.04, pinned hcloud provider + nixos-infect). `apply`
→ nixos-infect → bare NixOS on Hetzner.
- Add the `cc-ci-hetzner` flake host (nixos-infect's DO/Hetzner hardware + the shared `nix/modules/*`).
- **Full convergence (the D8 flow):** clone cc-ci `--recursive` + place the **bootstrap age key** at
`/var/lib/sops-nix/key.txt` (operator) + `nixos-rebuild switch --flake .#cc-ci-hetzner` → traefik /
drone / bridge / dashboard / backupbot / swarm all up, **0 failed units**.
- **DNS/cert:** point `ci.commoninternet.net` + `*.ci` **A record at the Hetzner public IP** (the
server has one — can drop the b1 TLS-passthrough gateway). Keep the sops wildcard cert for v1
(or ACME — §decision).
- **Readiness gate (before any cutover):** ssh works; the dashboard + `*.ci.commoninternet.net` are
reachable; a **full `!testme` runs GREEN on the Hetzner server** (drive one recipe end-to-end via
the harness). Keep the b1 cc-ci running untouched in parallel during all of Phase 1.
- **Operator inputs for Phase 1:** `HCLOUD_TOKEN` (have), `TS_AUTH_KEY` (have), the **bootstrap age
key** (needed for convergence), and the **DNS change**. Note: the token may be invalidated after the
KEEPER server is applied — the server runs without it; only future `terraform` needs a (new) token.
## 2. Phase 2 — cut everything over to the Hetzner server
Once Phase 1 is green, switch all consumers from the b1 `cc-nix-test` to the Hetzner server:
- **Loop access:** update the `Host cc-ci` entry in the loops' ssh config (on the orchestrator VM,
used by builder/adversary/orchestrator/assistant) — `HostName` from `100.90.116.4`
the **Hetzner server's tailnet IP / MagicDNS**. (`ssh cc-ci` is the single indirection the loops
use, so this one change repoints all of them. The Hetzner box joins the SAME tailnet via
`TS_AUTH_KEY`, so it's a direct peer like today.)
- **CI flow:** the `!testme` → bridge → Drone → harness path + the dashboard now run on the Hetzner
server (they're part of the converged config there). The recipe mirrors stay on Gitea (unaffected).
- **State carry-over (minimal — mostly stateless):** recipes redeploy from the mirrors; **warm
canonicals re-seed** on the first green cold runs; the harness lives in the cc-ci repo. Drone build
history + dashboard state start **fresh** on the new server (acceptable; migrate only if wanted).
- **Verify cutover:** a full loop cycle works against Hetzner — Builder deploys + claims a gate, the
Adversary **cold-verifies green** on the Hetzner server; phase-2 recipe work continues, now fast.
Watch a ghost/discourse deploy to confirm the timeouts are gone.
## 3. Phase 3 — stop the old cc-ci VM (free b1)
- Once everything is confirmed serving green on Hetzner, **stop `cc-nix-test` on b1** (Incus
`PUT .../state {"action":"stop"}`). **Keep it as a cold standby for a few days** (don't delete) for
rollback, then retire.
- b1 now runs only the small orchestrator VM → it gets b1's full (modest) resources → the loops'
*runtime* is less starved too. "Everything faster from here on out."
- **Rollback (until the old VM is deleted):** if Hetzner has a problem, revert the `Host cc-ci` ssh
entry to `100.90.116.4` and start the b1 VM again.
## 4. Sequencing & gates (don't break the running CI)
- **Strictly parallel bring-up:** Phase 1 stands Hetzner up *alongside* the live b1 cc-ci; **no
consumer is repointed until the Hetzner `!testme` is green** (Phase 1 readiness gate).
- The cutover (Phase 2) is a **single ssh-config repoint** + DNS — fast and reversible.
- Phase 3 (stop b1) only after Phase 2 is verified.
- The loops keep working on b1 throughout Phase 1 (no disruption); the brief cutover window is the
only moment they switch servers.
## 5. Open decisions (log in DECISIONS.md)
- **DNS/cert:** point `*.ci` at the Hetzner public IP + drop the gateway; sops cert (v1) vs ACME.
- **Drone/dashboard history:** fresh on Hetzner (default) vs migrate the volumes.
- **Orchestrator VM:** leave on b1 (freed) for now; a *later, separate* plan could also move the loops'
runtime to Hetzner and fully retire b1 — out of scope here (the runtime needn't be fast).
- **Token lifecycle:** invalidate `HCLOUD_TOKEN` after the keeper apply, or keep a (rotated) one for
ongoing `terraform` management of the server.
## 6. Definition of Done
- Hetzner cpx32 cc-ci fully converged (0 failed units) + a **green `!testme`** on it.
- Loops + dashboard + `*.ci.commoninternet.net` all served from Hetzner; a full Builder→Adversary
cycle verified green there; deploy/convergence visibly faster (ghost/discourse no longer near-timeout).
- Old b1 `cc-nix-test` **stopped** (cold standby, not deleted).
- `terraform/` committed to the cc-ci repo (via PR); no secrets/state in git; `docs/install.md`
updated for the Hetzner host. Adversary-verifiable: from-scratch reproducibility holds on Hetzner.
## 7. Guardrails
- Parallel bring-up; never repoint consumers until Hetzner is green; keep b1 as cold standby.
- No secrets in git (token, TS key, age key, tfstate). Pin everything. x86 only (cpx32/cx32).
- Real Nix provisioning (the flake) + real abra; don't weaken anything to make the new server "pass."