6.1 KiB
Plan — migrate the cc-ci SERVER from the Incus VM to Hetzner (provision → benchmark → cutover → retire)
Status: PROPOSED. Move the cc-ci CI server off the Incus VM cc-nix-test (b1: 2015 i5-6400T
- spinning HDD, CPU-pressure ~55%, getting very slow) onto a Hetzner
cpx32(4 vCPU / 8 GB / 160 GB NVMe, x86, ~€16.49/mo). Everything (Builder, Adversary, the !testme pipeline) then targets the fast new server. This file:/srv/cc-ci/cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md.
Key enabler (verified 2026-05-31): the bootstrap age key is already on this VM at
/srv/cc-ci/.sops/master-age.txt and the cc-ci-secrets submodule is populated — so the new server
can be fully provisioned end-to-end with NO operator secret-blocker (the D8 flow decrypts the TLS
cert + all secrets). The Pi is not needed.
Architecture reminder: the Builder/Adversary loops run on this orchestrator VM and reach the
CI server via ssh cc-ci; the !testme pipeline (Gitea webhook → bridge → Drone → harness) runs ON
the cc-ci server, and *.ci.commoninternet.net + the dashboard are served from it. "Switch
everything to the new server" = make the Hetzner box the cc-ci, then repoint ssh cc-ci, the
webhook/DNS, and the dashboard at it. The loops' code/clones don't move — only their target.
Phase 1 — Provision the new Hetzner cc-ci, fully converged (assistant)
Per plan-cc-ci-hetzner-terraform.md (the provisioning detail): terraform/ in the cc-ci repo →
hcloud cpx32 from ubuntu-24.04 → pinned nixos-infect → bare NixOS → add the cc-ci-hetzner
flake host (the nixos-infect-generated DO/Hetzner hardware + the shared nix/modules/*) → run the
D8 flow: clone --recursive, place /srv/cc-ci/.sops/master-age.txt at /var/lib/sops-nix/key.txt,
nixos-rebuild switch --flake .#cc-ci-hetzner. The server joins the tailnet (TS_AUTH_KEY).
- Accept: 0 failed units; traefik/drone/bridge/dashboard/backupbot up; the box is on the tailnet
and ssh-able; terraform is idempotent (
planclean). This is a real server we keep (not the throwaway the terraform-plan first described) — do notterraform destroyonce it converges. - Done in parallel — the old Incus cc-ci keeps serving the loops until Phase 3.
Phase 2 — Benchmark: old vs new, two recipes (a short report)
Pick two representative recipes — one light (e.g. n8n or custom-html) and one heavy/slow (e.g.
ghost or discourse — the HDD-bound timeout cases). Run the same full harness (cold,
install+upgrade+backup+restore+custom) on both servers:
- old:
ssh cc-ci-incus(the currentcc-nix-test), new:ssh cc-ci-hetzner. - Capture per-tier + total wall-clock from the
RUN SUMMARYfor each recipe on each host. Write a short comparison report →docs/perf/hetzner-vs-incus.mdin the cc-ci repo (table: recipe × tier × old-time × new-time × speedup). This empirically confirms the expected ~2–4× (more on the I/O-bound phases). (Run identical conditions — same recipe versions, cold cache both sides.)
Phase 3 — Cutover: point everything at the new server (orchestrated; pick a quiet moment)
- Quiesce briefly: ensure no live
!testme/deploy is mid-run on the old server. - Repoint the loops'
ssh cc-ci→ the Hetzner box's tailnet IP: updateHost cc-ciin/home/loops/.ssh/config(and root's)HostName→ new IP. The loops keep working from this VM; only their target changes. (Keep aHost cc-ci-incusalias for the old box during the overlap.) - DNS / webhook / gateway: point
ci.commoninternet.net+ the*.ciwildcard A record at the Hetzner public IP (drop the TLS-passthrough gateway — Traefik on the droplet terminates directly; the sops wildcard cert works as-is). Re-point the Giteaissue_commentwebhook → the new server so!testmetriggers there. DNS is operator-owned (commoninternet.net) — the one operator step. - Verify end-to-end on the new server: a real PR
!testmeruns green through the new bridge→Drone→harness; the dashboard +*.ci.commoninternet.netload; the loops'ssh cc-cideploys land on Hetzner. Re-run the relevant D-gates cold-verified by the Adversary. - Make
cc-ci-hetznerthe canonicalnixosConfigurations.cc-ciin the flake (retire the Incushardware.nixonce the old box is gone).
Phase 4 — Retire the old Incus cc-nix-test
Once Hetzner is the verified live cc-ci: stop the Incus VM via the b1 Incus API (mTLS certs are on
this VM under incus-terraform-nix-vm-creator/terraform-secrets/) — PUT .../instances/cc-nix-test/ state {"action":"stop"}. Keep it as a cold standby for a few days, then delete (frees b1). Update
the memory/docs (cc-ci-setup) to point cc-ci at Hetzner.
Who does what
- Assistant: Phase 1 (the terraform + full convergence) and the Phase-2 benchmark runs.
- Orchestrator (me) + operator: Phase 3 cutover (I do the ssh-repoint + the Incus stop via the API; operator does the DNS change + the go/no-go) and Phase 4.
Guardrails
- Parallel bring-up — never break the running Incus cc-ci until Hetzner is verified green; the cutover is the only switch moment, at a quiet point.
- No secrets in git —
HCLOUD_TOKEN, TS key, the age key (.sops/), tfstate all gitignored (.gitignorehardened for*age*.txt/.sops/); never echo/commit them. - x86
cpx32, pin the hcloud provider + nixos-infect rev (nixpkgs already pinned). - Reproducible-from-scratch holds (the D8 guarantee) — the Hetzner cc-ci comes from
terraform apply+ onenixos-rebuild switch, no hand steps beyond the operator DNS + age key.
Definition of Done
- Hetzner
cpx32cc-ci fully converged (0 failed units) via terraform + the D8 flake flow. docs/perf/hetzner-vs-incus.mdshows the two-recipe old-vs-new comparison (real numbers).- The loops,
!testmepipeline, dashboard, and*.ci.commoninternet.netall run on Hetzner; a PR!testmeis green end-to-end there; D-gates re-verified. - The Incus
cc-nix-testis stopped (cold standby → deletion); the flake's canonicalcc-cihost is Hetzner; docs/memory updated.