plan: full migrate-cc-ci-to-hetzner (provision cpx32 → benchmark 2 recipes → cutover loops+pipeline+DNS → retire Incus VM); age key is on the VM so no secret-blocker; harden .gitignore for the age key
This commit is contained in:
83
cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md
Normal file
83
cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md
Normal file
@ -0,0 +1,83 @@
|
||||
# Plan — migrate the cc-ci SERVER from the Incus VM to Hetzner (provision → benchmark → cutover → retire)
|
||||
|
||||
**Status:** PROPOSED. Move the **cc-ci CI server** off the Incus VM `cc-nix-test` (b1: 2015 i5-6400T
|
||||
+ **spinning HDD**, CPU-pressure ~55%, getting very slow) onto a **Hetzner `cpx32`** (4 vCPU / 8 GB /
|
||||
160 GB **NVMe**, x86, ~€16.49/mo). Everything (Builder, Adversary, the !testme pipeline) then targets
|
||||
the fast new server. **This file:** `/srv/cc-ci/cc-ci-plan/plan-migrate-cc-ci-to-hetzner.md`.
|
||||
|
||||
**Key enabler (verified 2026-05-31):** the bootstrap age key is **already on this VM** at
|
||||
`/srv/cc-ci/.sops/master-age.txt` and the `cc-ci-secrets` submodule is populated — so the new server
|
||||
can be **fully provisioned end-to-end with NO operator secret-blocker** (the D8 flow decrypts the TLS
|
||||
cert + all secrets). The Pi is not needed.
|
||||
|
||||
**Architecture reminder:** the Builder/Adversary **loops run on this orchestrator VM** and reach the
|
||||
CI server via `ssh cc-ci`; the **!testme pipeline (Gitea webhook → bridge → Drone → harness) runs ON
|
||||
the cc-ci server**, and `*.ci.commoninternet.net` + the dashboard are served from it. "Switch
|
||||
everything to the new server" = make the Hetzner box the cc-ci, then repoint `ssh cc-ci`, the
|
||||
webhook/DNS, and the dashboard at it. The loops' code/clones don't move — only their target.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Provision the new Hetzner cc-ci, fully converged (assistant)
|
||||
Per **`plan-cc-ci-hetzner-terraform.md`** (the provisioning detail): `terraform/` in the cc-ci repo →
|
||||
`hcloud` `cpx32` from `ubuntu-24.04` → **pinned nixos-infect** → bare NixOS → add the **`cc-ci-hetzner`
|
||||
flake host** (the nixos-infect-generated DO/Hetzner hardware + the shared `nix/modules/*`) → run the
|
||||
**D8 flow**: clone `--recursive`, place `/srv/cc-ci/.sops/master-age.txt` at `/var/lib/sops-nix/key.txt`,
|
||||
`nixos-rebuild switch --flake .#cc-ci-hetzner`. The server joins the tailnet (TS_AUTH_KEY).
|
||||
- **Accept:** 0 failed units; traefik/drone/bridge/dashboard/backupbot up; the box is on the tailnet
|
||||
and ssh-able; terraform is idempotent (`plan` clean). This is a **real** server we keep (not the
|
||||
throwaway the terraform-plan first described) — do **not** `terraform destroy` once it converges.
|
||||
- Done in **parallel** — the old Incus cc-ci keeps serving the loops until Phase 3.
|
||||
|
||||
## Phase 2 — Benchmark: old vs new, two recipes (a short report)
|
||||
Pick **two representative recipes** — one light (e.g. `n8n` or `custom-html`) and one heavy/slow (e.g.
|
||||
`ghost` or `discourse` — the HDD-bound timeout cases). Run the **same full harness** (cold,
|
||||
install+upgrade+backup+restore+custom) on **both servers**:
|
||||
- old: `ssh cc-ci-incus` (the current `cc-nix-test`), new: `ssh cc-ci-hetzner`.
|
||||
- Capture **per-tier + total wall-clock** from the `RUN SUMMARY` for each recipe on each host.
|
||||
Write a short comparison report → **`docs/perf/hetzner-vs-incus.md`** in the cc-ci repo (table: recipe
|
||||
× tier × old-time × new-time × speedup). This empirically confirms the expected ~2–4× (more on the
|
||||
I/O-bound phases). *(Run identical conditions — same recipe versions, cold cache both sides.)*
|
||||
|
||||
## Phase 3 — Cutover: point everything at the new server (orchestrated; pick a quiet moment)
|
||||
1. **Quiesce briefly:** ensure no live `!testme`/deploy is mid-run on the old server.
|
||||
2. **Repoint the loops' `ssh cc-ci`** → the Hetzner box's tailnet IP: update `Host cc-ci` in
|
||||
`/home/loops/.ssh/config` (and root's) `HostName` → new IP. The loops keep working from this VM;
|
||||
only their target changes. (Keep a `Host cc-ci-incus` alias for the old box during the overlap.)
|
||||
3. **DNS / webhook / gateway:** point `ci.commoninternet.net` + the `*.ci` wildcard **A record at the
|
||||
Hetzner public IP** (drop the TLS-passthrough gateway — Traefik on the droplet terminates directly;
|
||||
the sops wildcard cert works as-is). Re-point the Gitea `issue_comment` webhook → the new server so
|
||||
`!testme` triggers there. **DNS is operator-owned (`commoninternet.net`)** — the one operator step.
|
||||
4. **Verify end-to-end on the new server:** a real PR `!testme` runs green through the new
|
||||
bridge→Drone→harness; the dashboard + `*.ci.commoninternet.net` load; the loops' `ssh cc-ci` deploys
|
||||
land on Hetzner. Re-run the relevant D-gates cold-verified by the Adversary.
|
||||
5. Make `cc-ci-hetzner` the **canonical** `nixosConfigurations.cc-ci` in the flake (retire the Incus
|
||||
`hardware.nix` once the old box is gone).
|
||||
|
||||
## Phase 4 — Retire the old Incus cc-nix-test
|
||||
Once Hetzner is the verified live cc-ci: **stop** the Incus VM via the b1 Incus API (mTLS certs are on
|
||||
this VM under `incus-terraform-nix-vm-creator/terraform-secrets/`) — `PUT .../instances/cc-nix-test/
|
||||
state {"action":"stop"}`. Keep it as a **cold standby for a few days**, then delete (frees b1). Update
|
||||
the memory/docs ([[cc-ci-setup]]) to point cc-ci at Hetzner.
|
||||
|
||||
## Who does what
|
||||
- **Assistant:** Phase 1 (the terraform + full convergence) and the Phase-2 benchmark runs.
|
||||
- **Orchestrator (me) + operator:** Phase 3 cutover (I do the ssh-repoint + the Incus stop via the
|
||||
API; **operator does the DNS change** + the go/no-go) and Phase 4.
|
||||
|
||||
## Guardrails
|
||||
- **Parallel bring-up** — never break the running Incus cc-ci until Hetzner is verified green; the
|
||||
cutover is the only switch moment, at a quiet point.
|
||||
- **No secrets in git** — `HCLOUD_TOKEN`, TS key, the age key (`.sops/`), tfstate all gitignored
|
||||
(`.gitignore` hardened for `*age*.txt`/`.sops/`); never echo/commit them.
|
||||
- **x86 `cpx32`**, pin the hcloud provider + nixos-infect rev (nixpkgs already pinned).
|
||||
- **Reproducible-from-scratch holds** (the D8 guarantee) — the Hetzner cc-ci comes from `terraform
|
||||
apply` + one `nixos-rebuild switch`, no hand steps beyond the operator DNS + age key.
|
||||
|
||||
## Definition of Done
|
||||
- Hetzner `cpx32` cc-ci fully converged (0 failed units) via terraform + the D8 flake flow.
|
||||
- `docs/perf/hetzner-vs-incus.md` shows the two-recipe old-vs-new comparison (real numbers).
|
||||
- The loops, `!testme` pipeline, dashboard, and `*.ci.commoninternet.net` all run on Hetzner; a PR
|
||||
`!testme` is green end-to-end there; D-gates re-verified.
|
||||
- The Incus `cc-nix-test` is stopped (cold standby → deletion); the flake's canonical `cc-ci` host is
|
||||
Hetzner; docs/memory updated.
|
||||
Reference in New Issue
Block a user