Now the workspace is staged on the Hetzner cpx22 (server 134487234, public 91.98.47.73, tailnet cc-ci-orchestrator-1 @ 100.84.190.30): - configuration.nix: enable cc-ci-loops.service (wantedBy multi-user.target) so the loops + watchdog auto-resume on boot; wire reboot-log.sh as ExecStartPre so reboots auto-log to REBOOTS.md (boot_id-gated). - plan-orchestrator-hetzner-migration.md: full migration record. - REBOOTS.md / AGENTS.md: point the orchestrator host at Hetzner; first auto-logged reboot line. - launch-orchestrator.sh: default session id -> the Hetzner orchestrator session. - flake.lock: pin inputs. Verified: nixos-rebuild switch applied; systemctl is-enabled cc-ci-loops.service = enabled; ExecStartPre logged this boot to REBOOTS.md; loops healthy on phase 2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
81 lines
5.0 KiB
Markdown
81 lines
5.0 KiB
Markdown
# Plan/record — migrate the ORCHESTRATOR off the Incus VM onto a Hetzner cloud server
|
|
|
|
**Status:** COMPLETE (2026-05-31). The orchestrator (Builder/Adversary loops + watchdog + this
|
|
supervising session) now runs on a dedicated **Hetzner** cloud server, declared by the
|
|
`cc-ci-orchestrator-hetzner` flake host. Kept as a historical record.
|
|
|
|
**Why:** the previous orchestrator host was the Incus VM `cc-ci-orchestrator` on b1
|
|
(`100.116.55.106`, 2 GB / 2 vCPU, see [[plan-orchestrator-migration]] — the earlier Pi→Incus move).
|
|
A dedicated Hetzner box gives dedicated vCPU + NVMe and decouples the orchestrator from b1's hardware.
|
|
This is the orchestrator analogue of the cc-ci **server** move in [[plan-migrate-cc-ci-to-hetzner]]
|
|
(that one moves the *CI server*; this one moves the *orchestrator that drives the loops*).
|
|
|
|
> **Note on naming:** this migration was carried out directly via `terraform/` + the
|
|
> `cc-ci-orchestrator-hetzner` flake host. It is **not** the same as `plan-migrate-cc-ci-to-hetzner.md`
|
|
> (the cc-ci CI server → Hetzner `cpx32`) nor `plan-orchestrator-migration.md` (Pi → Incus VM). All
|
|
> three are distinct moves; only this file records the orchestrator → Hetzner step.
|
|
|
|
---
|
|
|
|
## The new host (facts)
|
|
|
|
| | |
|
|
|---|---|
|
|
| Provider / type | **Hetzner Cloud `cpx22`** — AMD **2 vCPU / 4 GB**, dedicated vCPU, NVMe |
|
|
| Location | `nbg1` (cpx11/cpx21 are retired there — hence `cpx22`) |
|
|
| Hetzner server ID | **134487234** |
|
|
| Public IPv4 | **168.119.126.100** (IPv6 disabled) |
|
|
| Tailnet | **`cc-ci-orchestrator-1`** @ **100.84.190.30** (`taila4a0bf.ts.net`); joins via `/etc/ts-auth-key` |
|
|
| OS | `debian-12` image → **nixos-infect** → NixOS, converged by the flake |
|
|
| Flake host | **`nixosConfigurations.cc-ci-orchestrator-hetzner`** (`flake.nix` → `nix/hosts/cc-ci-orchestrator-hetzner/{configuration,hardware}.nix`) |
|
|
| Workspace | `/srv/cc-ci-orch` (this repo); `/srv/cc-ci` is a **symlink** to it. Loop clones: `/srv/cc-ci/cc-ci`, `/srv/cc-ci/cc-ci-adv` |
|
|
|
|
The login keys (root `authorizedKeys`) and swap (4 GB disk swap — 4 GB RAM is tight for 3+ claude
|
|
sessions) are declared in `configuration.nix`.
|
|
|
|
## How it was provisioned (reproducible)
|
|
|
|
The whole box is reproducible from `terraform/` + one `nixos-rebuild`:
|
|
|
|
1. **`terraform apply`** (`terraform/main.tf`): `hcloud_server` `cpx22` from `debian-12` in `nbg1`,
|
|
`user_data = user-data.sh` runs **nixos-infect** on first boot (Debian→NixOS, reboot).
|
|
2. **Stage 2** (`terraform/README.md`): SSH in, capture the nixos-infect hardware config
|
|
(→ `nix/hosts/cc-ci-orchestrator-hetzner/hardware.nix`), then converge:
|
|
```bash
|
|
# on the server, from the repo root (/srv/cc-ci-orch)
|
|
nixos-rebuild switch --flake .#cc-ci-orchestrator-hetzner
|
|
```
|
|
3. Stage credentials (not in git, placed once): `/etc/ts-auth-key` (tailnet join), the loops'
|
|
`~/.ssh/cc-ci-root-ed25519` + `.testenv`, and the sops master age key. `claude auth login`
|
|
(device code) is the one interactive step so the loops can run `--remote-control`.
|
|
4. Stage the workspace: clone this repo to `/srv/cc-ci-orch` (symlink `/srv/cc-ci`), the Builder /
|
|
Adversary clones, `cc-ci-secrets`, `references/`; copy `.cc-ci-logs/.phase-idx` (resume point).
|
|
|
|
**Commit trail:** `0103f36` (terraform + flake host, initial `cpx11`) → `17951b8` (fix → `cpx22`,
|
|
add lock) → `c44b967` (real cpx22 hardware config from nixos-infect, server 134487234). Plus the
|
|
close-out commit below (root keys, drop tailscale `--ssh`, enable the loops service, this doc).
|
|
|
|
## Reboot-resilience (the point of running on a managed host)
|
|
|
|
`configuration.nix` declares **`systemd.services.cc-ci-loops`** — a oneshot that runs
|
|
`launch.sh start` with `RESUME_PHASE=1` after `network-online`/`tailscaled`, bringing the loops +
|
|
watchdog back on boot. It was authored **disabled** ("defined but NOT enabled until workspace is
|
|
staged") with `wantedBy` commented out. **Close-out (2026-05-31):** the workspace is staged and the
|
|
loops are running, so `wantedBy = [ "multi-user.target" ]` was uncommented and `nixos-rebuild switch`
|
|
re-run → `systemctl is-enabled cc-ci-loops.service` = **enabled**. A reboot is now a non-event:
|
|
systemd resumes the saved phase. (`reboot-log.sh`, the ExecStartPre, appends to
|
|
[[REBOOTS.md]] boot_id-gated.)
|
|
|
|
> **Caveat seen at first boot on this host:** the loops were initially started *by hand* during
|
|
> staging (not by the service), so the first boot did NOT log to `REBOOTS.md` and the service showed
|
|
> `linked`/not-enabled. Enabling `wantedBy` (above) is what wires the automatic path.
|
|
|
|
## Status of the migration
|
|
|
|
- ✅ Hetzner `cpx22` provisioned + converged from the flake (terraform + nixos-infect + one rebuild).
|
|
- ✅ On the tailnet (`cc-ci-orchestrator-1`) and ssh-able on the public IP.
|
|
- ✅ Loops + Adversary + watchdog running; phase sequence auto-advancing (watchdog on per-phase `## DONE`).
|
|
- ✅ `cc-ci-loops.service` **enabled** → reboot-resilient.
|
|
- ◻︎ Old Incus orchestrator VM (`100.116.55.106`) — keep as cold standby a few days, then delete.
|
|
- ◻︎ Rotate the tailnet name once the old `cc-ci-orchestrator` peer is gone (this box is `…-1`).
|