Files

autonomic-bot 21e7a79f50 orchestrator-hetzner: enable reboot-resilience + record migration

Now the workspace is staged on the Hetzner cpx22 (server 134487234, public
91.98.47.73, tailnet cc-ci-orchestrator-1 @ 100.84.190.30):

- configuration.nix: enable cc-ci-loops.service (wantedBy multi-user.target) so the
  loops + watchdog auto-resume on boot; wire reboot-log.sh as ExecStartPre so reboots
  auto-log to REBOOTS.md (boot_id-gated).
- plan-orchestrator-hetzner-migration.md: full migration record.
- REBOOTS.md / AGENTS.md: point the orchestrator host at Hetzner; first auto-logged
  reboot line.
- launch-orchestrator.sh: default session id -> the Hetzner orchestrator session.
- flake.lock: pin inputs.

Verified: nixos-rebuild switch applied; systemctl is-enabled cc-ci-loops.service =
enabled; ExecStartPre logged this boot to REBOOTS.md; loops healthy on phase 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-31 03:54:17 +00:00

5.0 KiB

Raw Blame History

Plan/record — migrate the ORCHESTRATOR off the Incus VM onto a Hetzner cloud server

Status: COMPLETE (2026-05-31). The orchestrator (Builder/Adversary loops + watchdog + this supervising session) now runs on a dedicated Hetzner cloud server, declared by the cc-ci-orchestrator-hetzner flake host. Kept as a historical record.

Why: the previous orchestrator host was the Incus VM cc-ci-orchestrator on b1 (100.116.55.106, 2 GB / 2 vCPU, see plan-orchestrator-migration — the earlier Pi→Incus move). A dedicated Hetzner box gives dedicated vCPU + NVMe and decouples the orchestrator from b1's hardware. This is the orchestrator analogue of the cc-ci server move in plan-migrate-cc-ci-to-hetzner (that one moves the CI server; this one moves the orchestrator that drives the loops).

Note on naming: this migration was carried out directly via terraform/ + the cc-ci-orchestrator-hetzner flake host. It is not the same as plan-migrate-cc-ci-to-hetzner.md (the cc-ci CI server → Hetzner cpx32) nor plan-orchestrator-migration.md (Pi → Incus VM). All three are distinct moves; only this file records the orchestrator → Hetzner step.

The new host (facts)


Provider / type	Hetzner Cloud `cpx22` — AMD 2 vCPU / 4 GB, dedicated vCPU, NVMe
Location	`nbg1` (cpx11/cpx21 are retired there — hence `cpx22`)
Hetzner server ID	134487234
Public IPv4	168.119.126.100 (IPv6 disabled)
Tailnet	`cc-ci-orchestrator-1` @ 100.84.190.30 (`taila4a0bf.ts.net`); joins via `/etc/ts-auth-key`
OS	`debian-12` image → nixos-infect → NixOS, converged by the flake
Flake host	`nixosConfigurations.cc-ci-orchestrator-hetzner` (`flake.nix` → `nix/hosts/cc-ci-orchestrator-hetzner/{configuration,hardware}.nix`)
Workspace	`/srv/cc-ci-orch` (this repo); `/srv/cc-ci` is a symlink to it. Loop clones: `/srv/cc-ci/cc-ci`, `/srv/cc-ci/cc-ci-adv`

The login keys (root authorizedKeys) and swap (4 GB disk swap — 4 GB RAM is tight for 3+ claude sessions) are declared in configuration.nix.

How it was provisioned (reproducible)

The whole box is reproducible from terraform/ + one nixos-rebuild:

terraform apply (terraform/main.tf): hcloud_server cpx22 from debian-12 in nbg1, user_data = user-data.sh runs nixos-infect on first boot (Debian→NixOS, reboot).
Stage 2 (terraform/README.md): SSH in, capture the nixos-infect hardware config (→ nix/hosts/cc-ci-orchestrator-hetzner/hardware.nix), then converge:
```
# on the server, from the repo root (/srv/cc-ci-orch)
nixos-rebuild switch --flake .#cc-ci-orchestrator-hetzner
```
Stage credentials (not in git, placed once): /etc/ts-auth-key (tailnet join), the loops' ~/.ssh/cc-ci-root-ed25519 + .testenv, and the sops master age key. claude auth login (device code) is the one interactive step so the loops can run --remote-control.
Stage the workspace: clone this repo to /srv/cc-ci-orch (symlink /srv/cc-ci), the Builder / Adversary clones, cc-ci-secrets, references/; copy .cc-ci-logs/.phase-idx (resume point).

Commit trail: 0103f36 (terraform + flake host, initial cpx11) → 17951b8 (fix → cpx22, add lock) → c44b967 (real cpx22 hardware config from nixos-infect, server 134487234). Plus the close-out commit below (root keys, drop tailscale --ssh, enable the loops service, this doc).

Reboot-resilience (the point of running on a managed host)

configuration.nix declares systemd.services.cc-ci-loops — a oneshot that runs launch.sh start with RESUME_PHASE=1 after network-online/tailscaled, bringing the loops + watchdog back on boot. It was authored disabled ("defined but NOT enabled until workspace is staged") with wantedBy commented out. Close-out (2026-05-31): the workspace is staged and the loops are running, so wantedBy = [ "multi-user.target" ] was uncommented and nixos-rebuild switch re-run → systemctl is-enabled cc-ci-loops.service = enabled. A reboot is now a non-event: systemd resumes the saved phase. (reboot-log.sh, the ExecStartPre, appends to REBOOTS.md boot_id-gated.)

Caveat seen at first boot on this host: the loops were initially started by hand during staging (not by the service), so the first boot did NOT log to REBOOTS.md and the service showed linked/not-enabled. Enabling wantedBy (above) is what wires the automatic path.

Status of the migration

✅ Hetzner cpx22 provisioned + converged from the flake (terraform + nixos-infect + one rebuild).
✅ On the tailnet (cc-ci-orchestrator-1) and ssh-able on the public IP.
✅ Loops + Adversary + watchdog running; phase sequence auto-advancing (watchdog on per-phase ## DONE).
✅ cc-ci-loops.service enabled → reboot-resilient.
◻︎ Old Incus orchestrator VM (100.116.55.106) — keep as cold standby a few days, then delete.
◻︎ Rotate the tailnet name once the old cc-ci-orchestrator peer is gone (this box is …-1).

5.0 KiB Raw Blame History