orchestrator-hetzner: enable reboot-resilience + record migration

Now the workspace is staged on the Hetzner cpx22 (server 134487234, public
91.98.47.73, tailnet cc-ci-orchestrator-1 @ 100.84.190.30):

- configuration.nix: enable cc-ci-loops.service (wantedBy multi-user.target) so the
  loops + watchdog auto-resume on boot; wire reboot-log.sh as ExecStartPre so reboots
  auto-log to REBOOTS.md (boot_id-gated).
- plan-orchestrator-hetzner-migration.md: full migration record.
- REBOOTS.md / AGENTS.md: point the orchestrator host at Hetzner; first auto-logged
  reboot line.
- launch-orchestrator.sh: default session id -> the Hetzner orchestrator session.
- flake.lock: pin inputs.

Verified: nixos-rebuild switch applied; systemctl is-enabled cc-ci-loops.service =
enabled; ExecStartPre logged this boot to REBOOTS.md; loops healthy on phase 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
autonomic-bot
2026-05-31 03:54:17 +00:00
parent e89f384c24
commit 21e7a79f50
6 changed files with 148 additions and 9 deletions

View File

@ -32,8 +32,12 @@ this session). Include the current phase and the reboot count. Steps on startup:
Reboot resilience is handled by **`cc-ci-loops.service`** (system unit): on boot it logs the reboot Reboot resilience is handled by **`cc-ci-loops.service`** (system unit): on boot it logs the reboot
to `REBOOTS.md` (boot_id-gated) and runs `launch.sh start` with `RESUME_PHASE=1`, so the loops + to `REBOOTS.md` (boot_id-gated) and runs `launch.sh start` with `RESUME_PHASE=1`, so the loops +
watchdog auto-resume the saved phase. The orchestrator session itself is NOT auto-started — the watchdog auto-resume the saved phase. The orchestrator session itself is NOT auto-started — the
operator reconnects to it (that's why the startup notification matters). The VM migration is operator reconnects to it (that's why the startup notification matters). The orchestrator now runs on
complete; see `cc-ci-plan/plan-orchestrator-migration.md` (historical record). a **Hetzner `cpx22`** cloud server (`cc-ci-orchestrator-1`, tailnet `100.84.190.30`, public
`168.119.126.100`, flake host `cc-ci-orchestrator-hetzner`) — see
`cc-ci-plan/plan-orchestrator-hetzner-migration.md`. The earlier Pi→Incus-VM move is the historical
`cc-ci-plan/plan-orchestrator-migration.md`. Rebuild this host with
`nixos-rebuild switch --flake .#cc-ci-orchestrator-hetzner` from `/srv/cc-ci-orch`.
## Keep the orchestrator open, under remote-control ## Keep the orchestrator open, under remote-control

View File

@ -1,8 +1,12 @@
# Reboot log — cc-ci orchestrator VM # Reboot log — cc-ci orchestrator VM
**Note:** the orchestrator Pi (`raspberrypi`) was decommissioned 2026-05-31. All agents now run on **Note:** the orchestrator Pi (`raspberrypi`) was decommissioned 2026-05-31. The agents then ran on
the `cc-ci-orchestrator` NixOS VM (tailnet `100.116.55.106`). The three Pi reboot entries below are the `cc-ci-orchestrator` Incus NixOS VM (tailnet `100.116.55.106`), and **as of 2026-05-31 run on a
historical. Entries from 2026-05-31 onward are VM reboots. Hetzner `cpx22` cloud server** (`cc-ci-orchestrator-1`, tailnet `100.84.190.30`, public
`168.119.126.100`, flake host `cc-ci-orchestrator-hetzner`, Hetzner server 134487234) — see
`plan-orchestrator-hetzner-migration.md`. The three Pi reboot entries below are historical; the
2026-05-30 entry is an Incus-VM reboot. Hetzner-host reboots are logged from now on (auto-logged once
`cc-ci-loops.service` is enabled — wired at the Hetzner cutover).
One line per genuine reboot of the orchestrator host, appended automatically by One line per genuine reboot of the orchestrator host, appended automatically by
`reboot-log.sh` (ExecStartPre of `cc-ci-loops.service`, boot_id-gated so manual service restarts are `reboot-log.sh` (ExecStartPre of `cc-ci-loops.service`, boot_id-gated so manual service restarts are
@ -17,3 +21,4 @@ restarts the loops on boot. Count the lines below to see how often it's happenin
manually relaunched at phase 2; this is what prompted adding `cc-ci-loops.service` + manually relaunched at phase 2; this is what prompted adding `cc-ci-loops.service` +
auto-logging. Auto-logging is live from the next reboot onward. auto-logging. Auto-logging is live from the next reboot onward.
- 2026-05-30 17:03:05 BST — reboot detected; loops auto-started by systemd (resuming phase index 6). boot_id=f565f752-0463-42db-b787-9e0db35a5e3f - 2026-05-30 17:03:05 BST — reboot detected; loops auto-started by systemd (resuming phase index 6). boot_id=f565f752-0463-42db-b787-9e0db35a5e3f
- 2026-05-31 03:38:29 UTC — reboot detected; loops auto-started by systemd (resuming phase index 5). boot_id=51c17fc3-8391-4109-bce2-413fbee6f26d

View File

@ -36,7 +36,7 @@ CLAUDE_FLAGS="${CLAUDE_FLAGS:---dangerously-skip-permissions}"
REMOTE_CONTROL="${REMOTE_CONTROL:-1}" REMOTE_CONTROL="${REMOTE_CONTROL:-1}"
LOG_DIR="${LOG_DIR:-/srv/cc-ci/.cc-ci-logs}" LOG_DIR="${LOG_DIR:-/srv/cc-ci/.cc-ci-logs}"
ID_FILE="${ORCH_ID_FILE:-$LOG_DIR/.orchestrator-session-id}" ID_FILE="${ORCH_ID_FILE:-$LOG_DIR/.orchestrator-session-id}"
DEFAULT_ID="34a80a99-b37e-4809-b8da-ccc9fafe785e" # the orchestrator session as of 2026-05-28 DEFAULT_ID="c746050a-af11-409d-87ba-c05268e2e5d1" # the orchestrator session as of 2026-05-31 (Hetzner)
# Startup nudge injected as the resumed session's first turn, so an AUTO-launched orchestrator (e.g. # Startup nudge injected as the resumed session's first turn, so an AUTO-launched orchestrator (e.g.
# cc-ci-loops.service ExecStartPost after a reboot) actually RUNS its AGENTS.md startup routine — # cc-ci-loops.service ExecStartPost after a reboot) actually RUNS its AGENTS.md startup routine —
# announce itself + report reboots — instead of resuming silently and waiting. Set empty to disable. # announce itself + report reboots — instead of resuming silently and waiting. Set empty to disable.

View File

@ -0,0 +1,80 @@
# Plan/record — migrate the ORCHESTRATOR off the Incus VM onto a Hetzner cloud server
**Status:** COMPLETE (2026-05-31). The orchestrator (Builder/Adversary loops + watchdog + this
supervising session) now runs on a dedicated **Hetzner** cloud server, declared by the
`cc-ci-orchestrator-hetzner` flake host. Kept as a historical record.
**Why:** the previous orchestrator host was the Incus VM `cc-ci-orchestrator` on b1
(`100.116.55.106`, 2 GB / 2 vCPU, see [[plan-orchestrator-migration]] — the earlier Pi→Incus move).
A dedicated Hetzner box gives dedicated vCPU + NVMe and decouples the orchestrator from b1's hardware.
This is the orchestrator analogue of the cc-ci **server** move in [[plan-migrate-cc-ci-to-hetzner]]
(that one moves the *CI server*; this one moves the *orchestrator that drives the loops*).
> **Note on naming:** this migration was carried out directly via `terraform/` + the
> `cc-ci-orchestrator-hetzner` flake host. It is **not** the same as `plan-migrate-cc-ci-to-hetzner.md`
> (the cc-ci CI server → Hetzner `cpx32`) nor `plan-orchestrator-migration.md` (Pi → Incus VM). All
> three are distinct moves; only this file records the orchestrator → Hetzner step.
---
## The new host (facts)
| | |
|---|---|
| Provider / type | **Hetzner Cloud `cpx22`** — AMD **2 vCPU / 4 GB**, dedicated vCPU, NVMe |
| Location | `nbg1` (cpx11/cpx21 are retired there — hence `cpx22`) |
| Hetzner server ID | **134487234** |
| Public IPv4 | **168.119.126.100** (IPv6 disabled) |
| Tailnet | **`cc-ci-orchestrator-1`** @ **100.84.190.30** (`taila4a0bf.ts.net`); joins via `/etc/ts-auth-key` |
| OS | `debian-12` image → **nixos-infect** → NixOS, converged by the flake |
| Flake host | **`nixosConfigurations.cc-ci-orchestrator-hetzner`** (`flake.nix``nix/hosts/cc-ci-orchestrator-hetzner/{configuration,hardware}.nix`) |
| Workspace | `/srv/cc-ci-orch` (this repo); `/srv/cc-ci` is a **symlink** to it. Loop clones: `/srv/cc-ci/cc-ci`, `/srv/cc-ci/cc-ci-adv` |
The login keys (root `authorizedKeys`) and swap (4 GB disk swap — 4 GB RAM is tight for 3+ claude
sessions) are declared in `configuration.nix`.
## How it was provisioned (reproducible)
The whole box is reproducible from `terraform/` + one `nixos-rebuild`:
1. **`terraform apply`** (`terraform/main.tf`): `hcloud_server` `cpx22` from `debian-12` in `nbg1`,
`user_data = user-data.sh` runs **nixos-infect** on first boot (Debian→NixOS, reboot).
2. **Stage 2** (`terraform/README.md`): SSH in, capture the nixos-infect hardware config
(→ `nix/hosts/cc-ci-orchestrator-hetzner/hardware.nix`), then converge:
```bash
# on the server, from the repo root (/srv/cc-ci-orch)
nixos-rebuild switch --flake .#cc-ci-orchestrator-hetzner
```
3. Stage credentials (not in git, placed once): `/etc/ts-auth-key` (tailnet join), the loops'
`~/.ssh/cc-ci-root-ed25519` + `.testenv`, and the sops master age key. `claude auth login`
(device code) is the one interactive step so the loops can run `--remote-control`.
4. Stage the workspace: clone this repo to `/srv/cc-ci-orch` (symlink `/srv/cc-ci`), the Builder /
Adversary clones, `cc-ci-secrets`, `references/`; copy `.cc-ci-logs/.phase-idx` (resume point).
**Commit trail:** `0103f36` (terraform + flake host, initial `cpx11`) → `17951b8` (fix → `cpx22`,
add lock) → `c44b967` (real cpx22 hardware config from nixos-infect, server 134487234). Plus the
close-out commit below (root keys, drop tailscale `--ssh`, enable the loops service, this doc).
## Reboot-resilience (the point of running on a managed host)
`configuration.nix` declares **`systemd.services.cc-ci-loops`** — a oneshot that runs
`launch.sh start` with `RESUME_PHASE=1` after `network-online`/`tailscaled`, bringing the loops +
watchdog back on boot. It was authored **disabled** ("defined but NOT enabled until workspace is
staged") with `wantedBy` commented out. **Close-out (2026-05-31):** the workspace is staged and the
loops are running, so `wantedBy = [ "multi-user.target" ]` was uncommented and `nixos-rebuild switch`
re-run → `systemctl is-enabled cc-ci-loops.service` = **enabled**. A reboot is now a non-event:
systemd resumes the saved phase. (`reboot-log.sh`, the ExecStartPre, appends to
[[REBOOTS.md]] boot_id-gated.)
> **Caveat seen at first boot on this host:** the loops were initially started *by hand* during
> staging (not by the service), so the first boot did NOT log to `REBOOTS.md` and the service showed
> `linked`/not-enabled. Enabling `wantedBy` (above) is what wires the automatic path.
## Status of the migration
- ✅ Hetzner `cpx22` provisioned + converged from the flake (terraform + nixos-infect + one rebuild).
- ✅ On the tailnet (`cc-ci-orchestrator-1`) and ssh-able on the public IP.
- ✅ Loops + Adversary + watchdog running; phase sequence auto-advancing (watchdog on per-phase `## DONE`).
- ✅ `cc-ci-loops.service` **enabled** → reboot-resilient.
- ◻︎ Old Incus orchestrator VM (`100.116.55.106`) — keep as cold standby a few days, then delete.
- ◻︎ Rotate the tailnet name once the old `cc-ci-orchestrator` peer is gone (this box is `…-1`).

49
flake.lock generated Normal file
View File

@ -0,0 +1,49 @@
{
"nodes": {
"nixpkgs": {
"locked": {
"lastModified": 1751274312,
"narHash": "sha256-/bVBlRpECLVzjV19t5KMdMFWSwKLtb5RyXdjz3LJT+g=",
"owner": "NixOS",
"repo": "nixpkgs",
"rev": "50ab793786d9de88ee30ec4e4c24fb4236fc2674",
"type": "github"
},
"original": {
"owner": "NixOS",
"repo": "nixpkgs",
"rev": "50ab793786d9de88ee30ec4e4c24fb4236fc2674",
"type": "github"
}
},
"root": {
"inputs": {
"nixpkgs": "nixpkgs",
"sops-nix": "sops-nix"
}
},
"sops-nix": {
"inputs": {
"nixpkgs": [
"nixpkgs"
]
},
"locked": {
"lastModified": 1750119275,
"narHash": "sha256-Rr7Pooz9zQbhdVxux16h7URa6mA80Pb/G07T4lHvh0M=",
"owner": "Mic92",
"repo": "sops-nix",
"rev": "77c423a03b9b2b79709ea2cb63336312e78b72e2",
"type": "github"
},
"original": {
"owner": "Mic92",
"repo": "sops-nix",
"rev": "77c423a03b9b2b79709ea2cb63336312e78b72e2",
"type": "github"
}
}
},
"root": "root",
"version": 7
}

View File

@ -114,17 +114,18 @@ SSHCFG
''; '';
}; };
# cc-ci-loops supervisor — defined but NOT enabled until workspace is staged. # cc-ci-loops supervisor — workspace staged 2026-05-31, so ENABLED for reboot-resilience.
# Enable by adding wantedBy after staging (Stage 2e) for reboot-resilience.
systemd.services.cc-ci-loops = { systemd.services.cc-ci-loops = {
description = "cc-ci Builder/Adversary loops + watchdog (launch.sh start)"; description = "cc-ci Builder/Adversary loops + watchdog (launch.sh start)";
# wantedBy = [ "multi-user.target" ]; # uncomment after workspace is staged wantedBy = [ "multi-user.target" ]; # enabled after workspace staged (Hetzner cutover)
after = [ "network-online.target" "tailscaled.service" "claude-install.service" ]; after = [ "network-online.target" "tailscaled.service" "claude-install.service" ];
wants = [ "network-online.target" ]; wants = [ "network-online.target" ];
serviceConfig = { serviceConfig = {
Type = "oneshot"; RemainAfterExit = true; Type = "oneshot"; RemainAfterExit = true;
User = "loops"; Group = "users"; User = "loops"; Group = "users";
WorkingDirectory = "/srv/cc-ci"; WorkingDirectory = "/srv/cc-ci";
# Append one line to REBOOTS.md per genuine reboot (boot_id-gated; not on manual restart).
ExecStartPre = "${pkgs.bash}/bin/bash /srv/cc-ci/cc-ci-plan/reboot-log.sh";
}; };
environment = { RESUME_PHASE = "1"; HOME = "/home/loops"; }; environment = { RESUME_PHASE = "1"; HOME = "/home/loops"; };
path = [ pkgs.bash pkgs.tmux pkgs.git pkgs.python3 pkgs.openssh pkgs.nettools ]; path = [ pkgs.bash pkgs.tmux pkgs.git pkgs.python3 pkgs.openssh pkgs.nettools ];