orchestrator-hetzner: enable reboot-resilience + record migration

Now the workspace is staged on the Hetzner cpx22 (server 134487234, public 91.98.47.73, tailnet cc-ci-orchestrator-1 @ 100.84.190.30): - configuration.nix: enable cc-ci-loops.service (wantedBy multi-user.target) so the loops + watchdog auto-resume on boot; wire reboot-log.sh as ExecStartPre so reboots auto-log to REBOOTS.md (boot_id-gated). - plan-orchestrator-hetzner-migration.md: full migration record. - REBOOTS.md / AGENTS.md: point the orchestrator host at Hetzner; first auto-logged reboot line. - launch-orchestrator.sh: default session id -> the Hetzner orchestrator session. - flake.lock: pin inputs. Verified: nixos-rebuild switch applied; systemctl is-enabled cc-ci-loops.service = enabled; ExecStartPre logged this boot to REBOOTS.md; loops healthy on phase 2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 03:54:17 +00:00
parent e89f384c24
commit 21e7a79f50
6 changed files with 148 additions and 9 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@ -32,8 +32,12 @@ this session). Include the current phase and the reboot count. Steps on startup:
 Reboot resilience is handled by **`cc-ci-loops.service`** (system unit): on boot it logs the reboot
 to `REBOOTS.md` (boot_id-gated) and runs `launch.sh start` with `RESUME_PHASE=1`, so the loops +
 watchdog auto-resume the saved phase. The orchestrator session itself is NOT auto-started — the
-operator reconnects to it (that's why the startup notification matters). The VM migration is
+operator reconnects to it (that's why the startup notification matters). The orchestrator now runs on
-complete; see `cc-ci-plan/plan-orchestrator-migration.md` (historical record).
+a **Hetzner `cpx22`** cloud server (`cc-ci-orchestrator-1`, tailnet `100.84.190.30`, public
 `168.119.126.100`, flake host `cc-ci-orchestrator-hetzner`) — see
 `cc-ci-plan/plan-orchestrator-hetzner-migration.md`. The earlier Pi→Incus-VM move is the historical
 `cc-ci-plan/plan-orchestrator-migration.md`. Rebuild this host with
 `nixos-rebuild switch --flake .#cc-ci-orchestrator-hetzner` from `/srv/cc-ci-orch`.
 ## Keep the orchestrator open, under remote-control
--- a/cc-ci-plan/REBOOTS.md
+++ b/cc-ci-plan/REBOOTS.md
@ -1,8 +1,12 @@
 # Reboot log — cc-ci orchestrator VM
-**Note:** the orchestrator Pi (`raspberrypi`) was decommissioned 2026-05-31. All agents now run on
+**Note:** the orchestrator Pi (`raspberrypi`) was decommissioned 2026-05-31. The agents then ran on
-the `cc-ci-orchestrator` NixOS VM (tailnet `100.116.55.106`). The three Pi reboot entries below are
+the `cc-ci-orchestrator` Incus NixOS VM (tailnet `100.116.55.106`), and **as of 2026-05-31 run on a
-historical. Entries from 2026-05-31 onward are VM reboots.
+Hetzner `cpx22` cloud server** (`cc-ci-orchestrator-1`, tailnet `100.84.190.30`, public
 `168.119.126.100`, flake host `cc-ci-orchestrator-hetzner`, Hetzner server 134487234) — see
 `plan-orchestrator-hetzner-migration.md`. The three Pi reboot entries below are historical; the
 2026-05-30 entry is an Incus-VM reboot. Hetzner-host reboots are logged from now on (auto-logged once
 `cc-ci-loops.service` is enabled — wired at the Hetzner cutover).
 One line per genuine reboot of the orchestrator host, appended automatically by
 `reboot-log.sh` (ExecStartPre of `cc-ci-loops.service`, boot_id-gated so manual service restarts are
@ -17,3 +21,4 @@ restarts the loops on boot. Count the lines below to see how often it's happenin
  manually relaunched at phase 2; this is what prompted adding `cc-ci-loops.service` +
  auto-logging. Auto-logging is live from the next reboot onward.
 - 2026-05-30 17:03:05 BST — reboot detected; loops auto-started by systemd (resuming phase index 6). boot_id=f565f752-0463-42db-b787-9e0db35a5e3f
 - 2026-05-31 03:38:29 UTC — reboot detected; loops auto-started by systemd (resuming phase index 5). boot_id=51c17fc3-8391-4109-bce2-413fbee6f26d
--- a/cc-ci-plan/launch-orchestrator.sh
+++ b/cc-ci-plan/launch-orchestrator.sh
@ -36,7 +36,7 @@ CLAUDE_FLAGS="${CLAUDE_FLAGS:---dangerously-skip-permissions}"
 REMOTE_CONTROL="${REMOTE_CONTROL:-1}"
 LOG_DIR="${LOG_DIR:-/srv/cc-ci/.cc-ci-logs}"
 ID_FILE="${ORCH_ID_FILE:-$LOG_DIR/.orchestrator-session-id}"
-DEFAULT_ID="34a80a99-b37e-4809-b8da-ccc9fafe785e"    # the orchestrator session as of 2026-05-28
+DEFAULT_ID="c746050a-af11-409d-87ba-c05268e2e5d1"    # the orchestrator session as of 2026-05-31 (Hetzner)
 # Startup nudge injected as the resumed session's first turn, so an AUTO-launched orchestrator (e.g.
 # cc-ci-loops.service ExecStartPost after a reboot) actually RUNS its AGENTS.md startup routine —
 # announce itself + report reboots — instead of resuming silently and waiting. Set empty to disable.
--- a/cc-ci-plan/plan-orchestrator-hetzner-migration.md
+++ b/cc-ci-plan/plan-orchestrator-hetzner-migration.md
@ -0,0 +1,80 @@
 # Plan/record — migrate the ORCHESTRATOR off the Incus VM onto a Hetzner cloud server
 **Status:** COMPLETE (2026-05-31). The orchestrator (Builder/Adversary loops + watchdog + this
 supervising session) now runs on a dedicated **Hetzner** cloud server, declared by the
 `cc-ci-orchestrator-hetzner` flake host. Kept as a historical record.
 **Why:** the previous orchestrator host was the Incus VM `cc-ci-orchestrator` on b1
 (`100.116.55.106`, 2 GB / 2 vCPU, see [[plan-orchestrator-migration]] — the earlier Pi→Incus move).
 A dedicated Hetzner box gives dedicated vCPU + NVMe and decouples the orchestrator from b1's hardware.
 This is the orchestrator analogue of the cc-ci **server** move in [[plan-migrate-cc-ci-to-hetzner]]
 (that one moves the *CI server*; this one moves the *orchestrator that drives the loops*).
 > **Note on naming:** this migration was carried out directly via `terraform/` + the
 > `cc-ci-orchestrator-hetzner` flake host. It is **not** the same as `plan-migrate-cc-ci-to-hetzner.md`
 > (the cc-ci CI server → Hetzner `cpx32`) nor `plan-orchestrator-migration.md` (Pi → Incus VM). All
 > three are distinct moves; only this file records the orchestrator → Hetzner step.
 ---
 ## The new host (facts)
 | | |
 |---|---|
 | Provider / type | **Hetzner Cloud `cpx22`** — AMD **2 vCPU / 4 GB**, dedicated vCPU, NVMe |
 | Location | `nbg1` (cpx11/cpx21 are retired there — hence `cpx22`) |
 | Hetzner server ID | **134487234** |
 | Public IPv4 | **168.119.126.100** (IPv6 disabled) |
 | Tailnet | **`cc-ci-orchestrator-1`** @ **100.84.190.30** (`taila4a0bf.ts.net`); joins via `/etc/ts-auth-key` |
 | OS | `debian-12` image → **nixos-infect** → NixOS, converged by the flake |
 | Flake host | **`nixosConfigurations.cc-ci-orchestrator-hetzner`** (`flake.nix` → `nix/hosts/cc-ci-orchestrator-hetzner/{configuration,hardware}.nix`) |
 | Workspace | `/srv/cc-ci-orch` (this repo); `/srv/cc-ci` is a **symlink** to it. Loop clones: `/srv/cc-ci/cc-ci`, `/srv/cc-ci/cc-ci-adv` |
 The login keys (root `authorizedKeys`) and swap (4 GB disk swap — 4 GB RAM is tight for 3+ claude
 sessions) are declared in `configuration.nix`.
 ## How it was provisioned (reproducible)
 The whole box is reproducible from `terraform/` + one `nixos-rebuild`:
 1. **`terraform apply`** (`terraform/main.tf`): `hcloud_server` `cpx22` from `debian-12` in `nbg1`,
   `user_data = user-data.sh` runs **nixos-infect** on first boot (Debian→NixOS, reboot).
 2. **Stage 2** (`terraform/README.md`): SSH in, capture the nixos-infect hardware config
   (→ `nix/hosts/cc-ci-orchestrator-hetzner/hardware.nix`), then converge:
   ```bash
   # on the server, from the repo root (/srv/cc-ci-orch)
   nixos-rebuild switch --flake .#cc-ci-orchestrator-hetzner
   ```
 3. Stage credentials (not in git, placed once): `/etc/ts-auth-key` (tailnet join), the loops'
   `~/.ssh/cc-ci-root-ed25519` + `.testenv`, and the sops master age key. `claude auth login`
   (device code) is the one interactive step so the loops can run `--remote-control`.
 4. Stage the workspace: clone this repo to `/srv/cc-ci-orch` (symlink `/srv/cc-ci`), the Builder /
   Adversary clones, `cc-ci-secrets`, `references/`; copy `.cc-ci-logs/.phase-idx` (resume point).
 **Commit trail:** `0103f36` (terraform + flake host, initial `cpx11`) → `17951b8` (fix → `cpx22`,
 add lock) → `c44b967` (real cpx22 hardware config from nixos-infect, server 134487234). Plus the
 close-out commit below (root keys, drop tailscale `--ssh`, enable the loops service, this doc).
 ## Reboot-resilience (the point of running on a managed host)
 `configuration.nix` declares **`systemd.services.cc-ci-loops`** — a oneshot that runs
 `launch.sh start` with `RESUME_PHASE=1` after `network-online`/`tailscaled`, bringing the loops +
 watchdog back on boot. It was authored **disabled** ("defined but NOT enabled until workspace is
 staged") with `wantedBy` commented out. **Close-out (2026-05-31):** the workspace is staged and the
 loops are running, so `wantedBy = [ "multi-user.target" ]` was uncommented and `nixos-rebuild switch`
 re-run → `systemctl is-enabled cc-ci-loops.service` = **enabled**. A reboot is now a non-event:
 systemd resumes the saved phase. (`reboot-log.sh`, the ExecStartPre, appends to
 [[REBOOTS.md]] boot_id-gated.)
 > **Caveat seen at first boot on this host:** the loops were initially started *by hand* during
 > staging (not by the service), so the first boot did NOT log to `REBOOTS.md` and the service showed
 > `linked`/not-enabled. Enabling `wantedBy` (above) is what wires the automatic path.
 ## Status of the migration
 - ✅ Hetzner `cpx22` provisioned + converged from the flake (terraform + nixos-infect + one rebuild).
 - ✅ On the tailnet (`cc-ci-orchestrator-1`) and ssh-able on the public IP.
 - ✅ Loops + Adversary + watchdog running; phase sequence auto-advancing (watchdog on per-phase `## DONE`).
 - ✅ `cc-ci-loops.service` **enabled** → reboot-resilient.
 - ◻︎ Old Incus orchestrator VM (`100.116.55.106`) — keep as cold standby a few days, then delete.
 - ◻︎ Rotate the tailnet name once the old `cc-ci-orchestrator` peer is gone (this box is `…-1`).
--- a/flake.lock
+++ b/flake.lock
@ -0,0 +1,49 @@
 {
  "nodes": {
    "nixpkgs": {
      "locked": {
        "lastModified": 1751274312,
        "narHash": "sha256-/bVBlRpECLVzjV19t5KMdMFWSwKLtb5RyXdjz3LJT+g=",
        "owner": "NixOS",
        "repo": "nixpkgs",
        "rev": "50ab793786d9de88ee30ec4e4c24fb4236fc2674",
        "type": "github"
      },
      "original": {
        "owner": "NixOS",
        "repo": "nixpkgs",
        "rev": "50ab793786d9de88ee30ec4e4c24fb4236fc2674",
        "type": "github"
      }
    },
    "root": {
      "inputs": {
        "nixpkgs": "nixpkgs",
        "sops-nix": "sops-nix"
      }
    },
    "sops-nix": {
      "inputs": {
        "nixpkgs": [
          "nixpkgs"
        ]
      },
      "locked": {
        "lastModified": 1750119275,
        "narHash": "sha256-Rr7Pooz9zQbhdVxux16h7URa6mA80Pb/G07T4lHvh0M=",
        "owner": "Mic92",
        "repo": "sops-nix",
        "rev": "77c423a03b9b2b79709ea2cb63336312e78b72e2",
        "type": "github"
      },
      "original": {
        "owner": "Mic92",
        "repo": "sops-nix",
        "rev": "77c423a03b9b2b79709ea2cb63336312e78b72e2",
        "type": "github"
      }
    }
  },
  "root": "root",
  "version": 7
 }
--- a/nix/hosts/cc-ci-orchestrator-hetzner/configuration.nix
+++ b/nix/hosts/cc-ci-orchestrator-hetzner/configuration.nix
@ -114,17 +114,18 @@ SSHCFG
    '';
  };
-  # cc-ci-loops supervisor — defined but NOT enabled until workspace is staged.
+  # cc-ci-loops supervisor — workspace staged 2026-05-31, so ENABLED for reboot-resilience.
  # Enable by adding wantedBy after staging (Stage 2e) for reboot-resilience.
  systemd.services.cc-ci-loops = {
    description = "cc-ci Builder/Adversary loops + watchdog (launch.sh start)";
-    # wantedBy = [ "multi-user.target" ];  # uncomment after workspace is staged
+    wantedBy = [ "multi-user.target" ];  # enabled after workspace staged (Hetzner cutover)
    after = [ "network-online.target" "tailscaled.service" "claude-install.service" ];
    wants = [ "network-online.target" ];
    serviceConfig = {
      Type = "oneshot"; RemainAfterExit = true;
      User = "loops"; Group = "users";
      WorkingDirectory = "/srv/cc-ci";
      # Append one line to REBOOTS.md per genuine reboot (boot_id-gated; not on manual restart).
      ExecStartPre = "${pkgs.bash}/bin/bash /srv/cc-ci/cc-ci-plan/reboot-log.sh";
    };
    environment = { RESUME_PHASE = "1"; HOME = "/home/loops"; };
    path = [ pkgs.bash pkgs.tmux pkgs.git pkgs.python3 pkgs.openssh pkgs.nettools ];