cc-ci-orchestrator/cc-ci-plan/plan-orchestrator-migration.md

# Plan — migrate the orchestrator off the Pi onto a dedicated NixOS Incus VM

**Goal:** move everything that drives the cc-ci loops (the Builder/Adversary loops, the watchdog,
the SOCKS proxy, the orchestrator session itself) off the Raspberry Pi and onto a new, dedicated,
**reboot-resilient NixOS VM** on b1 — declared in a new git repo **`cc-ci-orchestrator`**. Finish by
relocating this orchestrator session there too.

**Why:** the Pi has rebooted twice today, each time silently killing the tmux loops + watchdog
(they don't survive reboot, nothing auto-restarts them). A NixOS VM lets us declare the whole rig
(claude CLI, proxy, loop supervisor) as systemd services that come back on boot — turning a reboot
into a non-event. It also consolidates the orchestrator next to the infra it manages.

**Status:** COMPLETE (2026-05-31). All agents run on the VM; Pi fully decommissioned. Kept as a historical record.

**Phase A ✅ COMPLETE (2026-05-30):** VM `cc-ci-orchestrator` (**2 GB / 2 vCPU / 30 GB**,
`incus-base-vm`, NixOS 24.11) created via the Incus API + booted; **on the tailnet at
`100.116.55.106`**; **ssh works** (`ssh cc-ci-orchestrator` through the :1055 proxy — `cc-ci-root`
pubkey added via `incus exec`). Reproducible Terraform record at
`incus-terraform-nix-vm-creator/projects/cc-ci-orchestrator/` (note: this instance was API-created, so
TF drift — see PROVENANCE.txt).
- **TS-key finding:** the VM-creator's `.test.env` reusable key is **REVOKED** ("API key does not
  exist"). The **`/srv/cc-ci/.testenv` `TS_AUTH_KEY` is valid** — used it to join, and persisted it into
  the VM's `/etc/ts-auth-key`. So the plan's "operator provides a fresh TS key" item is **resolved** (no
  new key needed); housekeeping: revoke/rotate the dead key in `.test.env`.
- **Sizing watch:** 2 GB ≈ 1.7 GiB usable; fine idle (284 MiB) but tight for 3 concurrent claude
  sessions (Pi OOM lesson). Phase B will declare a **swapfile**; bump to 4 GB pre-cutover if needed.

**Next — Phase B:** the `cc-ci-orchestrator` NixOS-config git repo (SOCKS proxy + loop-supervisor boot
service + claude CLI + sops secrets). Then C (stage workspace), claude auth (operator), D/E (cutover).

---

## 0. Current footprint (what has to move)

On the Pi (`raspberrypi`, aarch64), workspace `/srv/cc-ci` (itself the
`cc-ci-orchestrator` git repo — formerly `cc-ci-autonomous-orchestrator`):

| Item | What | Move strategy |
|---|---|---|
| `cc-ci-plan/` | loop code: `launch.sh`, `plan*.md`, `prompts/`, `kickoff.md` | in git (this repo) → clone on VM |
| `cc-ci/`, `cc-ci-adv/` | Builder + Adversary working clones (~13M each) | **re-clone from git.autonomic.zone** on the VM (cleaner than copying) |
| `.cc-ci-logs/` | watchdog/loop logs + `.phase-idx` | copy `.phase-idx` (the resume point); logs start fresh |
| `cc-ci-secrets/` | sops-encrypted secrets repo | in git → clone |
| `references/` | recipe-maintainer corpus (read-only parity source) | clone/rsync from `/srv/recipe-maintainer` |
| **`.testenv`** | TS auth key, Gitea bot creds | **out-of-band copy** (gitignored, never in git) |
| **`~/.ssh/cc-ci-root-ed25519`** | root SSH key to cc-ci | **out-of-band copy** |
| **`.sops/master-age.txt`** | master recovery age key | **out-of-band copy** |
| **Incus mTLS certs** (`/srv/incus-terraform-nix-vm-creator/terraform-secrets/`) | `terraform.{crt,key}`, `vm_ssh_key` | **out-of-band copy** — so the VM can itself manage VMs |
| `cc-ci-tailscaled.service` | userspace SOCKS proxy :1055 | **re-declare as NixOS** (see §3) |
| **claude CLI + auth** | `~/.local/bin/claude` v2.1.154 + `~/.claude.json` | install on VM + **operator `claude auth login`** (§4) |
| this orchestrator session | the supervising claude conversation | **operator-assisted cutover** (§6) |

Two hard human-in-the-loop steps, called out explicitly: **claude auth on the new VM** (device-code
login, can't be scripted) and the **final session cutover** (the operator connects to the new
orchestrator session). Everything else I can do.

## 1. Target VM spec

- **Host/API:** b1 Incus, `https://100.117.251.31:8443`, project `terraform-ci`, mTLS certs (have).
- **Name:** `cc-ci-orchestrator` (tailnet hostname too).
- **Resources:** **2 GB RAM, 2 vCPU, 30 GB disk** (dir backend → resize needs a reboot; size at
  create time so no later grow). b1 has ample headroom (only cc-nix-test @8GB running).
- **Image:** the existing imported NixOS base VM image (`incus-base-vm`) — already ships tailscale,
  openssh, git/jq/curl, flakes, cloud-init.
- **Tailnet:** joins via a fresh `TS_AUTH_KEY` (operator provides, or reuse the keyed approach in
  `terraform-secrets/.test.env`). MagicDNS name `cc-ci-orchestrator.taila4a0bf.ts.net`.
- **Bootstrap:** cloud-init writes the `cc-ci-orchestrator` flake config + `nixos-rebuild switch`.

## 2. The new `cc-ci-orchestrator` git repo (NixOS config)

A new **private** repo on `git.autonomic.zone/recipe-maintainers/cc-ci-orchestrator` (bot is org
admin). It is the NixOS config for this VM — the orchestrator's equivalent of what `cc-ci` is for the
test server. Contents:

- `flake.nix` + `hosts/cc-ci-orchestrator/configuration.nix` — the VM's NixOS config.
- **Packages:** `claude-code` (CLI), `git`, `tmux`, `python3`, `jq`, `openssh`, `nodejs` (claude
  runtime), `coreutils`, `nettools` (`nc` for the proxy ProxyCommand).
- **`services.cc-ci-tailscaled`** — the userspace tailscaled SOCKS proxy on :1055, as a NixOS
  systemd service (port to NixOS from the Pi's `cc-ci-tailscaled.service`). This is the path to b1 +
  cc-ci.
- **`services.cc-ci-orchestrator`** — a systemd service that runs `launch.sh start` with
  `RESUME_PHASE=1` **on boot** (after the proxy + network are up), as the workspace user. **This is
  the reboot-resilience fix** — the loops + watchdog come back automatically after any reboot.
- **Secrets via sops-nix** (like cc-ci): the out-of-band secrets (`.testenv`, ssh key, incus certs)
  are sops-encrypted into the repo, decrypted at activation to their runtime paths. The **master age
  key** is the one irreducible out-of-band bootstrap secret placed on the VM once.
- `~/.ssh/config` for `cc-ci` (root, ProxyCommand via :1055) declared.
- **Excluded from git:** claude's own auth (`~/.claude.json`) — that's per-user login state, set up
  once interactively (§4), not committed.

## 3. Execution phases

### Phase A — provision the VM (reversible; safe to do while Pi loops keep running)
1. Create `cc-ci-orchestrator` VM via the Incus API (2 GB / 2 vCPU / 30 GB, NixOS base image, TS auth
   key in cloud-init). Wait for tailnet join + ssh.
2. Verify: `ssh` in, `tailscale status`, `nixos-rebuild` available, can reach b1 API + cc-ci through
   its own proxy once configured.

### Phase B — author + apply the `cc-ci-orchestrator` repo
3. Create the private git repo; author the flake/config (§2); commit/push.
4. Place the master age key on the VM; sops-encrypt the out-of-band secrets into the repo.
5. `nixos-rebuild switch` on the VM → proxy service up, packages present, services defined (loop
   supervisor **not yet started** — or started in a dry mode).

### Phase C — stage the workspace (no cutover yet)
6. On the VM: clone `cc-ci-orchestrator` (the loop code), clone the Builder/Adversary
   working repos fresh from git.autonomic.zone, clone `cc-ci-secrets`, rsync `references/`.
7. Copy `.phase-idx` (resume point = phase 2) so the VM watchdog resumes the right phase.
8. **Operator step:** `claude auth login` on the VM (device code) so the loops can run
   `--remote-control --dangerously-skip-permissions`. Verify with a throwaway interactive claude.

### Phase D — cutover (the only disruptive moment; pick a clean point)
9. **Quiesce the Pi:** stop the Pi loops + watchdog (`launch.sh stop`); confirm both loops are at a
   safe point (no half-written commit; `git status` clean in both clones, last work pushed).
10. **Start on the VM:** enable + start the `cc-ci-orchestrator` systemd service → `launch.sh start`
    (RESUME_PHASE=1) brings up Builder + Adversary + watchdog on the VM, resuming phase 2 from the
    repo state. Verify all three sessions + a handoff + public health.
11. **Decommission the Pi loops:** disable the Pi's `cc-ci-tailscaled` + leave the workspace in place
    (read-only fallback) but not running loops. (Keep the Pi as a cold standby for a few days before
    deleting anything.)

### Phase E — move the orchestrator session (operator-assisted)
12. On the VM, start the orchestrator session: `claude --remote-control 'autonomous-orchestrator'
    --dangerously-skip-permissions` in a tmux session, seeded with AGENTS.md + this plan so it picks
    up the supervising role. The **operator connects** to it (claude.ai/code) — this is the
    "move myself" step; a session can't transplant itself across machines, so it's a fresh
    orchestrator session on the VM with full context from the repo.
13. This Pi-side orchestrator session hands off (writes a short state note) and goes idle/ends.

## 4. Risks & mitigations
- **claude auth (human step):** unavoidable device-code login on the VM. Mitigation: do it in Phase
  C, well before cutover; verify before quiescing the Pi.
- **Loops mid-work at cutover:** pick a quiet point (between gate claims / after a push); the loops
  re-orient from git on restart anyway, so worst case is a re-run of an in-flight iteration.
- **Secrets sprawl:** out-of-band secrets are copied once, then sops-managed in the new repo; never
  committed in plaintext (same discipline as cc-ci). The master age key is the sole bootstrap secret.
- **Self-move gap:** between Pi-session-ends and VM-session-connected, there's no live orchestrator.
  The watchdog (now a boot service) keeps the loops alive independently, so this gap is safe.
- **Rollback:** until the Pi workspace is deleted, reverting = stop VM service, `launch.sh start` on
  the Pi again. Keep the Pi intact until the VM has run clean through at least one reboot + one gate
  handshake.
- **Reboot-resilience proof:** before trusting the VM, reboot it once and confirm the loops +
  watchdog + proxy all come back via systemd (the whole point of the move).

## 5. Operator-assisted steps (the only things I can't fully do)
1. Provide a fresh `TS_AUTH_KEY` for the VM (or confirm reuse of the one in `terraform-secrets`).
2. `claude auth login` on the VM (device code).
3. Connect to the new orchestrator session on the VM at cutover (Phase E).

Everything else (VM create, repo author, NixOS config, secret migration, workspace staging, the
loop cutover) I can drive.