The orchestrator Pi is retired (2026-05-31). All agents now run on the cc-ci-orchestrator VM (NixOS, loops user, /srv/cc-ci). The VM is a direct tailnet peer to cc-ci — no SOCKS proxy, no userspace tailscaled, no ProxyCommand. Updated across all affected files: AGENTS.md - Remove Pi from reboot description; migration complete (not "parked") - cc-ci access: direct ssh, not via proxy kickoff.md - Prerequisites: direct tailnet peer, not proxy - Host deps: NixOS (not apt) - Fallback/Incus: b1 reachable directly, no --proxy curl flag plan.md §1 + §1.5 - §1 bootstrap: direct SSH, check tailscale status (not restart proxy) - §1.5 intro: "VM" not "sandbox host"; no proxy - Credentials table: remove TS_AUTH_KEY row; update cc-ci SSH row - Replace "Tailscale connection (proxy)" subsection with direct-peer description plan-orchestrator-migration.md - Mark COMPLETE (2026-05-31); historical record only plan-phase1c-full-reproducibility.md - Incus access: direct, not via SOCKS proxy prompts/builder.md + prompts/adversary.md - cc-ci access language only: direct ssh, no proxy restart instructions - adversary: *.ci.commoninternet.net via plain curl, no proxy flag REBOOTS.md - Retitle for VM; note Pi retired; Pi entries marked historical systemd/cc-ci-loops.service - User/Group/HOME/PATH: notplants → loops - Remove cc-ci-tailscaled.service dependency (no proxy on VM) - Add note about nix/configuration.nix as the authoritative VM declaration test-e2e-testme-acceptance.md - tailscale status: no --socket flag - ssh to throwaway: no ProxyCommand Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 KiB
Plan — migrate the orchestrator off the Pi onto a dedicated NixOS Incus VM
Goal: move everything that drives the cc-ci loops (the Builder/Adversary loops, the watchdog,
the SOCKS proxy, the orchestrator session itself) off the Raspberry Pi and onto a new, dedicated,
reboot-resilient NixOS VM on b1 — declared in a new git repo cc-ci-orchestrator. Finish by
relocating this orchestrator session there too.
Why: the Pi has rebooted twice today, each time silently killing the tmux loops + watchdog (they don't survive reboot, nothing auto-restarts them). A NixOS VM lets us declare the whole rig (claude CLI, proxy, loop supervisor) as systemd services that come back on boot — turning a reboot into a non-event. It also consolidates the orchestrator next to the infra it manages.
Status: COMPLETE (2026-05-31). All agents run on the VM; Pi fully decommissioned. Kept as a historical record.
Phase A ✅ COMPLETE (2026-05-30): VM cc-ci-orchestrator (2 GB / 2 vCPU / 30 GB,
incus-base-vm, NixOS 24.11) created via the Incus API + booted; on the tailnet at
100.116.55.106; ssh works (ssh cc-ci-orchestrator through the :1055 proxy — cc-ci-root
pubkey added via incus exec). Reproducible Terraform record at
incus-terraform-nix-vm-creator/projects/cc-ci-orchestrator/ (note: this instance was API-created, so
TF drift — see PROVENANCE.txt).
- TS-key finding: the VM-creator's
.test.envreusable key is REVOKED ("API key does not exist"). The/srv/cc-ci/.testenvTS_AUTH_KEYis valid — used it to join, and persisted it into the VM's/etc/ts-auth-key. So the plan's "operator provides a fresh TS key" item is resolved (no new key needed); housekeeping: revoke/rotate the dead key in.test.env. - Sizing watch: 2 GB ≈ 1.7 GiB usable; fine idle (284 MiB) but tight for 3 concurrent claude sessions (Pi OOM lesson). Phase B will declare a swapfile; bump to 4 GB pre-cutover if needed.
Next — Phase B: the cc-ci-orchestrator NixOS-config git repo (SOCKS proxy + loop-supervisor boot
service + claude CLI + sops secrets). Then C (stage workspace), claude auth (operator), D/E (cutover).
0. Current footprint (what has to move)
On the Pi (raspberrypi, aarch64), workspace /srv/cc-ci (itself the
cc-ci-orchestrator git repo — formerly cc-ci-autonomous-orchestrator):
| Item | What | Move strategy |
|---|---|---|
cc-ci-plan/ |
loop code: launch.sh, plan*.md, prompts/, kickoff.md |
in git (this repo) → clone on VM |
cc-ci/, cc-ci-adv/ |
Builder + Adversary working clones (~13M each) | re-clone from git.autonomic.zone on the VM (cleaner than copying) |
.cc-ci-logs/ |
watchdog/loop logs + .phase-idx |
copy .phase-idx (the resume point); logs start fresh |
cc-ci-secrets/ |
sops-encrypted secrets repo | in git → clone |
references/ |
recipe-maintainer corpus (read-only parity source) | clone/rsync from /srv/recipe-maintainer |
.testenv |
TS auth key, Gitea bot creds | out-of-band copy (gitignored, never in git) |
~/.ssh/cc-ci-root-ed25519 |
root SSH key to cc-ci | out-of-band copy |
.sops/master-age.txt |
master recovery age key | out-of-band copy |
Incus mTLS certs (/srv/incus-terraform-nix-vm-creator/terraform-secrets/) |
terraform.{crt,key}, vm_ssh_key |
out-of-band copy — so the VM can itself manage VMs |
cc-ci-tailscaled.service |
userspace SOCKS proxy :1055 | re-declare as NixOS (see §3) |
| claude CLI + auth | ~/.local/bin/claude v2.1.154 + ~/.claude.json |
install on VM + operator claude auth login (§4) |
| this orchestrator session | the supervising claude conversation | operator-assisted cutover (§6) |
Two hard human-in-the-loop steps, called out explicitly: claude auth on the new VM (device-code login, can't be scripted) and the final session cutover (the operator connects to the new orchestrator session). Everything else I can do.
1. Target VM spec
- Host/API: b1 Incus,
https://100.117.251.31:8443, projectterraform-ci, mTLS certs (have). - Name:
cc-ci-orchestrator(tailnet hostname too). - Resources: 2 GB RAM, 2 vCPU, 30 GB disk (dir backend → resize needs a reboot; size at create time so no later grow). b1 has ample headroom (only cc-nix-test @8GB running).
- Image: the existing imported NixOS base VM image (
incus-base-vm) — already ships tailscale, openssh, git/jq/curl, flakes, cloud-init. - Tailnet: joins via a fresh
TS_AUTH_KEY(operator provides, or reuse the keyed approach interraform-secrets/.test.env). MagicDNS namecc-ci-orchestrator.taila4a0bf.ts.net. - Bootstrap: cloud-init writes the
cc-ci-orchestratorflake config +nixos-rebuild switch.
2. The new cc-ci-orchestrator git repo (NixOS config)
A new private repo on git.autonomic.zone/recipe-maintainers/cc-ci-orchestrator (bot is org
admin). It is the NixOS config for this VM — the orchestrator's equivalent of what cc-ci is for the
test server. Contents:
flake.nix+hosts/cc-ci-orchestrator/configuration.nix— the VM's NixOS config.- Packages:
claude-code(CLI),git,tmux,python3,jq,openssh,nodejs(claude runtime),coreutils,nettools(ncfor the proxy ProxyCommand). services.cc-ci-tailscaled— the userspace tailscaled SOCKS proxy on :1055, as a NixOS systemd service (port to NixOS from the Pi'scc-ci-tailscaled.service). This is the path to b1 + cc-ci.services.cc-ci-orchestrator— a systemd service that runslaunch.sh startwithRESUME_PHASE=1on boot (after the proxy + network are up), as the workspace user. This is the reboot-resilience fix — the loops + watchdog come back automatically after any reboot.- Secrets via sops-nix (like cc-ci): the out-of-band secrets (
.testenv, ssh key, incus certs) are sops-encrypted into the repo, decrypted at activation to their runtime paths. The master age key is the one irreducible out-of-band bootstrap secret placed on the VM once. ~/.ssh/configforcc-ci(root, ProxyCommand via :1055) declared.- Excluded from git: claude's own auth (
~/.claude.json) — that's per-user login state, set up once interactively (§4), not committed.
3. Execution phases
Phase A — provision the VM (reversible; safe to do while Pi loops keep running)
- Create
cc-ci-orchestratorVM via the Incus API (2 GB / 2 vCPU / 30 GB, NixOS base image, TS auth key in cloud-init). Wait for tailnet join + ssh. - Verify:
sshin,tailscale status,nixos-rebuildavailable, can reach b1 API + cc-ci through its own proxy once configured.
Phase B — author + apply the cc-ci-orchestrator repo
- Create the private git repo; author the flake/config (§2); commit/push.
- Place the master age key on the VM; sops-encrypt the out-of-band secrets into the repo.
nixos-rebuild switchon the VM → proxy service up, packages present, services defined (loop supervisor not yet started — or started in a dry mode).
Phase C — stage the workspace (no cutover yet)
- On the VM: clone
cc-ci-orchestrator(the loop code), clone the Builder/Adversary working repos fresh from git.autonomic.zone, clonecc-ci-secrets, rsyncreferences/. - Copy
.phase-idx(resume point = phase 2) so the VM watchdog resumes the right phase. - Operator step:
claude auth loginon the VM (device code) so the loops can run--remote-control --dangerously-skip-permissions. Verify with a throwaway interactive claude.
Phase D — cutover (the only disruptive moment; pick a clean point)
- Quiesce the Pi: stop the Pi loops + watchdog (
launch.sh stop); confirm both loops are at a safe point (no half-written commit;git statusclean in both clones, last work pushed). - Start on the VM: enable + start the
cc-ci-orchestratorsystemd service →launch.sh start(RESUME_PHASE=1) brings up Builder + Adversary + watchdog on the VM, resuming phase 2 from the repo state. Verify all three sessions + a handoff + public health. - Decommission the Pi loops: disable the Pi's
cc-ci-tailscaled+ leave the workspace in place (read-only fallback) but not running loops. (Keep the Pi as a cold standby for a few days before deleting anything.)
Phase E — move the orchestrator session (operator-assisted)
- On the VM, start the orchestrator session:
claude --remote-control 'autonomous-orchestrator' --dangerously-skip-permissionsin a tmux session, seeded with AGENTS.md + this plan so it picks up the supervising role. The operator connects to it (claude.ai/code) — this is the "move myself" step; a session can't transplant itself across machines, so it's a fresh orchestrator session on the VM with full context from the repo. - This Pi-side orchestrator session hands off (writes a short state note) and goes idle/ends.
4. Risks & mitigations
- claude auth (human step): unavoidable device-code login on the VM. Mitigation: do it in Phase C, well before cutover; verify before quiescing the Pi.
- Loops mid-work at cutover: pick a quiet point (between gate claims / after a push); the loops re-orient from git on restart anyway, so worst case is a re-run of an in-flight iteration.
- Secrets sprawl: out-of-band secrets are copied once, then sops-managed in the new repo; never committed in plaintext (same discipline as cc-ci). The master age key is the sole bootstrap secret.
- Self-move gap: between Pi-session-ends and VM-session-connected, there's no live orchestrator. The watchdog (now a boot service) keeps the loops alive independently, so this gap is safe.
- Rollback: until the Pi workspace is deleted, reverting = stop VM service,
launch.sh starton the Pi again. Keep the Pi intact until the VM has run clean through at least one reboot + one gate handshake. - Reboot-resilience proof: before trusting the VM, reboot it once and confirm the loops + watchdog + proxy all come back via systemd (the whole point of the move).
5. Operator-assisted steps (the only things I can't fully do)
- Provide a fresh
TS_AUTH_KEYfor the VM (or confirm reuse of the one interraform-secrets). claude auth loginon the VM (device code).- Connect to the new orchestrator session on the VM at cutover (Phase E).
Everything else (VM create, repo author, NixOS config, secret migration, workspace staging, the loop cutover) I can drive.