The orchestrator Pi is retired (2026-05-31). All agents now run on the cc-ci-orchestrator VM (NixOS, loops user, /srv/cc-ci). The VM is a direct tailnet peer to cc-ci — no SOCKS proxy, no userspace tailscaled, no ProxyCommand. Updated across all affected files: AGENTS.md - Remove Pi from reboot description; migration complete (not "parked") - cc-ci access: direct ssh, not via proxy kickoff.md - Prerequisites: direct tailnet peer, not proxy - Host deps: NixOS (not apt) - Fallback/Incus: b1 reachable directly, no --proxy curl flag plan.md §1 + §1.5 - §1 bootstrap: direct SSH, check tailscale status (not restart proxy) - §1.5 intro: "VM" not "sandbox host"; no proxy - Credentials table: remove TS_AUTH_KEY row; update cc-ci SSH row - Replace "Tailscale connection (proxy)" subsection with direct-peer description plan-orchestrator-migration.md - Mark COMPLETE (2026-05-31); historical record only plan-phase1c-full-reproducibility.md - Incus access: direct, not via SOCKS proxy prompts/builder.md + prompts/adversary.md - cc-ci access language only: direct ssh, no proxy restart instructions - adversary: *.ci.commoninternet.net via plain curl, no proxy flag REBOOTS.md - Retitle for VM; note Pi retired; Pi entries marked historical systemd/cc-ci-loops.service - User/Group/HOME/PATH: notplants → loops - Remove cc-ci-tailscaled.service dependency (no proxy on VM) - Add note about nix/configuration.nix as the authoritative VM declaration test-e2e-testme-acceptance.md - tailscale status: no --socket flag - ssh to throwaway: no ProxyCommand Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
152 lines
10 KiB
Markdown
152 lines
10 KiB
Markdown
# Plan — migrate the orchestrator off the Pi onto a dedicated NixOS Incus VM
|
|
|
|
**Goal:** move everything that drives the cc-ci loops (the Builder/Adversary loops, the watchdog,
|
|
the SOCKS proxy, the orchestrator session itself) off the Raspberry Pi and onto a new, dedicated,
|
|
**reboot-resilient NixOS VM** on b1 — declared in a new git repo **`cc-ci-orchestrator`**. Finish by
|
|
relocating this orchestrator session there too.
|
|
|
|
**Why:** the Pi has rebooted twice today, each time silently killing the tmux loops + watchdog
|
|
(they don't survive reboot, nothing auto-restarts them). A NixOS VM lets us declare the whole rig
|
|
(claude CLI, proxy, loop supervisor) as systemd services that come back on boot — turning a reboot
|
|
into a non-event. It also consolidates the orchestrator next to the infra it manages.
|
|
|
|
**Status:** COMPLETE (2026-05-31). All agents run on the VM; Pi fully decommissioned. Kept as a historical record.
|
|
|
|
**Phase A ✅ COMPLETE (2026-05-30):** VM `cc-ci-orchestrator` (**2 GB / 2 vCPU / 30 GB**,
|
|
`incus-base-vm`, NixOS 24.11) created via the Incus API + booted; **on the tailnet at
|
|
`100.116.55.106`**; **ssh works** (`ssh cc-ci-orchestrator` through the :1055 proxy — `cc-ci-root`
|
|
pubkey added via `incus exec`). Reproducible Terraform record at
|
|
`incus-terraform-nix-vm-creator/projects/cc-ci-orchestrator/` (note: this instance was API-created, so
|
|
TF drift — see PROVENANCE.txt).
|
|
- **TS-key finding:** the VM-creator's `.test.env` reusable key is **REVOKED** ("API key does not
|
|
exist"). The **`/srv/cc-ci/.testenv` `TS_AUTH_KEY` is valid** — used it to join, and persisted it into
|
|
the VM's `/etc/ts-auth-key`. So the plan's "operator provides a fresh TS key" item is **resolved** (no
|
|
new key needed); housekeeping: revoke/rotate the dead key in `.test.env`.
|
|
- **Sizing watch:** 2 GB ≈ 1.7 GiB usable; fine idle (284 MiB) but tight for 3 concurrent claude
|
|
sessions (Pi OOM lesson). Phase B will declare a **swapfile**; bump to 4 GB pre-cutover if needed.
|
|
|
|
**Next — Phase B:** the `cc-ci-orchestrator` NixOS-config git repo (SOCKS proxy + loop-supervisor boot
|
|
service + claude CLI + sops secrets). Then C (stage workspace), claude auth (operator), D/E (cutover).
|
|
|
|
---
|
|
|
|
## 0. Current footprint (what has to move)
|
|
|
|
On the Pi (`raspberrypi`, aarch64), workspace `/srv/cc-ci` (itself the
|
|
`cc-ci-orchestrator` git repo — formerly `cc-ci-autonomous-orchestrator`):
|
|
|
|
| Item | What | Move strategy |
|
|
|---|---|---|
|
|
| `cc-ci-plan/` | loop code: `launch.sh`, `plan*.md`, `prompts/`, `kickoff.md` | in git (this repo) → clone on VM |
|
|
| `cc-ci/`, `cc-ci-adv/` | Builder + Adversary working clones (~13M each) | **re-clone from git.autonomic.zone** on the VM (cleaner than copying) |
|
|
| `.cc-ci-logs/` | watchdog/loop logs + `.phase-idx` | copy `.phase-idx` (the resume point); logs start fresh |
|
|
| `cc-ci-secrets/` | sops-encrypted secrets repo | in git → clone |
|
|
| `references/` | recipe-maintainer corpus (read-only parity source) | clone/rsync from `/srv/recipe-maintainer` |
|
|
| **`.testenv`** | TS auth key, Gitea bot creds | **out-of-band copy** (gitignored, never in git) |
|
|
| **`~/.ssh/cc-ci-root-ed25519`** | root SSH key to cc-ci | **out-of-band copy** |
|
|
| **`.sops/master-age.txt`** | master recovery age key | **out-of-band copy** |
|
|
| **Incus mTLS certs** (`/srv/incus-terraform-nix-vm-creator/terraform-secrets/`) | `terraform.{crt,key}`, `vm_ssh_key` | **out-of-band copy** — so the VM can itself manage VMs |
|
|
| `cc-ci-tailscaled.service` | userspace SOCKS proxy :1055 | **re-declare as NixOS** (see §3) |
|
|
| **claude CLI + auth** | `~/.local/bin/claude` v2.1.154 + `~/.claude.json` | install on VM + **operator `claude auth login`** (§4) |
|
|
| this orchestrator session | the supervising claude conversation | **operator-assisted cutover** (§6) |
|
|
|
|
Two hard human-in-the-loop steps, called out explicitly: **claude auth on the new VM** (device-code
|
|
login, can't be scripted) and the **final session cutover** (the operator connects to the new
|
|
orchestrator session). Everything else I can do.
|
|
|
|
## 1. Target VM spec
|
|
|
|
- **Host/API:** b1 Incus, `https://100.117.251.31:8443`, project `terraform-ci`, mTLS certs (have).
|
|
- **Name:** `cc-ci-orchestrator` (tailnet hostname too).
|
|
- **Resources:** **2 GB RAM, 2 vCPU, 30 GB disk** (dir backend → resize needs a reboot; size at
|
|
create time so no later grow). b1 has ample headroom (only cc-nix-test @8GB running).
|
|
- **Image:** the existing imported NixOS base VM image (`incus-base-vm`) — already ships tailscale,
|
|
openssh, git/jq/curl, flakes, cloud-init.
|
|
- **Tailnet:** joins via a fresh `TS_AUTH_KEY` (operator provides, or reuse the keyed approach in
|
|
`terraform-secrets/.test.env`). MagicDNS name `cc-ci-orchestrator.taila4a0bf.ts.net`.
|
|
- **Bootstrap:** cloud-init writes the `cc-ci-orchestrator` flake config + `nixos-rebuild switch`.
|
|
|
|
## 2. The new `cc-ci-orchestrator` git repo (NixOS config)
|
|
|
|
A new **private** repo on `git.autonomic.zone/recipe-maintainers/cc-ci-orchestrator` (bot is org
|
|
admin). It is the NixOS config for this VM — the orchestrator's equivalent of what `cc-ci` is for the
|
|
test server. Contents:
|
|
|
|
- `flake.nix` + `hosts/cc-ci-orchestrator/configuration.nix` — the VM's NixOS config.
|
|
- **Packages:** `claude-code` (CLI), `git`, `tmux`, `python3`, `jq`, `openssh`, `nodejs` (claude
|
|
runtime), `coreutils`, `nettools` (`nc` for the proxy ProxyCommand).
|
|
- **`services.cc-ci-tailscaled`** — the userspace tailscaled SOCKS proxy on :1055, as a NixOS
|
|
systemd service (port to NixOS from the Pi's `cc-ci-tailscaled.service`). This is the path to b1 +
|
|
cc-ci.
|
|
- **`services.cc-ci-orchestrator`** — a systemd service that runs `launch.sh start` with
|
|
`RESUME_PHASE=1` **on boot** (after the proxy + network are up), as the workspace user. **This is
|
|
the reboot-resilience fix** — the loops + watchdog come back automatically after any reboot.
|
|
- **Secrets via sops-nix** (like cc-ci): the out-of-band secrets (`.testenv`, ssh key, incus certs)
|
|
are sops-encrypted into the repo, decrypted at activation to their runtime paths. The **master age
|
|
key** is the one irreducible out-of-band bootstrap secret placed on the VM once.
|
|
- `~/.ssh/config` for `cc-ci` (root, ProxyCommand via :1055) declared.
|
|
- **Excluded from git:** claude's own auth (`~/.claude.json`) — that's per-user login state, set up
|
|
once interactively (§4), not committed.
|
|
|
|
## 3. Execution phases
|
|
|
|
### Phase A — provision the VM (reversible; safe to do while Pi loops keep running)
|
|
1. Create `cc-ci-orchestrator` VM via the Incus API (2 GB / 2 vCPU / 30 GB, NixOS base image, TS auth
|
|
key in cloud-init). Wait for tailnet join + ssh.
|
|
2. Verify: `ssh` in, `tailscale status`, `nixos-rebuild` available, can reach b1 API + cc-ci through
|
|
its own proxy once configured.
|
|
|
|
### Phase B — author + apply the `cc-ci-orchestrator` repo
|
|
3. Create the private git repo; author the flake/config (§2); commit/push.
|
|
4. Place the master age key on the VM; sops-encrypt the out-of-band secrets into the repo.
|
|
5. `nixos-rebuild switch` on the VM → proxy service up, packages present, services defined (loop
|
|
supervisor **not yet started** — or started in a dry mode).
|
|
|
|
### Phase C — stage the workspace (no cutover yet)
|
|
6. On the VM: clone `cc-ci-orchestrator` (the loop code), clone the Builder/Adversary
|
|
working repos fresh from git.autonomic.zone, clone `cc-ci-secrets`, rsync `references/`.
|
|
7. Copy `.phase-idx` (resume point = phase 2) so the VM watchdog resumes the right phase.
|
|
8. **Operator step:** `claude auth login` on the VM (device code) so the loops can run
|
|
`--remote-control --dangerously-skip-permissions`. Verify with a throwaway interactive claude.
|
|
|
|
### Phase D — cutover (the only disruptive moment; pick a clean point)
|
|
9. **Quiesce the Pi:** stop the Pi loops + watchdog (`launch.sh stop`); confirm both loops are at a
|
|
safe point (no half-written commit; `git status` clean in both clones, last work pushed).
|
|
10. **Start on the VM:** enable + start the `cc-ci-orchestrator` systemd service → `launch.sh start`
|
|
(RESUME_PHASE=1) brings up Builder + Adversary + watchdog on the VM, resuming phase 2 from the
|
|
repo state. Verify all three sessions + a handoff + public health.
|
|
11. **Decommission the Pi loops:** disable the Pi's `cc-ci-tailscaled` + leave the workspace in place
|
|
(read-only fallback) but not running loops. (Keep the Pi as a cold standby for a few days before
|
|
deleting anything.)
|
|
|
|
### Phase E — move the orchestrator session (operator-assisted)
|
|
12. On the VM, start the orchestrator session: `claude --remote-control 'autonomous-orchestrator'
|
|
--dangerously-skip-permissions` in a tmux session, seeded with AGENTS.md + this plan so it picks
|
|
up the supervising role. The **operator connects** to it (claude.ai/code) — this is the
|
|
"move myself" step; a session can't transplant itself across machines, so it's a fresh
|
|
orchestrator session on the VM with full context from the repo.
|
|
13. This Pi-side orchestrator session hands off (writes a short state note) and goes idle/ends.
|
|
|
|
## 4. Risks & mitigations
|
|
- **claude auth (human step):** unavoidable device-code login on the VM. Mitigation: do it in Phase
|
|
C, well before cutover; verify before quiescing the Pi.
|
|
- **Loops mid-work at cutover:** pick a quiet point (between gate claims / after a push); the loops
|
|
re-orient from git on restart anyway, so worst case is a re-run of an in-flight iteration.
|
|
- **Secrets sprawl:** out-of-band secrets are copied once, then sops-managed in the new repo; never
|
|
committed in plaintext (same discipline as cc-ci). The master age key is the sole bootstrap secret.
|
|
- **Self-move gap:** between Pi-session-ends and VM-session-connected, there's no live orchestrator.
|
|
The watchdog (now a boot service) keeps the loops alive independently, so this gap is safe.
|
|
- **Rollback:** until the Pi workspace is deleted, reverting = stop VM service, `launch.sh start` on
|
|
the Pi again. Keep the Pi intact until the VM has run clean through at least one reboot + one gate
|
|
handshake.
|
|
- **Reboot-resilience proof:** before trusting the VM, reboot it once and confirm the loops +
|
|
watchdog + proxy all come back via systemd (the whole point of the move).
|
|
|
|
## 5. Operator-assisted steps (the only things I can't fully do)
|
|
1. Provide a fresh `TS_AUTH_KEY` for the VM (or confirm reuse of the one in `terraform-secrets`).
|
|
2. `claude auth login` on the VM (device code).
|
|
3. Connect to the new orchestrator session on the VM at cutover (Phase E).
|
|
|
|
Everything else (VM create, repo author, NixOS config, secret migration, workspace staging, the
|
|
loop cutover) I can drive.
|