Files

autonomic-bot 01874821f2 decommission Pi: update all docs for VM-only setup

The orchestrator Pi is retired (2026-05-31). All agents now run on the
cc-ci-orchestrator VM (NixOS, loops user, /srv/cc-ci). The VM is a
direct tailnet peer to cc-ci — no SOCKS proxy, no userspace tailscaled,
no ProxyCommand. Updated across all affected files:

AGENTS.md
  - Remove Pi from reboot description; migration complete (not "parked")
  - cc-ci access: direct ssh, not via proxy

kickoff.md
  - Prerequisites: direct tailnet peer, not proxy
  - Host deps: NixOS (not apt)
  - Fallback/Incus: b1 reachable directly, no --proxy curl flag

plan.md §1 + §1.5
  - §1 bootstrap: direct SSH, check tailscale status (not restart proxy)
  - §1.5 intro: "VM" not "sandbox host"; no proxy
  - Credentials table: remove TS_AUTH_KEY row; update cc-ci SSH row
  - Replace "Tailscale connection (proxy)" subsection with direct-peer description

plan-orchestrator-migration.md
  - Mark COMPLETE (2026-05-31); historical record only

plan-phase1c-full-reproducibility.md
  - Incus access: direct, not via SOCKS proxy

prompts/builder.md + prompts/adversary.md
  - cc-ci access language only: direct ssh, no proxy restart instructions
  - adversary: *.ci.commoninternet.net via plain curl, no proxy flag

REBOOTS.md
  - Retitle for VM; note Pi retired; Pi entries marked historical

systemd/cc-ci-loops.service
  - User/Group/HOME/PATH: notplants → loops
  - Remove cc-ci-tailscaled.service dependency (no proxy on VM)
  - Add note about nix/configuration.nix as the authoritative VM declaration

test-e2e-testme-acceptance.md
  - tailscale status: no --socket flag
  - ssh to throwaway: no ProxyCommand

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-31 00:16:37 +00:00

10 KiB

Raw Blame History

Plan — migrate the orchestrator off the Pi onto a dedicated NixOS Incus VM

Goal: move everything that drives the cc-ci loops (the Builder/Adversary loops, the watchdog, the SOCKS proxy, the orchestrator session itself) off the Raspberry Pi and onto a new, dedicated, reboot-resilient NixOS VM on b1 — declared in a new git repo cc-ci-orchestrator. Finish by relocating this orchestrator session there too.

Why: the Pi has rebooted twice today, each time silently killing the tmux loops + watchdog (they don't survive reboot, nothing auto-restarts them). A NixOS VM lets us declare the whole rig (claude CLI, proxy, loop supervisor) as systemd services that come back on boot — turning a reboot into a non-event. It also consolidates the orchestrator next to the infra it manages.

Status: COMPLETE (2026-05-31). All agents run on the VM; Pi fully decommissioned. Kept as a historical record.

Phase A ✅ COMPLETE (2026-05-30): VM cc-ci-orchestrator (2 GB / 2 vCPU / 30 GB, incus-base-vm, NixOS 24.11) created via the Incus API + booted; on the tailnet at 100.116.55.106; ssh works (ssh cc-ci-orchestrator through the :1055 proxy — cc-ci-root pubkey added via incus exec). Reproducible Terraform record at incus-terraform-nix-vm-creator/projects/cc-ci-orchestrator/ (note: this instance was API-created, so TF drift — see PROVENANCE.txt).

TS-key finding: the VM-creator's .test.env reusable key is REVOKED ("API key does not exist"). The /srv/cc-ci/.testenv TS_AUTH_KEY is valid — used it to join, and persisted it into the VM's /etc/ts-auth-key. So the plan's "operator provides a fresh TS key" item is resolved (no new key needed); housekeeping: revoke/rotate the dead key in .test.env.
Sizing watch: 2 GB ≈ 1.7 GiB usable; fine idle (284 MiB) but tight for 3 concurrent claude sessions (Pi OOM lesson). Phase B will declare a swapfile; bump to 4 GB pre-cutover if needed.

Next — Phase B: the cc-ci-orchestrator NixOS-config git repo (SOCKS proxy + loop-supervisor boot service + claude CLI + sops secrets). Then C (stage workspace), claude auth (operator), D/E (cutover).

0. Current footprint (what has to move)

On the Pi (raspberrypi, aarch64), workspace /srv/cc-ci (itself the cc-ci-orchestrator git repo — formerly cc-ci-autonomous-orchestrator):

Item	What	Move strategy
`cc-ci-plan/`	loop code: `launch.sh`, `plan*.md`, `prompts/`, `kickoff.md`	in git (this repo) → clone on VM
`cc-ci/`, `cc-ci-adv/`	Builder + Adversary working clones (~13M each)	re-clone from git.autonomic.zone on the VM (cleaner than copying)
`.cc-ci-logs/`	watchdog/loop logs + `.phase-idx`	copy `.phase-idx` (the resume point); logs start fresh
`cc-ci-secrets/`	sops-encrypted secrets repo	in git → clone
`references/`	recipe-maintainer corpus (read-only parity source)	clone/rsync from `/srv/recipe-maintainer`
`.testenv`	TS auth key, Gitea bot creds	out-of-band copy (gitignored, never in git)
`~/.ssh/cc-ci-root-ed25519`	root SSH key to cc-ci	out-of-band copy
`.sops/master-age.txt`	master recovery age key	out-of-band copy
Incus mTLS certs (`/srv/incus-terraform-nix-vm-creator/terraform-secrets/`)	`terraform.{crt,key}`, `vm_ssh_key`	out-of-band copy — so the VM can itself manage VMs
`cc-ci-tailscaled.service`	userspace SOCKS proxy :1055	re-declare as NixOS (see §3)
claude CLI + auth	`~/.local/bin/claude` v2.1.154 + `~/.claude.json`	install on VM + operator `claude auth login` (§4)
this orchestrator session	the supervising claude conversation	operator-assisted cutover (§6)

Two hard human-in-the-loop steps, called out explicitly: claude auth on the new VM (device-code login, can't be scripted) and the final session cutover (the operator connects to the new orchestrator session). Everything else I can do.

1. Target VM spec

Host/API: b1 Incus, https://100.117.251.31:8443, project terraform-ci, mTLS certs (have).
Name: cc-ci-orchestrator (tailnet hostname too).
Resources: 2 GB RAM, 2 vCPU, 30 GB disk (dir backend → resize needs a reboot; size at create time so no later grow). b1 has ample headroom (only cc-nix-test @8GB running).
Image: the existing imported NixOS base VM image (incus-base-vm) — already ships tailscale, openssh, git/jq/curl, flakes, cloud-init.
Tailnet: joins via a fresh TS_AUTH_KEY (operator provides, or reuse the keyed approach in terraform-secrets/.test.env). MagicDNS name cc-ci-orchestrator.taila4a0bf.ts.net.
Bootstrap: cloud-init writes the cc-ci-orchestrator flake config + nixos-rebuild switch.

2. The new `cc-ci-orchestrator` git repo (NixOS config)

A new private repo on git.autonomic.zone/recipe-maintainers/cc-ci-orchestrator (bot is org admin). It is the NixOS config for this VM — the orchestrator's equivalent of what cc-ci is for the test server. Contents:

flake.nix + hosts/cc-ci-orchestrator/configuration.nix — the VM's NixOS config.
Packages: claude-code (CLI), git, tmux, python3, jq, openssh, nodejs (claude runtime), coreutils, nettools (nc for the proxy ProxyCommand).
services.cc-ci-tailscaled — the userspace tailscaled SOCKS proxy on :1055, as a NixOS systemd service (port to NixOS from the Pi's cc-ci-tailscaled.service). This is the path to b1 + cc-ci.
services.cc-ci-orchestrator — a systemd service that runs launch.sh start with RESUME_PHASE=1 on boot (after the proxy + network are up), as the workspace user. This is the reboot-resilience fix — the loops + watchdog come back automatically after any reboot.
Secrets via sops-nix (like cc-ci): the out-of-band secrets (.testenv, ssh key, incus certs) are sops-encrypted into the repo, decrypted at activation to their runtime paths. The master age key is the one irreducible out-of-band bootstrap secret placed on the VM once.
~/.ssh/config for cc-ci (root, ProxyCommand via :1055) declared.
Excluded from git: claude's own auth (~/.claude.json) — that's per-user login state, set up once interactively (§4), not committed.

3. Execution phases

Phase A — provision the VM (reversible; safe to do while Pi loops keep running)

Create cc-ci-orchestrator VM via the Incus API (2 GB / 2 vCPU / 30 GB, NixOS base image, TS auth key in cloud-init). Wait for tailnet join + ssh.
Verify: ssh in, tailscale status, nixos-rebuild available, can reach b1 API + cc-ci through its own proxy once configured.

Phase B — author + apply the `cc-ci-orchestrator` repo

Create the private git repo; author the flake/config (§2); commit/push.
Place the master age key on the VM; sops-encrypt the out-of-band secrets into the repo.
nixos-rebuild switch on the VM → proxy service up, packages present, services defined (loop supervisor not yet started — or started in a dry mode).

Phase C — stage the workspace (no cutover yet)

On the VM: clone cc-ci-orchestrator (the loop code), clone the Builder/Adversary working repos fresh from git.autonomic.zone, clone cc-ci-secrets, rsync references/.
Copy .phase-idx (resume point = phase 2) so the VM watchdog resumes the right phase.
Operator step: claude auth login on the VM (device code) so the loops can run --remote-control --dangerously-skip-permissions. Verify with a throwaway interactive claude.

Phase D — cutover (the only disruptive moment; pick a clean point)

Quiesce the Pi: stop the Pi loops + watchdog (launch.sh stop); confirm both loops are at a safe point (no half-written commit; git status clean in both clones, last work pushed).
Start on the VM: enable + start the cc-ci-orchestrator systemd service → launch.sh start (RESUME_PHASE=1) brings up Builder + Adversary + watchdog on the VM, resuming phase 2 from the repo state. Verify all three sessions + a handoff + public health.
Decommission the Pi loops: disable the Pi's cc-ci-tailscaled + leave the workspace in place (read-only fallback) but not running loops. (Keep the Pi as a cold standby for a few days before deleting anything.)

Phase E — move the orchestrator session (operator-assisted)

On the VM, start the orchestrator session: claude --remote-control 'autonomous-orchestrator' --dangerously-skip-permissions in a tmux session, seeded with AGENTS.md + this plan so it picks up the supervising role. The operator connects to it (claude.ai/code) — this is the "move myself" step; a session can't transplant itself across machines, so it's a fresh orchestrator session on the VM with full context from the repo.
This Pi-side orchestrator session hands off (writes a short state note) and goes idle/ends.

4. Risks & mitigations

claude auth (human step): unavoidable device-code login on the VM. Mitigation: do it in Phase C, well before cutover; verify before quiescing the Pi.
Loops mid-work at cutover: pick a quiet point (between gate claims / after a push); the loops re-orient from git on restart anyway, so worst case is a re-run of an in-flight iteration.
Secrets sprawl: out-of-band secrets are copied once, then sops-managed in the new repo; never committed in plaintext (same discipline as cc-ci). The master age key is the sole bootstrap secret.
Self-move gap: between Pi-session-ends and VM-session-connected, there's no live orchestrator. The watchdog (now a boot service) keeps the loops alive independently, so this gap is safe.
Rollback: until the Pi workspace is deleted, reverting = stop VM service, launch.sh start on the Pi again. Keep the Pi intact until the VM has run clean through at least one reboot + one gate handshake.
Reboot-resilience proof: before trusting the VM, reboot it once and confirm the loops + watchdog + proxy all come back via systemd (the whole point of the move).

5. Operator-assisted steps (the only things I can't fully do)

Provide a fresh TS_AUTH_KEY for the VM (or confirm reuse of the one in terraform-secrets).
claude auth login on the VM (device code).
Connect to the new orchestrator session on the VM at cutover (Phase E).

Everything else (VM create, repo author, NixOS config, secret migration, workspace staging, the loop cutover) I can drive.

10 KiB Raw Blame History