Reboot survival for the Pi orchestrator host: - systemd unit cc-ci-plan/systemd/cc-ci-loops.service (installed + enabled): on boot records the reboot, starts loops+watchdog (RESUME_PHASE=1), and resumes the orchestrator session. - reboot-log.sh: boot_id-gated reboot record -> REBOOTS.md (manual restarts don't count). - launch-orchestrator.sh: injects an AGENTS.md startup nudge so an auto-resumed orchestrator announces itself (PushNotification) + reports reboots. - AGENTS.md: on-startup notify routine documented. Plans/tooling accumulated this session: - plan-phase1d (generic suite), 1e (harness corrections), phase4 (final review), sso-dep-testing, orchestrator-migration (parked), test-e2e-testme-acceptance. - launch.sh: 1d/1e/2/2b/3/4 phase sequence, machine-docs-aware state resolution, limit-stall re-nudge, INBOX side-channel detection. - plan.md §6.1/§7: artifact-layer isolation, INBOX, 5-min long-run polling, DEFERRED. - prompts: isolation discipline + INBOX + pacing. - .gitignore: harden (.sops/, cc-ci-secrets/, .claude/, *.tmp.*). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
136 lines
9.1 KiB
Markdown
136 lines
9.1 KiB
Markdown
# Plan — migrate the orchestrator off the Pi onto a dedicated NixOS Incus VM
|
|
|
|
**Goal:** move everything that drives the cc-ci loops (the Builder/Adversary loops, the watchdog,
|
|
the SOCKS proxy, the orchestrator session itself) off the Raspberry Pi and onto a new, dedicated,
|
|
**reboot-resilient NixOS VM** on b1 — declared in a new git repo **`cc-ci-orchestrator`**. Finish by
|
|
relocating this orchestrator session there too.
|
|
|
|
**Why:** the Pi has rebooted twice today, each time silently killing the tmux loops + watchdog
|
|
(they don't survive reboot, nothing auto-restarts them). A NixOS VM lets us declare the whole rig
|
|
(claude CLI, proxy, loop supervisor) as systemd services that come back on boot — turning a reboot
|
|
into a non-event. It also consolidates the orchestrator next to the infra it manages.
|
|
|
|
**Status:** DRAFT — awaiting operator go-ahead before any infra creation / cutover.
|
|
|
|
---
|
|
|
|
## 0. Current footprint (what has to move)
|
|
|
|
On the Pi (`raspberrypi`, aarch64), workspace `/srv/cc-ci` (itself the
|
|
`cc-ci-autonomous-orchestrator` git repo):
|
|
|
|
| Item | What | Move strategy |
|
|
|---|---|---|
|
|
| `cc-ci-plan/` | loop code: `launch.sh`, `plan*.md`, `prompts/`, `kickoff.md` | in git (this repo) → clone on VM |
|
|
| `cc-ci/`, `cc-ci-adv/` | Builder + Adversary working clones (~13M each) | **re-clone from git.autonomic.zone** on the VM (cleaner than copying) |
|
|
| `.cc-ci-logs/` | watchdog/loop logs + `.phase-idx` | copy `.phase-idx` (the resume point); logs start fresh |
|
|
| `cc-ci-secrets/` | sops-encrypted secrets repo | in git → clone |
|
|
| `references/` | recipe-maintainer corpus (read-only parity source) | clone/rsync from `/srv/recipe-maintainer` |
|
|
| **`.testenv`** | TS auth key, Gitea bot creds | **out-of-band copy** (gitignored, never in git) |
|
|
| **`~/.ssh/cc-ci-root-ed25519`** | root SSH key to cc-ci | **out-of-band copy** |
|
|
| **`.sops/master-age.txt`** | master recovery age key | **out-of-band copy** |
|
|
| **Incus mTLS certs** (`/srv/incus-terraform-nix-vm-creator/terraform-secrets/`) | `terraform.{crt,key}`, `vm_ssh_key` | **out-of-band copy** — so the VM can itself manage VMs |
|
|
| `cc-ci-tailscaled.service` | userspace SOCKS proxy :1055 | **re-declare as NixOS** (see §3) |
|
|
| **claude CLI + auth** | `~/.local/bin/claude` v2.1.154 + `~/.claude.json` | install on VM + **operator `claude auth login`** (§4) |
|
|
| this orchestrator session | the supervising claude conversation | **operator-assisted cutover** (§6) |
|
|
|
|
Two hard human-in-the-loop steps, called out explicitly: **claude auth on the new VM** (device-code
|
|
login, can't be scripted) and the **final session cutover** (the operator connects to the new
|
|
orchestrator session). Everything else I can do.
|
|
|
|
## 1. Target VM spec
|
|
|
|
- **Host/API:** b1 Incus, `https://100.117.251.31:8443`, project `terraform-ci`, mTLS certs (have).
|
|
- **Name:** `cc-ci-orchestrator` (tailnet hostname too).
|
|
- **Resources:** **2 GB RAM, 2 vCPU, 30 GB disk** (dir backend → resize needs a reboot; size at
|
|
create time so no later grow). b1 has ample headroom (only cc-nix-test @8GB running).
|
|
- **Image:** the existing imported NixOS base VM image (`incus-base-vm`) — already ships tailscale,
|
|
openssh, git/jq/curl, flakes, cloud-init.
|
|
- **Tailnet:** joins via a fresh `TS_AUTH_KEY` (operator provides, or reuse the keyed approach in
|
|
`terraform-secrets/.test.env`). MagicDNS name `cc-ci-orchestrator.taila4a0bf.ts.net`.
|
|
- **Bootstrap:** cloud-init writes the `cc-ci-orchestrator` flake config + `nixos-rebuild switch`.
|
|
|
|
## 2. The new `cc-ci-orchestrator` git repo (NixOS config)
|
|
|
|
A new **private** repo on `git.autonomic.zone/recipe-maintainers/cc-ci-orchestrator` (bot is org
|
|
admin). It is the NixOS config for this VM — the orchestrator's equivalent of what `cc-ci` is for the
|
|
test server. Contents:
|
|
|
|
- `flake.nix` + `hosts/cc-ci-orchestrator/configuration.nix` — the VM's NixOS config.
|
|
- **Packages:** `claude-code` (CLI), `git`, `tmux`, `python3`, `jq`, `openssh`, `nodejs` (claude
|
|
runtime), `coreutils`, `nettools` (`nc` for the proxy ProxyCommand).
|
|
- **`services.cc-ci-tailscaled`** — the userspace tailscaled SOCKS proxy on :1055, as a NixOS
|
|
systemd service (port to NixOS from the Pi's `cc-ci-tailscaled.service`). This is the path to b1 +
|
|
cc-ci.
|
|
- **`services.cc-ci-orchestrator`** — a systemd service that runs `launch.sh start` with
|
|
`RESUME_PHASE=1` **on boot** (after the proxy + network are up), as the workspace user. **This is
|
|
the reboot-resilience fix** — the loops + watchdog come back automatically after any reboot.
|
|
- **Secrets via sops-nix** (like cc-ci): the out-of-band secrets (`.testenv`, ssh key, incus certs)
|
|
are sops-encrypted into the repo, decrypted at activation to their runtime paths. The **master age
|
|
key** is the one irreducible out-of-band bootstrap secret placed on the VM once.
|
|
- `~/.ssh/config` for `cc-ci` (root, ProxyCommand via :1055) declared.
|
|
- **Excluded from git:** claude's own auth (`~/.claude.json`) — that's per-user login state, set up
|
|
once interactively (§4), not committed.
|
|
|
|
## 3. Execution phases
|
|
|
|
### Phase A — provision the VM (reversible; safe to do while Pi loops keep running)
|
|
1. Create `cc-ci-orchestrator` VM via the Incus API (2 GB / 2 vCPU / 30 GB, NixOS base image, TS auth
|
|
key in cloud-init). Wait for tailnet join + ssh.
|
|
2. Verify: `ssh` in, `tailscale status`, `nixos-rebuild` available, can reach b1 API + cc-ci through
|
|
its own proxy once configured.
|
|
|
|
### Phase B — author + apply the `cc-ci-orchestrator` repo
|
|
3. Create the private git repo; author the flake/config (§2); commit/push.
|
|
4. Place the master age key on the VM; sops-encrypt the out-of-band secrets into the repo.
|
|
5. `nixos-rebuild switch` on the VM → proxy service up, packages present, services defined (loop
|
|
supervisor **not yet started** — or started in a dry mode).
|
|
|
|
### Phase C — stage the workspace (no cutover yet)
|
|
6. On the VM: clone `cc-ci-autonomous-orchestrator` (the loop code), clone the Builder/Adversary
|
|
working repos fresh from git.autonomic.zone, clone `cc-ci-secrets`, rsync `references/`.
|
|
7. Copy `.phase-idx` (resume point = phase 2) so the VM watchdog resumes the right phase.
|
|
8. **Operator step:** `claude auth login` on the VM (device code) so the loops can run
|
|
`--remote-control --dangerously-skip-permissions`. Verify with a throwaway interactive claude.
|
|
|
|
### Phase D — cutover (the only disruptive moment; pick a clean point)
|
|
9. **Quiesce the Pi:** stop the Pi loops + watchdog (`launch.sh stop`); confirm both loops are at a
|
|
safe point (no half-written commit; `git status` clean in both clones, last work pushed).
|
|
10. **Start on the VM:** enable + start the `cc-ci-orchestrator` systemd service → `launch.sh start`
|
|
(RESUME_PHASE=1) brings up Builder + Adversary + watchdog on the VM, resuming phase 2 from the
|
|
repo state. Verify all three sessions + a handoff + public health.
|
|
11. **Decommission the Pi loops:** disable the Pi's `cc-ci-tailscaled` + leave the workspace in place
|
|
(read-only fallback) but not running loops. (Keep the Pi as a cold standby for a few days before
|
|
deleting anything.)
|
|
|
|
### Phase E — move the orchestrator session (operator-assisted)
|
|
12. On the VM, start the orchestrator session: `claude --remote-control 'autonomous-orchestrator'
|
|
--dangerously-skip-permissions` in a tmux session, seeded with AGENTS.md + this plan so it picks
|
|
up the supervising role. The **operator connects** to it (claude.ai/code) — this is the
|
|
"move myself" step; a session can't transplant itself across machines, so it's a fresh
|
|
orchestrator session on the VM with full context from the repo.
|
|
13. This Pi-side orchestrator session hands off (writes a short state note) and goes idle/ends.
|
|
|
|
## 4. Risks & mitigations
|
|
- **claude auth (human step):** unavoidable device-code login on the VM. Mitigation: do it in Phase
|
|
C, well before cutover; verify before quiescing the Pi.
|
|
- **Loops mid-work at cutover:** pick a quiet point (between gate claims / after a push); the loops
|
|
re-orient from git on restart anyway, so worst case is a re-run of an in-flight iteration.
|
|
- **Secrets sprawl:** out-of-band secrets are copied once, then sops-managed in the new repo; never
|
|
committed in plaintext (same discipline as cc-ci). The master age key is the sole bootstrap secret.
|
|
- **Self-move gap:** between Pi-session-ends and VM-session-connected, there's no live orchestrator.
|
|
The watchdog (now a boot service) keeps the loops alive independently, so this gap is safe.
|
|
- **Rollback:** until the Pi workspace is deleted, reverting = stop VM service, `launch.sh start` on
|
|
the Pi again. Keep the Pi intact until the VM has run clean through at least one reboot + one gate
|
|
handshake.
|
|
- **Reboot-resilience proof:** before trusting the VM, reboot it once and confirm the loops +
|
|
watchdog + proxy all come back via systemd (the whole point of the move).
|
|
|
|
## 5. Operator-assisted steps (the only things I can't fully do)
|
|
1. Provide a fresh `TS_AUTH_KEY` for the VM (or confirm reuse of the one in `terraform-secrets`).
|
|
2. `claude auth login` on the VM (device code).
|
|
3. Connect to the new orchestrator session on the VM at cutover (Phase E).
|
|
|
|
Everything else (VM create, repo author, NixOS config, secret migration, workspace staging, the
|
|
loop cutover) I can drive.
|