Files

autonomic-bot 239dfd8e26 Watchdog handoff signalling: ping the waiting loop on gate-claim / verdict (kill double-idle)

launch.sh watchdog now runs a fast (~30s) handoff_check alongside the heavy (300s) restart/DONE
check: when the Builder writes a CLAIMED gate it pings the Adversary to verify now; when the
Adversary updates REVIEW.md it pings the Builder to proceed (edge-triggered, reads local clones).
So a pending handoff resolves in <~30s instead of a whole idle interval. Pacing revised: the
Adversary may idle freely when nothing's pending (no pointless re-verify/busy-poll) and is woken
by the watchdog; Builder waits on the ping + a fallback ~2-4m self-poll. kickoff documents the
new "handoff signalling" role.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-27 06:15:25 +01:00

8.7 KiB

Raw Blame History

cc-ci — Kickoff & Launch

Everything needed to start the autonomous cc-ci build loop. The substance lives in plan.md; this file explains how to launch and supervise the two agents.

Folder contents

cc-ci-plan/
├── plan.md             # THE plan — single source of truth (read this in full)
├── brief.md            # original one-page brief (context only; superseded by plan.md)
├── kickoff.md          # this file — how to launch & supervise
├── launch.sh           # starts both loops + watchdog, stops on ## DONE
└── prompts/
    ├── builder.md      # Builder loop prompt (fed to claude by launch.sh)
    └── adversary.md    # Adversary loop prompt

Note: /srv/cc-ci/cc-ci-plan/ (this folder) is the planning + launch material. The actual CI project — NixOS config, runner, tests — lives in a separate git repo the Builder creates at git.autonomic.zone/recipe-maintainers/cc-ci, cloned to /srv/cc-ci/cc-ci (Builder) and /srv/cc-ci/cc-ci-adv (Adversary). Don't confuse the two.

Model: two independent loops (plan §6 / §6.1)

Builder — builds the CI server; owns code + STATUS.md/JOURNAL.md/DECISIONS.md + the ## Build backlog section of BACKLOG.md.
Adversary — independently disbelieves and re-verifies; owns REVIEW.md + ## Adversary findings. Holds veto over ## DONE.

They run as two separate processes and coordinate only through the git repo. Single-writer file ownership keeps concurrent pushes merge-clean.

Two layers of "looping" — and why you want both

Concern	Mechanism	Who provides it
Iteration — keep doing one unit of work, then wake again	`/loop` self-paced (ScheduleWakeup), per plan §7 pacing	each agent, in-session
Resilience — restart a loop whose process/sandbox died; stop all on `## DONE`	`launch.sh` watchdog (tmux + git poll)	this script
Handoff signalling — wake the waiting loop the moment its counterpart hands off	watchdog `handoff_check` (~30 s): Builder writes a `CLAIMED` gate → ping Adversary to verify; Adversary updates `REVIEW.md` → ping Builder to proceed	this script

/loop alone is bound to its process: if the sandbox restarts, that loop is gone until something relaunches it. The watchdog is that something. It also closes the double-idle gap: instead of a pending gate/verdict sitting until the other loop's next scheduled wake, the watchdog pings the waiting loop within ~30 s — so the Adversary can idle freely when nothing's pending (no busy-polling or pointless re-verifying) yet still start verifying right after the Builder parks at a gate. Use all three.

Launch

cd /srv/cc-ci/cc-ci-plan

# Optional but recommended once the repo exists, so the watchdog can detect ## DONE:
export CC_CI_REPO=https://git.autonomic.zone/recipe-maintainers/cc-ci.git

./launch.sh start        # starts cc-ci-builder + cc-ci-adv + cc-ci-watchdog (tmux sessions)
./launch.sh status       # session + DONE state
./launch.sh logs builder # tail a loop;  also: logs adversary | logs watchdog
tmux attach -t cc-ci-builder   # watch a loop live locally (detach: Ctrl-b d)
./launch.sh stop         # stop everything

launch.sh is idempotent — re-running start won't duplicate a live session. Each agent runs as an interactive claude in tmux (kickoff prompt passed as a positional arg, not piped — piping forces print mode and breaks /loop). With REMOTE_CONTROL=1 (default) each agent is launched with --remote-control, so you can watch and steer both loops from claude.ai/code (or the Claude mobile app) — not just via tmux attach. The box must be logged into the claude.ai account (claude auth status); set REMOTE_CONTROL=0 to skip the remote surface. The watchdog (default every 300s) restarts any dead session — note a >~10-min network outage will exit the claude process, after which the watchdog brings it back (a fresh remote-control session) — and when STATUS.md shows ## DONE, it kills the loops and exits.

Prerequisites the sessions inherit from your shell: SSH (root) to cc-ci via the Tailscale proxy (§1.5), Gitea bot creds, and git.autonomic.zone access. Plus preconfigured operator inputs the loop depends on (plan §4.0/§4.4): the wildcard *.ci.commoninternet.net DNS record pointing at a gateway that TLS-passthroughs to cc-ci, and the pre-issued wildcard cert at /var/lib/ci-certs/live/ on cc-ci. The operator owns the DNS record + gateway + cert issuance/renewal; the agent builds Traefik (file provider → that cert) + routing on cc-ci and does no ACME. If any prerequisite is absent, the Builder parks at STATUS.md ## Blocked (plan §1/§9) rather than improvise.

Host deps: launch.sh needs tmux (and claude) — tmux is installed on this sandbox host (3.5a). On a fresh host: sudo apt-get install -y tmux. The script's *_DIR defaults now point at /srv/cc-ci/... (Builder clone /srv/cc-ci/cc-ci, Adversary /srv/cc-ci/cc-ci-adv); override the *_DIR env vars only if your layout differs.

Optional: a cloud-side `/schedule` watchdog

launch.sh's watchdog is itself a local process — if the whole host goes down it stops too. For belt-and-suspenders durability, also create a /schedule routine (a remote agent that fires on a cron and re-orients from the repo). From inside a Claude session:

/schedule every 2 hours: read /srv/cc-ci/cc-ci-plan/plan.md §7 and the cc-ci repo STATUS.md; if the
Builder/Adversary loops are not making progress (or launch.sh is not running), restart them via
/srv/cc-ci/cc-ci-plan/launch.sh start; stop when STATUS.md says ## DONE.

This complements the local watchdog: scheduled runs are fresh, independent agents, so they survive process/context death that would take the in-session /loop and the local watchdog with it.

Fallback: restart/recreate the cc-ci VM (orchestrator only)

This is primarily an escape hatch for you, the supervising orchestrator. The loops normally reconfigure cc-ci only from inside (via Nix); power-cycling or recreating the VM shouldn't be their default move — but it's not forbidden if one gets genuinely stuck. Reach for this when cc-ci itself is wedged at a level that can't be fixed from inside (won't boot, disk full, swarm/Docker corrupted, unreachable even after a proxy restart): use the Incus skill to power-cycle or rebuild the VM, then re-bootstrap.

cc-nix-test (the cc-ci server, tailnet 100.90.116.4) is a NixOS Incus VM on host b1 (100.117.251.31:8443, Incus project terraform-ci). Skill + Terraform live at /srv/incus-terraform-nix-vm-creator/ (skills/incus-terraform/SKILL.md); read that for full usage.

Access: b1 is on the same cc-ci tailnet, so reach the Incus API through the existing cc-ci-tailscaled SOCKS proxy (127.0.0.1:1055) with the mTLS certs in that repo's terraform-secrets/ — no second tailscaled needed. Quick check:

CRT=/srv/incus-terraform-nix-vm-creator/terraform-secrets/terraform.crt
KEY=/srv/incus-terraform-nix-vm-creator/terraform-secrets/terraform.key
curl --proxy socks5h://localhost:1055 --cert "$CRT" --key "$KEY" -k -s \
  https://100.117.251.31:8443/1.0/instances/cc-nix-test/state?project=terraform-ci

Soft restart (keeps the disk — preferred): POST .../1.0/instances/cc-nix-test/state?project=terraform-ci with {"action":"restart"} (or "stop" / "start").
Full recreate (last resort): the Terraform module in /srv/incus-terraform-nix-vm-creator/projects/ (terraform apply with -var incus_remote_address=100.117.251.31 -var incus_project=terraform-ci -var ts_auth_key=$TSKEY). ⚠ Recreating wipes the VM disk — you must then re-apply the cc-ci preconditions: the pre-issued TLS cert into /var/lib/ci-certs/live/ and the cc-ci-root-ed25519 pubkey into root's authorized_keys (see the access notes), and the loops re-run §1 Bootstrap. Prefer a soft restart; only recreate if the VM is truly unrecoverable.

(Project cap: keep total RAM across terraform-ci instances under 10 GB — check before recreating.)

Manual launch (no script)

If you'd rather not use launch.sh, start each agent interactively yourself (same result, no supervision/restart), passing the prompt as a positional argument so the session stays interactive and remote-controllable:

claude --remote-control 'cc-ci-builder' --dangerously-skip-permissions "$(cat prompts/builder.md)"
claude --remote-control 'cc-ci-adv'     --dangerously-skip-permissions "$(cat prompts/adversary.md)"

Do not pipe the prompt (cat prompts/builder.md | claude …) — that forces print/headless mode, which breaks /loop and remote control.

8.7 KiB Raw Blame History