refactor: rewrite launchers as Python; add orchestrator JOURNAL.md
Bash scripts are now one-liner wrappers: exec python3 <script>.py "$@" All logic lives in the Python scripts (pure stdlib, no deps). launch.py — loops + watchdog: Full port of launch.sh: phase sequencing, start/stop/status/logs/watchdog, handoff signalling, stall detection, heal_session, heal_orchestrator. Cleaner structure: config block → helpers → phase/kickoff/agent/healing/ handoff/watchdog/main. LOOP_BACKEND + LOOP_MODEL switches throughout. launch-orchestrator.py — orchestrator session: claude path: --resume <id> preserved (conversation survives reboots). opencode path: run --attach --title (no --resume; STARTUP_PROMPT orients the new session; reads JOURNAL.md for context). STARTUP_PROMPT updated to reference JOURNAL.md on startup. launch-upgrader.py — one-shot upgrade job: LOOP_BACKEND / LOOP_MODEL take precedence over UPGRADER_BACKEND / UPGRADER_MODEL. Both claude and opencode paths supported. cc-ci-plan/JOURNAL.md — new orchestrator handoff file: Persistent across conversation resets. Documents the handoff format and carries the current session's summary: migration complete, phase 5 in progress (V3/V7 PASS), phase 4 deferred, open items for next session. AGENTS.md: step 1 on startup = read JOURNAL.md; step 5 = append on handoff. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
20
AGENTS.md
20
AGENTS.md
@ -16,18 +16,18 @@ project (NixOS config, test runner, recipe tests) lives in a **separate** repo t
|
||||
The two loops coordinate **only** through the cc-ci git repo (see `plan.md` §6.1). The orchestrator
|
||||
watches from outside.
|
||||
|
||||
## On startup: announce yourself + report reboots
|
||||
## On startup: read the journal, announce yourself, report reboots
|
||||
|
||||
**Every time you (the orchestrator) start or resume, send a `PushNotification`** that you are online —
|
||||
the operator wants to know the supervising session is back (especially after a reboot, which kills
|
||||
this session). Include the current phase and the reboot count. Steps on startup:
|
||||
1. Read `cc-ci-plan/REBOOTS.md` (count the `## Reboots` entries) and `cc-ci-plan/launch.sh status`
|
||||
(current phase + whether the loops/watchdog are running).
|
||||
2. `PushNotification` (proactive), e.g.: *"cc-ci orchestrator online — phase 2, loops+watchdog
|
||||
**Every time you (the orchestrator) start or resume:**
|
||||
1. **Read `cc-ci-plan/JOURNAL.md`** — the most recent `## Session` entry is where the previous
|
||||
session left off. This is the persistent handoff record; read it before anything else.
|
||||
2. Read `cc-ci-plan/REBOOTS.md` (count entries) and run `cc-ci-plan/launch.sh status`
|
||||
(current phase + whether loops/watchdog are running).
|
||||
3. **`PushNotification`** (proactive): *"cc-ci orchestrator online — phase X, loops+watchdog
|
||||
running; N reboots logged (last <date>)."*
|
||||
3. If a reboot happened while you were away (a new line in REBOOTS.md since you last looked, or the
|
||||
loops are down), check that `cc-ci-loops.service` brought the loops back; if not, relaunch with
|
||||
`RESUME_PHASE=1 cc-ci-plan/launch.sh start`.
|
||||
4. If loops are down, relaunch: `RESUME_PHASE=1 cc-ci-plan/launch.sh start`.
|
||||
5. **On handoff / end of session:** append a `## Session` block to `JOURNAL.md` summarising
|
||||
what happened, current state, and open items (see format in that file).
|
||||
|
||||
Reboot resilience is handled by **`cc-ci-loops.service`** (system unit): on boot it logs the reboot
|
||||
to `REBOOTS.md` (boot_id-gated) and runs `launch.sh start` with `RESUME_PHASE=1`, so the loops +
|
||||
|
||||
82
cc-ci-plan/JOURNAL.md
Normal file
82
cc-ci-plan/JOURNAL.md
Normal file
@ -0,0 +1,82 @@
|
||||
# Orchestrator journal
|
||||
|
||||
This file is the **persistent handoff record** for the cc-ci orchestrator. Every orchestrator
|
||||
session (whether Claude or opencode) reads this on startup and appends to it when handing off or
|
||||
when something noteworthy happens. It survives conversation resets — it is the memory that
|
||||
`--resume` can't provide for opencode, and a more readable supplement for Claude sessions.
|
||||
|
||||
**On startup:** read this file before doing anything else. The most recent `## Session` entry
|
||||
is where the previous session left off. Carry that context forward.
|
||||
|
||||
**On handoff / end of session:** append a `## Session` block (see format below) summarising
|
||||
what happened, the current state, and anything the next session needs to know.
|
||||
|
||||
**On significant events mid-session:** append a `### Event` sub-entry (no need to wait for
|
||||
handoff).
|
||||
|
||||
---
|
||||
|
||||
## Format
|
||||
|
||||
```markdown
|
||||
## Session YYYY-MM-DD HH:MM UTC — <backend> <model>
|
||||
**Left off:** <one sentence — what was the last thing done>
|
||||
**Phase / loop state:** <phase X [N/11], loops RUNNING/stopped, cc-ci healthy/issue>
|
||||
**Open items:** <bullet list of anything the next session needs to act on, or "none">
|
||||
**Notes:** <anything surprising, a decision made, a known blocker, etc.>
|
||||
|
||||
### Event HH:MM — <short label>
|
||||
<brief note>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Session 2026-05-31 ~04:00 UTC — Claude Sonnet 4.6
|
||||
|
||||
**Left off:** Completed the orchestrator → Hetzner migration (cpx22, server 134487234, public
|
||||
`168.119.126.100`, tailnet `cc-ci-orchestrator-1` @ `100.84.190.30`). The old Incus VM
|
||||
(`100.116.55.106`) is still on the tailnet — cold standby, not yet deleted.
|
||||
|
||||
**Phase / loop state:** Phases 1c–1e, 2w, 2pc, 2, 2b, 3 all DONE. Phase 5 [11/11]
|
||||
(upgrade-flow verify) in progress — loops running, actively verifying the `!testme`
|
||||
end-to-end flow on the new Hetzner cc-ci server.
|
||||
|
||||
**Open items:**
|
||||
- Phase 5 is in progress — loops need to finish V1–V9 and write `## DONE` to STATUS-5.md.
|
||||
- Phase 4 (final review/polish) was deliberately **skipped** this session — it is queued
|
||||
at idx 9 in PHASE_IDX_FILE. Resume it after the weekly Opus credits reset.
|
||||
- Phase 6 (reconcile-only over all 18 recipe mirrors) and Phase 7 (full upgrade on n8n +
|
||||
ghost + matrix-synapse) are planned but not yet started — run them after Phase 5 DONE.
|
||||
- Old Incus orchestrator VM (`cc-ci-orchestrator`, `100.116.55.106`) is still running —
|
||||
stop it via the b1 Incus API once happy with the Hetzner box. mTLS certs at
|
||||
`/srv/incus-terraform-nix-vm-creator/terraform-secrets/`.
|
||||
- DNS: `oc.commoninternet.net` A record → `100.84.190.30` still needs adding (operator step).
|
||||
|
||||
**Notes:**
|
||||
- `cc-ci-loops.service` is **enabled** and wired with `reboot-log.sh` ExecStartPre — a reboot
|
||||
is a non-event; loops + watchdog auto-resume via RESUME_PHASE=1.
|
||||
- The cc-ci **server** also moved to Hetzner (server 134485294, `ssh cc-ci` →
|
||||
`100.95.31.88`). It has authenticated Docker Hub pulls and 150 GB disk — the old OOM /
|
||||
disk-starvation / rate-limit issues are gone.
|
||||
- All recipe mirrors currently reconcile correctly; no stale open PRs observed.
|
||||
- `opencode` v1.15.13 installed at `/home/loops/.local/bin/opencode`. Tinfoil API key is in
|
||||
`.testenv` as `TINFOIL_API_KEY`. Backend switch: `LOOP_BACKEND=opencode
|
||||
LOOP_MODEL=tinfoil/deepseek-v4-pro RESUME_PHASE=1 cc-ci-plan/launch.sh start`.
|
||||
- Launcher scripts rewritten to Python (`launch.py`, `launch-orchestrator.py`,
|
||||
`launch-upgrader.py`); bash wrappers are now one-liners that `exec python3 <script> "$@"`.
|
||||
|
||||
### Event 03:13 — migrated from old Incus VM to Hetzner
|
||||
Loops were started manually during staging (not by the service); first systemd-managed
|
||||
boot was later this session. `cc-ci-loops.service` now enabled.
|
||||
|
||||
### Event 05:23 — phase 3 (results-UX) completed
|
||||
All R1–R8 Adversary-verified, no VETO. Watchdog auto-advanced to phase 4.
|
||||
|
||||
### Event 13:22 — phase 4 paused, jumped to phase 5
|
||||
Operator deferred phase 4 (weekly Opus credits exhausted). Phase idx manually set to 10
|
||||
(phase 5). Loops restarted on Sonnet.
|
||||
|
||||
### Event 17:29 — loops stopped pending restart on different model
|
||||
Operator paused loops to reconfigure backend (opencode/tinfoil exploration). Phase 5
|
||||
[11/11] was in progress — loops had verified V1/V2/V3/V7 (custom-html-tiny upgrade GREEN).
|
||||
Phase idx = 10 (phase 5), loops stopped, watchdog stopped.
|
||||
189
cc-ci-plan/launch-orchestrator.py
Normal file
189
cc-ci-plan/launch-orchestrator.py
Normal file
@ -0,0 +1,189 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
cc-ci orchestrator launcher — start/resume the orchestrator session in tmux.
|
||||
|
||||
The orchestrator is the long-lived supervisory session: it watches the Builder/Adversary
|
||||
loops, reads their logs/STATUS, edits the plan/prompts, restarts stuck loops, and owns
|
||||
the VM-level fallback. It is SEPARATE from the loops that launch.py manages.
|
||||
|
||||
Usage:
|
||||
launch-orchestrator.py start resume the persistent session (default)
|
||||
launch-orchestrator.py fresh start a NEW session (no --resume)
|
||||
launch-orchestrator.py stop kill the tmux session (conversation persists on disk)
|
||||
launch-orchestrator.py status show session state
|
||||
launch-orchestrator.py attach tmux attach to the session
|
||||
|
||||
Env:
|
||||
LOOP_BACKEND claude (default) | opencode
|
||||
LOOP_MODEL model flag, e.g. "sonnet" or "tinfoil/deepseek-v4-pro"
|
||||
|
||||
claude backend:
|
||||
CLAUDE_BIN claude
|
||||
REMOTE_CONTROL 1 (viewable at claude.ai/code)
|
||||
ORCH_SESSION_ID override the resume id (else read from $ID_FILE)
|
||||
ORCH_ID_FILE $LOG_DIR/.orchestrator-session-id
|
||||
ORCH_STARTUP_PROMPT startup nudge injected as the first turn after --resume
|
||||
|
||||
opencode backend:
|
||||
OPENCODE_BIN /home/loops/.local/bin/opencode
|
||||
OPENCODE_SERVER http://127.0.0.1:4096
|
||||
(no --resume equivalent; STARTUP_PROMPT is sent as the initial message;
|
||||
the session title in the web UI is the SESSION name)
|
||||
"""
|
||||
|
||||
import os, sys, subprocess
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
# ── config ────────────────────────────────────────────────────────────────────
|
||||
|
||||
SESSION = os.environ.get("ORCH_SESSION", "cc-ci-orchestrator-vm")
|
||||
WORKDIR = os.environ.get("ORCH_DIR", "/srv/cc-ci")
|
||||
LOG_DIR = os.environ.get("LOG_DIR", "/srv/cc-ci/.cc-ci-logs")
|
||||
|
||||
BACKEND = os.environ.get("LOOP_BACKEND", "claude")
|
||||
LOOP_MODEL = os.environ.get("LOOP_MODEL", "")
|
||||
|
||||
# claude-specific
|
||||
CLAUDE_BIN = os.environ.get("CLAUDE_BIN", "claude")
|
||||
CLAUDE_FLAGS = os.environ.get("CLAUDE_FLAGS", "--dangerously-skip-permissions")
|
||||
REMOTE_CONTROL = os.environ.get("REMOTE_CONTROL", "1") == "1"
|
||||
DEFAULT_ID = "c746050a-af11-409d-87ba-c05268e2e5d1"
|
||||
ID_FILE = os.environ.get("ORCH_ID_FILE", f"{LOG_DIR}/.orchestrator-session-id")
|
||||
STARTUP_PROMPT = os.environ.get("ORCH_STARTUP_PROMPT", (
|
||||
"STARTUP (auto-launch): you are the cc-ci orchestrator, just (re)launched, likely after a "
|
||||
"reboot. Do your AGENTS.md On-startup routine NOW: read cc-ci-plan/REBOOTS.md and run "
|
||||
"cc-ci-plan/launch.py status, then send a proactive PushNotification that you are online "
|
||||
"with the current phase and reboot count, and confirm cc-ci-loops.service brought the loops "
|
||||
"+ watchdog back (relaunch with RESUME_PHASE=1 cc-ci-plan/launch.py start if not). "
|
||||
"Also read cc-ci-plan/JOURNAL.md for recent context before resuming supervision."
|
||||
))
|
||||
|
||||
# opencode-specific
|
||||
OPENCODE_BIN = os.environ.get("OPENCODE_BIN", "/home/loops/.local/bin/opencode")
|
||||
OPENCODE_SERVER = os.environ.get("OPENCODE_SERVER", "http://127.0.0.1:4096")
|
||||
|
||||
# ── helpers ───────────────────────────────────────────────────────────────────
|
||||
|
||||
def log(msg):
|
||||
ts = datetime.now().strftime("%H:%M:%S")
|
||||
print(f"[orchestrator {ts}] {msg}", flush=True)
|
||||
|
||||
def die(msg):
|
||||
log(f"ERROR: {msg}")
|
||||
sys.exit(1)
|
||||
|
||||
def session_alive():
|
||||
return subprocess.run(
|
||||
["tmux", "has-session", "-t", SESSION], capture_output=True
|
||||
).returncode == 0
|
||||
|
||||
def resume_id():
|
||||
sid = os.environ.get("ORCH_SESSION_ID")
|
||||
if sid:
|
||||
return sid
|
||||
try:
|
||||
v = Path(ID_FILE).read_text().strip()
|
||||
return v or DEFAULT_ID
|
||||
except FileNotFoundError:
|
||||
return DEFAULT_ID
|
||||
|
||||
# ── launch ────────────────────────────────────────────────────────────────────
|
||||
|
||||
def start(mode="resume"):
|
||||
import shutil
|
||||
if not shutil.which("tmux"):
|
||||
die("tmux not found")
|
||||
Path(LOG_DIR).mkdir(parents=True, exist_ok=True)
|
||||
|
||||
if session_alive():
|
||||
log(f"{SESSION} already running — leaving it (use 'stop' first to relaunch)")
|
||||
return
|
||||
|
||||
model_flag = f"--model '{LOOP_MODEL}'" if LOOP_MODEL else ""
|
||||
|
||||
if BACKEND == "claude":
|
||||
if not shutil.which(CLAUDE_BIN):
|
||||
die(f"claude CLI not found — set CLAUDE_BIN (currently: {CLAUDE_BIN})")
|
||||
if not Path(ID_FILE).exists():
|
||||
Path(ID_FILE).write_text(DEFAULT_ID)
|
||||
|
||||
rc = f"--remote-control '{SESSION}'" if REMOTE_CONTROL else ""
|
||||
resume = f"--resume '{resume_id()}'" if mode == "resume" else ""
|
||||
prompt = f"'{STARTUP_PROMPT}'" if STARTUP_PROMPT else ""
|
||||
cmd = f"{CLAUDE_BIN} {resume} {rc} {model_flag} {CLAUDE_FLAGS} {prompt}"
|
||||
detail = f"resume={resume_id()}" if mode == "resume" else "fresh"
|
||||
log(f"starting {SESSION} (backend=claude, {detail}, model={LOOP_MODEL or 'default'})")
|
||||
|
||||
elif BACKEND == "opencode":
|
||||
if not Path(OPENCODE_BIN).exists():
|
||||
die(f"opencode not found at {OPENCODE_BIN}")
|
||||
# No --resume equivalent in opencode; STARTUP_PROMPT orients the new session.
|
||||
# The session title in the web UI identifies it as the orchestrator.
|
||||
prompt = STARTUP_PROMPT or (
|
||||
"You are the cc-ci orchestrator. Read /srv/cc-ci/AGENTS.md and "
|
||||
"cc-ci-plan/JOURNAL.md for context, then resume supervising the loops."
|
||||
)
|
||||
cmd = (
|
||||
f"set -a; . /srv/cc-ci/.testenv; set +a; "
|
||||
f"{OPENCODE_BIN} {model_flag} run --attach '{OPENCODE_SERVER}' "
|
||||
f"--title '{SESSION}' '{prompt}'"
|
||||
)
|
||||
log(f"starting {SESSION} (backend=opencode, model={LOOP_MODEL or 'default'})")
|
||||
log(f" visible at http://oc.commoninternet.net (tailnet only)")
|
||||
else:
|
||||
die(f"unknown LOOP_BACKEND '{BACKEND}' — use 'claude' or 'opencode'")
|
||||
|
||||
subprocess.run(["tmux", "new-session", "-d", "-s", SESSION, "-c", WORKDIR, cmd])
|
||||
subprocess.run(["tmux", "pipe-pane", "-o", "-t", SESSION,
|
||||
f"cat >> '{LOG_DIR}/{SESSION}.log'"])
|
||||
log(f"started. attach: tmux attach -t {SESSION}")
|
||||
|
||||
# ── main ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
cmd = sys.argv[1] if len(sys.argv) > 1 else "start"
|
||||
|
||||
if cmd == "start":
|
||||
start("resume")
|
||||
elif cmd == "fresh":
|
||||
start("fresh")
|
||||
elif cmd == "stop":
|
||||
if session_alive():
|
||||
log(f"killing {SESSION}")
|
||||
subprocess.run(["tmux", "kill-session", "-t", SESSION])
|
||||
else:
|
||||
log(f"{SESSION} not running")
|
||||
elif cmd == "status":
|
||||
state = "RUNNING" if session_alive() else "stopped"
|
||||
log(f"{SESSION}: {state}")
|
||||
subprocess.run(
|
||||
f"ps -eo pid,etime,args | grep '[r]emote-control {SESSION}' || true",
|
||||
shell=True)
|
||||
if BACKEND == "claude":
|
||||
log(f"resume id: {resume_id()} (file: {ID_FILE})")
|
||||
log(f"backend: {BACKEND} model: {LOOP_MODEL or '<default>'}")
|
||||
elif cmd == "attach":
|
||||
os.execvp("tmux", ["tmux", "attach", "-t", SESSION])
|
||||
else:
|
||||
backend_note = (
|
||||
"claude: --resume preserves conversation across reboots; viewable at claude.ai/code\n"
|
||||
" opencode: fresh session each launch (no --resume); viewable at http://oc.commoninternet.net"
|
||||
)
|
||||
print(f"""cc-ci orchestrator launcher
|
||||
|
||||
launch-orchestrator.py start resume the persistent session (default)
|
||||
launch-orchestrator.py fresh start a new session (no --resume)
|
||||
launch-orchestrator.py stop kill the tmux session
|
||||
launch-orchestrator.py status show session state
|
||||
launch-orchestrator.py attach tmux attach
|
||||
|
||||
Backend: {BACKEND} (LOOP_BACKEND env var)
|
||||
Model: {LOOP_MODEL or '<backend default>'} (LOOP_MODEL env var)
|
||||
Session: {SESSION} cwd={WORKDIR}
|
||||
{backend_note}
|
||||
""")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -1,118 +1,3 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# launch-orchestrator.sh — start/resume the cc-ci ORCHESTRATOR session in tmux under remote-control.
|
||||
#
|
||||
# The orchestrator (see /srv/cc-ci/AGENTS.md) is the long-lived SUPERVISORY session: it watches the
|
||||
# Builder/Adversary loops, reads their logs/STATUS, edits the plan/prompts, restarts stuck loops, and
|
||||
# owns the VM-level fallback. It is SEPARATE from the loops that launch.sh manages — this script only
|
||||
# brings the orchestrator back (e.g. after a reboot, which kills the tmux server and every session in
|
||||
# it). The conversation itself survives on disk across exits/reboots; remote-control only stays
|
||||
# connected while the process is alive, so recovery = relaunch the process and re-attach by --resume.
|
||||
#
|
||||
# Naming: tmux session AND remote-control name are both "cc-ci-orchestrator-vm" (the -vm suffix
|
||||
# distinguishes it from the repo name cc-ci-orchestrator); the loop sessions are cc-ci-builder /
|
||||
# cc-ci-adv / cc-ci-watchdog.
|
||||
#
|
||||
# Usage:
|
||||
# ./launch-orchestrator.sh start # resume the persistent orchestrator session (DEFAULT)
|
||||
# ./launch-orchestrator.sh fresh # start a NEW orchestrator session (no --resume)
|
||||
# ./launch-orchestrator.sh status # show tmux + remote-control state
|
||||
# ./launch-orchestrator.sh attach # tmux attach to the session (Ctrl-b d to detach)
|
||||
# ./launch-orchestrator.sh stop # kill the tmux session (conversation persists on disk)
|
||||
#
|
||||
# The persistent session id is read from $ID_FILE (seeded on first run with DEFAULT_ID). A Claude
|
||||
# session keeps the SAME id across --resume, so this stays valid across reboots. To point the script
|
||||
# at a different session, edit that file or export ORCH_SESSION_ID.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# ----- config -------------------------------------------------------------
|
||||
SESSION="${ORCH_SESSION:-cc-ci-orchestrator-vm}" # tmux session name == remote-control name
|
||||
WORKDIR="${ORCH_DIR:-/srv/cc-ci}" # orchestrator cwd (its claude project dir)
|
||||
CLAUDE_BIN="${CLAUDE_BIN:-claude}"
|
||||
CLAUDE_FLAGS="${CLAUDE_FLAGS:---dangerously-skip-permissions}"
|
||||
# REMOTE_CONTROL=1 → --remote-control session, viewable/steerable at claude.ai/code. Needs the box
|
||||
# logged into the claude.ai account. =0 for a plain local interactive session.
|
||||
REMOTE_CONTROL="${REMOTE_CONTROL:-1}"
|
||||
LOG_DIR="${LOG_DIR:-/srv/cc-ci/.cc-ci-logs}"
|
||||
ID_FILE="${ORCH_ID_FILE:-$LOG_DIR/.orchestrator-session-id}"
|
||||
DEFAULT_ID="c746050a-af11-409d-87ba-c05268e2e5d1" # the orchestrator session as of 2026-05-31 (Hetzner)
|
||||
# Startup nudge injected as the resumed session's first turn, so an AUTO-launched orchestrator (e.g.
|
||||
# cc-ci-loops.service ExecStartPost after a reboot) actually RUNS its AGENTS.md startup routine —
|
||||
# announce itself + report reboots — instead of resuming silently and waiting. Set empty to disable.
|
||||
# Must contain NO single quotes (it is single-quoted into the tmux command).
|
||||
STARTUP_PROMPT="${ORCH_STARTUP_PROMPT-STARTUP (auto-launch): you are the cc-ci orchestrator, just (re)launched, likely after a reboot. Do your AGENTS.md On-startup routine NOW: read cc-ci-plan/REBOOTS.md and run cc-ci-plan/launch.sh status, then send a proactive PushNotification that you are online with the current phase and reboot count, and confirm cc-ci-loops.service brought the loops + watchdog back (relaunch with RESUME_PHASE=1 cc-ci-plan/launch.sh start if not). Then resume supervising.}"
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
log() { printf '[orchestrator %(%H:%M:%S)T] %s\n' -1 "$*"; }
|
||||
die() { log "ERROR: $*"; exit 1; }
|
||||
session_alive() { tmux has-session -t "$SESSION" 2>/dev/null; }
|
||||
|
||||
preflight() {
|
||||
command -v tmux >/dev/null 2>&1 || die "missing dependency: tmux"
|
||||
command -v "$CLAUDE_BIN" >/dev/null 2>&1 || die "claude CLI not found (set CLAUDE_BIN)"
|
||||
[[ -d "$WORKDIR" ]] || die "workdir not found: $WORKDIR"
|
||||
mkdir -p "$LOG_DIR"
|
||||
[[ -f "$ID_FILE" ]] || echo "$DEFAULT_ID" > "$ID_FILE"
|
||||
}
|
||||
|
||||
resume_id() { echo "${ORCH_SESSION_ID:-$(cat "$ID_FILE" 2>/dev/null || echo "$DEFAULT_ID")}"; }
|
||||
|
||||
# Launch claude in a detached tmux session. $1=resume ("resume"|"fresh").
|
||||
start() {
|
||||
local mode="${1:-resume}"
|
||||
preflight
|
||||
if session_alive; then
|
||||
log "$SESSION already running — leaving it (use '$0 stop' first to relaunch)"
|
||||
return 0
|
||||
fi
|
||||
local rc="" resume="" id=""
|
||||
[[ "$REMOTE_CONTROL" == "1" ]] && rc="--remote-control '$SESSION'"
|
||||
if [[ "$mode" == "resume" ]]; then
|
||||
id="$(resume_id)"
|
||||
[[ -n "$id" ]] && resume="--resume '$id'"
|
||||
log "starting $SESSION (resume=$id, cwd=$WORKDIR, rc=$REMOTE_CONTROL)"
|
||||
else
|
||||
log "starting $SESSION FRESH (no resume, cwd=$WORKDIR, rc=$REMOTE_CONTROL)"
|
||||
fi
|
||||
# Startup nudge as a POSITIONAL prompt (not stdin — stdin would force print mode and break
|
||||
# remote-control). On --resume this appends as the session's next turn, triggering the AGENTS.md
|
||||
# startup routine (announce + report reboots). Empty STARTUP_PROMPT => clean resume, no nudge.
|
||||
local prompt_arg=""
|
||||
[[ -n "$STARTUP_PROMPT" ]] && prompt_arg="'$STARTUP_PROMPT'"
|
||||
tmux new-session -d -s "$SESSION" -c "$WORKDIR" \
|
||||
"$CLAUDE_BIN $resume $rc $CLAUDE_FLAGS $prompt_arg"
|
||||
tmux pipe-pane -o -t "$SESSION" "cat >> '$LOG_DIR/$SESSION.log'"
|
||||
log "started. status: $0 status | attach: tmux attach -t $SESSION"
|
||||
}
|
||||
|
||||
case "${1:-start}" in
|
||||
start) start resume ;;
|
||||
fresh) start fresh ;;
|
||||
stop)
|
||||
if session_alive; then log "killing $SESSION"; tmux kill-session -t "$SESSION" || true; else log "$SESSION not running"; fi
|
||||
;;
|
||||
status)
|
||||
if session_alive; then
|
||||
log "$SESSION: RUNNING"
|
||||
ps -eo pid,etime,args | grep "[r]emote-control $SESSION" || true
|
||||
else
|
||||
log "$SESSION: stopped"
|
||||
fi
|
||||
log "resume id: $(cat "$ID_FILE" 2>/dev/null || echo "$DEFAULT_ID") (file: $ID_FILE)"
|
||||
;;
|
||||
attach) exec tmux attach -t "$SESSION" ;;
|
||||
*)
|
||||
cat <<EOF
|
||||
cc-ci orchestrator launcher
|
||||
|
||||
$0 start resume the persistent orchestrator session in tmux + remote-control (default)
|
||||
$0 fresh start a NEW orchestrator session (no --resume)
|
||||
$0 status show tmux + remote-control state and the resume id
|
||||
$0 attach tmux attach to the session
|
||||
$0 stop kill the tmux session (conversation persists on disk)
|
||||
|
||||
Env: SESSION=$SESSION WORKDIR=$WORKDIR REMOTE_CONTROL=$REMOTE_CONTROL CLAUDE_BIN=$CLAUDE_BIN
|
||||
EOF
|
||||
;;
|
||||
esac
|
||||
# Thin wrapper — delegates everything to launch-orchestrator.py in the same directory.
|
||||
exec python3 "$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")/launch-orchestrator.py" "$@"
|
||||
|
||||
198
cc-ci-plan/launch-upgrader.py
Normal file
198
cc-ci-plan/launch-upgrader.py
Normal file
@ -0,0 +1,198 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
cc-ci upgrader launcher — one-shot weekly recipe-upgrade job agent.
|
||||
|
||||
The upgrader runs /upgrade-all to completion, then stops and stays idle so the
|
||||
run + summary remain viewable in the web UI. The next weekly run starts a fresh
|
||||
session (start clears any idle/finished session).
|
||||
|
||||
Usage:
|
||||
launch-upgrader.py start use-or-create: leave an in-flight run alone, else start fresh
|
||||
launch-upgrader.py fresh always kill any existing session and start fresh
|
||||
launch-upgrader.py stop kill the session
|
||||
launch-upgrader.py status show session state
|
||||
launch-upgrader.py attach tmux attach to the session
|
||||
|
||||
Env:
|
||||
LOOP_BACKEND claude (default) | opencode — also accepts UPGRADER_BACKEND
|
||||
LOOP_MODEL model flag (overrides UPGRADER_MODEL)
|
||||
UPGRADER_MODEL sonnet (default for claude) | tinfoil/deepseek-v4-pro (opencode example)
|
||||
UPGRADER_ARGS extra args passed to /upgrade-all (e.g. "n8n ghost", "--dry-run")
|
||||
|
||||
claude backend:
|
||||
CLAUDE_BIN, CLAUDE_FLAGS, REMOTE_CONTROL
|
||||
opencode backend:
|
||||
OPENCODE_BIN, OPENCODE_SERVER
|
||||
"""
|
||||
|
||||
import os, sys, subprocess, re
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
# ── config ────────────────────────────────────────────────────────────────────
|
||||
|
||||
SESSION = os.environ.get("UPGRADER_SESSION", "cc-ci-upgrader")
|
||||
WORKDIR = os.environ.get("UPGRADER_DIR", "/srv/cc-ci")
|
||||
LOG_DIR = os.environ.get("LOG_DIR", "/srv/cc-ci/.cc-ci-logs")
|
||||
|
||||
# LOOP_BACKEND / LOOP_MODEL take precedence (unified control from the operator).
|
||||
BACKEND = os.environ.get("LOOP_BACKEND", os.environ.get("UPGRADER_BACKEND", "claude"))
|
||||
MODEL = os.environ.get("LOOP_MODEL", os.environ.get("UPGRADER_MODEL", "sonnet"))
|
||||
|
||||
CLAUDE_BIN = os.environ.get("CLAUDE_BIN", "claude")
|
||||
CLAUDE_FLAGS = os.environ.get("CLAUDE_FLAGS", "--dangerously-skip-permissions")
|
||||
REMOTE_CONTROL = os.environ.get("REMOTE_CONTROL", "1") == "1"
|
||||
|
||||
OPENCODE_BIN = os.environ.get("OPENCODE_BIN", "/home/loops/.local/bin/opencode")
|
||||
OPENCODE_SERVER = os.environ.get("OPENCODE_SERVER", "http://127.0.0.1:4096")
|
||||
|
||||
UPGRADER_ARGS = os.environ.get("UPGRADER_ARGS", "")
|
||||
|
||||
# ── helpers ───────────────────────────────────────────────────────────────────
|
||||
|
||||
def log(msg):
|
||||
ts = datetime.now().strftime("%H:%M:%S")
|
||||
print(f"[upgrader {ts}] {msg}", flush=True)
|
||||
|
||||
def die(msg):
|
||||
log(f"ERROR: {msg}")
|
||||
sys.exit(1)
|
||||
|
||||
def session_alive():
|
||||
return subprocess.run(
|
||||
["tmux", "has-session", "-t", SESSION], capture_output=True
|
||||
).returncode == 0
|
||||
|
||||
def session_busy():
|
||||
"""True while a turn is actively in flight (not idle/finished/wedged)."""
|
||||
r = subprocess.run(["tmux", "capture-pane", "-pt", SESSION],
|
||||
capture_output=True, text=True)
|
||||
pane = r.stdout if r.returncode == 0 else ""
|
||||
return bool(re.search(r"esc to interrupt|⠋|⠙|⠹|⠸|⠼|⠴|⠦|⠧|⠇|⠏|Running tool", pane))
|
||||
|
||||
def kill_session():
|
||||
subprocess.run(["tmux", "kill-session", "-t", SESSION], capture_output=True)
|
||||
|
||||
# ── kickoff prompt ────────────────────────────────────────────────────────────
|
||||
|
||||
def build_kickoff():
|
||||
args_note = f" with arguments: {UPGRADER_ARGS}" if UPGRADER_ARGS else ""
|
||||
return f"""\
|
||||
*** cc-ci UPGRADER — weekly recipe-upgrade job ***
|
||||
You are the cc-ci Upgrader: a ONE-SHOT job agent, NOT a perpetual loop. Run the
|
||||
recipe-upgrade sequence to completion, then STOP. Your cwd is {WORKDIR}; reach the CI
|
||||
server with `ssh cc-ci`; creds are in {WORKDIR}/.testenv; skills in {WORKDIR}/.claude/skills/.
|
||||
|
||||
DO THIS:
|
||||
1. Invoke the /upgrade-all skill in DEFAULT mode{args_note}
|
||||
(read {WORKDIR}/.claude/skills/upgrade-all/SKILL.md for the full procedure). It surveys
|
||||
every enrolled recipe and, for each upgradeable one, runs /recipe-upgrade in DEFAULT
|
||||
mode — recipe PR only, verified by posting `!testme` on the PR (results visible in the
|
||||
PR, iterate up to 3x). A genuinely stale test gets an explanatory PR COMMENT, never a
|
||||
test edit.
|
||||
2. Process recipes via per-recipe SUBAGENTS so your own context stays light. If your
|
||||
context usage climbs (~80%), run /compact before continuing.
|
||||
3. Write + push the weekly summary (the PR list is the actionable output for the operator).
|
||||
4. WHEN THE RUN IS COMPLETE: STOP. Print the final summary (lead with the PR list) and an
|
||||
`UPGRADE RUN COMPLETE` line, then go idle. Do NOT loop, do NOT re-run, and do NOT kill
|
||||
your own session — leave it up so the operator can review the output in the web UI.
|
||||
Next week's run starts a fresh session (the launcher clears this idle one).
|
||||
|
||||
GUARDRAILS: NEVER merge any PR. NEVER weaken a test. DEFAULT mode only — do NOT pass
|
||||
--with-tests (updating cc-ci tests is the operator's per-recipe opt-in). Single-writer:
|
||||
dedicated branches + separate clones, never push main, never touch the build loops'
|
||||
/cc-ci /cc-ci-adv clones. The shared Swarm is stateful — go sequentially.
|
||||
"""
|
||||
|
||||
# ── launch ────────────────────────────────────────────────────────────────────
|
||||
|
||||
def start(mode="use-or-create"):
|
||||
import shutil
|
||||
if not shutil.which("tmux"):
|
||||
die("tmux not found")
|
||||
Path(LOG_DIR).mkdir(parents=True, exist_ok=True)
|
||||
|
||||
if session_alive():
|
||||
if mode == "use-or-create" and session_busy():
|
||||
log(f"{SESSION} already running a job (busy) — leaving it")
|
||||
return
|
||||
log(f"{SESSION} exists but idle/stale (or fresh requested) — killing it first")
|
||||
kill_session()
|
||||
import time; time.sleep(1)
|
||||
|
||||
kf = Path(LOG_DIR) / f".kickoff-{SESSION}.txt"
|
||||
kf.write_text(build_kickoff())
|
||||
|
||||
model_flag = f"--model '{MODEL}'" if MODEL else ""
|
||||
log(f"starting {SESSION} (backend={BACKEND}, model={MODEL}, args='{UPGRADER_ARGS or '<none>'}')")
|
||||
|
||||
if BACKEND == "claude":
|
||||
if not shutil.which(CLAUDE_BIN):
|
||||
die(f"claude CLI not found — set CLAUDE_BIN (currently: {CLAUDE_BIN})")
|
||||
rc = f"--remote-control '{SESSION}'" if REMOTE_CONTROL else ""
|
||||
cmd = f"{CLAUDE_BIN} {rc} {model_flag} {CLAUDE_FLAGS} \"$(cat '{kf}')\""
|
||||
|
||||
elif BACKEND == "opencode":
|
||||
if not Path(OPENCODE_BIN).exists():
|
||||
die(f"opencode not found at {OPENCODE_BIN}")
|
||||
cmd = (
|
||||
f"set -a; . /srv/cc-ci/.testenv; set +a; "
|
||||
f"{OPENCODE_BIN} {model_flag} run --attach '{OPENCODE_SERVER}' "
|
||||
f"--title '{SESSION}' \"$(cat '{kf}')\""
|
||||
)
|
||||
log(f" visible at http://oc.commoninternet.net (tailnet only)")
|
||||
else:
|
||||
die(f"unknown LOOP_BACKEND '{BACKEND}' — use 'claude' or 'opencode'")
|
||||
|
||||
subprocess.run(["tmux", "new-session", "-d", "-s", SESSION, "-c", WORKDIR, cmd])
|
||||
subprocess.run(["tmux", "pipe-pane", "-o", "-t", SESSION,
|
||||
f"cat >> '{LOG_DIR}/{SESSION}.log'"])
|
||||
log(f"started. attach: tmux attach -t {SESSION} log: {LOG_DIR}/{SESSION}.log")
|
||||
|
||||
# ── main ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
cmd = sys.argv[1] if len(sys.argv) > 1 else "start"
|
||||
|
||||
if cmd == "start":
|
||||
start("use-or-create")
|
||||
elif cmd == "fresh":
|
||||
start("fresh")
|
||||
elif cmd == "stop":
|
||||
if session_alive():
|
||||
log(f"killing {SESSION}")
|
||||
kill_session()
|
||||
else:
|
||||
log(f"{SESSION} not running")
|
||||
elif cmd == "status":
|
||||
if session_alive():
|
||||
busy = "busy" if session_busy() else "idle/finishing"
|
||||
log(f"{SESSION}: RUNNING ({busy})")
|
||||
subprocess.run(
|
||||
f"ps -eo pid,etime,args | grep '[r]emote-control {SESSION}' || true",
|
||||
shell=True)
|
||||
else:
|
||||
log(f"{SESSION}: stopped")
|
||||
log(f"backend: {BACKEND} model: {MODEL} args: '{UPGRADER_ARGS or '<none>'}'")
|
||||
elif cmd == "attach":
|
||||
os.execvp("tmux", ["tmux", "attach", "-t", SESSION])
|
||||
else:
|
||||
print(f"""cc-ci upgrader launcher — one-shot weekly recipe-upgrade job
|
||||
|
||||
launch-upgrader.py start use-or-create (leave busy run alone, else start fresh)
|
||||
launch-upgrader.py fresh always kill existing + start fresh
|
||||
launch-upgrader.py stop kill the session
|
||||
launch-upgrader.py status show session state
|
||||
launch-upgrader.py attach tmux attach
|
||||
|
||||
Backend: {BACKEND} (LOOP_BACKEND or UPGRADER_BACKEND env var)
|
||||
Model: {MODEL} (LOOP_MODEL or UPGRADER_MODEL env var)
|
||||
Args: {UPGRADER_ARGS or '<none>'} (UPGRADER_ARGS env var, passed to /upgrade-all)
|
||||
|
||||
claude: viewable at claude.ai/code
|
||||
opencode: viewable at http://oc.commoninternet.net server={OPENCODE_SERVER}
|
||||
""")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -1,151 +1,3 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# launch-upgrader.sh — spin up the cc-ci UPGRADER agent in tmux under remote-control.
|
||||
#
|
||||
# The Upgrader is a ONE-SHOT job agent (not a perpetual loop like the Builder/Adversary): it runs the
|
||||
# weekly recipe-upgrade sequence — the /upgrade-all skill in DEFAULT mode — to completion, then STOPS
|
||||
# and stays idle (it does NOT self-terminate) so the run + summary remain viewable/steerable at
|
||||
# claude.ai/code exactly like the Builder, instead of being buried in headless cron output. The next
|
||||
# weekly run starts a fresh session: `start` leaves an in-flight run alone but clears a finished/idle
|
||||
# (or wedged) session and starts clean. The weekly cron (Sat 03:00 UTC, once cc-ci is built — see
|
||||
# [[cc-ci-upgrade-all-cron]]) invokes `launch-upgrader.sh start`.
|
||||
#
|
||||
# Naming: tmux session AND remote-control name are both "cc-ci-upgrader" (matching
|
||||
# cc-ci-builder / cc-ci-adv / cc-ci-watchdog / cc-ci-orchestrator).
|
||||
#
|
||||
# Usage:
|
||||
# ./launch-upgrader.sh start # use-or-create: if a run is actively in flight leave it,
|
||||
# # else (no session / idle-stale) kill any stale + start fresh
|
||||
# ./launch-upgrader.sh fresh # always kill any existing + start a fresh run
|
||||
# ./launch-upgrader.sh status | attach | stop
|
||||
#
|
||||
# Env:
|
||||
# UPGRADER_ARGS="" passthrough args to /upgrade-all (e.g. "--dry-run", "ghost n8n"); default none
|
||||
# = full default fleet run. NEVER pass --with-tests here (the cron must not
|
||||
# auto-edit tests; that's the operator's per-recipe opt-in).
|
||||
set -euo pipefail
|
||||
|
||||
SESSION="${UPGRADER_SESSION:-cc-ci-upgrader}" # tmux session name == remote-control name
|
||||
WORKDIR="${UPGRADER_DIR:-/srv/cc-ci}" # cwd: where .claude/skills/ + .testenv live
|
||||
|
||||
# Backend selection — mirrors launch.sh. LOOP_BACKEND overrides for consistency.
|
||||
UPGRADER_BACKEND="${LOOP_BACKEND:-${UPGRADER_BACKEND:-claude}}" # "claude" or "opencode"
|
||||
# Model: LOOP_MODEL > UPGRADER_MODEL > backend default (sonnet for claude, provider/model for opencode).
|
||||
UPGRADER_MODEL="${LOOP_MODEL:-${UPGRADER_MODEL:-sonnet}}"
|
||||
|
||||
CLAUDE_BIN="${CLAUDE_BIN:-claude}"
|
||||
CLAUDE_FLAGS="${CLAUDE_FLAGS:---dangerously-skip-permissions}"
|
||||
OPENCODE_BIN="${OPENCODE_BIN:-/home/loops/.local/bin/opencode}"
|
||||
OPENCODE_SERVER="${OPENCODE_SERVER:-http://127.0.0.1:4096}"
|
||||
REMOTE_CONTROL="${REMOTE_CONTROL:-1}" # 1 => --remote-control / opencode web
|
||||
LOG_DIR="${LOG_DIR:-/srv/cc-ci/.cc-ci-logs}"
|
||||
UPGRADER_ARGS="${UPGRADER_ARGS:-}"
|
||||
|
||||
log() { printf '[upgrader %(%H:%M:%S)T] %s\n' -1 "$*"; }
|
||||
die() { log "ERROR: $*"; exit 1; }
|
||||
session_alive() { tmux has-session -t "$SESSION" 2>/dev/null; }
|
||||
# "actively working" = claude shows interrupt hint; opencode shows spinner/Running tool.
|
||||
session_busy() { tmux capture-pane -pt "$SESSION" 2>/dev/null | grep -qE 'esc to interrupt|⠋|⠙|⠹|⠸|⠼|⠴|⠦|⠧|⠇|⠏|Running tool'; }
|
||||
|
||||
preflight() {
|
||||
command -v tmux >/dev/null 2>&1 || die "missing dependency: tmux"
|
||||
case "$UPGRADER_BACKEND" in
|
||||
claude) command -v "$CLAUDE_BIN" >/dev/null 2>&1 || die "claude CLI not found (set CLAUDE_BIN)" ;;
|
||||
opencode) command -v "$OPENCODE_BIN" >/dev/null 2>&1 || die "opencode not found (set OPENCODE_BIN)"
|
||||
[[ -n "$OPENCODE_HOST" ]] || die "could not detect tailscale IP for OPENCODE_HOST" ;;
|
||||
*) die "unknown UPGRADER_BACKEND '$UPGRADER_BACKEND' — use 'claude' or 'opencode'" ;;
|
||||
esac
|
||||
[[ -d "$WORKDIR" ]] || die "workdir not found: $WORKDIR"
|
||||
[[ -d "$WORKDIR/.claude/skills/upgrade-all" ]] || die "upgrade-all skill not found under $WORKDIR/.claude/skills"
|
||||
mkdir -p "$LOG_DIR"
|
||||
}
|
||||
|
||||
write_kickoff() {
|
||||
local kf="$LOG_DIR/.kickoff-$SESSION.txt"
|
||||
cat > "$kf" <<KICK
|
||||
*** cc-ci UPGRADER — weekly recipe-upgrade job ***
|
||||
You are the cc-ci Upgrader: a ONE-SHOT job agent, NOT a perpetual loop. Run the recipe-upgrade
|
||||
sequence to completion, then STOP. Your cwd is ${WORKDIR}; reach the CI server with \`ssh cc-ci\`;
|
||||
creds are in ${WORKDIR}/.testenv; the skills live in ${WORKDIR}/.claude/skills/.
|
||||
|
||||
DO THIS:
|
||||
1. Invoke the **/upgrade-all** skill in DEFAULT mode${UPGRADER_ARGS:+ with arguments: ${UPGRADER_ARGS}}
|
||||
(read ${WORKDIR}/.claude/skills/upgrade-all/SKILL.md for the full procedure). It surveys every
|
||||
enrolled recipe and, for each upgradeable one, runs /recipe-upgrade in DEFAULT mode — recipe PR
|
||||
only, verified by posting \`!testme\` on the PR (results visible in the PR, iterate up to 3x). A
|
||||
genuinely stale test gets an explanatory PR COMMENT, never a test edit.
|
||||
2. Process recipes via per-recipe SUBAGENTS (as the skill specifies) so your own context stays light.
|
||||
If your context usage climbs (~80%), run /compact before continuing.
|
||||
3. Write + push the weekly summary (the PR list is the actionable output for the operator).
|
||||
4. WHEN THE RUN IS COMPLETE: STOP. Print the final summary (lead with the PR list) and an
|
||||
\`UPGRADE RUN COMPLETE\` line, then go idle. Do NOT loop, do NOT re-run, and do NOT kill your own
|
||||
session — leave it up so the operator can review your output + the summary in the web UI
|
||||
(claude.ai/code). Next week's run starts a fresh session (the launcher clears this idle one).
|
||||
|
||||
GUARDRAILS: NEVER merge any PR. NEVER weaken a test. DEFAULT mode only — do NOT pass --with-tests
|
||||
(updating cc-ci tests is the operator's per-recipe opt-in). Single-writer: dedicated branches +
|
||||
separate clones, never push main, never touch the build loops' /cc-ci /cc-ci-adv clones. The shared
|
||||
Swarm is stateful — go sequentially and tear down what you deploy.
|
||||
KICK
|
||||
echo "$kf"
|
||||
}
|
||||
|
||||
start() {
|
||||
local mode="${1:-use-or-create}"
|
||||
preflight
|
||||
if session_alive; then
|
||||
if [[ "$mode" == "use-or-create" ]] && session_busy; then
|
||||
log "$SESSION already running a job (busy) — leaving it"; return 0
|
||||
fi
|
||||
log "$SESSION exists but idle/stale (or fresh requested) — killing it first"
|
||||
tmux kill-session -t "$SESSION" 2>/dev/null || true; sleep 1
|
||||
fi
|
||||
local kf
|
||||
kf="$(write_kickoff)"
|
||||
log "starting $SESSION (backend=$UPGRADER_BACKEND, model=$UPGRADER_MODEL, args='${UPGRADER_ARGS:-<none>}')"
|
||||
case "$UPGRADER_BACKEND" in
|
||||
claude)
|
||||
local rc=""
|
||||
[[ "$REMOTE_CONTROL" == "1" ]] && rc="--remote-control '$SESSION'"
|
||||
tmux new-session -d -s "$SESSION" -c "$WORKDIR" \
|
||||
"$CLAUDE_BIN $rc --model '$UPGRADER_MODEL' $CLAUDE_FLAGS \"\$(cat '$kf')\""
|
||||
;;
|
||||
opencode)
|
||||
tmux new-session -d -s "$SESSION" -c "$WORKDIR" \
|
||||
"set -a; . /srv/cc-ci/.testenv; set +a; $OPENCODE_BIN --model '$UPGRADER_MODEL' run --attach '$OPENCODE_SERVER' --title '$SESSION' \"\$(cat '$kf')\""
|
||||
log "$SESSION visible in web UI at http://oc.commoninternet.net (tailnet only)"
|
||||
;;
|
||||
esac
|
||||
tmux pipe-pane -o -t "$SESSION" "cat >> '$LOG_DIR/$SESSION.log'"
|
||||
log "started. status: $0 status | attach: tmux attach -t $SESSION | log: $LOG_DIR/$SESSION.log"
|
||||
}
|
||||
|
||||
case "${1:-start}" in
|
||||
start) start use-or-create ;;
|
||||
fresh) start fresh ;;
|
||||
stop) if session_alive; then log "killing $SESSION"; tmux kill-session -t "$SESSION" || true; else log "$SESSION not running"; fi ;;
|
||||
status)
|
||||
if session_alive; then
|
||||
log "$SESSION: RUNNING $(session_busy && echo '(busy)' || echo '(idle/finishing)')"
|
||||
ps -eo pid,etime,args | grep "[r]emote-control $SESSION" || true
|
||||
else log "$SESSION: stopped"; fi ;;
|
||||
attach) exec tmux attach -t "$SESSION" ;;
|
||||
*)
|
||||
cat <<EOF
|
||||
cc-ci upgrader launcher — one-shot weekly recipe-upgrade job agent (remote-control)
|
||||
|
||||
$0 start use-or-create: leave an in-flight run alone, else (re)start fresh (DEFAULT; what the cron calls)
|
||||
$0 fresh always kill any existing + start a fresh run
|
||||
$0 status show tmux + remote-control state
|
||||
$0 attach tmux attach to the session
|
||||
$0 stop kill the session
|
||||
|
||||
Env: UPGRADER_BACKEND=$UPGRADER_BACKEND UPGRADER_MODEL=$UPGRADER_MODEL UPGRADER_ARGS='${UPGRADER_ARGS:-<none>}'
|
||||
claude: CLAUDE_BIN=$CLAUDE_BIN REMOTE_CONTROL=$REMOTE_CONTROL
|
||||
opencode: OPENCODE_BIN=$OPENCODE_BIN OPENCODE_SERVER=$OPENCODE_SERVER web=http://oc.commoninternet.net
|
||||
(LOOP_BACKEND / LOOP_MODEL override UPGRADER_BACKEND / UPGRADER_MODEL for unified control)
|
||||
The agent runs /upgrade-all (DEFAULT mode) to completion, then STOPS and stays idle (viewable in the
|
||||
web UI). It does NOT self-terminate; the next weekly `start` clears the idle session and runs fresh.
|
||||
EOF
|
||||
;;
|
||||
esac
|
||||
# Thin wrapper — delegates everything to launch-upgrader.py in the same directory.
|
||||
exec python3 "$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")/launch-upgrader.py" "$@"
|
||||
|
||||
582
cc-ci-plan/launch.py
Normal file
582
cc-ci-plan/launch.py
Normal file
@ -0,0 +1,582 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
cc-ci loop launcher — phase-aware Builder/Adversary loops + watchdog.
|
||||
|
||||
Usage:
|
||||
launch.py start start loops + watchdog (resets to phase 0 unless RESUME_PHASE=1)
|
||||
launch.py stop stop loops + watchdog
|
||||
launch.py status show phase + session state
|
||||
launch.py watchdog run the watchdog in the foreground (called by start_watchdog)
|
||||
launch.py logs builder|adversary|watchdog tail a log
|
||||
|
||||
Env (all optional — defaults shown):
|
||||
LOOP_BACKEND claude (default) | opencode
|
||||
LOOP_MODEL model flag, e.g. "sonnet" (claude) or "tinfoil/deepseek-v4-pro" (opencode)
|
||||
RESUME_PHASE 1 = keep current phase index on start (default resets to 0)
|
||||
|
||||
CLAUDE_BIN claude
|
||||
OPENCODE_BIN /home/loops/.local/bin/opencode
|
||||
OPENCODE_SERVER http://127.0.0.1:4096
|
||||
|
||||
PLAN_DIR /srv/cc-ci/cc-ci-plan
|
||||
BUILDER_DIR /srv/cc-ci/cc-ci
|
||||
ADV_DIR /srv/cc-ci/cc-ci-adv
|
||||
LOG_DIR /srv/cc-ci/.cc-ci-logs
|
||||
PHASES_SPEC semicolon-separated "id|planfile|statusfile" entries
|
||||
PHASE_IDX_FILE $LOG_DIR/.phase-idx
|
||||
WATCH_INTERVAL 300 (seconds between heavy checks: phase DONE / heal sessions)
|
||||
SIGNAL_INTERVAL 30 (seconds between handoff / stall checks)
|
||||
STALL_IDLE 300 (idle seconds without a WAITING-UNTIL before reboot)
|
||||
STALL_GRACE 180 (seconds past a WAITING-UNTIL before reboot)
|
||||
"""
|
||||
|
||||
import hashlib, os, re, subprocess, sys, time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
# ── config ────────────────────────────────────────────────────────────────────
|
||||
|
||||
PLAN_DIR = os.environ.get("PLAN_DIR", "/srv/cc-ci/cc-ci-plan")
|
||||
BUILDER_DIR = os.environ.get("BUILDER_DIR", "/srv/cc-ci/cc-ci")
|
||||
ADV_DIR = os.environ.get("ADV_DIR", "/srv/cc-ci/cc-ci-adv")
|
||||
LOG_DIR = os.environ.get("LOG_DIR", "/srv/cc-ci/.cc-ci-logs")
|
||||
|
||||
BACKEND = os.environ.get("LOOP_BACKEND", "claude")
|
||||
LOOP_MODEL = os.environ.get("LOOP_MODEL", "")
|
||||
REMOTE_CONTROL = os.environ.get("REMOTE_CONTROL", "1") == "1"
|
||||
|
||||
CLAUDE_BIN = os.environ.get("CLAUDE_BIN", "claude")
|
||||
CLAUDE_FLAGS = os.environ.get("CLAUDE_FLAGS", "")
|
||||
if os.getuid() == 0:
|
||||
os.environ.setdefault("CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS", "1")
|
||||
else:
|
||||
CLAUDE_FLAGS = os.environ.get("CLAUDE_FLAGS", "--dangerously-skip-permissions")
|
||||
|
||||
OPENCODE_BIN = os.environ.get("OPENCODE_BIN", "/home/loops/.local/bin/opencode")
|
||||
OPENCODE_SERVER = os.environ.get("OPENCODE_SERVER", "http://127.0.0.1:4096")
|
||||
|
||||
ORCH_SESSION = os.environ.get("ORCH_SESSION", "cc-ci-orchestrator-vm")
|
||||
ORCH_LAUNCHER = os.environ.get("ORCH_LAUNCHER", f"{PLAN_DIR}/launch-orchestrator.sh")
|
||||
WATCH_ORCHESTRATOR = os.environ.get("WATCH_ORCHESTRATOR", "1") == "1"
|
||||
|
||||
BUILDER_SESSION = "cc-ci-builder"
|
||||
ADV_SESSION = "cc-ci-adv"
|
||||
WATCHDOG_SESSION = "cc-ci-watchdog"
|
||||
|
||||
WATCH_INTERVAL = int(os.environ.get("WATCH_INTERVAL", 300))
|
||||
SIGNAL_INTERVAL = int(os.environ.get("SIGNAL_INTERVAL", 30))
|
||||
STALL_IDLE = int(os.environ.get("STALL_IDLE", 300))
|
||||
STALL_GRACE = int(os.environ.get("STALL_GRACE", 180))
|
||||
|
||||
PHASES_SPEC = os.environ.get("PHASES_SPEC", ";".join([
|
||||
"1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md",
|
||||
"1b|plan-phase1b-review-lint.md|STATUS-1b.md",
|
||||
"1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md",
|
||||
"1e|plan-phase1e-harness-corrections.md|STATUS-1e.md",
|
||||
"2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md",
|
||||
"2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md",
|
||||
"2|plan-phase2-recipe-tests.md|STATUS-2.md",
|
||||
"2b|plan-phase2b-test-performance.md|STATUS-2b.md",
|
||||
"3|plan-phase3-results-ux.md|STATUS-3.md",
|
||||
"4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md",
|
||||
"5|plan-phase5-verify-upgrade-flow.md|STATUS-5.md",
|
||||
]))
|
||||
PHASES = [p.split("|") for p in PHASES_SPEC.split(";")]
|
||||
PHASE_IDX_FILE = os.environ.get("PHASE_IDX_FILE", f"{LOG_DIR}/.phase-idx")
|
||||
|
||||
# Regex patterns for session-state detection
|
||||
ACTIVE_RE = re.compile(r"esc to interrupt|⠋|⠙|⠹|⠸|⠼|⠴|⠦|⠧|⠇|⠏|Running tool")
|
||||
LIMIT_RE = re.compile(r"spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)", re.I)
|
||||
FATAL_RE = re.compile(r"redacted_thinking|blocks cannot be modified|cannot be modified", re.I)
|
||||
|
||||
# ── logging ───────────────────────────────────────────────────────────────────
|
||||
|
||||
def log(msg):
|
||||
ts = datetime.now().strftime("%H:%M:%S")
|
||||
print(f"[launch {ts}] {msg}", flush=True)
|
||||
|
||||
def die(msg):
|
||||
log(f"ERROR: {msg}")
|
||||
sys.exit(1)
|
||||
|
||||
# ── tmux helpers ──────────────────────────────────────────────────────────────
|
||||
|
||||
def session_alive(name):
|
||||
return subprocess.run(
|
||||
["tmux", "has-session", "-t", name],
|
||||
capture_output=True
|
||||
).returncode == 0
|
||||
|
||||
def kill_session(name):
|
||||
subprocess.run(["tmux", "kill-session", "-t", name], capture_output=True)
|
||||
|
||||
def capture_pane(name, lines=40):
|
||||
r = subprocess.run(["tmux", "capture-pane", "-pt", name], capture_output=True, text=True)
|
||||
return "\n".join(r.stdout.splitlines()[-lines:]) if r.returncode == 0 else ""
|
||||
|
||||
def pipe_to_log(session, log_path):
|
||||
subprocess.run(["tmux", "pipe-pane", "-o", "-t", session, f"cat >> '{log_path}'"])
|
||||
|
||||
def ping_session(session, msg):
|
||||
"""Type a message into a tmux session and submit it, retrying Enter until accepted."""
|
||||
if not session_alive(session):
|
||||
return
|
||||
prefix = msg[:28]
|
||||
subprocess.run(["tmux", "send-keys", "-t", session, "-l", "--", msg], capture_output=True)
|
||||
time.sleep(0.5)
|
||||
for _ in range(5):
|
||||
subprocess.run(["tmux", "send-keys", "-t", session, "Enter"], capture_output=True)
|
||||
time.sleep(1)
|
||||
if prefix not in capture_pane(session, 4):
|
||||
return # message was accepted
|
||||
subprocess.run(["tmux", "send-keys", "-t", session, "C-m"], capture_output=True)
|
||||
time.sleep(0.5)
|
||||
|
||||
# ── phase helpers ─────────────────────────────────────────────────────────────
|
||||
|
||||
def cur_idx():
|
||||
try:
|
||||
v = Path(PHASE_IDX_FILE).read_text().strip()
|
||||
return int(v) if v.isdigit() else 0
|
||||
except FileNotFoundError:
|
||||
return 0
|
||||
|
||||
def phase_id(idx): return PHASES[idx][0]
|
||||
def phase_plan(idx): return PHASES[idx][1]
|
||||
def phase_status(idx): return PHASES[idx][2]
|
||||
def all_ids(): return " ".join(p[0] for p in PHASES)
|
||||
|
||||
def resolve_state(repo_dir, basename):
|
||||
"""Return the path to a loop-state file — machine-docs/ if present, else repo root."""
|
||||
p = Path(repo_dir) / "machine-docs" / basename
|
||||
return p if p.exists() else Path(repo_dir) / basename
|
||||
|
||||
def phase_done(status_basename):
|
||||
path = resolve_state(BUILDER_DIR, status_basename)
|
||||
try:
|
||||
return any(line.startswith("## DONE") for line in path.open())
|
||||
except FileNotFoundError:
|
||||
return False
|
||||
|
||||
# ── kickoff prompt ────────────────────────────────────────────────────────────
|
||||
|
||||
def build_kickoff(role, idx):
|
||||
pid, plan, status = phase_id(idx), phase_plan(idx), phase_status(idx)
|
||||
preamble = (
|
||||
f"*** cc-ci SUB-PHASE {pid} ***\n"
|
||||
f"SINGLE SOURCE OF TRUTH for THIS phase: /srv/cc-ci/cc-ci-plan/{plan} — read it in full "
|
||||
f"now; it defines this phase's mission and Definition of Done.\n"
|
||||
f"The general loop protocol still applies and lives in /srv/cc-ci/cc-ci-plan/plan.md "
|
||||
f"(§6.1 coordination, §7 pacing, §9 guardrails) — read those sections too.\n"
|
||||
f"Track loop state in PHASE-NAMESPACED files in your repo clone: {status}, "
|
||||
f"BACKLOG-{pid}.md, REVIEW-{pid}.md, JOURNAL-{pid}.md. DECISIONS.md is shared (append).\n"
|
||||
f'"Done" for this phase = the Builder writes "## DONE" to {status} ONLY after every '
|
||||
f"Definition-of-Done item is Adversary-verified with a fresh PASS in REVIEW-{pid}.md "
|
||||
f"(handshake per §6.1).\n"
|
||||
f"The repo's Phase-1 STATUS.md / BACKLOG.md / REVIEW.md are HISTORY from the completed "
|
||||
f"Phase 1 — do NOT use them as your state; use the phase-namespaced files above.\n"
|
||||
f'Wherever the standing rules below say "plan.md"/"STATUS.md"/"BACKLOG.md"/"REVIEW.md", '
|
||||
f"substitute the phase plan and these phase-namespaced files.\n\n"
|
||||
f"=== standing role & rules ===\n"
|
||||
)
|
||||
role_prompt = (Path(PLAN_DIR) / "prompts" / f"{role}.md").read_text()
|
||||
return preamble + role_prompt
|
||||
|
||||
# ── agent launch ──────────────────────────────────────────────────────────────
|
||||
|
||||
def start_agent(role, session, workdir):
|
||||
if session_alive(session):
|
||||
log(f"{session} already running — leaving it")
|
||||
return
|
||||
|
||||
Path(workdir).mkdir(parents=True, exist_ok=True)
|
||||
Path(LOG_DIR).mkdir(parents=True, exist_ok=True)
|
||||
|
||||
idx = cur_idx()
|
||||
pid, plan = phase_id(idx), phase_plan(idx)
|
||||
|
||||
kf = Path(LOG_DIR) / f".kickoff-{session}.txt"
|
||||
kf.write_text(build_kickoff(role, idx))
|
||||
|
||||
model_flag = f"--model '{LOOP_MODEL}'" if LOOP_MODEL else ""
|
||||
|
||||
if BACKEND == "claude":
|
||||
rc = f"--remote-control '{session}'" if REMOTE_CONTROL else ""
|
||||
cmd = f"{CLAUDE_BIN} {rc} {model_flag} {CLAUDE_FLAGS} \"$(cat '{kf}')\""
|
||||
log(f"starting {session} (backend=claude, phase={pid}, plan={plan}, model={LOOP_MODEL or 'default'})")
|
||||
elif BACKEND == "opencode":
|
||||
cmd = (
|
||||
f"set -a; . /srv/cc-ci/.testenv; set +a; "
|
||||
f"{OPENCODE_BIN} {model_flag} run --attach '{OPENCODE_SERVER}' "
|
||||
f"--title '{session}' \"$(cat '{kf}')\""
|
||||
)
|
||||
log(f"starting {session} (backend=opencode, phase={pid}, model={LOOP_MODEL or 'default'})")
|
||||
log(f" visible at http://oc.commoninternet.net (tailnet only)")
|
||||
else:
|
||||
die(f"unknown BACKEND '{BACKEND}' — set LOOP_BACKEND=claude or LOOP_BACKEND=opencode")
|
||||
|
||||
subprocess.run(["tmux", "new-session", "-d", "-s", session, "-c", workdir, cmd])
|
||||
pipe_to_log(session, f"{LOG_DIR}/{session}.log")
|
||||
|
||||
def start_loops():
|
||||
start_agent("builder", BUILDER_SESSION, BUILDER_DIR)
|
||||
start_agent("adversary", ADV_SESSION, ADV_DIR)
|
||||
|
||||
def stop_loops():
|
||||
for s in (BUILDER_SESSION, ADV_SESSION):
|
||||
if session_alive(s):
|
||||
log(f"killing {s}")
|
||||
kill_session(s)
|
||||
|
||||
# ── session healing ───────────────────────────────────────────────────────────
|
||||
|
||||
def heal_session(role, session, workdir):
|
||||
"""Restart a dead session; kill+restart a FATAL-wedged one; nudge a limit-stalled one."""
|
||||
if not session_alive(session):
|
||||
log(f"{role} ({session}) gone — restarting (phase {phase_id(cur_idx())})")
|
||||
start_agent(role, session, workdir)
|
||||
return
|
||||
|
||||
pane = capture_pane(session, 25)
|
||||
if ACTIVE_RE.search(pane):
|
||||
return # actively working — leave it alone
|
||||
|
||||
if FATAL_RE.search(pane):
|
||||
log(f"FATAL session-state error on {role} ({session}) — kill + restart fresh")
|
||||
kill_session(session)
|
||||
start_agent(role, session, workdir)
|
||||
return
|
||||
|
||||
if LIMIT_RE.search(pane):
|
||||
log(f"limit-stall on {role} ({session}) — nudging to resume")
|
||||
ping_session(session,
|
||||
"watchdog: the usage/spend limit appears lifted — RESUME your loop now. "
|
||||
"Pull latest, re-read your phase STATUS/REVIEW files, and continue from where you "
|
||||
"stopped; re-arm your loop pacing.")
|
||||
|
||||
# ── stall detection ───────────────────────────────────────────────────────────
|
||||
|
||||
_idle_since: dict[str, float] = {}
|
||||
|
||||
def _parse_waiting_until(pane):
|
||||
"""Extract the epoch timestamp from a WAITING-UNTIL marker, or None."""
|
||||
m = re.search(r"WAITING-UNTIL:\s*(\S+)", pane)
|
||||
if not m:
|
||||
return None
|
||||
try:
|
||||
ts = m.group(1)
|
||||
dt = datetime.fromisoformat(ts.replace("Z", "+00:00"))
|
||||
return dt.timestamp()
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
def stall_check_one(role, session, workdir):
|
||||
if not session_alive(session):
|
||||
_idle_since[session] = 0.0
|
||||
return
|
||||
|
||||
now = time.time()
|
||||
pane = capture_pane(session, 40)
|
||||
|
||||
if ACTIVE_RE.search(pane):
|
||||
_idle_since[session] = 0.0
|
||||
return
|
||||
|
||||
since = _idle_since.get(session) or now
|
||||
_idle_since[session] = since
|
||||
idle = now - since
|
||||
|
||||
until = _parse_waiting_until(pane)
|
||||
if until is not None:
|
||||
# Declared wait: only reboot once STALL_GRACE seconds past the stated time.
|
||||
# Never reboot before — that races with the healthy self-wake.
|
||||
if now <= until + STALL_GRACE:
|
||||
return
|
||||
reason = f"past its WAITING-UNTIL by {int(now - until)}s — self-wake did not fire"
|
||||
else:
|
||||
if idle < STALL_IDLE:
|
||||
return
|
||||
reason = f"idle {int(idle)}s with no WAITING-UNTIL marker"
|
||||
|
||||
log(f"stall: {role} ({session}) {reason} — kill + reboot")
|
||||
kill_session(session)
|
||||
start_agent(role, session, workdir)
|
||||
_idle_since[session] = 0.0
|
||||
|
||||
def stall_check():
|
||||
stall_check_one("builder", BUILDER_SESSION, BUILDER_DIR)
|
||||
stall_check_one("adversary", ADV_SESSION, ADV_DIR)
|
||||
|
||||
# ── orchestrator healing ──────────────────────────────────────────────────────
|
||||
|
||||
def orchestrator_alive():
|
||||
"""
|
||||
True if an orchestrator process is running anywhere.
|
||||
Conflict-safety: never launch a second orchestrator resuming the same session
|
||||
(double-resume causes "thinking blocks cannot be modified" crashes).
|
||||
"""
|
||||
for line in subprocess.run("pgrep -x claude || true", shell=True,
|
||||
capture_output=True, text=True).stdout.splitlines():
|
||||
pid = line.strip()
|
||||
if not pid:
|
||||
continue
|
||||
try:
|
||||
cmdline = Path(f"/proc/{pid}/cmdline").read_bytes().decode(errors="replace").replace("\0", " ")
|
||||
# Skip the loop sessions and the upgrader — they're not the orchestrator.
|
||||
if re.search(r"--remote-control\s+'?cc-ci-(builder|adv|upgrader)'?", cmdline):
|
||||
continue
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
return session_alive(ORCH_SESSION)
|
||||
|
||||
def heal_orchestrator():
|
||||
if not WATCH_ORCHESTRATOR:
|
||||
return
|
||||
if not Path(ORCH_LAUNCHER).is_file():
|
||||
return
|
||||
|
||||
if orchestrator_alive():
|
||||
if session_alive(ORCH_SESSION):
|
||||
pane = capture_pane(ORCH_SESSION, 25)
|
||||
if ACTIVE_RE.search(pane):
|
||||
return
|
||||
if FATAL_RE.search(pane):
|
||||
log(f"FATAL session-state error on orchestrator ({ORCH_SESSION}) — kill + restart")
|
||||
kill_session(ORCH_SESSION)
|
||||
subprocess.run([ORCH_LAUNCHER, "start"], capture_output=True)
|
||||
return
|
||||
|
||||
log(f"orchestrator not running — restarting via {ORCH_LAUNCHER}")
|
||||
subprocess.run([ORCH_LAUNCHER, "start"], capture_output=True)
|
||||
|
||||
# ── handoff signalling ────────────────────────────────────────────────────────
|
||||
|
||||
_last_sha = ""
|
||||
_adv_inbox_seen = ""
|
||||
_builder_inbox_seen = ""
|
||||
|
||||
def handoff_reset():
|
||||
global _last_sha, _adv_inbox_seen, _builder_inbox_seen
|
||||
_last_sha = _adv_inbox_seen = _builder_inbox_seen = ""
|
||||
|
||||
def _fetch_origin():
|
||||
subprocess.run(f"git -C {BUILDER_DIR!r} fetch -q origin", shell=True, capture_output=True)
|
||||
|
||||
def _show_pushed(path):
|
||||
"""Read a file from origin/main (machine-docs/ first, then repo root)."""
|
||||
for loc in (f"origin/main:machine-docs/{path}", f"origin/main:{path}"):
|
||||
r = subprocess.run(
|
||||
f"git -C {BUILDER_DIR!r} show {loc!r}",
|
||||
shell=True, capture_output=True, text=True)
|
||||
if r.returncode == 0:
|
||||
return r.stdout
|
||||
return ""
|
||||
|
||||
def handoff_check():
|
||||
global _last_sha, _adv_inbox_seen, _builder_inbox_seen
|
||||
|
||||
_fetch_origin()
|
||||
r = subprocess.run(
|
||||
f"git -C {BUILDER_DIR!r} rev-parse origin/main",
|
||||
shell=True, capture_output=True, text=True)
|
||||
head = r.stdout.strip()
|
||||
|
||||
if head:
|
||||
if not _last_sha:
|
||||
_last_sha = head # baseline silently on first tick
|
||||
elif head != _last_sha:
|
||||
subjects = subprocess.run(
|
||||
f"git -C {BUILDER_DIR!r} log --format=%s {_last_sha}..origin/main",
|
||||
shell=True, capture_output=True, text=True).stdout
|
||||
if re.search(r"^claim", subjects, re.MULTILINE | re.IGNORECASE):
|
||||
log("handoff: new claim(...) commit → pinging Adversary")
|
||||
ping_session(ADV_SESSION,
|
||||
"watchdog ping: the Builder pushed a gate CLAIM (claim(...) commit). "
|
||||
"Pull and verify the claimed gate now.")
|
||||
if re.search(r"^review", subjects, re.MULTILINE | re.IGNORECASE):
|
||||
log("handoff: new review(...) commit → pinging Builder")
|
||||
ping_session(BUILDER_SESSION,
|
||||
"watchdog ping: the Adversary pushed a verdict/finding (review(...) commit). "
|
||||
"Pull REVIEW and act — proceed if it PASSes your gate, address it if it's a finding.")
|
||||
_last_sha = head
|
||||
|
||||
adv_inbox = _show_pushed("ADVERSARY-INBOX.md")
|
||||
builder_inbox = _show_pushed("BUILDER-INBOX.md")
|
||||
|
||||
def md5(s): return hashlib.md5(s.encode()).hexdigest()
|
||||
|
||||
if adv_inbox:
|
||||
h = md5(adv_inbox)
|
||||
if h != _adv_inbox_seen:
|
||||
log("handoff: ADVERSARY-INBOX.md changed → pinging Adversary")
|
||||
ping_session(ADV_SESSION,
|
||||
"watchdog ping: the Builder pushed machine-docs/ADVERSARY-INBOX.md — "
|
||||
"pull, read it, act, then delete the file (commit + push) to mark it consumed.")
|
||||
_adv_inbox_seen = h
|
||||
else:
|
||||
_adv_inbox_seen = ""
|
||||
|
||||
if builder_inbox:
|
||||
h = md5(builder_inbox)
|
||||
if h != _builder_inbox_seen:
|
||||
log("handoff: BUILDER-INBOX.md changed → pinging Builder")
|
||||
ping_session(BUILDER_SESSION,
|
||||
"watchdog ping: the Adversary pushed machine-docs/BUILDER-INBOX.md — "
|
||||
"pull, read it, act, then delete the file (commit + push) to mark it consumed.")
|
||||
_builder_inbox_seen = h
|
||||
else:
|
||||
_builder_inbox_seen = ""
|
||||
|
||||
# ── watchdog loop ─────────────────────────────────────────────────────────────
|
||||
|
||||
def watchdog_loop():
|
||||
idx = cur_idx()
|
||||
log(f"watchdog up — phase={phase_id(idx)} [{idx+1}/{len(PHASES)}] "
|
||||
f"seq='{all_ids()}' signal={SIGNAL_INTERVAL}s heavy={WATCH_INTERVAL}s")
|
||||
|
||||
elapsed = WATCH_INTERVAL # force a heavy check on the first tick
|
||||
while True:
|
||||
handoff_check()
|
||||
stall_check()
|
||||
|
||||
if elapsed >= WATCH_INTERVAL:
|
||||
elapsed = 0
|
||||
idx = cur_idx()
|
||||
pid = phase_id(idx)
|
||||
status = phase_status(idx)
|
||||
|
||||
if phase_done(status):
|
||||
next_idx = idx + 1
|
||||
if next_idx < len(PHASES):
|
||||
log(f"PHASE {pid} DONE — auto-transitioning to {phase_id(next_idx)}")
|
||||
stop_loops()
|
||||
Path(PHASE_IDX_FILE).write_text(str(next_idx))
|
||||
handoff_reset()
|
||||
start_loops()
|
||||
else:
|
||||
log(f"PHASE SEQUENCE COMPLETE (last phase {pid} DONE) — stopping loops")
|
||||
stop_loops()
|
||||
ts = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||||
Path(LOG_DIR, "SEQUENCE-COMPLETE").write_text(
|
||||
f"cc-ci phase sequence complete {ts}. Phases: {all_ids()}. "
|
||||
f"Loops stopped; entire build finished.\n")
|
||||
log("watchdog exiting.")
|
||||
return
|
||||
else:
|
||||
heal_session("builder", BUILDER_SESSION, BUILDER_DIR)
|
||||
heal_session("adversary", ADV_SESSION, ADV_DIR)
|
||||
heal_orchestrator()
|
||||
|
||||
time.sleep(SIGNAL_INTERVAL)
|
||||
elapsed += SIGNAL_INTERVAL
|
||||
|
||||
def start_watchdog():
|
||||
if session_alive(WATCHDOG_SESSION):
|
||||
log("watchdog already running")
|
||||
return
|
||||
log("starting watchdog")
|
||||
script = Path(__file__).resolve()
|
||||
subprocess.run([
|
||||
"tmux", "new-session", "-d", "-s", WATCHDOG_SESSION, "-c", PLAN_DIR,
|
||||
f"exec >>'{LOG_DIR}/watchdog.log' 2>&1; python3 '{script}' watchdog"
|
||||
])
|
||||
|
||||
# ── preflight ─────────────────────────────────────────────────────────────────
|
||||
|
||||
def preflight():
|
||||
import shutil
|
||||
if not shutil.which("tmux"):
|
||||
die("tmux not found")
|
||||
if BACKEND == "claude":
|
||||
if not shutil.which(CLAUDE_BIN):
|
||||
die(f"claude CLI not found — set CLAUDE_BIN (currently: {CLAUDE_BIN})")
|
||||
elif BACKEND == "opencode":
|
||||
if not Path(OPENCODE_BIN).exists():
|
||||
die(f"opencode not found at {OPENCODE_BIN}")
|
||||
else:
|
||||
die(f"unknown LOOP_BACKEND '{BACKEND}' — use 'claude' or 'opencode'")
|
||||
|
||||
for phase in PHASES:
|
||||
plan = Path(PLAN_DIR) / phase[1]
|
||||
if not plan.exists():
|
||||
die(f"missing phase plan: {plan}")
|
||||
for prompt_file in ("builder.md", "adversary.md"):
|
||||
if not (Path(PLAN_DIR) / "prompts" / prompt_file).exists():
|
||||
die(f"missing {PLAN_DIR}/prompts/{prompt_file}")
|
||||
Path(LOG_DIR).mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# ── status ────────────────────────────────────────────────────────────────────
|
||||
|
||||
def cmd_status():
|
||||
idx = cur_idx()
|
||||
pid = phase_id(idx)
|
||||
print(f" phase: {pid} [{idx+1}/{len(PHASES)}] plan={phase_plan(idx)} status={phase_status(idx)}")
|
||||
for s in (BUILDER_SESSION, ADV_SESSION, WATCHDOG_SESSION):
|
||||
state = "RUNNING" if session_alive(s) else "stopped"
|
||||
print(f" {s}: {state}")
|
||||
done_str = "## DONE" if phase_done(phase_status(idx)) else "in progress"
|
||||
print(f" phase {pid}: {done_str}")
|
||||
seq = Path(LOG_DIR) / "SEQUENCE-COMPLETE"
|
||||
if seq.exists():
|
||||
print(f" >>> {seq.read_text().strip()}")
|
||||
|
||||
# ── main ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
cmd = sys.argv[1] if len(sys.argv) > 1 else ""
|
||||
|
||||
if cmd == "start":
|
||||
preflight()
|
||||
stop_loops()
|
||||
if os.environ.get("RESUME_PHASE") != "1":
|
||||
Path(PHASE_IDX_FILE).write_text("0")
|
||||
seq = Path(LOG_DIR) / "SEQUENCE-COMPLETE"
|
||||
if seq.exists():
|
||||
seq.unlink()
|
||||
start_loops()
|
||||
start_watchdog()
|
||||
log(f"started at phase {phase_id(cur_idx())}.")
|
||||
|
||||
elif cmd == "watchdog":
|
||||
preflight()
|
||||
watchdog_loop()
|
||||
|
||||
elif cmd == "status":
|
||||
cmd_status()
|
||||
|
||||
elif cmd == "stop":
|
||||
stop_loops()
|
||||
if session_alive(WATCHDOG_SESSION):
|
||||
log(f"killing {WATCHDOG_SESSION}")
|
||||
kill_session(WATCHDOG_SESSION)
|
||||
log("stopped.")
|
||||
|
||||
elif cmd == "logs":
|
||||
sub = sys.argv[2] if len(sys.argv) > 2 else ""
|
||||
log_files = {
|
||||
"builder": f"{LOG_DIR}/{BUILDER_SESSION}.log",
|
||||
"adversary": f"{LOG_DIR}/{ADV_SESSION}.log",
|
||||
"watchdog": f"{LOG_DIR}/watchdog.log",
|
||||
}
|
||||
if sub not in log_files:
|
||||
die("usage: launch.py logs builder|adversary|watchdog")
|
||||
os.execvp("tail", ["tail", "-f", log_files[sub]])
|
||||
|
||||
else:
|
||||
print(f"""cc-ci loop launcher (phase-aware)
|
||||
|
||||
launch.py start start loops + watchdog (RESUME_PHASE=1 to keep current phase)
|
||||
launch.py stop stop loops + watchdog
|
||||
launch.py status show phase + session state
|
||||
launch.py logs builder|adversary|watchdog tail a log
|
||||
launch.py watchdog run watchdog in foreground
|
||||
|
||||
Backend: {BACKEND} Model: {LOOP_MODEL or '<default>'}
|
||||
Phase sequence ({len(PHASES)} phases, auto-advance on ## DONE, stop after last):
|
||||
{all_ids()}
|
||||
""")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -1,505 +1,3 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# launch.sh — start and supervise the two cc-ci autonomous loops + a phase-aware watchdog.
|
||||
#
|
||||
# Model (see plan.md §6 / §6.1): two INDEPENDENT Claude Code sessions —
|
||||
# • Builder (tmux session: cc-ci-builder) working clone /srv/cc-ci/cc-ci
|
||||
# • Adversary (tmux session: cc-ci-adv) working clone /srv/cc-ci/cc-ci-adv
|
||||
# coordinating only through the git repo on git.autonomic.zone.
|
||||
#
|
||||
# PHASES: the watchdog runs an ordered sequence of sub-phases (default: 1c → 1b → 1d → 1e → 2w → 2 → 2b → 3 → 4;
|
||||
# 2w = warm-canonical/--quick, interjected; Phase 2 pauses for it then resumes).
|
||||
# Each phase has its own plan + phase-namespaced loop-state files (STATUS-<id>.md etc.). When a phase's
|
||||
# STATUS-<id>.md shows "## DONE", the watchdog AUTO-TRANSITIONS to the next phase; after the LAST
|
||||
# phase (4, final review/polish/cleanup) it STOPS the loops and exits (end of the whole build).
|
||||
#
|
||||
# Three jobs: ITERATION (each agent's /loop), RESILIENCE (restart a dead loop), HANDOFF SIGNALLING
|
||||
# (ping the waiting loop the moment its counterpart hands off), PHASE SEQUENCING (this file).
|
||||
#
|
||||
# Usage:
|
||||
# ./launch.sh start # start the sequence at phase 0 + watchdog (stops/relaunches loops)
|
||||
# ./launch.sh watchdog # run only the supervision loop in the foreground
|
||||
# ./launch.sh status # show phase + session + DONE state
|
||||
# ./launch.sh logs builder|adversary|watchdog # tail a session/log
|
||||
# ./launch.sh stop # stop both loops + watchdog
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Absolute path to this script, so the watchdog re-invokes it correctly regardless of cwd.
|
||||
SELF="$(readlink -f "${BASH_SOURCE[0]}")"
|
||||
|
||||
# ----- config -------------------------------------------------------------
|
||||
PLAN_DIR="${PLAN_DIR:-/srv/cc-ci/cc-ci-plan}"
|
||||
|
||||
# ----- backend selection ------------------------------------------------------
|
||||
# LOOP_BACKEND: "claude" (default) or "opencode" (tinfoil/opencode web, tailscale-only).
|
||||
# LOOP_MODEL: model to pass to the backend.
|
||||
# claude: e.g. "sonnet", "opus" (--model flag); empty = use CLI default.
|
||||
# opencode: "provider/model" e.g. "tinfoil/deepseek-v4-pro".
|
||||
LOOP_BACKEND="${LOOP_BACKEND:-claude}"
|
||||
LOOP_MODEL="${LOOP_MODEL:-}"
|
||||
|
||||
CLAUDE_BIN="${CLAUDE_BIN:-claude}"
|
||||
OPENCODE_BIN="${OPENCODE_BIN:-/home/loops/.local/bin/opencode}"
|
||||
# opencode web server listens on localhost (nginx proxies it at oc.commoninternet.net).
|
||||
# One shared server hosts all sessions; agents attach with --attach.
|
||||
OPENCODE_SERVER="${OPENCODE_SERVER:-http://127.0.0.1:4096}"
|
||||
OPENCODE_PORT="${OPENCODE_PORT:-4096}"
|
||||
|
||||
# --dangerously-skip-permissions cannot be passed as a FLAG when running as root (claude blocks it).
|
||||
# Use the env var form instead; detect root and switch automatically.
|
||||
if [ "$(id -u)" = "0" ]; then
|
||||
export CLAUDE_DANGEROUSLY_SKIP_PERMISSIONS=1
|
||||
CLAUDE_FLAGS="${CLAUDE_FLAGS:-}"
|
||||
else
|
||||
CLAUDE_FLAGS="${CLAUDE_FLAGS:---dangerously-skip-permissions}"
|
||||
fi
|
||||
# REMOTE_CONTROL=1 → interactive --remote-control sessions (viewable at claude.ai/code), required
|
||||
# for /loop. The box must be logged into the claude.ai account. =0 for plain interactive.
|
||||
# For opencode backend this controls whether to start the opencode web server.
|
||||
REMOTE_CONTROL="${REMOTE_CONTROL:-1}"
|
||||
|
||||
BUILDER_DIR="${BUILDER_DIR:-/srv/cc-ci/cc-ci}" # Builder's repo clone
|
||||
ADV_DIR="${ADV_DIR:-/srv/cc-ci/cc-ci-adv}" # Adversary's repo clone
|
||||
LOG_DIR="${LOG_DIR:-/srv/cc-ci/.cc-ci-logs}"
|
||||
|
||||
WATCH_INTERVAL="${WATCH_INTERVAL:-300}" # seconds between HEAVY checks (phase DONE / restart dead loops)
|
||||
SIGNAL_INTERVAL="${SIGNAL_INTERVAL:-30}" # seconds between HANDOFF checks (ping the waiting loop)
|
||||
STALL_IDLE="${STALL_IDLE:-300}" # NO-marker case: seconds a loop may sit idle (turn ended
|
||||
# without declaring a wait) before the watchdog reboots it
|
||||
STALL_GRACE="${STALL_GRACE:-180}" # marker case: seconds PAST a loop's WAITING-UNTIL before
|
||||
# reboot. The real ScheduleWakeup fires AT the stated time;
|
||||
# grace covers wake+start latency + marker/scheduler skew so
|
||||
# the watchdog never RACES (pre-empts) a healthy self-wake.
|
||||
|
||||
BUILDER_SESSION="cc-ci-builder"
|
||||
ADV_SESSION="cc-ci-adv"
|
||||
WATCHDOG_SESSION="cc-ci-watchdog"
|
||||
# Orchestrator (supervisory session) — the watchdog keeps it alive too, via launch-orchestrator.sh.
|
||||
ORCH_SESSION="${ORCH_SESSION:-cc-ci-orchestrator-vm}"
|
||||
ORCH_LAUNCHER="${ORCH_LAUNCHER:-$PLAN_DIR/launch-orchestrator.sh}"
|
||||
# Watchdog supervision of the orchestrator can be disabled (=0) if you run the orchestrator yourself
|
||||
# and don't want it auto-(re)launched.
|
||||
WATCH_ORCHESTRATOR="${WATCH_ORCHESTRATOR:-1}"
|
||||
|
||||
# Ordered phase sequence: each entry "id|planfile|statusbasename". The watchdog runs them in order,
|
||||
# auto-transitions on the phase's "## DONE" (in BUILDER_DIR/<statusbasename>), and STOPS after the
|
||||
# last one (manual gate). Override PHASES_SPEC (semicolon-separated) to change the sequence.
|
||||
PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md;5|plan-phase5-verify-upgrade-flow.md|STATUS-5.md}"
|
||||
IFS=';' read -r -a PHASES <<< "$PHASES_SPEC"
|
||||
PHASE_IDX_FILE="${PHASE_IDX_FILE:-$LOG_DIR/.phase-idx}"
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
log() { printf '[launch %(%H:%M:%S)T] %s\n' -1 "$*"; }
|
||||
die() { log "ERROR: $*"; exit 1; }
|
||||
need() { command -v "$1" >/dev/null 2>&1 || die "missing dependency: $1"; }
|
||||
|
||||
# ----- phase helpers ------------------------------------------------------
|
||||
cur_idx() { local i; i="$(cat "$PHASE_IDX_FILE" 2>/dev/null || echo 0)"; [[ "$i" =~ ^[0-9]+$ ]] || i=0; echo "$i"; }
|
||||
phase_id() { echo "${PHASES[$1]}" | cut -d'|' -f1; }
|
||||
phase_plan() { echo "${PHASES[$1]}" | cut -d'|' -f2; }
|
||||
phase_status() { echo "${PHASES[$1]}" | cut -d'|' -f3; }
|
||||
phase_review() { echo "REVIEW-$(phase_id "$1").md"; }
|
||||
# Loop-state files may sit at the repo root OR under machine-docs/ (the 1b RL6 move). Prefer
|
||||
# machine-docs/ if present, else root — so the watchdog survives the move whenever it happens.
|
||||
resolve_state() { local dir="$1" base="$2"; if [[ -f "$dir/machine-docs/$base" ]]; then echo "$dir/machine-docs/$base"; else echo "$dir/$base"; fi; }
|
||||
phase_done() { grep -qE '^##[[:space:]]+DONE' "$(resolve_state "$BUILDER_DIR" "$1")" 2>/dev/null; } # $1 = status basename (read locally)
|
||||
all_ids() { local p; for p in "${PHASES[@]}"; do printf '%s ' "$(echo "$p" | cut -d'|' -f1)"; done; }
|
||||
|
||||
preflight() {
|
||||
need tmux
|
||||
case "$LOOP_BACKEND" in
|
||||
claude) command -v "$CLAUDE_BIN" >/dev/null 2>&1 || die "claude CLI not found (set CLAUDE_BIN)" ;;
|
||||
opencode) command -v "$OPENCODE_BIN" >/dev/null 2>&1 || die "opencode not found at $OPENCODE_BIN; install from https://opencode.ai" ;;
|
||||
*) die "unknown LOOP_BACKEND '$LOOP_BACKEND' — use 'claude' or 'opencode'" ;;
|
||||
esac
|
||||
local p plan
|
||||
for p in "${PHASES[@]}"; do
|
||||
plan="$(echo "$p" | cut -d'|' -f2)"
|
||||
[[ -f "$PLAN_DIR/$plan" ]] || die "missing phase plan $PLAN_DIR/$plan"
|
||||
done
|
||||
[[ -f "$PLAN_DIR/prompts/builder.md" ]] || die "missing $PLAN_DIR/prompts/builder.md"
|
||||
[[ -f "$PLAN_DIR/prompts/adversary.md" ]] || die "missing $PLAN_DIR/prompts/adversary.md"
|
||||
mkdir -p "$LOG_DIR"
|
||||
}
|
||||
|
||||
session_alive() { tmux has-session -t "$1" 2>/dev/null; }
|
||||
|
||||
# Build the per-session kickoff (phase preamble + base role prompt) and launch the agent.
|
||||
# role ∈ {builder, adversary}.
|
||||
# Backend "claude": prompt passed as positional arg via $(cat kf) — never stdin (piping breaks /loop).
|
||||
# Backend "opencode": opencode serves a web UI on OPENCODE_HOST:OPENCODE_PORT (tailnet-only);
|
||||
# each session gets a dedicated port offset (builder=+0, adversary=+1) so they don't collide.
|
||||
# The kickoff prompt is passed via `opencode run <message>` in a detached tmux session; the web
|
||||
# UI is accessible at http://OPENCODE_HOST:PORT for observation (like --remote-control).
|
||||
start_agent() {
|
||||
local role="$1" session="$2" workdir="$3"
|
||||
if session_alive "$session"; then log "$session already running — leaving it"; return 0; fi
|
||||
mkdir -p "$workdir"
|
||||
local idx pid plan status kf
|
||||
idx="$(cur_idx)"; pid="$(phase_id "$idx")"; plan="$(phase_plan "$idx")"; status="$(phase_status "$idx")"
|
||||
kf="$LOG_DIR/.kickoff-$session.txt"
|
||||
{
|
||||
cat <<PREAMBLE
|
||||
*** cc-ci SUB-PHASE ${pid} ***
|
||||
SINGLE SOURCE OF TRUTH for THIS phase: /srv/cc-ci/cc-ci-plan/${plan} — read it in full now; it defines this phase's mission and Definition of Done.
|
||||
The general loop protocol still applies and lives in /srv/cc-ci/cc-ci-plan/plan.md (§6.1 coordination, §7 pacing, §9 guardrails) — read those sections too.
|
||||
Track loop state in PHASE-NAMESPACED files in your repo clone: ${status}, BACKLOG-${pid}.md, REVIEW-${pid}.md, JOURNAL-${pid}.md. DECISIONS.md is shared (append).
|
||||
"Done" for this phase = the Builder writes "## DONE" to ${status} ONLY after every Definition-of-Done item is Adversary-verified with a fresh PASS in REVIEW-${pid}.md (handshake per §6.1).
|
||||
The repo's Phase-1 STATUS.md / BACKLOG.md / REVIEW.md are HISTORY from the completed Phase 1 — do NOT use them as your state; use the phase-namespaced files above.
|
||||
Wherever the standing rules below say "plan.md"/"STATUS.md"/"BACKLOG.md"/"REVIEW.md", substitute the phase plan and these phase-namespaced files.
|
||||
|
||||
=== standing role & rules ===
|
||||
PREAMBLE
|
||||
cat "$PLAN_DIR/prompts/$role.md"
|
||||
} > "$kf"
|
||||
|
||||
local model_flag=""
|
||||
[[ -n "$LOOP_MODEL" ]] && model_flag="--model '$LOOP_MODEL'"
|
||||
|
||||
case "$LOOP_BACKEND" in
|
||||
claude)
|
||||
local rc=""
|
||||
[[ "$REMOTE_CONTROL" == "1" ]] && rc="--remote-control '$session'"
|
||||
log "starting $session (backend=claude, phase=$pid, model=${LOOP_MODEL:-default}, cwd=$workdir)"
|
||||
tmux new-session -d -s "$session" -c "$workdir" \
|
||||
"$CLAUDE_BIN $rc $model_flag $CLAUDE_FLAGS \"\$(cat '$kf')\""
|
||||
;;
|
||||
opencode)
|
||||
# One shared opencode web server (opencode-web.service or manually started) hosts all sessions.
|
||||
# Each agent attaches to it as a named session visible in the web UI at oc.commoninternet.net.
|
||||
log "starting $session (backend=opencode, phase=$pid, model=${LOOP_MODEL:-default}, server=$OPENCODE_SERVER)"
|
||||
tmux new-session -d -s "$session" -c "$workdir" \
|
||||
"set -a; . /srv/cc-ci/.testenv; set +a; $OPENCODE_BIN $model_flag run --attach '$OPENCODE_SERVER' --title '$session' \"\$(cat '$kf')\""
|
||||
log "$session visible in web UI at http://oc.commoninternet.net (tailnet only)"
|
||||
;;
|
||||
esac
|
||||
tmux pipe-pane -o -t "$session" "cat >> '$LOG_DIR/$session.log'"
|
||||
}
|
||||
|
||||
start_loops() {
|
||||
start_agent builder "$BUILDER_SESSION" "$BUILDER_DIR"
|
||||
start_agent adversary "$ADV_SESSION" "$ADV_DIR"
|
||||
}
|
||||
|
||||
stop_loops() {
|
||||
local s
|
||||
for s in "$BUILDER_SESSION" "$ADV_SESSION"; do
|
||||
if session_alive "$s"; then log "killing $s"; tmux kill-session -t "$s" || true; fi
|
||||
done
|
||||
}
|
||||
|
||||
# Wake a loop by typing a message into its tmux session and SUBMITTING it. A single Enter after a
|
||||
# long `send-keys -l` is often swallowed while the TUI ingests the paste (text left unsent in the
|
||||
# input box), so retry Enter/C-m until the message's leading text is no longer in the input box.
|
||||
ping_session() {
|
||||
local s="$1" msg="$2" prefix i
|
||||
session_alive "$s" || return 0
|
||||
prefix="${msg:0:28}"
|
||||
tmux send-keys -t "$s" -l -- "$msg" 2>/dev/null || return 0
|
||||
sleep 0.5
|
||||
for i in 1 2 3 4 5; do
|
||||
tmux send-keys -t "$s" Enter 2>/dev/null
|
||||
sleep 1
|
||||
tmux capture-pane -pt "$s" 2>/dev/null | tail -4 | grep -qF -- "$prefix" || return 0 # submitted
|
||||
tmux send-keys -t "$s" C-m 2>/dev/null; sleep 0.5
|
||||
done
|
||||
}
|
||||
|
||||
# A loop can stall ALIVE on a usage/spend-limit notice: the claude process stays up (so the
|
||||
# dead-session restart never fires) but makes no progress, and the /loop self-pacing is dead because
|
||||
# the limit interrupted the turn that would have scheduled the next tick. Detect that signature
|
||||
# (limit text present + no active-turn marker) and re-nudge it each heavy tick — once the limit resets
|
||||
# the next nudge lands and the loop resumes. Gated on the limit text so we NEVER nudge a loop that is
|
||||
# just legitimately idle-waiting on a handoff.
|
||||
LIMIT_RE='spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)'
|
||||
# FATAL = an unrecoverable session-state API error that recurs on EVERY turn (so the session stays
|
||||
# alive but wedged — a nudge can't fix it; only a fresh session can). The confirmed case: the
|
||||
# "thinking/redacted_thinking blocks ... cannot be modified" 400 that has hit the Adversary
|
||||
# repeatedly (interrupted-mid-thinking corrupts the replayed history). Kill + restart fresh; the loop
|
||||
# re-orients from the repo. Matched conservatively so it never fires on transient/working states.
|
||||
FATAL_RE='redacted_thinking|blocks cannot be modified|cannot be modified'
|
||||
|
||||
# Heal one loop session: dead -> restart; wedged on a FATAL error -> kill + restart fresh; stalled on
|
||||
# a usage limit -> nudge. No-op while actively working ("esc to interrupt" on screen).
|
||||
heal_session() {
|
||||
local role="$1" s="$2" dir="$3" pane
|
||||
if ! session_alive "$s"; then
|
||||
log "$role ($s) gone — restarting (phase $(phase_id "$(cur_idx)"))"
|
||||
start_agent "$role" "$s" "$dir"; return 0
|
||||
fi
|
||||
pane="$(tmux capture-pane -pt "$s" 2>/dev/null | tail -25 || true)"
|
||||
# "esc to interrupt" = claude actively working; "running" or spinner chars = opencode actively working
|
||||
printf '%s\n' "$pane" | grep -qE 'esc to interrupt|⠋|⠙|⠹|⠸|⠼|⠴|⠦|⠧|⠇|⠏|Running tool' && return 0
|
||||
if printf '%s\n' "$pane" | grep -qiE "$FATAL_RE"; then
|
||||
log "FATAL session-state error on $role ($s) — kill + restart fresh (re-orients from repo)"
|
||||
tmux kill-session -t "$s" 2>/dev/null || true
|
||||
start_agent "$role" "$s" "$dir"; return 0
|
||||
fi
|
||||
if printf '%s\n' "$pane" | grep -qiE "$LIMIT_RE"; then
|
||||
log "limit-stall detected on $role ($s) — re-nudging to resume"
|
||||
ping_session "$s" "watchdog: the usage/spend limit appears lifted — RESUME your loop now. Pull latest, re-read your phase STATUS/REVIEW files, and continue from where you stopped; re-arm your loop pacing."
|
||||
fi
|
||||
}
|
||||
|
||||
# --- Idle-wedge detection (complements heal_session's dead/FATAL/limit cases) ----------------------
|
||||
# A loop can sit ALIVE but wedged — e.g. garbled output at the context limit — showing none of the
|
||||
# heal_session signals (not dead, no FATAL string, no limit notice). The loops therefore DECLARE every
|
||||
# wait with a final-line marker `WAITING-UNTIL: <ISO-8601 UTC>` and cap each wait at 10 min (plan §7).
|
||||
# A healthy idle loop ALWAYS has a current marker as its last message; a wedge does not (or has one
|
||||
# whose time has already passed). So: reboot a loop that has been idle (no "esc to interrupt") for
|
||||
# >= STALL_IDLE seconds AND (has no WAITING-UNTIL marker OR is now past the time that marker named).
|
||||
# Runs every signal tick (30 s) for fine resolution; rebooting is safe — the loop re-orients from
|
||||
# git + its phase STATUS/REVIEW files.
|
||||
declare -A _wd_idle_since # session -> epoch first seen idle this stretch (0/unset = working)
|
||||
|
||||
_parse_waiting_until() { # arg1 = pane text; echoes epoch seconds of the last marker, or nothing
|
||||
local line ts
|
||||
line="$(printf '%s\n' "$1" | grep -oE 'WAITING-UNTIL:[[:space:]]*[0-9][0-9T:Z+-]+' | tail -1)"
|
||||
[[ -n "$line" ]] || return 0
|
||||
ts="$(printf '%s' "${line#WAITING-UNTIL:}" | tr -d '[:space:]')"
|
||||
date -u -d "$ts" +%s 2>/dev/null || true
|
||||
}
|
||||
|
||||
stall_check_one() {
|
||||
local role="$1" s="$2" dir="$3" pane now until idle since reason
|
||||
session_alive "$s" || { _wd_idle_since[$s]=0; return 0; } # dead => heal_session handles it
|
||||
now="$(printf '%(%s)T' -1)"
|
||||
pane="$(tmux capture-pane -pt "$s" 2>/dev/null | tail -40 || true)"
|
||||
if printf '%s\n' "$pane" | grep -q 'esc to interrupt'; then
|
||||
_wd_idle_since[$s]=0; return 0 # actively working — not idle
|
||||
fi
|
||||
since="${_wd_idle_since[$s]:-0}"
|
||||
if [[ "$since" == 0 ]]; then since="$now"; _wd_idle_since[$s]="$now"; fi
|
||||
idle=$(( now - since ))
|
||||
until="$(_parse_waiting_until "$pane")"
|
||||
if [[ -n "$until" ]]; then
|
||||
# Declared wait: the loop's own ScheduleWakeup fires AT 'until'. Reboot ONLY once we are
|
||||
# STALL_GRACE seconds PAST it — i.e. the self-wake genuinely failed. Never reboot before/at
|
||||
# 'until' (that races and pre-empts the healthy wake — the original false-reboot bug).
|
||||
(( now > until + STALL_GRACE )) || return 0
|
||||
reason="past its WAITING-UNTIL by $(( now - until ))s — self-wake did not fire"
|
||||
else
|
||||
# No declared wait: a turn ended without scheduling/declaring. Treat as a wedge once idle a while.
|
||||
(( idle >= STALL_IDLE )) || return 0
|
||||
reason="idle ${idle}s with no WAITING-UNTIL marker"
|
||||
fi
|
||||
log "stall: $role ($s) $reason — kill + reboot (re-orients from repo)"
|
||||
tmux kill-session -t "$s" 2>/dev/null || true
|
||||
start_agent "$role" "$s" "$dir"
|
||||
_wd_idle_since[$s]=0
|
||||
}
|
||||
|
||||
stall_check() {
|
||||
stall_check_one builder "$BUILDER_SESSION" "$BUILDER_DIR"
|
||||
stall_check_one adversary "$ADV_SESSION" "$ADV_DIR"
|
||||
}
|
||||
|
||||
# Is an orchestrator process alive ANYWHERE? Conflict-safety: we must NEVER launch a second
|
||||
# orchestrator that resumes the same conversation while one is already running (that double-resume is
|
||||
# the likely cause of the "thinking blocks cannot be modified" crashes). The orchestrator may be
|
||||
# running as a managed tmux session (cc-ci-orchestrator) OR as a plain terminal session the operator
|
||||
# started by hand (no flags). So: alive iff any `claude` process exists that is NOT one of the two
|
||||
# loop sessions (identified by their --remote-control name), or the managed tmux session exists.
|
||||
orchestrator_alive() {
|
||||
local pid args
|
||||
for pid in $(pgrep -x claude 2>/dev/null); do
|
||||
args="$(tr '\0' ' ' < "/proc/$pid/cmdline" 2>/dev/null || true)"
|
||||
# skip the loops + the one-shot upgrader job (matched by remote-control session NAME, not a
|
||||
# stray path mention) — none of these is the orchestrator.
|
||||
printf '%s' "$args" | grep -qE -- "--remote-control +'?cc-ci-(builder|adv|upgrader)'?" && continue
|
||||
return 0 # a non-loop claude process => orchestrator (or operator) is alive
|
||||
done
|
||||
tmux has-session -t "$ORCH_SESSION" 2>/dev/null && return 0
|
||||
return 1
|
||||
}
|
||||
|
||||
# Keep the orchestrator alive: restart it (via launch-orchestrator.sh, which resumes its session) ONLY
|
||||
# when none is running; if it's the managed tmux session and wedged on a FATAL error, kill+restart.
|
||||
heal_orchestrator() {
|
||||
[[ "$WATCH_ORCHESTRATOR" == "1" ]] || return 0
|
||||
[[ -x "$ORCH_LAUNCHER" ]] || return 0
|
||||
if orchestrator_alive; then
|
||||
if tmux has-session -t "$ORCH_SESSION" 2>/dev/null; then
|
||||
local pane; pane="$(tmux capture-pane -pt "$ORCH_SESSION" 2>/dev/null | tail -25 || true)"
|
||||
printf '%s\n' "$pane" | grep -qE 'esc to interrupt|⠋|⠙|⠹|⠸|⠼|⠴|⠦|⠧|⠇|⠏|Running tool' && return 0
|
||||
if printf '%s\n' "$pane" | grep -qiE "$FATAL_RE"; then
|
||||
log "FATAL session-state error on orchestrator ($ORCH_SESSION) — kill + restart fresh"
|
||||
tmux kill-session -t "$ORCH_SESSION" 2>/dev/null || true
|
||||
"$ORCH_LAUNCHER" start >/dev/null 2>&1 || true
|
||||
fi
|
||||
fi
|
||||
return 0
|
||||
fi
|
||||
log "orchestrator not running anywhere — restarting via $ORCH_LAUNCHER"
|
||||
"$ORCH_LAUNCHER" start >/dev/null 2>&1 || true
|
||||
}
|
||||
|
||||
# Detect handoffs against the PUSHED origin/main — i.e. exactly what the RECEIVER will pull — NOT the
|
||||
# writer's local working tree. (Reading the working tree fired on a claim/verdict the writer hadn't
|
||||
# pushed yet; the receiver then pulled a stale remote, saw "no formal gate", and a clarifying
|
||||
# inbox round-trip ensued. Mirroring origin/main eliminates that race.) origin/main is the shared
|
||||
# branch, so all four files are read from one clone's origin/main after a single best-effort fetch.
|
||||
_wd_fetch_origin() { git -C "$1" fetch -q origin 2>/dev/null || true; }
|
||||
_wd_show_pushed() { git -C "$1" show "origin/main:machine-docs/$2" 2>/dev/null || git -C "$1" show "origin/main:$2" 2>/dev/null || true; }
|
||||
|
||||
_wd_last_sha=""; _wd_adv_inbox_seen=""; _wd_builder_inbox_seen=""
|
||||
handoff_reset() { _wd_last_sha=""; _wd_adv_inbox_seen=""; _wd_builder_inbox_seen=""; } # call on phase transition
|
||||
# Signal handoffs off the loops' CONVENTIONAL COMMIT PREFIXES on origin/main — NOT by parsing
|
||||
# free-form markdown prose (brittle). The loops consistently prefix every gate claim `claim(...)`
|
||||
# and every verdict/finding `review(...)`. So: a new `claim(` commit pushed => ping the Adversary;
|
||||
# a new `review(` commit => ping the Builder. Edge-triggered on the origin/main SHA (append-only —
|
||||
# the loops never force-push), so it can't double-fire or mis-route. INBOX files are detected
|
||||
# separately (which file changed routes the ping). All reads are of the PUSHED state (what the
|
||||
# receiver pulls).
|
||||
handoff_check() {
|
||||
local head subjects adv_inbox builder_inbox h
|
||||
_wd_fetch_origin "$BUILDER_DIR"
|
||||
head="$(git -C "$BUILDER_DIR" rev-parse origin/main 2>/dev/null || true)"
|
||||
if [[ -n "$head" ]]; then
|
||||
if [[ -z "$_wd_last_sha" ]]; then
|
||||
_wd_last_sha="$head" # baseline silently on first observation / restart
|
||||
elif [[ "$head" != "$_wd_last_sha" ]]; then
|
||||
subjects="$(git -C "$BUILDER_DIR" log --format='%s' "${_wd_last_sha}..origin/main" 2>/dev/null || true)"
|
||||
if printf '%s\n' "$subjects" | grep -qiE '^claim'; then
|
||||
log "handoff: new claim(...) commit on origin/main -> pinging Adversary"
|
||||
ping_session "$ADV_SESSION" "watchdog ping: the Builder pushed a gate CLAIM (claim(...) commit). Pull and verify the claimed gate now."
|
||||
fi
|
||||
if printf '%s\n' "$subjects" | grep -qiE '^review'; then
|
||||
log "handoff: new review(...) commit on origin/main -> pinging Builder"
|
||||
ping_session "$BUILDER_SESSION" "watchdog ping: the Adversary pushed a verdict/finding (review(...) commit). Pull REVIEW and act — proceed if it PASSes your gate, address it if it's a finding."
|
||||
fi
|
||||
_wd_last_sha="$head"
|
||||
fi
|
||||
fi
|
||||
|
||||
adv_inbox="$(_wd_show_pushed "$BUILDER_DIR" "ADVERSARY-INBOX.md")"
|
||||
builder_inbox="$(_wd_show_pushed "$BUILDER_DIR" "BUILDER-INBOX.md")"
|
||||
|
||||
# INBOX side-channel (§6.1), detected on the pushed state. Receiver deletes after consuming =>
|
||||
# absent on origin/main => re-arm so the next write re-pings.
|
||||
if [[ -n "$adv_inbox" ]]; then
|
||||
h="$(printf '%s' "$adv_inbox" | md5sum | awk '{print $1}')"
|
||||
if [[ "$h" != "$_wd_adv_inbox_seen" ]]; then
|
||||
log "handoff: ADVERSARY-INBOX.md new/changed (pushed) -> pinging Adversary"
|
||||
ping_session "$ADV_SESSION" "watchdog ping: the Builder pushed machine-docs/ADVERSARY-INBOX.md — pull, read it, act, then delete the file (commit + push) to mark it consumed."
|
||||
_wd_adv_inbox_seen="$h"
|
||||
fi
|
||||
else
|
||||
_wd_adv_inbox_seen=""
|
||||
fi
|
||||
if [[ -n "$builder_inbox" ]]; then
|
||||
h="$(printf '%s' "$builder_inbox" | md5sum | awk '{print $1}')"
|
||||
if [[ "$h" != "$_wd_builder_inbox_seen" ]]; then
|
||||
log "handoff: BUILDER-INBOX.md new/changed (pushed) -> pinging Builder"
|
||||
ping_session "$BUILDER_SESSION" "watchdog ping: the Adversary pushed machine-docs/BUILDER-INBOX.md — pull, read it, act, then delete the file (commit + push) to mark it consumed."
|
||||
_wd_builder_inbox_seen="$h"
|
||||
fi
|
||||
else
|
||||
_wd_builder_inbox_seen=""
|
||||
fi
|
||||
}
|
||||
|
||||
watchdog_loop() {
|
||||
local idx pid status next
|
||||
idx="$(cur_idx)"; pid="$(phase_id "$idx")"
|
||||
log "watchdog up (phase=$pid [$((idx+1))/${#PHASES[@]}], seq='$(all_ids)', signal=${SIGNAL_INTERVAL}s, heavy=${WATCH_INTERVAL}s)"
|
||||
local elapsed="$WATCH_INTERVAL"
|
||||
while true; do
|
||||
handoff_check
|
||||
stall_check
|
||||
if (( elapsed >= WATCH_INTERVAL )); then
|
||||
elapsed=0
|
||||
idx="$(cur_idx)"; pid="$(phase_id "$idx")"; status="$(phase_status "$idx")"
|
||||
if phase_done "$status"; then
|
||||
next=$((idx + 1))
|
||||
if (( next < ${#PHASES[@]} )); then
|
||||
log "PHASE $pid DONE (## DONE in $status) — auto-transitioning to $(phase_id "$next")."
|
||||
stop_loops
|
||||
echo "$next" > "$PHASE_IDX_FILE"
|
||||
handoff_reset
|
||||
start_loops
|
||||
else
|
||||
log "PHASE SEQUENCE COMPLETE (last phase $pid DONE). Stopping loops — entire build (1c→3) finished."
|
||||
stop_loops
|
||||
printf 'cc-ci phase sequence complete %(%F %T)T. Phases: %s. Loops stopped; entire build finished.\n' -1 "$(all_ids)" > "$LOG_DIR/SEQUENCE-COMPLETE"
|
||||
log "watchdog exiting."
|
||||
exit 0
|
||||
fi
|
||||
else
|
||||
heal_session builder "$BUILDER_SESSION" "$BUILDER_DIR"
|
||||
heal_session adversary "$ADV_SESSION" "$ADV_DIR"
|
||||
heal_orchestrator
|
||||
fi
|
||||
fi
|
||||
sleep "$SIGNAL_INTERVAL"
|
||||
elapsed=$(( elapsed + SIGNAL_INTERVAL ))
|
||||
done
|
||||
}
|
||||
|
||||
start_watchdog() {
|
||||
if session_alive "$WATCHDOG_SESSION"; then log "watchdog already running"; return 0; fi
|
||||
log "starting watchdog"
|
||||
tmux new-session -d -s "$WATCHDOG_SESSION" -c "$PLAN_DIR" \
|
||||
"exec >>'$LOG_DIR/watchdog.log' 2>&1; '$SELF' watchdog"
|
||||
}
|
||||
|
||||
cmd_status() {
|
||||
local idx pid; idx="$(cur_idx)"; pid="$(phase_id "$idx")"
|
||||
echo " phase: $pid [$((idx+1))/${#PHASES[@]}] plan=$(phase_plan "$idx") status=$(phase_status "$idx")"
|
||||
local s
|
||||
for s in "$BUILDER_SESSION" "$ADV_SESSION" "$WATCHDOG_SESSION"; do
|
||||
if session_alive "$s"; then echo " $s: RUNNING"; else echo " $s: stopped"; fi
|
||||
done
|
||||
if phase_done "$(phase_status "$idx")"; then echo " phase $pid: ## DONE"; else echo " phase $pid: in progress"; fi
|
||||
[[ -f "$LOG_DIR/SEQUENCE-COMPLETE" ]] && echo " >>> $(cat "$LOG_DIR/SEQUENCE-COMPLETE")"
|
||||
}
|
||||
|
||||
case "${1:-}" in
|
||||
start)
|
||||
preflight
|
||||
# Fresh sequence: stop any running loops, reset to phase 0 (unless RESUME_PHASE=1 keeps the idx).
|
||||
stop_loops
|
||||
if [[ "${RESUME_PHASE:-}" != "1" ]]; then echo 0 > "$PHASE_IDX_FILE"; fi
|
||||
rm -f "$LOG_DIR/SEQUENCE-COMPLETE"
|
||||
start_loops
|
||||
start_watchdog
|
||||
log "started at phase $(phase_id "$(cur_idx)"). status: ./launch.sh status | attach: tmux attach -t $BUILDER_SESSION"
|
||||
;;
|
||||
watchdog) preflight; watchdog_loop ;;
|
||||
status) cmd_status ;;
|
||||
logs)
|
||||
case "${2:-}" in
|
||||
builder) tail -f "$LOG_DIR/$BUILDER_SESSION.log" ;;
|
||||
adversary) tail -f "$LOG_DIR/$ADV_SESSION.log" ;;
|
||||
watchdog) tail -f "$LOG_DIR/watchdog.log" ;;
|
||||
*) die "usage: $0 logs builder|adversary|watchdog" ;;
|
||||
esac
|
||||
;;
|
||||
stop)
|
||||
stop_loops
|
||||
if session_alive "$WATCHDOG_SESSION"; then log "killing $WATCHDOG_SESSION"; tmux kill-session -t "$WATCHDOG_SESSION" || true; fi
|
||||
log "stopped."
|
||||
;;
|
||||
*)
|
||||
cat <<EOF
|
||||
cc-ci loop launcher (phase-aware)
|
||||
|
||||
$0 start start the phase sequence at phase 0 + watchdog (stops any running loops first)
|
||||
$0 status show phase + session + DONE state
|
||||
$0 logs builder|adversary|watchdog tail a log
|
||||
$0 stop stop both loops + watchdog
|
||||
$0 watchdog run supervision loop in foreground
|
||||
|
||||
Phase sequence (auto-transition on per-phase ## DONE; STOP after the last = manual gate):
|
||||
$(all_ids)
|
||||
Env: LOOP_BACKEND=$LOOP_BACKEND LOOP_MODEL=${LOOP_MODEL:-<default>}
|
||||
claude: CLAUDE_BIN=$CLAUDE_BIN REMOTE_CONTROL=$REMOTE_CONTROL
|
||||
opencode: OPENCODE_BIN=$OPENCODE_BIN OPENCODE_SERVER=$OPENCODE_SERVER
|
||||
(one shared server; each session attaches with --title; web UI: http://oc.commoninternet.net)
|
||||
WATCH_INTERVAL=${WATCH_INTERVAL}s SIGNAL_INTERVAL=${SIGNAL_INTERVAL}s
|
||||
PHASES_SPEC='$PHASES_SPEC'
|
||||
RESUME_PHASE=1 to keep the current phase index instead of resetting to 0.
|
||||
EOF
|
||||
;;
|
||||
esac
|
||||
# Thin wrapper — delegates everything to launch.py in the same directory.
|
||||
exec python3 "$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")/launch.py" "$@"
|
||||
|
||||
Reference in New Issue
Block a user