weekly-run: pre-reclaim stale cc-ci images + hourly glm-5.2 supervisor

Root-cause fix for the 2026-07-03 run stalling: the cc-ci host disk filled to
100% (ENOSPC) mid-run (Wave 6, lasuite-drive), the agent stopped to reclaim
space, and nothing resumed it — the log-idle/429 watchdog only covers opencode-go
usage-limit stalls, not an environmental wedge.

- launch-upgrader.py: step-0 prereclaim_cc_ci() prunes STALE cc-ci docker images
  (unused AND older than a week, so this week's likely-reused images stay) before
  each weekly run. Best-effort; env-tunable (UPGRADER_PRERECLAIM*).
- launch-supervisor.py (new): hourly glm-5.2 orchestrator wake-up. Cheap
  deterministic gate — no-ops (zero tokens) when the run is complete or
  progressing; only when a run stalled/died before completing does it launch a
  short-lived glm-5.2 agent to diagnose + drive it to a clean DONE. Progress is
  judged by live run-proc + log mtime (session_busy() is claude-tuned and misreads
  a headless opencode run as idle).
- configuration.nix: cc-ci-upgrade-supervisor service + hourly timer (:07).
- upgrade-all SKILL §0: note the stale-image reclaim for manual runs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WxbpH3DquKzoSTSwGvGuET
This commit is contained in:
autonomic-bot
2026-07-04 04:33:05 +00:00
parent 52e7c954a3
commit 1bd156e7e6
4 changed files with 238 additions and 0 deletions

View File

@ -81,6 +81,16 @@ fi
remains as belt-and-suspenders even after the /16 fix: it fires on the exact error signature and restarts remains as belt-and-suspenders even after the /16 fix: it fires on the exact error signature and restarts
docker to reclaim leaked endpoints if VIP exhaustion ever recurs despite the larger subnet.) docker to reclaim leaked endpoints if VIP exhaustion ever recurs despite the larger subnet.)
Then **reclaim STALE docker images so the run can't fill the disk mid-flight.** A full run deploys
~16 recipes; their images accumulate week over week and can run the cc-ci root FS to 100% (ENOSPC),
which killed the 2026-07-03 run mid-way (lasuite-drive, Wave 6). Clear only **stale** images —
unused by any container AND older than a week — so this week's likely-reused images are kept:
```
ssh cc-ci 'docker image prune -af --filter until=168h 2>&1 | tail -1; df -h / | tail -1'
```
(When the run is launched via `launch-upgrader.py` this is done automatically as step 0 — the
`prereclaim_cc_ci()` pre-step — so you only run it by hand for a manual `/upgrade-all`.)
## 1. Build the candidate list ## 1. Build the candidate list
Enrolled recipes = the cc-ci `tests/<recipe>/` dirs (same set `ci-test-review` sweeps), **MINUS any Enrolled recipes = the cc-ci `tests/<recipe>/` dirs (same set `ci-test-review` sweeps), **MINUS any
recipe tagged `external` in `cc-ci-plan/used-recipes.md`** — recipes cc-ci uses/tests but does NOT recipe tagged `external` in `cc-ci-plan/used-recipes.md`** — recipes cc-ci uses/tests but does NOT

View File

@ -0,0 +1,159 @@
#!/usr/bin/env python3
"""
cc-ci weekly-run SUPERVISOR — hourly glm-5.2 orchestrator wake-up.
Fired hourly by a systemd timer. It is a CHEAP deterministic GATE first: if this week's
/upgrade-all run is already complete, or is actively progressing, it does NOTHING and spends
ZERO model tokens. Only when the run has STALLED or died before completing — e.g. the host
disk-full crash on 2026-07-03 that the log-idle/429 watchdog does NOT cover — does it launch a
short-lived glm-5.2 opencode agent that DIAGNOSES the blockage (disk, wedged deploy, dead
session, a stuck recipe) and DRIVES the run to completion (resume the upgrader, ensure the
summary + public report land). One-shot per fire; the next hour re-checks and no-ops if healthy.
This is the intelligent complement to launch-upgrader.py's watchdog: the watchdog only handles
opencode-go usage-limit (429) stalls (wait-out + `--continue`); the supervisor handles everything
else that can wedge a weekly run, using a real model instead of a fixed heuristic.
Usage:
launch-supervisor.py [check] default — the timer entrypoint (gate; may spawn the agent)
launch-supervisor.py force skip the gate; always launch the supervisor agent
launch-supervisor.py status show what the gate currently sees
launch-supervisor.py stop kill the supervisor agent session
"""
import os, re, sys, time, subprocess, importlib.util
from datetime import datetime
from pathlib import Path
# ── reuse launch-upgrader's server/session/completion helpers (single source of truth) ──────────
_HERE = os.path.dirname(os.path.realpath(__file__))
os.environ.setdefault("UPGRADER_SESSION", "cc-ci-upgrader") # the run we supervise
_spec = importlib.util.spec_from_file_location("launch_upgrader", os.path.join(_HERE, "launch-upgrader.py"))
lu = importlib.util.module_from_spec(_spec); _spec.loader.exec_module(lu)
# ── config ──────────────────────────────────────────────────────────────────────────────────────
SUP_SESSION = os.environ.get("SUPERVISOR_SESSION", "cc-ci-supervisor")
WORKDIR = os.environ.get("UPGRADER_DIR", "/srv/cc-ci")
LOG_DIR = os.environ.get("LOG_DIR", "/srv/cc-ci/.cc-ci-logs")
MODEL = os.environ.get("SUPERVISOR_MODEL", "opencode-go/glm-5.2")
OPENCODE_BIN = lu.OPENCODE_BIN
OPENCODE_SERVER = lu.OPENCODE_SERVER
OPENCODE_SHARE = os.environ.get("OPENCODE_SHARE", "1") == "1"
# Don't auto-resurrect a run whose session is older than this — a genuinely abandoned run should not
# be dragged back to life days later; the operator will look. Covers the Thu-night → weekend window.
WINDOW_HOURS = float(os.environ.get("SUPERVISOR_WINDOW_HOURS", "96"))
def log(m): print(f"[supervisor {datetime.now():%H:%M:%S}] {m}", flush=True)
def _sh(c): return subprocess.run(c, capture_output=True, text=True)
# ── gate helpers ────────────────────────────────────────────────────────────────────────────────
def _session_created_ms(sid):
rows = lu._server_get("/session") or []
rows = rows if isinstance(rows, list) else rows.get("data", [])
for s in rows:
if s.get("id") == sid:
return (s.get("time") or {}).get("created")
return None
def _sup_alive(): return _sh(["tmux", "has-session", "-t", SUP_SESSION]).returncode == 0
def _sup_busy():
r = _sh(["tmux", "capture-pane", "-pt", SUP_SESSION])
return bool(re.search(r"esc to interrupt|⠋|⠙|⠹|⠸|⠼|⠴|⠦|⠧|⠇|⠏|Running tool", r.stdout)) if r.returncode == 0 else False
def _sup_kill(): _sh(["tmux", "kill-session", "-t", SUP_SESSION])
# ── the supervisor agent ─────────────────────────────────────────────────────────────────────────
def build_kickoff(sid, reason):
return f"""\
*** cc-ci WEEKLY-RUN SUPERVISOR — one-shot, glm-5.2 ***
You are the hourly SUPERVISOR for the weekly cc-ci /upgrade-all run. A gate has determined the run
is INCOMPLETE and not currently progressing ({reason}). Your job: get this week's run to a clean
DONE — published report + summary — then STOP. You are NOT a perpetual loop.
Your cwd is {WORKDIR}; reach the CI server with `ssh cc-ci`; creds in {WORKDIR}/.testenv; skills in
{WORKDIR}/.claude/skills/. The stalled upgrader opencode session is {sid} (title "cc-ci-upgrader").
DO THIS, in order — stop as soon as the run is healthy again:
1. ENVIRONMENT FIRST. Check the CI server disk: `ssh cc-ci 'df -h / | tail -1'`. If root is > 85%
used, reclaim STALE images (unused AND older than a week, so this week's are kept):
`ssh cc-ci 'docker image prune -af --filter until=168h 2>&1 | tail -1; df -h / | tail -1'`.
Also glance for other infra wedges (a hung deploy, proxy VIP exhaustion — see upgrade-all §0).
2. ASSESS the run. Read the upgrader session's recent output (opencode server {OPENCODE_SERVER},
`GET /session/{sid}/message`) and the open recipe PRs to see which enrolled recipes already have
a PR this week and which remain. Do NOT redo any recipe that already has a PR.
3. DRIVE TO COMPLETION. Prefer to RESUME the existing run (context preserved) once the environment
is healthy: `python3 {WORKDIR}/cc-ci-plan/launch-upgrader.py resume`. Then CONFIRM it actually
restarted and is progressing (a fresh `opencode run … -s {sid} --continue` proc + the session
advancing). If the session is truly gone/unresumable, drive the remaining recipes yourself the
/upgrade-all way (per-recipe /recipe-upgrade DEFAULT-mode subagents, !testme verify), then make
sure the weekly summary is written to {WORKDIR}/.cc-ci-logs/upgrades/ and launch the public
report: `python3 {WORKDIR}/cc-ci-plan/launch-report.py fresh`.
4. If on inspection the run is actually FINE (progressing) or already COMPLETE, do NOTHING.
5. Print `SUPERVISOR DONE` and go idle. Do NOT loop.
GUARDRAILS: NEVER merge a PR. NEVER weaken a test. DEFAULT mode only. Single-writer on the shared
Swarm — don't pile concurrent deploys past DRONE_RUNNER_CAPACITY. Handing back to the resumed run
is preferred over doing the recipe work yourself — avoid two writers at once.
"""
def spawn_supervisor(sid, reason):
Path(LOG_DIR).mkdir(parents=True, exist_ok=True)
if _sup_alive():
_sup_kill(); time.sleep(1)
kf = Path(LOG_DIR) / f".kickoff-{SUP_SESSION}.txt"
kf.write_text(build_kickoff(sid, reason))
share = "--share" if OPENCODE_SHARE else ""
cmd = (f"set -a; . {WORKDIR}/.testenv; set +a; {OPENCODE_BIN} run --model '{MODEL}' {share} "
f"--attach '{OPENCODE_SERVER}' --title '{SUP_SESSION}' --dir {WORKDIR} \"$(cat '{kf}')\"")
_sh(["tmux", "new-session", "-d", "-s", SUP_SESSION, "-c", WORKDIR, cmd])
_sh(["tmux", "pipe-pane", "-o", "-t", SUP_SESSION, f"cat >> '{LOG_DIR}/{SUP_SESSION}.log'"])
log(f"launched glm-5.2 supervisor (tmux '{SUP_SESSION}', model={MODEL}) — {reason}")
# ── gate ─────────────────────────────────────────────────────────────────────────────────────────
def _gate():
"""Return (should_spawn, sid, reason). Cheap — no model tokens."""
sid = lu._session_id()
if not sid:
return False, None, "no cc-ci-upgrader session exists — nothing to supervise"
if lu._completed():
return False, sid, "weekly run COMPLETE (DONE marker present) — nothing to do"
created = _session_created_ms(sid)
age_h = (time.time() * 1000 - created) / 3.6e6 if created else 0.0
if created and age_h > WINDOW_HOURS:
return False, sid, f"incomplete run is {age_h:.0f}h old (> {WINDOW_HOURS:.0f}h window) — not auto-resurrecting"
# "Progressing" for an opencode run is NOT session_busy() (its pane regex is claude-tuned and
# misreads a headless `opencode run` as idle). Trust the run PROCESS + the session log's mtime:
# a live `opencode run … -s <sid> --attach` proc, or a log touched within the stall window.
pids = lu._run_pids(sid)
idle = lu._log_idle_min()
if pids or (idle is not None and idle < lu.STALL_MIN):
via = f"{len(pids)} live run proc(s)" if pids else f"log idle {idle:.0f}m < {lu.STALL_MIN:.0f}m"
return False, sid, f"upgrader run progressing ({via}) — leaving it"
if _sup_alive() and _sup_busy():
return False, sid, "a supervisor agent is already working — skip"
idle_s = f"{idle:.0f}m" if idle is not None else "unknown"
return True, sid, f"run INCOMPLETE + not progressing (log idle {idle_s}, age {age_h:.0f}h)"
def check(force=False):
if force:
sid = lu._session_id()
spawn_supervisor(sid, "forced"); return
should, sid, reason = _gate()
log(reason)
if should:
spawn_supervisor(sid, reason)
# ── main ─────────────────────────────────────────────────────────────────────────────────────────
def main():
cmd = sys.argv[1] if len(sys.argv) > 1 else "check"
if cmd == "check": check()
elif cmd == "force": check(force=True)
elif cmd == "stop": _sup_kill(); log(f"{SUP_SESSION} stopped")
elif cmd == "status":
should, sid, reason = _gate()
log(f"gate: would {'SPAWN' if should else 'skip'}{reason}")
log(f"supervisor session: {'RUNNING '+('(busy)' if _sup_busy() else '(idle)') if _sup_alive() else 'stopped'}")
log(f"model: {MODEL} window: {WINDOW_HOURS:.0f}h")
else:
print(__doc__); sys.exit(2)
if __name__ == "__main__":
main()

View File

@ -58,6 +58,15 @@ OPENCODE_SHARE = os.environ.get("OPENCODE_SHARE", "1") == "1"
UPGRADER_ARGS = os.environ.get("UPGRADER_ARGS", "") UPGRADER_ARGS = os.environ.get("UPGRADER_ARGS", "")
# First step of the weekly run: reclaim STALE docker images on the cc-ci server BEFORE the run so a
# heavy run can't fill the disk mid-flight (root cause of the 2026-07-03 stall — 100% ENOSPC killed
# lasuite-drive + wedged the run). "Stale" = unused by any container AND older than PRERECLAIM_UNTIL,
# so recently-built/pulled images (the ones this week's tests will reuse) are KEPT — we only evict
# leftovers from prior weeks. Best-effort; never fails the run.
PRERECLAIM = os.environ.get("UPGRADER_PRERECLAIM", "1") == "1"
PRERECLAIM_UNTIL = os.environ.get("UPGRADER_PRERECLAIM_UNTIL", "168h") # 7d: older than one run ago
PRERECLAIM_HOST = os.environ.get("UPGRADER_PRERECLAIM_HOST", "cc-ci")
# ── helpers ─────────────────────────────────────────────────────────────────── # ── helpers ───────────────────────────────────────────────────────────────────
def log(msg): def log(msg):
@ -83,6 +92,27 @@ def session_busy():
def kill_session(): def kill_session():
subprocess.run(["tmux", "kill-session", "-t", SESSION], capture_output=True) subprocess.run(["tmux", "kill-session", "-t", SESSION], capture_output=True)
def prereclaim_cc_ci():
"""Weekly-run step 0: prune STALE (unused AND older than PRERECLAIM_UNTIL) docker images on the
cc-ci server so the run has disk headroom. Keeps recent images (reused this week); only clears
prior-weeks' leftovers. Best-effort — a reclaim failure must never abort the run."""
if not PRERECLAIM:
return
filt = f"--filter until={PRERECLAIM_UNTIL}"
remote = (f"docker image prune -af {filt} 2>&1 | tail -1; "
f"docker builder prune -af {filt} >/dev/null 2>&1 || true; "
f"df -h / | tail -1")
log(f" step 0: pre-reclaim stale docker images on {PRERECLAIM_HOST} (unused & >{PRERECLAIM_UNTIL})")
try:
r = subprocess.run(["ssh", "-o", "ConnectTimeout=15", PRERECLAIM_HOST, remote],
capture_output=True, text=True, timeout=900)
out = (r.stdout or r.stderr or "").strip()
for ln in out.splitlines():
if ln.strip():
log(f" {ln.strip()}")
except Exception as e:
log(f" pre-reclaim skipped (non-fatal): {e}")
# ── kickoff prompt ──────────────────────────────────────────────────────────── # ── kickoff prompt ────────────────────────────────────────────────────────────
def build_kickoff(): def build_kickoff():
@ -130,6 +160,11 @@ def start(mode="use-or-create"):
kill_session() kill_session()
import time; time.sleep(1) import time; time.sleep(1)
# Step 0 of the weekly run: clear STALE cc-ci docker images so a heavy run can't run the disk
# out mid-flight (root cause of the 2026-07-03 stall). Only for the actual upgrade run.
if SESSION == "cc-ci-upgrader":
prereclaim_cc_ci()
kf = Path(LOG_DIR) / f".kickoff-{SESSION}.txt" kf = Path(LOG_DIR) / f".kickoff-{SESSION}.txt"
kf.write_text(build_kickoff()) kf.write_text(build_kickoff())

View File

@ -224,4 +224,38 @@ SSHCFG
Persistent = true; # if the box was down at the scheduled time, run once on next boot Persistent = true; # if the box was down at the scheduled time, run once on next boot
}; };
}; };
# Hourly SUPERVISOR — a glm-5.2 orchestrator wake-up that keeps the weekly run on track. The
# log-idle/429 watchdog only handles opencode-go usage-limit stalls; it does NOT cover a host
# disk-full crash (which killed the 2026-07-03 run) or any other environmental wedge. This is a
# CHEAP deterministic gate: if the weekly run is complete or actively progressing it does NOTHING
# (zero model tokens). Only when a run has stalled/died before completing does it launch a
# short-lived glm-5.2 agent that diagnoses the blockage and drives the run to a clean DONE.
systemd.services.cc-ci-upgrade-supervisor = {
description = "cc-ci hourly weekly-run supervisor (glm-5.2 drives a stalled /upgrade-all to completion)";
after = [ "network-online.target" "tailscaled.service" ];
wants = [ "network-online.target" ];
serviceConfig = {
Type = "oneshot"; # launch-supervisor.py check: gate now, spawn the agent into tmux, return
User = "loops"; Group = "users";
WorkingDirectory = "/srv/cc-ci";
# Shares the weekly run's optional override file (e.g. SUPERVISOR_MODEL=…); "-" = optional.
EnvironmentFile = "-/srv/cc-ci/upgrader.env";
};
environment = { HOME = "/home/loops"; };
path = [ pkgs.bash pkgs.tmux pkgs.git pkgs.python3 pkgs.openssh pkgs.nettools ];
script = ''
export PATH="/home/loops/.local/bin:$PATH"
python3 /srv/cc-ci/cc-ci-plan/launch-supervisor.py check >> /srv/cc-ci/.cc-ci-logs/supervisor-cron.log 2>&1
'';
};
systemd.timers.cc-ci-upgrade-supervisor = {
description = "Hourly trigger for cc-ci-upgrade-supervisor (weekly-run health check + drive)";
wantedBy = [ "timers.target" ];
timerConfig = {
OnCalendar = "*-*-* *:07:00"; # every hour at :07 (offset from the weekly :00 fire)
Persistent = false; # a missed hourly check is moot — the next hour re-checks
};
};
} }