weekly-run: pre-reclaim stale cc-ci images + hourly glm-5.2 supervisor

Root-cause fix for the 2026-07-03 run stalling: the cc-ci host disk filled to 100% (ENOSPC) mid-run (Wave 6, lasuite-drive), the agent stopped to reclaim space, and nothing resumed it — the log-idle/429 watchdog only covers opencode-go usage-limit stalls, not an environmental wedge. - launch-upgrader.py: step-0 prereclaim_cc_ci() prunes STALE cc-ci docker images (unused AND older than a week, so this week's likely-reused images stay) before each weekly run. Best-effort; env-tunable (UPGRADER_PRERECLAIM*). - launch-supervisor.py (new): hourly glm-5.2 orchestrator wake-up. Cheap deterministic gate — no-ops (zero tokens) when the run is complete or progressing; only when a run stalled/died before completing does it launch a short-lived glm-5.2 agent to diagnose + drive it to a clean DONE. Progress is judged by live run-proc + log mtime (session_busy() is claude-tuned and misreads a headless opencode run as idle). - configuration.nix: cc-ci-upgrade-supervisor service + hourly timer (:07). - upgrade-all SKILL §0: note the stale-image reclaim for manual runs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WxbpH3DquKzoSTSwGvGuET
2026-07-04 04:33:05 +00:00
parent 52e7c954a3
commit 1bd156e7e6
4 changed files with 238 additions and 0 deletions
--- a/.claude/skills/upgrade-all/SKILL.md
+++ b/.claude/skills/upgrade-all/SKILL.md
@ -81,6 +81,16 @@ fi
 remains as belt-and-suspenders even after the /16 fix: it fires on the exact error signature and restarts
 docker to reclaim leaked endpoints if VIP exhaustion ever recurs despite the larger subnet.)

+Then **reclaim STALE docker images so the run can't fill the disk mid-flight.** A full run deploys
+~16 recipes; their images accumulate week over week and can run the cc-ci root FS to 100% (ENOSPC),
+which killed the 2026-07-03 run mid-way (lasuite-drive, Wave 6). Clear only **stale** images —
+unused by any container AND older than a week — so this week's likely-reused images are kept:
+```
+ssh cc-ci 'docker image prune -af --filter until=168h 2>&1 | tail -1; df -h / | tail -1'
+```
+(When the run is launched via `launch-upgrader.py` this is done automatically as step 0 — the
+`prereclaim_cc_ci()` pre-step — so you only run it by hand for a manual `/upgrade-all`.)
+
 ## 1. Build the candidate list
 Enrolled recipes = the cc-ci `tests/<recipe>/` dirs (same set `ci-test-review` sweeps), **MINUS any
 recipe tagged `external` in `cc-ci-plan/used-recipes.md`** — recipes cc-ci uses/tests but does NOT
--- a/cc-ci-plan/launch-supervisor.py
+++ b/cc-ci-plan/launch-supervisor.py
@ -0,0 +1,159 @@
+#!/usr/bin/env python3
+"""
+cc-ci weekly-run SUPERVISOR — hourly glm-5.2 orchestrator wake-up.
+
+Fired hourly by a systemd timer. It is a CHEAP deterministic GATE first: if this week's
+/upgrade-all run is already complete, or is actively progressing, it does NOTHING and spends
+ZERO model tokens. Only when the run has STALLED or died before completing — e.g. the host
+disk-full crash on 2026-07-03 that the log-idle/429 watchdog does NOT cover — does it launch a
+short-lived glm-5.2 opencode agent that DIAGNOSES the blockage (disk, wedged deploy, dead
+session, a stuck recipe) and DRIVES the run to completion (resume the upgrader, ensure the
+summary + public report land). One-shot per fire; the next hour re-checks and no-ops if healthy.
+
+This is the intelligent complement to launch-upgrader.py's watchdog: the watchdog only handles
+opencode-go usage-limit (429) stalls (wait-out + `--continue`); the supervisor handles everything
+else that can wedge a weekly run, using a real model instead of a fixed heuristic.
+
+Usage:
+  launch-supervisor.py [check]   default — the timer entrypoint (gate; may spawn the agent)
+  launch-supervisor.py force     skip the gate; always launch the supervisor agent
+  launch-supervisor.py status    show what the gate currently sees
+  launch-supervisor.py stop      kill the supervisor agent session
+"""
+import os, re, sys, time, subprocess, importlib.util
+from datetime import datetime
+from pathlib import Path
+
+# ── reuse launch-upgrader's server/session/completion helpers (single source of truth) ──────────
+_HERE = os.path.dirname(os.path.realpath(__file__))
+os.environ.setdefault("UPGRADER_SESSION", "cc-ci-upgrader")   # the run we supervise
+_spec = importlib.util.spec_from_file_location("launch_upgrader", os.path.join(_HERE, "launch-upgrader.py"))
+lu = importlib.util.module_from_spec(_spec); _spec.loader.exec_module(lu)
+
+# ── config ──────────────────────────────────────────────────────────────────────────────────────
+SUP_SESSION   = os.environ.get("SUPERVISOR_SESSION", "cc-ci-supervisor")
+WORKDIR       = os.environ.get("UPGRADER_DIR", "/srv/cc-ci")
+LOG_DIR       = os.environ.get("LOG_DIR",      "/srv/cc-ci/.cc-ci-logs")
+MODEL         = os.environ.get("SUPERVISOR_MODEL", "opencode-go/glm-5.2")
+OPENCODE_BIN  = lu.OPENCODE_BIN
+OPENCODE_SERVER = lu.OPENCODE_SERVER
+OPENCODE_SHARE  = os.environ.get("OPENCODE_SHARE", "1") == "1"
+# Don't auto-resurrect a run whose session is older than this — a genuinely abandoned run should not
+# be dragged back to life days later; the operator will look. Covers the Thu-night → weekend window.
+WINDOW_HOURS  = float(os.environ.get("SUPERVISOR_WINDOW_HOURS", "96"))
+
+def log(m): print(f"[supervisor {datetime.now():%H:%M:%S}] {m}", flush=True)
+def _sh(c): return subprocess.run(c, capture_output=True, text=True)
+
+# ── gate helpers ────────────────────────────────────────────────────────────────────────────────
+def _session_created_ms(sid):
+    rows = lu._server_get("/session") or []
+    rows = rows if isinstance(rows, list) else rows.get("data", [])
+    for s in rows:
+        if s.get("id") == sid:
+            return (s.get("time") or {}).get("created")
+    return None
+
+def _sup_alive(): return _sh(["tmux", "has-session", "-t", SUP_SESSION]).returncode == 0
+def _sup_busy():
+    r = _sh(["tmux", "capture-pane", "-pt", SUP_SESSION])
+    return bool(re.search(r"esc to interrupt|⠋|⠙|⠹|⠸|⠼|⠴|⠦|⠧|⠇|⠏|Running tool", r.stdout)) if r.returncode == 0 else False
+def _sup_kill(): _sh(["tmux", "kill-session", "-t", SUP_SESSION])
+
+# ── the supervisor agent ─────────────────────────────────────────────────────────────────────────
+def build_kickoff(sid, reason):
+    return f"""\
+*** cc-ci WEEKLY-RUN SUPERVISOR — one-shot, glm-5.2 ***
+You are the hourly SUPERVISOR for the weekly cc-ci /upgrade-all run. A gate has determined the run
+is INCOMPLETE and not currently progressing ({reason}). Your job: get this week's run to a clean
+DONE — published report + summary — then STOP. You are NOT a perpetual loop.
+
+Your cwd is {WORKDIR}; reach the CI server with `ssh cc-ci`; creds in {WORKDIR}/.testenv; skills in
+{WORKDIR}/.claude/skills/. The stalled upgrader opencode session is {sid} (title "cc-ci-upgrader").
+
+DO THIS, in order — stop as soon as the run is healthy again:
+1. ENVIRONMENT FIRST. Check the CI server disk: `ssh cc-ci 'df -h / | tail -1'`. If root is > 85%
+   used, reclaim STALE images (unused AND older than a week, so this week's are kept):
+   `ssh cc-ci 'docker image prune -af --filter until=168h 2>&1 | tail -1; df -h / | tail -1'`.
+   Also glance for other infra wedges (a hung deploy, proxy VIP exhaustion — see upgrade-all §0).
+2. ASSESS the run. Read the upgrader session's recent output (opencode server {OPENCODE_SERVER},
+   `GET /session/{sid}/message`) and the open recipe PRs to see which enrolled recipes already have
+   a PR this week and which remain. Do NOT redo any recipe that already has a PR.
+3. DRIVE TO COMPLETION. Prefer to RESUME the existing run (context preserved) once the environment
+   is healthy: `python3 {WORKDIR}/cc-ci-plan/launch-upgrader.py resume`. Then CONFIRM it actually
+   restarted and is progressing (a fresh `opencode run … -s {sid} --continue` proc + the session
+   advancing). If the session is truly gone/unresumable, drive the remaining recipes yourself the
+   /upgrade-all way (per-recipe /recipe-upgrade DEFAULT-mode subagents, !testme verify), then make
+   sure the weekly summary is written to {WORKDIR}/.cc-ci-logs/upgrades/ and launch the public
+   report: `python3 {WORKDIR}/cc-ci-plan/launch-report.py fresh`.
+4. If on inspection the run is actually FINE (progressing) or already COMPLETE, do NOTHING.
+5. Print `SUPERVISOR DONE` and go idle. Do NOT loop.
+
+GUARDRAILS: NEVER merge a PR. NEVER weaken a test. DEFAULT mode only. Single-writer on the shared
+Swarm — don't pile concurrent deploys past DRONE_RUNNER_CAPACITY. Handing back to the resumed run
+is preferred over doing the recipe work yourself — avoid two writers at once.
+"""
+
+def spawn_supervisor(sid, reason):
+    Path(LOG_DIR).mkdir(parents=True, exist_ok=True)
+    if _sup_alive():
+        _sup_kill(); time.sleep(1)
+    kf = Path(LOG_DIR) / f".kickoff-{SUP_SESSION}.txt"
+    kf.write_text(build_kickoff(sid, reason))
+    share = "--share" if OPENCODE_SHARE else ""
+    cmd = (f"set -a; . {WORKDIR}/.testenv; set +a; {OPENCODE_BIN} run --model '{MODEL}' {share} "
+           f"--attach '{OPENCODE_SERVER}' --title '{SUP_SESSION}' --dir {WORKDIR} \"$(cat '{kf}')\"")
+    _sh(["tmux", "new-session", "-d", "-s", SUP_SESSION, "-c", WORKDIR, cmd])
+    _sh(["tmux", "pipe-pane", "-o", "-t", SUP_SESSION, f"cat >> '{LOG_DIR}/{SUP_SESSION}.log'"])
+    log(f"launched glm-5.2 supervisor (tmux '{SUP_SESSION}', model={MODEL}) — {reason}")
+
+# ── gate ─────────────────────────────────────────────────────────────────────────────────────────
+def _gate():
+    """Return (should_spawn, sid, reason). Cheap — no model tokens."""
+    sid = lu._session_id()
+    if not sid:
+        return False, None, "no cc-ci-upgrader session exists — nothing to supervise"
+    if lu._completed():
+        return False, sid, "weekly run COMPLETE (DONE marker present) — nothing to do"
+    created = _session_created_ms(sid)
+    age_h = (time.time() * 1000 - created) / 3.6e6 if created else 0.0
+    if created and age_h > WINDOW_HOURS:
+        return False, sid, f"incomplete run is {age_h:.0f}h old (> {WINDOW_HOURS:.0f}h window) — not auto-resurrecting"
+    # "Progressing" for an opencode run is NOT session_busy() (its pane regex is claude-tuned and
+    # misreads a headless `opencode run` as idle). Trust the run PROCESS + the session log's mtime:
+    # a live `opencode run … -s <sid> --attach` proc, or a log touched within the stall window.
+    pids = lu._run_pids(sid)
+    idle = lu._log_idle_min()
+    if pids or (idle is not None and idle < lu.STALL_MIN):
+        via = f"{len(pids)} live run proc(s)" if pids else f"log idle {idle:.0f}m < {lu.STALL_MIN:.0f}m"
+        return False, sid, f"upgrader run progressing ({via}) — leaving it"
+    if _sup_alive() and _sup_busy():
+        return False, sid, "a supervisor agent is already working — skip"
+    idle_s = f"{idle:.0f}m" if idle is not None else "unknown"
+    return True, sid, f"run INCOMPLETE + not progressing (log idle {idle_s}, age {age_h:.0f}h)"
+
+def check(force=False):
+    if force:
+        sid = lu._session_id()
+        spawn_supervisor(sid, "forced"); return
+    should, sid, reason = _gate()
+    log(reason)
+    if should:
+        spawn_supervisor(sid, reason)
+
+# ── main ─────────────────────────────────────────────────────────────────────────────────────────
+def main():
+    cmd = sys.argv[1] if len(sys.argv) > 1 else "check"
+    if cmd == "check":    check()
+    elif cmd == "force":  check(force=True)
+    elif cmd == "stop":   _sup_kill(); log(f"{SUP_SESSION} stopped")
+    elif cmd == "status":
+        should, sid, reason = _gate()
+        log(f"gate: would {'SPAWN' if should else 'skip'} — {reason}")
+        log(f"supervisor session: {'RUNNING '+('(busy)' if _sup_busy() else '(idle)') if _sup_alive() else 'stopped'}")
+        log(f"model: {MODEL}  window: {WINDOW_HOURS:.0f}h")
+    else:
+        print(__doc__); sys.exit(2)
+
+if __name__ == "__main__":
+    main()
--- a/cc-ci-plan/launch-upgrader.py
+++ b/cc-ci-plan/launch-upgrader.py
@ -58,6 +58,15 @@ OPENCODE_SHARE = os.environ.get("OPENCODE_SHARE", "1") == "1"

 UPGRADER_ARGS = os.environ.get("UPGRADER_ARGS", "")

+# First step of the weekly run: reclaim STALE docker images on the cc-ci server BEFORE the run so a
+# heavy run can't fill the disk mid-flight (root cause of the 2026-07-03 stall — 100% ENOSPC killed
+# lasuite-drive + wedged the run). "Stale" = unused by any container AND older than PRERECLAIM_UNTIL,
+# so recently-built/pulled images (the ones this week's tests will reuse) are KEPT — we only evict
+# leftovers from prior weeks. Best-effort; never fails the run.
+PRERECLAIM       = os.environ.get("UPGRADER_PRERECLAIM", "1") == "1"
+PRERECLAIM_UNTIL = os.environ.get("UPGRADER_PRERECLAIM_UNTIL", "168h")  # 7d: older than one run ago
+PRERECLAIM_HOST  = os.environ.get("UPGRADER_PRERECLAIM_HOST", "cc-ci")
+
 # ── helpers ───────────────────────────────────────────────────────────────────

 def log(msg):
@ -83,6 +92,27 @@ def session_busy():
 def kill_session():
    subprocess.run(["tmux", "kill-session", "-t", SESSION], capture_output=True)

+def prereclaim_cc_ci():
+    """Weekly-run step 0: prune STALE (unused AND older than PRERECLAIM_UNTIL) docker images on the
+    cc-ci server so the run has disk headroom. Keeps recent images (reused this week); only clears
+    prior-weeks' leftovers. Best-effort — a reclaim failure must never abort the run."""
+    if not PRERECLAIM:
+        return
+    filt = f"--filter until={PRERECLAIM_UNTIL}"
+    remote = (f"docker image prune -af {filt} 2>&1 | tail -1; "
+              f"docker builder prune -af {filt} >/dev/null 2>&1 || true; "
+              f"df -h / | tail -1")
+    log(f"  step 0: pre-reclaim stale docker images on {PRERECLAIM_HOST} (unused & >{PRERECLAIM_UNTIL})")
+    try:
+        r = subprocess.run(["ssh", "-o", "ConnectTimeout=15", PRERECLAIM_HOST, remote],
+                           capture_output=True, text=True, timeout=900)
+        out = (r.stdout or r.stderr or "").strip()
+        for ln in out.splitlines():
+            if ln.strip():
+                log(f"    {ln.strip()}")
+    except Exception as e:
+        log(f"    pre-reclaim skipped (non-fatal): {e}")
+
 # ── kickoff prompt ────────────────────────────────────────────────────────────

 def build_kickoff():
@ -130,6 +160,11 @@ def start(mode="use-or-create"):
        kill_session()
        import time; time.sleep(1)

+    # Step 0 of the weekly run: clear STALE cc-ci docker images so a heavy run can't run the disk
+    # out mid-flight (root cause of the 2026-07-03 stall). Only for the actual upgrade run.
+    if SESSION == "cc-ci-upgrader":
+        prereclaim_cc_ci()
+
    kf = Path(LOG_DIR) / f".kickoff-{SESSION}.txt"
    kf.write_text(build_kickoff())

--- a/nix/hosts/cc-ci-orchestrator-hetzner/configuration.nix
+++ b/nix/hosts/cc-ci-orchestrator-hetzner/configuration.nix
@ -224,4 +224,38 @@ SSHCFG
      Persistent = true;  # if the box was down at the scheduled time, run once on next boot
    };
  };
+
+  # Hourly SUPERVISOR — a glm-5.2 orchestrator wake-up that keeps the weekly run on track. The
+  # log-idle/429 watchdog only handles opencode-go usage-limit stalls; it does NOT cover a host
+  # disk-full crash (which killed the 2026-07-03 run) or any other environmental wedge. This is a
+  # CHEAP deterministic gate: if the weekly run is complete or actively progressing it does NOTHING
+  # (zero model tokens). Only when a run has stalled/died before completing does it launch a
+  # short-lived glm-5.2 agent that diagnoses the blockage and drives the run to a clean DONE.
+  systemd.services.cc-ci-upgrade-supervisor = {
+    description = "cc-ci hourly weekly-run supervisor (glm-5.2 — drives a stalled /upgrade-all to completion)";
+    after = [ "network-online.target" "tailscaled.service" ];
+    wants = [ "network-online.target" ];
+    serviceConfig = {
+      Type = "oneshot";  # launch-supervisor.py check: gate now, spawn the agent into tmux, return
+      User = "loops"; Group = "users";
+      WorkingDirectory = "/srv/cc-ci";
+      # Shares the weekly run's optional override file (e.g. SUPERVISOR_MODEL=…); "-" = optional.
+      EnvironmentFile = "-/srv/cc-ci/upgrader.env";
+    };
+    environment = { HOME = "/home/loops"; };
+    path = [ pkgs.bash pkgs.tmux pkgs.git pkgs.python3 pkgs.openssh pkgs.nettools ];
+    script = ''
+      export PATH="/home/loops/.local/bin:$PATH"
+      python3 /srv/cc-ci/cc-ci-plan/launch-supervisor.py check >> /srv/cc-ci/.cc-ci-logs/supervisor-cron.log 2>&1
+    '';
+  };
+
+  systemd.timers.cc-ci-upgrade-supervisor = {
+    description = "Hourly trigger for cc-ci-upgrade-supervisor (weekly-run health check + drive)";
+    wantedBy = [ "timers.target" ];
+    timerConfig = {
+      OnCalendar = "*-*-* *:07:00";  # every hour at :07 (offset from the weekly :00 fire)
+      Persistent = false;            # a missed hourly check is moot — the next hour re-checks
+    };
+  };
 }