Live-testing the resume path surfaced two gaps: (1) an `opencode run` proc
EXITS when the model ends its turn, so a long /upgrade-all run's process dies
repeatedly before the whole run completes — and the log mtime freezes on death,
so the watchdog's log-idle>15min signal is both too slow and unreliable. (2) A
resumed run had no watchdog, so nothing re-continued it.
- watchdog(): detect PROC-DEATH (no live `opencode run` proc for the session +
not completed) and resume promptly, in addition to log-idle. Guarded by
MAX_RESUMES (default 20) so a no-progress loop (e.g. disk-full) eventually hands
off to the supervisor/operator instead of spinning forever.
- resume(): auto-spawn a watchdog if none is alive (skips when the watchdog itself
called resume — it lives in {SESSION}-watchdog — so no duplicate).
- launch-supervisor.py gate: defer while the per-run watchdog is alive (it is the
single writer for prompt-recovery). The supervisor only takes over once the
watchdog gives up (MAX_RESUMES) — i.e. a wedge a bare resume can't fix. Removes
the supervisor/watchdog double-resume race.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WxbpH3DquKzoSTSwGvGuET
Root-cause fix for the 2026-07-03 run stalling: the cc-ci host disk filled to
100% (ENOSPC) mid-run (Wave 6, lasuite-drive), the agent stopped to reclaim
space, and nothing resumed it — the log-idle/429 watchdog only covers opencode-go
usage-limit stalls, not an environmental wedge.
- launch-upgrader.py: step-0 prereclaim_cc_ci() prunes STALE cc-ci docker images
(unused AND older than a week, so this week's likely-reused images stay) before
each weekly run. Best-effort; env-tunable (UPGRADER_PRERECLAIM*).
- launch-supervisor.py (new): hourly glm-5.2 orchestrator wake-up. Cheap
deterministic gate — no-ops (zero tokens) when the run is complete or
progressing; only when a run stalled/died before completing does it launch a
short-lived glm-5.2 agent to diagnose + drive it to a clean DONE. Progress is
judged by live run-proc + log mtime (session_busy() is claude-tuned and misreads
a headless opencode run as idle).
- configuration.nix: cc-ci-upgrade-supervisor service + hourly timer (:07).
- upgrade-all SKILL §0: note the stale-image reclaim for manual runs.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WxbpH3DquKzoSTSwGvGuET