Two gaps for the scheduled Thursday glm-5.2 run:
1. Survival: the watchdog was a Popen child of the Type=oneshot service, which
systemd's cgroup cleanup kills on exit. Spawn it under the persistent tmux
server instead (_spawn_watchdog), like the run sessions — survives the oneshot.
2. The report runs on glm-5.2 sharing the same opencode-go budget the upgrade run
drains, so it can 429-stall with no recovery. launch-report.py now spawns the
SAME watchdog pointed at the cc-ci-report session (generic via UPGRADER_SESSION/
_MODEL/_DONE_MARKER/_RESUME_FILE), with a report-specific resume prompt.
Also: _run_pids() is now scoped to the managed session (title or -s <sid>) so the
report watchdog can't kill the idle upgrader process and vice-versa; resume() adds
--dir and honors a custom resume prompt file.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_completed() grepped the log for UPGRADE RUN COMPLETE, but the kickoff/resume
PROMPT (a user message) contains that string verbatim, so it false-positived
'done' while the run was still going. Check the model's ASSISTANT message output
via the web server API instead (log grep only as an offline, prompt-excluding
fallback).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Subcommand, function, env (UPGRADER_WATCHDOG), and log file renamed; behavior
unchanged. Only the opencode upgrader 'start' auto-spawns it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The opencode-go subscription's rolling usage-limit (429) ends the 'opencode run'
agent loop mid-run; it does NOT self-resume. Add:
- resume: continue the SAME session (context preserved) via 'opencode run -s <id>
--continue' — finds the session from the web server, kills the idle proc safely
(via /proc scan, never pkill -f self-match), relaunches in the tmux session.
- babysit: poll the session log; on a stall (>15min idle) wait out any 429
retry-after then auto-resume. Spawned automatically by an opencode 'start'.
So a usage-limit pause now self-heals instead of needing a manual nudge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Weekly upgrade run now defaults backend=opencode, model=opencode-go/glm-5.2 with
no env set. Model default tracks backend (claude override → sonnet). Override via
LOOP_BACKEND/LOOP_MODEL or /srv/cc-ci/upgrader.env.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cc-ci-upgrade-all now reads an optional EnvironmentFile so the weekly run can
switch backend/model (e.g. LOOP_BACKEND=opencode LOOP_MODEL=opencode-go/glm-5.2)
without a rebuild. Absent file → claude/sonnet (unchanged). Built+switched on
cc-ci-orchestrator-hetzner, host verified healthy.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The opencode backend emitted 'opencode --model X run ...' but -m/--model is a
flag on the run subcommand, so the model was being ignored. Move it after run.
Add OPENCODE_SHARE (default on): attach the session to the shared opencode web
server (oc.commoninternet.net) AND create a public --share link for monitoring.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bash scripts are now one-liner wrappers: exec python3 <script>.py "$@"
All logic lives in the Python scripts (pure stdlib, no deps).
launch.py — loops + watchdog:
Full port of launch.sh: phase sequencing, start/stop/status/logs/watchdog,
handoff signalling, stall detection, heal_session, heal_orchestrator.
Cleaner structure: config block → helpers → phase/kickoff/agent/healing/
handoff/watchdog/main. LOOP_BACKEND + LOOP_MODEL switches throughout.
launch-orchestrator.py — orchestrator session:
claude path: --resume <id> preserved (conversation survives reboots).
opencode path: run --attach --title (no --resume; STARTUP_PROMPT orients
the new session; reads JOURNAL.md for context).
STARTUP_PROMPT updated to reference JOURNAL.md on startup.
launch-upgrader.py — one-shot upgrade job:
LOOP_BACKEND / LOOP_MODEL take precedence over UPGRADER_BACKEND / UPGRADER_MODEL.
Both claude and opencode paths supported.
cc-ci-plan/JOURNAL.md — new orchestrator handoff file:
Persistent across conversation resets. Documents the handoff format and
carries the current session's summary: migration complete, phase 5 in
progress (V3/V7 PASS), phase 4 deferred, open items for next session.
AGENTS.md: step 1 on startup = read JOURNAL.md; step 5 = append on handoff.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>