Commit Graph

8 Commits

Author SHA1 Message Date
f94be45f9c watchdog: cover all parts of the weekly run + survive the systemd oneshot
Two gaps for the scheduled Thursday glm-5.2 run:
1. Survival: the watchdog was a Popen child of the Type=oneshot service, which
   systemd's cgroup cleanup kills on exit. Spawn it under the persistent tmux
   server instead (_spawn_watchdog), like the run sessions — survives the oneshot.
2. The report runs on glm-5.2 sharing the same opencode-go budget the upgrade run
   drains, so it can 429-stall with no recovery. launch-report.py now spawns the
   SAME watchdog pointed at the cc-ci-report session (generic via UPGRADER_SESSION/
   _MODEL/_DONE_MARKER/_RESUME_FILE), with a report-specific resume prompt.

Also: _run_pids() is now scoped to the managed session (title or -s <sid>) so the
report watchdog can't kill the idle upgrader process and vice-versa; resume() adds
--dir and honors a custom resume prompt file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 02:42:50 +00:00
5a6c62e36c launch-upgrader: fix false completion detection (prompt contains the marker)
_completed() grepped the log for UPGRADE RUN COMPLETE, but the kickoff/resume
PROMPT (a user message) contains that string verbatim, so it false-positived
'done' while the run was still going. Check the model's ASSISTANT message output
via the web server API instead (log grep only as an offline, prompt-excluding
fallback).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 01:42:06 +00:00
6f9cbc1a56 launch-upgrader: rename babysit -> watchdog (match agents.py convention)
Subcommand, function, env (UPGRADER_WATCHDOG), and log file renamed; behavior
unchanged. Only the opencode upgrader 'start' auto-spawns it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 01:33:07 +00:00
28ef7e44ab launch-upgrader: add stall-detect + auto-resume watchdog (opencode-go limit)
The opencode-go subscription's rolling usage-limit (429) ends the 'opencode run'
agent loop mid-run; it does NOT self-resume. Add:
- resume: continue the SAME session (context preserved) via 'opencode run -s <id>
  --continue' — finds the session from the web server, kills the idle proc safely
  (via /proc scan, never pkill -f self-match), relaunches in the tmux session.
- babysit: poll the session log; on a stall (>15min idle) wait out any 429
  retry-after then auto-resume. Spawned automatically by an opencode 'start'.

So a usage-limit pause now self-heals instead of needing a manual nudge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 01:26:24 +00:00
5351ec2e40 launch-upgrader: default to opencode-go/glm-5.2 when unset
Weekly upgrade run now defaults backend=opencode, model=opencode-go/glm-5.2 with
no env set. Model default tracks backend (claude override → sonnet). Override via
LOOP_BACKEND/LOOP_MODEL or /srv/cc-ci/upgrader.env.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 20:23:26 +00:00
1443ccaea5 weekly upgrade: optional backend/model via /srv/cc-ci/upgrader.env
cc-ci-upgrade-all now reads an optional EnvironmentFile so the weekly run can
switch backend/model (e.g. LOOP_BACKEND=opencode LOOP_MODEL=opencode-go/glm-5.2)
without a rebuild. Absent file → claude/sonnet (unchanged). Built+switched on
cc-ci-orchestrator-hetzner, host verified healthy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 20:21:16 +00:00
ec18c98af6 launch-upgrader: fix opencode --model placement + add web-attach/--share
The opencode backend emitted 'opencode --model X run ...' but -m/--model is a
flag on the run subcommand, so the model was being ignored. Move it after run.
Add OPENCODE_SHARE (default on): attach the session to the shared opencode web
server (oc.commoninternet.net) AND create a public --share link for monitoring.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 20:14:27 +00:00
bca51071bd refactor: rewrite launchers as Python; add orchestrator JOURNAL.md
Bash scripts are now one-liner wrappers: exec python3 <script>.py "$@"
All logic lives in the Python scripts (pure stdlib, no deps).

launch.py — loops + watchdog:
  Full port of launch.sh: phase sequencing, start/stop/status/logs/watchdog,
  handoff signalling, stall detection, heal_session, heal_orchestrator.
  Cleaner structure: config block → helpers → phase/kickoff/agent/healing/
  handoff/watchdog/main. LOOP_BACKEND + LOOP_MODEL switches throughout.

launch-orchestrator.py — orchestrator session:
  claude path: --resume <id> preserved (conversation survives reboots).
  opencode path: run --attach --title (no --resume; STARTUP_PROMPT orients
  the new session; reads JOURNAL.md for context).
  STARTUP_PROMPT updated to reference JOURNAL.md on startup.

launch-upgrader.py — one-shot upgrade job:
  LOOP_BACKEND / LOOP_MODEL take precedence over UPGRADER_BACKEND / UPGRADER_MODEL.
  Both claude and opencode paths supported.

cc-ci-plan/JOURNAL.md — new orchestrator handoff file:
  Persistent across conversation resets. Documents the handoff format and
  carries the current session's summary: migration complete, phase 5 in
  progress (V3/V7 PASS), phase 4 deferred, open items for next session.

AGENTS.md: step 1 on startup = read JOURNAL.md; step 5 = append on handoff.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 17:50:09 +00:00