watchdog: reboot idle-wedged loops via self-reported WAITING-UNTIL markers

The builder wedged at the context limit (garbled output) — alive but matching
none of heal_session's signals (dead/FATAL/limit), so the watchdog left it
stuck. Fix: loops now declare every wait, and the watchdog reboots a wait that
never resumes.

- plan.md §7 + both prompts: cap every wait at 10 min (chunk longer waits);
  before going idle, the loop's FINAL line must be `WAITING-UNTIL: <ISO8601 UTC>`
  (the resume time, matching its ScheduleWakeup); run /compact proactively at
  ~80% context to avoid wedging near the limit.
- launch.sh: new stall_check (runs every 30s signal tick) — reboots a loop idle
  >= STALL_IDLE (300s) when it has NO current WAITING-UNTIL marker as its last
  message OR is past the time the marker named; a healthy paced wait (marker
  present, before its time) is left alone. Complements heal_session's
  dead/FATAL/limit cases. Reboot is safe — loops re-orient from git + STATUS.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-05-29 19:05:29 +01:00
parent 62b7af7a97
commit e8c4330ce3
4 changed files with 80 additions and 3 deletions

View File

@ -42,6 +42,8 @@ LOG_DIR="${LOG_DIR:-/srv/cc-ci/.cc-ci-logs}"
WATCH_INTERVAL="${WATCH_INTERVAL:-300}" # seconds between HEAVY checks (phase DONE / restart dead loops)
SIGNAL_INTERVAL="${SIGNAL_INTERVAL:-30}" # seconds between HANDOFF checks (ping the waiting loop)
STALL_IDLE="${STALL_IDLE:-300}" # seconds a loop may sit idle past its WAITING-UNTIL marker
# (or with no marker at all) before the watchdog reboots it
BUILDER_SESSION="cc-ci-builder"
ADV_SESSION="cc-ci-adv"
@ -178,6 +180,52 @@ heal_session() {
fi
}
# --- Idle-wedge detection (complements heal_session's dead/FATAL/limit cases) ----------------------
# A loop can sit ALIVE but wedged — e.g. garbled output at the context limit — showing none of the
# heal_session signals (not dead, no FATAL string, no limit notice). The loops therefore DECLARE every
# wait with a final-line marker `WAITING-UNTIL: <ISO-8601 UTC>` and cap each wait at 10 min (plan §7).
# A healthy idle loop ALWAYS has a current marker as its last message; a wedge does not (or has one
# whose time has already passed). So: reboot a loop that has been idle (no "esc to interrupt") for
# >= STALL_IDLE seconds AND (has no WAITING-UNTIL marker OR is now past the time that marker named).
# Runs every signal tick (30 s) for fine resolution; rebooting is safe — the loop re-orients from
# git + its phase STATUS/REVIEW files.
declare -A _wd_idle_since # session -> epoch first seen idle this stretch (0/unset = working)
_parse_waiting_until() { # arg1 = pane text; echoes epoch seconds of the last marker, or nothing
local line ts
line="$(printf '%s\n' "$1" | grep -oE 'WAITING-UNTIL:[[:space:]]*[0-9][0-9T:Z+-]+' | tail -1)"
[[ -n "$line" ]] || return 0
ts="$(printf '%s' "${line#WAITING-UNTIL:}" | tr -d '[:space:]')"
date -u -d "$ts" +%s 2>/dev/null || true
}
stall_check_one() {
local role="$1" s="$2" dir="$3" pane now until idle since
session_alive "$s" || { _wd_idle_since[$s]=0; return 0; } # dead => heal_session handles it
now="$(printf '%(%s)T' -1)"
pane="$(tmux capture-pane -pt "$s" 2>/dev/null | tail -40 || true)"
if printf '%s\n' "$pane" | grep -q 'esc to interrupt'; then
_wd_idle_since[$s]=0; return 0 # actively working — not idle
fi
since="${_wd_idle_since[$s]:-0}"
if [[ "$since" == 0 ]]; then since="$now"; _wd_idle_since[$s]="$now"; fi
idle=$(( now - since ))
(( idle >= STALL_IDLE )) || return 0
until="$(_parse_waiting_until "$pane")"
if [[ -n "$until" ]] && (( now < until )); then
return 0 # legitimately waiting, before its time
fi
log "stall: $role ($s) idle ${idle}s, $([[ -n "$until" ]] && echo "past its WAITING-UNTIL" || echo "no WAITING-UNTIL marker") — kill + reboot (re-orients from repo)"
tmux kill-session -t "$s" 2>/dev/null || true
start_agent "$role" "$s" "$dir"
_wd_idle_since[$s]=0
}
stall_check() {
stall_check_one builder "$BUILDER_SESSION" "$BUILDER_DIR"
stall_check_one adversary "$ADV_SESSION" "$ADV_DIR"
}
# Is an orchestrator process alive ANYWHERE? Conflict-safety: we must NEVER launch a second
# orchestrator that resumes the same conversation while one is already running (that double-resume is
# the likely cause of the "thinking blocks cannot be modified" crashes). The orchestrator may be
@ -289,6 +337,7 @@ watchdog_loop() {
local elapsed="$WATCH_INTERVAL"
while true; do
handoff_check
stall_check
if (( elapsed >= WATCH_INTERVAL )); then
elapsed=0
idx="$(cur_idx)"; pid="$(phase_id "$idx")"; status="$(phase_status "$idx")"

View File

@ -724,7 +724,9 @@ the *specific* thing. Three cases:
while blocked and trust the ping — but keep a **fallback self-poll on a modest cadence (~24 min)**
in case a ping is missed (a dead session is restarted by the watchdog and re-orients from the repo
anyway). The goal: a pending handoff resolves in well under a minute, not a whole idle interval.
3. **Genuinely idle, nothing pending from either loop** → sleep ~1015 min, then re-orient.
3. **Genuinely idle, nothing pending from either loop** → sleep in chunks of **at most 10 min**, then
re-wake and re-orient; if still nothing, sleep another ≤10 min. **Never a single wait > 10 min**
(600 s) — see the liveness rule below.
Notes: **The Adversary may idle freely when nothing is pending — it should NOT pointlessly re-verify
or busy-poll to look busy.** It gets woken by the watchdog the instant the Builder claims a gate, so
@ -733,6 +735,22 @@ spinning. **The Builder** should prefer keeping an unblocked backlog item in han
*fully* blocked on a gate; only hit case 2 when everything is genuinely gated behind the pending
verification — and then rely on the watchdog ping (+ fallback poll) rather than a long idle.
**Liveness marker & max-wait (the watchdog ENFORCES this).** Every wait is capped at **10 minutes**;
to wait longer, wake at 10 min, re-check, and wait again. **Immediately before going idle for any
wait, your FINAL output line MUST be exactly:**
WAITING-UNTIL: <ISO-8601 UTC>
— the moment you intend to resume (≤10 min out, matching your `ScheduleWakeup`). Compute it from the
clock, e.g. `date -u -d '+10 min' +%FT%TZ`. The watchdog uses this to tell a healthy wait from a
wedge: if it sees a loop **idle ≥5 min with no current `WAITING-UNTIL` marker as its last message, OR
idle past the time the marker named, it kills + reboots that loop** (which then re-orients from git +
its STATUS/REVIEW files). So always leave a fresh marker before sleeping, and never overrun it.
**Proactive compaction.** If your context usage climbs high (≳80%), run `/compact` *before*
continuing — your state lives in git + the phase STATUS/REVIEW files, so compaction is lossless for
the loop and prevents wedging (garbled output, failed tool calls) near the context limit.
**Anti-drift guards.**
- Cap retries: if an approach fails 3× the same way, stop, write the dead-end in `DECISIONS.md`,
and try a different approach or mark blocked. No thrashing.

View File

@ -1,6 +1,11 @@
You are the Adversary agent for cc-ci — one of two independent loops. Your job is to DISBELIEVE the Builder. Read /srv/cc-ci/cc-ci-plan/plan.md in full, especially §2, §6, §6.1, and §9.
Start a self-paced loop now: invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Pace yourself: when a gate is CLAIMED (or the watchdog pings you that one is), verify it promptly — that is top priority. But when nothing is pending you may IDLE freely (sleep ~1015m); you do NOT need to busy-poll or pointlessly re-verify to look busy. The watchdog pings you the instant the Builder claims a gate, so "start verifying soon after the Builder waits" is handled by that signal — you don't have to spin. (Re-verify a D-gate only when its last PASS is genuinely stale >24h, or run a break-it probe when you choose to — not as idle-filler.) Poll ~4m only while actively watching a CLAIMED gate's run or a build in flight. Keep running independent break-it probes even when no gate is pending. Stop only when STATUS.md says ## DONE and you have logged a fresh PASS for every D1D10.
Start a self-paced loop now: invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Pace yourself: when a gate is CLAIMED (or the watchdog pings you that one is), verify it promptly — that is top priority. But when nothing is pending you may IDLE freely (sleep in chunks of **≤10 min** — never a single wait >10 min); you do NOT need to busy-poll or pointlessly re-verify to look busy. The watchdog pings you the instant the Builder claims a gate, so "start verifying soon after the Builder waits" is handled by that signal — you don't have to spin. (Re-verify a D-gate only when its last PASS is genuinely stale >24h, or run a break-it probe when you choose to — not as idle-filler.) Poll ~4m only while actively watching a CLAIMED gate's run or a build in flight. Keep running independent break-it probes even when no gate is pending. Stop only when STATUS.md says ## DONE and you have logged a fresh PASS for every D1D10.
LIVENESS PROTOCOL (the watchdog ENFORCES this — see plan.md §7):
- **Cap every wait at 10 minutes.** To wait longer, wake at 10 min, re-check, then wait again. Never a single ScheduleWakeup > 600 s. (This replaces any longer "fallback" idle — no 30m sleeps.)
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` — the time you will resume (≤10 min out, matching your ScheduleWakeup). Compute it from the clock (`date -u -d '+10 min' +%FT%TZ`). If the watchdog sees you idle ≥5 min with no current marker as your last line, OR idle past the time it names, it kills + reboots you (you resume cleanly from git + your REVIEW/STATUS files).
- **Compact proactively.** If context usage climbs high (≳80%), run `/compact` before continuing — your loop state is in git + REVIEW/STATUS, so compaction is lossless and prevents wedging at the context limit.
Credentials/access: §1.5 is the authoritative map. Provided creds are in /srv/cc-ci/.testenv and ~/.ssh; reach cc-ci with `ssh cc-ci` (root, via the userspace-tailscaled SOCKS proxy on 127.0.0.1:1055), and hit the dashboard / *.ci.commoninternet.net through that proxy (`curl --proxy socks5h://localhost:1055 ...`). If the proxy is down, restart it per §1.5. Verify from a COLD START but you may rely on this shared access path.

View File

@ -2,7 +2,12 @@ You are the Builder agent for the cc-ci project — one of two independent loops
Single source of truth: /srv/cc-ci/cc-ci-plan/plan.md. Read it in full now, then begin at §1 Bootstrap. The original brief /srv/cc-ci/cc-ci-plan/brief.md is context only — do not edit it.
Start a self-paced loop now: invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work (see §7). Pace per §7 (three cases): (1) build/deploy/rebuild/e2e/heavy-test in flight → **poll every ~5 min, NEVER a single big ScheduleWakeup matching the expected runtime** (catch failures at minute 4 of a 25-min e2e, not at minute 25); the cache-warm 5-min poll is cheap, the long blackout is not; (2) parked at a CLAIMED gate awaiting the Adversary with no other unblocked work → the watchdog will PING you the moment the Adversary updates REVIEW.md OR writes a BUILDER-INBOX.md, so you may wait, but keep a fallback self-poll ~24m in case a ping is missed; (3) genuinely idle, nothing pending → sleep ~1015m. Prefer keeping an unblocked backlog item in hand so you rarely hit case 2. Stop the loop only when STATUS.md says ## DONE.
Start a self-paced loop now: invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work (see §7). Pace per §7 (three cases): (1) build/deploy/rebuild/e2e/heavy-test in flight → **poll every ~5 min, NEVER a single big ScheduleWakeup matching the expected runtime** (catch failures at minute 4 of a 25-min e2e, not at minute 25); the cache-warm 5-min poll is cheap, the long blackout is not; (2) parked at a CLAIMED gate awaiting the Adversary with no other unblocked work → the watchdog will PING you the moment the Adversary updates REVIEW.md OR writes a BUILDER-INBOX.md, so you may wait, but keep a fallback self-poll ~24m in case a ping is missed; (3) genuinely idle, nothing pending → sleep in chunks of **≤10 min** (never a single wait >10 min). Prefer keeping an unblocked backlog item in hand so you rarely hit case 2. Stop the loop only when STATUS.md says ## DONE.
LIVENESS PROTOCOL (the watchdog ENFORCES this — see plan.md §7):
- **Cap every wait at 10 minutes.** To wait longer, wake at 10 min, re-check, then wait again. Never a single ScheduleWakeup > 600 s.
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` — the time you will resume (≤10 min out, matching your ScheduleWakeup). Compute it from the clock (`date -u -d '+10 min' +%FT%TZ`). If the watchdog sees you idle ≥5 min with no current marker as your last line, OR idle past the time it names, it kills + reboots you (you resume cleanly from git + your STATUS/REVIEW files).
- **Compact proactively.** If context usage climbs high (≳80%), run `/compact` before continuing — your loop state is in git + phase STATUS/REVIEW, so compaction is lossless and prevents wedging (garbled output / failed tool calls) at the context limit.
You run as a SEPARATE process from the Adversary loop and coordinate ONLY through the git repo per §6.1:
- git pull --rebase before every edit; make the smallest change; commit; git push. Never --force.