watchdog: STALL_GRACE so stall_check never races a loop's own ScheduleWakeup

Root cause of the adversary "overrun": stall_check rebooted the instant
now >= WAITING-UNTIL (zero grace), but the loop's own ScheduleWakeup fires AT
that stated time — and the runtime scheduled it ~40s later than the marker
(date-vs-scheduler skew). So the watchdog pre-empted a HEALTHY self-wake by
~37s; the loop wasn't wedged, it was killed just before it woke. That was the
single false reboot at 18:55Z.

Fix: split the two cases cleanly.
- Marker present: reboot only when now > WAITING-UNTIL + STALL_GRACE (180s) —
  covers wake+start latency + marker/scheduler skew, so the watchdog only fires
  if the self-wake GENUINELY failed.
- No marker: unchanged — reboot when idle >= STALL_IDLE (300s).

Verified post-fix: adversary self-woke on time and re-paced (WAITING-UNTIL
19:19:30Z); no new stall reboots.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-05-29 20:12:46 +01:00
parent e8c4330ce3
commit c7da03fa6c

View File

@ -42,8 +42,12 @@ LOG_DIR="${LOG_DIR:-/srv/cc-ci/.cc-ci-logs}"
WATCH_INTERVAL="${WATCH_INTERVAL:-300}" # seconds between HEAVY checks (phase DONE / restart dead loops)
SIGNAL_INTERVAL="${SIGNAL_INTERVAL:-30}" # seconds between HANDOFF checks (ping the waiting loop)
STALL_IDLE="${STALL_IDLE:-300}" # seconds a loop may sit idle past its WAITING-UNTIL marker
# (or with no marker at all) before the watchdog reboots it
STALL_IDLE="${STALL_IDLE:-300}" # NO-marker case: seconds a loop may sit idle (turn ended
# without declaring a wait) before the watchdog reboots it
STALL_GRACE="${STALL_GRACE:-180}" # marker case: seconds PAST a loop's WAITING-UNTIL before
# reboot. The real ScheduleWakeup fires AT the stated time;
# grace covers wake+start latency + marker/scheduler skew so
# the watchdog never RACES (pre-empts) a healthy self-wake.
BUILDER_SESSION="cc-ci-builder"
ADV_SESSION="cc-ci-adv"
@ -200,7 +204,7 @@ _parse_waiting_until() { # arg1 = pane text; echoes epoch seconds of the last
}
stall_check_one() {
local role="$1" s="$2" dir="$3" pane now until idle since
local role="$1" s="$2" dir="$3" pane now until idle since reason
session_alive "$s" || { _wd_idle_since[$s]=0; return 0; } # dead => heal_session handles it
now="$(printf '%(%s)T' -1)"
pane="$(tmux capture-pane -pt "$s" 2>/dev/null | tail -40 || true)"
@ -210,12 +214,19 @@ stall_check_one() {
since="${_wd_idle_since[$s]:-0}"
if [[ "$since" == 0 ]]; then since="$now"; _wd_idle_since[$s]="$now"; fi
idle=$(( now - since ))
(( idle >= STALL_IDLE )) || return 0
until="$(_parse_waiting_until "$pane")"
if [[ -n "$until" ]] && (( now < until )); then
return 0 # legitimately waiting, before its time
if [[ -n "$until" ]]; then
# Declared wait: the loop's own ScheduleWakeup fires AT 'until'. Reboot ONLY once we are
# STALL_GRACE seconds PAST it — i.e. the self-wake genuinely failed. Never reboot before/at
# 'until' (that races and pre-empts the healthy wake — the original false-reboot bug).
(( now > until + STALL_GRACE )) || return 0
reason="past its WAITING-UNTIL by $(( now - until ))s — self-wake did not fire"
else
# No declared wait: a turn ended without scheduling/declaring. Treat as a wedge once idle a while.
(( idle >= STALL_IDLE )) || return 0
reason="idle ${idle}s with no WAITING-UNTIL marker"
fi
log "stall: $role ($s) idle ${idle}s, $([[ -n "$until" ]] && echo "past its WAITING-UNTIL" || echo "no WAITING-UNTIL marker") — kill + reboot (re-orients from repo)"
log "stall: $role ($s) $reason — kill + reboot (re-orients from repo)"
tmux kill-session -t "$s" 2>/dev/null || true
start_agent "$role" "$s" "$dir"
_wd_idle_since[$s]=0