The per-gate functions were inserted immediately after token_phase_flush's log
line, which split the function: its trailing re-baseline block (the
'if next_phase_id is not None: ...' that re-seeds the per-phase baseline for the
next phase, or finalizes when None) was orphaned onto the end of gate_token_check,
where next_phase_id is undefined. The watchdog therefore crashed with NameError on
the first tick of every start. Move that block back into token_phase_flush (where
next_phase_id/cur/sf are in scope) and end gate_token_check at its log line.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UWTdUq2bsic7JZGqJp3nD6
Two watchdog/metrics improvements to the loop machine:
1) Token-logging granularity is configurable via [watchdog].token_granularity:
'gate' (default) or 'phase'. In 'gate' mode, tokens are attributed to each
claimed gate -- any 'claim(<label>)' commit on the work repo's origin/main
(e.g. claim(D1-D5), claim(feat:multi-file); a leading 'feat:' is stripped) --
in addition to the per-phase rollup, appended to token-log.jsonl tagged
phase_id='<phase>:<label>'. A change in the most-recently-claimed label is the
boundary; the in-flight gate is also flushed when the phase ends. 'phase' mode
keeps the original per-phase-only behaviour.
2) Phase auto-advance is now evaluated on EVERY signal tick instead of only the
heavy tick, so a completed phase advances within signal_interval of its
'## DONE' landing rather than idling up to heavy_interval. Healing stays on the
heavy cadence.
Note: gate-boundary detection assumes the loop's 'claim(<label>)' commit convention.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UWTdUq2bsic7JZGqJp3nD6
Multi-reviewer setups (e.g. a correctness + a readability adversary) can now have
the watchdog ping ALL reviewers on a claim, each in its own session with its own
submit key. A bare string still works (single agent). _ping_agents() helper.
Coarsest review cadence: the Builder self-certifies the build phases and the
Adversary does ONE comprehensive cold-verification of the whole accumulated build
in a final `review` phase (vs orig per-phase, lean per-gate). Full original
prompts + a DEFERRED REVIEW CADENCE override, so it isolates verification cadence.
Cheapest coordination; the trade-off is the independent check arrives late (late
rework risk + self-certification drift on build phases). README spells it out.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
So that "stateless vs builder-adversary" and "lean vs stateless" isolate context
hygiene / review granularity WITHOUT the confound of the minimal prompts' reduced
testing pressure (which we found cuts ~25% of test methods). stateless = orig +
context hygiene; lean = orig + context hygiene + per-gate review. min stays the
pure minimal-prompt variant (isolates verbosity vs orig).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A single Builder that builds AND self-verifies (same DoD rigor), with NO
independent Adversary and no claim/review handoff. The control for measuring
what the AI adversary costs (its tokens, ~half of a loop-pair run) and buys
(independent cold verification vs self-certification).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When [watchdog].log_tokens (or [loop].log_tokens) is true, the watchdog records
for each phase how many tokens each agent used (and the total) and how long the
phase took, appended to <log_dir>/token-log.jsonl. Tokens are summed from each
agent's session transcript, attributed by working dir. View with `agents.py
tokens`. Baseline snapshot at phase start + delta at phase advance/complete;
robust across watchdog restarts. Validated: the transcript sum matches an
independent external collector exactly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Isolates the two effects conflated in builder-adversary-stateless: keeps all the
CONTEXT HYGIENE (compact/diffs/lean loads) but ENFORCES full per-gate review
granularity (one claim per gate, one independent verdict per gate, no batching).
Tests whether the token saving is real efficiency vs reduced scrutiny.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Same pattern + AI-as-adversary verification as builder-adversary-min, but the
role prompts add CONTEXT HYGIENE: /compact at every checkpoint (lossless — state
is on disk), read diffs not trees, spill bulk output to files, adversary loads
only {plan, STATUS, diff}. Loop agents non-resumed → fresh session per phase.
Targets cache-read (the dominant cost in a long loop) without changing what the
agents do or how they verify.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Same topology/behaviour as builder-adversary (loop pair, phase machine,
claim()/review() handoff, machine-docs coordination, cold verification) but the
role + kickoff prompts are compressed to minimal tokens, keeping every
load-bearing rule. Config and plans are unchanged. The separate
agent-orchestrator-benchmark repo runs a head-to-head token comparison.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A sketch backlog of further examples, each teaching a distinct orchestration
topology (anthill/stigmergy, kitchen line/pipeline, incident room/blackboard,
senate/debate, baton/mutex+failover, immune system/reactive, evolution chamber,
plus ATC and day-night extras). Not implemented — ideas only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Based on @ponder.ooo's "snake pit agent orchestrator" idea (bsky 2026-05-28) and
Claude's metaphor-mapping elaboration: agents are snakes, tasks are food tossed
into a shared pit; snakes devour/digest/regurgitate/excrete.
A worker-pool-over-a-shared-queue topology (contrast the builder-adversary phase
machine):
- pit/ is a filesystem queue; snakes claim by atomic mv (no two eat the same food)
- species = specialized agents: keeper (zookeeper), planner (regurgitation IS
task decomposition), snake-1..3 (worker pool), cleanup (scavenger + coprophagy)
- no [loop] phase machine; persistent agents self-pace via /loop
- README carries the full bio→compute mapping table from the thread image
Verified: `agents.py status --config agents.toml` lists all 6 agents + service.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A self-contained examples/builder-adversary/ that distills the cc-ci production
loop pair into a tiny, fully-local task (build a `wc` CLI in two phases):
- agents.toml: builder + adversary loops, persistent orchestrator, on_complete
reporter, cleanlogs service; phase machine with a per-phase model override
- prompts/: kickoff template + builder/adversary roles carrying the load-bearing
protocol (claim()/review() handoff, machine-docs file-location rule,
WHAT+HOW+EXPECTED+WHERE=STATUS / WHY=JOURNAL anti-anchoring, WAITING-UNTIL liveness)
- plans/: two phase plans (wc, json) each with a cold-verifiable Definition of Done
- README: how to run, the work-repo two-clone isolation model, how to adapt
Verified: `agents.py status --config agents.toml` parses and lists all agents.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unit tests (no agents/tmux): config load + defaults merge, kickoff-template
assembly, phase machine (advance/idempotent-complete/append-resumes), limit
reset-banner parsing, WAITING-UNTIL/stall parsing, claude+opencode activity
detectors. Live smokes bring a throwaway project up THROUGH agents.py on each
real backend in an isolated sandbox (unique prefix, opencode on a non-4096
port), verify attach + status + down, and clean up. tests/run.sh runs unit
always + smokes when backends present; README documents it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>