A run that hits a usage-limit pause has inflated duration (idle wait) but an accurate token total. analyze.py now scans each run's watchdog log for 'limit hit', flags it LIMIT in the raw table, and excludes it from the tokens/sec stat (token total, tok/LOC, tok/commit unaffected). Caught because campaign run r2 hit the limit ~00:40 and recovered at the 00:50 reset — watchdog handled it.
agent-orchestrator-benchmark
Benchmarks for the agent-orchestrator
harness — vendored here as the engine/ submodule, pinned at a ref that ships the example variants
being compared.
What it measures
A head-to-head between two example variants in the engine:
builder-adversary— the original Builder/Adversary loop-pair prompts.builder-adversary-min— the same pattern with the role + kickoff prompts compressed to minimal tokens.
The benchmark confirms each variant independently succeeds on the same task (no shared context) and clocks the tokens each uses.
Run
git submodule update --init # fetch the vendored engine (first time)
./run-bench.sh # writes RESULTS.md
Needs claude on PATH and python/timeout. Both variants run on Sonnet
(claude-sonnet-4-6) for Builder and Adversary.
How it works
run-bench.sh assembles exactly the prompt the harness would send a loop agent (the variant's
kickoff.md with {phase_id}/{plan}/{status}/{role} substituted, then the role prompt), then drives
one Builder pass and one Adversary pass as separate headless claude -p sessions — fresh
context each, so the two variants (and the two roles) share no context. The Builder builds and
commits in its own repo; the Adversary cold-verifies from its own clone. The script then re-runs
the task's Definition-of-Done check itself and reads the Adversary's verdict, and tallies tokens from
claude -p --output-format json.
The test problem is plans/roman.md — an integer→Roman-numeral CLI with a stdlib
unittest suite (deterministic, fully local, cold-verifiable, and not present in either example).
Caveats
- This is a controlled single pass per variant (N=1; expect run-to-run variance), not the full
self-paced watchdog loop. It measures task effectiveness + prompt token cost, not the live
loop / handoff / liveness machinery (that needs a real
engine/agents.py uprun). - Each
claude -pcall carries a fixed ~24k-token cached system-prompt/tool overhead, and most tokens come from the agentic work itself — so the prompt-size difference is a small slice of the total.RESULTS.mdreports the static prompt size separately so the minimisation is visible.
Layout
engine/ agent-orchestrator, vendored as a submodule (the variants live in engine/examples/)
plans/roman.md the test problem (single source of truth + Definition of Done)
run-bench.sh the runner
RESULTS.md generated by run-bench.sh