recipe-maintainers/agent-orchestrator-benchmark

Go to file

mfowler 25a77f5d3c fix: flag usage-limit-affected runs; correct tok/sec

A run that hits a usage-limit pause has inflated duration (idle wait) but an
accurate token total. analyze.py now scans each run's watchdog log for 'limit
hit', flags it LIMIT in the raw table, and excludes it from the tokens/sec stat
(token total, tok/LOC, tok/commit unaffected). Caught because campaign run r2
hit the limit ~00:40 and recovered at the 00:50 reset — watchdog handled it.

2026-06-15 01:29:54 +00:00

engine @ e0425e6108

feat: add builder-adversary-lean variant; runner takes variant args

2026-06-14 21:43:11 +00:00

plans

feat: multi-phase calculator problem + full-harness benchmark runner

2026-06-14 20:40:14 +00:00

.gitignore

feat: campaign mode — repeat each variant N times, aggregate distributions

2026-06-14 22:19:10 +00:00

.gitmodules

feat: agent-orchestrator-benchmark — prompt token comparison harness

2026-06-14 20:20:05 +00:00

analyze.py

fix: flag usage-limit-affected runs; correct tok/sec

2026-06-15 01:29:54 +00:00

README.md

feat: agent-orchestrator-benchmark — prompt token comparison harness

2026-06-14 20:20:05 +00:00

RESULTS-harness.md

results: 4-way + the variance finding (N=1 is not enough)

2026-06-14 22:06:21 +00:00

RESULTS.md

feat: multi-phase calculator problem + full-harness benchmark runner

2026-06-14 20:40:14 +00:00

run-bench.sh

feat: agent-orchestrator-benchmark — prompt token comparison harness

2026-06-14 20:20:05 +00:00

run-harness-bench.sh

feat: keep run repos + record commits/LOC per run

2026-06-15 00:13:08 +00:00

README.md

agent-orchestrator-benchmark

Benchmarks for the agent-orchestrator harness — vendored here as the engine/ submodule, pinned at a ref that ships the example variants being compared.

What it measures

A head-to-head between two example variants in the engine:

builder-adversary — the original Builder/Adversary loop-pair prompts.
builder-adversary-min — the same pattern with the role + kickoff prompts compressed to minimal tokens.

The benchmark confirms each variant independently succeeds on the same task (no shared context) and clocks the tokens each uses.

Run

git submodule update --init      # fetch the vendored engine (first time)
./run-bench.sh                   # writes RESULTS.md

Needs claude on PATH and python/timeout. Both variants run on Sonnet (claude-sonnet-4-6) for Builder and Adversary.

How it works

run-bench.sh assembles exactly the prompt the harness would send a loop agent (the variant's kickoff.md with {phase_id}/{plan}/{status}/{role} substituted, then the role prompt), then drives one Builder pass and one Adversary pass as separate headless claude -p sessions — fresh context each, so the two variants (and the two roles) share no context. The Builder builds and commits in its own repo; the Adversary cold-verifies from its own clone. The script then re-runs the task's Definition-of-Done check itself and reads the Adversary's verdict, and tallies tokens from claude -p --output-format json.

The test problem is plans/roman.md — an integer→Roman-numeral CLI with a stdlib unittest suite (deterministic, fully local, cold-verifiable, and not present in either example).

Caveats

This is a controlled single pass per variant (N=1; expect run-to-run variance), not the full self-paced watchdog loop. It measures task effectiveness + prompt token cost, not the live loop / handoff / liveness machinery (that needs a real engine/agents.py up run).
Each claude -p call carries a fixed ~24k-token cached system-prompt/tool overhead, and most tokens come from the agentic work itself — so the prompt-size difference is a small slice of the total. RESULTS.md reports the static prompt size separately so the minimisation is visible.

Layout

engine/            agent-orchestrator, vendored as a submodule (the variants live in engine/examples/)
plans/roman.md     the test problem (single source of truth + Definition of Done)
run-bench.sh       the runner
RESULTS.md         generated by run-bench.sh