7.0 KiB
agent-orchestrator-benchmark
Benchmarks for the agent-orchestrator
Builder/Adversary loop — measuring what actually drives token cost: prompt design, context
discipline, verification cadence, and whether there's an independent adversary at all. The engine is
vendored as the engine/ submodule, pinned at a ref that ships the example variants being compared.
→ Findings
See FINDINGS.md for the synthesis (current as of 2026-06-16). The one-line
takeaway:
What the AI adversary costs is set by whether it verifies at all (~4.7× a solo builder), not by how often it verifies (per-gate ≈ per-phase ≈ per-build, all ~13M tokens). The only clean way to cut that cost without dropping verification is context hygiene (−22%).
Headline (median tokens, N=5 per variant, all on Sonnet):
| variant | adversary verifies… | median tokens | vs orig |
|---|---|---|---|
builder-solo |
never (self-certifies) | 2.77M | −79% |
builder-adversary-min |
per phase (minimal prompts) | 9.77M | −25% |
builder-adversary-stateless |
per phase (+context hygiene) | 10.12M | −22% |
builder-adversary (orig) |
per phase | 13.04M | — |
builder-adversary-deferred |
once, after whole build | 12.89M | −1% |
builder-adversary-lean |
per gate | 13.41M | +3% |
Full per-variant stats, efficiency ratios (tokens/LOC, tokens/sec, tokens/commit), correlations, and
the raw per-run table are in RESULTS-campaign.md; raw rows in
RESULTS-campaign.md.data.
The variants (engine examples being compared)
All live in engine/examples/. They share one task and differ in one dimension each:
| variant | what changes |
|---|---|
builder-adversary |
the original full prompts; Adversary verifies per phase (the baseline) |
builder-adversary-min |
prompts compressed to minimal tokens |
builder-adversary-stateless |
orig + context hygiene (compact per checkpoint, read diffs not trees, lean loads) |
builder-adversary-lean |
orig + context hygiene + per-gate review (one claim/verdict per gate) |
builder-adversary-deferred |
orig; Adversary verifies once, in a final comprehensive review phase |
builder-solo |
no Adversary — a single Builder that self-certifies |
(stateless/lean/deferred are built on the full original prompts, so each isolates its one change without the minimal-prompt confound.)
The task
Build a 3-phase Python calculator — lexer → parser → evaluator (plans/calc/{lex,parse,eval}.md),
each phase with 4–6 cold-verifiable Definition-of-Done gates (deferred adds a comprehensive
plans/calc/review.md). It's deliberately offline and deterministic so it stresses the protocol,
not infrastructure, and the deliverable is behaviorally identical across variants (verified on a
24-expression probe) — so the comparison is like-for-like.
How it works (the real harness, N=5)
Each variant is run autonomously to completion by the real harness — engine/agents.py up brings
up the Builder + Adversary loop pair + watchdog, which work through the phase machine to
SEQUENCE-COMPLETE exactly as in production. There's no simulation: the agents self-pace via /loop,
coordinate through git (claim(/review( commits + the watchdog handoff), and the watchdog heals
stalls and rides out usage limits.
run-harness-bench.sh orchestrates the campaign:
- For each variant × repeat, it stands up a fresh shared bare repo + two clones (Builder and
Adversary each get their own, for genuine cold verification), pre-trusts the work dirs, generates
an
agents.tomlpointing at that variant's prompts (and a 4-phase config fordeferred), and runsagents.py up. - It polls for
SEQUENCE-COMPLETE(per-run timeout), then tears the loop down. - It re-runs the task's Definition-of-Done itself (cold, in the Adversary's clone) to confirm success, and tallies tokens per loop from the Claude Code session transcripts.
- One row per run is appended to
RESULTS-campaign.md.dataimmediately (so partial results survive an interruption). Each run's git repo is kept under/tmp/ao-campaign-*for later analysis.
run-solo-bench.sh does the same for the single-builder builder-solo control.
analyze.py reads the data file and (re)generates RESULTS-campaign.md — per-variant token
distributions, the efficiency ratios, correlations, and the full raw table.
Run it yourself
git submodule update --init # fetch the vendored engine
# one variant, 5 runs, 45-min per-run timeout:
BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-harness-bench.sh builder-adversary
# the solo control:
BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-solo-bench.sh
python3 analyze.py # regenerate RESULTS-campaign.md
Needs claude on PATH (authenticated), plus python, tmux, git, timeout. run-harness-bench.sh
with no arguments runs all four loop-pair variants; pass variant names to run a subset. The data file
is append-mode (clear it manually for a fresh campaign).
Methodology & caveats
- N matters. A single full-loop run is highly nondeterministic — the same variant varied ±55% run-to-run early on, which is why everything here is N=5. (An early single-run "context hygiene halves tokens" claim did not reproduce; the stable figure is −22%.)
- Excluded runs: a few real failures (a wedge, a usage-limit/timeout collision) are in the raw
data as
NOand excluded from stats; superseded by clean re-runs.LIMIT-flagged runs (a usage-limit pause inflates duration, not tokens) are kept for token totals but excluded fromtokens/sec. - Scope: one task, one model (Sonnet), one harness. Relative findings should generalize; absolute numbers are task-specific. The adversary's quality value isn't measured here — the task is too well-specified to make self-certification fail.
Layout
FINDINGS.md the synthesis — start here (current as of 2026-06-16)
RESULTS-campaign.md full-harness campaign analysis (stats + ratios + raw table) ← canonical
RESULTS-campaign.md.data raw per-run rows (TSV)
analyze.py aggregates the data file -> RESULTS-campaign.md
run-harness-bench.sh the full-harness campaign runner (loop pair, N runs/variant)
run-solo-bench.sh the builder-solo control runner
plans/calc/{lex,parse,eval,review}.md the calculator task
engine/ agent-orchestrator, vendored as a submodule (variants in engine/examples/)
# earlier / superseded exploratory runs (kept for history):
run-bench.sh first experiment: headless single-pass, 2 variants, roman-numeral task
plans/roman.md that experiment's task
RESULTS.md its results (N=1, single-pass — superseded by the campaign)
RESULTS-harness.md early 3-variant full-harness run (superseded by RESULTS-campaign.md)