agent-orchestrator-benchmark

Benchmarks for the agent-orchestrator Builder/Adversary loop — measuring what actually drives token cost: prompt design, context discipline, verification cadence, and whether there's an independent adversary at all. The engine is vendored as the engine/ submodule, pinned at a ref that ships the example variants being compared.

→ Findings

See FINDINGS.md for the synthesis (current as of 2026-06-16). The one-line takeaway:

What the AI adversary costs is set by whether it verifies at all (~4.7× a solo builder), not by how often it verifies (per-gate ≈ per-phase ≈ per-build, all ~13M tokens). The only clean way to cut that cost without dropping verification is context hygiene (−22%).

Headline (median tokens, N=5 per variant, all on Sonnet):

variant	adversary verifies…	median tokens	vs orig
`builder-solo`	never (self-certifies)	2.77M	−79%
`builder-adversary-min`	per phase (minimal prompts)	9.77M	−25%
`builder-adversary-stateless`	per phase (+context hygiene)	10.12M	−22%
`builder-adversary` (orig)	per phase	13.04M	—
`builder-adversary-deferred`	once, after whole build	12.89M	−1%
`builder-adversary-lean`	per gate	13.41M	+3%

Full per-variant stats, efficiency ratios (tokens/LOC, tokens/sec, tokens/commit), correlations, and the raw per-run table are in RESULTS-campaign.md; raw rows in RESULTS-campaign.md.data.

The variants (engine examples being compared)

All live in engine/examples/. They share one task and differ in one dimension each:

variant	what changes
`builder-adversary`	the original full prompts; Adversary verifies per phase (the baseline)
`builder-adversary-min`	prompts compressed to minimal tokens
`builder-adversary-stateless`	orig + context hygiene (compact per checkpoint, read diffs not trees, lean loads)
`builder-adversary-lean`	orig + context hygiene + per-gate review (one claim/verdict per gate)
`builder-adversary-deferred`	orig; Adversary verifies once, in a final comprehensive `review` phase
`builder-solo`	no Adversary — a single Builder that self-certifies

(stateless/lean/deferred are built on the full original prompts, so each isolates its one change without the minimal-prompt confound.)

The task

Build a 3-phase Python calculator — lexer → parser → evaluator (plans/calc/{lex,parse,eval}.md), each phase with 4–6 cold-verifiable Definition-of-Done gates (deferred adds a comprehensive plans/calc/review.md). It's deliberately offline and deterministic so it stresses the protocol, not infrastructure, and the deliverable is behaviorally identical across variants (verified on a 24-expression probe) — so the comparison is like-for-like.

How it works (the real harness, N=5)

Each variant is run autonomously to completion by the real harness — engine/agents.py up brings up the Builder + Adversary loop pair + watchdog, which work through the phase machine to SEQUENCE-COMPLETE exactly as in production. There's no simulation: the agents self-pace via /loop, coordinate through git (claim(/review( commits + the watchdog handoff), and the watchdog heals stalls and rides out usage limits.

run-harness-bench.sh orchestrates the campaign:

For each variant × repeat, it stands up a fresh shared bare repo + two clones (Builder and Adversary each get their own, for genuine cold verification), pre-trusts the work dirs, generates an agents.toml pointing at that variant's prompts (and a 4-phase config for deferred), and runs agents.py up.
It polls for SEQUENCE-COMPLETE (per-run timeout), then tears the loop down.
It re-runs the task's Definition-of-Done itself (cold, in the Adversary's clone) to confirm success, and tallies tokens per loop from the Claude Code session transcripts.
One row per run is appended to RESULTS-campaign.md.data immediately (so partial results survive an interruption). Each run's git repo is kept under /tmp/ao-campaign-* for later analysis.

run-solo-bench.sh does the same for the single-builder builder-solo control. analyze.py reads the data file and (re)generates RESULTS-campaign.md — per-variant token distributions, the efficiency ratios, correlations, and the full raw table.

Run it yourself

git submodule update --init                                   # fetch the vendored engine
# one variant, 5 runs, 45-min per-run timeout:
BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-harness-bench.sh builder-adversary
# the solo control:
BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-solo-bench.sh
python3 analyze.py                                            # regenerate RESULTS-campaign.md

Needs claude on PATH (authenticated), plus python, tmux, git, timeout. run-harness-bench.sh with no arguments runs all four loop-pair variants; pass variant names to run a subset. The data file is append-mode (clear it manually for a fresh campaign).

Methodology & caveats

N matters. A single full-loop run is highly nondeterministic — the same variant varied ±55% run-to-run early on, which is why everything here is N=5. (An early single-run "context hygiene halves tokens" claim did not reproduce; the stable figure is −22%.)
Excluded runs: a few real failures (a wedge, a usage-limit/timeout collision) are in the raw data as NO and excluded from stats; superseded by clean re-runs. LIMIT-flagged runs (a usage-limit pause inflates duration, not tokens) are kept for token totals but excluded from tokens/sec.
Scope: one task, one model (Sonnet), one harness. Relative findings should generalize; absolute numbers are task-specific. The adversary's quality value isn't measured here — the task is too well-specified to make self-certification fail.

Layout

FINDINGS.md                  the synthesis — start here (current as of 2026-06-16)
RESULTS-campaign.md          full-harness campaign analysis (stats + ratios + raw table)  ← canonical
RESULTS-campaign.md.data     raw per-run rows (TSV)
analyze.py                   aggregates the data file -> RESULTS-campaign.md
run-harness-bench.sh         the full-harness campaign runner (loop pair, N runs/variant)
run-solo-bench.sh            the builder-solo control runner
plans/calc/{lex,parse,eval,review}.md   the calculator task
engine/                      agent-orchestrator, vendored as a submodule (variants in engine/examples/)

# earlier / superseded exploratory runs (kept for history):
run-bench.sh                 first experiment: headless single-pass, 2 variants, roman-numeral task
plans/roman.md               that experiment's task
RESULTS.md                   its results (N=1, single-pass — superseded by the campaign)
RESULTS-harness.md           early 3-variant full-harness run (superseded by RESULTS-campaign.md)

7.0 KiB Raw Permalink Blame History Unescape Escape