# agent-orchestrator-benchmark Benchmarks for the [`agent-orchestrator`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator) Builder/Adversary loop — measuring **what actually drives token cost**: prompt design, context discipline, verification cadence, and whether there's an independent adversary at all. The engine is vendored as the `engine/` submodule, pinned at a ref that ships the example variants being compared. ## → Findings **See [`FINDINGS.md`](FINDINGS.md)** for the synthesis (current as of **2026-06-16**). The one-line takeaway: > What the AI adversary costs is set by **whether it verifies at all** (~4.7× a solo builder), **not > by how often** it verifies (per-gate ≈ per-phase ≈ per-build, all ~13M tokens). The only clean way > to cut that cost without dropping verification is **context hygiene (−22%)**. Headline (median tokens, N=5 per variant, all on Sonnet): | variant | adversary verifies… | median tokens | vs orig | |---|---|--:|--:| | `builder-solo` | never (self-certifies) | 2.77M | −79% | | `builder-adversary-min` | per phase *(minimal prompts)* | 9.77M | −25% | | `builder-adversary-stateless` | per phase *(+context hygiene)* | 10.12M | −22% | | `builder-adversary` (orig) | per **phase** | 13.04M | — | | `builder-adversary-deferred` | once, after **whole build** | 12.89M | −1% | | `builder-adversary-lean` | per **gate** | 13.41M | +3% | Full per-variant stats, efficiency ratios (tokens/LOC, tokens/sec, tokens/commit), correlations, and the raw per-run table are in [`RESULTS-campaign.md`](RESULTS-campaign.md); raw rows in `RESULTS-campaign.md.data`. ## The variants (engine examples being compared) All live in `engine/examples/`. They share one task and differ in one dimension each: | variant | what changes | |---|---| | `builder-adversary` | the original full prompts; Adversary verifies **per phase** (the baseline) | | `builder-adversary-min` | prompts compressed to minimal tokens | | `builder-adversary-stateless` | orig + **context hygiene** (compact per checkpoint, read diffs not trees, lean loads) | | `builder-adversary-lean` | orig + context hygiene + **per-gate** review (one claim/verdict per gate) | | `builder-adversary-deferred` | orig; Adversary verifies **once**, in a final comprehensive `review` phase | | `builder-solo` | **no Adversary** — a single Builder that self-certifies | (stateless/lean/deferred are built on the *full original* prompts, so each isolates its one change without the minimal-prompt confound.) ## The task Build a 3-phase Python calculator — lexer → parser → evaluator (`plans/calc/{lex,parse,eval}.md`), each phase with 4–6 cold-verifiable Definition-of-Done gates (`deferred` adds a comprehensive `plans/calc/review.md`). It's deliberately offline and deterministic so it stresses the *protocol*, not infrastructure, and the deliverable is behaviorally identical across variants (verified on a 24-expression probe) — so the comparison is like-for-like. ## How it works (the real harness, N=5) Each variant is run **autonomously to completion** by the real harness — `engine/agents.py up` brings up the Builder + Adversary loop pair + watchdog, which work through the phase machine to `SEQUENCE-COMPLETE` exactly as in production. There's no simulation: the agents self-pace via `/loop`, coordinate through git (`claim(`/`review(` commits + the watchdog handoff), and the watchdog heals stalls and rides out usage limits. `run-harness-bench.sh` orchestrates the campaign: 1. For each variant × repeat, it stands up a fresh **shared bare repo + two clones** (Builder and Adversary each get their own, for genuine cold verification), pre-trusts the work dirs, generates an `agents.toml` pointing at that variant's prompts (and a 4-phase config for `deferred`), and runs `agents.py up`. 2. It polls for `SEQUENCE-COMPLETE` (per-run timeout), then tears the loop down. 3. It re-runs the task's Definition-of-Done itself (cold, in the Adversary's clone) to confirm success, and **tallies tokens per loop from the Claude Code session transcripts**. 4. One row per run is appended to `RESULTS-campaign.md.data` immediately (so partial results survive an interruption). Each run's git repo is kept under `/tmp/ao-campaign-*` for later analysis. `run-solo-bench.sh` does the same for the single-builder `builder-solo` control. `analyze.py` reads the data file and (re)generates `RESULTS-campaign.md` — per-variant token distributions, the efficiency ratios, correlations, and the full raw table. ### Run it yourself ```bash git submodule update --init # fetch the vendored engine # one variant, 5 runs, 45-min per-run timeout: BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-harness-bench.sh builder-adversary # the solo control: BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-solo-bench.sh python3 analyze.py # regenerate RESULTS-campaign.md ``` Needs `claude` on `PATH` (authenticated), plus `python`, `tmux`, `git`, `timeout`. `run-harness-bench.sh` with no arguments runs all four loop-pair variants; pass variant names to run a subset. The data file is **append-mode** (clear it manually for a fresh campaign). ## Methodology & caveats - **N matters.** A single full-loop run is highly nondeterministic — the *same* variant varied ±55% run-to-run early on, which is why everything here is **N=5**. (An early single-run "context hygiene halves tokens" claim did **not** reproduce; the stable figure is −22%.) - **Excluded runs:** a few real failures (a wedge, a usage-limit/timeout collision) are in the raw data as `NO` and excluded from stats; superseded by clean re-runs. `LIMIT`-flagged runs (a usage-limit *pause* inflates duration, not tokens) are kept for token totals but excluded from `tokens/sec`. - **Scope:** one task, one model (Sonnet), one harness. *Relative* findings should generalize; absolute numbers are task-specific. The adversary's *quality* value isn't measured here — the task is too well-specified to make self-certification fail. ## Layout ``` FINDINGS.md the synthesis — start here (current as of 2026-06-16) RESULTS-campaign.md full-harness campaign analysis (stats + ratios + raw table) ← canonical RESULTS-campaign.md.data raw per-run rows (TSV) analyze.py aggregates the data file -> RESULTS-campaign.md run-harness-bench.sh the full-harness campaign runner (loop pair, N runs/variant) run-solo-bench.sh the builder-solo control runner plans/calc/{lex,parse,eval,review}.md the calculator task engine/ agent-orchestrator, vendored as a submodule (variants in engine/examples/) # earlier / superseded exploratory runs (kept for history): run-bench.sh first experiment: headless single-pass, 2 variants, roman-numeral task plans/roman.md that experiment's task RESULTS.md its results (N=1, single-pass — superseded by the campaign) RESULTS-harness.md early 3-variant full-harness run (superseded by RESULTS-campaign.md) ```