# agent-orchestrator-benchmark Benchmarks for the [`agent-orchestrator`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator) harness — vendored here as the `engine/` submodule, pinned at a ref that ships the example variants being compared. ## What it measures A head-to-head between two example variants in the engine: - **`builder-adversary`** — the original Builder/Adversary loop-pair prompts. - **`builder-adversary-min`** — the same pattern with the role + kickoff prompts compressed to minimal tokens. The benchmark confirms each variant **independently succeeds** on the same task (no shared context) and **clocks the tokens** each uses. ## Run ```bash git submodule update --init # fetch the vendored engine (first time) ./run-bench.sh # writes RESULTS.md ``` Needs `claude` on `PATH` and `python`/`timeout`. Both variants run on **Sonnet** (`claude-sonnet-4-6`) for Builder and Adversary. ## How it works `run-bench.sh` assembles exactly the prompt the harness would send a loop agent (the variant's `kickoff.md` with `{phase_id}/{plan}/{status}/{role}` substituted, then the role prompt), then drives one **Builder** pass and one **Adversary** pass as separate headless `claude -p` sessions — fresh context each, so the two variants (and the two roles) share no context. The Builder builds and commits in its own repo; the Adversary cold-verifies from its **own clone**. The script then re-runs the task's Definition-of-Done check itself and reads the Adversary's verdict, and tallies tokens from `claude -p --output-format json`. The test problem is [`plans/roman.md`](plans/roman.md) — an integer→Roman-numeral CLI with a stdlib `unittest` suite (deterministic, fully local, cold-verifiable, and not present in either example). ### Caveats - This is a **controlled single pass** per variant (N=1; expect run-to-run variance), not the full self-paced watchdog loop. It measures task effectiveness + prompt token cost, **not** the live loop / handoff / liveness machinery (that needs a real `engine/agents.py up` run). - Each `claude -p` call carries a fixed ~24k-token cached system-prompt/tool overhead, and most tokens come from the agentic work itself — so the prompt-size difference is a small slice of the total. `RESULTS.md` reports the static prompt size separately so the minimisation is visible. ## Layout ``` engine/ agent-orchestrator, vendored as a submodule (the variants live in engine/examples/) plans/roman.md the test problem (single source of truth + Definition of Done) run-bench.sh the runner RESULTS.md generated by run-bench.sh ```