feat: agent-orchestrator-benchmark — prompt token comparison harness

A standalone repo (engine vendored as a submodule at the examples commit) that runs a head-to-head between the builder-adversary and builder-adversary-min example variants: same task, independent headless runs, both on Sonnet, with token counts. Includes the roman-numeral test problem and run-bench.sh. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:20:05 +00:00
commit 27df2c7b55
6 changed files with 257 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,57 @@
+# agent-orchestrator-benchmark
+
+Benchmarks for the [`agent-orchestrator`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator)
+harness — vendored here as the `engine/` submodule, pinned at a ref that ships the example variants
+being compared.
+
+## What it measures
+
+A head-to-head between two example variants in the engine:
+
+- **`builder-adversary`** — the original Builder/Adversary loop-pair prompts.
+- **`builder-adversary-min`** — the same pattern with the role + kickoff prompts compressed to
+  minimal tokens.
+
+The benchmark confirms each variant **independently succeeds** on the same task (no shared context)
+and **clocks the tokens** each uses.
+
+## Run
+
+```bash
+git submodule update --init      # fetch the vendored engine (first time)
+./run-bench.sh                   # writes RESULTS.md
+```
+
+Needs `claude` on `PATH` and `python`/`timeout`. Both variants run on **Sonnet**
+(`claude-sonnet-4-6`) for Builder and Adversary.
+
+## How it works
+
+`run-bench.sh` assembles exactly the prompt the harness would send a loop agent (the variant's
+`kickoff.md` with `{phase_id}/{plan}/{status}/{role}` substituted, then the role prompt), then drives
+one **Builder** pass and one **Adversary** pass as separate headless `claude -p` sessions — fresh
+context each, so the two variants (and the two roles) share no context. The Builder builds and
+commits in its own repo; the Adversary cold-verifies from its **own clone**. The script then re-runs
+the task's Definition-of-Done check itself and reads the Adversary's verdict, and tallies tokens from
+`claude -p --output-format json`.
+
+The test problem is [`plans/roman.md`](plans/roman.md) — an integer→Roman-numeral CLI with a stdlib
+`unittest` suite (deterministic, fully local, cold-verifiable, and not present in either example).
+
+### Caveats
+
+- This is a **controlled single pass** per variant (N=1; expect run-to-run variance), not the full
+  self-paced watchdog loop. It measures task effectiveness + prompt token cost, **not** the live
+  loop / handoff / liveness machinery (that needs a real `engine/agents.py up` run).
+- Each `claude -p` call carries a fixed ~24k-token cached system-prompt/tool overhead, and most
+  tokens come from the agentic work itself — so the prompt-size difference is a small slice of the
+  total. `RESULTS.md` reports the static prompt size separately so the minimisation is visible.
+
+## Layout
+
+```
+engine/            agent-orchestrator, vendored as a submodule (the variants live in engine/examples/)
+plans/roman.md     the test problem (single source of truth + Definition of Done)
+run-bench.sh       the runner
+RESULTS.md         generated by run-bench.sh
+```