recipe-maintainers/agent-orchestrator-benchmark

Go to file

mfowler 0fa3d726a5 results: full-harness 3-way (orig/min/stateless) on the calculator

All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full
adversary verification (calc correct, tests OK). Tokens:
  orig       10,756,363
  min        10,119,225  (-5.9%)
  stateless   5,964,829  (-44.5%)
Prompt-prose minimization barely moved tokens; context hygiene (stateless)
nearly halved them, driven by ~48% lower cache_read. Quality held.

Also fix the runner's success check: it grepped the word "veto" and matched
"No veto" → false failures; now matches the "## VETO" marker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-14 21:35:47 +00:00

engine @ 985d33dd51

feat: add stateless variant, pre-trust work dirs, loop over 3 variants

2026-06-14 20:52:29 +00:00

plans

feat: multi-phase calculator problem + full-harness benchmark runner

2026-06-14 20:40:14 +00:00

.gitignore

chore: gitignore the runner's transient .tmp file

2026-06-14 20:40:26 +00:00

.gitmodules

feat: agent-orchestrator-benchmark — prompt token comparison harness

2026-06-14 20:20:05 +00:00

README.md

feat: agent-orchestrator-benchmark — prompt token comparison harness

2026-06-14 20:20:05 +00:00

RESULTS-harness.md

results: full-harness 3-way (orig/min/stateless) on the calculator

2026-06-14 21:35:47 +00:00

RESULTS.md

feat: multi-phase calculator problem + full-harness benchmark runner

2026-06-14 20:40:14 +00:00

run-bench.sh

feat: agent-orchestrator-benchmark — prompt token comparison harness

2026-06-14 20:20:05 +00:00

run-harness-bench.sh

results: full-harness 3-way (orig/min/stateless) on the calculator

2026-06-14 21:35:47 +00:00

README.md

agent-orchestrator-benchmark

Benchmarks for the agent-orchestrator harness — vendored here as the engine/ submodule, pinned at a ref that ships the example variants being compared.

What it measures

A head-to-head between two example variants in the engine:

builder-adversary — the original Builder/Adversary loop-pair prompts.
builder-adversary-min — the same pattern with the role + kickoff prompts compressed to minimal tokens.

The benchmark confirms each variant independently succeeds on the same task (no shared context) and clocks the tokens each uses.

Run

git submodule update --init      # fetch the vendored engine (first time)
./run-bench.sh                   # writes RESULTS.md

Needs claude on PATH and python/timeout. Both variants run on Sonnet (claude-sonnet-4-6) for Builder and Adversary.

How it works

run-bench.sh assembles exactly the prompt the harness would send a loop agent (the variant's kickoff.md with {phase_id}/{plan}/{status}/{role} substituted, then the role prompt), then drives one Builder pass and one Adversary pass as separate headless claude -p sessions — fresh context each, so the two variants (and the two roles) share no context. The Builder builds and commits in its own repo; the Adversary cold-verifies from its own clone. The script then re-runs the task's Definition-of-Done check itself and reads the Adversary's verdict, and tallies tokens from claude -p --output-format json.

The test problem is plans/roman.md — an integer→Roman-numeral CLI with a stdlib unittest suite (deterministic, fully local, cold-verifiable, and not present in either example).

Caveats

This is a controlled single pass per variant (N=1; expect run-to-run variance), not the full self-paced watchdog loop. It measures task effectiveness + prompt token cost, not the live loop / handoff / liveness machinery (that needs a real engine/agents.py up run).
Each claude -p call carries a fixed ~24k-token cached system-prompt/tool overhead, and most tokens come from the agentic work itself — so the prompt-size difference is a small slice of the total. RESULTS.md reports the static prompt size separately so the minimisation is visible.

Layout

engine/            agent-orchestrator, vendored as a submodule (the variants live in engine/examples/)
plans/roman.md     the test problem (single source of truth + Definition of Done)
run-bench.sh       the runner
RESULTS.md         generated by run-bench.sh