recipe-maintainers/agent-orchestrator-benchmark

Files

mfowler 8c3f38dbf4 feat: multi-phase calculator problem + full-harness benchmark runner

- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
  phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
  edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
  + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
  clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
  effect; cache-read of the working context dominates)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-14 20:40:14 +00:00

2.1 KiB

Raw Blame History

Benchmark results — original vs minimal prompts

Engine pinned at: 737ef81. Task: plans/roman.md (integer → Roman numeral). Model: claude-sonnet-4-6 for Builder and Adversary in both versions. Runs are independent (separate headless claude -p sessions, no shared context). Methodology + caveats: see run-bench.sh header and the note below.

Static prompt size (chars: kickoff + role, what gets sent each kickoff)

version	builder prompt	adversary prompt
builder-adversary (orig)	6389	5811
builder-adversary-min	1751	1644

Per-run tokens & cost

builder-adversary

success: YES (tests=yes, cli=yes, adversary-verdict=PASS)
builder: in=21 out=4007 cache_create=14460 cache_read=526213 → 544701 tok, $0.3073279, turns=21
adversary: in=14 out=3245 cache_create=14930 cache_read=331897 → 350086 tok, $0.24022810000000003, turns=17
total: 894787 tokens, $0.5476

builder-adversary-min

success: YES (tests=yes, cli=yes, adversary-verdict=PASS)
builder: in=20 out=4257 cache_create=13183 cache_read=477142 → 494602 tok, $0.28740659999999996, turns=18
adversary: in=16 out=4545 cache_create=14792 cache_read=378787 → 398140 tok, $0.2718171000000001, turns=16
total: 892742 tokens, $0.5592

Summary

version	success	total tokens	total cost
builder-adversary (orig)	YES	894787	$0.5476
builder-adversary-min	YES	892742	$0.5592

Note: each claude -p call carries a fixed ~24k-token cached Claude Code system-prompt + tool-schema overhead, and most tokens come from the agentic work itself (reading the plan, writing/running code, tool results). The role/kickoff prompt is a small slice — so the headline token totals are close; the minimisation shows up in the static prompt size above and the (smaller) input/cache-creation portion. This bench is a single controlled pass per version (N=1; expect run-to-run variance); it exercises task effectiveness + prompt cost, NOT the live watchdog loop / handoff machinery (that needs a full agents.py up run).

Work dirs for this run: /tmp/ao-benchmark.CwQFWF

2.1 KiB Raw Blame History