recipe-maintainers/agent-orchestrator-benchmark

Files

mfowler 0fa3d726a5 results: full-harness 3-way (orig/min/stateless) on the calculator

All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full
adversary verification (calc correct, tests OK). Tokens:
  orig       10,756,363
  min        10,119,225  (-5.9%)
  stateless   5,964,829  (-44.5%)
Prompt-prose minimization barely moved tokens; context hygiene (stateless)
nearly halved them, driven by ~48% lower cache_read. Quality held.

Also fix the runner's success check: it grepped the word "veto" and matched
"No veto" → false failures; now matches the "## VETO" marker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-14 21:35:47 +00:00

4.1 KiB

Raw Blame History

Full-harness benchmark — prompt variants

Real agents.py up Builder/Adversary loop pair + watchdog, run autonomously through the multi-phase calculator (plans/calc/{lex,parse,eval}.md — 3 phases, 4–6 gates each) to SEQUENCE-COMPLETE. Engine pinned at 985d33d. Both loops on claude-sonnet-4-6. Per-variant timeout 3000s. Tokens summed from the Claude Code session transcripts of each loop's clone. N=1 per variant (the autonomous loop is nondeterministic — number of review rounds varies).

Variants

builder-adversary — original prompts.
builder-adversary-min — same rules, prose compressed to minimal tokens.
builder-adversary-stateless — min + context hygiene (compact at each checkpoint, read diffs not trees, spill bulk to files, adversary loads only {plan, STATUS, diff}); loop sessions non-resumed → fresh context per phase. Same AI-as-adversary verification.

All three succeeded

Every variant completed all three phases with the Adversary cold-verifying every gate (no veto), and the final calculator is correct in each:

version	sequence-complete	unittest	`2+3*4`	`(2+3)*4`	`7/2`	result
builder-adversary	yes	OK	14	20	3.5	PASS
builder-adversary-min	yes	OK	14	20	3.5	PASS
builder-adversary-stateless	yes	OK	14	20	3.5	PASS

(The original Adversary even filed a non-blocking advisory on lex — genuine adversarial review, not rubber-stamping.)

Static prompt size (chars: kickoff + role)

version	builder	adversary
builder-adversary	6389	5811
builder-adversary-min	1751	1644
builder-adversary-stateless	2430	2218

Tokens (from session transcripts)

version	builder loop	adversary loop	total	vs orig
builder-adversary	5,557,356	5,199,007	10,756,363	—
builder-adversary-min	5,350,953	4,768,272	10,119,225	−5.9%
builder-adversary-stateless	2,834,505	3,130,324	5,964,829	−44.5%

Breakdown (the dominant term is cache_read — re-reading the conversation each turn):

version	role	input	output	cache_create	cache_read
builder-adversary	builder	199	87,704	256,054	5,213,399
builder-adversary	adversary	181	66,540	183,189	4,949,097
builder-adversary-min	builder	216	54,381	254,524	5,041,832
builder-adversary-min	adversary	213	58,838	202,916	4,506,305
builder-adversary-stateless	builder	124	26,998	113,026	2,694,357
builder-adversary-stateless	adversary	141	31,342	122,360	2,976,481

Findings

Prompt-prose minimization barely moves tokens (−5.9%). min cut the prompt to ~⅓ the size but saved almost nothing — because the role/kickoff prompt is a tiny, cached slice. Worth keeping for readability; not a token lever.
Context hygiene nearly halves tokens (−44.5%), quality intact. stateless produced the same correct, fully-verified calculator while cutting total tokens ~45%. The saving is dominated by cache_read falling ~48% (builder 5.21M→2.69M) — exactly the "don't carry/reload context you don't need" lever. It also cut output tokens (builder 87.7k→27.0k), i.e. less redundant regeneration across turns.
The cost is the conversation, not the prompt. cache_read ≫ everything else in all three. Any real efficiency work should target carried/reloaded context (compaction cadence, fresh sessions per unit of work, diff-not-tree reads), not prompt wording.

N=1 caveat: the autonomous loop is nondeterministic, so some of the gap is run-to-run variance (review-round count, retries). The ~45% reduction is large and matches the mechanism (cache_read roughly halved), but repeating the run a few times would tighten the estimate.

Run dirs: /tmp/ao-harness-YIrsUp. (A prior auto-generated version mislabeled orig/stateless as failed — a bug in the harness success check grepping the word "veto" and matching "No veto"; fixed to match the ## VETO marker. Functionally all three passed, as verified above.)

4.1 KiB Raw Blame History Unescape Escape