Files
agent-orchestrator-benchmark/RESULTS-harness.md
mfowler 0fa3d726a5 results: full-harness 3-way (orig/min/stateless) on the calculator
All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full
adversary verification (calc correct, tests OK). Tokens:
  orig       10,756,363
  min        10,119,225  (-5.9%)
  stateless   5,964,829  (-44.5%)
Prompt-prose minimization barely moved tokens; context hygiene (stateless)
nearly halved them, driven by ~48% lower cache_read. Quality held.

Also fix the runner's success check: it grepped the word "veto" and matched
"No veto" → false failures; now matches the "## VETO" marker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 21:35:47 +00:00

4.1 KiB
Raw Blame History

Full-harness benchmark — prompt variants

Real agents.py up Builder/Adversary loop pair + watchdog, run autonomously through the multi-phase calculator (plans/calc/{lex,parse,eval}.md — 3 phases, 46 gates each) to SEQUENCE-COMPLETE. Engine pinned at 985d33d. Both loops on claude-sonnet-4-6. Per-variant timeout 3000s. Tokens summed from the Claude Code session transcripts of each loop's clone. N=1 per variant (the autonomous loop is nondeterministic — number of review rounds varies).

Variants

  • builder-adversary — original prompts.
  • builder-adversary-min — same rules, prose compressed to minimal tokens.
  • builder-adversary-stateless — min + context hygiene (compact at each checkpoint, read diffs not trees, spill bulk to files, adversary loads only {plan, STATUS, diff}); loop sessions non-resumed → fresh context per phase. Same AI-as-adversary verification.

All three succeeded

Every variant completed all three phases with the Adversary cold-verifying every gate (no veto), and the final calculator is correct in each:

version sequence-complete unittest 2+3*4 (2+3)*4 7/2 result
builder-adversary yes OK 14 20 3.5 PASS
builder-adversary-min yes OK 14 20 3.5 PASS
builder-adversary-stateless yes OK 14 20 3.5 PASS

(The original Adversary even filed a non-blocking advisory on lex — genuine adversarial review, not rubber-stamping.)

Static prompt size (chars: kickoff + role)

version builder adversary
builder-adversary 6389 5811
builder-adversary-min 1751 1644
builder-adversary-stateless 2430 2218

Tokens (from session transcripts)

version builder loop adversary loop total vs orig
builder-adversary 5,557,356 5,199,007 10,756,363
builder-adversary-min 5,350,953 4,768,272 10,119,225 5.9%
builder-adversary-stateless 2,834,505 3,130,324 5,964,829 44.5%

Breakdown (the dominant term is cache_read — re-reading the conversation each turn):

version role input output cache_create cache_read
builder-adversary builder 199 87,704 256,054 5,213,399
builder-adversary adversary 181 66,540 183,189 4,949,097
builder-adversary-min builder 216 54,381 254,524 5,041,832
builder-adversary-min adversary 213 58,838 202,916 4,506,305
builder-adversary-stateless builder 124 26,998 113,026 2,694,357
builder-adversary-stateless adversary 141 31,342 122,360 2,976,481

Findings

  1. Prompt-prose minimization barely moves tokens (5.9%). min cut the prompt to ~⅓ the size but saved almost nothing — because the role/kickoff prompt is a tiny, cached slice. Worth keeping for readability; not a token lever.
  2. Context hygiene nearly halves tokens (44.5%), quality intact. stateless produced the same correct, fully-verified calculator while cutting total tokens ~45%. The saving is dominated by cache_read falling ~48% (builder 5.21M→2.69M) — exactly the "don't carry/reload context you don't need" lever. It also cut output tokens (builder 87.7k→27.0k), i.e. less redundant regeneration across turns.
  3. The cost is the conversation, not the prompt. cache_read ≫ everything else in all three. Any real efficiency work should target carried/reloaded context (compaction cadence, fresh sessions per unit of work, diff-not-tree reads), not prompt wording.

N=1 caveat: the autonomous loop is nondeterministic, so some of the gap is run-to-run variance (review-round count, retries). The ~45% reduction is large and matches the mechanism (cache_read roughly halved), but repeating the run a few times would tighten the estimate.

Run dirs: /tmp/ao-harness-YIrsUp. (A prior auto-generated version mislabeled orig/stateless as failed — a bug in the harness success check grepping the word "veto" and matching "No veto"; fixed to match the ## VETO marker. Functionally all three passed, as verified above.)