All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full adversary verification (calc correct, tests OK). Tokens: orig 10,756,363 min 10,119,225 (-5.9%) stateless 5,964,829 (-44.5%) Prompt-prose minimization barely moved tokens; context hygiene (stateless) nearly halved them, driven by ~48% lower cache_read. Quality held. Also fix the runner's success check: it grepped the word "veto" and matched "No veto" → false failures; now matches the "## VETO" marker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4.1 KiB
Full-harness benchmark — prompt variants
Real agents.py up Builder/Adversary loop pair + watchdog, run autonomously through the
multi-phase calculator (plans/calc/{lex,parse,eval}.md — 3 phases, 4–6 gates each) to
SEQUENCE-COMPLETE. Engine pinned at 985d33d. Both loops on claude-sonnet-4-6. Per-variant
timeout 3000s. Tokens summed from the Claude Code session transcripts of each loop's clone. N=1
per variant (the autonomous loop is nondeterministic — number of review rounds varies).
Variants
- builder-adversary — original prompts.
- builder-adversary-min — same rules, prose compressed to minimal tokens.
- builder-adversary-stateless — min + context hygiene (compact at each checkpoint, read diffs not trees, spill bulk to files, adversary loads only {plan, STATUS, diff}); loop sessions non-resumed → fresh context per phase. Same AI-as-adversary verification.
All three succeeded
Every variant completed all three phases with the Adversary cold-verifying every gate (no veto), and the final calculator is correct in each:
| version | sequence-complete | unittest | 2+3*4 |
(2+3)*4 |
7/2 |
result |
|---|---|---|---|---|---|---|
| builder-adversary | yes | OK | 14 | 20 | 3.5 | PASS |
| builder-adversary-min | yes | OK | 14 | 20 | 3.5 | PASS |
| builder-adversary-stateless | yes | OK | 14 | 20 | 3.5 | PASS |
(The original Adversary even filed a non-blocking advisory on lex — genuine adversarial review, not
rubber-stamping.)
Static prompt size (chars: kickoff + role)
| version | builder | adversary |
|---|---|---|
| builder-adversary | 6389 | 5811 |
| builder-adversary-min | 1751 | 1644 |
| builder-adversary-stateless | 2430 | 2218 |
Tokens (from session transcripts)
| version | builder loop | adversary loop | total | vs orig |
|---|---|---|---|---|
| builder-adversary | 5,557,356 | 5,199,007 | 10,756,363 | — |
| builder-adversary-min | 5,350,953 | 4,768,272 | 10,119,225 | −5.9% |
| builder-adversary-stateless | 2,834,505 | 3,130,324 | 5,964,829 | −44.5% |
Breakdown (the dominant term is cache_read — re-reading the conversation each turn):
| version | role | input | output | cache_create | cache_read |
|---|---|---|---|---|---|
| builder-adversary | builder | 199 | 87,704 | 256,054 | 5,213,399 |
| builder-adversary | adversary | 181 | 66,540 | 183,189 | 4,949,097 |
| builder-adversary-min | builder | 216 | 54,381 | 254,524 | 5,041,832 |
| builder-adversary-min | adversary | 213 | 58,838 | 202,916 | 4,506,305 |
| builder-adversary-stateless | builder | 124 | 26,998 | 113,026 | 2,694,357 |
| builder-adversary-stateless | adversary | 141 | 31,342 | 122,360 | 2,976,481 |
Findings
- Prompt-prose minimization barely moves tokens (−5.9%).
mincut the prompt to ~⅓ the size but saved almost nothing — because the role/kickoff prompt is a tiny, cached slice. Worth keeping for readability; not a token lever. - Context hygiene nearly halves tokens (−44.5%), quality intact.
statelessproduced the same correct, fully-verified calculator while cutting total tokens ~45%. The saving is dominated by cache_read falling ~48% (builder 5.21M→2.69M) — exactly the "don't carry/reload context you don't need" lever. It also cut output tokens (builder 87.7k→27.0k), i.e. less redundant regeneration across turns. - The cost is the conversation, not the prompt. cache_read ≫ everything else in all three. Any real efficiency work should target carried/reloaded context (compaction cadence, fresh sessions per unit of work, diff-not-tree reads), not prompt wording.
N=1 caveat: the autonomous loop is nondeterministic, so some of the gap is run-to-run variance (review-round count, retries). The ~45% reduction is large and matches the mechanism (cache_read roughly halved), but repeating the run a few times would tighten the estimate.
Run dirs: /tmp/ao-harness-YIrsUp. (A prior auto-generated version mislabeled orig/stateless as
failed — a bug in the harness success check grepping the word "veto" and matching "No veto"; fixed to
match the ## VETO marker. Functionally all three passed, as verified above.)