- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
+ watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
effect; cache-read of the working context dominates)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2.1 KiB
Benchmark results — original vs minimal prompts
Engine pinned at: 737ef81. Task:
plans/roman.md (integer → Roman numeral). Model: claude-sonnet-4-6 for Builder and Adversary in
both versions. Runs are independent (separate headless claude -p sessions, no shared
context). Methodology + caveats: see run-bench.sh header and the note below.
Static prompt size (chars: kickoff + role, what gets sent each kickoff)
| version | builder prompt | adversary prompt |
|---|---|---|
| builder-adversary (orig) | 6389 | 5811 |
| builder-adversary-min | 1751 | 1644 |
Per-run tokens & cost
builder-adversary
- success: YES (tests=yes, cli=yes, adversary-verdict=PASS)
- builder: in=21 out=4007 cache_create=14460 cache_read=526213 → 544701 tok, $0.3073279, turns=21
- adversary: in=14 out=3245 cache_create=14930 cache_read=331897 → 350086 tok, $0.24022810000000003, turns=17
- total: 894787 tokens, $0.5476
builder-adversary-min
- success: YES (tests=yes, cli=yes, adversary-verdict=PASS)
- builder: in=20 out=4257 cache_create=13183 cache_read=477142 → 494602 tok, $0.28740659999999996, turns=18
- adversary: in=16 out=4545 cache_create=14792 cache_read=378787 → 398140 tok, $0.2718171000000001, turns=16
- total: 892742 tokens, $0.5592
Summary
| version | success | total tokens | total cost |
|---|---|---|---|
| builder-adversary (orig) | YES | 894787 | $0.5476 |
| builder-adversary-min | YES | 892742 | $0.5592 |
Note: each
claude -pcall carries a fixed ~24k-token cached Claude Code system-prompt + tool-schema overhead, and most tokens come from the agentic work itself (reading the plan, writing/running code, tool results). The role/kickoff prompt is a small slice — so the headline token totals are close; the minimisation shows up in the static prompt size above and the (smaller) input/cache-creation portion. This bench is a single controlled pass per version (N=1; expect run-to-run variance); it exercises task effectiveness + prompt cost, NOT the live watchdog loop / handoff machinery (that needs a fullagents.py uprun).
Work dirs for this run: /tmp/ao-benchmark.CwQFWF