# Benchmark results — original vs minimal prompts Engine pinned at: `737ef81`. Task: `plans/roman.md` (integer → Roman numeral). Model: **claude-sonnet-4-6** for Builder and Adversary in both versions. Runs are independent (separate headless `claude -p` sessions, no shared context). Methodology + caveats: see `run-bench.sh` header and the note below. ## Static prompt size (chars: kickoff + role, what gets sent each kickoff) | version | builder prompt | adversary prompt | |---|--:|--:| | builder-adversary (orig) | 6389 | 5811 | | builder-adversary-min | 1751 | 1644 | ## Per-run tokens & cost ### builder-adversary - **success:** YES (tests=yes, cli=yes, adversary-verdict=PASS) - **builder:** in=21 out=4007 cache_create=14460 cache_read=526213 → 544701 tok, $0.3073279, turns=21 - **adversary:** in=14 out=3245 cache_create=14930 cache_read=331897 → 350086 tok, $0.24022810000000003, turns=17 - **total:** 894787 tokens, $0.5476 ### builder-adversary-min - **success:** YES (tests=yes, cli=yes, adversary-verdict=PASS) - **builder:** in=20 out=4257 cache_create=13183 cache_read=477142 → 494602 tok, $0.28740659999999996, turns=18 - **adversary:** in=16 out=4545 cache_create=14792 cache_read=378787 → 398140 tok, $0.2718171000000001, turns=16 - **total:** 892742 tokens, $0.5592 ## Summary | version | success | total tokens | total cost | |---|:--:|--:|--:| | builder-adversary (orig) | YES | 894787 | $0.5476 | | builder-adversary-min | YES | 892742 | $0.5592 | > Note: each `claude -p` call carries a fixed ~24k-token cached Claude Code system-prompt + > tool-schema overhead, and most tokens come from the agentic work itself (reading the plan, > writing/running code, tool results). The role/kickoff prompt is a small slice — so the > headline token totals are close; the minimisation shows up in the static prompt size above > and the (smaller) input/cache-creation portion. This bench is a single controlled pass per > version (N=1; expect run-to-run variance); it exercises task effectiveness + prompt cost, > NOT the live watchdog loop / handoff machinery (that needs a full `agents.py up` run). _Work dirs for this run: `/tmp/ao-benchmark.CwQFWF`_