Files
agent-orchestrator-benchmark/RESULTS.md
mfowler 8c3f38dbf4 feat: multi-phase calculator problem + full-harness benchmark runner
- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
  phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
  edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
  + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
  clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
  effect; cache-read of the working context dominates)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:40:14 +00:00

45 lines
2.1 KiB
Markdown

# Benchmark results — original vs minimal prompts
Engine pinned at: `737ef81`. Task:
`plans/roman.md` (integer → Roman numeral). Model: **claude-sonnet-4-6** for Builder and Adversary in
both versions. Runs are independent (separate headless `claude -p` sessions, no shared
context). Methodology + caveats: see `run-bench.sh` header and the note below.
## Static prompt size (chars: kickoff + role, what gets sent each kickoff)
| version | builder prompt | adversary prompt |
|---|--:|--:|
| builder-adversary (orig) | 6389 | 5811 |
| builder-adversary-min | 1751 | 1644 |
## Per-run tokens & cost
### builder-adversary
- **success:** YES (tests=yes, cli=yes, adversary-verdict=PASS)
- **builder:** in=21 out=4007 cache_create=14460 cache_read=526213 → 544701 tok, $0.3073279, turns=21
- **adversary:** in=14 out=3245 cache_create=14930 cache_read=331897 → 350086 tok, $0.24022810000000003, turns=17
- **total:** 894787 tokens, $0.5476
### builder-adversary-min
- **success:** YES (tests=yes, cli=yes, adversary-verdict=PASS)
- **builder:** in=20 out=4257 cache_create=13183 cache_read=477142 → 494602 tok, $0.28740659999999996, turns=18
- **adversary:** in=16 out=4545 cache_create=14792 cache_read=378787 → 398140 tok, $0.2718171000000001, turns=16
- **total:** 892742 tokens, $0.5592
## Summary
| version | success | total tokens | total cost |
|---|:--:|--:|--:|
| builder-adversary (orig) | YES | 894787 | $0.5476 |
| builder-adversary-min | YES | 892742 | $0.5592 |
> Note: each `claude -p` call carries a fixed ~24k-token cached Claude Code system-prompt +
> tool-schema overhead, and most tokens come from the agentic work itself (reading the plan,
> writing/running code, tool results). The role/kickoff prompt is a small slice — so the
> headline token totals are close; the minimisation shows up in the static prompt size above
> and the (smaller) input/cache-creation portion. This bench is a single controlled pass per
> version (N=1; expect run-to-run variance); it exercises task effectiveness + prompt cost,
> NOT the live watchdog loop / handoff machinery (that needs a full `agents.py up` run).
_Work dirs for this run: `/tmp/ao-benchmark.CwQFWF`_