agent-orchestrator-benchmark/RESULTS.md

# Benchmark results — original vs minimal prompts

Engine pinned at: `737ef81`. Task:
`plans/roman.md` (integer → Roman numeral). Model: **claude-sonnet-4-6** for Builder and Adversary in
both versions. Runs are independent (separate headless `claude -p` sessions, no shared
context). Methodology + caveats: see `run-bench.sh` header and the note below.

## Static prompt size (chars: kickoff + role, what gets sent each kickoff)

| version | builder prompt | adversary prompt |
|---|--:|--:|
| builder-adversary (orig) | 6389 | 5811 |
| builder-adversary-min    | 1751 | 1644 |

## Per-run tokens & cost

### builder-adversary
- **success:** YES  (tests=yes, cli=yes, adversary-verdict=PASS)
- **builder:** in=21 out=4007 cache_create=14460 cache_read=526213 → 544701 tok, $0.3073279, turns=21
- **adversary:** in=14 out=3245 cache_create=14930 cache_read=331897 → 350086 tok, $0.24022810000000003, turns=17
- **total:** 894787 tokens, $0.5476

### builder-adversary-min
- **success:** YES  (tests=yes, cli=yes, adversary-verdict=PASS)
- **builder:** in=20 out=4257 cache_create=13183 cache_read=477142 → 494602 tok, $0.28740659999999996, turns=18
- **adversary:** in=16 out=4545 cache_create=14792 cache_read=378787 → 398140 tok, $0.2718171000000001, turns=16
- **total:** 892742 tokens, $0.5592

## Summary

| version | success | total tokens | total cost |
|---|:--:|--:|--:|
| builder-adversary (orig) | YES | 894787 | $0.5476 |
| builder-adversary-min    | YES | 892742 | $0.5592 |

> Note: each `claude -p` call carries a fixed ~24k-token cached Claude Code system-prompt +
> tool-schema overhead, and most tokens come from the agentic work itself (reading the plan,
> writing/running code, tool results). The role/kickoff prompt is a small slice — so the
> headline token totals are close; the minimisation shows up in the static prompt size above
> and the (smaller) input/cache-creation portion. This bench is a single controlled pass per
> version (N=1; expect run-to-run variance); it exercises task effectiveness + prompt cost,
> NOT the live watchdog loop / handoff machinery (that needs a full `agents.py up` run).

_Work dirs for this run: `/tmp/ao-benchmark.CwQFWF`_