Files
agent-orchestrator-benchmark/RESULTS-harness.md
mfowler 0fa3d726a5 results: full-harness 3-way (orig/min/stateless) on the calculator
All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full
adversary verification (calc correct, tests OK). Tokens:
  orig       10,756,363
  min        10,119,225  (-5.9%)
  stateless   5,964,829  (-44.5%)
Prompt-prose minimization barely moved tokens; context hygiene (stateless)
nearly halved them, driven by ~48% lower cache_read. Quality held.

Also fix the runner's success check: it grepped the word "veto" and matched
"No veto" → false failures; now matches the "## VETO" marker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 21:35:47 +00:00

79 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Full-harness benchmark — prompt variants
Real `agents.py up` Builder/Adversary loop pair + watchdog, run **autonomously** through the
multi-phase calculator (`plans/calc/{lex,parse,eval}.md` — 3 phases, 46 gates each) to
`SEQUENCE-COMPLETE`. Engine pinned at `985d33d`. Both loops on **claude-sonnet-4-6**. Per-variant
timeout 3000s. Tokens summed from the Claude Code session transcripts of each loop's clone. **N=1**
per variant (the autonomous loop is nondeterministic — number of review rounds varies).
## Variants
- **builder-adversary** — original prompts.
- **builder-adversary-min** — same rules, prose compressed to minimal tokens.
- **builder-adversary-stateless** — min + **context hygiene** (compact at each checkpoint, read diffs
not trees, spill bulk to files, adversary loads only {plan, STATUS, diff}); loop sessions
non-resumed → fresh context per phase. Same AI-as-adversary verification.
## All three succeeded
Every variant completed all three phases with the Adversary cold-verifying every gate (no veto), and
the final calculator is correct in each:
| version | sequence-complete | unittest | `2+3*4` | `(2+3)*4` | `7/2` | result |
|---|:--:|:--:|:--:|:--:|:--:|:--:|
| builder-adversary | yes | OK | 14 | 20 | 3.5 | PASS |
| builder-adversary-min | yes | OK | 14 | 20 | 3.5 | PASS |
| builder-adversary-stateless | yes | OK | 14 | 20 | 3.5 | PASS |
(The original Adversary even filed a non-blocking advisory on `lex` — genuine adversarial review, not
rubber-stamping.)
## Static prompt size (chars: kickoff + role)
| version | builder | adversary |
|---|--:|--:|
| builder-adversary | 6389 | 5811 |
| builder-adversary-min | 1751 | 1644 |
| builder-adversary-stateless | 2430 | 2218 |
## Tokens (from session transcripts)
| version | builder loop | adversary loop | **total** | vs orig |
|---|--:|--:|--:|--:|
| builder-adversary | 5,557,356 | 5,199,007 | **10,756,363** | — |
| builder-adversary-min | 5,350,953 | 4,768,272 | **10,119,225** | 5.9% |
| builder-adversary-stateless | 2,834,505 | 3,130,324 | **5,964,829** | **44.5%** |
Breakdown (the dominant term is **cache_read** — re-reading the conversation each turn):
| version | role | input | output | cache_create | cache_read |
|---|---|--:|--:|--:|--:|
| builder-adversary | builder | 199 | 87,704 | 256,054 | 5,213,399 |
| builder-adversary | adversary | 181 | 66,540 | 183,189 | 4,949,097 |
| builder-adversary-min | builder | 216 | 54,381 | 254,524 | 5,041,832 |
| builder-adversary-min | adversary | 213 | 58,838 | 202,916 | 4,506,305 |
| builder-adversary-stateless | builder | 124 | 26,998 | 113,026 | 2,694,357 |
| builder-adversary-stateless | adversary | 141 | 31,342 | 122,360 | 2,976,481 |
## Findings
1. **Prompt-prose minimization barely moves tokens (5.9%).** `min` cut the prompt to ~⅓ the size
but saved almost nothing — because the role/kickoff prompt is a tiny, cached slice. Worth keeping
for readability; not a token lever.
2. **Context hygiene nearly halves tokens (44.5%), quality intact.** `stateless` produced the same
correct, fully-verified calculator while cutting total tokens ~45%. The saving is dominated by
**cache_read** falling ~48% (builder 5.21M→2.69M) — exactly the "don't carry/reload context you
don't need" lever. It also cut output tokens (builder 87.7k→27.0k), i.e. less redundant
regeneration across turns.
3. **The cost is the conversation, not the prompt.** cache_read ≫ everything else in all three. Any
real efficiency work should target carried/reloaded context (compaction cadence, fresh sessions
per unit of work, diff-not-tree reads), not prompt wording.
> N=1 caveat: the autonomous loop is nondeterministic, so some of the gap is run-to-run variance
> (review-round count, retries). The ~45% reduction is large and matches the mechanism (cache_read
> roughly halved), but repeating the run a few times would tighten the estimate.
_Run dirs: `/tmp/ao-harness-YIrsUp`. (A prior auto-generated version mislabeled `orig`/`stateless` as
failed — a bug in the harness success check grepping the word "veto" and matching "No veto"; fixed to
match the `## VETO` marker. Functionally all three passed, as verified above.)_