Files
agent-orchestrator-benchmark/RESULTS-harness.md
mfowler b46dca003c results: 4-way + the variance finding (N=1 is not enough)
Combined run 1 (orig/min/stateless) + run 2 (lean/stateless). Key result: the
SAME stateless variant used 5.96M tokens in run 1 and 9.27M in run 2 (±55%) —
nondeterministic iteration count dominates every between-variant gap. So:
- prose minimization ~-6% (small, same-invocation)
- lean (full per-gate review) ~= stateless (batched): full review is ~free
- the earlier "-45% from context hygiene" is NOT reproducible — mostly noise
Honest conclusion: need >=5 runs/variant to resolve the context-hygiene effect;
log_tokens now makes that easy to collect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 22:06:21 +00:00

63 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Full-harness benchmark — prompt variants (calculator, Sonnet)
Real `agents.py up` Builder/Adversary loop pair + watchdog, run **autonomously** through the
multi-phase calculator (`plans/calc/{lex,parse,eval}.md` — 3 phases, 46 gates each) to
`SEQUENCE-COMPLETE`. Both loops on **claude-sonnet-4-6**. Tokens summed from each loop's Claude Code
session transcripts.
## ⚠️ Headline: run-to-run variance dominates (N=1 is not enough)
The **same** `stateless` variant, identical prompts, was run twice:
| run | stateless total tokens |
|---|--:|
| run 1 | **5,964,829** |
| run 2 | **9,266,808** |
That's a **±55% swing** from nondeterminism alone (how many review rounds / retries the autonomous
loop happens to do). It is **larger than every difference between variants below.** So treat all
single-run deltas as suggestive at best — the honest conclusion is that **N=1 cannot resolve the
variant effects**; you'd need several runs per variant and compare medians.
## All data points
| run | variant | builder | adversary | **total** |
|---|---|--:|--:|--:|
| 1 | builder-adversary (orig) | 5,557,356 | 5,199,007 | **10,756,363** |
| 1 | builder-adversary-min | 5,350,953 | 4,768,272 | **10,119,225** |
| 1 | builder-adversary-stateless | 2,834,505 | 3,130,324 | **5,964,829** |
| 2 | builder-adversary-lean | 4,050,402 | 5,086,052 | **9,136,454** |
| 2 | builder-adversary-stateless | 4,606,579 | 4,660,229 | **9,266,808** |
All five **succeeded**: built a correct calculator (`2+3*4→14`, `(2+3)*4→20`, `7/2→3.5`), full test
suites green, every gate Adversary-verified, no veto. (Engine: run 1 @ `985d33d`, run 2 @ `e0425e6`
— the orig/min/stateless prompts are byte-identical across both; run 2 only adds `lean`.)
## What we can and can't say
**Valid (same-invocation) comparisons:**
- **Run 1 — prose minimization:** orig 10.76M vs min 10.12M → **5.9%**. Small; consistent with "the
prompt is a tiny cached slice." Probably real but minor.
- **Run 2 — full per-gate review vs batched, both with context hygiene:** lean 9.14M vs stateless
9.27M → **essentially tied (1.4%)**. So **enforcing one claim + one independent verdict per gate
did NOT cost more tokens** than letting the loop batch — answering the question directly: you can
keep full review granularity without a token penalty (in this run).
**NOT supported:**
- The earlier "context hygiene halves tokens (45%)" claim from run 1 is **not reproducible**:
stateless's *own* second run (9.27M) lands right next to orig/min/lean. The 45% was mostly a
lucky low-iteration run, not the context discipline. Context hygiene may still help, but this
benchmark can't prove it at N=1.
## Findings
1. **The dominant variable is nondeterministic iteration count, not the prompt variant.** ±55%
same-variant variance > any between-variant gap.
2. **Prose size barely matters** (6%, run 1) — keep minimal prompts for readability, not tokens.
3. **Full per-gate review is ~free vs batching** (run 2) — granular adversarial scrutiny didn't
raise the bill, so prefer it for quality.
4. **To actually measure context hygiene you need a campaign:** ≥5 runs per variant, compare medians
/ distributions. The new `log_tokens` harness feature makes that cheap to collect.
_Run dirs: run 1 `/tmp/ao-harness-YIrsUp`, run 2 `/tmp/ao-harness-TMDfvk`. N=1 per (run, variant)._