Combined run 1 (orig/min/stateless) + run 2 (lean/stateless). Key result: the SAME stateless variant used 5.96M tokens in run 1 and 9.27M in run 2 (±55%) — nondeterministic iteration count dominates every between-variant gap. So: - prose minimization ~-6% (small, same-invocation) - lean (full per-gate review) ~= stateless (batched): full review is ~free - the earlier "-45% from context hygiene" is NOT reproducible — mostly noise Honest conclusion: need >=5 runs/variant to resolve the context-hygiene effect; log_tokens now makes that easy to collect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3.3 KiB
Full-harness benchmark — prompt variants (calculator, Sonnet)
Real agents.py up Builder/Adversary loop pair + watchdog, run autonomously through the
multi-phase calculator (plans/calc/{lex,parse,eval}.md — 3 phases, 4–6 gates each) to
SEQUENCE-COMPLETE. Both loops on claude-sonnet-4-6. Tokens summed from each loop's Claude Code
session transcripts.
⚠️ Headline: run-to-run variance dominates (N=1 is not enough)
The same stateless variant, identical prompts, was run twice:
| run | stateless total tokens |
|---|---|
| run 1 | 5,964,829 |
| run 2 | 9,266,808 |
That's a ±55% swing from nondeterminism alone (how many review rounds / retries the autonomous loop happens to do). It is larger than every difference between variants below. So treat all single-run deltas as suggestive at best — the honest conclusion is that N=1 cannot resolve the variant effects; you'd need several runs per variant and compare medians.
All data points
| run | variant | builder | adversary | total |
|---|---|---|---|---|
| 1 | builder-adversary (orig) | 5,557,356 | 5,199,007 | 10,756,363 |
| 1 | builder-adversary-min | 5,350,953 | 4,768,272 | 10,119,225 |
| 1 | builder-adversary-stateless | 2,834,505 | 3,130,324 | 5,964,829 |
| 2 | builder-adversary-lean | 4,050,402 | 5,086,052 | 9,136,454 |
| 2 | builder-adversary-stateless | 4,606,579 | 4,660,229 | 9,266,808 |
All five succeeded: built a correct calculator (2+3*4→14, (2+3)*4→20, 7/2→3.5), full test
suites green, every gate Adversary-verified, no veto. (Engine: run 1 @ 985d33d, run 2 @ e0425e6
— the orig/min/stateless prompts are byte-identical across both; run 2 only adds lean.)
What we can and can't say
Valid (same-invocation) comparisons:
- Run 1 — prose minimization: orig 10.76M vs min 10.12M → −5.9%. Small; consistent with "the prompt is a tiny cached slice." Probably real but minor.
- Run 2 — full per-gate review vs batched, both with context hygiene: lean 9.14M vs stateless 9.27M → essentially tied (−1.4%). So enforcing one claim + one independent verdict per gate did NOT cost more tokens than letting the loop batch — answering the question directly: you can keep full review granularity without a token penalty (in this run).
NOT supported:
- The earlier "context hygiene halves tokens (−45%)" claim from run 1 is not reproducible: stateless's own second run (9.27M) lands right next to orig/min/lean. The −45% was mostly a lucky low-iteration run, not the context discipline. Context hygiene may still help, but this benchmark can't prove it at N=1.
Findings
- The dominant variable is nondeterministic iteration count, not the prompt variant. ±55% same-variant variance > any between-variant gap.
- Prose size barely matters (−6%, run 1) — keep minimal prompts for readability, not tokens.
- Full per-gate review is ~free vs batching (run 2) — granular adversarial scrutiny didn't raise the bill, so prefer it for quality.
- To actually measure context hygiene you need a campaign: ≥5 runs per variant, compare medians
/ distributions. The new
log_tokensharness feature makes that cheap to collect.
Run dirs: run 1 /tmp/ao-harness-YIrsUp, run 2 /tmp/ao-harness-TMDfvk. N=1 per (run, variant).