Combined run 1 (orig/min/stateless) + run 2 (lean/stateless). Key result: the SAME stateless variant used 5.96M tokens in run 1 and 9.27M in run 2 (±55%) — nondeterministic iteration count dominates every between-variant gap. So: - prose minimization ~-6% (small, same-invocation) - lean (full per-gate review) ~= stateless (batched): full review is ~free - the earlier "-45% from context hygiene" is NOT reproducible — mostly noise Honest conclusion: need >=5 runs/variant to resolve the context-hygiene effect; log_tokens now makes that easy to collect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
63 lines
3.3 KiB
Markdown
63 lines
3.3 KiB
Markdown
# Full-harness benchmark — prompt variants (calculator, Sonnet)
|
||
|
||
Real `agents.py up` Builder/Adversary loop pair + watchdog, run **autonomously** through the
|
||
multi-phase calculator (`plans/calc/{lex,parse,eval}.md` — 3 phases, 4–6 gates each) to
|
||
`SEQUENCE-COMPLETE`. Both loops on **claude-sonnet-4-6**. Tokens summed from each loop's Claude Code
|
||
session transcripts.
|
||
|
||
## ⚠️ Headline: run-to-run variance dominates (N=1 is not enough)
|
||
|
||
The **same** `stateless` variant, identical prompts, was run twice:
|
||
|
||
| run | stateless total tokens |
|
||
|---|--:|
|
||
| run 1 | **5,964,829** |
|
||
| run 2 | **9,266,808** |
|
||
|
||
That's a **±55% swing** from nondeterminism alone (how many review rounds / retries the autonomous
|
||
loop happens to do). It is **larger than every difference between variants below.** So treat all
|
||
single-run deltas as suggestive at best — the honest conclusion is that **N=1 cannot resolve the
|
||
variant effects**; you'd need several runs per variant and compare medians.
|
||
|
||
## All data points
|
||
|
||
| run | variant | builder | adversary | **total** |
|
||
|---|---|--:|--:|--:|
|
||
| 1 | builder-adversary (orig) | 5,557,356 | 5,199,007 | **10,756,363** |
|
||
| 1 | builder-adversary-min | 5,350,953 | 4,768,272 | **10,119,225** |
|
||
| 1 | builder-adversary-stateless | 2,834,505 | 3,130,324 | **5,964,829** |
|
||
| 2 | builder-adversary-lean | 4,050,402 | 5,086,052 | **9,136,454** |
|
||
| 2 | builder-adversary-stateless | 4,606,579 | 4,660,229 | **9,266,808** |
|
||
|
||
All five **succeeded**: built a correct calculator (`2+3*4→14`, `(2+3)*4→20`, `7/2→3.5`), full test
|
||
suites green, every gate Adversary-verified, no veto. (Engine: run 1 @ `985d33d`, run 2 @ `e0425e6`
|
||
— the orig/min/stateless prompts are byte-identical across both; run 2 only adds `lean`.)
|
||
|
||
## What we can and can't say
|
||
|
||
**Valid (same-invocation) comparisons:**
|
||
- **Run 1 — prose minimization:** orig 10.76M vs min 10.12M → **−5.9%**. Small; consistent with "the
|
||
prompt is a tiny cached slice." Probably real but minor.
|
||
- **Run 2 — full per-gate review vs batched, both with context hygiene:** lean 9.14M vs stateless
|
||
9.27M → **essentially tied (−1.4%)**. So **enforcing one claim + one independent verdict per gate
|
||
did NOT cost more tokens** than letting the loop batch — answering the question directly: you can
|
||
keep full review granularity without a token penalty (in this run).
|
||
|
||
**NOT supported:**
|
||
- The earlier "context hygiene halves tokens (−45%)" claim from run 1 is **not reproducible**:
|
||
stateless's *own* second run (9.27M) lands right next to orig/min/lean. The −45% was mostly a
|
||
lucky low-iteration run, not the context discipline. Context hygiene may still help, but this
|
||
benchmark can't prove it at N=1.
|
||
|
||
## Findings
|
||
|
||
1. **The dominant variable is nondeterministic iteration count, not the prompt variant.** ±55%
|
||
same-variant variance > any between-variant gap.
|
||
2. **Prose size barely matters** (−6%, run 1) — keep minimal prompts for readability, not tokens.
|
||
3. **Full per-gate review is ~free vs batching** (run 2) — granular adversarial scrutiny didn't
|
||
raise the bill, so prefer it for quality.
|
||
4. **To actually measure context hygiene you need a campaign:** ≥5 runs per variant, compare medians
|
||
/ distributions. The new `log_tokens` harness feature makes that cheap to collect.
|
||
|
||
_Run dirs: run 1 `/tmp/ao-harness-YIrsUp`, run 2 `/tmp/ao-harness-TMDfvk`. N=1 per (run, variant)._
|