Combined run 1 (orig/min/stateless) + run 2 (lean/stateless). Key result: the
SAME stateless variant used 5.96M tokens in run 1 and 9.27M in run 2 (±55%) —
nondeterministic iteration count dominates every between-variant gap. So:
- prose minimization ~-6% (small, same-invocation)
- lean (full per-gate review) ~= stateless (batched): full review is ~free
- the earlier "-45% from context hygiene" is NOT reproducible — mostly noise
Honest conclusion: need >=5 runs/variant to resolve the context-hygiene effect;
log_tokens now makes that easy to collect.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full
adversary verification (calc correct, tests OK). Tokens:
orig 10,756,363
min 10,119,225 (-5.9%)
stateless 5,964,829 (-44.5%)
Prompt-prose minimization barely moved tokens; context hygiene (stateless)
nearly halved them, driven by ~48% lower cache_read. Quality held.
Also fix the runner's success check: it grepped the word "veto" and matched
"No veto" → false failures; now matches the "## VETO" marker.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>