Runner no longer wipes RESULTS-campaign.md.data on start (incremental reruns
append). Bump benchmark engine submodule to c6c7ce8 so stateless/lean use the
full original prompts. orig + min (r1-4) data preserved; rerunning min once +
stateless/lean ×5 with the new prompts.
Per request: stop deleting each run's git repo after tallying — keep work/,
work-adv/, origin.git under the run root so differences can be analysed. Record
commit count (origin rev-list) and calc/*.py LOC in each data row; aggregation
now reports per-variant median commits/LOC and tokens~{duration,commits,LOC}
correlations over successful runs, plus the full raw table.
run-harness-bench.sh now loops VARIANTS × BENCH_REPEATS (default 5), writes each
run's row to RESULTS-campaign.md.data immediately (survives interruption), and
aggregates per-variant median/mean/min/max/stdev + median duration into
RESULTS-campaign.md. Frees each run's repo/transcripts after tallying.
All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full
adversary verification (calc correct, tests OK). Tokens:
orig 10,756,363
min 10,119,225 (-5.9%)
stateless 5,964,829 (-44.5%)
Prompt-prose minimization barely moved tokens; context hygiene (stateless)
nearly halved them, driven by ~48% lower cache_read. Quality held.
Also fix the runner's success check: it grepped the word "veto" and matched
"No veto" → false failures; now matches the "## VETO" marker.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- bump engine submodule to 985d33d (adds builder-adversary-stateless example)
- run-harness-bench.sh: pre-trust each work dir in ~/.claude.json so interactive
claude (tmux) skips the workspace-trust dialog (--dangerously-skip-permissions
only skips it for redirected/headless output); benchmark all three variants
- (fixes from this session: bare repo default branch → main; unique session
prefix per run)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
+ watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
effect; cache-read of the working context dominates)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>