Capstone summary of the Builder/Adversary prompt + verification-cadence study:
- adversary EXISTENCE costs ~4.7x (solo 2.8M vs ~13M); cadence is ~token-neutral
- context hygiene is the one clean -22% win; minimal prompts -25% but test less
- deferred review saves nothing (the one comprehensive pass is expensive) + late
- cost is process not product (tokens~duration 0.83, ~commits 0.79, ~LOC -0.04)
All results now in-repo: FINDINGS.md + RESULTS-campaign.md + raw .data + runners.
(deferred N=3, finalizing to N=5.)
Runner no longer wipes RESULTS-campaign.md.data on start (incremental reruns
append). Bump benchmark engine submodule to c6c7ce8 so stateless/lean use the
full original prompts. orig + min (r1-4) data preserved; rerunning min once +
stateless/lean ×5 with the new prompts.
run-solo-bench.sh runs the builder-solo variant (single builder, self-verify,
no adversary) 5× on the same calculator and appends rows to the shared campaign
data file (adversary col = 0). Separate script so the live campaign runner is
untouched. analyze.py limit-detection now also covers the solo run layout.
Engine example builder-solo committed at a0f7652; benchmark engine to be re-
pinned to it before running solo (after the main campaign completes).
A run that hits a usage-limit pause has inflated duration (idle wait) but an
accurate token total. analyze.py now scans each run's watchdog log for 'limit
hit', flags it LIMIT in the raw table, and excludes it from the tokens/sec stat
(token total, tok/LOC, tok/commit unaffected). Caught because campaign run r2
hit the limit ~00:40 and recovered at the 00:50 reset — watchdog handled it.
Standalone analysis over RESULTS-campaign.md.data (safe: independent of the live
runner). Adds the normalised efficiency ratios per run with min/median/max per
variant, alongside the token distributions, commit/LOC medians, correlations,
and full raw table. Run: python3 analyze.py (regenerates RESULTS-campaign.md).
Orig baseline (5 runs): tokens/LOC ~25k–34k, tokens/sec ~11.3k–14.0k.
Per request: stop deleting each run's git repo after tallying — keep work/,
work-adv/, origin.git under the run root so differences can be analysed. Record
commit count (origin rev-list) and calc/*.py LOC in each data row; aggregation
now reports per-variant median commits/LOC and tokens~{duration,commits,LOC}
correlations over successful runs, plus the full raw table.
run-harness-bench.sh now loops VARIANTS × BENCH_REPEATS (default 5), writes each
run's row to RESULTS-campaign.md.data immediately (survives interruption), and
aggregates per-variant median/mean/min/max/stdev + median duration into
RESULTS-campaign.md. Frees each run's repo/transcripts after tallying.
Combined run 1 (orig/min/stateless) + run 2 (lean/stateless). Key result: the
SAME stateless variant used 5.96M tokens in run 1 and 9.27M in run 2 (±55%) —
nondeterministic iteration count dominates every between-variant gap. So:
- prose minimization ~-6% (small, same-invocation)
- lean (full per-gate review) ~= stateless (batched): full review is ~free
- the earlier "-45% from context hygiene" is NOT reproducible — mostly noise
Honest conclusion: need >=5 runs/variant to resolve the context-hygiene effect;
log_tokens now makes that easy to collect.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full
adversary verification (calc correct, tests OK). Tokens:
orig 10,756,363
min 10,119,225 (-5.9%)
stateless 5,964,829 (-44.5%)
Prompt-prose minimization barely moved tokens; context hygiene (stateless)
nearly halved them, driven by ~48% lower cache_read. Quality held.
Also fix the runner's success check: it grepped the word "veto" and matched
"No veto" → false failures; now matches the "## VETO" marker.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- bump engine submodule to 985d33d (adds builder-adversary-stateless example)
- run-harness-bench.sh: pre-trust each work dir in ~/.claude.json so interactive
claude (tmux) skips the workspace-trust dialog (--dangerously-skip-permissions
only skips it for redirected/headless output); benchmark all three variants
- (fixes from this session: bare repo default branch → main; unique session
prefix per run)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
+ watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
effect; cache-read of the working context dominates)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A standalone repo (engine vendored as a submodule at the examples commit) that
runs a head-to-head between the builder-adversary and builder-adversary-min
example variants: same task, independent headless runs, both on Sonnet, with
token counts. Includes the roman-numeral test problem and run-bench.sh.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>