Capstone summary of the Builder/Adversary prompt + verification-cadence study:
- adversary EXISTENCE costs ~4.7x (solo 2.8M vs ~13M); cadence is ~token-neutral
- context hygiene is the one clean -22% win; minimal prompts -25% but test less
- deferred review saves nothing (the one comprehensive pass is expensive) + late
- cost is process not product (tokens~duration 0.83, ~commits 0.79, ~LOC -0.04)
All results now in-repo: FINDINGS.md + RESULTS-campaign.md + raw .data + runners.
(deferred N=3, finalizing to N=5.)
run-harness-bench.sh now loops VARIANTS × BENCH_REPEATS (default 5), writes each
run's row to RESULTS-campaign.md.data immediately (survives interruption), and
aggregates per-variant median/mean/min/max/stdev + median duration into
RESULTS-campaign.md. Frees each run's repo/transcripts after tallying.
A standalone repo (engine vendored as a submodule at the examples commit) that
runs a head-to-head between the builder-adversary and builder-adversary-min
example variants: same task, independent headless runs, both on Sonnet, with
token counts. Includes the roman-numeral test problem and run-bench.sh.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>