Commit Graph

5 Commits

Author SHA1 Message Date
64bc360fc0 chore: gitignore the runner's regenerated .data.hdr artifact 2026-06-16 03:13:23 +00:00
3bf3316572 docs: FINDINGS.md — benchmark synthesis; track raw results data
Capstone summary of the Builder/Adversary prompt + verification-cadence study:
- adversary EXISTENCE costs ~4.7x (solo 2.8M vs ~13M); cadence is ~token-neutral
- context hygiene is the one clean -22% win; minimal prompts -25% but test less
- deferred review saves nothing (the one comprehensive pass is expensive) + late
- cost is process not product (tokens~duration 0.83, ~commits 0.79, ~LOC -0.04)
All results now in-repo: FINDINGS.md + RESULTS-campaign.md + raw .data + runners.
(deferred N=3, finalizing to N=5.)
2026-06-16 01:53:34 +00:00
37032ee363 feat: campaign mode — repeat each variant N times, aggregate distributions
run-harness-bench.sh now loops VARIANTS × BENCH_REPEATS (default 5), writes each
run's row to RESULTS-campaign.md.data immediately (survives interruption), and
aggregates per-variant median/mean/min/max/stdev + median duration into
RESULTS-campaign.md. Frees each run's repo/transcripts after tallying.
2026-06-14 22:19:10 +00:00
11eda4a8b1 chore: gitignore the runner's transient .tmp file 2026-06-14 20:40:26 +00:00
27df2c7b55 feat: agent-orchestrator-benchmark — prompt token comparison harness
A standalone repo (engine vendored as a submodule at the examples commit) that
runs a head-to-head between the builder-adversary and builder-adversary-min
example variants: same task, independent headless runs, both on Sonnet, with
token counts. Includes the roman-numeral test problem and run-bench.sh.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:20:05 +00:00