agent-orchestrator-benchmark

recipe-maintainers/agent-orchestrator-benchmark

Author	SHA1	Message	Date
mfowler	64bc360fc0	chore: gitignore the runner's regenerated .data.hdr artifact	2026-06-16 03:13:23 +00:00
mfowler	3bf3316572	docs: FINDINGS.md — benchmark synthesis; track raw results data Capstone summary of the Builder/Adversary prompt + verification-cadence study: - adversary EXISTENCE costs ~4.7x (solo 2.8M vs ~13M); cadence is ~token-neutral - context hygiene is the one clean -22% win; minimal prompts -25% but test less - deferred review saves nothing (the one comprehensive pass is expensive) + late - cost is process not product (tokens~duration 0.83, ~commits 0.79, ~LOC -0.04) All results now in-repo: FINDINGS.md + RESULTS-campaign.md + raw .data + runners. (deferred N=3, finalizing to N=5.)	2026-06-16 01:53:34 +00:00
mfowler	37032ee363	feat: campaign mode — repeat each variant N times, aggregate distributions run-harness-bench.sh now loops VARIANTS × BENCH_REPEATS (default 5), writes each run's row to RESULTS-campaign.md.data immediately (survives interruption), and aggregates per-variant median/mean/min/max/stdev + median duration into RESULTS-campaign.md. Frees each run's repo/transcripts after tallying.	2026-06-14 22:19:10 +00:00
mfowler	11eda4a8b1	chore: gitignore the runner's transient .tmp file	2026-06-14 20:40:26 +00:00
mfowler	27df2c7b55	feat: agent-orchestrator-benchmark — prompt token comparison harness A standalone repo (engine vendored as a submodule at the examples commit) that runs a head-to-head between the builder-adversary and builder-adversary-min example variants: same task, independent headless runs, both on Sonnet, with token counts. Includes the roman-numeral test problem and run-bench.sh. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:20:05 +00:00

5 Commits