agent-orchestrator-benchmark

recipe-maintainers/agent-orchestrator-benchmark

Author	SHA1	Message	Date
mfowler	5f9805173a	fix: veto check matches all-caps '## VETO <reason>', not '## Veto log' header	2026-06-15 06:22:40 +00:00
mfowler	583fc2a0dc	chore: append-mode data file; engine -> c6c7ce8 (orig-based stateless/lean) Runner no longer wipes RESULTS-campaign.md.data on start (incremental reruns append). Bump benchmark engine submodule to c6c7ce8 so stateless/lean use the full original prompts. orig + min (r1-4) data preserved; rerunning min once + stateless/lean ×5 with the new prompts.	2026-06-15 03:19:09 +00:00
mfowler	dbe9ef9c72	feat: keep run repos + record commits/LOC per run Per request: stop deleting each run's git repo after tallying — keep work/, work-adv/, origin.git under the run root so differences can be analysed. Record commit count (origin rev-list) and calc/*.py LOC in each data row; aggregation now reports per-variant median commits/LOC and tokens~{duration,commits,LOC} correlations over successful runs, plus the full raw table.	2026-06-15 00:13:08 +00:00
mfowler	37032ee363	feat: campaign mode — repeat each variant N times, aggregate distributions run-harness-bench.sh now loops VARIANTS × BENCH_REPEATS (default 5), writes each run's row to RESULTS-campaign.md.data immediately (survives interruption), and aggregates per-variant median/mean/min/max/stdev + median duration into RESULTS-campaign.md. Frees each run's repo/transcripts after tallying.	2026-06-14 22:19:10 +00:00
mfowler	cca5c895b2	feat: add builder-adversary-lean variant; runner takes variant args - bump engine submodule to e0425e6 (adds builder-adversary-lean: context hygiene + enforced per-gate review) - run-harness-bench.sh: accept variant names as CLI args to run a subset	2026-06-14 21:43:11 +00:00
mfowler	0fa3d726a5	results: full-harness 3-way (orig/min/stateless) on the calculator All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full adversary verification (calc correct, tests OK). Tokens: orig 10,756,363 min 10,119,225 (-5.9%) stateless 5,964,829 (-44.5%) Prompt-prose minimization barely moved tokens; context hygiene (stateless) nearly halved them, driven by ~48% lower cache_read. Quality held. Also fix the runner's success check: it grepped the word "veto" and matched "No veto" → false failures; now matches the "## VETO" marker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 21:35:47 +00:00
mfowler	a1b59e1bc5	feat: add stateless variant, pre-trust work dirs, loop over 3 variants - bump engine submodule to 985d33d (adds builder-adversary-stateless example) - run-harness-bench.sh: pre-trust each work dir in ~/.claude.json so interactive claude (tmux) skips the workspace-trust dialog (--dangerously-skip-permissions only skips it for redirected/headless output); benchmark all three variants - (fixes from this session: bare repo default branch → main; unique session prefix per run) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:52:29 +00:00
mfowler	8c3f38dbf4	feat: multi-phase calculator problem + full-harness benchmark runner - plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial edge cases (precedence/associativity/unary/div-zero) - run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and clocks tokens from the session transcripts (AI-as-adversary kept intact) - RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token effect; cache-read of the working context dominates) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:40:14 +00:00

8 Commits