agent-orchestrator-benchmark

recipe-maintainers/agent-orchestrator-benchmark

Author	SHA1	Message	Date
mfowler	29b89140e7	results: 4-variant campaign complete (5/5 each); analysis with ratios	2026-06-15 06:40:34 +00:00
mfowler	5f9805173a	fix: veto check matches all-caps '## VETO <reason>', not '## Veto log' header	2026-06-15 06:22:40 +00:00
mfowler	583fc2a0dc	chore: append-mode data file; engine -> c6c7ce8 (orig-based stateless/lean) Runner no longer wipes RESULTS-campaign.md.data on start (incremental reruns append). Bump benchmark engine submodule to c6c7ce8 so stateless/lean use the full original prompts. orig + min (r1-4) data preserved; rerunning min once + stateless/lean ×5 with the new prompts.	2026-06-15 03:19:09 +00:00
mfowler	fc0608ede1	feat: builder-solo control runner (run after campaign) + limit-detect for it run-solo-bench.sh runs the builder-solo variant (single builder, self-verify, no adversary) 5× on the same calculator and appends rows to the shared campaign data file (adversary col = 0). Separate script so the live campaign runner is untouched. analyze.py limit-detection now also covers the solo run layout. Engine example builder-solo committed at a0f7652; benchmark engine to be re- pinned to it before running solo (after the main campaign completes).	2026-06-15 02:36:58 +00:00
mfowler	25a77f5d3c	fix: flag usage-limit-affected runs; correct tok/sec A run that hits a usage-limit pause has inflated duration (idle wait) but an accurate token total. analyze.py now scans each run's watchdog log for 'limit hit', flags it LIMIT in the raw table, and excludes it from the tokens/sec stat (token total, tok/LOC, tok/commit unaffected). Caught because campaign run r2 hit the limit ~00:40 and recovered at the 00:50 reset — watchdog handled it.	2026-06-15 01:29:54 +00:00
mfowler	33eeb3ce6b	feat: analyze.py — efficiency ratios (tokens/LOC, tokens/sec, tokens/commit) Standalone analysis over RESULTS-campaign.md.data (safe: independent of the live runner). Adds the normalised efficiency ratios per run with min/median/max per variant, alongside the token distributions, commit/LOC medians, correlations, and full raw table. Run: python3 analyze.py (regenerates RESULTS-campaign.md). Orig baseline (5 runs): tokens/LOC ~25k–34k, tokens/sec ~11.3k–14.0k.	2026-06-15 00:15:46 +00:00
mfowler	dbe9ef9c72	feat: keep run repos + record commits/LOC per run Per request: stop deleting each run's git repo after tallying — keep work/, work-adv/, origin.git under the run root so differences can be analysed. Record commit count (origin rev-list) and calc/*.py LOC in each data row; aggregation now reports per-variant median commits/LOC and tokens~{duration,commits,LOC} correlations over successful runs, plus the full raw table.	2026-06-15 00:13:08 +00:00
mfowler	37032ee363	feat: campaign mode — repeat each variant N times, aggregate distributions run-harness-bench.sh now loops VARIANTS × BENCH_REPEATS (default 5), writes each run's row to RESULTS-campaign.md.data immediately (survives interruption), and aggregates per-variant median/mean/min/max/stdev + median duration into RESULTS-campaign.md. Frees each run's repo/transcripts after tallying.	2026-06-14 22:19:10 +00:00
mfowler	b46dca003c	results: 4-way + the variance finding (N=1 is not enough) Combined run 1 (orig/min/stateless) + run 2 (lean/stateless). Key result: the SAME stateless variant used 5.96M tokens in run 1 and 9.27M in run 2 (±55%) — nondeterministic iteration count dominates every between-variant gap. So: - prose minimization ~-6% (small, same-invocation) - lean (full per-gate review) ~= stateless (batched): full review is ~free - the earlier "-45% from context hygiene" is NOT reproducible — mostly noise Honest conclusion: need >=5 runs/variant to resolve the context-hygiene effect; log_tokens now makes that easy to collect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 22:06:21 +00:00
mfowler	cca5c895b2	feat: add builder-adversary-lean variant; runner takes variant args - bump engine submodule to e0425e6 (adds builder-adversary-lean: context hygiene + enforced per-gate review) - run-harness-bench.sh: accept variant names as CLI args to run a subset	2026-06-14 21:43:11 +00:00
mfowler	0fa3d726a5	results: full-harness 3-way (orig/min/stateless) on the calculator All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full adversary verification (calc correct, tests OK). Tokens: orig 10,756,363 min 10,119,225 (-5.9%) stateless 5,964,829 (-44.5%) Prompt-prose minimization barely moved tokens; context hygiene (stateless) nearly halved them, driven by ~48% lower cache_read. Quality held. Also fix the runner's success check: it grepped the word "veto" and matched "No veto" → false failures; now matches the "## VETO" marker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 21:35:47 +00:00
mfowler	a1b59e1bc5	feat: add stateless variant, pre-trust work dirs, loop over 3 variants - bump engine submodule to 985d33d (adds builder-adversary-stateless example) - run-harness-bench.sh: pre-trust each work dir in ~/.claude.json so interactive claude (tmux) skips the workspace-trust dialog (--dangerously-skip-permissions only skips it for redirected/headless output); benchmark all three variants - (fixes from this session: bare repo default branch → main; unique session prefix per run) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:52:29 +00:00
mfowler	11eda4a8b1	chore: gitignore the runner's transient .tmp file	2026-06-14 20:40:26 +00:00
mfowler	8c3f38dbf4	feat: multi-phase calculator problem + full-harness benchmark runner - plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial edge cases (precedence/associativity/unary/div-zero) - run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and clocks tokens from the session transcripts (AI-as-adversary kept intact) - RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token effect; cache-read of the working context dominates) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:40:14 +00:00
mfowler	27df2c7b55	feat: agent-orchestrator-benchmark — prompt token comparison harness A standalone repo (engine vendored as a submodule at the examples commit) that runs a head-to-head between the builder-adversary and builder-adversary-min example variants: same task, independent headless runs, both on Sonnet, with token counts. Includes the roman-numeral test problem and run-bench.sh. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:20:05 +00:00

15 Commits