agent-orchestrator-benchmark

4 Commits 1 Branch 0 Tags

Author	SHA1	Message	Date
mfowler	a1b59e1bc5	feat: add stateless variant, pre-trust work dirs, loop over 3 variants - bump engine submodule to 985d33d (adds builder-adversary-stateless example) - run-harness-bench.sh: pre-trust each work dir in ~/.claude.json so interactive claude (tmux) skips the workspace-trust dialog (--dangerously-skip-permissions only skips it for redirected/headless output); benchmark all three variants - (fixes from this session: bare repo default branch → main; unique session prefix per run) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:52:29 +00:00
mfowler	11eda4a8b1	chore: gitignore the runner's transient .tmp file	2026-06-14 20:40:26 +00:00
mfowler	8c3f38dbf4	feat: multi-phase calculator problem + full-harness benchmark runner - plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial edge cases (precedence/associativity/unary/div-zero) - run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and clocks tokens from the session transcripts (AI-as-adversary kept intact) - RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token effect; cache-read of the working context dominates) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:40:14 +00:00
mfowler	27df2c7b55	feat: agent-orchestrator-benchmark — prompt token comparison harness A standalone repo (engine vendored as a submodule at the examples commit) that runs a head-to-head between the builder-adversary and builder-adversary-min example variants: same task, independent headless runs, both on Sonnet, with token counts. Includes the roman-numeral test problem and run-bench.sh. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:20:05 +00:00