agent-orchestrator-benchmark

recipe-maintainers/agent-orchestrator-benchmark

Files

mfowler 8c3f38dbf4 feat: multi-phase calculator problem + full-harness benchmark runner

- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
  phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
  edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
  + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
  clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
  effect; cache-read of the working context dominates)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-14 20:40:14 +00:00

calc

feat: multi-phase calculator problem + full-harness benchmark runner

2026-06-14 20:40:14 +00:00

roman.md

feat: agent-orchestrator-benchmark — prompt token comparison harness

2026-06-14 20:20:05 +00:00