recipe-maintainers/agent-orchestrator-benchmark

Files

mfowler 8c3f38dbf4 feat: multi-phase calculator problem + full-harness benchmark runner

- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
  phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
  edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
  + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
  clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
  effect; cache-read of the working context dominates)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-14 20:40:14 +00:00

2.0 KiB

Raw Permalink Blame History

Phase `eval` — evaluator + CLI

Mission. Build calc/evaluator.py exposing evaluate(node) -> int | float (walking the parse phase's AST) and a top-level calc.py CLI, plus a unittest suite. SSOT for this phase. This phase makes the calculator end-to-end: string → tokens → AST → number.

Definition of Done (each Dn is a gate)

D1 — arithmetic. evaluate(parse(tokenize(s))) is correct for + - * /, precedence, parens, and unary minus: "2+3*4"→14, "(2+3)*4"→20, "8-3-2"→3, "-2+5"→3, "2*-3"→-6.
D2 — division. / is true division ("7/2"→3.5). Division by zero raises EvalError (define it) — not a bare ZeroDivisionError escaping the API.
D3 — result type. Whole-valued results print without a trailing .0 ("4/2"→2), non-whole as a float ("7/2"→3.5). Document the rule and keep it consistent.
D4 — CLI. python calc.py "2+3*4" prints 14 and exits 0; an invalid expression (python calc.py "1 +") prints an error to stderr and exits non-zero (no traceback).
D5 — tests green + end-to-end. calc/test_evaluator.py (unittest) passes under python -m unittest, 0 failures, covering D1–D3; and a CLI check covers D4. The whole prior suite (lex + parse) must still pass (no regression).

Verify (cold)

python -m unittest -q                              # D5 (whole suite, all phases)
python calc.py "2+3*4"     # 14
python calc.py "(2+3)*4"   # 20
python calc.py "7/2"       # 3.5
python calc.py "4/2"       # 2
python calc.py "1/0"       # error to stderr, non-zero exit
python calc.py "1 +"       # error to stderr, non-zero exit

The Builder restates exact commands + expected outputs + commit sha in machine-docs/STATUS-eval.md. The Adversary cold-verifies and records eval/Dn: PASS|FAIL in machine-docs/REVIEW-eval.md. When every gate in every phase has a fresh PASS and there is no ## VETO, the Builder writes ## DONE to machine-docs/STATUS-eval.md (this is the last phase → the harness then marks the sequence complete).

2.0 KiB Raw Permalink Blame History Unescape Escape

Phase eval — evaluator + CLI

Definition of Done (each Dn is a gate)

Verify (cold)

2.0 KiB

Raw Permalink Blame History

Phase `eval` — evaluator + CLI