recipe-maintainers/agent-orchestrator-benchmark

Files

mfowler 8c3f38dbf4 feat: multi-phase calculator problem + full-harness benchmark runner

- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
  phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
  edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
  + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
  clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
  effect; cache-read of the working context dominates)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-14 20:40:14 +00:00

2.1 KiB

Raw Permalink Blame History

Phase `parse` — recursive-descent parser

Mission. Build calc/parser.py exposing parse(tokens) -> Node (consuming the lex phase's tokens) that produces an AST with correct arithmetic precedence and associativity, plus a unittest suite. SSOT for this phase. Do NOT evaluate yet — just build the tree (the eval phase consumes it). Represent nodes however you like (e.g. Num(value) and BinOp(op, left, right), Unary(op, operand)), but expose a stable, documented shape the evaluator can walk.

Definition of Done (each Dn is a gate)

D1 — precedence. * and / bind tighter than + and -: 1+2*3 parses as 1+(2*3), not (1+2)*3.
D2 — left associativity. Same-precedence operators associate left: 8-3-2 parses as (8-3)-2; 8/4/2 as (8/4)/2.
D3 — parentheses. Parens override precedence: (1+2)*3 parses with the + under the *.
D4 — unary minus. Leading and nested unary minus parses: -5, -(1+2), 3 * -2.
D5 — errors. Malformed input raises ParseError (define it): "1 +", "(1", "1 2", ")(", and the empty string each raise (not crash with a different exception).
D6 — tests green. calc/test_parser.py (unittest) passes under python -m unittest, 0 failures, covering D1–D5. Assert on tree structure (e.g. a repr/shape helper), not on evaluation.

Verify (cold)

python -m unittest -q                              # D6
# D1/D3 differ in structure — the Builder's STATUS gives the exact shape assertion to re-run:
python -c "from calc.lexer import tokenize; from calc.parser import parse; print(parse(tokenize('1+2*3')))"
python -c "from calc.lexer import tokenize; from calc.parser import parse; parse(tokenize('1 +'))"  # ParseError

The Builder documents the AST shape + exact assertions in machine-docs/STATUS-parse.md; the Adversary cold-verifies and records parse/Dn: PASS|FAIL in machine-docs/REVIEW-parse.md. Watch especially for a precedence/associativity bug that still passes a weak test — re-derive the expected tree yourself from the plan.

2.1 KiB Raw Permalink Blame History Unescape Escape

Phase parse — recursive-descent parser

Definition of Done (each Dn is a gate)

Verify (cold)

2.1 KiB

Raw Permalink Blame History

Phase `parse` — recursive-descent parser