feat: multi-phase calculator problem + full-harness benchmark runner

- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial edge cases (precedence/associativity/unary/div-zero) - run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and clocks tokens from the session transcripts (AI-as-adversary kept intact) - RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token effect; cache-read of the working context dominates) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:40:14 +00:00
parent 27df2c7b55
commit 8c3f38dbf4
6 changed files with 365 additions and 0 deletions
--- a/plans/calc/parse.md
+++ b/plans/calc/parse.md
@ -0,0 +1,34 @@
+# Phase `parse` — recursive-descent parser
+
+**Mission.** Build `calc/parser.py` exposing `parse(tokens) -> Node` (consuming the `lex` phase's
+tokens) that produces an **AST** with correct arithmetic precedence and associativity, plus a
+`unittest` suite. SSOT for this phase. Do NOT evaluate yet — just build the tree (the `eval` phase
+consumes it). Represent nodes however you like (e.g. `Num(value)` and `BinOp(op, left, right)`,
+`Unary(op, operand)`), but expose a stable, documented shape the evaluator can walk.
+
+## Definition of Done (each Dn is a gate)
+
+- **D1 — precedence.** `*` and `/` bind tighter than `+` and `-`: `1+2*3` parses as `1+(2*3)`, not
+  `(1+2)*3`.
+- **D2 — left associativity.** Same-precedence operators associate left: `8-3-2` parses as
+  `(8-3)-2`; `8/4/2` as `(8/4)/2`.
+- **D3 — parentheses.** Parens override precedence: `(1+2)*3` parses with the `+` under the `*`.
+- **D4 — unary minus.** Leading and nested unary minus parses: `-5`, `-(1+2)`, `3 * -2`.
+- **D5 — errors.** Malformed input raises `ParseError` (define it): `"1 +"`, `"(1"`, `"1 2"`, `")("`,
+  and the empty string each raise (not crash with a different exception).
+- **D6 — tests green.** `calc/test_parser.py` (`unittest`) passes under `python -m unittest`, 0
+  failures, covering D1–D5. Assert on tree structure (e.g. a `repr`/shape helper), not on evaluation.
+
+## Verify (cold)
+
+```bash
+python -m unittest -q                              # D6
+# D1/D3 differ in structure — the Builder's STATUS gives the exact shape assertion to re-run:
+python -c "from calc.lexer import tokenize; from calc.parser import parse; print(parse(tokenize('1+2*3')))"
+python -c "from calc.lexer import tokenize; from calc.parser import parse; parse(tokenize('1 +'))"  # ParseError
+```
+
+The Builder documents the AST shape + exact assertions in `machine-docs/STATUS-parse.md`; the
+Adversary cold-verifies and records `parse/Dn: PASS|FAIL` in `machine-docs/REVIEW-parse.md`. Watch
+especially for a precedence/associativity bug that still passes a weak test — re-derive the expected
+tree yourself from the plan.