feat: multi-phase calculator problem + full-harness benchmark runner

- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
  phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
  edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
  + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
  clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
  effect; cache-read of the working context dominates)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-14 20:40:14 +00:00
parent 27df2c7b55
commit 8c3f38dbf4
6 changed files with 365 additions and 0 deletions

34
plans/calc/parse.md Normal file
View File

@ -0,0 +1,34 @@
# Phase `parse` — recursive-descent parser
**Mission.** Build `calc/parser.py` exposing `parse(tokens) -> Node` (consuming the `lex` phase's
tokens) that produces an **AST** with correct arithmetic precedence and associativity, plus a
`unittest` suite. SSOT for this phase. Do NOT evaluate yet — just build the tree (the `eval` phase
consumes it). Represent nodes however you like (e.g. `Num(value)` and `BinOp(op, left, right)`,
`Unary(op, operand)`), but expose a stable, documented shape the evaluator can walk.
## Definition of Done (each Dn is a gate)
- **D1 — precedence.** `*` and `/` bind tighter than `+` and `-`: `1+2*3` parses as `1+(2*3)`, not
`(1+2)*3`.
- **D2 — left associativity.** Same-precedence operators associate left: `8-3-2` parses as
`(8-3)-2`; `8/4/2` as `(8/4)/2`.
- **D3 — parentheses.** Parens override precedence: `(1+2)*3` parses with the `+` under the `*`.
- **D4 — unary minus.** Leading and nested unary minus parses: `-5`, `-(1+2)`, `3 * -2`.
- **D5 — errors.** Malformed input raises `ParseError` (define it): `"1 +"`, `"(1"`, `"1 2"`, `")("`,
and the empty string each raise (not crash with a different exception).
- **D6 — tests green.** `calc/test_parser.py` (`unittest`) passes under `python -m unittest`, 0
failures, covering D1D5. Assert on tree structure (e.g. a `repr`/shape helper), not on evaluation.
## Verify (cold)
```bash
python -m unittest -q # D6
# D1/D3 differ in structure — the Builder's STATUS gives the exact shape assertion to re-run:
python -c "from calc.lexer import tokenize; from calc.parser import parse; print(parse(tokenize('1+2*3')))"
python -c "from calc.lexer import tokenize; from calc.parser import parse; parse(tokenize('1 +'))" # ParseError
```
The Builder documents the AST shape + exact assertions in `machine-docs/STATUS-parse.md`; the
Adversary cold-verifies and records `parse/Dn: PASS|FAIL` in `machine-docs/REVIEW-parse.md`. Watch
especially for a precedence/associativity bug that still passes a weak test — re-derive the expected
tree yourself from the plan.