feat: multi-phase calculator problem + full-harness benchmark runner

- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial edge cases (precedence/associativity/unary/div-zero) - run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and clocks tokens from the session transcripts (AI-as-adversary kept intact) - RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token effect; cache-read of the working context dominates) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:40:14 +00:00
parent 27df2c7b55
commit 8c3f38dbf4
6 changed files with 365 additions and 0 deletions
--- a/plans/calc/eval.md
+++ b/plans/calc/eval.md
@ -0,0 +1,36 @@
+# Phase `eval` — evaluator + CLI
+
+**Mission.** Build `calc/evaluator.py` exposing `evaluate(node) -> int | float` (walking the `parse`
+phase's AST) and a top-level `calc.py` CLI, plus a `unittest` suite. SSOT for this phase. This phase
+makes the calculator end-to-end: string → tokens → AST → number.
+
+## Definition of Done (each Dn is a gate)
+
+- **D1 — arithmetic.** `evaluate(parse(tokenize(s)))` is correct for `+ - * /`, precedence, parens,
+  and unary minus: `"2+3*4"`→14, `"(2+3)*4"`→20, `"8-3-2"`→3, `"-2+5"`→3, `"2*-3"`→-6.
+- **D2 — division.** `/` is true division (`"7/2"`→3.5). Division by zero raises `EvalError` (define
+  it) — not a bare `ZeroDivisionError` escaping the API.
+- **D3 — result type.** Whole-valued results print without a trailing `.0` (`"4/2"`→`2`), non-whole
+  as a float (`"7/2"`→`3.5`). Document the rule and keep it consistent.
+- **D4 — CLI.** `python calc.py "2+3*4"` prints `14` and exits 0; an invalid expression
+  (`python calc.py "1 +"`) prints an error to stderr and exits non-zero (no traceback).
+- **D5 — tests green + end-to-end.** `calc/test_evaluator.py` (`unittest`) passes under
+  `python -m unittest`, 0 failures, covering D1–D3; and a CLI check covers D4. The whole prior suite
+  (lex + parse) must still pass (no regression).
+
+## Verify (cold)
+
+```bash
+python -m unittest -q                              # D5 (whole suite, all phases)
+python calc.py "2+3*4"     # 14
+python calc.py "(2+3)*4"   # 20
+python calc.py "7/2"       # 3.5
+python calc.py "4/2"       # 2
+python calc.py "1/0"       # error to stderr, non-zero exit
+python calc.py "1 +"       # error to stderr, non-zero exit
+```
+
+The Builder restates exact commands + expected outputs + commit sha in `machine-docs/STATUS-eval.md`.
+The Adversary cold-verifies and records `eval/Dn: PASS|FAIL` in `machine-docs/REVIEW-eval.md`. When
+every gate in every phase has a fresh PASS and there is no `## VETO`, the Builder writes `## DONE` to
+`machine-docs/STATUS-eval.md` (this is the last phase → the harness then marks the sequence complete).
--- a/plans/calc/lex.md
+++ b/plans/calc/lex.md
@ -0,0 +1,34 @@
+# Phase `lex` — tokenizer
+
+**Mission.** Start a Python arithmetic calculator. In this phase build the **lexer**: `calc/lexer.py`
+exposing `tokenize(src: str) -> list[Token]`, plus a `unittest` suite. Pure stdlib. This file is the
+single source of truth for the phase. (Later phases add the parser and evaluator — design the Token
+type so they can consume it.)
+
+A `Token` has at least a `kind` and a `value`. Kinds: `NUMBER`, `PLUS`, `MINUS`, `STAR`, `SLASH`,
+`LPAREN`, `RPAREN`, and `EOF` as the final token.
+
+## Definition of Done (each Dn is a gate: Builder claims it, Adversary cold-verifies)
+
+- **D1 — numbers.** Integers (`42`) and floats (`3.14`, `.5`, `10.`) tokenize to one `NUMBER` token
+  whose value is the numeric value (int or float). `tokenize("42")` → `[NUMBER(42), EOF]`.
+- **D2 — operators & parens.** `+ - * / ( )` each tokenize to the right kind; `tokenize("1+2*3")`
+  yields `NUMBER PLUS NUMBER STAR NUMBER EOF`.
+- **D3 — whitespace & errors.** Spaces/tabs between tokens are skipped; an invalid character (e.g.
+  `@`, `$`, a letter) raises `LexError` (define it in the module) with the offending character and
+  its position in the message.
+- **D4 — tests green.** `calc/test_lexer.py` (`unittest`) passes under `python -m unittest`, 0
+  failures, covering D1–D3 including: `"  12  +  3 "`, `"3.5*(1-2)"`, and that `"1 @ 2"` raises
+  `LexError`.
+
+## Verify (cold)
+
+```bash
+python -m unittest -q                              # D4
+python -c "from calc.lexer import tokenize; print([(t.kind,t.value) for t in tokenize('3.5*(1-2)')])"
+python -c "from calc.lexer import tokenize; tokenize('1 @ 2')"   # must raise LexError
+```
+
+The Builder restates the exact commands + expected token lists + commit sha in
+`machine-docs/STATUS-lex.md`; the Adversary re-runs from its own clone and records
+`lex/Dn: PASS|FAIL` in `machine-docs/REVIEW-lex.md`.
--- a/plans/calc/parse.md
+++ b/plans/calc/parse.md
@ -0,0 +1,34 @@
+# Phase `parse` — recursive-descent parser
+
+**Mission.** Build `calc/parser.py` exposing `parse(tokens) -> Node` (consuming the `lex` phase's
+tokens) that produces an **AST** with correct arithmetic precedence and associativity, plus a
+`unittest` suite. SSOT for this phase. Do NOT evaluate yet — just build the tree (the `eval` phase
+consumes it). Represent nodes however you like (e.g. `Num(value)` and `BinOp(op, left, right)`,
+`Unary(op, operand)`), but expose a stable, documented shape the evaluator can walk.
+
+## Definition of Done (each Dn is a gate)
+
+- **D1 — precedence.** `*` and `/` bind tighter than `+` and `-`: `1+2*3` parses as `1+(2*3)`, not
+  `(1+2)*3`.
+- **D2 — left associativity.** Same-precedence operators associate left: `8-3-2` parses as
+  `(8-3)-2`; `8/4/2` as `(8/4)/2`.
+- **D3 — parentheses.** Parens override precedence: `(1+2)*3` parses with the `+` under the `*`.
+- **D4 — unary minus.** Leading and nested unary minus parses: `-5`, `-(1+2)`, `3 * -2`.
+- **D5 — errors.** Malformed input raises `ParseError` (define it): `"1 +"`, `"(1"`, `"1 2"`, `")("`,
+  and the empty string each raise (not crash with a different exception).
+- **D6 — tests green.** `calc/test_parser.py` (`unittest`) passes under `python -m unittest`, 0
+  failures, covering D1–D5. Assert on tree structure (e.g. a `repr`/shape helper), not on evaluation.
+
+## Verify (cold)
+
+```bash
+python -m unittest -q                              # D6
+# D1/D3 differ in structure — the Builder's STATUS gives the exact shape assertion to re-run:
+python -c "from calc.lexer import tokenize; from calc.parser import parse; print(parse(tokenize('1+2*3')))"
+python -c "from calc.lexer import tokenize; from calc.parser import parse; parse(tokenize('1 +'))"  # ParseError
+```
+
+The Builder documents the AST shape + exact assertions in `machine-docs/STATUS-parse.md`; the
+Adversary cold-verifies and records `parse/Dn: PASS|FAIL` in `machine-docs/REVIEW-parse.md`. Watch
+especially for a precedence/associativity bug that still passes a weak test — re-derive the expected
+tree yourself from the plan.