feat: multi-phase calculator problem + full-harness benchmark runner

- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
  phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
  edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
  + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
  clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
  effect; cache-read of the working context dominates)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-14 20:40:14 +00:00
parent 27df2c7b55
commit 8c3f38dbf4
6 changed files with 365 additions and 0 deletions

36
plans/calc/eval.md Normal file
View File

@ -0,0 +1,36 @@
# Phase `eval` — evaluator + CLI
**Mission.** Build `calc/evaluator.py` exposing `evaluate(node) -> int | float` (walking the `parse`
phase's AST) and a top-level `calc.py` CLI, plus a `unittest` suite. SSOT for this phase. This phase
makes the calculator end-to-end: string → tokens → AST → number.
## Definition of Done (each Dn is a gate)
- **D1 — arithmetic.** `evaluate(parse(tokenize(s)))` is correct for `+ - * /`, precedence, parens,
and unary minus: `"2+3*4"`→14, `"(2+3)*4"`→20, `"8-3-2"`→3, `"-2+5"`→3, `"2*-3"`→-6.
- **D2 — division.** `/` is true division (`"7/2"`→3.5). Division by zero raises `EvalError` (define
it) — not a bare `ZeroDivisionError` escaping the API.
- **D3 — result type.** Whole-valued results print without a trailing `.0` (`"4/2"``2`), non-whole
as a float (`"7/2"``3.5`). Document the rule and keep it consistent.
- **D4 — CLI.** `python calc.py "2+3*4"` prints `14` and exits 0; an invalid expression
(`python calc.py "1 +"`) prints an error to stderr and exits non-zero (no traceback).
- **D5 — tests green + end-to-end.** `calc/test_evaluator.py` (`unittest`) passes under
`python -m unittest`, 0 failures, covering D1D3; and a CLI check covers D4. The whole prior suite
(lex + parse) must still pass (no regression).
## Verify (cold)
```bash
python -m unittest -q # D5 (whole suite, all phases)
python calc.py "2+3*4" # 14
python calc.py "(2+3)*4" # 20
python calc.py "7/2" # 3.5
python calc.py "4/2" # 2
python calc.py "1/0" # error to stderr, non-zero exit
python calc.py "1 +" # error to stderr, non-zero exit
```
The Builder restates exact commands + expected outputs + commit sha in `machine-docs/STATUS-eval.md`.
The Adversary cold-verifies and records `eval/Dn: PASS|FAIL` in `machine-docs/REVIEW-eval.md`. When
every gate in every phase has a fresh PASS and there is no `## VETO`, the Builder writes `## DONE` to
`machine-docs/STATUS-eval.md` (this is the last phase → the harness then marks the sequence complete).

34
plans/calc/lex.md Normal file
View File

@ -0,0 +1,34 @@
# Phase `lex` — tokenizer
**Mission.** Start a Python arithmetic calculator. In this phase build the **lexer**: `calc/lexer.py`
exposing `tokenize(src: str) -> list[Token]`, plus a `unittest` suite. Pure stdlib. This file is the
single source of truth for the phase. (Later phases add the parser and evaluator — design the Token
type so they can consume it.)
A `Token` has at least a `kind` and a `value`. Kinds: `NUMBER`, `PLUS`, `MINUS`, `STAR`, `SLASH`,
`LPAREN`, `RPAREN`, and `EOF` as the final token.
## Definition of Done (each Dn is a gate: Builder claims it, Adversary cold-verifies)
- **D1 — numbers.** Integers (`42`) and floats (`3.14`, `.5`, `10.`) tokenize to one `NUMBER` token
whose value is the numeric value (int or float). `tokenize("42")``[NUMBER(42), EOF]`.
- **D2 — operators & parens.** `+ - * / ( )` each tokenize to the right kind; `tokenize("1+2*3")`
yields `NUMBER PLUS NUMBER STAR NUMBER EOF`.
- **D3 — whitespace & errors.** Spaces/tabs between tokens are skipped; an invalid character (e.g.
`@`, `$`, a letter) raises `LexError` (define it in the module) with the offending character and
its position in the message.
- **D4 — tests green.** `calc/test_lexer.py` (`unittest`) passes under `python -m unittest`, 0
failures, covering D1D3 including: `" 12 + 3 "`, `"3.5*(1-2)"`, and that `"1 @ 2"` raises
`LexError`.
## Verify (cold)
```bash
python -m unittest -q # D4
python -c "from calc.lexer import tokenize; print([(t.kind,t.value) for t in tokenize('3.5*(1-2)')])"
python -c "from calc.lexer import tokenize; tokenize('1 @ 2')" # must raise LexError
```
The Builder restates the exact commands + expected token lists + commit sha in
`machine-docs/STATUS-lex.md`; the Adversary re-runs from its own clone and records
`lex/Dn: PASS|FAIL` in `machine-docs/REVIEW-lex.md`.

34
plans/calc/parse.md Normal file
View File

@ -0,0 +1,34 @@
# Phase `parse` — recursive-descent parser
**Mission.** Build `calc/parser.py` exposing `parse(tokens) -> Node` (consuming the `lex` phase's
tokens) that produces an **AST** with correct arithmetic precedence and associativity, plus a
`unittest` suite. SSOT for this phase. Do NOT evaluate yet — just build the tree (the `eval` phase
consumes it). Represent nodes however you like (e.g. `Num(value)` and `BinOp(op, left, right)`,
`Unary(op, operand)`), but expose a stable, documented shape the evaluator can walk.
## Definition of Done (each Dn is a gate)
- **D1 — precedence.** `*` and `/` bind tighter than `+` and `-`: `1+2*3` parses as `1+(2*3)`, not
`(1+2)*3`.
- **D2 — left associativity.** Same-precedence operators associate left: `8-3-2` parses as
`(8-3)-2`; `8/4/2` as `(8/4)/2`.
- **D3 — parentheses.** Parens override precedence: `(1+2)*3` parses with the `+` under the `*`.
- **D4 — unary minus.** Leading and nested unary minus parses: `-5`, `-(1+2)`, `3 * -2`.
- **D5 — errors.** Malformed input raises `ParseError` (define it): `"1 +"`, `"(1"`, `"1 2"`, `")("`,
and the empty string each raise (not crash with a different exception).
- **D6 — tests green.** `calc/test_parser.py` (`unittest`) passes under `python -m unittest`, 0
failures, covering D1D5. Assert on tree structure (e.g. a `repr`/shape helper), not on evaluation.
## Verify (cold)
```bash
python -m unittest -q # D6
# D1/D3 differ in structure — the Builder's STATUS gives the exact shape assertion to re-run:
python -c "from calc.lexer import tokenize; from calc.parser import parse; print(parse(tokenize('1+2*3')))"
python -c "from calc.lexer import tokenize; from calc.parser import parse; parse(tokenize('1 +'))" # ParseError
```
The Builder documents the AST shape + exact assertions in `machine-docs/STATUS-parse.md`; the
Adversary cold-verifies and records `parse/Dn: PASS|FAIL` in `machine-docs/REVIEW-parse.md`. Watch
especially for a precedence/associativity bug that still passes a weak test — re-derive the expected
tree yourself from the plan.