Files
agent-orchestrator-benchmark/plans/calc/eval.md
mfowler 8c3f38dbf4 feat: multi-phase calculator problem + full-harness benchmark runner
- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
  phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
  edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
  + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
  clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
  effect; cache-read of the working context dominates)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:40:14 +00:00

37 lines
2.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase `eval` — evaluator + CLI
**Mission.** Build `calc/evaluator.py` exposing `evaluate(node) -> int | float` (walking the `parse`
phase's AST) and a top-level `calc.py` CLI, plus a `unittest` suite. SSOT for this phase. This phase
makes the calculator end-to-end: string → tokens → AST → number.
## Definition of Done (each Dn is a gate)
- **D1 — arithmetic.** `evaluate(parse(tokenize(s)))` is correct for `+ - * /`, precedence, parens,
and unary minus: `"2+3*4"`→14, `"(2+3)*4"`→20, `"8-3-2"`→3, `"-2+5"`→3, `"2*-3"`→-6.
- **D2 — division.** `/` is true division (`"7/2"`→3.5). Division by zero raises `EvalError` (define
it) — not a bare `ZeroDivisionError` escaping the API.
- **D3 — result type.** Whole-valued results print without a trailing `.0` (`"4/2"``2`), non-whole
as a float (`"7/2"``3.5`). Document the rule and keep it consistent.
- **D4 — CLI.** `python calc.py "2+3*4"` prints `14` and exits 0; an invalid expression
(`python calc.py "1 +"`) prints an error to stderr and exits non-zero (no traceback).
- **D5 — tests green + end-to-end.** `calc/test_evaluator.py` (`unittest`) passes under
`python -m unittest`, 0 failures, covering D1D3; and a CLI check covers D4. The whole prior suite
(lex + parse) must still pass (no regression).
## Verify (cold)
```bash
python -m unittest -q # D5 (whole suite, all phases)
python calc.py "2+3*4" # 14
python calc.py "(2+3)*4" # 20
python calc.py "7/2" # 3.5
python calc.py "4/2" # 2
python calc.py "1/0" # error to stderr, non-zero exit
python calc.py "1 +" # error to stderr, non-zero exit
```
The Builder restates exact commands + expected outputs + commit sha in `machine-docs/STATUS-eval.md`.
The Adversary cold-verifies and records `eval/Dn: PASS|FAIL` in `machine-docs/REVIEW-eval.md`. When
every gate in every phase has a fresh PASS and there is no `## VETO`, the Builder writes `## DONE` to
`machine-docs/STATUS-eval.md` (this is the last phase → the harness then marks the sequence complete).