Files
agent-orchestrator-benchmark/calculators/builder-adversary/run-02/machine-docs/REVIEW-eval.md

3.5 KiB

REVIEW-eval — Adversary Verdicts

Legend

  • PASS @ — gate accepted, evidence below
  • FAIL — repro steps below, Builder must fix

D1 — arithmetic

PASS @2026-06-15T00:54Z

Cold run — all 5 DoD-mandated cases:

'2+3*4'    -> 14   expected 14   OK
'(2+3)*4'  -> 20   expected 20   OK
'8-3-2'    -> 3    expected 3    OK
'-2+5'     -> 3    expected 3    OK
'2*-3'     -> -6   expected -6   OK

Extra break-it probes (all correct):

'2+3+4'       -> 9    OK  (left-assoc addition)
'10-2-3'      -> 5    OK  (left-assoc subtraction)
'2*3*4'       -> 24   OK  (left-assoc multiplication)
'--5'         -> 5    OK  (double unary minus)
'(-3)*(-2)'   -> 6    OK  (unary in parens)
'(1+2)*(3+4)' -> 21   OK  (nested parens)
'0*100'       -> 0    OK

python -m unittest calc.test_evaluator.TestArithmetic -q — 0 failures.


D2 — division

PASS @2026-06-15T00:54Z

Cold run:

'7/2'      -> 3.5        OK  (true division)
'1/0'      -> EvalError: division by zero  OK  (not ZeroDivisionError)
'5/(3-3)'  -> EvalError: division by zero  OK  (dynamic zero denominator)

Implementation: explicit if right == 0: raise EvalError(...) at calc/evaluator.py:18-21ZeroDivisionError cannot escape the API boundary.

python -m unittest calc.test_evaluator.TestDivision -q — 0 failures.


D3 — result type

PASS @2026-06-15T00:54Z

Cold run — CLI output (stdout only, no stderr):

'4/2'    -> '2'                      OK  (whole float -> int display)
'9/3'    -> '3'                      OK  (whole float -> int display)
'0/5'    -> '0'                      OK  (zero result -> int display)
'7/2'    -> '3.5'                    OK  (non-whole)
'1/3'    -> '0.3333333333333333'     OK  (non-whole)
'22/7'   -> '3.142857142857143'      OK  (non-whole)

Rule confirmed: _fmt() in calc.py calls value.is_integer() on floats; whole → cast to int for display.

python -m unittest calc.test_evaluator.TestResultType -q — 0 failures.


D4 — CLI

PASS @2026-06-15T00:54Z

Cold run — all DoD cases:

python calc.py "2+3*4"    -> stdout='14'   stderr=''                  exit=0  OK
python calc.py "(2+3)*4"  -> stdout='20'   stderr=''                  exit=0  OK
python calc.py "7/2"      -> stdout='3.5'  stderr=''                  exit=0  exit=0  OK
python calc.py "4/2"      -> stdout='2'    stderr=''                  exit=0  OK
python calc.py "1/0"      -> stdout=''     stderr='error: division by zero'  exit=1  OK
python calc.py "1 +"      -> stdout=''     stderr='error: unexpected token ...'  exit=1  OK

Additional probes:

  • No-arg: stderr='usage: calc.py ', exit=1 OK
  • Empty string "": stderr='error: empty expression', exit=1 OK
  • No traceback in any error case (grepped for "Traceback" — not found) OK
  • Errors go to stderr, stdout is empty on error (verified via redirect) OK

D5 — tests green + end-to-end

PASS @2026-06-15T00:54Z

Cold run:

$ python -m unittest -q
----------------------------------------------------------------------
Ran 68 tests in 0.210s

OK

Exit code 0. 68/68 pass (24 lex + 22 parse + 22 eval, including 6 CLI subprocess tests).

No regression in prior lex/parse tests.


Summary

Gate Verdict
D1 — arithmetic PASS
D2 — division PASS
D3 — result type PASS
D4 — CLI PASS
D5 — tests green PASS

All gates PASS. No findings. Builder may write "## DONE" to STATUS-eval.md.