recipe-maintainers/agent-orchestrator-benchmark

Files

mfowler bb85aa9f11 artifacts: add calculators/ — the 30 built calculators (5/variant) + machine-docs + git logs

2026-06-16 15:39:42 +00:00

2.3 KiB

Raw Blame History

REVIEW-eval — Adversary Verdicts

Phase: eval Plan SSOT: /home/loops/project-orchestrator/projects/agent-orchestrator-benchmark/plans/calc/eval.md

Gates

D1 — arithmetic: PASS @2026-06-15T01:12:53Z
D2 — division / EvalError: PASS @2026-06-15T01:12:53Z
D3 — result type (no trailing .0): PASS @2026-06-15T01:12:53Z
D4 — CLI: PASS @2026-06-15T01:12:53Z
D5 — tests green + end-to-end: PASS @2026-06-15T01:12:53Z

Verdicts

D1 — arithmetic: PASS @2026-06-15T01:12:53Z

Cold-verified from work-adv clone (commit after pull: 070dc92).

Evidence (all outputs match expected):

python calc.py "2+3*4" → 14 exit 0 ✓
python calc.py "(2+3)*4" → 20 exit 0 ✓
python calc.py "8-3-2" → 3 exit 0 ✓
python calc.py "-2+5" → 3 exit 0 ✓
python calc.py "2*-3" → -6 exit 0 ✓
python calc.py "--5" → 5 exit 0 ✓ (double unary)
python calc.py "3-3" → 0 exit 0 ✓

D2 — division / EvalError: PASS @2026-06-15T01:12:53Z

Evidence:

python calc.py "7/2" → 3.5 exit 0 ✓ (true division)
1/0 raises EvalError("division by zero"), NOT bare ZeroDivisionError ✓
5/(3-3) also raises EvalError ✓

D3 — result type: PASS @2026-06-15T01:12:53Z

Evidence (types confirmed via Python isinstance check):

4/2 → int(2) (not float(2.0)) ✓
7/2 → float(3.5) ✓
2+3*4 → int(14) ✓
0.0/1 → int(0) (whole-float coercion works for zero) ✓
1.5+1.5 → 3 exit 0 (coerces 3.0 → int) ✓
Rule documented in evaluator.py docstring ✓

D4 — CLI: PASS @2026-06-15T01:12:53Z

Evidence:

python calc.py "2+3*4" → stdout 14, exit 0 ✓
python calc.py "1 +" → stderr error, exit 1, no "Traceback" ✓
python calc.py "1/0" → stderr error, exit 1, no "Traceback" ✓
python calc.py (no args) → stderr usage msg, exit 1 ✓
Error output confirmed routed to stderr (stdout suppressed, still exits 1) ✓

D5 — tests green + end-to-end: PASS @2026-06-15T01:12:53Z

Evidence:

python -m unittest -q → Ran 68 tests in ...s / OK ✓
Breakdown: 18 lex + 26 parse + 24 eval = 68 total ✓
Prior 44 tests (lex + parse) still pass — no regression ✓
python -m unittest calc.test_lexer calc.test_parser -q → 44 tests OK ✓

Adversary findings

None. No defects found. No VETO.