results: full-harness 3-way (orig/min/stateless) on the calculator

All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full adversary verification (calc correct, tests OK). Tokens: orig 10,756,363 min 10,119,225 (-5.9%) stateless 5,964,829 (-44.5%) Prompt-prose minimization barely moved tokens; context hygiene (stateless) nearly halved them, driven by ~48% lower cache_read. Quality held. Also fix the runner's success check: it grepped the word "veto" and matched "No veto" → false failures; now matches the "## VETO" marker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 21:35:47 +00:00
parent a1b59e1bc5
commit 0fa3d726a5
2 changed files with 80 additions and 1 deletions
--- a/run-harness-bench.sh
+++ b/run-harness-bench.sh
@ -169,7 +169,8 @@ run_variant() {
  local out; out="$( cd "$run/work-adv" && python calc.py '2+3*4' 2>/dev/null )"
  [ "$out" = "14" ] && cli=yes
  local reviews; reviews="$(grep -rhoiE '(lex|parse|eval)/D[0-9]+:?\s*PASS' "$run/work-adv/machine-docs/" 2>/dev/null | sort -u | wc -l)"
-  local veto="no"; grep -rqi 'VETO' "$run/work-adv/machine-docs/" 2>/dev/null && veto=yes
+  # a standing veto is the "## VETO" marker (per the prompts) — NOT the word "veto" (matches "No veto")
+  local veto="no"; grep -rqiE '##[[:space:]]*VETO' "$run/work-adv/machine-docs/" 2>/dev/null && veto=yes
  SUM_PHASES[$v]="$(cat "$run/.ao-state/state/phase-idx" 2>/dev/null || echo '?')"

  local success=NO