results: full-harness 3-way (orig/min/stateless) on the calculator

All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full
adversary verification (calc correct, tests OK). Tokens:
  orig       10,756,363
  min        10,119,225  (-5.9%)
  stateless   5,964,829  (-44.5%)
Prompt-prose minimization barely moved tokens; context hygiene (stateless)
nearly halved them, driven by ~48% lower cache_read. Quality held.

Also fix the runner's success check: it grepped the word "veto" and matched
"No veto" → false failures; now matches the "## VETO" marker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-14 21:35:47 +00:00
parent a1b59e1bc5
commit 0fa3d726a5
2 changed files with 80 additions and 1 deletions

View File

@ -169,7 +169,8 @@ run_variant() {
local out; out="$( cd "$run/work-adv" && python calc.py '2+3*4' 2>/dev/null )"
[ "$out" = "14" ] && cli=yes
local reviews; reviews="$(grep -rhoiE '(lex|parse|eval)/D[0-9]+:?\s*PASS' "$run/work-adv/machine-docs/" 2>/dev/null | sort -u | wc -l)"
local veto="no"; grep -rqi 'VETO' "$run/work-adv/machine-docs/" 2>/dev/null && veto=yes
# a standing veto is the "## VETO" marker (per the prompts) — NOT the word "veto" (matches "No veto")
local veto="no"; grep -rqiE '##[[:space:]]*VETO' "$run/work-adv/machine-docs/" 2>/dev/null && veto=yes
SUM_PHASES[$v]="$(cat "$run/.ao-state/state/phase-idx" 2>/dev/null || echo '?')"
local success=NO