From 0fa3d726a51c09320d7e50a5af053ac2baded8d3 Mon Sep 17 00:00:00 2001
From: mfowler <mfowler.email@protonmail.com>
Date: Sun, 14 Jun 2026 21:35:47 +0000
Subject: [PATCH] results: full-harness 3-way (orig/min/stateless) on the
 calculator
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full
adversary verification (calc correct, tests OK). Tokens:
  orig       10,756,363
  min        10,119,225  (-5.9%)
  stateless   5,964,829  (-44.5%)
Prompt-prose minimization barely moved tokens; context hygiene (stateless)
nearly halved them, driven by ~48% lower cache_read. Quality held.

Also fix the runner's success check: it grepped the word "veto" and matched
"No veto" → false failures; now matches the "## VETO" marker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 RESULTS-harness.md   | 78 ++++++++++++++++++++++++++++++++++++++++++++
 run-harness-bench.sh |  3 +-
 2 files changed, 80 insertions(+), 1 deletion(-)
 create mode 100644 RESULTS-harness.md

diff --git a/RESULTS-harness.md b/RESULTS-harness.md
new file mode 100644
index 0000000..b0e9e96
--- /dev/null
+++ b/RESULTS-harness.md
@@ -0,0 +1,78 @@
+# Full-harness benchmark — prompt variants
+
+Real `agents.py up` Builder/Adversary loop pair + watchdog, run **autonomously** through the
+multi-phase calculator (`plans/calc/{lex,parse,eval}.md` — 3 phases, 4–6 gates each) to
+`SEQUENCE-COMPLETE`. Engine pinned at `985d33d`. Both loops on **claude-sonnet-4-6**. Per-variant
+timeout 3000s. Tokens summed from the Claude Code session transcripts of each loop's clone. **N=1**
+per variant (the autonomous loop is nondeterministic — number of review rounds varies).
+
+## Variants
+
+- **builder-adversary** — original prompts.
+- **builder-adversary-min** — same rules, prose compressed to minimal tokens.
+- **builder-adversary-stateless** — min + **context hygiene** (compact at each checkpoint, read diffs
+  not trees, spill bulk to files, adversary loads only {plan, STATUS, diff}); loop sessions
+  non-resumed → fresh context per phase. Same AI-as-adversary verification.
+
+## All three succeeded
+
+Every variant completed all three phases with the Adversary cold-verifying every gate (no veto), and
+the final calculator is correct in each:
+
+| version | sequence-complete | unittest | `2+3*4` | `(2+3)*4` | `7/2` | result |
+|---|:--:|:--:|:--:|:--:|:--:|:--:|
+| builder-adversary | yes | OK | 14 | 20 | 3.5 | PASS |
+| builder-adversary-min | yes | OK | 14 | 20 | 3.5 | PASS |
+| builder-adversary-stateless | yes | OK | 14 | 20 | 3.5 | PASS |
+
+(The original Adversary even filed a non-blocking advisory on `lex` — genuine adversarial review, not
+rubber-stamping.)
+
+## Static prompt size (chars: kickoff + role)
+
+| version | builder | adversary |
+|---|--:|--:|
+| builder-adversary | 6389 | 5811 |
+| builder-adversary-min | 1751 | 1644 |
+| builder-adversary-stateless | 2430 | 2218 |
+
+## Tokens (from session transcripts)
+
+| version | builder loop | adversary loop | **total** | vs orig |
+|---|--:|--:|--:|--:|
+| builder-adversary | 5,557,356 | 5,199,007 | **10,756,363** | — |
+| builder-adversary-min | 5,350,953 | 4,768,272 | **10,119,225** | −5.9% |
+| builder-adversary-stateless | 2,834,505 | 3,130,324 | **5,964,829** | **−44.5%** |
+
+Breakdown (the dominant term is **cache_read** — re-reading the conversation each turn):
+
+| version | role | input | output | cache_create | cache_read |
+|---|---|--:|--:|--:|--:|
+| builder-adversary | builder | 199 | 87,704 | 256,054 | 5,213,399 |
+| builder-adversary | adversary | 181 | 66,540 | 183,189 | 4,949,097 |
+| builder-adversary-min | builder | 216 | 54,381 | 254,524 | 5,041,832 |
+| builder-adversary-min | adversary | 213 | 58,838 | 202,916 | 4,506,305 |
+| builder-adversary-stateless | builder | 124 | 26,998 | 113,026 | 2,694,357 |
+| builder-adversary-stateless | adversary | 141 | 31,342 | 122,360 | 2,976,481 |
+
+## Findings
+
+1. **Prompt-prose minimization barely moves tokens (−5.9%).** `min` cut the prompt to ~⅓ the size
+   but saved almost nothing — because the role/kickoff prompt is a tiny, cached slice. Worth keeping
+   for readability; not a token lever.
+2. **Context hygiene nearly halves tokens (−44.5%), quality intact.** `stateless` produced the same
+   correct, fully-verified calculator while cutting total tokens ~45%. The saving is dominated by
+   **cache_read** falling ~48% (builder 5.21M→2.69M) — exactly the "don't carry/reload context you
+   don't need" lever. It also cut output tokens (builder 87.7k→27.0k), i.e. less redundant
+   regeneration across turns.
+3. **The cost is the conversation, not the prompt.** cache_read ≫ everything else in all three. Any
+   real efficiency work should target carried/reloaded context (compaction cadence, fresh sessions
+   per unit of work, diff-not-tree reads), not prompt wording.
+
+> N=1 caveat: the autonomous loop is nondeterministic, so some of the gap is run-to-run variance
+> (review-round count, retries). The ~45% reduction is large and matches the mechanism (cache_read
+> roughly halved), but repeating the run a few times would tighten the estimate.
+
+_Run dirs: `/tmp/ao-harness-YIrsUp`. (A prior auto-generated version mislabeled `orig`/`stateless` as
+failed — a bug in the harness success check grepping the word "veto" and matching "No veto"; fixed to
+match the `## VETO` marker. Functionally all three passed, as verified above.)_
diff --git a/run-harness-bench.sh b/run-harness-bench.sh
index 1cf8d48..790e21b 100755
--- a/run-harness-bench.sh
+++ b/run-harness-bench.sh
@@ -169,7 +169,8 @@ run_variant() {
   local out; out="$( cd "$run/work-adv" && python calc.py '2+3*4' 2>/dev/null )"
   [ "$out" = "14" ] && cli=yes
   local reviews; reviews="$(grep -rhoiE '(lex|parse|eval)/D[0-9]+:?\s*PASS' "$run/work-adv/machine-docs/" 2>/dev/null | sort -u | wc -l)"
-  local veto="no"; grep -rqi 'VETO' "$run/work-adv/machine-docs/" 2>/dev/null && veto=yes
+  # a standing veto is the "## VETO" marker (per the prompts) — NOT the word "veto" (matches "No veto")
+  local veto="no"; grep -rqiE '##[[:space:]]*VETO' "$run/work-adv/machine-docs/" 2>/dev/null && veto=yes
   SUM_PHASES[$v]="$(cat "$run/.ao-state/state/phase-idx" 2>/dev/null || echo '?')"
 
   local success=NO