From b46dca003caf896ff7ae03f15585a2dd81f3591d Mon Sep 17 00:00:00 2001
From: mfowler <mfowler.email@protonmail.com>
Date: Sun, 14 Jun 2026 22:06:21 +0000
Subject: [PATCH] results: 4-way + the variance finding (N=1 is not enough)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Combined run 1 (orig/min/stateless) + run 2 (lean/stateless). Key result: the
SAME stateless variant used 5.96M tokens in run 1 and 9.27M in run 2 (±55%) —
nondeterministic iteration count dominates every between-variant gap. So:
- prose minimization ~-6% (small, same-invocation)
- lean (full per-gate review) ~= stateless (batched): full review is ~free
- the earlier "-45% from context hygiene" is NOT reproducible — mostly noise
Honest conclusion: need >=5 runs/variant to resolve the context-hygiene effect;
log_tokens now makes that easy to collect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 RESULTS-harness.md | 106 +++++++++++++++++++--------------------------
 1 file changed, 45 insertions(+), 61 deletions(-)

diff --git a/RESULTS-harness.md b/RESULTS-harness.md
index b0e9e96..273a0ba 100644
--- a/RESULTS-harness.md
+++ b/RESULTS-harness.md
@@ -1,78 +1,62 @@
-# Full-harness benchmark — prompt variants
+# Full-harness benchmark — prompt variants (calculator, Sonnet)
 
 Real `agents.py up` Builder/Adversary loop pair + watchdog, run **autonomously** through the
 multi-phase calculator (`plans/calc/{lex,parse,eval}.md` — 3 phases, 4–6 gates each) to
-`SEQUENCE-COMPLETE`. Engine pinned at `985d33d`. Both loops on **claude-sonnet-4-6**. Per-variant
-timeout 3000s. Tokens summed from the Claude Code session transcripts of each loop's clone. **N=1**
-per variant (the autonomous loop is nondeterministic — number of review rounds varies).
+`SEQUENCE-COMPLETE`. Both loops on **claude-sonnet-4-6**. Tokens summed from each loop's Claude Code
+session transcripts.
 
-## Variants
+## ⚠️ Headline: run-to-run variance dominates (N=1 is not enough)
 
-- **builder-adversary** — original prompts.
-- **builder-adversary-min** — same rules, prose compressed to minimal tokens.
-- **builder-adversary-stateless** — min + **context hygiene** (compact at each checkpoint, read diffs
-  not trees, spill bulk to files, adversary loads only {plan, STATUS, diff}); loop sessions
-  non-resumed → fresh context per phase. Same AI-as-adversary verification.
+The **same** `stateless` variant, identical prompts, was run twice:
 
-## All three succeeded
+| run | stateless total tokens |
+|---|--:|
+| run 1 | **5,964,829** |
+| run 2 | **9,266,808** |
 
-Every variant completed all three phases with the Adversary cold-verifying every gate (no veto), and
-the final calculator is correct in each:
+That's a **±55% swing** from nondeterminism alone (how many review rounds / retries the autonomous
+loop happens to do). It is **larger than every difference between variants below.** So treat all
+single-run deltas as suggestive at best — the honest conclusion is that **N=1 cannot resolve the
+variant effects**; you'd need several runs per variant and compare medians.
 
-| version | sequence-complete | unittest | `2+3*4` | `(2+3)*4` | `7/2` | result |
-|---|:--:|:--:|:--:|:--:|:--:|:--:|
-| builder-adversary | yes | OK | 14 | 20 | 3.5 | PASS |
-| builder-adversary-min | yes | OK | 14 | 20 | 3.5 | PASS |
-| builder-adversary-stateless | yes | OK | 14 | 20 | 3.5 | PASS |
+## All data points
 
-(The original Adversary even filed a non-blocking advisory on `lex` — genuine adversarial review, not
-rubber-stamping.)
+| run | variant | builder | adversary | **total** |
+|---|---|--:|--:|--:|
+| 1 | builder-adversary (orig) | 5,557,356 | 5,199,007 | **10,756,363** |
+| 1 | builder-adversary-min | 5,350,953 | 4,768,272 | **10,119,225** |
+| 1 | builder-adversary-stateless | 2,834,505 | 3,130,324 | **5,964,829** |
+| 2 | builder-adversary-lean | 4,050,402 | 5,086,052 | **9,136,454** |
+| 2 | builder-adversary-stateless | 4,606,579 | 4,660,229 | **9,266,808** |
 
-## Static prompt size (chars: kickoff + role)
+All five **succeeded**: built a correct calculator (`2+3*4→14`, `(2+3)*4→20`, `7/2→3.5`), full test
+suites green, every gate Adversary-verified, no veto. (Engine: run 1 @ `985d33d`, run 2 @ `e0425e6`
+— the orig/min/stateless prompts are byte-identical across both; run 2 only adds `lean`.)
 
-| version | builder | adversary |
-|---|--:|--:|
-| builder-adversary | 6389 | 5811 |
-| builder-adversary-min | 1751 | 1644 |
-| builder-adversary-stateless | 2430 | 2218 |
+## What we can and can't say
 
-## Tokens (from session transcripts)
+**Valid (same-invocation) comparisons:**
+- **Run 1 — prose minimization:** orig 10.76M vs min 10.12M → **−5.9%**. Small; consistent with "the
+  prompt is a tiny cached slice." Probably real but minor.
+- **Run 2 — full per-gate review vs batched, both with context hygiene:** lean 9.14M vs stateless
+  9.27M → **essentially tied (−1.4%)**. So **enforcing one claim + one independent verdict per gate
+  did NOT cost more tokens** than letting the loop batch — answering the question directly: you can
+  keep full review granularity without a token penalty (in this run).
 
-| version | builder loop | adversary loop | **total** | vs orig |
-|---|--:|--:|--:|--:|
-| builder-adversary | 5,557,356 | 5,199,007 | **10,756,363** | — |
-| builder-adversary-min | 5,350,953 | 4,768,272 | **10,119,225** | −5.9% |
-| builder-adversary-stateless | 2,834,505 | 3,130,324 | **5,964,829** | **−44.5%** |
-
-Breakdown (the dominant term is **cache_read** — re-reading the conversation each turn):
-
-| version | role | input | output | cache_create | cache_read |
-|---|---|--:|--:|--:|--:|
-| builder-adversary | builder | 199 | 87,704 | 256,054 | 5,213,399 |
-| builder-adversary | adversary | 181 | 66,540 | 183,189 | 4,949,097 |
-| builder-adversary-min | builder | 216 | 54,381 | 254,524 | 5,041,832 |
-| builder-adversary-min | adversary | 213 | 58,838 | 202,916 | 4,506,305 |
-| builder-adversary-stateless | builder | 124 | 26,998 | 113,026 | 2,694,357 |
-| builder-adversary-stateless | adversary | 141 | 31,342 | 122,360 | 2,976,481 |
+**NOT supported:**
+- The earlier "context hygiene halves tokens (−45%)" claim from run 1 is **not reproducible**:
+  stateless's *own* second run (9.27M) lands right next to orig/min/lean. The −45% was mostly a
+  lucky low-iteration run, not the context discipline. Context hygiene may still help, but this
+  benchmark can't prove it at N=1.
 
 ## Findings
 
-1. **Prompt-prose minimization barely moves tokens (−5.9%).** `min` cut the prompt to ~⅓ the size
-   but saved almost nothing — because the role/kickoff prompt is a tiny, cached slice. Worth keeping
-   for readability; not a token lever.
-2. **Context hygiene nearly halves tokens (−44.5%), quality intact.** `stateless` produced the same
-   correct, fully-verified calculator while cutting total tokens ~45%. The saving is dominated by
-   **cache_read** falling ~48% (builder 5.21M→2.69M) — exactly the "don't carry/reload context you
-   don't need" lever. It also cut output tokens (builder 87.7k→27.0k), i.e. less redundant
-   regeneration across turns.
-3. **The cost is the conversation, not the prompt.** cache_read ≫ everything else in all three. Any
-   real efficiency work should target carried/reloaded context (compaction cadence, fresh sessions
-   per unit of work, diff-not-tree reads), not prompt wording.
+1. **The dominant variable is nondeterministic iteration count, not the prompt variant.** ±55%
+   same-variant variance > any between-variant gap.
+2. **Prose size barely matters** (−6%, run 1) — keep minimal prompts for readability, not tokens.
+3. **Full per-gate review is ~free vs batching** (run 2) — granular adversarial scrutiny didn't
+   raise the bill, so prefer it for quality.
+4. **To actually measure context hygiene you need a campaign:** ≥5 runs per variant, compare medians
+   / distributions. The new `log_tokens` harness feature makes that cheap to collect.
 
-> N=1 caveat: the autonomous loop is nondeterministic, so some of the gap is run-to-run variance
-> (review-round count, retries). The ~45% reduction is large and matches the mechanism (cache_read
-> roughly halved), but repeating the run a few times would tighten the estimate.
-
-_Run dirs: `/tmp/ao-harness-YIrsUp`. (A prior auto-generated version mislabeled `orig`/`stateless` as
-failed — a bug in the harness success check grepping the word "veto" and matching "No veto"; fixed to
-match the `## VETO` marker. Functionally all three passed, as verified above.)_
+_Run dirs: run 1 `/tmp/ao-harness-YIrsUp`, run 2 `/tmp/ao-harness-TMDfvk`. N=1 per (run, variant)._