docs: FINDINGS.md — benchmark synthesis; track raw results data

Capstone summary of the Builder/Adversary prompt + verification-cadence study: - adversary EXISTENCE costs ~4.7x (solo 2.8M vs ~13M); cadence is ~token-neutral - context hygiene is the one clean -22% win; minimal prompts -25% but test less - deferred review saves nothing (the one comprehensive pass is expensive) + late - cost is process not product (tokens~duration 0.83, ~commits 0.79, ~LOC -0.04) All results now in-repo: FINDINGS.md + RESULTS-campaign.md + raw .data + runners. (deferred N=3, finalizing to N=5.)
2026-06-16 01:53:34 +00:00
parent 819000417b
commit 3bf3316572
4 changed files with 164 additions and 9 deletions
--- a/.gitignore
+++ b/.gitignore
@ -4,5 +4,3 @@ __pycache__/
 *.pyc
 *.tmp
 RESULTS-harness.md.tmp
-RESULTS-campaign.md.data
-RESULTS-campaign.md.data.hdr
--- a/FINDINGS.md
+++ b/FINDINGS.md
@ -0,0 +1,120 @@
+# Findings — Builder/Adversary prompt & verification-cadence benchmark
+
+A controlled study of what actually drives **token cost** in the agent-orchestrator Builder/Adversary
+loop, on a fixed, well-specified task.
+
+- **Task:** build a 3-phase Python calculator (lexer → parser → evaluator), each phase with 4–6
+  cold-verifiable Definition-of-Done gates. Deliberately offline and deterministic so it stresses the
+  *protocol*, not infrastructure.
+- **How:** each variant run autonomously to `SEQUENCE-COMPLETE` via the real harness (`agents.py up`
+  — Builder + Adversary loop pair + watchdog), **5 runs each** (N=5; `deferred` N=3, finalizing).
+  Both loops on **claude-sonnet-4-6**. Tokens summed from each loop's Claude Code session
+  transcripts. The deliverable is behaviorally identical across all variants (verified on a
+  24-expression probe), so this compares like-for-like.
+- **Full data:** [`RESULTS-campaign.md`](RESULTS-campaign.md) (analysis), `RESULTS-campaign.md.data`
+  (raw per-run rows). Every run's git repo is preserved under `/tmp/ao-campaign-*` and `/tmp/ao-solo-*`.
+
+## The variants
+
+| variant | what changes | engine example |
+|---|---|---|
+| `builder-adversary` (orig) | the original full prompts; Adversary verifies **per phase** | `examples/builder-adversary` |
+| `builder-adversary-min` | prompts compressed to minimal tokens | `examples/builder-adversary-min` |
+| `builder-adversary-stateless` | orig + **context hygiene** (compact per checkpoint, read diffs not trees, lean loads) | `examples/builder-adversary-stateless` |
+| `builder-adversary-lean` | orig + context hygiene + **per-gate** review (one claim/verdict per gate) | `examples/builder-adversary-lean` |
+| `builder-adversary-deferred` | orig; Adversary verifies **once, after the whole build** (a final comprehensive `review` phase) | `examples/builder-adversary-deferred` |
+| `builder-solo` | **no Adversary** — a single Builder that self-certifies | `examples/builder-solo` |
+
+(stateless/lean/deferred are all built on the *full original* prompts, so they isolate their one
+change without the minimal-prompt confound.)
+
+## Headline results — median tokens (5 runs each; deferred N=3)
+
+| variant | adversary verifies… | median tokens | vs orig | commits | LOC |
+|---|---|--:|--:|--:|--:|
+| **builder-solo** | never (self-certify) | **2.77M** | −79% | 5 | 426 |
+| **min** | per phase *(minimal prompts)* | 9.77M | −25% | 15 | 367 |
+| **stateless** | per phase *(+context hygiene)* | 10.12M | −22% | 14 | 400 |
+| **orig** | per **phase** | 13.04M | — | 14 | 449 |
+| **deferred** | once, after **whole build** | 13.37M | +3% | 10 | 425 |
+| **lean** | per **gate** | 13.41M | +3% | 28 | 390 |
+
+## The two big findings
+
+### 1. The adversary's *existence* costs ~4.7× — its *cadence* barely matters.
+Every loop-pair variant lands near **~13M tokens regardless of how the review is chunked** —
+per-gate (`lean`, 28 commits), per-phase (`orig`, 14), or one deferred pass (`deferred`, 10).
+`builder-solo` (no adversary) is **2.77M**. So the dominant cost is **whether an independent cold
+re-verification happens at all**, not how it's scheduled. The verification *work* is roughly
+conserved; chunking it finer or coarser mostly changes the commit/handshake count — which is itself
+nearly token-neutral.
+
+### 2. Deferred review was the surprise — and the loser.
+Hypothesis: deferring to one pass would be cheapest (fewest handshakes ≈ solo-build + one review).
+It wasn't — `deferred` ≈ `orig`. Handshakes *did* drop (10 commits), but the **single comprehensive
+review is itself expensive** (the Adversary re-verifies the whole calculator + cross-feature probes
+in one shot), so total tokens stayed put. And it carries the downside that the independent check
+**arrives late** — late-rework risk, plus self-certification drift on the build phases. Worst of both
+for this task.
+
+## The levers, ranked
+
+1. **Drop the adversary → ~−79%** — but you lose all independent verification. On this clean,
+   well-specified task `solo` produced correct calculators, so the adversary bought no *measured*
+   quality here — but it is **insurance against self-certification rubber-stamping a bug**, whose
+   value shows on ambiguous/underspecified work this benchmark can't stress.
+2. **Context hygiene → −22%** — the **only clean win**: same review effort (same commits/LOC as
+   orig), just less context carried and reloaded each turn. (`stateless` vs `orig`.)
+3. **Minimal prompts → −25%, but not free** — ~⅓ of the saving comes from the agents writing **~25%
+   fewer tests** (the compressed prompts drop the emphatic "try to break it / paste the output /
+   a red test is information" language that drives thorough testing). Same features, thinner test
+   suite.
+4. **Review cadence → ~0%** — per-gate / per-phase / per-build are interchangeable on *cost*; choose
+   for **quality and latency**, not tokens: finer = earlier defect-catching at slight overhead;
+   coarser = late but holistic (better at cross-feature bugs).
+
+## Why: cost is *process*, not *product*
+
+Pooled across all 28 successful runs:
+
+| tokens vs | Pearson r |
+|---|--:|
+| duration | **+0.83** |
+| commits (review rounds) | **+0.79** |
+| LOC (code shipped) | **−0.04** |
+
+Token cost tracks how long the loop **runs and verifies**, and is **uncorrelated with how much code
+ships**. The deliverable (LOC, behavior) is near-constant across variants; the cost variance is all
+process intensity.
+
+## Methodology notes & caveats
+
+- **N matters.** A single full-loop run is wildly nondeterministic: the *same* variant varied **±55%**
+  run-to-run early on, which is why this is N=5. (An early single-run claim of "context hygiene halves
+  tokens" did **not** reproduce — the real, stable figure is −22%.)
+- **Variance source:** number of review rounds / retries, not output size.
+- **Real failures excluded** (2 of 27 loop-pair runs): a wedge and a usage-limit/timeout collision;
+  superseded by clean re-runs. `LIMIT`-flagged runs (a usage-limit *pause* inflates duration without
+  adding tokens) are kept for token totals but excluded from `tokens/sec`.
+- **Scope:** one task, one model (Sonnet), one harness. The *relative* findings should generalize;
+  absolute numbers are task-specific. The adversary's quality value is **not** measured here (the task
+  is too well-specified to stress it).
+
+## Practical guidance
+
+- **Want to cut tokens without losing the independent check?** Use **context hygiene** (the `stateless`
+  pattern). It's the only free lunch.
+- **Don't pay for minimal prompts with test coverage** — keep the emphatic testing language unless you
+  genuinely want less testing.
+- **Pick review cadence for the work, not the bill:** per-gate to catch regressions early in long
+  phases; per-phase as a sane default; deferred only when features are independent and cheap to fix
+  late (it saves nothing and checks late).
+- **`solo` is ~5× cheaper** — reasonable for low-stakes / well-specified work, but you're trusting the
+  builder to grade its own homework.
+
+---
+_Artifacts in this repo: `FINDINGS.md` (this summary), `RESULTS-campaign.md` (per-variant stats +
+ratios + full raw table), `RESULTS-campaign.md.data` (raw rows), `analyze.py` (regenerates the
+analysis), `run-harness-bench.sh` / `run-solo-bench.sh` (the runners), `plans/calc/` (the task), and
+the six `engine/examples/builder-adversary*` variants. `deferred` is N=3 here and is being finalized
+to N=5; its median is stable (spread 1.19×)._
--- a/RESULTS-campaign.md
+++ b/RESULTS-campaign.md
@ -1,6 +1,6 @@
 # Full-harness benchmark — campaign analysis

-Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 25 successful runs of 27 total.
+Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 28 successful runs of 30 total.

 ## Per-variant total tokens (successful runs)

@ -11,6 +11,7 @@ Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase c
 | builder-adversary-stateless | 5/5 | 10,122,375 | 10,735,401 | 9,992,834 | 13,009,792 | 1.30x |
 | builder-adversary-lean | 5/6 | 13,409,349 | 13,216,582 | 12,101,355 | 13,815,595 | 1.14x |
 | builder-solo | 5/5 | 2,773,634 | 2,744,840 | 2,417,528 | 2,948,467 | 1.22x |
+| builder-adversary-deferred | 3/3 | 13,366,800 | 13,863,041 | 12,888,082 | 15,334,242 | 1.19x |

 ## Efficiency ratios — min / median / max (successful runs)

@ -18,12 +19,13 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration

 | variant | tokens / LOC | tokens / sec | tokens / commit |
 |---|--:|--:|--:|
-| builder-adversary | 23,857 / 30,391 / 32,665 | 8,083 / 11,670 / 13,852 | 793,540 / 935,026 / 1,044,586 |
+| builder-adversary | 23,857 / 30,391 / 32,665 | 11,581 / 12,226 / 13,852 | 793,540 / 935,026 / 1,044,586 |
 | builder-adversary-min | 23,527 / 25,252 / 32,814 | 8,173 / 14,807 / 15,814 | 582,292 / 669,789 / 712,415 |
 | builder-adversary-stateless | 21,802 / 27,799 / 30,861 | 10,544 / 11,620 / 12,755 | 697,172 / 765,282 / 778,644 |
 | builder-adversary-lean | 28,077 / 33,238 / 38,966 | 12,416 / 13,523 / 14,403 | 432,191 / 478,905 / 575,650 |
 | builder-solo | 6,029 / 6,611 / 6,969 | 6,542 / 6,715 / 7,020 | 392,494 / 483,506 / 737,117 |
-| **all** | 6,029 / 28,077 / 38,966 | 6,542 / 11,712 / 15,814 | 392,494 / 697,172 / 1,044,586 |
+| builder-adversary-deferred | 31,451 / 33,827 / 34,537 | 12,170 / 13,105 / 15,343 | 1,277,854 / 1,336,680 / 1,841,155 |
+| **all** | 6,029 / 29,024 / 38,966 | 6,542 / 12,170 / 15,814 | 392,494 / 705,228 / 1,841,155 |

 ## Per-variant medians (commits / LOC / duration)

@ -34,21 +36,22 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration
 | builder-adversary-stateless | 14 | 400 | 900 |
 | builder-adversary-lean | 28 | 390 | 960 |
 | builder-solo | 5 | 426 | 420 |
+| builder-adversary-deferred | 10 | 425 | 1020 |

-## Correlations with total tokens (pooled, n=25)
+## Correlations with total tokens (pooled, n=28)

 | tokens vs | Pearson r |
 |---|--:|
 | duration | +0.83 |
-| commits | +0.79 |
-| LOC | -0.04 |
+| commits | +0.65 |
+| LOC | +0.01 |

 ## All runs (raw)

 | variant | rep | ok | limit | total | dur(s) | commits | LOC | tok/LOC | tok/sec | tok/commit |
 |---|:--:|:--:|:--:|--:|--:|--:|--:|--:|--:|--:|
 | builder-adversary | 1 | YES |  | 11,117,474 | 960 | 14 | 466 | 23,857 | 11,581 | 794,105 |
-| builder-adversary | 2 | YES |  | 13,579,616 | 1680 | 13 | 449 | 30,244 | 8,083 | 1,044,586 |
+| builder-adversary | 2 | YES | LIMIT | 13,579,616 | 1680 | 13 | 449 | 30,244 | 8,083 | 1,044,586 |
 | builder-adversary | 3 | YES |  | 14,960,414 | 1080 | 16 | 458 | 32,665 | 13,852 | 935,026 |
 | builder-adversary | 4 | YES |  | 13,037,683 | 1020 | 13 | 429 | 30,391 | 12,782 | 1,002,899 |
 | builder-adversary | 5 | YES |  | 11,903,098 | 1020 | 15 | 381 | 31,242 | 11,670 | 793,540 |
@ -74,5 +77,8 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration
 | builder-solo | 3 | YES |  | 2,837,115 | 420 | 6 | 426 | 6,660 | 6,755 | 472,852 |
 | builder-solo | 4 | YES |  | 2,773,634 | 420 | 5 | 398 | 6,969 | 6,604 | 554,727 |
 | builder-solo | 5 | YES |  | 2,948,467 | 420 | 4 | 446 | 6,611 | 7,020 | 737,117 |
+| builder-adversary-deferred | 1 | YES |  | 13,366,800 | 1020 | 10 | 425 | 31,451 | 13,105 | 1,336,680 |
+| builder-adversary-deferred | 2 | YES |  | 12,888,082 | 840 | 7 | 381 | 33,827 | 15,343 | 1,841,155 |
+| builder-adversary-deferred | 3 | YES |  | 15,334,242 | 1260 | 12 | 444 | 34,537 | 12,170 | 1,277,854 |

 _Stats over successful runs. `LIMIT` = the run hit a usage-limit pause (duration/tok-sec distorted, token total fine). Repos kept under the run root for analysis._
--- a/RESULTS-campaign.md.data
+++ b/RESULTS-campaign.md.data
@ -0,0 +1,31 @@
+builder-adversary	1	YES	6118255	4999219	11117474	960	14	466
+builder-adversary	2	YES	7058221	6521395	13579616	1680	13	449
+builder-adversary	3	YES	7057033	7903381	14960414	1080	16	458
+builder-adversary	4	YES	6723564	6314119	13037683	1020	13	429
+builder-adversary	5	YES	6177117	5725981	11903098	1020	15	381
+builder-adversary-min	1	YES	4608722	4526996	9135718	780	15	367
+builder-adversary-min	2	YES	5692897	5693518	11386415	720	17	347
+builder-adversary-min	3	YES	5225139	4091537	9316676	1140	16	396
+builder-adversary-min	4	YES	4996985	4976828	9973813	660	14	347
+builder-adversary-min	5	NO	2479575	1213596	3693171	1800	4	128
+builder-adversary-min	1	YES	4508074	5264507	9772581	660	14	387
+builder-adversary-stateless	1	YES	5439341	5018236	10457577	900	15	400
+builder-adversary-stateless	2	YES	4958232	5034602	9992834	840	13	341
+builder-adversary-stateless	3	YES	5035212	5059218	10094430	900	14	463
+builder-adversary-stateless	4	YES	4736715	5385660	10122375	960	13	328
+builder-adversary-stateless	5	YES	7083535	5926257	13009792	1020	17	468
+builder-adversary-lean	1	YES	6605782	6356919	12962701	900	28	390
+builder-adversary-lean	2	YES	7290398	6118951	13409349	1080	28	451
+builder-adversary-lean	3	NO	3476619	3041803	6518422	1800	11	259
+builder-adversary-lean	4	YES	6208552	7607043	13815595	960	24	378
+builder-adversary-lean	5	YES	5959024	6142331	12101355	960	28	431
+builder-adversary-lean	1	YES	8030199	5763715	13793914	1020	25	354
+builder-solo	1	YES	2417528	0	2417528	360	5	401
+builder-solo	2	YES	2747457	0	2747457	420	7	450
+builder-solo	3	YES	2837115	0	2837115	420	6	426
+builder-solo	4	YES	2773634	0	2773634	420	5	398
+builder-solo	5	YES	2948467	0	2948467	420	4	446
+builder-adversary-deferred	1	YES	7375559	5991241	13366800	1020	10	425
+builder-adversary-deferred	2	YES	7803666	5084416	12888082	840	7	381
+builder-adversary-deferred	3	YES	8034336	7299906	15334242	1260	12	444
+builder-adversary-deferred	4	NO	4589433	3947296	8536729	2700	6	146