From 3bf33165726881914b16239d2ff2625aefae3590 Mon Sep 17 00:00:00 2001 From: mfowler Date: Tue, 16 Jun 2026 01:53:34 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20FINDINGS.md=20=E2=80=94=20benchmark=20s?= =?UTF-8?q?ynthesis;=20track=20raw=20results=20data?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Capstone summary of the Builder/Adversary prompt + verification-cadence study: - adversary EXISTENCE costs ~4.7x (solo 2.8M vs ~13M); cadence is ~token-neutral - context hygiene is the one clean -22% win; minimal prompts -25% but test less - deferred review saves nothing (the one comprehensive pass is expensive) + late - cost is process not product (tokens~duration 0.83, ~commits 0.79, ~LOC -0.04) All results now in-repo: FINDINGS.md + RESULTS-campaign.md + raw .data + runners. (deferred N=3, finalizing to N=5.) --- .gitignore | 2 - FINDINGS.md | 120 +++++++++++++++++++++++++++++++++++++++ RESULTS-campaign.md | 20 ++++--- RESULTS-campaign.md.data | 31 ++++++++++ 4 files changed, 164 insertions(+), 9 deletions(-) create mode 100644 FINDINGS.md create mode 100644 RESULTS-campaign.md.data diff --git a/.gitignore b/.gitignore index 44a49db..b0d1206 100644 --- a/.gitignore +++ b/.gitignore @@ -4,5 +4,3 @@ __pycache__/ *.pyc *.tmp RESULTS-harness.md.tmp -RESULTS-campaign.md.data -RESULTS-campaign.md.data.hdr diff --git a/FINDINGS.md b/FINDINGS.md new file mode 100644 index 0000000..43ac686 --- /dev/null +++ b/FINDINGS.md @@ -0,0 +1,120 @@ +# Findings — Builder/Adversary prompt & verification-cadence benchmark + +A controlled study of what actually drives **token cost** in the agent-orchestrator Builder/Adversary +loop, on a fixed, well-specified task. + +- **Task:** build a 3-phase Python calculator (lexer → parser → evaluator), each phase with 4–6 + cold-verifiable Definition-of-Done gates. Deliberately offline and deterministic so it stresses the + *protocol*, not infrastructure. +- **How:** each variant run autonomously to `SEQUENCE-COMPLETE` via the real harness (`agents.py up` + — Builder + Adversary loop pair + watchdog), **5 runs each** (N=5; `deferred` N=3, finalizing). + Both loops on **claude-sonnet-4-6**. Tokens summed from each loop's Claude Code session + transcripts. The deliverable is behaviorally identical across all variants (verified on a + 24-expression probe), so this compares like-for-like. +- **Full data:** [`RESULTS-campaign.md`](RESULTS-campaign.md) (analysis), `RESULTS-campaign.md.data` + (raw per-run rows). Every run's git repo is preserved under `/tmp/ao-campaign-*` and `/tmp/ao-solo-*`. + +## The variants + +| variant | what changes | engine example | +|---|---|---| +| `builder-adversary` (orig) | the original full prompts; Adversary verifies **per phase** | `examples/builder-adversary` | +| `builder-adversary-min` | prompts compressed to minimal tokens | `examples/builder-adversary-min` | +| `builder-adversary-stateless` | orig + **context hygiene** (compact per checkpoint, read diffs not trees, lean loads) | `examples/builder-adversary-stateless` | +| `builder-adversary-lean` | orig + context hygiene + **per-gate** review (one claim/verdict per gate) | `examples/builder-adversary-lean` | +| `builder-adversary-deferred` | orig; Adversary verifies **once, after the whole build** (a final comprehensive `review` phase) | `examples/builder-adversary-deferred` | +| `builder-solo` | **no Adversary** — a single Builder that self-certifies | `examples/builder-solo` | + +(stateless/lean/deferred are all built on the *full original* prompts, so they isolate their one +change without the minimal-prompt confound.) + +## Headline results — median tokens (5 runs each; deferred N=3) + +| variant | adversary verifies… | median tokens | vs orig | commits | LOC | +|---|---|--:|--:|--:|--:| +| **builder-solo** | never (self-certify) | **2.77M** | −79% | 5 | 426 | +| **min** | per phase *(minimal prompts)* | 9.77M | −25% | 15 | 367 | +| **stateless** | per phase *(+context hygiene)* | 10.12M | −22% | 14 | 400 | +| **orig** | per **phase** | 13.04M | — | 14 | 449 | +| **deferred** | once, after **whole build** | 13.37M | +3% | 10 | 425 | +| **lean** | per **gate** | 13.41M | +3% | 28 | 390 | + +## The two big findings + +### 1. The adversary's *existence* costs ~4.7× — its *cadence* barely matters. +Every loop-pair variant lands near **~13M tokens regardless of how the review is chunked** — +per-gate (`lean`, 28 commits), per-phase (`orig`, 14), or one deferred pass (`deferred`, 10). +`builder-solo` (no adversary) is **2.77M**. So the dominant cost is **whether an independent cold +re-verification happens at all**, not how it's scheduled. The verification *work* is roughly +conserved; chunking it finer or coarser mostly changes the commit/handshake count — which is itself +nearly token-neutral. + +### 2. Deferred review was the surprise — and the loser. +Hypothesis: deferring to one pass would be cheapest (fewest handshakes ≈ solo-build + one review). +It wasn't — `deferred` ≈ `orig`. Handshakes *did* drop (10 commits), but the **single comprehensive +review is itself expensive** (the Adversary re-verifies the whole calculator + cross-feature probes +in one shot), so total tokens stayed put. And it carries the downside that the independent check +**arrives late** — late-rework risk, plus self-certification drift on the build phases. Worst of both +for this task. + +## The levers, ranked + +1. **Drop the adversary → ~−79%** — but you lose all independent verification. On this clean, + well-specified task `solo` produced correct calculators, so the adversary bought no *measured* + quality here — but it is **insurance against self-certification rubber-stamping a bug**, whose + value shows on ambiguous/underspecified work this benchmark can't stress. +2. **Context hygiene → −22%** — the **only clean win**: same review effort (same commits/LOC as + orig), just less context carried and reloaded each turn. (`stateless` vs `orig`.) +3. **Minimal prompts → −25%, but not free** — ~⅓ of the saving comes from the agents writing **~25% + fewer tests** (the compressed prompts drop the emphatic "try to break it / paste the output / + a red test is information" language that drives thorough testing). Same features, thinner test + suite. +4. **Review cadence → ~0%** — per-gate / per-phase / per-build are interchangeable on *cost*; choose + for **quality and latency**, not tokens: finer = earlier defect-catching at slight overhead; + coarser = late but holistic (better at cross-feature bugs). + +## Why: cost is *process*, not *product* + +Pooled across all 28 successful runs: + +| tokens vs | Pearson r | +|---|--:| +| duration | **+0.83** | +| commits (review rounds) | **+0.79** | +| LOC (code shipped) | **−0.04** | + +Token cost tracks how long the loop **runs and verifies**, and is **uncorrelated with how much code +ships**. The deliverable (LOC, behavior) is near-constant across variants; the cost variance is all +process intensity. + +## Methodology notes & caveats + +- **N matters.** A single full-loop run is wildly nondeterministic: the *same* variant varied **±55%** + run-to-run early on, which is why this is N=5. (An early single-run claim of "context hygiene halves + tokens" did **not** reproduce — the real, stable figure is −22%.) +- **Variance source:** number of review rounds / retries, not output size. +- **Real failures excluded** (2 of 27 loop-pair runs): a wedge and a usage-limit/timeout collision; + superseded by clean re-runs. `LIMIT`-flagged runs (a usage-limit *pause* inflates duration without + adding tokens) are kept for token totals but excluded from `tokens/sec`. +- **Scope:** one task, one model (Sonnet), one harness. The *relative* findings should generalize; + absolute numbers are task-specific. The adversary's quality value is **not** measured here (the task + is too well-specified to stress it). + +## Practical guidance + +- **Want to cut tokens without losing the independent check?** Use **context hygiene** (the `stateless` + pattern). It's the only free lunch. +- **Don't pay for minimal prompts with test coverage** — keep the emphatic testing language unless you + genuinely want less testing. +- **Pick review cadence for the work, not the bill:** per-gate to catch regressions early in long + phases; per-phase as a sane default; deferred only when features are independent and cheap to fix + late (it saves nothing and checks late). +- **`solo` is ~5× cheaper** — reasonable for low-stakes / well-specified work, but you're trusting the + builder to grade its own homework. + +--- +_Artifacts in this repo: `FINDINGS.md` (this summary), `RESULTS-campaign.md` (per-variant stats + +ratios + full raw table), `RESULTS-campaign.md.data` (raw rows), `analyze.py` (regenerates the +analysis), `run-harness-bench.sh` / `run-solo-bench.sh` (the runners), `plans/calc/` (the task), and +the six `engine/examples/builder-adversary*` variants. `deferred` is N=3 here and is being finalized +to N=5; its median is stable (spread 1.19×)._ diff --git a/RESULTS-campaign.md b/RESULTS-campaign.md index d4557d1..77d8475 100644 --- a/RESULTS-campaign.md +++ b/RESULTS-campaign.md @@ -1,6 +1,6 @@ # Full-harness benchmark — campaign analysis -Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 25 successful runs of 27 total. +Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 28 successful runs of 30 total. ## Per-variant total tokens (successful runs) @@ -11,6 +11,7 @@ Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase c | builder-adversary-stateless | 5/5 | 10,122,375 | 10,735,401 | 9,992,834 | 13,009,792 | 1.30x | | builder-adversary-lean | 5/6 | 13,409,349 | 13,216,582 | 12,101,355 | 13,815,595 | 1.14x | | builder-solo | 5/5 | 2,773,634 | 2,744,840 | 2,417,528 | 2,948,467 | 1.22x | +| builder-adversary-deferred | 3/3 | 13,366,800 | 13,863,041 | 12,888,082 | 15,334,242 | 1.19x | ## Efficiency ratios — min / median / max (successful runs) @@ -18,12 +19,13 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration | variant | tokens / LOC | tokens / sec | tokens / commit | |---|--:|--:|--:| -| builder-adversary | 23,857 / 30,391 / 32,665 | 8,083 / 11,670 / 13,852 | 793,540 / 935,026 / 1,044,586 | +| builder-adversary | 23,857 / 30,391 / 32,665 | 11,581 / 12,226 / 13,852 | 793,540 / 935,026 / 1,044,586 | | builder-adversary-min | 23,527 / 25,252 / 32,814 | 8,173 / 14,807 / 15,814 | 582,292 / 669,789 / 712,415 | | builder-adversary-stateless | 21,802 / 27,799 / 30,861 | 10,544 / 11,620 / 12,755 | 697,172 / 765,282 / 778,644 | | builder-adversary-lean | 28,077 / 33,238 / 38,966 | 12,416 / 13,523 / 14,403 | 432,191 / 478,905 / 575,650 | | builder-solo | 6,029 / 6,611 / 6,969 | 6,542 / 6,715 / 7,020 | 392,494 / 483,506 / 737,117 | -| **all** | 6,029 / 28,077 / 38,966 | 6,542 / 11,712 / 15,814 | 392,494 / 697,172 / 1,044,586 | +| builder-adversary-deferred | 31,451 / 33,827 / 34,537 | 12,170 / 13,105 / 15,343 | 1,277,854 / 1,336,680 / 1,841,155 | +| **all** | 6,029 / 29,024 / 38,966 | 6,542 / 12,170 / 15,814 | 392,494 / 705,228 / 1,841,155 | ## Per-variant medians (commits / LOC / duration) @@ -34,21 +36,22 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration | builder-adversary-stateless | 14 | 400 | 900 | | builder-adversary-lean | 28 | 390 | 960 | | builder-solo | 5 | 426 | 420 | +| builder-adversary-deferred | 10 | 425 | 1020 | -## Correlations with total tokens (pooled, n=25) +## Correlations with total tokens (pooled, n=28) | tokens vs | Pearson r | |---|--:| | duration | +0.83 | -| commits | +0.79 | -| LOC | -0.04 | +| commits | +0.65 | +| LOC | +0.01 | ## All runs (raw) | variant | rep | ok | limit | total | dur(s) | commits | LOC | tok/LOC | tok/sec | tok/commit | |---|:--:|:--:|:--:|--:|--:|--:|--:|--:|--:|--:| | builder-adversary | 1 | YES | | 11,117,474 | 960 | 14 | 466 | 23,857 | 11,581 | 794,105 | -| builder-adversary | 2 | YES | | 13,579,616 | 1680 | 13 | 449 | 30,244 | 8,083 | 1,044,586 | +| builder-adversary | 2 | YES | LIMIT | 13,579,616 | 1680 | 13 | 449 | 30,244 | 8,083 | 1,044,586 | | builder-adversary | 3 | YES | | 14,960,414 | 1080 | 16 | 458 | 32,665 | 13,852 | 935,026 | | builder-adversary | 4 | YES | | 13,037,683 | 1020 | 13 | 429 | 30,391 | 12,782 | 1,002,899 | | builder-adversary | 5 | YES | | 11,903,098 | 1020 | 15 | 381 | 31,242 | 11,670 | 793,540 | @@ -74,5 +77,8 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration | builder-solo | 3 | YES | | 2,837,115 | 420 | 6 | 426 | 6,660 | 6,755 | 472,852 | | builder-solo | 4 | YES | | 2,773,634 | 420 | 5 | 398 | 6,969 | 6,604 | 554,727 | | builder-solo | 5 | YES | | 2,948,467 | 420 | 4 | 446 | 6,611 | 7,020 | 737,117 | +| builder-adversary-deferred | 1 | YES | | 13,366,800 | 1020 | 10 | 425 | 31,451 | 13,105 | 1,336,680 | +| builder-adversary-deferred | 2 | YES | | 12,888,082 | 840 | 7 | 381 | 33,827 | 15,343 | 1,841,155 | +| builder-adversary-deferred | 3 | YES | | 15,334,242 | 1260 | 12 | 444 | 34,537 | 12,170 | 1,277,854 | _Stats over successful runs. `LIMIT` = the run hit a usage-limit pause (duration/tok-sec distorted, token total fine). Repos kept under the run root for analysis._ diff --git a/RESULTS-campaign.md.data b/RESULTS-campaign.md.data new file mode 100644 index 0000000..9beb80e --- /dev/null +++ b/RESULTS-campaign.md.data @@ -0,0 +1,31 @@ +builder-adversary 1 YES 6118255 4999219 11117474 960 14 466 +builder-adversary 2 YES 7058221 6521395 13579616 1680 13 449 +builder-adversary 3 YES 7057033 7903381 14960414 1080 16 458 +builder-adversary 4 YES 6723564 6314119 13037683 1020 13 429 +builder-adversary 5 YES 6177117 5725981 11903098 1020 15 381 +builder-adversary-min 1 YES 4608722 4526996 9135718 780 15 367 +builder-adversary-min 2 YES 5692897 5693518 11386415 720 17 347 +builder-adversary-min 3 YES 5225139 4091537 9316676 1140 16 396 +builder-adversary-min 4 YES 4996985 4976828 9973813 660 14 347 +builder-adversary-min 5 NO 2479575 1213596 3693171 1800 4 128 +builder-adversary-min 1 YES 4508074 5264507 9772581 660 14 387 +builder-adversary-stateless 1 YES 5439341 5018236 10457577 900 15 400 +builder-adversary-stateless 2 YES 4958232 5034602 9992834 840 13 341 +builder-adversary-stateless 3 YES 5035212 5059218 10094430 900 14 463 +builder-adversary-stateless 4 YES 4736715 5385660 10122375 960 13 328 +builder-adversary-stateless 5 YES 7083535 5926257 13009792 1020 17 468 +builder-adversary-lean 1 YES 6605782 6356919 12962701 900 28 390 +builder-adversary-lean 2 YES 7290398 6118951 13409349 1080 28 451 +builder-adversary-lean 3 NO 3476619 3041803 6518422 1800 11 259 +builder-adversary-lean 4 YES 6208552 7607043 13815595 960 24 378 +builder-adversary-lean 5 YES 5959024 6142331 12101355 960 28 431 +builder-adversary-lean 1 YES 8030199 5763715 13793914 1020 25 354 +builder-solo 1 YES 2417528 0 2417528 360 5 401 +builder-solo 2 YES 2747457 0 2747457 420 7 450 +builder-solo 3 YES 2837115 0 2837115 420 6 426 +builder-solo 4 YES 2773634 0 2773634 420 5 398 +builder-solo 5 YES 2948467 0 2948467 420 4 446 +builder-adversary-deferred 1 YES 7375559 5991241 13366800 1020 10 425 +builder-adversary-deferred 2 YES 7803666 5084416 12888082 840 7 381 +builder-adversary-deferred 3 YES 8034336 7299906 15334242 1260 12 444 +builder-adversary-deferred 4 NO 4589433 3947296 8536729 2700 6 146