docs: FINDINGS.md — benchmark synthesis; track raw results data
Capstone summary of the Builder/Adversary prompt + verification-cadence study: - adversary EXISTENCE costs ~4.7x (solo 2.8M vs ~13M); cadence is ~token-neutral - context hygiene is the one clean -22% win; minimal prompts -25% but test less - deferred review saves nothing (the one comprehensive pass is expensive) + late - cost is process not product (tokens~duration 0.83, ~commits 0.79, ~LOC -0.04) All results now in-repo: FINDINGS.md + RESULTS-campaign.md + raw .data + runners. (deferred N=3, finalizing to N=5.)
This commit is contained in:
2
.gitignore
vendored
2
.gitignore
vendored
@ -4,5 +4,3 @@ __pycache__/
|
||||
*.pyc
|
||||
*.tmp
|
||||
RESULTS-harness.md.tmp
|
||||
RESULTS-campaign.md.data
|
||||
RESULTS-campaign.md.data.hdr
|
||||
|
||||
120
FINDINGS.md
Normal file
120
FINDINGS.md
Normal file
@ -0,0 +1,120 @@
|
||||
# Findings — Builder/Adversary prompt & verification-cadence benchmark
|
||||
|
||||
A controlled study of what actually drives **token cost** in the agent-orchestrator Builder/Adversary
|
||||
loop, on a fixed, well-specified task.
|
||||
|
||||
- **Task:** build a 3-phase Python calculator (lexer → parser → evaluator), each phase with 4–6
|
||||
cold-verifiable Definition-of-Done gates. Deliberately offline and deterministic so it stresses the
|
||||
*protocol*, not infrastructure.
|
||||
- **How:** each variant run autonomously to `SEQUENCE-COMPLETE` via the real harness (`agents.py up`
|
||||
— Builder + Adversary loop pair + watchdog), **5 runs each** (N=5; `deferred` N=3, finalizing).
|
||||
Both loops on **claude-sonnet-4-6**. Tokens summed from each loop's Claude Code session
|
||||
transcripts. The deliverable is behaviorally identical across all variants (verified on a
|
||||
24-expression probe), so this compares like-for-like.
|
||||
- **Full data:** [`RESULTS-campaign.md`](RESULTS-campaign.md) (analysis), `RESULTS-campaign.md.data`
|
||||
(raw per-run rows). Every run's git repo is preserved under `/tmp/ao-campaign-*` and `/tmp/ao-solo-*`.
|
||||
|
||||
## The variants
|
||||
|
||||
| variant | what changes | engine example |
|
||||
|---|---|---|
|
||||
| `builder-adversary` (orig) | the original full prompts; Adversary verifies **per phase** | `examples/builder-adversary` |
|
||||
| `builder-adversary-min` | prompts compressed to minimal tokens | `examples/builder-adversary-min` |
|
||||
| `builder-adversary-stateless` | orig + **context hygiene** (compact per checkpoint, read diffs not trees, lean loads) | `examples/builder-adversary-stateless` |
|
||||
| `builder-adversary-lean` | orig + context hygiene + **per-gate** review (one claim/verdict per gate) | `examples/builder-adversary-lean` |
|
||||
| `builder-adversary-deferred` | orig; Adversary verifies **once, after the whole build** (a final comprehensive `review` phase) | `examples/builder-adversary-deferred` |
|
||||
| `builder-solo` | **no Adversary** — a single Builder that self-certifies | `examples/builder-solo` |
|
||||
|
||||
(stateless/lean/deferred are all built on the *full original* prompts, so they isolate their one
|
||||
change without the minimal-prompt confound.)
|
||||
|
||||
## Headline results — median tokens (5 runs each; deferred N=3)
|
||||
|
||||
| variant | adversary verifies… | median tokens | vs orig | commits | LOC |
|
||||
|---|---|--:|--:|--:|--:|
|
||||
| **builder-solo** | never (self-certify) | **2.77M** | −79% | 5 | 426 |
|
||||
| **min** | per phase *(minimal prompts)* | 9.77M | −25% | 15 | 367 |
|
||||
| **stateless** | per phase *(+context hygiene)* | 10.12M | −22% | 14 | 400 |
|
||||
| **orig** | per **phase** | 13.04M | — | 14 | 449 |
|
||||
| **deferred** | once, after **whole build** | 13.37M | +3% | 10 | 425 |
|
||||
| **lean** | per **gate** | 13.41M | +3% | 28 | 390 |
|
||||
|
||||
## The two big findings
|
||||
|
||||
### 1. The adversary's *existence* costs ~4.7× — its *cadence* barely matters.
|
||||
Every loop-pair variant lands near **~13M tokens regardless of how the review is chunked** —
|
||||
per-gate (`lean`, 28 commits), per-phase (`orig`, 14), or one deferred pass (`deferred`, 10).
|
||||
`builder-solo` (no adversary) is **2.77M**. So the dominant cost is **whether an independent cold
|
||||
re-verification happens at all**, not how it's scheduled. The verification *work* is roughly
|
||||
conserved; chunking it finer or coarser mostly changes the commit/handshake count — which is itself
|
||||
nearly token-neutral.
|
||||
|
||||
### 2. Deferred review was the surprise — and the loser.
|
||||
Hypothesis: deferring to one pass would be cheapest (fewest handshakes ≈ solo-build + one review).
|
||||
It wasn't — `deferred` ≈ `orig`. Handshakes *did* drop (10 commits), but the **single comprehensive
|
||||
review is itself expensive** (the Adversary re-verifies the whole calculator + cross-feature probes
|
||||
in one shot), so total tokens stayed put. And it carries the downside that the independent check
|
||||
**arrives late** — late-rework risk, plus self-certification drift on the build phases. Worst of both
|
||||
for this task.
|
||||
|
||||
## The levers, ranked
|
||||
|
||||
1. **Drop the adversary → ~−79%** — but you lose all independent verification. On this clean,
|
||||
well-specified task `solo` produced correct calculators, so the adversary bought no *measured*
|
||||
quality here — but it is **insurance against self-certification rubber-stamping a bug**, whose
|
||||
value shows on ambiguous/underspecified work this benchmark can't stress.
|
||||
2. **Context hygiene → −22%** — the **only clean win**: same review effort (same commits/LOC as
|
||||
orig), just less context carried and reloaded each turn. (`stateless` vs `orig`.)
|
||||
3. **Minimal prompts → −25%, but not free** — ~⅓ of the saving comes from the agents writing **~25%
|
||||
fewer tests** (the compressed prompts drop the emphatic "try to break it / paste the output /
|
||||
a red test is information" language that drives thorough testing). Same features, thinner test
|
||||
suite.
|
||||
4. **Review cadence → ~0%** — per-gate / per-phase / per-build are interchangeable on *cost*; choose
|
||||
for **quality and latency**, not tokens: finer = earlier defect-catching at slight overhead;
|
||||
coarser = late but holistic (better at cross-feature bugs).
|
||||
|
||||
## Why: cost is *process*, not *product*
|
||||
|
||||
Pooled across all 28 successful runs:
|
||||
|
||||
| tokens vs | Pearson r |
|
||||
|---|--:|
|
||||
| duration | **+0.83** |
|
||||
| commits (review rounds) | **+0.79** |
|
||||
| LOC (code shipped) | **−0.04** |
|
||||
|
||||
Token cost tracks how long the loop **runs and verifies**, and is **uncorrelated with how much code
|
||||
ships**. The deliverable (LOC, behavior) is near-constant across variants; the cost variance is all
|
||||
process intensity.
|
||||
|
||||
## Methodology notes & caveats
|
||||
|
||||
- **N matters.** A single full-loop run is wildly nondeterministic: the *same* variant varied **±55%**
|
||||
run-to-run early on, which is why this is N=5. (An early single-run claim of "context hygiene halves
|
||||
tokens" did **not** reproduce — the real, stable figure is −22%.)
|
||||
- **Variance source:** number of review rounds / retries, not output size.
|
||||
- **Real failures excluded** (2 of 27 loop-pair runs): a wedge and a usage-limit/timeout collision;
|
||||
superseded by clean re-runs. `LIMIT`-flagged runs (a usage-limit *pause* inflates duration without
|
||||
adding tokens) are kept for token totals but excluded from `tokens/sec`.
|
||||
- **Scope:** one task, one model (Sonnet), one harness. The *relative* findings should generalize;
|
||||
absolute numbers are task-specific. The adversary's quality value is **not** measured here (the task
|
||||
is too well-specified to stress it).
|
||||
|
||||
## Practical guidance
|
||||
|
||||
- **Want to cut tokens without losing the independent check?** Use **context hygiene** (the `stateless`
|
||||
pattern). It's the only free lunch.
|
||||
- **Don't pay for minimal prompts with test coverage** — keep the emphatic testing language unless you
|
||||
genuinely want less testing.
|
||||
- **Pick review cadence for the work, not the bill:** per-gate to catch regressions early in long
|
||||
phases; per-phase as a sane default; deferred only when features are independent and cheap to fix
|
||||
late (it saves nothing and checks late).
|
||||
- **`solo` is ~5× cheaper** — reasonable for low-stakes / well-specified work, but you're trusting the
|
||||
builder to grade its own homework.
|
||||
|
||||
---
|
||||
_Artifacts in this repo: `FINDINGS.md` (this summary), `RESULTS-campaign.md` (per-variant stats +
|
||||
ratios + full raw table), `RESULTS-campaign.md.data` (raw rows), `analyze.py` (regenerates the
|
||||
analysis), `run-harness-bench.sh` / `run-solo-bench.sh` (the runners), `plans/calc/` (the task), and
|
||||
the six `engine/examples/builder-adversary*` variants. `deferred` is N=3 here and is being finalized
|
||||
to N=5; its median is stable (spread 1.19×)._
|
||||
@ -1,6 +1,6 @@
|
||||
# Full-harness benchmark — campaign analysis
|
||||
|
||||
Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 25 successful runs of 27 total.
|
||||
Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 28 successful runs of 30 total.
|
||||
|
||||
## Per-variant total tokens (successful runs)
|
||||
|
||||
@ -11,6 +11,7 @@ Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase c
|
||||
| builder-adversary-stateless | 5/5 | 10,122,375 | 10,735,401 | 9,992,834 | 13,009,792 | 1.30x |
|
||||
| builder-adversary-lean | 5/6 | 13,409,349 | 13,216,582 | 12,101,355 | 13,815,595 | 1.14x |
|
||||
| builder-solo | 5/5 | 2,773,634 | 2,744,840 | 2,417,528 | 2,948,467 | 1.22x |
|
||||
| builder-adversary-deferred | 3/3 | 13,366,800 | 13,863,041 | 12,888,082 | 15,334,242 | 1.19x |
|
||||
|
||||
## Efficiency ratios — min / median / max (successful runs)
|
||||
|
||||
@ -18,12 +19,13 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration
|
||||
|
||||
| variant | tokens / LOC | tokens / sec | tokens / commit |
|
||||
|---|--:|--:|--:|
|
||||
| builder-adversary | 23,857 / 30,391 / 32,665 | 8,083 / 11,670 / 13,852 | 793,540 / 935,026 / 1,044,586 |
|
||||
| builder-adversary | 23,857 / 30,391 / 32,665 | 11,581 / 12,226 / 13,852 | 793,540 / 935,026 / 1,044,586 |
|
||||
| builder-adversary-min | 23,527 / 25,252 / 32,814 | 8,173 / 14,807 / 15,814 | 582,292 / 669,789 / 712,415 |
|
||||
| builder-adversary-stateless | 21,802 / 27,799 / 30,861 | 10,544 / 11,620 / 12,755 | 697,172 / 765,282 / 778,644 |
|
||||
| builder-adversary-lean | 28,077 / 33,238 / 38,966 | 12,416 / 13,523 / 14,403 | 432,191 / 478,905 / 575,650 |
|
||||
| builder-solo | 6,029 / 6,611 / 6,969 | 6,542 / 6,715 / 7,020 | 392,494 / 483,506 / 737,117 |
|
||||
| **all** | 6,029 / 28,077 / 38,966 | 6,542 / 11,712 / 15,814 | 392,494 / 697,172 / 1,044,586 |
|
||||
| builder-adversary-deferred | 31,451 / 33,827 / 34,537 | 12,170 / 13,105 / 15,343 | 1,277,854 / 1,336,680 / 1,841,155 |
|
||||
| **all** | 6,029 / 29,024 / 38,966 | 6,542 / 12,170 / 15,814 | 392,494 / 705,228 / 1,841,155 |
|
||||
|
||||
## Per-variant medians (commits / LOC / duration)
|
||||
|
||||
@ -34,21 +36,22 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration
|
||||
| builder-adversary-stateless | 14 | 400 | 900 |
|
||||
| builder-adversary-lean | 28 | 390 | 960 |
|
||||
| builder-solo | 5 | 426 | 420 |
|
||||
| builder-adversary-deferred | 10 | 425 | 1020 |
|
||||
|
||||
## Correlations with total tokens (pooled, n=25)
|
||||
## Correlations with total tokens (pooled, n=28)
|
||||
|
||||
| tokens vs | Pearson r |
|
||||
|---|--:|
|
||||
| duration | +0.83 |
|
||||
| commits | +0.79 |
|
||||
| LOC | -0.04 |
|
||||
| commits | +0.65 |
|
||||
| LOC | +0.01 |
|
||||
|
||||
## All runs (raw)
|
||||
|
||||
| variant | rep | ok | limit | total | dur(s) | commits | LOC | tok/LOC | tok/sec | tok/commit |
|
||||
|---|:--:|:--:|:--:|--:|--:|--:|--:|--:|--:|--:|
|
||||
| builder-adversary | 1 | YES | | 11,117,474 | 960 | 14 | 466 | 23,857 | 11,581 | 794,105 |
|
||||
| builder-adversary | 2 | YES | | 13,579,616 | 1680 | 13 | 449 | 30,244 | 8,083 | 1,044,586 |
|
||||
| builder-adversary | 2 | YES | LIMIT | 13,579,616 | 1680 | 13 | 449 | 30,244 | 8,083 | 1,044,586 |
|
||||
| builder-adversary | 3 | YES | | 14,960,414 | 1080 | 16 | 458 | 32,665 | 13,852 | 935,026 |
|
||||
| builder-adversary | 4 | YES | | 13,037,683 | 1020 | 13 | 429 | 30,391 | 12,782 | 1,002,899 |
|
||||
| builder-adversary | 5 | YES | | 11,903,098 | 1020 | 15 | 381 | 31,242 | 11,670 | 793,540 |
|
||||
@ -74,5 +77,8 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration
|
||||
| builder-solo | 3 | YES | | 2,837,115 | 420 | 6 | 426 | 6,660 | 6,755 | 472,852 |
|
||||
| builder-solo | 4 | YES | | 2,773,634 | 420 | 5 | 398 | 6,969 | 6,604 | 554,727 |
|
||||
| builder-solo | 5 | YES | | 2,948,467 | 420 | 4 | 446 | 6,611 | 7,020 | 737,117 |
|
||||
| builder-adversary-deferred | 1 | YES | | 13,366,800 | 1020 | 10 | 425 | 31,451 | 13,105 | 1,336,680 |
|
||||
| builder-adversary-deferred | 2 | YES | | 12,888,082 | 840 | 7 | 381 | 33,827 | 15,343 | 1,841,155 |
|
||||
| builder-adversary-deferred | 3 | YES | | 15,334,242 | 1260 | 12 | 444 | 34,537 | 12,170 | 1,277,854 |
|
||||
|
||||
_Stats over successful runs. `LIMIT` = the run hit a usage-limit pause (duration/tok-sec distorted, token total fine). Repos kept under the run root for analysis._
|
||||
|
||||
31
RESULTS-campaign.md.data
Normal file
31
RESULTS-campaign.md.data
Normal file
@ -0,0 +1,31 @@
|
||||
builder-adversary 1 YES 6118255 4999219 11117474 960 14 466
|
||||
builder-adversary 2 YES 7058221 6521395 13579616 1680 13 449
|
||||
builder-adversary 3 YES 7057033 7903381 14960414 1080 16 458
|
||||
builder-adversary 4 YES 6723564 6314119 13037683 1020 13 429
|
||||
builder-adversary 5 YES 6177117 5725981 11903098 1020 15 381
|
||||
builder-adversary-min 1 YES 4608722 4526996 9135718 780 15 367
|
||||
builder-adversary-min 2 YES 5692897 5693518 11386415 720 17 347
|
||||
builder-adversary-min 3 YES 5225139 4091537 9316676 1140 16 396
|
||||
builder-adversary-min 4 YES 4996985 4976828 9973813 660 14 347
|
||||
builder-adversary-min 5 NO 2479575 1213596 3693171 1800 4 128
|
||||
builder-adversary-min 1 YES 4508074 5264507 9772581 660 14 387
|
||||
builder-adversary-stateless 1 YES 5439341 5018236 10457577 900 15 400
|
||||
builder-adversary-stateless 2 YES 4958232 5034602 9992834 840 13 341
|
||||
builder-adversary-stateless 3 YES 5035212 5059218 10094430 900 14 463
|
||||
builder-adversary-stateless 4 YES 4736715 5385660 10122375 960 13 328
|
||||
builder-adversary-stateless 5 YES 7083535 5926257 13009792 1020 17 468
|
||||
builder-adversary-lean 1 YES 6605782 6356919 12962701 900 28 390
|
||||
builder-adversary-lean 2 YES 7290398 6118951 13409349 1080 28 451
|
||||
builder-adversary-lean 3 NO 3476619 3041803 6518422 1800 11 259
|
||||
builder-adversary-lean 4 YES 6208552 7607043 13815595 960 24 378
|
||||
builder-adversary-lean 5 YES 5959024 6142331 12101355 960 28 431
|
||||
builder-adversary-lean 1 YES 8030199 5763715 13793914 1020 25 354
|
||||
builder-solo 1 YES 2417528 0 2417528 360 5 401
|
||||
builder-solo 2 YES 2747457 0 2747457 420 7 450
|
||||
builder-solo 3 YES 2837115 0 2837115 420 6 426
|
||||
builder-solo 4 YES 2773634 0 2773634 420 5 398
|
||||
builder-solo 5 YES 2948467 0 2948467 420 4 446
|
||||
builder-adversary-deferred 1 YES 7375559 5991241 13366800 1020 10 425
|
||||
builder-adversary-deferred 2 YES 7803666 5084416 12888082 840 7 381
|
||||
builder-adversary-deferred 3 YES 8034336 7299906 15334242 1260 12 444
|
||||
builder-adversary-deferred 4 NO 4589433 3947296 8536729 2700 6 146
|
||||
Reference in New Issue
Block a user