# Findings — Builder/Adversary prompt & verification-cadence benchmark A controlled study of what actually drives **token cost** in the agent-orchestrator Builder/Adversary loop, on a fixed, well-specified task. - **Task:** build a 3-phase Python calculator (lexer → parser → evaluator), each phase with 4–6 cold-verifiable Definition-of-Done gates. Deliberately offline and deterministic so it stresses the *protocol*, not infrastructure. - **How:** each variant run autonomously to `SEQUENCE-COMPLETE` via the real harness (`agents.py up` — Builder + Adversary loop pair + watchdog), **5 runs each** (N=5). Both loops on **claude-sonnet-4-6**. Tokens summed from each loop's Claude Code session transcripts. The deliverable is behaviorally identical across all variants (verified on a 24-expression probe), so this compares like-for-like. - **Full data:** [`RESULTS-campaign.md`](RESULTS-campaign.md) (analysis), `RESULTS-campaign.md.data` (raw per-run rows). Every run's git repo is preserved under `/tmp/ao-campaign-*` and `/tmp/ao-solo-*`. ## The variants | variant | what changes | engine example | |---|---|---| | `builder-adversary` (orig) | the original full prompts; Adversary verifies **per phase** | `examples/builder-adversary` | | `builder-adversary-min` | prompts compressed to minimal tokens | `examples/builder-adversary-min` | | `builder-adversary-stateless` | orig + **context hygiene** (compact per checkpoint, read diffs not trees, lean loads) | `examples/builder-adversary-stateless` | | `builder-adversary-lean` | orig + context hygiene + **per-gate** review (one claim/verdict per gate) | `examples/builder-adversary-lean` | | `builder-adversary-deferred` | orig; Adversary verifies **once, after the whole build** (a final comprehensive `review` phase) | `examples/builder-adversary-deferred` | | `builder-solo` | **no Adversary** — a single Builder that self-certifies | `examples/builder-solo` | (stateless/lean/deferred are all built on the *full original* prompts, so they isolate their one change without the minimal-prompt confound.) ## Headline results — median tokens (5 runs each) | variant | adversary verifies… | median tokens | vs orig | commits | LOC | |---|---|--:|--:|--:|--:| | **builder-solo** | never (self-certify) | **2.77M** | −79% | 5 | 426 | | **min** | per phase *(minimal prompts)* | 9.77M | −25% | 15 | 367 | | **stateless** | per phase *(+context hygiene)* | 10.12M | −22% | 14 | 400 | | **orig** | per **phase** | 13.04M | — | 14 | 449 | | **deferred** | once, after **whole build** | 12.89M | −1% | 12 | 425 | | **lean** | per **gate** | 13.41M | +3% | 28 | 390 | ## The two big findings ### 1. The adversary's *existence* costs ~4.7× — its *cadence* barely matters. Every loop-pair variant lands near **~13M tokens regardless of how the review is chunked** — per-gate (`lean`, 28 commits), per-phase (`orig`, 14), or one deferred pass (`deferred`, 10). `builder-solo` (no adversary) is **2.77M**. So the dominant cost is **whether an independent cold re-verification happens at all**, not how it's scheduled. The verification *work* is roughly conserved; chunking it finer or coarser mostly changes the commit/handshake count — which is itself nearly token-neutral. ### 2. Deferred review was the surprise — and the loser. Hypothesis: deferring to one pass would be cheapest (fewest handshakes ≈ solo-build + one review). It wasn't — `deferred` ≈ `orig`. Handshakes *did* drop (10 commits), but the **single comprehensive review is itself expensive** (the Adversary re-verifies the whole calculator + cross-feature probes in one shot), so total tokens stayed put. And it carries the downside that the independent check **arrives late** — late-rework risk, plus self-certification drift on the build phases. Worst of both for this task. ## The levers, ranked 1. **Drop the adversary → ~−79%** — but you lose all independent verification. On this clean, well-specified task `solo` produced correct calculators, so the adversary bought no *measured* quality here — but it is **insurance against self-certification rubber-stamping a bug**, whose value shows on ambiguous/underspecified work this benchmark can't stress. 2. **Context hygiene → −22%** — the **only clean win**: same review effort (same commits/LOC as orig), just less context carried and reloaded each turn. (`stateless` vs `orig`.) 3. **Minimal prompts → −25%, but not free** — ~⅓ of the saving comes from the agents writing **~25% fewer tests** (the compressed prompts drop the emphatic "try to break it / paste the output / a red test is information" language that drives thorough testing). Same features, thinner test suite. 4. **Review cadence → ~0%** — per-gate / per-phase / per-build are interchangeable on *cost*; choose for **quality and latency**, not tokens: finer = earlier defect-catching at slight overhead; coarser = late but holistic (better at cross-feature bugs). ## Why: cost is *process*, not *product* Pooled across all 28 successful runs: | tokens vs | Pearson r | |---|--:| | duration | **+0.83** | | commits (review rounds) | **+0.79** | | LOC (code shipped) | **−0.04** | Token cost tracks how long the loop **runs and verifies**, and is **uncorrelated with how much code ships**. The deliverable (LOC, behavior) is near-constant across variants; the cost variance is all process intensity. ## Methodology notes & caveats - **N matters.** A single full-loop run is wildly nondeterministic: the *same* variant varied **±55%** run-to-run early on, which is why this is N=5. (An early single-run claim of "context hygiene halves tokens" did **not** reproduce — the real, stable figure is −22%.) - **Variance source:** number of review rounds / retries, not output size. - **Real failures excluded** (2 of 27 loop-pair runs): a wedge and a usage-limit/timeout collision; superseded by clean re-runs. `LIMIT`-flagged runs (a usage-limit *pause* inflates duration without adding tokens) are kept for token totals but excluded from `tokens/sec`. - **Scope:** one task, one model (Sonnet), one harness. The *relative* findings should generalize; absolute numbers are task-specific. The adversary's quality value is **not** measured here (the task is too well-specified to stress it). ## Practical guidance - **Want to cut tokens without losing the independent check?** Use **context hygiene** (the `stateless` pattern). It's the only free lunch. - **Don't pay for minimal prompts with test coverage** — keep the emphatic testing language unless you genuinely want less testing. - **Pick review cadence for the work, not the bill:** per-gate to catch regressions early in long phases; per-phase as a sane default; deferred only when features are independent and cheap to fix late (it saves nothing and checks late). - **`solo` is ~5× cheaper** — reasonable for low-stakes / well-specified work, but you're trusting the builder to grade its own homework. --- _Artifacts in this repo: `FINDINGS.md` (this summary), `RESULTS-campaign.md` (per-variant stats + ratios + full raw table), `RESULTS-campaign.md.data` (raw rows), `analyze.py` (regenerates the analysis), `run-harness-bench.sh` / `run-solo-bench.sh` (the runners), `plans/calc/` (the task), and the six `engine/examples/builder-adversary*` variants. All variants are N=5 (two `deferred` and the `min`/`lean` wedge/limit failures excluded; see RESULTS for the raw rows)._