Findings — Builder/Adversary prompt & verification-cadence benchmark

A controlled study of what actually drives token cost in the agent-orchestrator Builder/Adversary loop, on a fixed, well-specified task.

Task: build a 3-phase Python calculator (lexer → parser → evaluator), each phase with 4–6 cold-verifiable Definition-of-Done gates. Deliberately offline and deterministic so it stresses the protocol, not infrastructure.
How: each variant run autonomously to SEQUENCE-COMPLETE via the real harness (agents.py up — Builder + Adversary loop pair + watchdog), 5 runs each (N=5). Both loops on claude-sonnet-4-6. Tokens summed from each loop's Claude Code session transcripts. The deliverable is behaviorally identical across all variants (verified on a 24-expression probe), so this compares like-for-like.
Full data: RESULTS-campaign.md (analysis), RESULTS-campaign.md.data (raw per-run rows). Every run's git repo is preserved under /tmp/ao-campaign-* and /tmp/ao-solo-*.

The variants

variant	what changes	engine example
`builder-adversary` (orig)	the original full prompts; Adversary verifies per phase	`examples/builder-adversary`
`builder-adversary-min`	prompts compressed to minimal tokens	`examples/builder-adversary-min`
`builder-adversary-stateless`	orig + context hygiene (compact per checkpoint, read diffs not trees, lean loads)	`examples/builder-adversary-stateless`
`builder-adversary-lean`	orig + context hygiene + per-gate review (one claim/verdict per gate)	`examples/builder-adversary-lean`
`builder-adversary-deferred`	orig; Adversary verifies once, after the whole build (a final comprehensive `review` phase)	`examples/builder-adversary-deferred`
`builder-solo`	no Adversary — a single Builder that self-certifies	`examples/builder-solo`

(stateless/lean/deferred are all built on the full original prompts, so they isolate their one change without the minimal-prompt confound.)

Headline results — median tokens (5 runs each)

variant	adversary verifies…	median tokens	vs orig	commits	LOC
builder-solo	never (self-certify)	2.77M	−79%	5	426
min	per phase (minimal prompts)	9.77M	−25%	15	367
stateless	per phase (+context hygiene)	10.12M	−22%	14	400
orig	per phase	13.04M	—	14	449
deferred	once, after whole build	12.89M	−1%	12	425
lean	per gate	13.41M	+3%	28	390

The two big findings

1. The adversary's existence costs ~4.7× — its cadence barely matters.

Every loop-pair variant lands near ~13M tokens regardless of how the review is chunked — per-gate (lean, 28 commits), per-phase (orig, 14), or one deferred pass (deferred, 10). builder-solo (no adversary) is 2.77M. So the dominant cost is whether an independent cold re-verification happens at all, not how it's scheduled. The verification work is roughly conserved; chunking it finer or coarser mostly changes the commit/handshake count — which is itself nearly token-neutral.

2. Deferred review was the surprise — and the loser.

Hypothesis: deferring to one pass would be cheapest (fewest handshakes ≈ solo-build + one review). It wasn't — deferred ≈ orig. Handshakes did drop (10 commits), but the single comprehensive review is itself expensive (the Adversary re-verifies the whole calculator + cross-feature probes in one shot), so total tokens stayed put. And it carries the downside that the independent check arrives late — late-rework risk, plus self-certification drift on the build phases. Worst of both for this task.

The levers, ranked

Drop the adversary → ~−79% — but you lose all independent verification. On this clean, well-specified task solo produced correct calculators, so the adversary bought no measured quality here — but it is insurance against self-certification rubber-stamping a bug, whose value shows on ambiguous/underspecified work this benchmark can't stress.
Context hygiene → −22% — the only clean win: same review effort (same commits/LOC as orig), just less context carried and reloaded each turn. (stateless vs orig.)
Minimal prompts → −25%, but not free — ~⅓ of the saving comes from the agents writing ~25% fewer tests (the compressed prompts drop the emphatic "try to break it / paste the output / a red test is information" language that drives thorough testing). Same features, thinner test suite.
Review cadence → ~0% — per-gate / per-phase / per-build are interchangeable on cost; choose for quality and latency, not tokens: finer = earlier defect-catching at slight overhead; coarser = late but holistic (better at cross-feature bugs).

Why: cost is process, not product

Pooled across all 28 successful runs:

tokens vs	Pearson r
duration	+0.83
commits (review rounds)	+0.79
LOC (code shipped)	−0.04

Token cost tracks how long the loop runs and verifies, and is uncorrelated with how much code ships. The deliverable (LOC, behavior) is near-constant across variants; the cost variance is all process intensity.

Methodology notes & caveats

N matters. A single full-loop run is wildly nondeterministic: the same variant varied ±55% run-to-run early on, which is why this is N=5. (An early single-run claim of "context hygiene halves tokens" did not reproduce — the real, stable figure is −22%.)
Variance source: number of review rounds / retries, not output size.
Real failures excluded (2 of 27 loop-pair runs): a wedge and a usage-limit/timeout collision; superseded by clean re-runs. LIMIT-flagged runs (a usage-limit pause inflates duration without adding tokens) are kept for token totals but excluded from tokens/sec.
Scope: one task, one model (Sonnet), one harness. The relative findings should generalize; absolute numbers are task-specific. The adversary's quality value is not measured here (the task is too well-specified to stress it).

Practical guidance

Want to cut tokens without losing the independent check? Use context hygiene (the stateless pattern). It's the only free lunch.
Don't pay for minimal prompts with test coverage — keep the emphatic testing language unless you genuinely want less testing.
Pick review cadence for the work, not the bill: per-gate to catch regressions early in long phases; per-phase as a sane default; deferred only when features are independent and cheap to fix late (it saves nothing and checks late).
solo is ~5× cheaper — reasonable for low-stakes / well-specified work, but you're trusting the builder to grade its own homework.

Artifacts in this repo: FINDINGS.md (this summary), RESULTS-campaign.md (per-variant stats + ratios + full raw table), RESULTS-campaign.md.data (raw rows), analyze.py (regenerates the analysis), run-harness-bench.sh / run-solo-bench.sh (the runners), plans/calc/ (the task), and the six engine/examples/builder-adversary* variants. All variants are N=5 (two deferred and the min/lean wedge/limit failures excluded; see RESULTS for the raw rows).

7.3 KiB Raw Permalink Blame History Unescape Escape