7.3 KiB
Findings — Builder/Adversary prompt & verification-cadence benchmark
A controlled study of what actually drives token cost in the agent-orchestrator Builder/Adversary loop, on a fixed, well-specified task.
- Task: build a 3-phase Python calculator (lexer → parser → evaluator), each phase with 4–6 cold-verifiable Definition-of-Done gates. Deliberately offline and deterministic so it stresses the protocol, not infrastructure.
- How: each variant run autonomously to
SEQUENCE-COMPLETEvia the real harness (agents.py up— Builder + Adversary loop pair + watchdog), 5 runs each (N=5). Both loops on claude-sonnet-4-6. Tokens summed from each loop's Claude Code session transcripts. The deliverable is behaviorally identical across all variants (verified on a 24-expression probe), so this compares like-for-like. - Full data:
RESULTS-campaign.md(analysis),RESULTS-campaign.md.data(raw per-run rows). Every run's git repo is preserved under/tmp/ao-campaign-*and/tmp/ao-solo-*.
The variants
| variant | what changes | engine example |
|---|---|---|
builder-adversary (orig) |
the original full prompts; Adversary verifies per phase | examples/builder-adversary |
builder-adversary-min |
prompts compressed to minimal tokens | examples/builder-adversary-min |
builder-adversary-stateless |
orig + context hygiene (compact per checkpoint, read diffs not trees, lean loads) | examples/builder-adversary-stateless |
builder-adversary-lean |
orig + context hygiene + per-gate review (one claim/verdict per gate) | examples/builder-adversary-lean |
builder-adversary-deferred |
orig; Adversary verifies once, after the whole build (a final comprehensive review phase) |
examples/builder-adversary-deferred |
builder-solo |
no Adversary — a single Builder that self-certifies | examples/builder-solo |
(stateless/lean/deferred are all built on the full original prompts, so they isolate their one change without the minimal-prompt confound.)
Headline results — median tokens (5 runs each)
| variant | adversary verifies… | median tokens | vs orig | commits | LOC |
|---|---|---|---|---|---|
| builder-solo | never (self-certify) | 2.77M | −79% | 5 | 426 |
| min | per phase (minimal prompts) | 9.77M | −25% | 15 | 367 |
| stateless | per phase (+context hygiene) | 10.12M | −22% | 14 | 400 |
| orig | per phase | 13.04M | — | 14 | 449 |
| deferred | once, after whole build | 12.89M | −1% | 12 | 425 |
| lean | per gate | 13.41M | +3% | 28 | 390 |
The two big findings
1. The adversary's existence costs ~4.7× — its cadence barely matters.
Every loop-pair variant lands near ~13M tokens regardless of how the review is chunked —
per-gate (lean, 28 commits), per-phase (orig, 14), or one deferred pass (deferred, 10).
builder-solo (no adversary) is 2.77M. So the dominant cost is whether an independent cold
re-verification happens at all, not how it's scheduled. The verification work is roughly
conserved; chunking it finer or coarser mostly changes the commit/handshake count — which is itself
nearly token-neutral.
2. Deferred review was the surprise — and the loser.
Hypothesis: deferring to one pass would be cheapest (fewest handshakes ≈ solo-build + one review).
It wasn't — deferred ≈ orig. Handshakes did drop (10 commits), but the single comprehensive
review is itself expensive (the Adversary re-verifies the whole calculator + cross-feature probes
in one shot), so total tokens stayed put. And it carries the downside that the independent check
arrives late — late-rework risk, plus self-certification drift on the build phases. Worst of both
for this task.
The levers, ranked
- Drop the adversary → ~−79% — but you lose all independent verification. On this clean,
well-specified task
soloproduced correct calculators, so the adversary bought no measured quality here — but it is insurance against self-certification rubber-stamping a bug, whose value shows on ambiguous/underspecified work this benchmark can't stress. - Context hygiene → −22% — the only clean win: same review effort (same commits/LOC as
orig), just less context carried and reloaded each turn. (
statelessvsorig.) - Minimal prompts → −25%, but not free — ~⅓ of the saving comes from the agents writing ~25% fewer tests (the compressed prompts drop the emphatic "try to break it / paste the output / a red test is information" language that drives thorough testing). Same features, thinner test suite.
- Review cadence → ~0% — per-gate / per-phase / per-build are interchangeable on cost; choose for quality and latency, not tokens: finer = earlier defect-catching at slight overhead; coarser = late but holistic (better at cross-feature bugs).
Why: cost is process, not product
Pooled across all 28 successful runs:
| tokens vs | Pearson r |
|---|---|
| duration | +0.83 |
| commits (review rounds) | +0.79 |
| LOC (code shipped) | −0.04 |
Token cost tracks how long the loop runs and verifies, and is uncorrelated with how much code ships. The deliverable (LOC, behavior) is near-constant across variants; the cost variance is all process intensity.
Methodology notes & caveats
- N matters. A single full-loop run is wildly nondeterministic: the same variant varied ±55% run-to-run early on, which is why this is N=5. (An early single-run claim of "context hygiene halves tokens" did not reproduce — the real, stable figure is −22%.)
- Variance source: number of review rounds / retries, not output size.
- Real failures excluded (2 of 27 loop-pair runs): a wedge and a usage-limit/timeout collision;
superseded by clean re-runs.
LIMIT-flagged runs (a usage-limit pause inflates duration without adding tokens) are kept for token totals but excluded fromtokens/sec. - Scope: one task, one model (Sonnet), one harness. The relative findings should generalize; absolute numbers are task-specific. The adversary's quality value is not measured here (the task is too well-specified to stress it).
Practical guidance
- Want to cut tokens without losing the independent check? Use context hygiene (the
statelesspattern). It's the only free lunch. - Don't pay for minimal prompts with test coverage — keep the emphatic testing language unless you genuinely want less testing.
- Pick review cadence for the work, not the bill: per-gate to catch regressions early in long phases; per-phase as a sane default; deferred only when features are independent and cheap to fix late (it saves nothing and checks late).
solois ~5× cheaper — reasonable for low-stakes / well-specified work, but you're trusting the builder to grade its own homework.
Artifacts in this repo: FINDINGS.md (this summary), RESULTS-campaign.md (per-variant stats +
ratios + full raw table), RESULTS-campaign.md.data (raw rows), analyze.py (regenerates the
analysis), run-harness-bench.sh / run-solo-bench.sh (the runners), plans/calc/ (the task), and
the six engine/examples/builder-adversary* variants. All variants are N=5 (two deferred and the
min/lean wedge/limit failures excluded; see RESULTS for the raw rows).