Files
agent-orchestrator-benchmark/FINDINGS.md

7.3 KiB
Raw Blame History

Findings — Builder/Adversary prompt & verification-cadence benchmark

A controlled study of what actually drives token cost in the agent-orchestrator Builder/Adversary loop, on a fixed, well-specified task.

  • Task: build a 3-phase Python calculator (lexer → parser → evaluator), each phase with 46 cold-verifiable Definition-of-Done gates. Deliberately offline and deterministic so it stresses the protocol, not infrastructure.
  • How: each variant run autonomously to SEQUENCE-COMPLETE via the real harness (agents.py up — Builder + Adversary loop pair + watchdog), 5 runs each (N=5). Both loops on claude-sonnet-4-6. Tokens summed from each loop's Claude Code session transcripts. The deliverable is behaviorally identical across all variants (verified on a 24-expression probe), so this compares like-for-like.
  • Full data: RESULTS-campaign.md (analysis), RESULTS-campaign.md.data (raw per-run rows). Every run's git repo is preserved under /tmp/ao-campaign-* and /tmp/ao-solo-*.

The variants

variant what changes engine example
builder-adversary (orig) the original full prompts; Adversary verifies per phase examples/builder-adversary
builder-adversary-min prompts compressed to minimal tokens examples/builder-adversary-min
builder-adversary-stateless orig + context hygiene (compact per checkpoint, read diffs not trees, lean loads) examples/builder-adversary-stateless
builder-adversary-lean orig + context hygiene + per-gate review (one claim/verdict per gate) examples/builder-adversary-lean
builder-adversary-deferred orig; Adversary verifies once, after the whole build (a final comprehensive review phase) examples/builder-adversary-deferred
builder-solo no Adversary — a single Builder that self-certifies examples/builder-solo

(stateless/lean/deferred are all built on the full original prompts, so they isolate their one change without the minimal-prompt confound.)

Headline results — median tokens (5 runs each)

variant adversary verifies… median tokens vs orig commits LOC
builder-solo never (self-certify) 2.77M 79% 5 426
min per phase (minimal prompts) 9.77M 25% 15 367
stateless per phase (+context hygiene) 10.12M 22% 14 400
orig per phase 13.04M 14 449
deferred once, after whole build 12.89M 1% 12 425
lean per gate 13.41M +3% 28 390

The two big findings

1. The adversary's existence costs ~4.7× — its cadence barely matters.

Every loop-pair variant lands near ~13M tokens regardless of how the review is chunked — per-gate (lean, 28 commits), per-phase (orig, 14), or one deferred pass (deferred, 10). builder-solo (no adversary) is 2.77M. So the dominant cost is whether an independent cold re-verification happens at all, not how it's scheduled. The verification work is roughly conserved; chunking it finer or coarser mostly changes the commit/handshake count — which is itself nearly token-neutral.

2. Deferred review was the surprise — and the loser.

Hypothesis: deferring to one pass would be cheapest (fewest handshakes ≈ solo-build + one review). It wasn't — deferredorig. Handshakes did drop (10 commits), but the single comprehensive review is itself expensive (the Adversary re-verifies the whole calculator + cross-feature probes in one shot), so total tokens stayed put. And it carries the downside that the independent check arrives late — late-rework risk, plus self-certification drift on the build phases. Worst of both for this task.

The levers, ranked

  1. Drop the adversary → ~79% — but you lose all independent verification. On this clean, well-specified task solo produced correct calculators, so the adversary bought no measured quality here — but it is insurance against self-certification rubber-stamping a bug, whose value shows on ambiguous/underspecified work this benchmark can't stress.
  2. Context hygiene → 22% — the only clean win: same review effort (same commits/LOC as orig), just less context carried and reloaded each turn. (stateless vs orig.)
  3. Minimal prompts → 25%, but not free — ~⅓ of the saving comes from the agents writing ~25% fewer tests (the compressed prompts drop the emphatic "try to break it / paste the output / a red test is information" language that drives thorough testing). Same features, thinner test suite.
  4. Review cadence → ~0% — per-gate / per-phase / per-build are interchangeable on cost; choose for quality and latency, not tokens: finer = earlier defect-catching at slight overhead; coarser = late but holistic (better at cross-feature bugs).

Why: cost is process, not product

Pooled across all 28 successful runs:

tokens vs Pearson r
duration +0.83
commits (review rounds) +0.79
LOC (code shipped) 0.04

Token cost tracks how long the loop runs and verifies, and is uncorrelated with how much code ships. The deliverable (LOC, behavior) is near-constant across variants; the cost variance is all process intensity.

Methodology notes & caveats

  • N matters. A single full-loop run is wildly nondeterministic: the same variant varied ±55% run-to-run early on, which is why this is N=5. (An early single-run claim of "context hygiene halves tokens" did not reproduce — the real, stable figure is 22%.)
  • Variance source: number of review rounds / retries, not output size.
  • Real failures excluded (2 of 27 loop-pair runs): a wedge and a usage-limit/timeout collision; superseded by clean re-runs. LIMIT-flagged runs (a usage-limit pause inflates duration without adding tokens) are kept for token totals but excluded from tokens/sec.
  • Scope: one task, one model (Sonnet), one harness. The relative findings should generalize; absolute numbers are task-specific. The adversary's quality value is not measured here (the task is too well-specified to stress it).

Practical guidance

  • Want to cut tokens without losing the independent check? Use context hygiene (the stateless pattern). It's the only free lunch.
  • Don't pay for minimal prompts with test coverage — keep the emphatic testing language unless you genuinely want less testing.
  • Pick review cadence for the work, not the bill: per-gate to catch regressions early in long phases; per-phase as a sane default; deferred only when features are independent and cheap to fix late (it saves nothing and checks late).
  • solo is ~5× cheaper — reasonable for low-stakes / well-specified work, but you're trusting the builder to grade its own homework.

Artifacts in this repo: FINDINGS.md (this summary), RESULTS-campaign.md (per-variant stats + ratios + full raw table), RESULTS-campaign.md.data (raw rows), analyze.py (regenerates the analysis), run-harness-bench.sh / run-solo-bench.sh (the runners), plans/calc/ (the task), and the six engine/examples/builder-adversary* variants. All variants are N=5 (two deferred and the min/lean wedge/limit failures excluded; see RESULTS for the raw rows).