agent-orchestrator-benchmark/FINDINGS.md

# Findings — Builder/Adversary prompt & verification-cadence benchmark

A controlled study of what actually drives **token cost** in the agent-orchestrator Builder/Adversary
loop, on a fixed, well-specified task.

- **Task:** build a 3-phase Python calculator (lexer → parser → evaluator), each phase with 4–6
  cold-verifiable Definition-of-Done gates. Deliberately offline and deterministic so it stresses the
  *protocol*, not infrastructure.
- **How:** each variant run autonomously to `SEQUENCE-COMPLETE` via the real harness (`agents.py up`
  — Builder + Adversary loop pair + watchdog), **5 runs each** (N=5).
  Both loops on **claude-sonnet-4-6**. Tokens summed from each loop's Claude Code session
  transcripts. The deliverable is behaviorally identical across all variants (verified on a
  24-expression probe), so this compares like-for-like.
- **Full data:** [`RESULTS-campaign.md`](RESULTS-campaign.md) (analysis), `RESULTS-campaign.md.data`
  (raw per-run rows). Every run's git repo is preserved under `/tmp/ao-campaign-*` and `/tmp/ao-solo-*`.

## The variants

| variant | what changes | engine example |
|---|---|---|
| `builder-adversary` (orig) | the original full prompts; Adversary verifies **per phase** | `examples/builder-adversary` |
| `builder-adversary-min` | prompts compressed to minimal tokens | `examples/builder-adversary-min` |
| `builder-adversary-stateless` | orig + **context hygiene** (compact per checkpoint, read diffs not trees, lean loads) | `examples/builder-adversary-stateless` |
| `builder-adversary-lean` | orig + context hygiene + **per-gate** review (one claim/verdict per gate) | `examples/builder-adversary-lean` |
| `builder-adversary-deferred` | orig; Adversary verifies **once, after the whole build** (a final comprehensive `review` phase) | `examples/builder-adversary-deferred` |
| `builder-solo` | **no Adversary** — a single Builder that self-certifies | `examples/builder-solo` |

(stateless/lean/deferred are all built on the *full original* prompts, so they isolate their one
change without the minimal-prompt confound.)

## Headline results — median tokens (5 runs each)

| variant | adversary verifies… | median tokens | vs orig | commits | LOC |
|---|---|--:|--:|--:|--:|
| **builder-solo** | never (self-certify) | **2.77M** | −79% | 5 | 426 |
| **min** | per phase *(minimal prompts)* | 9.77M | −25% | 15 | 367 |
| **stateless** | per phase *(+context hygiene)* | 10.12M | −22% | 14 | 400 |
| **orig** | per **phase** | 13.04M | — | 14 | 449 |
| **deferred** | once, after **whole build** | 12.89M | −1% | 12 | 425 |
| **lean** | per **gate** | 13.41M | +3% | 28 | 390 |

## The two big findings

### 1. The adversary's *existence* costs ~4.7× — its *cadence* barely matters.
Every loop-pair variant lands near **~13M tokens regardless of how the review is chunked** —
per-gate (`lean`, 28 commits), per-phase (`orig`, 14), or one deferred pass (`deferred`, 10).
`builder-solo` (no adversary) is **2.77M**. So the dominant cost is **whether an independent cold
re-verification happens at all**, not how it's scheduled. The verification *work* is roughly
conserved; chunking it finer or coarser mostly changes the commit/handshake count — which is itself
nearly token-neutral.

### 2. Deferred review was the surprise — and the loser.
Hypothesis: deferring to one pass would be cheapest (fewest handshakes ≈ solo-build + one review).
It wasn't — `deferred` ≈ `orig`. Handshakes *did* drop (10 commits), but the **single comprehensive
review is itself expensive** (the Adversary re-verifies the whole calculator + cross-feature probes
in one shot), so total tokens stayed put. And it carries the downside that the independent check
**arrives late** — late-rework risk, plus self-certification drift on the build phases. Worst of both
for this task.

## The levers, ranked

1. **Drop the adversary → ~−79%** — but you lose all independent verification. On this clean,
   well-specified task `solo` produced correct calculators, so the adversary bought no *measured*
   quality here — but it is **insurance against self-certification rubber-stamping a bug**, whose
   value shows on ambiguous/underspecified work this benchmark can't stress.
2. **Context hygiene → −22%** — the **only clean win**: same review effort (same commits/LOC as
   orig), just less context carried and reloaded each turn. (`stateless` vs `orig`.)
3. **Minimal prompts → −25%, but not free** — ~⅓ of the saving comes from the agents writing **~25%
   fewer tests** (the compressed prompts drop the emphatic "try to break it / paste the output /
   a red test is information" language that drives thorough testing). Same features, thinner test
   suite.
4. **Review cadence → ~0%** — per-gate / per-phase / per-build are interchangeable on *cost*; choose
   for **quality and latency**, not tokens: finer = earlier defect-catching at slight overhead;
   coarser = late but holistic (better at cross-feature bugs).

## Why: cost is *process*, not *product*

Pooled across all 28 successful runs:

| tokens vs | Pearson r |
|---|--:|
| duration | **+0.83** |
| commits (review rounds) | **+0.79** |
| LOC (code shipped) | **−0.04** |

Token cost tracks how long the loop **runs and verifies**, and is **uncorrelated with how much code
ships**. The deliverable (LOC, behavior) is near-constant across variants; the cost variance is all
process intensity.

## Methodology notes & caveats

- **N matters.** A single full-loop run is wildly nondeterministic: the *same* variant varied **±55%**
  run-to-run early on, which is why this is N=5. (An early single-run claim of "context hygiene halves
  tokens" did **not** reproduce — the real, stable figure is −22%.)
- **Variance source:** number of review rounds / retries, not output size.
- **Real failures excluded** (2 of 27 loop-pair runs): a wedge and a usage-limit/timeout collision;
  superseded by clean re-runs. `LIMIT`-flagged runs (a usage-limit *pause* inflates duration without
  adding tokens) are kept for token totals but excluded from `tokens/sec`.
- **Scope:** one task, one model (Sonnet), one harness. The *relative* findings should generalize;
  absolute numbers are task-specific. The adversary's quality value is **not** measured here (the task
  is too well-specified to stress it).

## Practical guidance

- **Want to cut tokens without losing the independent check?** Use **context hygiene** (the `stateless`
  pattern). It's the only free lunch.
- **Don't pay for minimal prompts with test coverage** — keep the emphatic testing language unless you
  genuinely want less testing.
- **Pick review cadence for the work, not the bill:** per-gate to catch regressions early in long
  phases; per-phase as a sane default; deferred only when features are independent and cheap to fix
  late (it saves nothing and checks late).
- **`solo` is ~5× cheaper** — reasonable for low-stakes / well-specified work, but you're trusting the
  builder to grade its own homework.

---
_Artifacts in this repo: `FINDINGS.md` (this summary), `RESULTS-campaign.md` (per-variant stats +
ratios + full raw table), `RESULTS-campaign.md.data` (raw rows), `analyze.py` (regenerates the
analysis), `run-harness-bench.sh` / `run-solo-bench.sh` (the runners), `plans/calc/` (the task), and
the six `engine/examples/builder-adversary*` variants. All variants are N=5 (two `deferred` and the
`min`/`lean` wedge/limit failures excluded; see RESULTS for the raw rows)._