Files

121 lines
7.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Findings — Builder/Adversary prompt & verification-cadence benchmark
A controlled study of what actually drives **token cost** in the agent-orchestrator Builder/Adversary
loop, on a fixed, well-specified task.
- **Task:** build a 3-phase Python calculator (lexer → parser → evaluator), each phase with 46
cold-verifiable Definition-of-Done gates. Deliberately offline and deterministic so it stresses the
*protocol*, not infrastructure.
- **How:** each variant run autonomously to `SEQUENCE-COMPLETE` via the real harness (`agents.py up`
— Builder + Adversary loop pair + watchdog), **5 runs each** (N=5).
Both loops on **claude-sonnet-4-6**. Tokens summed from each loop's Claude Code session
transcripts. The deliverable is behaviorally identical across all variants (verified on a
24-expression probe), so this compares like-for-like.
- **Full data:** [`RESULTS-campaign.md`](RESULTS-campaign.md) (analysis), `RESULTS-campaign.md.data`
(raw per-run rows). Every run's git repo is preserved under `/tmp/ao-campaign-*` and `/tmp/ao-solo-*`.
## The variants
| variant | what changes | engine example |
|---|---|---|
| `builder-adversary` (orig) | the original full prompts; Adversary verifies **per phase** | `examples/builder-adversary` |
| `builder-adversary-min` | prompts compressed to minimal tokens | `examples/builder-adversary-min` |
| `builder-adversary-stateless` | orig + **context hygiene** (compact per checkpoint, read diffs not trees, lean loads) | `examples/builder-adversary-stateless` |
| `builder-adversary-lean` | orig + context hygiene + **per-gate** review (one claim/verdict per gate) | `examples/builder-adversary-lean` |
| `builder-adversary-deferred` | orig; Adversary verifies **once, after the whole build** (a final comprehensive `review` phase) | `examples/builder-adversary-deferred` |
| `builder-solo` | **no Adversary** — a single Builder that self-certifies | `examples/builder-solo` |
(stateless/lean/deferred are all built on the *full original* prompts, so they isolate their one
change without the minimal-prompt confound.)
## Headline results — median tokens (5 runs each)
| variant | adversary verifies… | median tokens | vs orig | commits | LOC |
|---|---|--:|--:|--:|--:|
| **builder-solo** | never (self-certify) | **2.77M** | 79% | 5 | 426 |
| **min** | per phase *(minimal prompts)* | 9.77M | 25% | 15 | 367 |
| **stateless** | per phase *(+context hygiene)* | 10.12M | 22% | 14 | 400 |
| **orig** | per **phase** | 13.04M | — | 14 | 449 |
| **deferred** | once, after **whole build** | 12.89M | 1% | 12 | 425 |
| **lean** | per **gate** | 13.41M | +3% | 28 | 390 |
## The two big findings
### 1. The adversary's *existence* costs ~4.7× — its *cadence* barely matters.
Every loop-pair variant lands near **~13M tokens regardless of how the review is chunked** —
per-gate (`lean`, 28 commits), per-phase (`orig`, 14), or one deferred pass (`deferred`, 10).
`builder-solo` (no adversary) is **2.77M**. So the dominant cost is **whether an independent cold
re-verification happens at all**, not how it's scheduled. The verification *work* is roughly
conserved; chunking it finer or coarser mostly changes the commit/handshake count — which is itself
nearly token-neutral.
### 2. Deferred review was the surprise — and the loser.
Hypothesis: deferring to one pass would be cheapest (fewest handshakes ≈ solo-build + one review).
It wasn't — `deferred``orig`. Handshakes *did* drop (10 commits), but the **single comprehensive
review is itself expensive** (the Adversary re-verifies the whole calculator + cross-feature probes
in one shot), so total tokens stayed put. And it carries the downside that the independent check
**arrives late** — late-rework risk, plus self-certification drift on the build phases. Worst of both
for this task.
## The levers, ranked
1. **Drop the adversary → ~79%** — but you lose all independent verification. On this clean,
well-specified task `solo` produced correct calculators, so the adversary bought no *measured*
quality here — but it is **insurance against self-certification rubber-stamping a bug**, whose
value shows on ambiguous/underspecified work this benchmark can't stress.
2. **Context hygiene → 22%** — the **only clean win**: same review effort (same commits/LOC as
orig), just less context carried and reloaded each turn. (`stateless` vs `orig`.)
3. **Minimal prompts → 25%, but not free** — ~⅓ of the saving comes from the agents writing **~25%
fewer tests** (the compressed prompts drop the emphatic "try to break it / paste the output /
a red test is information" language that drives thorough testing). Same features, thinner test
suite.
4. **Review cadence → ~0%** — per-gate / per-phase / per-build are interchangeable on *cost*; choose
for **quality and latency**, not tokens: finer = earlier defect-catching at slight overhead;
coarser = late but holistic (better at cross-feature bugs).
## Why: cost is *process*, not *product*
Pooled across all 28 successful runs:
| tokens vs | Pearson r |
|---|--:|
| duration | **+0.83** |
| commits (review rounds) | **+0.79** |
| LOC (code shipped) | **0.04** |
Token cost tracks how long the loop **runs and verifies**, and is **uncorrelated with how much code
ships**. The deliverable (LOC, behavior) is near-constant across variants; the cost variance is all
process intensity.
## Methodology notes & caveats
- **N matters.** A single full-loop run is wildly nondeterministic: the *same* variant varied **±55%**
run-to-run early on, which is why this is N=5. (An early single-run claim of "context hygiene halves
tokens" did **not** reproduce — the real, stable figure is 22%.)
- **Variance source:** number of review rounds / retries, not output size.
- **Real failures excluded** (2 of 27 loop-pair runs): a wedge and a usage-limit/timeout collision;
superseded by clean re-runs. `LIMIT`-flagged runs (a usage-limit *pause* inflates duration without
adding tokens) are kept for token totals but excluded from `tokens/sec`.
- **Scope:** one task, one model (Sonnet), one harness. The *relative* findings should generalize;
absolute numbers are task-specific. The adversary's quality value is **not** measured here (the task
is too well-specified to stress it).
## Practical guidance
- **Want to cut tokens without losing the independent check?** Use **context hygiene** (the `stateless`
pattern). It's the only free lunch.
- **Don't pay for minimal prompts with test coverage** — keep the emphatic testing language unless you
genuinely want less testing.
- **Pick review cadence for the work, not the bill:** per-gate to catch regressions early in long
phases; per-phase as a sane default; deferred only when features are independent and cheap to fix
late (it saves nothing and checks late).
- **`solo` is ~5× cheaper** — reasonable for low-stakes / well-specified work, but you're trusting the
builder to grade its own homework.
---
_Artifacts in this repo: `FINDINGS.md` (this summary), `RESULTS-campaign.md` (per-variant stats +
ratios + full raw table), `RESULTS-campaign.md.data` (raw rows), `analyze.py` (regenerates the
analysis), `run-harness-bench.sh` / `run-solo-bench.sh` (the runners), `plans/calc/` (the task), and
the six `engine/examples/builder-adversary*` variants. All variants are N=5 (two `deferred` and the
`min`/`lean` wedge/limit failures excluded; see RESULTS for the raw rows)._