agent-orchestrator-benchmark/README.md

# agent-orchestrator-benchmark

Benchmarks for the [`agent-orchestrator`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator)
Builder/Adversary loop — measuring **what actually drives token cost**: prompt design, context
discipline, verification cadence, and whether there's an independent adversary at all. The engine is
vendored as the `engine/` submodule, pinned at a ref that ships the example variants being compared.

## → Findings

**See [`FINDINGS.md`](FINDINGS.md)** for the synthesis (current as of **2026-06-16**). The one-line
takeaway:

> What the AI adversary costs is set by **whether it verifies at all** (~4.7× a solo builder), **not
> by how often** it verifies (per-gate ≈ per-phase ≈ per-build, all ~13M tokens). The only clean way
> to cut that cost without dropping verification is **context hygiene (−22%)**.

Headline (median tokens, N=5 per variant, all on Sonnet):

| variant | adversary verifies… | median tokens | vs orig |
|---|---|--:|--:|
| `builder-solo` | never (self-certifies) | 2.77M | −79% |
| `builder-adversary-min` | per phase *(minimal prompts)* | 9.77M | −25% |
| `builder-adversary-stateless` | per phase *(+context hygiene)* | 10.12M | −22% |
| `builder-adversary` (orig) | per **phase** | 13.04M | — |
| `builder-adversary-deferred` | once, after **whole build** | 12.89M | −1% |
| `builder-adversary-lean` | per **gate** | 13.41M | +3% |

Full per-variant stats, efficiency ratios (tokens/LOC, tokens/sec, tokens/commit), correlations, and
the raw per-run table are in [`RESULTS-campaign.md`](RESULTS-campaign.md); raw rows in
`RESULTS-campaign.md.data`.

## The variants (engine examples being compared)

All live in `engine/examples/`. They share one task and differ in one dimension each:

| variant | what changes |
|---|---|
| `builder-adversary` | the original full prompts; Adversary verifies **per phase** (the baseline) |
| `builder-adversary-min` | prompts compressed to minimal tokens |
| `builder-adversary-stateless` | orig + **context hygiene** (compact per checkpoint, read diffs not trees, lean loads) |
| `builder-adversary-lean` | orig + context hygiene + **per-gate** review (one claim/verdict per gate) |
| `builder-adversary-deferred` | orig; Adversary verifies **once**, in a final comprehensive `review` phase |
| `builder-solo` | **no Adversary** — a single Builder that self-certifies |

(stateless/lean/deferred are built on the *full original* prompts, so each isolates its one change
without the minimal-prompt confound.)

## The task

Build a 3-phase Python calculator — lexer → parser → evaluator (`plans/calc/{lex,parse,eval}.md`),
each phase with 4–6 cold-verifiable Definition-of-Done gates (`deferred` adds a comprehensive
`plans/calc/review.md`). It's deliberately offline and deterministic so it stresses the *protocol*,
not infrastructure, and the deliverable is behaviorally identical across variants (verified on a
24-expression probe) — so the comparison is like-for-like.

## How it works (the real harness, N=5)

Each variant is run **autonomously to completion** by the real harness — `engine/agents.py up` brings
up the Builder + Adversary loop pair + watchdog, which work through the phase machine to
`SEQUENCE-COMPLETE` exactly as in production. There's no simulation: the agents self-pace via `/loop`,
coordinate through git (`claim(`/`review(` commits + the watchdog handoff), and the watchdog heals
stalls and rides out usage limits.

`run-harness-bench.sh` orchestrates the campaign:

1. For each variant × repeat, it stands up a fresh **shared bare repo + two clones** (Builder and
   Adversary each get their own, for genuine cold verification), pre-trusts the work dirs, generates
   an `agents.toml` pointing at that variant's prompts (and a 4-phase config for `deferred`), and runs
   `agents.py up`.
2. It polls for `SEQUENCE-COMPLETE` (per-run timeout), then tears the loop down.
3. It re-runs the task's Definition-of-Done itself (cold, in the Adversary's clone) to confirm
   success, and **tallies tokens per loop from the Claude Code session transcripts**.
4. One row per run is appended to `RESULTS-campaign.md.data` immediately (so partial results survive
   an interruption). Each run's git repo is kept under `/tmp/ao-campaign-*` for later analysis.

`run-solo-bench.sh` does the same for the single-builder `builder-solo` control.
`analyze.py` reads the data file and (re)generates `RESULTS-campaign.md` — per-variant token
distributions, the efficiency ratios, correlations, and the full raw table.

### Run it yourself

```bash
git submodule update --init                                   # fetch the vendored engine
# one variant, 5 runs, 45-min per-run timeout:
BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-harness-bench.sh builder-adversary
# the solo control:
BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-solo-bench.sh
python3 analyze.py                                            # regenerate RESULTS-campaign.md
```

Needs `claude` on `PATH` (authenticated), plus `python`, `tmux`, `git`, `timeout`. `run-harness-bench.sh`
with no arguments runs all four loop-pair variants; pass variant names to run a subset. The data file
is **append-mode** (clear it manually for a fresh campaign).

## Methodology & caveats

- **N matters.** A single full-loop run is highly nondeterministic — the *same* variant varied ±55%
  run-to-run early on, which is why everything here is **N=5**. (An early single-run "context hygiene
  halves tokens" claim did **not** reproduce; the stable figure is −22%.)
- **Excluded runs:** a few real failures (a wedge, a usage-limit/timeout collision) are in the raw
  data as `NO` and excluded from stats; superseded by clean re-runs. `LIMIT`-flagged runs (a
  usage-limit *pause* inflates duration, not tokens) are kept for token totals but excluded from
  `tokens/sec`.
- **Scope:** one task, one model (Sonnet), one harness. *Relative* findings should generalize;
  absolute numbers are task-specific. The adversary's *quality* value isn't measured here — the task
  is too well-specified to make self-certification fail.

## Layout

```
FINDINGS.md                  the synthesis — start here (current as of 2026-06-16)
RESULTS-campaign.md          full-harness campaign analysis (stats + ratios + raw table)  ← canonical
RESULTS-campaign.md.data     raw per-run rows (TSV)
analyze.py                   aggregates the data file -> RESULTS-campaign.md
run-harness-bench.sh         the full-harness campaign runner (loop pair, N runs/variant)
run-solo-bench.sh            the builder-solo control runner
plans/calc/{lex,parse,eval,review}.md   the calculator task
engine/                      agent-orchestrator, vendored as a submodule (variants in engine/examples/)

# earlier / superseded exploratory runs (kept for history):
run-bench.sh                 first experiment: headless single-pass, 2 variants, roman-numeral task
plans/roman.md               that experiment's task
RESULTS.md                   its results (N=1, single-pass — superseded by the campaign)
RESULTS-harness.md           early 3-variant full-harness run (superseded by RESULTS-campaign.md)
```