Files
agent-orchestrator-benchmark/README.md

126 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# agent-orchestrator-benchmark
Benchmarks for the [`agent-orchestrator`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator)
Builder/Adversary loop — measuring **what actually drives token cost**: prompt design, context
discipline, verification cadence, and whether there's an independent adversary at all. The engine is
vendored as the `engine/` submodule, pinned at a ref that ships the example variants being compared.
## → Findings
**See [`FINDINGS.md`](FINDINGS.md)** for the synthesis (current as of **2026-06-16**). The one-line
takeaway:
> What the AI adversary costs is set by **whether it verifies at all** (~4.7× a solo builder), **not
> by how often** it verifies (per-gate ≈ per-phase ≈ per-build, all ~13M tokens). The only clean way
> to cut that cost without dropping verification is **context hygiene (22%)**.
Headline (median tokens, N=5 per variant, all on Sonnet):
| variant | adversary verifies… | median tokens | vs orig |
|---|---|--:|--:|
| `builder-solo` | never (self-certifies) | 2.77M | 79% |
| `builder-adversary-min` | per phase *(minimal prompts)* | 9.77M | 25% |
| `builder-adversary-stateless` | per phase *(+context hygiene)* | 10.12M | 22% |
| `builder-adversary` (orig) | per **phase** | 13.04M | — |
| `builder-adversary-deferred` | once, after **whole build** | 12.89M | 1% |
| `builder-adversary-lean` | per **gate** | 13.41M | +3% |
Full per-variant stats, efficiency ratios (tokens/LOC, tokens/sec, tokens/commit), correlations, and
the raw per-run table are in [`RESULTS-campaign.md`](RESULTS-campaign.md); raw rows in
`RESULTS-campaign.md.data`.
## The variants (engine examples being compared)
All live in `engine/examples/`. They share one task and differ in one dimension each:
| variant | what changes |
|---|---|
| `builder-adversary` | the original full prompts; Adversary verifies **per phase** (the baseline) |
| `builder-adversary-min` | prompts compressed to minimal tokens |
| `builder-adversary-stateless` | orig + **context hygiene** (compact per checkpoint, read diffs not trees, lean loads) |
| `builder-adversary-lean` | orig + context hygiene + **per-gate** review (one claim/verdict per gate) |
| `builder-adversary-deferred` | orig; Adversary verifies **once**, in a final comprehensive `review` phase |
| `builder-solo` | **no Adversary** — a single Builder that self-certifies |
(stateless/lean/deferred are built on the *full original* prompts, so each isolates its one change
without the minimal-prompt confound.)
## The task
Build a 3-phase Python calculator — lexer → parser → evaluator (`plans/calc/{lex,parse,eval}.md`),
each phase with 46 cold-verifiable Definition-of-Done gates (`deferred` adds a comprehensive
`plans/calc/review.md`). It's deliberately offline and deterministic so it stresses the *protocol*,
not infrastructure, and the deliverable is behaviorally identical across variants (verified on a
24-expression probe) — so the comparison is like-for-like.
## How it works (the real harness, N=5)
Each variant is run **autonomously to completion** by the real harness — `engine/agents.py up` brings
up the Builder + Adversary loop pair + watchdog, which work through the phase machine to
`SEQUENCE-COMPLETE` exactly as in production. There's no simulation: the agents self-pace via `/loop`,
coordinate through git (`claim(`/`review(` commits + the watchdog handoff), and the watchdog heals
stalls and rides out usage limits.
`run-harness-bench.sh` orchestrates the campaign:
1. For each variant × repeat, it stands up a fresh **shared bare repo + two clones** (Builder and
Adversary each get their own, for genuine cold verification), pre-trusts the work dirs, generates
an `agents.toml` pointing at that variant's prompts (and a 4-phase config for `deferred`), and runs
`agents.py up`.
2. It polls for `SEQUENCE-COMPLETE` (per-run timeout), then tears the loop down.
3. It re-runs the task's Definition-of-Done itself (cold, in the Adversary's clone) to confirm
success, and **tallies tokens per loop from the Claude Code session transcripts**.
4. One row per run is appended to `RESULTS-campaign.md.data` immediately (so partial results survive
an interruption). Each run's git repo is kept under `/tmp/ao-campaign-*` for later analysis.
`run-solo-bench.sh` does the same for the single-builder `builder-solo` control.
`analyze.py` reads the data file and (re)generates `RESULTS-campaign.md` — per-variant token
distributions, the efficiency ratios, correlations, and the full raw table.
### Run it yourself
```bash
git submodule update --init # fetch the vendored engine
# one variant, 5 runs, 45-min per-run timeout:
BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-harness-bench.sh builder-adversary
# the solo control:
BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-solo-bench.sh
python3 analyze.py # regenerate RESULTS-campaign.md
```
Needs `claude` on `PATH` (authenticated), plus `python`, `tmux`, `git`, `timeout`. `run-harness-bench.sh`
with no arguments runs all four loop-pair variants; pass variant names to run a subset. The data file
is **append-mode** (clear it manually for a fresh campaign).
## Methodology & caveats
- **N matters.** A single full-loop run is highly nondeterministic — the *same* variant varied ±55%
run-to-run early on, which is why everything here is **N=5**. (An early single-run "context hygiene
halves tokens" claim did **not** reproduce; the stable figure is 22%.)
- **Excluded runs:** a few real failures (a wedge, a usage-limit/timeout collision) are in the raw
data as `NO` and excluded from stats; superseded by clean re-runs. `LIMIT`-flagged runs (a
usage-limit *pause* inflates duration, not tokens) are kept for token totals but excluded from
`tokens/sec`.
- **Scope:** one task, one model (Sonnet), one harness. *Relative* findings should generalize;
absolute numbers are task-specific. The adversary's *quality* value isn't measured here — the task
is too well-specified to make self-certification fail.
## Layout
```
FINDINGS.md the synthesis — start here (current as of 2026-06-16)
RESULTS-campaign.md full-harness campaign analysis (stats + ratios + raw table) ← canonical
RESULTS-campaign.md.data raw per-run rows (TSV)
analyze.py aggregates the data file -> RESULTS-campaign.md
run-harness-bench.sh the full-harness campaign runner (loop pair, N runs/variant)
run-solo-bench.sh the builder-solo control runner
plans/calc/{lex,parse,eval,review}.md the calculator task
engine/ agent-orchestrator, vendored as a submodule (variants in engine/examples/)
# earlier / superseded exploratory runs (kept for history):
run-bench.sh first experiment: headless single-pass, 2 variants, roman-numeral task
plans/roman.md that experiment's task
RESULTS.md its results (N=1, single-pass — superseded by the campaign)
RESULTS-harness.md early 3-variant full-harness run (superseded by RESULTS-campaign.md)
```