From 2d6dab7e22dcab368aa0c8183c54f865d4080579 Mon Sep 17 00:00:00 2001 From: mfowler Date: Tue, 16 Jun 2026 02:33:14 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20rewrite=20README=20=E2=80=94=20full=20s?= =?UTF-8?q?tudy,=20how=20it=20works,=20points=20to=20FINDINGS=20(2026-06-1?= =?UTF-8?q?6)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 144 ++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 106 insertions(+), 38 deletions(-) diff --git a/README.md b/README.md index dc80185..fc2e2e1 100644 --- a/README.md +++ b/README.md @@ -1,57 +1,125 @@ # agent-orchestrator-benchmark Benchmarks for the [`agent-orchestrator`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator) -harness — vendored here as the `engine/` submodule, pinned at a ref that ships the example variants -being compared. +Builder/Adversary loop — measuring **what actually drives token cost**: prompt design, context +discipline, verification cadence, and whether there's an independent adversary at all. The engine is +vendored as the `engine/` submodule, pinned at a ref that ships the example variants being compared. -## What it measures +## → Findings -A head-to-head between two example variants in the engine: +**See [`FINDINGS.md`](FINDINGS.md)** for the synthesis (current as of **2026-06-16**). The one-line +takeaway: -- **`builder-adversary`** — the original Builder/Adversary loop-pair prompts. -- **`builder-adversary-min`** — the same pattern with the role + kickoff prompts compressed to - minimal tokens. +> What the AI adversary costs is set by **whether it verifies at all** (~4.7× a solo builder), **not +> by how often** it verifies (per-gate ≈ per-phase ≈ per-build, all ~13M tokens). The only clean way +> to cut that cost without dropping verification is **context hygiene (−22%)**. -The benchmark confirms each variant **independently succeeds** on the same task (no shared context) -and **clocks the tokens** each uses. +Headline (median tokens, N=5 per variant, all on Sonnet): -## Run +| variant | adversary verifies… | median tokens | vs orig | +|---|---|--:|--:| +| `builder-solo` | never (self-certifies) | 2.77M | −79% | +| `builder-adversary-min` | per phase *(minimal prompts)* | 9.77M | −25% | +| `builder-adversary-stateless` | per phase *(+context hygiene)* | 10.12M | −22% | +| `builder-adversary` (orig) | per **phase** | 13.04M | — | +| `builder-adversary-deferred` | once, after **whole build** | 12.89M | −1% | +| `builder-adversary-lean` | per **gate** | 13.41M | +3% | + +Full per-variant stats, efficiency ratios (tokens/LOC, tokens/sec, tokens/commit), correlations, and +the raw per-run table are in [`RESULTS-campaign.md`](RESULTS-campaign.md); raw rows in +`RESULTS-campaign.md.data`. + +## The variants (engine examples being compared) + +All live in `engine/examples/`. They share one task and differ in one dimension each: + +| variant | what changes | +|---|---| +| `builder-adversary` | the original full prompts; Adversary verifies **per phase** (the baseline) | +| `builder-adversary-min` | prompts compressed to minimal tokens | +| `builder-adversary-stateless` | orig + **context hygiene** (compact per checkpoint, read diffs not trees, lean loads) | +| `builder-adversary-lean` | orig + context hygiene + **per-gate** review (one claim/verdict per gate) | +| `builder-adversary-deferred` | orig; Adversary verifies **once**, in a final comprehensive `review` phase | +| `builder-solo` | **no Adversary** — a single Builder that self-certifies | + +(stateless/lean/deferred are built on the *full original* prompts, so each isolates its one change +without the minimal-prompt confound.) + +## The task + +Build a 3-phase Python calculator — lexer → parser → evaluator (`plans/calc/{lex,parse,eval}.md`), +each phase with 4–6 cold-verifiable Definition-of-Done gates (`deferred` adds a comprehensive +`plans/calc/review.md`). It's deliberately offline and deterministic so it stresses the *protocol*, +not infrastructure, and the deliverable is behaviorally identical across variants (verified on a +24-expression probe) — so the comparison is like-for-like. + +## How it works (the real harness, N=5) + +Each variant is run **autonomously to completion** by the real harness — `engine/agents.py up` brings +up the Builder + Adversary loop pair + watchdog, which work through the phase machine to +`SEQUENCE-COMPLETE` exactly as in production. There's no simulation: the agents self-pace via `/loop`, +coordinate through git (`claim(`/`review(` commits + the watchdog handoff), and the watchdog heals +stalls and rides out usage limits. + +`run-harness-bench.sh` orchestrates the campaign: + +1. For each variant × repeat, it stands up a fresh **shared bare repo + two clones** (Builder and + Adversary each get their own, for genuine cold verification), pre-trusts the work dirs, generates + an `agents.toml` pointing at that variant's prompts (and a 4-phase config for `deferred`), and runs + `agents.py up`. +2. It polls for `SEQUENCE-COMPLETE` (per-run timeout), then tears the loop down. +3. It re-runs the task's Definition-of-Done itself (cold, in the Adversary's clone) to confirm + success, and **tallies tokens per loop from the Claude Code session transcripts**. +4. One row per run is appended to `RESULTS-campaign.md.data` immediately (so partial results survive + an interruption). Each run's git repo is kept under `/tmp/ao-campaign-*` for later analysis. + +`run-solo-bench.sh` does the same for the single-builder `builder-solo` control. +`analyze.py` reads the data file and (re)generates `RESULTS-campaign.md` — per-variant token +distributions, the efficiency ratios, correlations, and the full raw table. + +### Run it yourself ```bash -git submodule update --init # fetch the vendored engine (first time) -./run-bench.sh # writes RESULTS.md +git submodule update --init # fetch the vendored engine +# one variant, 5 runs, 45-min per-run timeout: +BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-harness-bench.sh builder-adversary +# the solo control: +BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-solo-bench.sh +python3 analyze.py # regenerate RESULTS-campaign.md ``` -Needs `claude` on `PATH` and `python`/`timeout`. Both variants run on **Sonnet** -(`claude-sonnet-4-6`) for Builder and Adversary. +Needs `claude` on `PATH` (authenticated), plus `python`, `tmux`, `git`, `timeout`. `run-harness-bench.sh` +with no arguments runs all four loop-pair variants; pass variant names to run a subset. The data file +is **append-mode** (clear it manually for a fresh campaign). -## How it works +## Methodology & caveats -`run-bench.sh` assembles exactly the prompt the harness would send a loop agent (the variant's -`kickoff.md` with `{phase_id}/{plan}/{status}/{role}` substituted, then the role prompt), then drives -one **Builder** pass and one **Adversary** pass as separate headless `claude -p` sessions — fresh -context each, so the two variants (and the two roles) share no context. The Builder builds and -commits in its own repo; the Adversary cold-verifies from its **own clone**. The script then re-runs -the task's Definition-of-Done check itself and reads the Adversary's verdict, and tallies tokens from -`claude -p --output-format json`. - -The test problem is [`plans/roman.md`](plans/roman.md) — an integer→Roman-numeral CLI with a stdlib -`unittest` suite (deterministic, fully local, cold-verifiable, and not present in either example). - -### Caveats - -- This is a **controlled single pass** per variant (N=1; expect run-to-run variance), not the full - self-paced watchdog loop. It measures task effectiveness + prompt token cost, **not** the live - loop / handoff / liveness machinery (that needs a real `engine/agents.py up` run). -- Each `claude -p` call carries a fixed ~24k-token cached system-prompt/tool overhead, and most - tokens come from the agentic work itself — so the prompt-size difference is a small slice of the - total. `RESULTS.md` reports the static prompt size separately so the minimisation is visible. +- **N matters.** A single full-loop run is highly nondeterministic — the *same* variant varied ±55% + run-to-run early on, which is why everything here is **N=5**. (An early single-run "context hygiene + halves tokens" claim did **not** reproduce; the stable figure is −22%.) +- **Excluded runs:** a few real failures (a wedge, a usage-limit/timeout collision) are in the raw + data as `NO` and excluded from stats; superseded by clean re-runs. `LIMIT`-flagged runs (a + usage-limit *pause* inflates duration, not tokens) are kept for token totals but excluded from + `tokens/sec`. +- **Scope:** one task, one model (Sonnet), one harness. *Relative* findings should generalize; + absolute numbers are task-specific. The adversary's *quality* value isn't measured here — the task + is too well-specified to make self-certification fail. ## Layout ``` -engine/ agent-orchestrator, vendored as a submodule (the variants live in engine/examples/) -plans/roman.md the test problem (single source of truth + Definition of Done) -run-bench.sh the runner -RESULTS.md generated by run-bench.sh +FINDINGS.md the synthesis — start here (current as of 2026-06-16) +RESULTS-campaign.md full-harness campaign analysis (stats + ratios + raw table) ← canonical +RESULTS-campaign.md.data raw per-run rows (TSV) +analyze.py aggregates the data file -> RESULTS-campaign.md +run-harness-bench.sh the full-harness campaign runner (loop pair, N runs/variant) +run-solo-bench.sh the builder-solo control runner +plans/calc/{lex,parse,eval,review}.md the calculator task +engine/ agent-orchestrator, vendored as a submodule (variants in engine/examples/) + +# earlier / superseded exploratory runs (kept for history): +run-bench.sh first experiment: headless single-pass, 2 variants, roman-numeral task +plans/roman.md that experiment's task +RESULTS.md its results (N=1, single-pass — superseded by the campaign) +RESULTS-harness.md early 3-variant full-harness run (superseded by RESULTS-campaign.md) ```