docs: rewrite README — full study, how it works, points to FINDINGS (2026-06-16)

2026-06-16 02:33:14 +00:00
parent 8e6290e7e0
commit 2d6dab7e22
1 changed files with 106 additions and 38 deletions
--- a/README.md
+++ b/README.md
@ -1,57 +1,125 @@
 # agent-orchestrator-benchmark

 Benchmarks for the [`agent-orchestrator`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator)
-harness — vendored here as the `engine/` submodule, pinned at a ref that ships the example variants
-being compared.
+Builder/Adversary loop — measuring **what actually drives token cost**: prompt design, context
+discipline, verification cadence, and whether there's an independent adversary at all. The engine is
+vendored as the `engine/` submodule, pinned at a ref that ships the example variants being compared.

-## What it measures
+## → Findings

-A head-to-head between two example variants in the engine:
+**See [`FINDINGS.md`](FINDINGS.md)** for the synthesis (current as of **2026-06-16**). The one-line
+takeaway:

- **`builder-adversary`** — the original Builder/Adversary loop-pair prompts.
- **`builder-adversary-min`** — the same pattern with the role + kickoff prompts compressed to
-  minimal tokens.
+> What the AI adversary costs is set by **whether it verifies at all** (~4.7× a solo builder), **not
+> by how often** it verifies (per-gate ≈ per-phase ≈ per-build, all ~13M tokens). The only clean way
+> to cut that cost without dropping verification is **context hygiene (−22%)**.

-The benchmark confirms each variant **independently succeeds** on the same task (no shared context)
-and **clocks the tokens** each uses.
+Headline (median tokens, N=5 per variant, all on Sonnet):

-## Run
+| variant | adversary verifies… | median tokens | vs orig |
+|---|---|--:|--:|
+| `builder-solo` | never (self-certifies) | 2.77M | −79% |
+| `builder-adversary-min` | per phase *(minimal prompts)* | 9.77M | −25% |
+| `builder-adversary-stateless` | per phase *(+context hygiene)* | 10.12M | −22% |
+| `builder-adversary` (orig) | per **phase** | 13.04M | — |
+| `builder-adversary-deferred` | once, after **whole build** | 12.89M | −1% |
+| `builder-adversary-lean` | per **gate** | 13.41M | +3% |
+
+Full per-variant stats, efficiency ratios (tokens/LOC, tokens/sec, tokens/commit), correlations, and
+the raw per-run table are in [`RESULTS-campaign.md`](RESULTS-campaign.md); raw rows in
+`RESULTS-campaign.md.data`.
+
+## The variants (engine examples being compared)
+
+All live in `engine/examples/`. They share one task and differ in one dimension each:
+
+| variant | what changes |
+|---|---|
+| `builder-adversary` | the original full prompts; Adversary verifies **per phase** (the baseline) |
+| `builder-adversary-min` | prompts compressed to minimal tokens |
+| `builder-adversary-stateless` | orig + **context hygiene** (compact per checkpoint, read diffs not trees, lean loads) |
+| `builder-adversary-lean` | orig + context hygiene + **per-gate** review (one claim/verdict per gate) |
+| `builder-adversary-deferred` | orig; Adversary verifies **once**, in a final comprehensive `review` phase |
+| `builder-solo` | **no Adversary** — a single Builder that self-certifies |
+
+(stateless/lean/deferred are built on the *full original* prompts, so each isolates its one change
+without the minimal-prompt confound.)
+
+## The task
+
+Build a 3-phase Python calculator — lexer → parser → evaluator (`plans/calc/{lex,parse,eval}.md`),
+each phase with 4–6 cold-verifiable Definition-of-Done gates (`deferred` adds a comprehensive
+`plans/calc/review.md`). It's deliberately offline and deterministic so it stresses the *protocol*,
+not infrastructure, and the deliverable is behaviorally identical across variants (verified on a
+24-expression probe) — so the comparison is like-for-like.
+
+## How it works (the real harness, N=5)
+
+Each variant is run **autonomously to completion** by the real harness — `engine/agents.py up` brings
+up the Builder + Adversary loop pair + watchdog, which work through the phase machine to
+`SEQUENCE-COMPLETE` exactly as in production. There's no simulation: the agents self-pace via `/loop`,
+coordinate through git (`claim(`/`review(` commits + the watchdog handoff), and the watchdog heals
+stalls and rides out usage limits.
+
+`run-harness-bench.sh` orchestrates the campaign:
+
+1. For each variant × repeat, it stands up a fresh **shared bare repo + two clones** (Builder and
+   Adversary each get their own, for genuine cold verification), pre-trusts the work dirs, generates
+   an `agents.toml` pointing at that variant's prompts (and a 4-phase config for `deferred`), and runs
+   `agents.py up`.
+2. It polls for `SEQUENCE-COMPLETE` (per-run timeout), then tears the loop down.
+3. It re-runs the task's Definition-of-Done itself (cold, in the Adversary's clone) to confirm
+   success, and **tallies tokens per loop from the Claude Code session transcripts**.
+4. One row per run is appended to `RESULTS-campaign.md.data` immediately (so partial results survive
+   an interruption). Each run's git repo is kept under `/tmp/ao-campaign-*` for later analysis.
+
+`run-solo-bench.sh` does the same for the single-builder `builder-solo` control.
+`analyze.py` reads the data file and (re)generates `RESULTS-campaign.md` — per-variant token
+distributions, the efficiency ratios, correlations, and the full raw table.
+
+### Run it yourself

 ```bash
-git submodule update --init      # fetch the vendored engine (first time)
-./run-bench.sh                   # writes RESULTS.md
+git submodule update --init                                   # fetch the vendored engine
+# one variant, 5 runs, 45-min per-run timeout:
+BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-harness-bench.sh builder-adversary
+# the solo control:
+BENCH_REPEATS=5 BENCH_TIMEOUT=2700 ./run-solo-bench.sh
+python3 analyze.py                                            # regenerate RESULTS-campaign.md
 ```

-Needs `claude` on `PATH` and `python`/`timeout`. Both variants run on **Sonnet**
-(`claude-sonnet-4-6`) for Builder and Adversary.
+Needs `claude` on `PATH` (authenticated), plus `python`, `tmux`, `git`, `timeout`. `run-harness-bench.sh`
+with no arguments runs all four loop-pair variants; pass variant names to run a subset. The data file
+is **append-mode** (clear it manually for a fresh campaign).

-## How it works
+## Methodology & caveats

-`run-bench.sh` assembles exactly the prompt the harness would send a loop agent (the variant's
-`kickoff.md` with `{phase_id}/{plan}/{status}/{role}` substituted, then the role prompt), then drives
-one **Builder** pass and one **Adversary** pass as separate headless `claude -p` sessions — fresh
-context each, so the two variants (and the two roles) share no context. The Builder builds and
-commits in its own repo; the Adversary cold-verifies from its **own clone**. The script then re-runs
-the task's Definition-of-Done check itself and reads the Adversary's verdict, and tallies tokens from
-`claude -p --output-format json`.
-
-The test problem is [`plans/roman.md`](plans/roman.md) — an integer→Roman-numeral CLI with a stdlib
-`unittest` suite (deterministic, fully local, cold-verifiable, and not present in either example).
-
-### Caveats
-
- This is a **controlled single pass** per variant (N=1; expect run-to-run variance), not the full
-  self-paced watchdog loop. It measures task effectiveness + prompt token cost, **not** the live
-  loop / handoff / liveness machinery (that needs a real `engine/agents.py up` run).
- Each `claude -p` call carries a fixed ~24k-token cached system-prompt/tool overhead, and most
-  tokens come from the agentic work itself — so the prompt-size difference is a small slice of the
-  total. `RESULTS.md` reports the static prompt size separately so the minimisation is visible.
+- **N matters.** A single full-loop run is highly nondeterministic — the *same* variant varied ±55%
+  run-to-run early on, which is why everything here is **N=5**. (An early single-run "context hygiene
+  halves tokens" claim did **not** reproduce; the stable figure is −22%.)
+- **Excluded runs:** a few real failures (a wedge, a usage-limit/timeout collision) are in the raw
+  data as `NO` and excluded from stats; superseded by clean re-runs. `LIMIT`-flagged runs (a
+  usage-limit *pause* inflates duration, not tokens) are kept for token totals but excluded from
+  `tokens/sec`.
+- **Scope:** one task, one model (Sonnet), one harness. *Relative* findings should generalize;
+  absolute numbers are task-specific. The adversary's *quality* value isn't measured here — the task
+  is too well-specified to make self-certification fail.

 ## Layout

 ```
-engine/            agent-orchestrator, vendored as a submodule (the variants live in engine/examples/)
-plans/roman.md     the test problem (single source of truth + Definition of Done)
-run-bench.sh       the runner
-RESULTS.md         generated by run-bench.sh
+FINDINGS.md                  the synthesis — start here (current as of 2026-06-16)
+RESULTS-campaign.md          full-harness campaign analysis (stats + ratios + raw table)  ← canonical
+RESULTS-campaign.md.data     raw per-run rows (TSV)
+analyze.py                   aggregates the data file -> RESULTS-campaign.md
+run-harness-bench.sh         the full-harness campaign runner (loop pair, N runs/variant)
+run-solo-bench.sh            the builder-solo control runner
+plans/calc/{lex,parse,eval,review}.md   the calculator task
+engine/                      agent-orchestrator, vendored as a submodule (variants in engine/examples/)
+
+# earlier / superseded exploratory runs (kept for history):
+run-bench.sh                 first experiment: headless single-pass, 2 variants, roman-numeral task
+plans/roman.md               that experiment's task
+RESULTS.md                   its results (N=1, single-pass — superseded by the campaign)
+RESULTS-harness.md           early 3-variant full-harness run (superseded by RESULTS-campaign.md)
 ```