Standalone analysis over RESULTS-campaign.md.data (safe: independent of the live runner). Adds the normalised efficiency ratios per run with min/median/max per variant, alongside the token distributions, commit/LOC medians, correlations, and full raw table. Run: python3 analyze.py (regenerates RESULTS-campaign.md). Orig baseline (5 runs): tokens/LOC ~25k–34k, tokens/sec ~11.3k–14.0k.
agent-orchestrator-benchmark
Benchmarks for the agent-orchestrator
harness — vendored here as the engine/ submodule, pinned at a ref that ships the example variants
being compared.
What it measures
A head-to-head between two example variants in the engine:
builder-adversary— the original Builder/Adversary loop-pair prompts.builder-adversary-min— the same pattern with the role + kickoff prompts compressed to minimal tokens.
The benchmark confirms each variant independently succeeds on the same task (no shared context) and clocks the tokens each uses.
Run
git submodule update --init # fetch the vendored engine (first time)
./run-bench.sh # writes RESULTS.md
Needs claude on PATH and python/timeout. Both variants run on Sonnet
(claude-sonnet-4-6) for Builder and Adversary.
How it works
run-bench.sh assembles exactly the prompt the harness would send a loop agent (the variant's
kickoff.md with {phase_id}/{plan}/{status}/{role} substituted, then the role prompt), then drives
one Builder pass and one Adversary pass as separate headless claude -p sessions — fresh
context each, so the two variants (and the two roles) share no context. The Builder builds and
commits in its own repo; the Adversary cold-verifies from its own clone. The script then re-runs
the task's Definition-of-Done check itself and reads the Adversary's verdict, and tallies tokens from
claude -p --output-format json.
The test problem is plans/roman.md — an integer→Roman-numeral CLI with a stdlib
unittest suite (deterministic, fully local, cold-verifiable, and not present in either example).
Caveats
- This is a controlled single pass per variant (N=1; expect run-to-run variance), not the full
self-paced watchdog loop. It measures task effectiveness + prompt token cost, not the live
loop / handoff / liveness machinery (that needs a real
engine/agents.py uprun). - Each
claude -pcall carries a fixed ~24k-token cached system-prompt/tool overhead, and most tokens come from the agentic work itself — so the prompt-size difference is a small slice of the total.RESULTS.mdreports the static prompt size separately so the minimisation is visible.
Layout
engine/ agent-orchestrator, vendored as a submodule (the variants live in engine/examples/)
plans/roman.md the test problem (single source of truth + Definition of Done)
run-bench.sh the runner
RESULTS.md generated by run-bench.sh