commit 27df2c7b55f41977d462e003bf78f6036ef67003 Author: mfowler Date: Sun Jun 14 20:20:05 2026 +0000 feat: agent-orchestrator-benchmark — prompt token comparison harness A standalone repo (engine vendored as a submodule at the examples commit) that runs a head-to-head between the builder-adversary and builder-adversary-min example variants: same task, independent headless runs, both on Sonnet, with token counts. Includes the roman-numeral test problem and run-bench.sh. Co-Authored-By: Claude Opus 4.8 diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..d481f10 --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +# runtime +.ao-state/ +__pycache__/ +*.pyc diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 0000000..a48e6a3 --- /dev/null +++ b/.gitmodules @@ -0,0 +1,3 @@ +[submodule "engine"] + path = engine + url = https://git.autonomic.zone/recipe-maintainers/agent-orchestrator.git diff --git a/README.md b/README.md new file mode 100644 index 0000000..dc80185 --- /dev/null +++ b/README.md @@ -0,0 +1,57 @@ +# agent-orchestrator-benchmark + +Benchmarks for the [`agent-orchestrator`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator) +harness — vendored here as the `engine/` submodule, pinned at a ref that ships the example variants +being compared. + +## What it measures + +A head-to-head between two example variants in the engine: + +- **`builder-adversary`** — the original Builder/Adversary loop-pair prompts. +- **`builder-adversary-min`** — the same pattern with the role + kickoff prompts compressed to + minimal tokens. + +The benchmark confirms each variant **independently succeeds** on the same task (no shared context) +and **clocks the tokens** each uses. + +## Run + +```bash +git submodule update --init # fetch the vendored engine (first time) +./run-bench.sh # writes RESULTS.md +``` + +Needs `claude` on `PATH` and `python`/`timeout`. Both variants run on **Sonnet** +(`claude-sonnet-4-6`) for Builder and Adversary. + +## How it works + +`run-bench.sh` assembles exactly the prompt the harness would send a loop agent (the variant's +`kickoff.md` with `{phase_id}/{plan}/{status}/{role}` substituted, then the role prompt), then drives +one **Builder** pass and one **Adversary** pass as separate headless `claude -p` sessions — fresh +context each, so the two variants (and the two roles) share no context. The Builder builds and +commits in its own repo; the Adversary cold-verifies from its **own clone**. The script then re-runs +the task's Definition-of-Done check itself and reads the Adversary's verdict, and tallies tokens from +`claude -p --output-format json`. + +The test problem is [`plans/roman.md`](plans/roman.md) — an integer→Roman-numeral CLI with a stdlib +`unittest` suite (deterministic, fully local, cold-verifiable, and not present in either example). + +### Caveats + +- This is a **controlled single pass** per variant (N=1; expect run-to-run variance), not the full + self-paced watchdog loop. It measures task effectiveness + prompt token cost, **not** the live + loop / handoff / liveness machinery (that needs a real `engine/agents.py up` run). +- Each `claude -p` call carries a fixed ~24k-token cached system-prompt/tool overhead, and most + tokens come from the agentic work itself — so the prompt-size difference is a small slice of the + total. `RESULTS.md` reports the static prompt size separately so the minimisation is visible. + +## Layout + +``` +engine/ agent-orchestrator, vendored as a submodule (the variants live in engine/examples/) +plans/roman.md the test problem (single source of truth + Definition of Done) +run-bench.sh the runner +RESULTS.md generated by run-bench.sh +``` diff --git a/engine b/engine new file mode 160000 index 0000000..737ef81 --- /dev/null +++ b/engine @@ -0,0 +1 @@ +Subproject commit 737ef810666a619a4555c20bac27ebcdc734d3b1 diff --git a/plans/roman.md b/plans/roman.md new file mode 100644 index 0000000..60ae314 --- /dev/null +++ b/plans/roman.md @@ -0,0 +1,28 @@ +# Phase `roman` — integer → Roman numeral + +**Mission.** In the work repo, implement `roman.py` plus a `test_roman.py` unittest suite. Pure +stdlib, no dependencies. This file is the single source of truth for the phase. + +## Definition of Done + +- **D1 — `to_roman(n)`.** Returns the Roman-numeral string for an int `1 ≤ n ≤ 3999` + (e.g. `to_roman(1994) == "MCMXCIV"`). +- **D2 — validation.** `to_roman` raises `ValueError` for `n < 1`, `n > 3999`, or a non-int. +- **D3 — CLI.** `python roman.py 1994` prints `MCMXCIV` (and exits 0); a bad argument exits non-zero. +- **D4 — tests green.** `test_roman.py` (stdlib `unittest`) passes under `python -m unittest`, with + **0 failures**, covering at least: `1→I, 4→IV, 9→IX, 40→XL, 90→XC, 400→CD, 900→CM, + 1994→MCMXCIV, 3888→MMMDCCCLXXXVIII, 3999→MMMCMXCIX`, and `ValueError` on `0`, `4000`, and `"x"`. + +## How to verify (cold) + +From a fresh clone of the work repo: + +```bash +python -m unittest -q # D4: must report OK (0 failures) +python roman.py 1994 # D3: expect MCMXCIV +python roman.py 3888 # expect MMMDCCCLXXXVIII +``` + +Expected outputs are above. The Builder restates the exact commands + expected outputs + commit sha +in `machine-docs/STATUS-roman.md`; the Adversary re-runs them from its own clone and records +`roman: PASS`/`FAIL` in `machine-docs/REVIEW-roman.md`. diff --git a/run-bench.sh b/run-bench.sh new file mode 100755 index 0000000..6bd52c3 --- /dev/null +++ b/run-bench.sh @@ -0,0 +1,164 @@ +#!/usr/bin/env bash +# agent-orchestrator-benchmark — head-to-head prompt comparison. +# +# Compares two example variants that ship in the vendored engine (engine/examples/): +# • builder-adversary (original prompts) +# • builder-adversary-min (minimal prompts) +# on the same task, INDEPENDENTLY (no shared context), both on Sonnet. Confirms each succeeds and +# clocks the tokens each uses. +# +# Methodology (controlled, reproducible — not the full self-paced watchdog loop): +# For each version we assemble EXACTLY the prompt the harness would send a loop agent +# (kickoff_template with {phase_id}/{plan}/{status}/{role} substituted, then the role prompt), +# and drive one Builder pass then one Adversary pass as separate headless `claude -p` sessions +# (fresh context each → "no shared context"). The Builder builds+commits in its own repo; the +# Adversary cold-verifies from its OWN clone. We then independently re-run the DoD check and read +# the Adversary's verdict. Tokens come from `claude -p --output-format json`. +# +# A short identical DRIVER NOTE is appended to BOTH versions' prompts so the agents finish in one +# headless session (no /loop, no waiting). Being identical, it doesn't bias the comparison. +# +# Usage: ./run-bench.sh (writes RESULTS.md here; work dirs under a temp dir) +set -u + +BENCH_DIR="$(cd "$(dirname "$0")" && pwd)" +ENGINE_EX="$BENCH_DIR/engine/examples" +PLAN_SRC="$BENCH_DIR/plans/roman.md" +MODEL="claude-sonnet-4-6" +RUNROOT="$(mktemp -d /tmp/ao-benchmark.XXXXXX)" +RESULTS="$BENCH_DIR/RESULTS.md" +GIT_ID=(-c user.email=bench@example.com -c user.name=bench) + +DRIVER_NOTE='[HEADLESS BENCH — single session, NO loop] Do the ENTIRE phase now, in THIS one session: implement the code and tests, run them, then `git add -A && git commit -m "..."`. Builder: also write machine-docs/STATUS-roman.md and make a claim commit (prefix `claim(`). Adversary: cold-verify and write machine-docs/REVIEW-roman.md with `roman: PASS` or `roman: FAIL`, then commit (prefix `review(`). Do NOT invoke /loop, do NOT ScheduleWakeup, do NOT wait — finish and exit. No git remote exists; commit locally only (do not push).' + +# assemble_prompt -> stdout +assemble_prompt() { + local version="$1" role="$2" planrel="$3" + sed -e "s/{phase_id}/roman/g" \ + -e "s#{plan}#$planrel#g" \ + -e "s/{status}/STATUS-roman.md/g" \ + -e "s/{role}/$role/g" \ + "$ENGINE_EX/$version/prompts/kickoff.md" + printf '\n' + cat "$ENGINE_EX/$version/prompts/$role.md" + printf '\n\n%s\n' "$DRIVER_NOTE" +} + +# usage_of -> "in out cache_create cache_read cost turns" +usage_of() { + python3 - "$1" <<'PY' +import json,sys +try: + d=json.load(open(sys.argv[1])) +except Exception: + print("0 0 0 0 0 0"); sys.exit() +u=d.get("usage",{}) or {} +print(u.get("input_tokens",0), u.get("output_tokens",0), + u.get("cache_creation_input_tokens",0), u.get("cache_read_input_tokens",0), + d.get("total_cost_usd",0), d.get("num_turns",0)) +PY +} + +declare -A SUM_TOK SUM_COST SUM_OK + +run_version() { + local version="$1" + local W="$RUNROOT/$version" + local A="$RUNROOT/$version-adv" + echo "===== $version =====" + + # --- Builder: its own work repo --- + mkdir -p "$W/plans"; cp "$PLAN_SRC" "$W/plans/roman.md" + git "${GIT_ID[@]}" -C "$W" init -q + git "${GIT_ID[@]}" -C "$W" add -A && git "${GIT_ID[@]}" -C "$W" commit -q -m "chore: seed plan" + echo "[$version] builder running…" + ( cd "$W" && timeout 600 claude -p "$(assemble_prompt "$version" builder plans/roman.md)" \ + --output-format json --model "$MODEL" --dangerously-skip-permissions ) > "$RUNROOT/$version.builder.json" 2>"$RUNROOT/$version.builder.err" + + # --- Adversary: its OWN clone, cold --- + git "${GIT_ID[@]}" clone -q "$W" "$A" + echo "[$version] adversary running…" + ( cd "$A" && timeout 600 claude -p "$(assemble_prompt "$version" adversary plans/roman.md)" \ + --output-format json --model "$MODEL" --dangerously-skip-permissions ) > "$RUNROOT/$version.adv.json" 2>"$RUNROOT/$version.adv.err" + + # --- independent success check (re-run DoD ourselves, in the adversary's clone) --- + local tests_ok="no" cli_ok="no" verdict="missing" + ( cd "$A" && python -m unittest -q ) >"$RUNROOT/$version.unittest.txt" 2>&1 && tests_ok="yes" + local out; out="$( cd "$A" && python roman.py 1994 2>/dev/null )" + [ "$out" = "MCMXCIV" ] && cli_ok="yes" + if [ -f "$A/machine-docs/REVIEW-roman.md" ]; then + if grep -qiE 'roman:?\s*PASS' "$A/machine-docs/REVIEW-roman.md" && ! grep -qi 'VETO' "$A/machine-docs/REVIEW-roman.md"; then + verdict="PASS"; else verdict="FAIL/none"; fi + fi + local success="NO" + [ "$tests_ok" = yes ] && [ "$cli_ok" = yes ] && [ "$verdict" = PASS ] && success="YES" + + # --- tally tokens --- + read -r bi bo bcc bcr bcost bturns <<<"$(usage_of "$RUNROOT/$version.builder.json")" + read -r ai ao acc acr acost aturns <<<"$(usage_of "$RUNROOT/$version.adv.json")" + local btok=$((bi+bo+bcc+bcr)) atok=$((ai+ao+acc+acr)) + local vtok=$((btok+atok)) + local vcost; vcost=$(python3 -c "print(round(${bcost:-0}+${acost:-0},4))") + SUM_TOK[$version]=$vtok; SUM_COST[$version]=$vcost; SUM_OK[$version]=$success + + { + echo "### $version" + echo "- **success:** $success (tests=$tests_ok, cli=$cli_ok, adversary-verdict=$verdict)" + echo "- **builder:** in=$bi out=$bo cache_create=$bcc cache_read=$bcr → ${btok} tok, \$${bcost}, turns=$bturns" + echo "- **adversary:** in=$ai out=$ao cache_create=$acc cache_read=$acr → ${atok} tok, \$${acost}, turns=$aturns" + echo "- **total:** ${vtok} tokens, \$${vcost}" + echo + } >>"$RESULTS.tmp" +} + +# static prompt size (chars: kickoff + role, what gets sent each kickoff) +prompt_chars() { cat "$ENGINE_EX/$1/prompts/kickoff.md" "$ENGINE_EX/$1/prompts/$2.md" | wc -c | tr -d ' '; } + +: >"$RESULTS.tmp" +run_version builder-adversary +run_version builder-adversary-min + +# ---- write RESULTS.md ---- +{ + echo "# Benchmark results — original vs minimal prompts" + echo + echo "Engine pinned at: \`$(git -C "$BENCH_DIR/engine" rev-parse --short HEAD)\`. Task:" + echo "\`plans/roman.md\` (integer → Roman numeral). Model: **$MODEL** for Builder and Adversary in" + echo "both versions. Runs are independent (separate headless \`claude -p\` sessions, no shared" + echo "context). Methodology + caveats: see \`run-bench.sh\` header and the note below." + echo + echo "## Static prompt size (chars: kickoff + role, what gets sent each kickoff)" + echo + echo "| version | builder prompt | adversary prompt |" + echo "|---|--:|--:|" + echo "| builder-adversary (orig) | $(prompt_chars builder-adversary builder) | $(prompt_chars builder-adversary adversary) |" + echo "| builder-adversary-min | $(prompt_chars builder-adversary-min builder) | $(prompt_chars builder-adversary-min adversary) |" + echo + echo "## Per-run tokens & cost" + echo + cat "$RESULTS.tmp" + echo "## Summary" + echo + echo "| version | success | total tokens | total cost |" + echo "|---|:--:|--:|--:|" + echo "| builder-adversary (orig) | ${SUM_OK[builder-adversary]} | ${SUM_TOK[builder-adversary]} | \$${SUM_COST[builder-adversary]} |" + echo "| builder-adversary-min | ${SUM_OK[builder-adversary-min]} | ${SUM_TOK[builder-adversary-min]} | \$${SUM_COST[builder-adversary-min]} |" + echo + echo "> Note: each \`claude -p\` call carries a fixed ~24k-token cached Claude Code system-prompt +" + echo "> tool-schema overhead, and most tokens come from the agentic work itself (reading the plan," + echo "> writing/running code, tool results). The role/kickoff prompt is a small slice — so the" + echo "> headline token totals are close; the minimisation shows up in the static prompt size above" + echo "> and the (smaller) input/cache-creation portion. This bench is a single controlled pass per" + echo "> version (N=1; expect run-to-run variance); it exercises task effectiveness + prompt cost," + echo "> NOT the live watchdog loop / handoff machinery (that needs a full \`agents.py up\` run)." + echo + echo "_Work dirs for this run: \`$RUNROOT\`_" +} >"$RESULTS" +rm -f "$RESULTS.tmp" + +echo +echo "===== DONE =====" +echo "orig: success=${SUM_OK[builder-adversary]} tokens=${SUM_TOK[builder-adversary]} cost=\$${SUM_COST[builder-adversary]}" +echo "min : success=${SUM_OK[builder-adversary-min]} tokens=${SUM_TOK[builder-adversary-min]} cost=\$${SUM_COST[builder-adversary-min]}" +echo "Results: $RESULTS" +echo "Run dirs: $RUNROOT"