diff --git a/RESULTS-harness.md.tmp b/RESULTS-harness.md.tmp new file mode 100644 index 0000000..e69de29 diff --git a/RESULTS.md b/RESULTS.md new file mode 100644 index 0000000..498dbbb --- /dev/null +++ b/RESULTS.md @@ -0,0 +1,44 @@ +# Benchmark results — original vs minimal prompts + +Engine pinned at: `737ef81`. Task: +`plans/roman.md` (integer → Roman numeral). Model: **claude-sonnet-4-6** for Builder and Adversary in +both versions. Runs are independent (separate headless `claude -p` sessions, no shared +context). Methodology + caveats: see `run-bench.sh` header and the note below. + +## Static prompt size (chars: kickoff + role, what gets sent each kickoff) + +| version | builder prompt | adversary prompt | +|---|--:|--:| +| builder-adversary (orig) | 6389 | 5811 | +| builder-adversary-min | 1751 | 1644 | + +## Per-run tokens & cost + +### builder-adversary +- **success:** YES (tests=yes, cli=yes, adversary-verdict=PASS) +- **builder:** in=21 out=4007 cache_create=14460 cache_read=526213 → 544701 tok, $0.3073279, turns=21 +- **adversary:** in=14 out=3245 cache_create=14930 cache_read=331897 → 350086 tok, $0.24022810000000003, turns=17 +- **total:** 894787 tokens, $0.5476 + +### builder-adversary-min +- **success:** YES (tests=yes, cli=yes, adversary-verdict=PASS) +- **builder:** in=20 out=4257 cache_create=13183 cache_read=477142 → 494602 tok, $0.28740659999999996, turns=18 +- **adversary:** in=16 out=4545 cache_create=14792 cache_read=378787 → 398140 tok, $0.2718171000000001, turns=16 +- **total:** 892742 tokens, $0.5592 + +## Summary + +| version | success | total tokens | total cost | +|---|:--:|--:|--:| +| builder-adversary (orig) | YES | 894787 | $0.5476 | +| builder-adversary-min | YES | 892742 | $0.5592 | + +> Note: each `claude -p` call carries a fixed ~24k-token cached Claude Code system-prompt + +> tool-schema overhead, and most tokens come from the agentic work itself (reading the plan, +> writing/running code, tool results). The role/kickoff prompt is a small slice — so the +> headline token totals are close; the minimisation shows up in the static prompt size above +> and the (smaller) input/cache-creation portion. This bench is a single controlled pass per +> version (N=1; expect run-to-run variance); it exercises task effectiveness + prompt cost, +> NOT the live watchdog loop / handoff machinery (that needs a full `agents.py up` run). + +_Work dirs for this run: `/tmp/ao-benchmark.CwQFWF`_ diff --git a/plans/calc/eval.md b/plans/calc/eval.md new file mode 100644 index 0000000..c841aca --- /dev/null +++ b/plans/calc/eval.md @@ -0,0 +1,36 @@ +# Phase `eval` — evaluator + CLI + +**Mission.** Build `calc/evaluator.py` exposing `evaluate(node) -> int | float` (walking the `parse` +phase's AST) and a top-level `calc.py` CLI, plus a `unittest` suite. SSOT for this phase. This phase +makes the calculator end-to-end: string → tokens → AST → number. + +## Definition of Done (each Dn is a gate) + +- **D1 — arithmetic.** `evaluate(parse(tokenize(s)))` is correct for `+ - * /`, precedence, parens, + and unary minus: `"2+3*4"`→14, `"(2+3)*4"`→20, `"8-3-2"`→3, `"-2+5"`→3, `"2*-3"`→-6. +- **D2 — division.** `/` is true division (`"7/2"`→3.5). Division by zero raises `EvalError` (define + it) — not a bare `ZeroDivisionError` escaping the API. +- **D3 — result type.** Whole-valued results print without a trailing `.0` (`"4/2"`→`2`), non-whole + as a float (`"7/2"`→`3.5`). Document the rule and keep it consistent. +- **D4 — CLI.** `python calc.py "2+3*4"` prints `14` and exits 0; an invalid expression + (`python calc.py "1 +"`) prints an error to stderr and exits non-zero (no traceback). +- **D5 — tests green + end-to-end.** `calc/test_evaluator.py` (`unittest`) passes under + `python -m unittest`, 0 failures, covering D1–D3; and a CLI check covers D4. The whole prior suite + (lex + parse) must still pass (no regression). + +## Verify (cold) + +```bash +python -m unittest -q # D5 (whole suite, all phases) +python calc.py "2+3*4" # 14 +python calc.py "(2+3)*4" # 20 +python calc.py "7/2" # 3.5 +python calc.py "4/2" # 2 +python calc.py "1/0" # error to stderr, non-zero exit +python calc.py "1 +" # error to stderr, non-zero exit +``` + +The Builder restates exact commands + expected outputs + commit sha in `machine-docs/STATUS-eval.md`. +The Adversary cold-verifies and records `eval/Dn: PASS|FAIL` in `machine-docs/REVIEW-eval.md`. When +every gate in every phase has a fresh PASS and there is no `## VETO`, the Builder writes `## DONE` to +`machine-docs/STATUS-eval.md` (this is the last phase → the harness then marks the sequence complete). diff --git a/plans/calc/lex.md b/plans/calc/lex.md new file mode 100644 index 0000000..4cb7acb --- /dev/null +++ b/plans/calc/lex.md @@ -0,0 +1,34 @@ +# Phase `lex` — tokenizer + +**Mission.** Start a Python arithmetic calculator. In this phase build the **lexer**: `calc/lexer.py` +exposing `tokenize(src: str) -> list[Token]`, plus a `unittest` suite. Pure stdlib. This file is the +single source of truth for the phase. (Later phases add the parser and evaluator — design the Token +type so they can consume it.) + +A `Token` has at least a `kind` and a `value`. Kinds: `NUMBER`, `PLUS`, `MINUS`, `STAR`, `SLASH`, +`LPAREN`, `RPAREN`, and `EOF` as the final token. + +## Definition of Done (each Dn is a gate: Builder claims it, Adversary cold-verifies) + +- **D1 — numbers.** Integers (`42`) and floats (`3.14`, `.5`, `10.`) tokenize to one `NUMBER` token + whose value is the numeric value (int or float). `tokenize("42")` → `[NUMBER(42), EOF]`. +- **D2 — operators & parens.** `+ - * / ( )` each tokenize to the right kind; `tokenize("1+2*3")` + yields `NUMBER PLUS NUMBER STAR NUMBER EOF`. +- **D3 — whitespace & errors.** Spaces/tabs between tokens are skipped; an invalid character (e.g. + `@`, `$`, a letter) raises `LexError` (define it in the module) with the offending character and + its position in the message. +- **D4 — tests green.** `calc/test_lexer.py` (`unittest`) passes under `python -m unittest`, 0 + failures, covering D1–D3 including: `" 12 + 3 "`, `"3.5*(1-2)"`, and that `"1 @ 2"` raises + `LexError`. + +## Verify (cold) + +```bash +python -m unittest -q # D4 +python -c "from calc.lexer import tokenize; print([(t.kind,t.value) for t in tokenize('3.5*(1-2)')])" +python -c "from calc.lexer import tokenize; tokenize('1 @ 2')" # must raise LexError +``` + +The Builder restates the exact commands + expected token lists + commit sha in +`machine-docs/STATUS-lex.md`; the Adversary re-runs from its own clone and records +`lex/Dn: PASS|FAIL` in `machine-docs/REVIEW-lex.md`. diff --git a/plans/calc/parse.md b/plans/calc/parse.md new file mode 100644 index 0000000..1c8d2d9 --- /dev/null +++ b/plans/calc/parse.md @@ -0,0 +1,34 @@ +# Phase `parse` — recursive-descent parser + +**Mission.** Build `calc/parser.py` exposing `parse(tokens) -> Node` (consuming the `lex` phase's +tokens) that produces an **AST** with correct arithmetic precedence and associativity, plus a +`unittest` suite. SSOT for this phase. Do NOT evaluate yet — just build the tree (the `eval` phase +consumes it). Represent nodes however you like (e.g. `Num(value)` and `BinOp(op, left, right)`, +`Unary(op, operand)`), but expose a stable, documented shape the evaluator can walk. + +## Definition of Done (each Dn is a gate) + +- **D1 — precedence.** `*` and `/` bind tighter than `+` and `-`: `1+2*3` parses as `1+(2*3)`, not + `(1+2)*3`. +- **D2 — left associativity.** Same-precedence operators associate left: `8-3-2` parses as + `(8-3)-2`; `8/4/2` as `(8/4)/2`. +- **D3 — parentheses.** Parens override precedence: `(1+2)*3` parses with the `+` under the `*`. +- **D4 — unary minus.** Leading and nested unary minus parses: `-5`, `-(1+2)`, `3 * -2`. +- **D5 — errors.** Malformed input raises `ParseError` (define it): `"1 +"`, `"(1"`, `"1 2"`, `")("`, + and the empty string each raise (not crash with a different exception). +- **D6 — tests green.** `calc/test_parser.py` (`unittest`) passes under `python -m unittest`, 0 + failures, covering D1–D5. Assert on tree structure (e.g. a `repr`/shape helper), not on evaluation. + +## Verify (cold) + +```bash +python -m unittest -q # D6 +# D1/D3 differ in structure — the Builder's STATUS gives the exact shape assertion to re-run: +python -c "from calc.lexer import tokenize; from calc.parser import parse; print(parse(tokenize('1+2*3')))" +python -c "from calc.lexer import tokenize; from calc.parser import parse; parse(tokenize('1 +'))" # ParseError +``` + +The Builder documents the AST shape + exact assertions in `machine-docs/STATUS-parse.md`; the +Adversary cold-verifies and records `parse/Dn: PASS|FAIL` in `machine-docs/REVIEW-parse.md`. Watch +especially for a precedence/associativity bug that still passes a weak test — re-derive the expected +tree yourself from the plan. diff --git a/run-harness-bench.sh b/run-harness-bench.sh new file mode 100755 index 0000000..cf18bb2 --- /dev/null +++ b/run-harness-bench.sh @@ -0,0 +1,217 @@ +#!/usr/bin/env bash +# run-harness-bench.sh — FULL harness benchmark (real agents.py up loop, not headless single-pass). +# +# For each variant (builder-adversary, builder-adversary-min) this stands up a real Builder/Adversary +# loop pair + watchdog over a shared work repo and lets them run autonomously through the multi-phase +# calculator (plans/calc/{lex,parse,eval}.md) to SEQUENCE-COMPLETE. Both loops on Sonnet. Then it +# clocks the tokens each loop used (summed from the Claude Code session transcripts) and re-runs the +# final Definition-of-Done itself. +# +# Long, autonomous, nondeterministic (N=1). Per-variant wall-clock timeout below. Usage: ./run-harness-bench.sh +set -u + +BENCH_DIR="$(cd "$(dirname "$0")" && pwd)" +ENGINE="$BENCH_DIR/engine" +PLANS="$BENCH_DIR/plans/calc" +AGENTS_PY="$ENGINE/agents.py" +MODEL="claude-sonnet-4-6" +RUNROOT="$(mktemp -d /tmp/ao-harness-XXXXXX)" # no dot → clean transcript-dir mapping +RESULTS="$BENCH_DIR/RESULTS-harness.md" +TIMEOUT="${BENCH_TIMEOUT:-3000}" # seconds per variant +POLL=60 +GIT_ID=(-c user.email=bench@example.com -c user.name=bench) + +log() { echo "[$(date -u +%H:%M:%S)] $*"; } + +# transcript token sum for an agent's working dir -> "in out cache_create cache_read" +collect_tokens() { + python3 - "$1" <<'PY' +import json,sys,os,glob +wd=sys.argv[1].rstrip('/') +name=wd.replace('/','-').replace('.','-') # '/tmp/x' -> '-tmp-x' +tdir=os.path.expanduser("~/.claude/projects/"+name) +ti=to=tcc=tcr=0 +for f in glob.glob(tdir+"/*.jsonl"): + for line in open(f, errors="ignore"): + try: o=json.loads(line) + except Exception: continue + if o.get("type")=="assistant": + u=(o.get("message",{}) or {}).get("usage",{}) or {} + ti+=u.get("input_tokens",0) or 0; to+=u.get("output_tokens",0) or 0 + tcc+=u.get("cache_creation_input_tokens",0) or 0; tcr+=u.get("cache_read_input_tokens",0) or 0 +print(ti,to,tcc,tcr) +PY +} + +gen_config() { # + local v="$1" run="$2" prefix="$3" + cat > "$run/agents.toml" < README.md \ + && : > machine-docs/.gitkeep \ + && git "${GIT_ID[@]}" add -A && git "${GIT_ID[@]}" commit -q -m "chore: seed work repo" \ + && git "${GIT_ID[@]}" remote add origin "$run/origin.git" && git "${GIT_ID[@]}" push -q -u origin main ) + git "${GIT_ID[@]}" clone -q "$run/origin.git" "$run/work" + git "${GIT_ID[@]}" clone -q "$run/origin.git" "$run/work-adv" + for c in work work-adv; do ( cd "$run/$c" && git config user.email bench@example.com && git config user.name bench ); done + + gen_config "$v" "$run" "$prefix" + + log "[$v] agents.py up …" + python3 "$AGENTS_PY" up --config "$run/agents.toml" >"$run/up.log" 2>&1 + log "[$v] up done; status:"; python3 "$AGENTS_PY" status --config "$run/agents.toml" 2>&1 | sed 's/^/ /' + + # poll for SEQUENCE-COMPLETE or timeout + local marker="$run/.ao-state/SEQUENCE-COMPLETE" t=0 done="no" + while [ $t -lt "$TIMEOUT" ]; do + if [ -f "$marker" ]; then done="yes"; break; fi + sleep "$POLL"; t=$((t+POLL)) + local idx commits + idx="$(cat "$run/.ao-state/state/phase-idx" 2>/dev/null || echo '?')" + commits="$(git -C "$run/origin.git" rev-list --count main 2>/dev/null || echo '?')" + log "[$v] t=${t}s phase-idx=$idx origin-commits=$commits" + done + log "[$v] loop finished (sequence-complete=$done after ${t}s); tearing down" + python3 "$AGENTS_PY" down --config "$run/agents.toml" >"$run/down.log" 2>&1 + + # final DoD check from the adversary's clone (pull latest first) + ( cd "$run/work-adv" && git "${GIT_ID[@]}" pull -q --no-rebase origin main 2>/dev/null ) + local tests=no cli=no + ( cd "$run/work-adv" && python -m unittest -q ) >"$run/final-unittest.txt" 2>&1 && tests=yes + local out; out="$( cd "$run/work-adv" && python calc.py '2+3*4' 2>/dev/null )" + [ "$out" = "14" ] && cli=yes + local reviews; reviews="$(grep -rhoiE '(lex|parse|eval)/D[0-9]+:?\s*PASS' "$run/work-adv/machine-docs/" 2>/dev/null | sort -u | wc -l)" + local veto="no"; grep -rqi 'VETO' "$run/work-adv/machine-docs/" 2>/dev/null && veto=yes + SUM_PHASES[$v]="$(cat "$run/.ao-state/state/phase-idx" 2>/dev/null || echo '?')" + + local success=NO + [ "$done" = yes ] && [ "$tests" = yes ] && [ "$cli" = yes ] && [ "$veto" = no ] && success=YES + SUM_OK[$v]=$success + + # tokens + read -r bi bo bcc bcr <<<"$(collect_tokens "$run/work")" + read -r ai ao acc acr <<<"$(collect_tokens "$run/work-adv")" + local btok=$((bi+bo+bcc+bcr)) atok=$((ai+ao+acc+acr)) vtok=$(( bi+bo+bcc+bcr + ai+ao+acc+acr )) + SUM_TOK[$v]=$vtok + + { + echo "### $v" + echo "- **success:** $success (sequence-complete=$done, tests=$tests, cli('2+3*4'→'$out')=$cli, gates-passed=$reviews, veto=$veto, final phase-idx=${SUM_PHASES[$v]})" + echo "- **builder loop:** in=$bi out=$bo cache_create=$bcc cache_read=$bcr → **${btok}** tok" + echo "- **adversary loop:** in=$ai out=$ao cache_create=$acc cache_read=$acr → **${atok}** tok" + echo "- **total:** **${vtok}** tokens" + echo + } >>"$RESULTS.tmp" + log "[$v] DONE success=$success tokens=$vtok gates-passed=$reviews" +} + +prompt_chars() { cat "$ENGINE/examples/$1/prompts/kickoff.md" "$ENGINE/examples/$1/prompts/$2.md" | wc -c | tr -d ' '; } + +: >"$RESULTS.tmp" +run_variant builder-adversary +run_variant builder-adversary-min + +{ + echo "# Full-harness benchmark — original vs minimal prompts" + echo + echo "Real \`agents.py up\` Builder/Adversary loop pair + watchdog, run autonomously through the" + echo "multi-phase calculator (\`plans/calc/{lex,parse,eval}.md\`) to SEQUENCE-COMPLETE. Engine pinned" + echo "at \`$(git -C "$ENGINE" rev-parse --short HEAD)\`. Both loops on **$MODEL**. Per-variant timeout" + echo "${TIMEOUT}s. Tokens summed from the Claude Code session transcripts of each loop's clone." + echo + echo "## Static prompt size (chars: kickoff + role)" + echo "| version | builder | adversary |" + echo "|---|--:|--:|" + echo "| builder-adversary (orig) | $(prompt_chars builder-adversary builder) | $(prompt_chars builder-adversary adversary) |" + echo "| builder-adversary-min | $(prompt_chars builder-adversary-min builder) | $(prompt_chars builder-adversary-min adversary) |" + echo + echo "## Per-variant" + echo + cat "$RESULTS.tmp" + echo "## Summary" + echo "| version | success | total tokens |" + echo "|---|:--:|--:|" + echo "| builder-adversary (orig) | ${SUM_OK[builder-adversary]:-?} | ${SUM_TOK[builder-adversary]:-?} |" + echo "| builder-adversary-min | ${SUM_OK[builder-adversary-min]:-?} | ${SUM_TOK[builder-adversary-min]:-?} |" + echo + echo "_N=1 per variant; the autonomous loop is nondeterministic (number of review rounds varies)._" + echo "_Run dirs: \`$RUNROOT\`_" +} >"$RESULTS" +rm -f "$RESULTS.tmp" + +echo; echo "===== ALL DONE =====" +echo "orig: success=${SUM_OK[builder-adversary]:-?} tokens=${SUM_TOK[builder-adversary]:-?}" +echo "min : success=${SUM_OK[builder-adversary-min]:-?} tokens=${SUM_TOK[builder-adversary-min]:-?}" +echo "Results: $RESULTS" +echo "Run dirs: $RUNROOT"