feat: multi-phase calculator problem + full-harness benchmark runner

- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial edge cases (precedence/associativity/unary/div-zero) - run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and clocks tokens from the session transcripts (AI-as-adversary kept intact) - RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token effect; cache-read of the working context dominates) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:40:14 +00:00
parent 27df2c7b55
commit 8c3f38dbf4
6 changed files with 365 additions and 0 deletions
--- a/RESULTS-harness.md.tmp
+++ b/RESULTS-harness.md.tmp
--- a/RESULTS.md
+++ b/RESULTS.md
@ -0,0 +1,44 @@
+# Benchmark results — original vs minimal prompts
+
+Engine pinned at: `737ef81`. Task:
+`plans/roman.md` (integer → Roman numeral). Model: **claude-sonnet-4-6** for Builder and Adversary in
+both versions. Runs are independent (separate headless `claude -p` sessions, no shared
+context). Methodology + caveats: see `run-bench.sh` header and the note below.
+
+## Static prompt size (chars: kickoff + role, what gets sent each kickoff)
+
+| version | builder prompt | adversary prompt |
+|---|--:|--:|
+| builder-adversary (orig) | 6389 | 5811 |
+| builder-adversary-min    | 1751 | 1644 |
+
+## Per-run tokens & cost
+
+### builder-adversary
+- **success:** YES  (tests=yes, cli=yes, adversary-verdict=PASS)
+- **builder:** in=21 out=4007 cache_create=14460 cache_read=526213 → 544701 tok, $0.3073279, turns=21
+- **adversary:** in=14 out=3245 cache_create=14930 cache_read=331897 → 350086 tok, $0.24022810000000003, turns=17
+- **total:** 894787 tokens, $0.5476
+
+### builder-adversary-min
+- **success:** YES  (tests=yes, cli=yes, adversary-verdict=PASS)
+- **builder:** in=20 out=4257 cache_create=13183 cache_read=477142 → 494602 tok, $0.28740659999999996, turns=18
+- **adversary:** in=16 out=4545 cache_create=14792 cache_read=378787 → 398140 tok, $0.2718171000000001, turns=16
+- **total:** 892742 tokens, $0.5592
+
+## Summary
+
+| version | success | total tokens | total cost |
+|---|:--:|--:|--:|
+| builder-adversary (orig) | YES | 894787 | $0.5476 |
+| builder-adversary-min    | YES | 892742 | $0.5592 |
+
+> Note: each `claude -p` call carries a fixed ~24k-token cached Claude Code system-prompt +
+> tool-schema overhead, and most tokens come from the agentic work itself (reading the plan,
+> writing/running code, tool results). The role/kickoff prompt is a small slice — so the
+> headline token totals are close; the minimisation shows up in the static prompt size above
+> and the (smaller) input/cache-creation portion. This bench is a single controlled pass per
+> version (N=1; expect run-to-run variance); it exercises task effectiveness + prompt cost,
+> NOT the live watchdog loop / handoff machinery (that needs a full `agents.py up` run).
+
+_Work dirs for this run: `/tmp/ao-benchmark.CwQFWF`_
--- a/plans/calc/eval.md
+++ b/plans/calc/eval.md
@ -0,0 +1,36 @@
+# Phase `eval` — evaluator + CLI
+
+**Mission.** Build `calc/evaluator.py` exposing `evaluate(node) -> int | float` (walking the `parse`
+phase's AST) and a top-level `calc.py` CLI, plus a `unittest` suite. SSOT for this phase. This phase
+makes the calculator end-to-end: string → tokens → AST → number.
+
+## Definition of Done (each Dn is a gate)
+
+- **D1 — arithmetic.** `evaluate(parse(tokenize(s)))` is correct for `+ - * /`, precedence, parens,
+  and unary minus: `"2+3*4"`→14, `"(2+3)*4"`→20, `"8-3-2"`→3, `"-2+5"`→3, `"2*-3"`→-6.
+- **D2 — division.** `/` is true division (`"7/2"`→3.5). Division by zero raises `EvalError` (define
+  it) — not a bare `ZeroDivisionError` escaping the API.
+- **D3 — result type.** Whole-valued results print without a trailing `.0` (`"4/2"`→`2`), non-whole
+  as a float (`"7/2"`→`3.5`). Document the rule and keep it consistent.
+- **D4 — CLI.** `python calc.py "2+3*4"` prints `14` and exits 0; an invalid expression
+  (`python calc.py "1 +"`) prints an error to stderr and exits non-zero (no traceback).
+- **D5 — tests green + end-to-end.** `calc/test_evaluator.py` (`unittest`) passes under
+  `python -m unittest`, 0 failures, covering D1–D3; and a CLI check covers D4. The whole prior suite
+  (lex + parse) must still pass (no regression).
+
+## Verify (cold)
+
+```bash
+python -m unittest -q                              # D5 (whole suite, all phases)
+python calc.py "2+3*4"     # 14
+python calc.py "(2+3)*4"   # 20
+python calc.py "7/2"       # 3.5
+python calc.py "4/2"       # 2
+python calc.py "1/0"       # error to stderr, non-zero exit
+python calc.py "1 +"       # error to stderr, non-zero exit
+```
+
+The Builder restates exact commands + expected outputs + commit sha in `machine-docs/STATUS-eval.md`.
+The Adversary cold-verifies and records `eval/Dn: PASS|FAIL` in `machine-docs/REVIEW-eval.md`. When
+every gate in every phase has a fresh PASS and there is no `## VETO`, the Builder writes `## DONE` to
+`machine-docs/STATUS-eval.md` (this is the last phase → the harness then marks the sequence complete).
--- a/plans/calc/lex.md
+++ b/plans/calc/lex.md
@ -0,0 +1,34 @@
+# Phase `lex` — tokenizer
+
+**Mission.** Start a Python arithmetic calculator. In this phase build the **lexer**: `calc/lexer.py`
+exposing `tokenize(src: str) -> list[Token]`, plus a `unittest` suite. Pure stdlib. This file is the
+single source of truth for the phase. (Later phases add the parser and evaluator — design the Token
+type so they can consume it.)
+
+A `Token` has at least a `kind` and a `value`. Kinds: `NUMBER`, `PLUS`, `MINUS`, `STAR`, `SLASH`,
+`LPAREN`, `RPAREN`, and `EOF` as the final token.
+
+## Definition of Done (each Dn is a gate: Builder claims it, Adversary cold-verifies)
+
+- **D1 — numbers.** Integers (`42`) and floats (`3.14`, `.5`, `10.`) tokenize to one `NUMBER` token
+  whose value is the numeric value (int or float). `tokenize("42")` → `[NUMBER(42), EOF]`.
+- **D2 — operators & parens.** `+ - * / ( )` each tokenize to the right kind; `tokenize("1+2*3")`
+  yields `NUMBER PLUS NUMBER STAR NUMBER EOF`.
+- **D3 — whitespace & errors.** Spaces/tabs between tokens are skipped; an invalid character (e.g.
+  `@`, `$`, a letter) raises `LexError` (define it in the module) with the offending character and
+  its position in the message.
+- **D4 — tests green.** `calc/test_lexer.py` (`unittest`) passes under `python -m unittest`, 0
+  failures, covering D1–D3 including: `"  12  +  3 "`, `"3.5*(1-2)"`, and that `"1 @ 2"` raises
+  `LexError`.
+
+## Verify (cold)
+
+```bash
+python -m unittest -q                              # D4
+python -c "from calc.lexer import tokenize; print([(t.kind,t.value) for t in tokenize('3.5*(1-2)')])"
+python -c "from calc.lexer import tokenize; tokenize('1 @ 2')"   # must raise LexError
+```
+
+The Builder restates the exact commands + expected token lists + commit sha in
+`machine-docs/STATUS-lex.md`; the Adversary re-runs from its own clone and records
+`lex/Dn: PASS|FAIL` in `machine-docs/REVIEW-lex.md`.
--- a/plans/calc/parse.md
+++ b/plans/calc/parse.md
@ -0,0 +1,34 @@
+# Phase `parse` — recursive-descent parser
+
+**Mission.** Build `calc/parser.py` exposing `parse(tokens) -> Node` (consuming the `lex` phase's
+tokens) that produces an **AST** with correct arithmetic precedence and associativity, plus a
+`unittest` suite. SSOT for this phase. Do NOT evaluate yet — just build the tree (the `eval` phase
+consumes it). Represent nodes however you like (e.g. `Num(value)` and `BinOp(op, left, right)`,
+`Unary(op, operand)`), but expose a stable, documented shape the evaluator can walk.
+
+## Definition of Done (each Dn is a gate)
+
+- **D1 — precedence.** `*` and `/` bind tighter than `+` and `-`: `1+2*3` parses as `1+(2*3)`, not
+  `(1+2)*3`.
+- **D2 — left associativity.** Same-precedence operators associate left: `8-3-2` parses as
+  `(8-3)-2`; `8/4/2` as `(8/4)/2`.
+- **D3 — parentheses.** Parens override precedence: `(1+2)*3` parses with the `+` under the `*`.
+- **D4 — unary minus.** Leading and nested unary minus parses: `-5`, `-(1+2)`, `3 * -2`.
+- **D5 — errors.** Malformed input raises `ParseError` (define it): `"1 +"`, `"(1"`, `"1 2"`, `")("`,
+  and the empty string each raise (not crash with a different exception).
+- **D6 — tests green.** `calc/test_parser.py` (`unittest`) passes under `python -m unittest`, 0
+  failures, covering D1–D5. Assert on tree structure (e.g. a `repr`/shape helper), not on evaluation.
+
+## Verify (cold)
+
+```bash
+python -m unittest -q                              # D6
+# D1/D3 differ in structure — the Builder's STATUS gives the exact shape assertion to re-run:
+python -c "from calc.lexer import tokenize; from calc.parser import parse; print(parse(tokenize('1+2*3')))"
+python -c "from calc.lexer import tokenize; from calc.parser import parse; parse(tokenize('1 +'))"  # ParseError
+```
+
+The Builder documents the AST shape + exact assertions in `machine-docs/STATUS-parse.md`; the
+Adversary cold-verifies and records `parse/Dn: PASS|FAIL` in `machine-docs/REVIEW-parse.md`. Watch
+especially for a precedence/associativity bug that still passes a weak test — re-derive the expected
+tree yourself from the plan.
--- a/run-harness-bench.sh
+++ b/run-harness-bench.sh
@ -0,0 +1,217 @@
+#!/usr/bin/env bash
+# run-harness-bench.sh — FULL harness benchmark (real agents.py up loop, not headless single-pass).
+#
+# For each variant (builder-adversary, builder-adversary-min) this stands up a real Builder/Adversary
+# loop pair + watchdog over a shared work repo and lets them run autonomously through the multi-phase
+# calculator (plans/calc/{lex,parse,eval}.md) to SEQUENCE-COMPLETE. Both loops on Sonnet. Then it
+# clocks the tokens each loop used (summed from the Claude Code session transcripts) and re-runs the
+# final Definition-of-Done itself.
+#
+# Long, autonomous, nondeterministic (N=1). Per-variant wall-clock timeout below. Usage: ./run-harness-bench.sh
+set -u
+
+BENCH_DIR="$(cd "$(dirname "$0")" && pwd)"
+ENGINE="$BENCH_DIR/engine"
+PLANS="$BENCH_DIR/plans/calc"
+AGENTS_PY="$ENGINE/agents.py"
+MODEL="claude-sonnet-4-6"
+RUNROOT="$(mktemp -d /tmp/ao-harness-XXXXXX)"   # no dot → clean transcript-dir mapping
+RESULTS="$BENCH_DIR/RESULTS-harness.md"
+TIMEOUT="${BENCH_TIMEOUT:-3000}"                # seconds per variant
+POLL=60
+GIT_ID=(-c user.email=bench@example.com -c user.name=bench)
+
+log() { echo "[$(date -u +%H:%M:%S)] $*"; }
+
+# transcript token sum for an agent's working dir -> "in out cache_create cache_read"
+collect_tokens() {
+  python3 - "$1" <<'PY'
+import json,sys,os,glob
+wd=sys.argv[1].rstrip('/')
+name=wd.replace('/','-').replace('.','-')        # '/tmp/x' -> '-tmp-x'
+tdir=os.path.expanduser("~/.claude/projects/"+name)
+ti=to=tcc=tcr=0
+for f in glob.glob(tdir+"/*.jsonl"):
+    for line in open(f, errors="ignore"):
+        try: o=json.loads(line)
+        except Exception: continue
+        if o.get("type")=="assistant":
+            u=(o.get("message",{}) or {}).get("usage",{}) or {}
+            ti+=u.get("input_tokens",0) or 0; to+=u.get("output_tokens",0) or 0
+            tcc+=u.get("cache_creation_input_tokens",0) or 0; tcr+=u.get("cache_read_input_tokens",0) or 0
+print(ti,to,tcc,tcr)
+PY
+}
+
+gen_config() {  # <variant> <run> <prefix>
+  local v="$1" run="$2" prefix="$3"
+  cat > "$run/agents.toml" <<EOF
+[watchdog]
+signal_interval = 15
+heavy_interval  = 60
+limit_probe_fallback = 300
+limit_reset_slack = 45
+stall_grace = 180
+
+[defaults]
+session_prefix = "$prefix"
+log_dir = "$run/.ao-state"
+backend = "claude"
+model = "$MODEL"
+watch = "heal"
+
+[backend.claude]
+bin = "claude"
+flags = "--dangerously-skip-permissions"
+remote_control = true
+supports_resume = true
+prompt_delivery = "arg"
+process_name = "claude"
+submit_key = "Enter"
+stall_idle = 300
+active_re = "esc to interrupt|Running tool|⠇|⠙|· \\\\d+"
+limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
+fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
+
+[[agent]]
+name = "builder"
+kind = "loop"
+role = "builder"
+dir = "$run/work"
+watch = "heal+stall"
+
+[[agent]]
+name = "adversary"
+session = "${prefix}adv"
+kind = "loop"
+role = "adversary"
+dir = "$run/work-adv"
+watch = "heal+stall"
+
+[loop]
+state_file = "phase-idx"
+resume_phase = true
+auto_advance = true
+done_marker = "## DONE"
+kickoff_template = "$ENGINE/examples/$v/prompts/kickoff.md"
+roles_dir = "$ENGINE/examples/$v/prompts"
+handoff = { repo = "$run/work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
+phases = [
+  { id = "lex",   plan = "$PLANS/lex.md",   status = "STATUS-lex.md" },
+  { id = "parse", plan = "$PLANS/parse.md", status = "STATUS-parse.md" },
+  { id = "eval",  plan = "$PLANS/eval.md",  status = "STATUS-eval.md" },
+]
+EOF
+}
+
+declare -A SUM_TOK SUM_OK SUM_PHASES
+
+run_variant() {
+  local v="$1"
+  local run="$RUNROOT/$v"
+  local rtag; rtag=$(basename "$RUNROOT"); rtag=${rtag##*-}        # mktemp suffix → unique per invocation
+  local prefix="b${rtag}$(echo "$v" | cksum | cut -c1-2)-"          # unique per (run, variant); avoids tmux collisions
+  mkdir -p "$run"
+  log "===== $v  (run dir: $run, prefix: $prefix) ====="
+
+  # shared bare 'origin' + two clones
+  git "${GIT_ID[@]}" init -q --bare "$run/origin.git"
+  git -C "$run/origin.git" symbolic-ref HEAD refs/heads/main   # so clones check out 'main' (we push main, not master)
+  git "${GIT_ID[@]}" init -q -b main "$run/seed"
+  ( cd "$run/seed" && mkdir -p machine-docs && echo "# calc work repo" > README.md \
+      && : > machine-docs/.gitkeep \
+      && git "${GIT_ID[@]}" add -A && git "${GIT_ID[@]}" commit -q -m "chore: seed work repo" \
+      && git "${GIT_ID[@]}" remote add origin "$run/origin.git" && git "${GIT_ID[@]}" push -q -u origin main )
+  git "${GIT_ID[@]}" clone -q "$run/origin.git" "$run/work"
+  git "${GIT_ID[@]}" clone -q "$run/origin.git" "$run/work-adv"
+  for c in work work-adv; do ( cd "$run/$c" && git config user.email bench@example.com && git config user.name bench ); done
+
+  gen_config "$v" "$run" "$prefix"
+
+  log "[$v] agents.py up …"
+  python3 "$AGENTS_PY" up --config "$run/agents.toml" >"$run/up.log" 2>&1
+  log "[$v] up done; status:"; python3 "$AGENTS_PY" status --config "$run/agents.toml" 2>&1 | sed 's/^/    /'
+
+  # poll for SEQUENCE-COMPLETE or timeout
+  local marker="$run/.ao-state/SEQUENCE-COMPLETE" t=0 done="no"
+  while [ $t -lt "$TIMEOUT" ]; do
+    if [ -f "$marker" ]; then done="yes"; break; fi
+    sleep "$POLL"; t=$((t+POLL))
+    local idx commits
+    idx="$(cat "$run/.ao-state/state/phase-idx" 2>/dev/null || echo '?')"
+    commits="$(git -C "$run/origin.git" rev-list --count main 2>/dev/null || echo '?')"
+    log "[$v] t=${t}s phase-idx=$idx origin-commits=$commits"
+  done
+  log "[$v] loop finished (sequence-complete=$done after ${t}s); tearing down"
+  python3 "$AGENTS_PY" down --config "$run/agents.toml" >"$run/down.log" 2>&1
+
+  # final DoD check from the adversary's clone (pull latest first)
+  ( cd "$run/work-adv" && git "${GIT_ID[@]}" pull -q --no-rebase origin main 2>/dev/null )
+  local tests=no cli=no
+  ( cd "$run/work-adv" && python -m unittest -q ) >"$run/final-unittest.txt" 2>&1 && tests=yes
+  local out; out="$( cd "$run/work-adv" && python calc.py '2+3*4' 2>/dev/null )"
+  [ "$out" = "14" ] && cli=yes
+  local reviews; reviews="$(grep -rhoiE '(lex|parse|eval)/D[0-9]+:?\s*PASS' "$run/work-adv/machine-docs/" 2>/dev/null | sort -u | wc -l)"
+  local veto="no"; grep -rqi 'VETO' "$run/work-adv/machine-docs/" 2>/dev/null && veto=yes
+  SUM_PHASES[$v]="$(cat "$run/.ao-state/state/phase-idx" 2>/dev/null || echo '?')"
+
+  local success=NO
+  [ "$done" = yes ] && [ "$tests" = yes ] && [ "$cli" = yes ] && [ "$veto" = no ] && success=YES
+  SUM_OK[$v]=$success
+
+  # tokens
+  read -r bi bo bcc bcr <<<"$(collect_tokens "$run/work")"
+  read -r ai ao acc acr <<<"$(collect_tokens "$run/work-adv")"
+  local btok=$((bi+bo+bcc+bcr)) atok=$((ai+ao+acc+acr)) vtok=$(( bi+bo+bcc+bcr + ai+ao+acc+acr ))
+  SUM_TOK[$v]=$vtok
+
+  {
+    echo "### $v"
+    echo "- **success:** $success  (sequence-complete=$done, tests=$tests, cli('2+3*4'→'$out')=$cli, gates-passed=$reviews, veto=$veto, final phase-idx=${SUM_PHASES[$v]})"
+    echo "- **builder loop:** in=$bi out=$bo cache_create=$bcc cache_read=$bcr → **${btok}** tok"
+    echo "- **adversary loop:** in=$ai out=$ao cache_create=$acc cache_read=$acr → **${atok}** tok"
+    echo "- **total:** **${vtok}** tokens"
+    echo
+  } >>"$RESULTS.tmp"
+  log "[$v] DONE success=$success tokens=$vtok gates-passed=$reviews"
+}
+
+prompt_chars() { cat "$ENGINE/examples/$1/prompts/kickoff.md" "$ENGINE/examples/$1/prompts/$2.md" | wc -c | tr -d ' '; }
+
+: >"$RESULTS.tmp"
+run_variant builder-adversary
+run_variant builder-adversary-min
+
+{
+  echo "# Full-harness benchmark — original vs minimal prompts"
+  echo
+  echo "Real \`agents.py up\` Builder/Adversary loop pair + watchdog, run autonomously through the"
+  echo "multi-phase calculator (\`plans/calc/{lex,parse,eval}.md\`) to SEQUENCE-COMPLETE. Engine pinned"
+  echo "at \`$(git -C "$ENGINE" rev-parse --short HEAD)\`. Both loops on **$MODEL**. Per-variant timeout"
+  echo "${TIMEOUT}s. Tokens summed from the Claude Code session transcripts of each loop's clone."
+  echo
+  echo "## Static prompt size (chars: kickoff + role)"
+  echo "| version | builder | adversary |"
+  echo "|---|--:|--:|"
+  echo "| builder-adversary (orig) | $(prompt_chars builder-adversary builder) | $(prompt_chars builder-adversary adversary) |"
+  echo "| builder-adversary-min    | $(prompt_chars builder-adversary-min builder) | $(prompt_chars builder-adversary-min adversary) |"
+  echo
+  echo "## Per-variant"
+  echo
+  cat "$RESULTS.tmp"
+  echo "## Summary"
+  echo "| version | success | total tokens |"
+  echo "|---|:--:|--:|"
+  echo "| builder-adversary (orig) | ${SUM_OK[builder-adversary]:-?} | ${SUM_TOK[builder-adversary]:-?} |"
+  echo "| builder-adversary-min    | ${SUM_OK[builder-adversary-min]:-?} | ${SUM_TOK[builder-adversary-min]:-?} |"
+  echo
+  echo "_N=1 per variant; the autonomous loop is nondeterministic (number of review rounds varies)._"
+  echo "_Run dirs: \`$RUNROOT\`_"
+} >"$RESULTS"
+rm -f "$RESULTS.tmp"
+
+echo; echo "===== ALL DONE ====="
+echo "orig: success=${SUM_OK[builder-adversary]:-?} tokens=${SUM_TOK[builder-adversary]:-?}"
+echo "min : success=${SUM_OK[builder-adversary-min]:-?} tokens=${SUM_TOK[builder-adversary-min]:-?}"
+echo "Results: $RESULTS"
+echo "Run dirs: $RUNROOT"