Commit Graph

7 Commits

Author SHA1 Message Date
b46dca003c results: 4-way + the variance finding (N=1 is not enough)
Combined run 1 (orig/min/stateless) + run 2 (lean/stateless). Key result: the
SAME stateless variant used 5.96M tokens in run 1 and 9.27M in run 2 (±55%) —
nondeterministic iteration count dominates every between-variant gap. So:
- prose minimization ~-6% (small, same-invocation)
- lean (full per-gate review) ~= stateless (batched): full review is ~free
- the earlier "-45% from context hygiene" is NOT reproducible — mostly noise
Honest conclusion: need >=5 runs/variant to resolve the context-hygiene effect;
log_tokens now makes that easy to collect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 22:06:21 +00:00
cca5c895b2 feat: add builder-adversary-lean variant; runner takes variant args
- bump engine submodule to e0425e6 (adds builder-adversary-lean: context hygiene
  + enforced per-gate review)
- run-harness-bench.sh: accept variant names as CLI args to run a subset
2026-06-14 21:43:11 +00:00
0fa3d726a5 results: full-harness 3-way (orig/min/stateless) on the calculator
All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full
adversary verification (calc correct, tests OK). Tokens:
  orig       10,756,363
  min        10,119,225  (-5.9%)
  stateless   5,964,829  (-44.5%)
Prompt-prose minimization barely moved tokens; context hygiene (stateless)
nearly halved them, driven by ~48% lower cache_read. Quality held.

Also fix the runner's success check: it grepped the word "veto" and matched
"No veto" → false failures; now matches the "## VETO" marker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 21:35:47 +00:00
a1b59e1bc5 feat: add stateless variant, pre-trust work dirs, loop over 3 variants
- bump engine submodule to 985d33d (adds builder-adversary-stateless example)
- run-harness-bench.sh: pre-trust each work dir in ~/.claude.json so interactive
  claude (tmux) skips the workspace-trust dialog (--dangerously-skip-permissions
  only skips it for redirected/headless output); benchmark all three variants
- (fixes from this session: bare repo default branch → main; unique session
  prefix per run)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:52:29 +00:00
11eda4a8b1 chore: gitignore the runner's transient .tmp file 2026-06-14 20:40:26 +00:00
8c3f38dbf4 feat: multi-phase calculator problem + full-harness benchmark runner
- plans/calc/{lex,parse,eval}.md: a 3-phase calculator with multiple gates per
  phase (tokenizer → recursive-descent parser → evaluator+CLI), rich adversarial
  edge cases (precedence/associativity/unary/div-zero)
- run-harness-bench.sh: stands up a real agents.py up Builder/Adversary loop pair
  + watchdog over a shared work repo per variant, runs to SEQUENCE-COMPLETE, and
  clocks tokens from the session transcripts (AI-as-adversary kept intact)
- RESULTS.md: baseline single-pass roman-numeral run (prompt size had ~0 token
  effect; cache-read of the working context dominates)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:40:14 +00:00
27df2c7b55 feat: agent-orchestrator-benchmark — prompt token comparison harness
A standalone repo (engine vendored as a submodule at the examples commit) that
runs a head-to-head between the builder-adversary and builder-adversary-min
example variants: same task, independent headless runs, both on Sonnet, with
token counts. Includes the roman-numeral test problem and run-bench.sh.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:20:05 +00:00