diff --git a/FINDINGS.md b/FINDINGS.md index 43ac686..23914e1 100644 --- a/FINDINGS.md +++ b/FINDINGS.md @@ -7,7 +7,7 @@ loop, on a fixed, well-specified task. cold-verifiable Definition-of-Done gates. Deliberately offline and deterministic so it stresses the *protocol*, not infrastructure. - **How:** each variant run autonomously to `SEQUENCE-COMPLETE` via the real harness (`agents.py up` - — Builder + Adversary loop pair + watchdog), **5 runs each** (N=5; `deferred` N=3, finalizing). + — Builder + Adversary loop pair + watchdog), **5 runs each** (N=5). Both loops on **claude-sonnet-4-6**. Tokens summed from each loop's Claude Code session transcripts. The deliverable is behaviorally identical across all variants (verified on a 24-expression probe), so this compares like-for-like. @@ -28,7 +28,7 @@ loop, on a fixed, well-specified task. (stateless/lean/deferred are all built on the *full original* prompts, so they isolate their one change without the minimal-prompt confound.) -## Headline results — median tokens (5 runs each; deferred N=3) +## Headline results — median tokens (5 runs each) | variant | adversary verifies… | median tokens | vs orig | commits | LOC | |---|---|--:|--:|--:|--:| @@ -36,7 +36,7 @@ change without the minimal-prompt confound.) | **min** | per phase *(minimal prompts)* | 9.77M | −25% | 15 | 367 | | **stateless** | per phase *(+context hygiene)* | 10.12M | −22% | 14 | 400 | | **orig** | per **phase** | 13.04M | — | 14 | 449 | -| **deferred** | once, after **whole build** | 13.37M | +3% | 10 | 425 | +| **deferred** | once, after **whole build** | 12.89M | −1% | 12 | 425 | | **lean** | per **gate** | 13.41M | +3% | 28 | 390 | ## The two big findings @@ -116,5 +116,5 @@ process intensity. _Artifacts in this repo: `FINDINGS.md` (this summary), `RESULTS-campaign.md` (per-variant stats + ratios + full raw table), `RESULTS-campaign.md.data` (raw rows), `analyze.py` (regenerates the analysis), `run-harness-bench.sh` / `run-solo-bench.sh` (the runners), `plans/calc/` (the task), and -the six `engine/examples/builder-adversary*` variants. `deferred` is N=3 here and is being finalized -to N=5; its median is stable (spread 1.19×)._ +the six `engine/examples/builder-adversary*` variants. All variants are N=5 (two `deferred` and the +`min`/`lean` wedge/limit failures excluded; see RESULTS for the raw rows)._ diff --git a/RESULTS-campaign.md b/RESULTS-campaign.md index 77d8475..dcdf226 100644 --- a/RESULTS-campaign.md +++ b/RESULTS-campaign.md @@ -1,6 +1,6 @@ # Full-harness benchmark — campaign analysis -Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 28 successful runs of 30 total. +Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 30 successful runs of 33 total. ## Per-variant total tokens (successful runs) @@ -11,7 +11,7 @@ Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase c | builder-adversary-stateless | 5/5 | 10,122,375 | 10,735,401 | 9,992,834 | 13,009,792 | 1.30x | | builder-adversary-lean | 5/6 | 13,409,349 | 13,216,582 | 12,101,355 | 13,815,595 | 1.14x | | builder-solo | 5/5 | 2,773,634 | 2,744,840 | 2,417,528 | 2,948,467 | 1.22x | -| builder-adversary-deferred | 3/3 | 13,366,800 | 13,863,041 | 12,888,082 | 15,334,242 | 1.19x | +| builder-adversary-deferred | 5/6 | 12,888,082 | 12,410,637 | 9,610,923 | 15,334,242 | 1.60x | ## Efficiency ratios — min / median / max (successful runs) @@ -24,8 +24,8 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration | builder-adversary-stateless | 21,802 / 27,799 / 30,861 | 10,544 / 11,620 / 12,755 | 697,172 / 765,282 / 778,644 | | builder-adversary-lean | 28,077 / 33,238 / 38,966 | 12,416 / 13,523 / 14,403 | 432,191 / 478,905 / 575,650 | | builder-solo | 6,029 / 6,611 / 6,969 | 6,542 / 6,715 / 7,020 | 392,494 / 483,506 / 737,117 | -| builder-adversary-deferred | 31,451 / 33,827 / 34,537 | 12,170 / 13,105 / 15,343 | 1,277,854 / 1,336,680 / 1,841,155 | -| **all** | 6,029 / 29,024 / 38,966 | 6,542 / 12,170 / 15,814 | 392,494 / 705,228 / 1,841,155 | +| builder-adversary-deferred | 20,107 / 31,451 / 34,537 | 12,170 / 13,349 / 15,343 | 739,302 / 1,277,854 / 1,841,155 | +| **all** | 6,029 / 28,410 / 38,966 | 6,542 / 12,416 / 15,814 | 392,494 / 716,723 / 1,841,155 | ## Per-variant medians (commits / LOC / duration) @@ -36,15 +36,15 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration | builder-adversary-stateless | 14 | 400 | 900 | | builder-adversary-lean | 28 | 390 | 960 | | builder-solo | 5 | 426 | 420 | -| builder-adversary-deferred | 10 | 425 | 1020 | +| builder-adversary-deferred | 12 | 425 | 840 | -## Correlations with total tokens (pooled, n=28) +## Correlations with total tokens (pooled, n=30) | tokens vs | Pearson r | |---|--:| | duration | +0.83 | -| commits | +0.65 | -| LOC | +0.01 | +| commits | +0.64 | +| LOC | -0.00 | ## All runs (raw) @@ -80,5 +80,8 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration | builder-adversary-deferred | 1 | YES | | 13,366,800 | 1020 | 10 | 425 | 31,451 | 13,105 | 1,336,680 | | builder-adversary-deferred | 2 | YES | | 12,888,082 | 840 | 7 | 381 | 33,827 | 15,343 | 1,841,155 | | builder-adversary-deferred | 3 | YES | | 15,334,242 | 1260 | 12 | 444 | 34,537 | 12,170 | 1,277,854 | +| builder-adversary-deferred | 4 | NO | | 8,536,729 | 2700 | 6 | 146 | 58,471 | 3,162 | 1,422,788 | +| builder-adversary-deferred | 5 | YES | | 9,610,923 | 720 | 13 | 478 | 20,107 | 13,349 | 739,302 | +| builder-adversary-deferred | 1 | YES | | 10,853,138 | 780 | 12 | 421 | 25,779 | 13,914 | 904,428 | _Stats over successful runs. `LIMIT` = the run hit a usage-limit pause (duration/tok-sec distorted, token total fine). Repos kept under the run root for analysis._ diff --git a/RESULTS-campaign.md.data b/RESULTS-campaign.md.data index 9beb80e..15fc486 100644 --- a/RESULTS-campaign.md.data +++ b/RESULTS-campaign.md.data @@ -29,3 +29,5 @@ builder-adversary-deferred 1 YES 7375559 5991241 13366800 1020 10 425 builder-adversary-deferred 2 YES 7803666 5084416 12888082 840 7 381 builder-adversary-deferred 3 YES 8034336 7299906 15334242 1260 12 444 builder-adversary-deferred 4 NO 4589433 3947296 8536729 2700 6 146 +builder-adversary-deferred 5 YES 5294341 4316582 9610923 720 13 478 +builder-adversary-deferred 1 YES 5968601 4884537 10853138 780 12 421