diff --git a/RESULTS-campaign.md b/RESULTS-campaign.md index 8849974..d4557d1 100644 --- a/RESULTS-campaign.md +++ b/RESULTS-campaign.md @@ -1,6 +1,6 @@ # Full-harness benchmark — campaign analysis -Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 20 successful runs of 22 total. +Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 25 successful runs of 27 total. ## Per-variant total tokens (successful runs) @@ -10,6 +10,7 @@ Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase c | builder-adversary-min | 5/6 | 9,772,581 | 9,917,040 | 9,135,718 | 11,386,415 | 1.25x | | builder-adversary-stateless | 5/5 | 10,122,375 | 10,735,401 | 9,992,834 | 13,009,792 | 1.30x | | builder-adversary-lean | 5/6 | 13,409,349 | 13,216,582 | 12,101,355 | 13,815,595 | 1.14x | +| builder-solo | 5/5 | 2,773,634 | 2,744,840 | 2,417,528 | 2,948,467 | 1.22x | ## Efficiency ratios — min / median / max (successful runs) @@ -21,7 +22,8 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration | builder-adversary-min | 23,527 / 25,252 / 32,814 | 8,173 / 14,807 / 15,814 | 582,292 / 669,789 / 712,415 | | builder-adversary-stateless | 21,802 / 27,799 / 30,861 | 10,544 / 11,620 / 12,755 | 697,172 / 765,282 / 778,644 | | builder-adversary-lean | 28,077 / 33,238 / 38,966 | 12,416 / 13,523 / 14,403 | 432,191 / 478,905 / 575,650 | -| **all** | 21,802 / 29,518 / 38,966 | 8,083 / 12,511 / 15,814 | 432,191 / 705,228 / 1,044,586 | +| builder-solo | 6,029 / 6,611 / 6,969 | 6,542 / 6,715 / 7,020 | 392,494 / 483,506 / 737,117 | +| **all** | 6,029 / 28,077 / 38,966 | 6,542 / 11,712 / 15,814 | 392,494 / 697,172 / 1,044,586 | ## Per-variant medians (commits / LOC / duration) @@ -31,14 +33,15 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration | builder-adversary-min | 15 | 367 | 720 | | builder-adversary-stateless | 14 | 400 | 900 | | builder-adversary-lean | 28 | 390 | 960 | +| builder-solo | 5 | 426 | 420 | -## Correlations with total tokens (pooled, n=20) +## Correlations with total tokens (pooled, n=25) | tokens vs | Pearson r | |---|--:| -| duration | +0.51 | -| commits | +0.50 | -| LOC | +0.41 | +| duration | +0.83 | +| commits | +0.79 | +| LOC | -0.04 | ## All runs (raw) @@ -66,5 +69,10 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration | builder-adversary-lean | 4 | YES | | 13,815,595 | 960 | 24 | 378 | 36,549 | 14,391 | 575,650 | | builder-adversary-lean | 5 | YES | | 12,101,355 | 960 | 28 | 431 | 28,077 | 12,606 | 432,191 | | builder-adversary-lean | 1 | YES | | 13,793,914 | 1020 | 25 | 354 | 38,966 | 13,523 | 551,757 | +| builder-solo | 1 | YES | | 2,417,528 | 360 | 5 | 401 | 6,029 | 6,715 | 483,506 | +| builder-solo | 2 | YES | | 2,747,457 | 420 | 7 | 450 | 6,105 | 6,542 | 392,494 | +| builder-solo | 3 | YES | | 2,837,115 | 420 | 6 | 426 | 6,660 | 6,755 | 472,852 | +| builder-solo | 4 | YES | | 2,773,634 | 420 | 5 | 398 | 6,969 | 6,604 | 554,727 | +| builder-solo | 5 | YES | | 2,948,467 | 420 | 4 | 446 | 6,611 | 7,020 | 737,117 | _Stats over successful runs. `LIMIT` = the run hit a usage-limit pause (duration/tok-sec distorted, token total fine). Repos kept under the run root for analysis._