results: full 5-variant campaign complete (incl. builder-solo control)
This commit is contained in:
@ -1,6 +1,6 @@
|
||||
# Full-harness benchmark — campaign analysis
|
||||
|
||||
Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 20 successful runs of 22 total.
|
||||
Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 25 successful runs of 27 total.
|
||||
|
||||
## Per-variant total tokens (successful runs)
|
||||
|
||||
@ -10,6 +10,7 @@ Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase c
|
||||
| builder-adversary-min | 5/6 | 9,772,581 | 9,917,040 | 9,135,718 | 11,386,415 | 1.25x |
|
||||
| builder-adversary-stateless | 5/5 | 10,122,375 | 10,735,401 | 9,992,834 | 13,009,792 | 1.30x |
|
||||
| builder-adversary-lean | 5/6 | 13,409,349 | 13,216,582 | 12,101,355 | 13,815,595 | 1.14x |
|
||||
| builder-solo | 5/5 | 2,773,634 | 2,744,840 | 2,417,528 | 2,948,467 | 1.22x |
|
||||
|
||||
## Efficiency ratios — min / median / max (successful runs)
|
||||
|
||||
@ -21,7 +22,8 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration
|
||||
| builder-adversary-min | 23,527 / 25,252 / 32,814 | 8,173 / 14,807 / 15,814 | 582,292 / 669,789 / 712,415 |
|
||||
| builder-adversary-stateless | 21,802 / 27,799 / 30,861 | 10,544 / 11,620 / 12,755 | 697,172 / 765,282 / 778,644 |
|
||||
| builder-adversary-lean | 28,077 / 33,238 / 38,966 | 12,416 / 13,523 / 14,403 | 432,191 / 478,905 / 575,650 |
|
||||
| **all** | 21,802 / 29,518 / 38,966 | 8,083 / 12,511 / 15,814 | 432,191 / 705,228 / 1,044,586 |
|
||||
| builder-solo | 6,029 / 6,611 / 6,969 | 6,542 / 6,715 / 7,020 | 392,494 / 483,506 / 737,117 |
|
||||
| **all** | 6,029 / 28,077 / 38,966 | 6,542 / 11,712 / 15,814 | 392,494 / 697,172 / 1,044,586 |
|
||||
|
||||
## Per-variant medians (commits / LOC / duration)
|
||||
|
||||
@ -31,14 +33,15 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration
|
||||
| builder-adversary-min | 15 | 367 | 720 |
|
||||
| builder-adversary-stateless | 14 | 400 | 900 |
|
||||
| builder-adversary-lean | 28 | 390 | 960 |
|
||||
| builder-solo | 5 | 426 | 420 |
|
||||
|
||||
## Correlations with total tokens (pooled, n=20)
|
||||
## Correlations with total tokens (pooled, n=25)
|
||||
|
||||
| tokens vs | Pearson r |
|
||||
|---|--:|
|
||||
| duration | +0.51 |
|
||||
| commits | +0.50 |
|
||||
| LOC | +0.41 |
|
||||
| duration | +0.83 |
|
||||
| commits | +0.79 |
|
||||
| LOC | -0.04 |
|
||||
|
||||
## All runs (raw)
|
||||
|
||||
@ -66,5 +69,10 @@ tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration
|
||||
| builder-adversary-lean | 4 | YES | | 13,815,595 | 960 | 24 | 378 | 36,549 | 14,391 | 575,650 |
|
||||
| builder-adversary-lean | 5 | YES | | 12,101,355 | 960 | 28 | 431 | 28,077 | 12,606 | 432,191 |
|
||||
| builder-adversary-lean | 1 | YES | | 13,793,914 | 1020 | 25 | 354 | 38,966 | 13,523 | 551,757 |
|
||||
| builder-solo | 1 | YES | | 2,417,528 | 360 | 5 | 401 | 6,029 | 6,715 | 483,506 |
|
||||
| builder-solo | 2 | YES | | 2,747,457 | 420 | 7 | 450 | 6,105 | 6,542 | 392,494 |
|
||||
| builder-solo | 3 | YES | | 2,837,115 | 420 | 6 | 426 | 6,660 | 6,755 | 472,852 |
|
||||
| builder-solo | 4 | YES | | 2,773,634 | 420 | 5 | 398 | 6,969 | 6,604 | 554,727 |
|
||||
| builder-solo | 5 | YES | | 2,948,467 | 420 | 4 | 446 | 6,611 | 7,020 | 737,117 |
|
||||
|
||||
_Stats over successful runs. `LIMIT` = the run hit a usage-limit pause (duration/tok-sec distorted, token total fine). Repos kept under the run root for analysis._
|
||||
|
||||
Reference in New Issue
Block a user