79 lines
5.2 KiB
Markdown
79 lines
5.2 KiB
Markdown
# Full-harness benchmark — campaign analysis
|
|
|
|
Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 25 successful runs of 27 total.
|
|
|
|
## Per-variant total tokens (successful runs)
|
|
|
|
| variant | runs(ok) | median | mean | min | max | spread |
|
|
|---|:--:|--:|--:|--:|--:|--:|
|
|
| builder-adversary | 5/5 | 13,037,683 | 12,919,657 | 11,117,474 | 14,960,414 | 1.35x |
|
|
| builder-adversary-min | 5/6 | 9,772,581 | 9,917,040 | 9,135,718 | 11,386,415 | 1.25x |
|
|
| builder-adversary-stateless | 5/5 | 10,122,375 | 10,735,401 | 9,992,834 | 13,009,792 | 1.30x |
|
|
| builder-adversary-lean | 5/6 | 13,409,349 | 13,216,582 | 12,101,355 | 13,815,595 | 1.14x |
|
|
| builder-solo | 5/5 | 2,773,634 | 2,744,840 | 2,417,528 | 2,948,467 | 1.22x |
|
|
|
|
## Efficiency ratios — min / median / max (successful runs)
|
|
|
|
tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration without adding tokens, so it would understate the true rate); tokens/LOC and tokens/commit are unaffected and include all successful runs.
|
|
|
|
| variant | tokens / LOC | tokens / sec | tokens / commit |
|
|
|---|--:|--:|--:|
|
|
| builder-adversary | 23,857 / 30,391 / 32,665 | 8,083 / 11,670 / 13,852 | 793,540 / 935,026 / 1,044,586 |
|
|
| builder-adversary-min | 23,527 / 25,252 / 32,814 | 8,173 / 14,807 / 15,814 | 582,292 / 669,789 / 712,415 |
|
|
| builder-adversary-stateless | 21,802 / 27,799 / 30,861 | 10,544 / 11,620 / 12,755 | 697,172 / 765,282 / 778,644 |
|
|
| builder-adversary-lean | 28,077 / 33,238 / 38,966 | 12,416 / 13,523 / 14,403 | 432,191 / 478,905 / 575,650 |
|
|
| builder-solo | 6,029 / 6,611 / 6,969 | 6,542 / 6,715 / 7,020 | 392,494 / 483,506 / 737,117 |
|
|
| **all** | 6,029 / 28,077 / 38,966 | 6,542 / 11,712 / 15,814 | 392,494 / 697,172 / 1,044,586 |
|
|
|
|
## Per-variant medians (commits / LOC / duration)
|
|
|
|
| variant | median commits | median LOC | median dur(s) |
|
|
|---|--:|--:|--:|
|
|
| builder-adversary | 14 | 449 | 1020 |
|
|
| builder-adversary-min | 15 | 367 | 720 |
|
|
| builder-adversary-stateless | 14 | 400 | 900 |
|
|
| builder-adversary-lean | 28 | 390 | 960 |
|
|
| builder-solo | 5 | 426 | 420 |
|
|
|
|
## Correlations with total tokens (pooled, n=25)
|
|
|
|
| tokens vs | Pearson r |
|
|
|---|--:|
|
|
| duration | +0.83 |
|
|
| commits | +0.79 |
|
|
| LOC | -0.04 |
|
|
|
|
## All runs (raw)
|
|
|
|
| variant | rep | ok | limit | total | dur(s) | commits | LOC | tok/LOC | tok/sec | tok/commit |
|
|
|---|:--:|:--:|:--:|--:|--:|--:|--:|--:|--:|--:|
|
|
| builder-adversary | 1 | YES | | 11,117,474 | 960 | 14 | 466 | 23,857 | 11,581 | 794,105 |
|
|
| builder-adversary | 2 | YES | | 13,579,616 | 1680 | 13 | 449 | 30,244 | 8,083 | 1,044,586 |
|
|
| builder-adversary | 3 | YES | | 14,960,414 | 1080 | 16 | 458 | 32,665 | 13,852 | 935,026 |
|
|
| builder-adversary | 4 | YES | | 13,037,683 | 1020 | 13 | 429 | 30,391 | 12,782 | 1,002,899 |
|
|
| builder-adversary | 5 | YES | | 11,903,098 | 1020 | 15 | 381 | 31,242 | 11,670 | 793,540 |
|
|
| builder-adversary-min | 1 | YES | | 9,135,718 | 780 | 15 | 367 | 24,893 | 11,712 | 609,048 |
|
|
| builder-adversary-min | 2 | YES | | 11,386,415 | 720 | 17 | 347 | 32,814 | 15,814 | 669,789 |
|
|
| builder-adversary-min | 3 | YES | | 9,316,676 | 1140 | 16 | 396 | 23,527 | 8,173 | 582,292 |
|
|
| builder-adversary-min | 4 | YES | | 9,973,813 | 660 | 14 | 347 | 28,743 | 15,112 | 712,415 |
|
|
| builder-adversary-min | 5 | NO | | 3,693,171 | 1800 | 4 | 128 | 28,853 | 2,052 | 923,293 |
|
|
| builder-adversary-min | 1 | YES | | 9,772,581 | 660 | 14 | 387 | 25,252 | 14,807 | 698,042 |
|
|
| builder-adversary-stateless | 1 | YES | | 10,457,577 | 900 | 15 | 400 | 26,144 | 11,620 | 697,172 |
|
|
| builder-adversary-stateless | 2 | YES | | 9,992,834 | 840 | 13 | 341 | 29,304 | 11,896 | 768,680 |
|
|
| builder-adversary-stateless | 3 | YES | | 10,094,430 | 900 | 14 | 463 | 21,802 | 11,216 | 721,031 |
|
|
| builder-adversary-stateless | 4 | YES | | 10,122,375 | 960 | 13 | 328 | 30,861 | 10,544 | 778,644 |
|
|
| builder-adversary-stateless | 5 | YES | | 13,009,792 | 1020 | 17 | 468 | 27,799 | 12,755 | 765,282 |
|
|
| builder-adversary-lean | 1 | YES | | 12,962,701 | 900 | 28 | 390 | 33,238 | 14,403 | 462,954 |
|
|
| builder-adversary-lean | 2 | YES | | 13,409,349 | 1080 | 28 | 451 | 29,732 | 12,416 | 478,905 |
|
|
| builder-adversary-lean | 3 | NO | LIMIT | 6,518,422 | 1800 | 11 | 259 | 25,168 | 3,621 | 592,584 |
|
|
| builder-adversary-lean | 4 | YES | | 13,815,595 | 960 | 24 | 378 | 36,549 | 14,391 | 575,650 |
|
|
| builder-adversary-lean | 5 | YES | | 12,101,355 | 960 | 28 | 431 | 28,077 | 12,606 | 432,191 |
|
|
| builder-adversary-lean | 1 | YES | | 13,793,914 | 1020 | 25 | 354 | 38,966 | 13,523 | 551,757 |
|
|
| builder-solo | 1 | YES | | 2,417,528 | 360 | 5 | 401 | 6,029 | 6,715 | 483,506 |
|
|
| builder-solo | 2 | YES | | 2,747,457 | 420 | 7 | 450 | 6,105 | 6,542 | 392,494 |
|
|
| builder-solo | 3 | YES | | 2,837,115 | 420 | 6 | 426 | 6,660 | 6,755 | 472,852 |
|
|
| builder-solo | 4 | YES | | 2,773,634 | 420 | 5 | 398 | 6,969 | 6,604 | 554,727 |
|
|
| builder-solo | 5 | YES | | 2,948,467 | 420 | 4 | 446 | 6,611 | 7,020 | 737,117 |
|
|
|
|
_Stats over successful runs. `LIMIT` = the run hit a usage-limit pause (duration/tok-sec distorted, token total fine). Repos kept under the run root for analysis._
|