6.1 KiB
6.1 KiB
Full-harness benchmark — campaign analysis
Real agents.py up Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank calc/*.py lines (code + tests). 30 successful runs of 33 total.
Per-variant total tokens (successful runs)
| variant | runs(ok) | median | mean | min | max | spread |
|---|---|---|---|---|---|---|
| builder-adversary | 5/5 | 13,037,683 | 12,919,657 | 11,117,474 | 14,960,414 | 1.35x |
| builder-adversary-min | 5/6 | 9,772,581 | 9,917,040 | 9,135,718 | 11,386,415 | 1.25x |
| builder-adversary-stateless | 5/5 | 10,122,375 | 10,735,401 | 9,992,834 | 13,009,792 | 1.30x |
| builder-adversary-lean | 5/6 | 13,409,349 | 13,216,582 | 12,101,355 | 13,815,595 | 1.14x |
| builder-solo | 5/5 | 2,773,634 | 2,744,840 | 2,417,528 | 2,948,467 | 1.22x |
| builder-adversary-deferred | 5/6 | 12,888,082 | 12,410,637 | 9,610,923 | 15,334,242 | 1.60x |
Efficiency ratios — min / median / max (successful runs)
tokens/sec excludes runs flagged LIMIT (a usage-limit pause inflates duration without adding tokens, so it would understate the true rate); tokens/LOC and tokens/commit are unaffected and include all successful runs.
| variant | tokens / LOC | tokens / sec | tokens / commit |
|---|---|---|---|
| builder-adversary | 23,857 / 30,391 / 32,665 | 11,581 / 12,226 / 13,852 | 793,540 / 935,026 / 1,044,586 |
| builder-adversary-min | 23,527 / 25,252 / 32,814 | 8,173 / 14,807 / 15,814 | 582,292 / 669,789 / 712,415 |
| builder-adversary-stateless | 21,802 / 27,799 / 30,861 | 10,544 / 11,620 / 12,755 | 697,172 / 765,282 / 778,644 |
| builder-adversary-lean | 28,077 / 33,238 / 38,966 | 12,416 / 13,523 / 14,403 | 432,191 / 478,905 / 575,650 |
| builder-solo | 6,029 / 6,611 / 6,969 | 6,542 / 6,715 / 7,020 | 392,494 / 483,506 / 737,117 |
| builder-adversary-deferred | 20,107 / 31,451 / 34,537 | 12,170 / 13,349 / 15,343 | 739,302 / 1,277,854 / 1,841,155 |
| all | 6,029 / 28,410 / 38,966 | 6,542 / 12,416 / 15,814 | 392,494 / 716,723 / 1,841,155 |
Per-variant medians (commits / LOC / duration)
| variant | median commits | median LOC | median dur(s) |
|---|---|---|---|
| builder-adversary | 14 | 449 | 1020 |
| builder-adversary-min | 15 | 367 | 720 |
| builder-adversary-stateless | 14 | 400 | 900 |
| builder-adversary-lean | 28 | 390 | 960 |
| builder-solo | 5 | 426 | 420 |
| builder-adversary-deferred | 12 | 425 | 840 |
Correlations with total tokens (pooled, n=30)
| tokens vs | Pearson r |
|---|---|
| duration | +0.83 |
| commits | +0.64 |
| LOC | -0.00 |
All runs (raw)
| variant | rep | ok | limit | total | dur(s) | commits | LOC | tok/LOC | tok/sec | tok/commit |
|---|---|---|---|---|---|---|---|---|---|---|
| builder-adversary | 1 | YES | 11,117,474 | 960 | 14 | 466 | 23,857 | 11,581 | 794,105 | |
| builder-adversary | 2 | YES | LIMIT | 13,579,616 | 1680 | 13 | 449 | 30,244 | 8,083 | 1,044,586 |
| builder-adversary | 3 | YES | 14,960,414 | 1080 | 16 | 458 | 32,665 | 13,852 | 935,026 | |
| builder-adversary | 4 | YES | 13,037,683 | 1020 | 13 | 429 | 30,391 | 12,782 | 1,002,899 | |
| builder-adversary | 5 | YES | 11,903,098 | 1020 | 15 | 381 | 31,242 | 11,670 | 793,540 | |
| builder-adversary-min | 1 | YES | 9,135,718 | 780 | 15 | 367 | 24,893 | 11,712 | 609,048 | |
| builder-adversary-min | 2 | YES | 11,386,415 | 720 | 17 | 347 | 32,814 | 15,814 | 669,789 | |
| builder-adversary-min | 3 | YES | 9,316,676 | 1140 | 16 | 396 | 23,527 | 8,173 | 582,292 | |
| builder-adversary-min | 4 | YES | 9,973,813 | 660 | 14 | 347 | 28,743 | 15,112 | 712,415 | |
| builder-adversary-min | 5 | NO | 3,693,171 | 1800 | 4 | 128 | 28,853 | 2,052 | 923,293 | |
| builder-adversary-min | 1 | YES | 9,772,581 | 660 | 14 | 387 | 25,252 | 14,807 | 698,042 | |
| builder-adversary-stateless | 1 | YES | 10,457,577 | 900 | 15 | 400 | 26,144 | 11,620 | 697,172 | |
| builder-adversary-stateless | 2 | YES | 9,992,834 | 840 | 13 | 341 | 29,304 | 11,896 | 768,680 | |
| builder-adversary-stateless | 3 | YES | 10,094,430 | 900 | 14 | 463 | 21,802 | 11,216 | 721,031 | |
| builder-adversary-stateless | 4 | YES | 10,122,375 | 960 | 13 | 328 | 30,861 | 10,544 | 778,644 | |
| builder-adversary-stateless | 5 | YES | 13,009,792 | 1020 | 17 | 468 | 27,799 | 12,755 | 765,282 | |
| builder-adversary-lean | 1 | YES | 12,962,701 | 900 | 28 | 390 | 33,238 | 14,403 | 462,954 | |
| builder-adversary-lean | 2 | YES | 13,409,349 | 1080 | 28 | 451 | 29,732 | 12,416 | 478,905 | |
| builder-adversary-lean | 3 | NO | LIMIT | 6,518,422 | 1800 | 11 | 259 | 25,168 | 3,621 | 592,584 |
| builder-adversary-lean | 4 | YES | 13,815,595 | 960 | 24 | 378 | 36,549 | 14,391 | 575,650 | |
| builder-adversary-lean | 5 | YES | 12,101,355 | 960 | 28 | 431 | 28,077 | 12,606 | 432,191 | |
| builder-adversary-lean | 1 | YES | 13,793,914 | 1020 | 25 | 354 | 38,966 | 13,523 | 551,757 | |
| builder-solo | 1 | YES | 2,417,528 | 360 | 5 | 401 | 6,029 | 6,715 | 483,506 | |
| builder-solo | 2 | YES | 2,747,457 | 420 | 7 | 450 | 6,105 | 6,542 | 392,494 | |
| builder-solo | 3 | YES | 2,837,115 | 420 | 6 | 426 | 6,660 | 6,755 | 472,852 | |
| builder-solo | 4 | YES | 2,773,634 | 420 | 5 | 398 | 6,969 | 6,604 | 554,727 | |
| builder-solo | 5 | YES | 2,948,467 | 420 | 4 | 446 | 6,611 | 7,020 | 737,117 | |
| builder-adversary-deferred | 1 | YES | 13,366,800 | 1020 | 10 | 425 | 31,451 | 13,105 | 1,336,680 | |
| builder-adversary-deferred | 2 | YES | 12,888,082 | 840 | 7 | 381 | 33,827 | 15,343 | 1,841,155 | |
| builder-adversary-deferred | 3 | YES | 15,334,242 | 1260 | 12 | 444 | 34,537 | 12,170 | 1,277,854 | |
| builder-adversary-deferred | 4 | NO | 8,536,729 | 2700 | 6 | 146 | 58,471 | 3,162 | 1,422,788 | |
| builder-adversary-deferred | 5 | YES | 9,610,923 | 720 | 13 | 478 | 20,107 | 13,349 | 739,302 | |
| builder-adversary-deferred | 1 | YES | 10,853,138 | 780 | 12 | 421 | 25,779 | 13,914 | 904,428 |
Stats over successful runs. LIMIT = the run hit a usage-limit pause (duration/tok-sec distorted, token total fine). Repos kept under the run root for analysis.