Files
agent-orchestrator-benchmark/RESULTS-campaign.md

6.1 KiB

Full-harness benchmark — campaign analysis

Real agents.py up Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank calc/*.py lines (code + tests). 30 successful runs of 33 total.

Per-variant total tokens (successful runs)

variant runs(ok) median mean min max spread
builder-adversary 5/5 13,037,683 12,919,657 11,117,474 14,960,414 1.35x
builder-adversary-min 5/6 9,772,581 9,917,040 9,135,718 11,386,415 1.25x
builder-adversary-stateless 5/5 10,122,375 10,735,401 9,992,834 13,009,792 1.30x
builder-adversary-lean 5/6 13,409,349 13,216,582 12,101,355 13,815,595 1.14x
builder-solo 5/5 2,773,634 2,744,840 2,417,528 2,948,467 1.22x
builder-adversary-deferred 5/6 12,888,082 12,410,637 9,610,923 15,334,242 1.60x

Efficiency ratios — min / median / max (successful runs)

tokens/sec excludes runs flagged LIMIT (a usage-limit pause inflates duration without adding tokens, so it would understate the true rate); tokens/LOC and tokens/commit are unaffected and include all successful runs.

variant tokens / LOC tokens / sec tokens / commit
builder-adversary 23,857 / 30,391 / 32,665 11,581 / 12,226 / 13,852 793,540 / 935,026 / 1,044,586
builder-adversary-min 23,527 / 25,252 / 32,814 8,173 / 14,807 / 15,814 582,292 / 669,789 / 712,415
builder-adversary-stateless 21,802 / 27,799 / 30,861 10,544 / 11,620 / 12,755 697,172 / 765,282 / 778,644
builder-adversary-lean 28,077 / 33,238 / 38,966 12,416 / 13,523 / 14,403 432,191 / 478,905 / 575,650
builder-solo 6,029 / 6,611 / 6,969 6,542 / 6,715 / 7,020 392,494 / 483,506 / 737,117
builder-adversary-deferred 20,107 / 31,451 / 34,537 12,170 / 13,349 / 15,343 739,302 / 1,277,854 / 1,841,155
all 6,029 / 28,410 / 38,966 6,542 / 12,416 / 15,814 392,494 / 716,723 / 1,841,155

Per-variant medians (commits / LOC / duration)

variant median commits median LOC median dur(s)
builder-adversary 14 449 1020
builder-adversary-min 15 367 720
builder-adversary-stateless 14 400 900
builder-adversary-lean 28 390 960
builder-solo 5 426 420
builder-adversary-deferred 12 425 840

Correlations with total tokens (pooled, n=30)

tokens vs Pearson r
duration +0.83
commits +0.64
LOC -0.00

All runs (raw)

variant rep ok limit total dur(s) commits LOC tok/LOC tok/sec tok/commit
builder-adversary 1 YES 11,117,474 960 14 466 23,857 11,581 794,105
builder-adversary 2 YES LIMIT 13,579,616 1680 13 449 30,244 8,083 1,044,586
builder-adversary 3 YES 14,960,414 1080 16 458 32,665 13,852 935,026
builder-adversary 4 YES 13,037,683 1020 13 429 30,391 12,782 1,002,899
builder-adversary 5 YES 11,903,098 1020 15 381 31,242 11,670 793,540
builder-adversary-min 1 YES 9,135,718 780 15 367 24,893 11,712 609,048
builder-adversary-min 2 YES 11,386,415 720 17 347 32,814 15,814 669,789
builder-adversary-min 3 YES 9,316,676 1140 16 396 23,527 8,173 582,292
builder-adversary-min 4 YES 9,973,813 660 14 347 28,743 15,112 712,415
builder-adversary-min 5 NO 3,693,171 1800 4 128 28,853 2,052 923,293
builder-adversary-min 1 YES 9,772,581 660 14 387 25,252 14,807 698,042
builder-adversary-stateless 1 YES 10,457,577 900 15 400 26,144 11,620 697,172
builder-adversary-stateless 2 YES 9,992,834 840 13 341 29,304 11,896 768,680
builder-adversary-stateless 3 YES 10,094,430 900 14 463 21,802 11,216 721,031
builder-adversary-stateless 4 YES 10,122,375 960 13 328 30,861 10,544 778,644
builder-adversary-stateless 5 YES 13,009,792 1020 17 468 27,799 12,755 765,282
builder-adversary-lean 1 YES 12,962,701 900 28 390 33,238 14,403 462,954
builder-adversary-lean 2 YES 13,409,349 1080 28 451 29,732 12,416 478,905
builder-adversary-lean 3 NO LIMIT 6,518,422 1800 11 259 25,168 3,621 592,584
builder-adversary-lean 4 YES 13,815,595 960 24 378 36,549 14,391 575,650
builder-adversary-lean 5 YES 12,101,355 960 28 431 28,077 12,606 432,191
builder-adversary-lean 1 YES 13,793,914 1020 25 354 38,966 13,523 551,757
builder-solo 1 YES 2,417,528 360 5 401 6,029 6,715 483,506
builder-solo 2 YES 2,747,457 420 7 450 6,105 6,542 392,494
builder-solo 3 YES 2,837,115 420 6 426 6,660 6,755 472,852
builder-solo 4 YES 2,773,634 420 5 398 6,969 6,604 554,727
builder-solo 5 YES 2,948,467 420 4 446 6,611 7,020 737,117
builder-adversary-deferred 1 YES 13,366,800 1020 10 425 31,451 13,105 1,336,680
builder-adversary-deferred 2 YES 12,888,082 840 7 381 33,827 15,343 1,841,155
builder-adversary-deferred 3 YES 15,334,242 1260 12 444 34,537 12,170 1,277,854
builder-adversary-deferred 4 NO 8,536,729 2700 6 146 58,471 3,162 1,422,788
builder-adversary-deferred 5 YES 9,610,923 720 13 478 20,107 13,349 739,302
builder-adversary-deferred 1 YES 10,853,138 780 12 421 25,779 13,914 904,428

Stats over successful runs. LIMIT = the run hit a usage-limit pause (duration/tok-sec distorted, token total fine). Repos kept under the run root for analysis.