Files
agent-orchestrator-benchmark/RESULTS-campaign.md

5.2 KiB

Full-harness benchmark — campaign analysis

Real agents.py up Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank calc/*.py lines (code + tests). 25 successful runs of 27 total.

Per-variant total tokens (successful runs)

variant runs(ok) median mean min max spread
builder-adversary 5/5 13,037,683 12,919,657 11,117,474 14,960,414 1.35x
builder-adversary-min 5/6 9,772,581 9,917,040 9,135,718 11,386,415 1.25x
builder-adversary-stateless 5/5 10,122,375 10,735,401 9,992,834 13,009,792 1.30x
builder-adversary-lean 5/6 13,409,349 13,216,582 12,101,355 13,815,595 1.14x
builder-solo 5/5 2,773,634 2,744,840 2,417,528 2,948,467 1.22x

Efficiency ratios — min / median / max (successful runs)

tokens/sec excludes runs flagged LIMIT (a usage-limit pause inflates duration without adding tokens, so it would understate the true rate); tokens/LOC and tokens/commit are unaffected and include all successful runs.

variant tokens / LOC tokens / sec tokens / commit
builder-adversary 23,857 / 30,391 / 32,665 8,083 / 11,670 / 13,852 793,540 / 935,026 / 1,044,586
builder-adversary-min 23,527 / 25,252 / 32,814 8,173 / 14,807 / 15,814 582,292 / 669,789 / 712,415
builder-adversary-stateless 21,802 / 27,799 / 30,861 10,544 / 11,620 / 12,755 697,172 / 765,282 / 778,644
builder-adversary-lean 28,077 / 33,238 / 38,966 12,416 / 13,523 / 14,403 432,191 / 478,905 / 575,650
builder-solo 6,029 / 6,611 / 6,969 6,542 / 6,715 / 7,020 392,494 / 483,506 / 737,117
all 6,029 / 28,077 / 38,966 6,542 / 11,712 / 15,814 392,494 / 697,172 / 1,044,586

Per-variant medians (commits / LOC / duration)

variant median commits median LOC median dur(s)
builder-adversary 14 449 1020
builder-adversary-min 15 367 720
builder-adversary-stateless 14 400 900
builder-adversary-lean 28 390 960
builder-solo 5 426 420

Correlations with total tokens (pooled, n=25)

tokens vs Pearson r
duration +0.83
commits +0.79
LOC -0.04

All runs (raw)

variant rep ok limit total dur(s) commits LOC tok/LOC tok/sec tok/commit
builder-adversary 1 YES 11,117,474 960 14 466 23,857 11,581 794,105
builder-adversary 2 YES 13,579,616 1680 13 449 30,244 8,083 1,044,586
builder-adversary 3 YES 14,960,414 1080 16 458 32,665 13,852 935,026
builder-adversary 4 YES 13,037,683 1020 13 429 30,391 12,782 1,002,899
builder-adversary 5 YES 11,903,098 1020 15 381 31,242 11,670 793,540
builder-adversary-min 1 YES 9,135,718 780 15 367 24,893 11,712 609,048
builder-adversary-min 2 YES 11,386,415 720 17 347 32,814 15,814 669,789
builder-adversary-min 3 YES 9,316,676 1140 16 396 23,527 8,173 582,292
builder-adversary-min 4 YES 9,973,813 660 14 347 28,743 15,112 712,415
builder-adversary-min 5 NO 3,693,171 1800 4 128 28,853 2,052 923,293
builder-adversary-min 1 YES 9,772,581 660 14 387 25,252 14,807 698,042
builder-adversary-stateless 1 YES 10,457,577 900 15 400 26,144 11,620 697,172
builder-adversary-stateless 2 YES 9,992,834 840 13 341 29,304 11,896 768,680
builder-adversary-stateless 3 YES 10,094,430 900 14 463 21,802 11,216 721,031
builder-adversary-stateless 4 YES 10,122,375 960 13 328 30,861 10,544 778,644
builder-adversary-stateless 5 YES 13,009,792 1020 17 468 27,799 12,755 765,282
builder-adversary-lean 1 YES 12,962,701 900 28 390 33,238 14,403 462,954
builder-adversary-lean 2 YES 13,409,349 1080 28 451 29,732 12,416 478,905
builder-adversary-lean 3 NO LIMIT 6,518,422 1800 11 259 25,168 3,621 592,584
builder-adversary-lean 4 YES 13,815,595 960 24 378 36,549 14,391 575,650
builder-adversary-lean 5 YES 12,101,355 960 28 431 28,077 12,606 432,191
builder-adversary-lean 1 YES 13,793,914 1020 25 354 38,966 13,523 551,757
builder-solo 1 YES 2,417,528 360 5 401 6,029 6,715 483,506
builder-solo 2 YES 2,747,457 420 7 450 6,105 6,542 392,494
builder-solo 3 YES 2,837,115 420 6 426 6,660 6,755 472,852
builder-solo 4 YES 2,773,634 420 5 398 6,969 6,604 554,727
builder-solo 5 YES 2,948,467 420 4 446 6,611 7,020 737,117

Stats over successful runs. LIMIT = the run hit a usage-limit pause (duration/tok-sec distorted, token total fine). Repos kept under the run root for analysis.