results: 4-variant campaign complete (5/5 each); analysis with ratios
This commit is contained in:
70
RESULTS-campaign.md
Normal file
70
RESULTS-campaign.md
Normal file
@ -0,0 +1,70 @@
|
||||
# Full-harness benchmark — campaign analysis
|
||||
|
||||
Real `agents.py up` Builder/Adversary loop pair + watchdog through the 3-phase calculator to SEQUENCE-COMPLETE. Both loops on Sonnet. Tokens summed from each loop's session transcript; commits = work-repo commit count; LOC = non-blank `calc/*.py` lines (code + tests). 20 successful runs of 22 total.
|
||||
|
||||
## Per-variant total tokens (successful runs)
|
||||
|
||||
| variant | runs(ok) | median | mean | min | max | spread |
|
||||
|---|:--:|--:|--:|--:|--:|--:|
|
||||
| builder-adversary | 5/5 | 13,037,683 | 12,919,657 | 11,117,474 | 14,960,414 | 1.35x |
|
||||
| builder-adversary-min | 5/6 | 9,772,581 | 9,917,040 | 9,135,718 | 11,386,415 | 1.25x |
|
||||
| builder-adversary-stateless | 5/5 | 10,122,375 | 10,735,401 | 9,992,834 | 13,009,792 | 1.30x |
|
||||
| builder-adversary-lean | 5/6 | 13,409,349 | 13,216,582 | 12,101,355 | 13,815,595 | 1.14x |
|
||||
|
||||
## Efficiency ratios — min / median / max (successful runs)
|
||||
|
||||
tokens/sec excludes runs flagged `LIMIT` (a usage-limit pause inflates duration without adding tokens, so it would understate the true rate); tokens/LOC and tokens/commit are unaffected and include all successful runs.
|
||||
|
||||
| variant | tokens / LOC | tokens / sec | tokens / commit |
|
||||
|---|--:|--:|--:|
|
||||
| builder-adversary | 23,857 / 30,391 / 32,665 | 8,083 / 11,670 / 13,852 | 793,540 / 935,026 / 1,044,586 |
|
||||
| builder-adversary-min | 23,527 / 25,252 / 32,814 | 8,173 / 14,807 / 15,814 | 582,292 / 669,789 / 712,415 |
|
||||
| builder-adversary-stateless | 21,802 / 27,799 / 30,861 | 10,544 / 11,620 / 12,755 | 697,172 / 765,282 / 778,644 |
|
||||
| builder-adversary-lean | 28,077 / 33,238 / 38,966 | 12,416 / 13,523 / 14,403 | 432,191 / 478,905 / 575,650 |
|
||||
| **all** | 21,802 / 29,518 / 38,966 | 8,083 / 12,511 / 15,814 | 432,191 / 705,228 / 1,044,586 |
|
||||
|
||||
## Per-variant medians (commits / LOC / duration)
|
||||
|
||||
| variant | median commits | median LOC | median dur(s) |
|
||||
|---|--:|--:|--:|
|
||||
| builder-adversary | 14 | 449 | 1020 |
|
||||
| builder-adversary-min | 15 | 367 | 720 |
|
||||
| builder-adversary-stateless | 14 | 400 | 900 |
|
||||
| builder-adversary-lean | 28 | 390 | 960 |
|
||||
|
||||
## Correlations with total tokens (pooled, n=20)
|
||||
|
||||
| tokens vs | Pearson r |
|
||||
|---|--:|
|
||||
| duration | +0.51 |
|
||||
| commits | +0.50 |
|
||||
| LOC | +0.41 |
|
||||
|
||||
## All runs (raw)
|
||||
|
||||
| variant | rep | ok | limit | total | dur(s) | commits | LOC | tok/LOC | tok/sec | tok/commit |
|
||||
|---|:--:|:--:|:--:|--:|--:|--:|--:|--:|--:|--:|
|
||||
| builder-adversary | 1 | YES | | 11,117,474 | 960 | 14 | 466 | 23,857 | 11,581 | 794,105 |
|
||||
| builder-adversary | 2 | YES | | 13,579,616 | 1680 | 13 | 449 | 30,244 | 8,083 | 1,044,586 |
|
||||
| builder-adversary | 3 | YES | | 14,960,414 | 1080 | 16 | 458 | 32,665 | 13,852 | 935,026 |
|
||||
| builder-adversary | 4 | YES | | 13,037,683 | 1020 | 13 | 429 | 30,391 | 12,782 | 1,002,899 |
|
||||
| builder-adversary | 5 | YES | | 11,903,098 | 1020 | 15 | 381 | 31,242 | 11,670 | 793,540 |
|
||||
| builder-adversary-min | 1 | YES | | 9,135,718 | 780 | 15 | 367 | 24,893 | 11,712 | 609,048 |
|
||||
| builder-adversary-min | 2 | YES | | 11,386,415 | 720 | 17 | 347 | 32,814 | 15,814 | 669,789 |
|
||||
| builder-adversary-min | 3 | YES | | 9,316,676 | 1140 | 16 | 396 | 23,527 | 8,173 | 582,292 |
|
||||
| builder-adversary-min | 4 | YES | | 9,973,813 | 660 | 14 | 347 | 28,743 | 15,112 | 712,415 |
|
||||
| builder-adversary-min | 5 | NO | | 3,693,171 | 1800 | 4 | 128 | 28,853 | 2,052 | 923,293 |
|
||||
| builder-adversary-min | 1 | YES | | 9,772,581 | 660 | 14 | 387 | 25,252 | 14,807 | 698,042 |
|
||||
| builder-adversary-stateless | 1 | YES | | 10,457,577 | 900 | 15 | 400 | 26,144 | 11,620 | 697,172 |
|
||||
| builder-adversary-stateless | 2 | YES | | 9,992,834 | 840 | 13 | 341 | 29,304 | 11,896 | 768,680 |
|
||||
| builder-adversary-stateless | 3 | YES | | 10,094,430 | 900 | 14 | 463 | 21,802 | 11,216 | 721,031 |
|
||||
| builder-adversary-stateless | 4 | YES | | 10,122,375 | 960 | 13 | 328 | 30,861 | 10,544 | 778,644 |
|
||||
| builder-adversary-stateless | 5 | YES | | 13,009,792 | 1020 | 17 | 468 | 27,799 | 12,755 | 765,282 |
|
||||
| builder-adversary-lean | 1 | YES | | 12,962,701 | 900 | 28 | 390 | 33,238 | 14,403 | 462,954 |
|
||||
| builder-adversary-lean | 2 | YES | | 13,409,349 | 1080 | 28 | 451 | 29,732 | 12,416 | 478,905 |
|
||||
| builder-adversary-lean | 3 | NO | LIMIT | 6,518,422 | 1800 | 11 | 259 | 25,168 | 3,621 | 592,584 |
|
||||
| builder-adversary-lean | 4 | YES | | 13,815,595 | 960 | 24 | 378 | 36,549 | 14,391 | 575,650 |
|
||||
| builder-adversary-lean | 5 | YES | | 12,101,355 | 960 | 28 | 431 | 28,077 | 12,606 | 432,191 |
|
||||
| builder-adversary-lean | 1 | YES | | 13,793,914 | 1020 | 25 | 354 | 38,966 | 13,523 | 551,757 |
|
||||
|
||||
_Stats over successful runs. `LIMIT` = the run hit a usage-limit pause (duration/tok-sec distorted, token total fine). Repos kept under the run root for analysis._
|
||||
Reference in New Issue
Block a user