agent-orchestrator-benchmark

recipe-maintainers/agent-orchestrator-benchmark

Fork 0

Commit Graph

Author	SHA1	Message	Date
mfowler	fc0608ede1	feat: builder-solo control runner (run after campaign) + limit-detect for it run-solo-bench.sh runs the builder-solo variant (single builder, self-verify, no adversary) 5× on the same calculator and appends rows to the shared campaign data file (adversary col = 0). Separate script so the live campaign runner is untouched. analyze.py limit-detection now also covers the solo run layout. Engine example builder-solo committed at a0f7652; benchmark engine to be re- pinned to it before running solo (after the main campaign completes).	2026-06-15 02:36:58 +00:00
mfowler	25a77f5d3c	fix: flag usage-limit-affected runs; correct tok/sec A run that hits a usage-limit pause has inflated duration (idle wait) but an accurate token total. analyze.py now scans each run's watchdog log for 'limit hit', flags it LIMIT in the raw table, and excludes it from the tokens/sec stat (token total, tok/LOC, tok/commit unaffected). Caught because campaign run r2 hit the limit ~00:40 and recovered at the 00:50 reset — watchdog handled it.	2026-06-15 01:29:54 +00:00
mfowler	33eeb3ce6b	feat: analyze.py — efficiency ratios (tokens/LOC, tokens/sec, tokens/commit) Standalone analysis over RESULTS-campaign.md.data (safe: independent of the live runner). Adds the normalised efficiency ratios per run with min/median/max per variant, alongside the token distributions, commit/LOC medians, correlations, and full raw table. Run: python3 analyze.py (regenerates RESULTS-campaign.md). Orig baseline (5 runs): tokens/LOC ~25k–34k, tokens/sec ~11.3k–14.0k.	2026-06-15 00:15:46 +00:00

Author

SHA1

Message

Date

mfowler

fc0608ede1

feat: builder-solo control runner (run after campaign) + limit-detect for it

run-solo-bench.sh runs the builder-solo variant (single builder, self-verify,
no adversary) 5× on the same calculator and appends rows to the shared campaign
data file (adversary col = 0). Separate script so the live campaign runner is
untouched. analyze.py limit-detection now also covers the solo run layout.
Engine example builder-solo committed at a0f7652; benchmark engine to be re-
pinned to it before running solo (after the main campaign completes).

2026-06-15 02:36:58 +00:00

mfowler

25a77f5d3c

fix: flag usage-limit-affected runs; correct tok/sec

A run that hits a usage-limit pause has inflated duration (idle wait) but an
accurate token total. analyze.py now scans each run's watchdog log for 'limit
hit', flags it LIMIT in the raw table, and excludes it from the tokens/sec stat
(token total, tok/LOC, tok/commit unaffected). Caught because campaign run r2
hit the limit ~00:40 and recovered at the 00:50 reset — watchdog handled it.

2026-06-15 01:29:54 +00:00

mfowler

33eeb3ce6b

feat: analyze.py — efficiency ratios (tokens/LOC, tokens/sec, tokens/commit)

Standalone analysis over RESULTS-campaign.md.data (safe: independent of the live
runner). Adds the normalised efficiency ratios per run with min/median/max per
variant, alongside the token distributions, commit/LOC medians, correlations,
and full raw table. Run: python3 analyze.py  (regenerates RESULTS-campaign.md).

Orig baseline (5 runs): tokens/LOC ~25k–34k, tokens/sec ~11.3k–14.0k.

2026-06-15 00:15:46 +00:00

3 Commits