agent-orchestrator-benchmark

recipe-maintainers/agent-orchestrator-benchmark

Fork 0

Commit Graph

Author	SHA1	Message	Date
mfowler	b46dca003c	results: 4-way + the variance finding (N=1 is not enough) Combined run 1 (orig/min/stateless) + run 2 (lean/stateless). Key result: the SAME stateless variant used 5.96M tokens in run 1 and 9.27M in run 2 (±55%) — nondeterministic iteration count dominates every between-variant gap. So: - prose minimization ~-6% (small, same-invocation) - lean (full per-gate review) ~= stateless (batched): full review is ~free - the earlier "-45% from context hygiene" is NOT reproducible — mostly noise Honest conclusion: need >=5 runs/variant to resolve the context-hygiene effect; log_tokens now makes that easy to collect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 22:06:21 +00:00
mfowler	0fa3d726a5	results: full-harness 3-way (orig/min/stateless) on the calculator All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full adversary verification (calc correct, tests OK). Tokens: orig 10,756,363 min 10,119,225 (-5.9%) stateless 5,964,829 (-44.5%) Prompt-prose minimization barely moved tokens; context hygiene (stateless) nearly halved them, driven by ~48% lower cache_read. Quality held. Also fix the runner's success check: it grepped the word "veto" and matched "No veto" → false failures; now matches the "## VETO" marker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 21:35:47 +00:00

Author

SHA1

Message

Date

mfowler

b46dca003c

results: 4-way + the variance finding (N=1 is not enough)

Combined run 1 (orig/min/stateless) + run 2 (lean/stateless). Key result: the
SAME stateless variant used 5.96M tokens in run 1 and 9.27M in run 2 (±55%) —
nondeterministic iteration count dominates every between-variant gap. So:
- prose minimization ~-6% (small, same-invocation)
- lean (full per-gate review) ~= stateless (batched): full review is ~free
- the earlier "-45% from context hygiene" is NOT reproducible — mostly noise
Honest conclusion: need >=5 runs/variant to resolve the context-hygiene effect;
log_tokens now makes that easy to collect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-14 22:06:21 +00:00

mfowler

0fa3d726a5

results: full-harness 3-way (orig/min/stateless) on the calculator

All three completed the 3-phase calculator to SEQUENCE-COMPLETE with full
adversary verification (calc correct, tests OK). Tokens:
  orig       10,756,363
  min        10,119,225  (-5.9%)
  stateless   5,964,829  (-44.5%)
Prompt-prose minimization barely moved tokens; context hygiene (stateless)
nearly halved them, driven by ~48% lower cache_read. Quality held.

Also fix the runner's success check: it grepped the word "veto" and matched
"No veto" → false failures; now matches the "## VETO" marker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-14 21:35:47 +00:00

2 Commits