Files
agent-orchestrator-benchmark/calculators/README.md

2.4 KiB

calculators/ — the artifacts the benchmark built

Every benchmark run had a Builder/Adversary loop pair (or a solo Builder) build a Python calculator to the spec in ../plans/calc/. This folder preserves the actual calculators they produced — the 5 canonical successful runs per variant (the N=5 the analysis is based on; the wedged/limit/superseded runs are not included). 30 calculators in all.

Layout

calculators/<variant>/run-NN/
  calc.py                  the CLI entry point
  calc/                    lexer.py, parser.py, evaluator.py + test_*.py (the built calculator)
  machine-docs/            the loop's coordination artifacts for this run:
                             STATUS-<phase>.md   (Builder's claims: WHAT/HOW/EXPECTED/WHERE)
                             REVIEW-<phase>.md   (Adversary's verdicts + findings)
                             JOURNAL-<phase>.md  (Builder's reasoning — kept out of STATUS)
                             BACKLOG/DECISIONS.md
  GIT-LOG.txt              the run's commit history — the claim()/review() handshake
  SOURCE.txt               the original /tmp run path

<variant> is one of the six: builder-adversary, builder-adversary-min, builder-adversary-stateless, builder-adversary-lean, builder-adversary-deferred, builder-solo.

These are working-tree snapshots (not nested git repos — that would confuse the parent repo). The commit history that shows how each was built — the per-gate/per-phase claim(/review( exchange — is captured in each GIT-LOG.txt. Compare, say, a builder-adversary-lean log (per-gate, ~28 commits) against a builder-adversary-deferred log (one comprehensive review at the end) to see the cadence difference in action.

What they're good for

  • Inspect the deliverable each variant produced (all behaviorally identical — verified — but the code/test style and volume vary; e.g. -min runs have leaner test suites).
  • Read the actual review exchange in machine-docs/REVIEW-*.md + GIT-LOG.txt — the Adversary's cold verdicts, findings, and the Builder's STATUS hand-offs.

Run any of them:

cd calculators/builder-adversary/run-01
python -m unittest -q        # tests pass
python calc.py "2+3*4"       # 14

See ../FINDINGS.md for what the benchmark concluded and ../RESULTS-campaign.md for the per-run numbers.