Files
agent-orchestrator-benchmark/calculators/README.md

49 lines
2.4 KiB
Markdown

# calculators/ — the artifacts the benchmark built
Every benchmark run had a Builder/Adversary loop pair (or a solo Builder) build a Python calculator
to the spec in [`../plans/calc/`](../plans/calc/). This folder preserves the **actual calculators
they produced** — the 5 canonical successful runs per variant (the N=5 the analysis is based on; the
wedged/limit/superseded runs are not included). 30 calculators in all.
## Layout
```
calculators/<variant>/run-NN/
calc.py the CLI entry point
calc/ lexer.py, parser.py, evaluator.py + test_*.py (the built calculator)
machine-docs/ the loop's coordination artifacts for this run:
STATUS-<phase>.md (Builder's claims: WHAT/HOW/EXPECTED/WHERE)
REVIEW-<phase>.md (Adversary's verdicts + findings)
JOURNAL-<phase>.md (Builder's reasoning — kept out of STATUS)
BACKLOG/DECISIONS.md
GIT-LOG.txt the run's commit history — the claim()/review() handshake
SOURCE.txt the original /tmp run path
```
`<variant>` is one of the six: `builder-adversary`, `builder-adversary-min`,
`builder-adversary-stateless`, `builder-adversary-lean`, `builder-adversary-deferred`, `builder-solo`.
These are **working-tree snapshots** (not nested git repos — that would confuse the parent repo). The
commit history that shows *how* each was built — the per-gate/per-phase `claim(`/`review(` exchange —
is captured in each `GIT-LOG.txt`. Compare, say, a `builder-adversary-lean` log (per-gate, ~28
commits) against a `builder-adversary-deferred` log (one comprehensive review at the end) to see the
cadence difference in action.
## What they're good for
- **Inspect the deliverable** each variant produced (all behaviorally identical — verified — but the
code/test style and volume vary; e.g. `-min` runs have leaner test suites).
- **Read the actual review exchange** in `machine-docs/REVIEW-*.md` + `GIT-LOG.txt` — the Adversary's
cold verdicts, findings, and the Builder's STATUS hand-offs.
Run any of them:
```bash
cd calculators/builder-adversary/run-01
python -m unittest -q # tests pass
python calc.py "2+3*4" # 14
```
See [`../FINDINGS.md`](../FINDINGS.md) for what the benchmark concluded and
[`../RESULTS-campaign.md`](../RESULTS-campaign.md) for the per-run numbers.