docs: finalize deferred at N=5 (median 12.89M, ~tied with orig)
This commit is contained in:
10
FINDINGS.md
10
FINDINGS.md
@ -7,7 +7,7 @@ loop, on a fixed, well-specified task.
|
||||
cold-verifiable Definition-of-Done gates. Deliberately offline and deterministic so it stresses the
|
||||
*protocol*, not infrastructure.
|
||||
- **How:** each variant run autonomously to `SEQUENCE-COMPLETE` via the real harness (`agents.py up`
|
||||
— Builder + Adversary loop pair + watchdog), **5 runs each** (N=5; `deferred` N=3, finalizing).
|
||||
— Builder + Adversary loop pair + watchdog), **5 runs each** (N=5).
|
||||
Both loops on **claude-sonnet-4-6**. Tokens summed from each loop's Claude Code session
|
||||
transcripts. The deliverable is behaviorally identical across all variants (verified on a
|
||||
24-expression probe), so this compares like-for-like.
|
||||
@ -28,7 +28,7 @@ loop, on a fixed, well-specified task.
|
||||
(stateless/lean/deferred are all built on the *full original* prompts, so they isolate their one
|
||||
change without the minimal-prompt confound.)
|
||||
|
||||
## Headline results — median tokens (5 runs each; deferred N=3)
|
||||
## Headline results — median tokens (5 runs each)
|
||||
|
||||
| variant | adversary verifies… | median tokens | vs orig | commits | LOC |
|
||||
|---|---|--:|--:|--:|--:|
|
||||
@ -36,7 +36,7 @@ change without the minimal-prompt confound.)
|
||||
| **min** | per phase *(minimal prompts)* | 9.77M | −25% | 15 | 367 |
|
||||
| **stateless** | per phase *(+context hygiene)* | 10.12M | −22% | 14 | 400 |
|
||||
| **orig** | per **phase** | 13.04M | — | 14 | 449 |
|
||||
| **deferred** | once, after **whole build** | 13.37M | +3% | 10 | 425 |
|
||||
| **deferred** | once, after **whole build** | 12.89M | −1% | 12 | 425 |
|
||||
| **lean** | per **gate** | 13.41M | +3% | 28 | 390 |
|
||||
|
||||
## The two big findings
|
||||
@ -116,5 +116,5 @@ process intensity.
|
||||
_Artifacts in this repo: `FINDINGS.md` (this summary), `RESULTS-campaign.md` (per-variant stats +
|
||||
ratios + full raw table), `RESULTS-campaign.md.data` (raw rows), `analyze.py` (regenerates the
|
||||
analysis), `run-harness-bench.sh` / `run-solo-bench.sh` (the runners), `plans/calc/` (the task), and
|
||||
the six `engine/examples/builder-adversary*` variants. `deferred` is N=3 here and is being finalized
|
||||
to N=5; its median is stable (spread 1.19×)._
|
||||
the six `engine/examples/builder-adversary*` variants. All variants are N=5 (two `deferred` and the
|
||||
`min`/`lean` wedge/limit failures excluded; see RESULTS for the raw rows)._
|
||||
|
||||
Reference in New Issue
Block a user