So that "stateless vs builder-adversary" and "lean vs stateless" isolate context hygiene / review granularity WITHOUT the confound of the minimal prompts' reduced testing pressure (which we found cuts ~25% of test methods). stateless = orig + context hygiene; lean = orig + context hygiene + per-gate review. min stays the pure minimal-prompt variant (isolates verbosity vs orig). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
52 lines
3.0 KiB
Markdown
52 lines
3.0 KiB
Markdown
# Builder/Adversary example — context-lean ("stateless") variant
|
||
|
||
Same pattern, same **AI-as-adversary** verification, same gates as
|
||
[`../builder-adversary`](../builder-adversary/) and
|
||
[`../builder-adversary-min`](../builder-adversary-min/) — but the role prompts add a **context
|
||
hygiene** discipline so each loop carries and reloads as little conversation as possible. Nothing
|
||
about *what* the agents do or *how* they verify changes; only how much context they drag from turn to
|
||
turn.
|
||
|
||
## Why
|
||
|
||
In a long autonomous loop the dominant token cost is **cache-read**: every turn re-sends the
|
||
conversation so far (the unchanged prefix is billed as cache-read, ~10% of input price, but it's
|
||
billed *every turn*). So cost ≈ context length × turns. The role prose is a rounding error against
|
||
that. The win is keeping the conversation short and not carrying it where it isn't needed.
|
||
|
||
This protocol already makes that safe: the **durable state is on disk** (git + the plan +
|
||
STATUS/REVIEW/JOURNAL), so the conversation is disposable scratch. These prompts exploit that:
|
||
|
||
- **Compact at every checkpoint.** After each gate is committed (Builder) or each verdict is written
|
||
(Adversary), run `/compact` — lossless here, because the agent reloads from git + STATUS/REVIEW.
|
||
- **Read diffs, not trees.** `git diff <last-sha>..HEAD` and only the touched files — never re-read
|
||
the whole repo.
|
||
- **Spill bulk to files.** Long build/test/verification output goes to a file; read back only the
|
||
slice you need, instead of dumping it into context.
|
||
- **Adversary loads only {plan, STATUS, diff}** per gate — full cold AI judgment, tiny footprint.
|
||
|
||
## Config note
|
||
|
||
Run the loop agents **non-resumed** (the default in this `agents.toml` — loop agents don't set
|
||
`resume = true`), so each time the watchdog restarts a loop (notably at every phase advance) it
|
||
starts a *fresh* session rather than carrying the prior phase's whole conversation forward. The
|
||
in-phase shrinking is done by `/compact` per the prompts above.
|
||
|
||
> A natural future engine lever (not yet implemented) would be a watchdog policy that **recycles a
|
||
> loop's session after each checkpoint commit** (claim/review), giving fresh context *per gate*
|
||
> rather than per phase — the same idea, enforced by the harness instead of the prompt.
|
||
|
||
## Compared
|
||
|
||
The **`agent-orchestrator-benchmark`** repo runs this variant head-to-head against
|
||
`builder-adversary` and `builder-adversary-min` on the same multi-phase task (all on Sonnet),
|
||
reporting tokens per loop — to quantify how much the context discipline saves while keeping identical
|
||
gate outcomes.
|
||
|
||
```bash
|
||
python3 ../../agents.py status --config agents.toml
|
||
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||
```
|
||
|
||
> **Prompt base:** these prompts are the **full original** `builder-adversary` prompts plus the additions above — NOT the minimal ones — so that comparing this variant to `builder-adversary` isolates its specific change (context hygiene / review granularity) without the minimal-prompt testing-pressure drop.
|