Files
mfowler c6c7ce8640 change: base stateless + lean on the FULL original prompts (not minimal)
So that "stateless vs builder-adversary" and "lean vs stateless" isolate context
hygiene / review granularity WITHOUT the confound of the minimal prompts' reduced
testing pressure (which we found cuts ~25% of test methods). stateless = orig +
context hygiene; lean = orig + context hygiene + per-gate review. min stays the
pure minimal-prompt variant (isolates verbosity vs orig).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 03:17:47 +00:00

52 lines
3.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Builder/Adversary example — context-lean ("stateless") variant
Same pattern, same **AI-as-adversary** verification, same gates as
[`../builder-adversary`](../builder-adversary/) and
[`../builder-adversary-min`](../builder-adversary-min/) — but the role prompts add a **context
hygiene** discipline so each loop carries and reloads as little conversation as possible. Nothing
about *what* the agents do or *how* they verify changes; only how much context they drag from turn to
turn.
## Why
In a long autonomous loop the dominant token cost is **cache-read**: every turn re-sends the
conversation so far (the unchanged prefix is billed as cache-read, ~10% of input price, but it's
billed *every turn*). So cost ≈ context length × turns. The role prose is a rounding error against
that. The win is keeping the conversation short and not carrying it where it isn't needed.
This protocol already makes that safe: the **durable state is on disk** (git + the plan +
STATUS/REVIEW/JOURNAL), so the conversation is disposable scratch. These prompts exploit that:
- **Compact at every checkpoint.** After each gate is committed (Builder) or each verdict is written
(Adversary), run `/compact` — lossless here, because the agent reloads from git + STATUS/REVIEW.
- **Read diffs, not trees.** `git diff <last-sha>..HEAD` and only the touched files — never re-read
the whole repo.
- **Spill bulk to files.** Long build/test/verification output goes to a file; read back only the
slice you need, instead of dumping it into context.
- **Adversary loads only {plan, STATUS, diff}** per gate — full cold AI judgment, tiny footprint.
## Config note
Run the loop agents **non-resumed** (the default in this `agents.toml` — loop agents don't set
`resume = true`), so each time the watchdog restarts a loop (notably at every phase advance) it
starts a *fresh* session rather than carrying the prior phase's whole conversation forward. The
in-phase shrinking is done by `/compact` per the prompts above.
> A natural future engine lever (not yet implemented) would be a watchdog policy that **recycles a
> loop's session after each checkpoint commit** (claim/review), giving fresh context *per gate*
> rather than per phase — the same idea, enforced by the harness instead of the prompt.
## Compared
The **`agent-orchestrator-benchmark`** repo runs this variant head-to-head against
`builder-adversary` and `builder-adversary-min` on the same multi-phase task (all on Sonnet),
reporting tokens per loop — to quantify how much the context discipline saves while keeping identical
gate outcomes.
```bash
python3 ../../agents.py status --config agents.toml
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
```
> **Prompt base:** these prompts are the **full original** `builder-adversary` prompts plus the additions above — NOT the minimal ones — so that comparing this variant to `builder-adversary` isolates its specific change (context hygiene / review granularity) without the minimal-prompt testing-pressure drop.