docs(examples): add builder-adversary-stateless — context-lean variant
Same pattern + AI-as-adversary verification as builder-adversary-min, but the
role prompts add CONTEXT HYGIENE: /compact at every checkpoint (lossless — state
is on disk), read diffs not trees, spill bulk output to files, adversary loads
only {plan, STATUS, diff}. Loop agents non-resumed → fresh session per phase.
Targets cache-read (the dominant cost in a long loop) without changing what the
agents do or how they verify.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
49
examples/builder-adversary-stateless/README.md
Normal file
49
examples/builder-adversary-stateless/README.md
Normal file
@ -0,0 +1,49 @@
|
||||
# Builder/Adversary example — context-lean ("stateless") variant
|
||||
|
||||
Same pattern, same **AI-as-adversary** verification, same gates as
|
||||
[`../builder-adversary`](../builder-adversary/) and
|
||||
[`../builder-adversary-min`](../builder-adversary-min/) — but the role prompts add a **context
|
||||
hygiene** discipline so each loop carries and reloads as little conversation as possible. Nothing
|
||||
about *what* the agents do or *how* they verify changes; only how much context they drag from turn to
|
||||
turn.
|
||||
|
||||
## Why
|
||||
|
||||
In a long autonomous loop the dominant token cost is **cache-read**: every turn re-sends the
|
||||
conversation so far (the unchanged prefix is billed as cache-read, ~10% of input price, but it's
|
||||
billed *every turn*). So cost ≈ context length × turns. The role prose is a rounding error against
|
||||
that. The win is keeping the conversation short and not carrying it where it isn't needed.
|
||||
|
||||
This protocol already makes that safe: the **durable state is on disk** (git + the plan +
|
||||
STATUS/REVIEW/JOURNAL), so the conversation is disposable scratch. These prompts exploit that:
|
||||
|
||||
- **Compact at every checkpoint.** After each gate is committed (Builder) or each verdict is written
|
||||
(Adversary), run `/compact` — lossless here, because the agent reloads from git + STATUS/REVIEW.
|
||||
- **Read diffs, not trees.** `git diff <last-sha>..HEAD` and only the touched files — never re-read
|
||||
the whole repo.
|
||||
- **Spill bulk to files.** Long build/test/verification output goes to a file; read back only the
|
||||
slice you need, instead of dumping it into context.
|
||||
- **Adversary loads only {plan, STATUS, diff}** per gate — full cold AI judgment, tiny footprint.
|
||||
|
||||
## Config note
|
||||
|
||||
Run the loop agents **non-resumed** (the default in this `agents.toml` — loop agents don't set
|
||||
`resume = true`), so each time the watchdog restarts a loop (notably at every phase advance) it
|
||||
starts a *fresh* session rather than carrying the prior phase's whole conversation forward. The
|
||||
in-phase shrinking is done by `/compact` per the prompts above.
|
||||
|
||||
> A natural future engine lever (not yet implemented) would be a watchdog policy that **recycles a
|
||||
> loop's session after each checkpoint commit** (claim/review), giving fresh context *per gate*
|
||||
> rather than per phase — the same idea, enforced by the harness instead of the prompt.
|
||||
|
||||
## Compared
|
||||
|
||||
The **`agent-orchestrator-benchmark`** repo runs this variant head-to-head against
|
||||
`builder-adversary` and `builder-adversary-min` on the same multi-phase task (all on Sonnet),
|
||||
reporting tokens per loop — to quantify how much the context discipline saves while keeping identical
|
||||
gate outcomes.
|
||||
|
||||
```bash
|
||||
python3 ../../agents.py status --config agents.toml
|
||||
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
```
|
||||
92
examples/builder-adversary-stateless/agents.toml
Normal file
92
examples/builder-adversary-stateless/agents.toml
Normal file
@ -0,0 +1,92 @@
|
||||
# examples/builder-adversary-stateless — context-lean variant of ../builder-adversary-min.
|
||||
#
|
||||
# Same topology, behaviour, and AI-as-adversary verification as builder-adversary. The prompts add a
|
||||
# CONTEXT HYGIENE discipline (compact at every checkpoint, read diffs not trees, spill bulk to files,
|
||||
# adversary loads only {plan, STATUS, diff}) so each loop carries/reloads minimal conversation —
|
||||
# cache-read is the dominant cost in a long loop. Loop agents are NOT resumed (default below), so the
|
||||
# watchdog gives a fresh session per phase. See README.md.
|
||||
#
|
||||
# python3 ../../agents.py status --config agents.toml
|
||||
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
|
||||
[watchdog]
|
||||
signal_interval = 30
|
||||
heavy_interval = 300
|
||||
limit_probe_fallback = 300
|
||||
limit_reset_slack = 45
|
||||
stall_grace = 180
|
||||
|
||||
[defaults]
|
||||
session_prefix = "bastl-" # REQUIRED — sessions: bastl-builder, bastl-adv, …
|
||||
log_dir = ".ao-state"
|
||||
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
|
||||
model = "claude-sonnet-4-6"
|
||||
watch = "heal"
|
||||
|
||||
[backend.claude]
|
||||
bin = "claude"
|
||||
flags = "--dangerously-skip-permissions"
|
||||
remote_control = true
|
||||
supports_resume = true
|
||||
prompt_delivery = "arg"
|
||||
process_name = "claude"
|
||||
submit_key = "Enter"
|
||||
stall_idle = 300
|
||||
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
|
||||
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
|
||||
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
|
||||
|
||||
[backend.demo]
|
||||
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
|
||||
prompt_delivery = "exec"
|
||||
|
||||
[[agent]]
|
||||
name = "builder" # tmux session: bastl-builder
|
||||
kind = "loop"
|
||||
role = "builder"
|
||||
dir = "./work"
|
||||
watch = "heal+stall"
|
||||
|
||||
[[agent]]
|
||||
name = "adversary"
|
||||
session = "bastl-adv"
|
||||
kind = "loop"
|
||||
role = "adversary"
|
||||
dir = "./work-adv"
|
||||
watch = "heal+stall"
|
||||
|
||||
[[agent]]
|
||||
name = "orchestrator" # tmux session: bastl-orchestrator
|
||||
kind = "persistent"
|
||||
model = "claude-opus-4-8"
|
||||
resume = true
|
||||
watch = "heal"
|
||||
prompt = "You supervise this Builder/Adversary project. On startup: read machine-docs/ for the current phase's STATUS/REVIEW, confirm both loops + the watchdog are up, report the phase and any open findings/VETO. Then stay available; intervene only if the pair is stuck."
|
||||
|
||||
[[agent]]
|
||||
name = "reporter" # tmux session: bastl-reporter
|
||||
kind = "task"
|
||||
model = "claude-opus-4-8"
|
||||
watch = "none"
|
||||
enabled = false
|
||||
prompt = "The phase sequence is complete. Read machine-docs/ across all phases, write a short machine-docs/REPORT.md (what was built, each gate's final verdict, deferred items), then go idle."
|
||||
|
||||
[[service]]
|
||||
name = "cleanlogs"
|
||||
command = "python3 ../../agent-log.py follow-all"
|
||||
dir = "."
|
||||
|
||||
[loop]
|
||||
state_file = "phase-idx"
|
||||
resume_phase = true
|
||||
auto_advance = true
|
||||
done_marker = "## DONE"
|
||||
kickoff_template = "prompts/kickoff.md"
|
||||
roles_dir = "prompts"
|
||||
handoff = { repo = "./work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
|
||||
on_complete = { trigger_file = ".run-report-on-complete", run = "reporter" }
|
||||
|
||||
phases = [
|
||||
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
|
||||
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md", models = { builder = "claude-opus-4-8" } },
|
||||
]
|
||||
@ -0,0 +1,2 @@
|
||||
# Coordination / loop-state files live here at runtime (phase-namespaced STATUS / REVIEW / BACKLOG /
|
||||
# JOURNAL, plus the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels). The loop pair populates it.
|
||||
32
examples/builder-adversary-stateless/plans/json.md
Normal file
32
examples/builder-adversary-stateless/plans/json.md
Normal file
@ -0,0 +1,32 @@
|
||||
# Phase `json` — machine-readable output
|
||||
|
||||
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
|
||||
`wc`-phase behaviour. Single source of truth for this phase.
|
||||
|
||||
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
|
||||
model override; the Adversary stays on the default model.)
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
|
||||
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
|
||||
With stdin (no FILE), `"file"` is `null`.
|
||||
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
|
||||
(plus `file`). E.g. `wc.py --json -l FILE` → `{"lines": N, "file": "FILE"}`.
|
||||
- **D3 — no regression.** Every `wc`-phase gate (D1–D4 there) still passes unchanged.
|
||||
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
```bash
|
||||
pytest -q # D4 + D3 regression
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
|
||||
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
|
||||
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
|
||||
```
|
||||
|
||||
The Builder restates the exact commands, expected JSON, and commit sha in
|
||||
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
|
||||
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
|
||||
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).
|
||||
43
examples/builder-adversary-stateless/plans/wc.md
Normal file
43
examples/builder-adversary-stateless/plans/wc.md
Normal file
@ -0,0 +1,43 @@
|
||||
# Phase `wc` — a word-count CLI
|
||||
|
||||
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
|
||||
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
|
||||
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
|
||||
|
||||
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
|
||||
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
|
||||
|
||||
## Definition of Done
|
||||
|
||||
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
|
||||
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
|
||||
|
||||
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
|
||||
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
|
||||
`wc` on ASCII input.
|
||||
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
|
||||
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
|
||||
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
|
||||
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
|
||||
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
From a fresh clone of the work repo:
|
||||
|
||||
```bash
|
||||
pytest -q # D4: must be all-green
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
|
||||
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
|
||||
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
|
||||
```
|
||||
|
||||
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
|
||||
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
|
||||
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
|
||||
|
||||
## Out of scope (defer to a later phase or DEFERRED.md)
|
||||
|
||||
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
|
||||
the next phase (`plans/json.md`).
|
||||
14
examples/builder-adversary-stateless/prompts/adversary.md
Normal file
14
examples/builder-adversary-stateless/prompts/adversary.md
Normal file
@ -0,0 +1,14 @@
|
||||
You are the **Adversary**, one of two independent loops: **DISBELIEVE the Builder**. Coordinate ONLY through git. The phase plan is the SSOT for what to verify.
|
||||
|
||||
Loop: run `/loop` (no interval). Verify a CLAIMED gate promptly (the watchdog pings you when the Builder claims one); idle otherwise. Cap waits at 10 min; before going idle your LAST line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>`. Compact at ~80%.
|
||||
|
||||
Verify cold from your OWN clone: re-run the plan's DoD check yourself and try to break it (edge cases, bad input) — don't trust the Builder's word. From STATUS take only what you need to re-run (command, expected result, shas); ignore its reasoning and don't read JOURNAL until after your verdict (it anchors you). Judge from the plan, the code, and your own run.
|
||||
|
||||
Git: `pull --rebase`, commit, push; never `--force`. Prefix verdicts `review(<id>): PASS|FAIL …` — pings the Builder. Write only REVIEW.md (+ your findings). Record "<id>: PASS @<ts>" + evidence, or FAIL + repro steps. You hold veto: write "## VETO <reason>".
|
||||
|
||||
CONTEXT HYGIENE — your durable state is REVIEW + git, so the conversation is disposable scratch; keep it small so you don't pay to reload it every turn:
|
||||
- Per gate, load only what you need to judge it: the plan, the Builder's STATUS, and the diff since the last verified sha (`git diff <sha>..HEAD`). Don't re-read the whole repo or earlier gates.
|
||||
- After writing each verdict (a durable checkpoint), run `/compact` — lossless here; you reload from REVIEW + git.
|
||||
- Spill bulk to files: pipe long verification/test output to a file and read back only the part you need.
|
||||
|
||||
Begin: read the plan, then enter the loop (clone the work repo into your dir if it exists yet).
|
||||
17
examples/builder-adversary-stateless/prompts/builder.md
Normal file
17
examples/builder-adversary-stateless/prompts/builder.md
Normal file
@ -0,0 +1,17 @@
|
||||
You are the **Builder**, one of two independent loops; coordinate ONLY through git. Read the phase plan (the SSOT) and build to its DoD.
|
||||
|
||||
Loop: run `/loop` (no interval), one unit of work per wake. Cap every wait at 10 min; before going idle your LAST output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out) or the watchdog reboots you. Compact at ~80% context.
|
||||
|
||||
Git: `pull --rebase`, smallest change, commit, push; never `--force`. Prefix a gate claim `claim(<id>): …` — the watchdog pings the Adversary on it; use `feat/fix/status/…` otherwise. Before you claim, the tree MUST be clean (committed AND pushed): the Adversary cold-verifies from a fresh clone.
|
||||
|
||||
STATUS (in machine-docs/) must give the Adversary: WHAT is claimed (gate id + DoD items), HOW to verify (exact command), the EXPECTED result, WHERE (commit shas/paths). Reasoning goes in JOURNAL, NOT STATUS — the Adversary won't read JOURNAL before judging. Write only your files (code, STATUS, JOURNAL, build backlog); REVIEW is the Adversary's.
|
||||
|
||||
Done: write "## DONE" only when REVIEW shows a fresh PASS for every DoD item and there's no "## VETO". Never weaken/skip/delete a test; verify for real, no "should work".
|
||||
|
||||
CONTEXT HYGIENE — your durable state is git + STATUS/JOURNAL, so the conversation is disposable scratch; keep it small so you don't pay to reload it every turn:
|
||||
- After each gate is committed+pushed (a durable checkpoint), run `/compact` — it's lossless here, you reload what you need from git + STATUS.
|
||||
- Read DIFFS, not trees: `git diff <last-sha>..HEAD` and only the files you're touching; don't re-read the whole repo.
|
||||
- Spill bulk to files: pipe long build/test output to a file and read back only the part you need — don't dump it into the conversation.
|
||||
- On a fresh wake, reconstruct from the plan + STATUS + a diff; don't rebuild context by re-reading everything.
|
||||
|
||||
Begin: read the plan, then enter the loop.
|
||||
6
examples/builder-adversary-stateless/prompts/kickoff.md
Normal file
6
examples/builder-adversary-stateless/prompts/kickoff.md
Normal file
@ -0,0 +1,6 @@
|
||||
*** PHASE {phase_id} ***
|
||||
Plan (this phase's single source of truth): {plan} — read it fully now; it defines the mission and the Definition of Done (DoD).
|
||||
Loop state goes under machine-docs/ (create if missing), phase-namespaced: {status}, REVIEW-{phase_id}.md, JOURNAL-{phase_id}.md, BACKLOG-{phase_id}.md. Never at the repo root.
|
||||
Done = the Builder writes "## DONE" to machine-docs/{status} ONLY after every DoD item has a fresh Adversary PASS in machine-docs/REVIEW-{phase_id}.md.
|
||||
|
||||
=== role ===
|
||||
Reference in New Issue
Block a user