Compare commits
12 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 781db071dd | |||
| 90375f004e | |||
| c6c7ce8640 | |||
| a0f7652e9e | |||
| 924874aafa | |||
| e0425e6108 | |||
| 985d33dd51 | |||
| 737ef81066 | |||
| 11843f41a4 | |||
| e4453dcfdd | |||
| 7f237a522c | |||
| cdcece9a9a |
89
README.md
89
README.md
@ -16,7 +16,9 @@ agents.py the driver + watchdog (pure Python stdlib; needs python >=
|
||||
agent-log.py render claude JSONL transcripts into clean, greppable logs
|
||||
agents.example.toml a self-contained 2-agent example project
|
||||
prompts/ generic role + kickoff templates (builder / adversary / kickoff)
|
||||
examples/ runnable example projects — the Builder/Adversary variant family, snakepit, …
|
||||
smoke.sh bring the example up + tear it down in an isolated sandbox, then clean up
|
||||
tests/ the test suite — unit tests + isolated live backend smokes + a runner
|
||||
flake.nix/.lock a Nix devShell with the runtime deps (python311, tmux, git)
|
||||
```
|
||||
|
||||
@ -48,6 +50,42 @@ python3 agents.py --config agents.toml phase show # where the loop phase mach
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
`examples/` holds runnable example projects — copy one, point `agents.py` at its `agents.toml`, and
|
||||
go. The headline set is a family of **Builder/Adversary** variants that build the *same* task but each
|
||||
differ in one dimension — useful both as templates and as a study of the pattern:
|
||||
|
||||
- **`builder-adversary`** — the canonical loop pair: a Builder that builds and an Adversary that
|
||||
cold-verifies every claim, coordinating only through git (`claim(`/`review(` commits + the watchdog
|
||||
handoff). **Start here.**
|
||||
- **`builder-adversary-min`** — the same pattern with the prompts compressed to minimal tokens.
|
||||
- **`builder-adversary-stateless`** — `builder-adversary` + **context hygiene** (compact at each
|
||||
checkpoint, read diffs not trees, lean loads) to minimise carried/reloaded context.
|
||||
- **`builder-adversary-lean`** — context hygiene + **per-gate** review (one claim/verdict per gate).
|
||||
- **`builder-adversary-deferred`** — the Adversary verifies **once**, after the whole build, in a
|
||||
final comprehensive `review` phase (vs per-phase / per-gate).
|
||||
- **`builder-solo`** — a single Builder that self-certifies, with **no Adversary** (the control).
|
||||
- **`snakepit`** — a different topology entirely: a pool of identical worker "snakes" pulling tasks
|
||||
from a shared filesystem queue, plus cleanup specialists. (`examples/IDEAS.md` sketches more.)
|
||||
|
||||
Each example has its own `README.md`. Run one by hand:
|
||||
|
||||
```bash
|
||||
cd examples/builder-adversary
|
||||
python3 ../../agents.py status --config agents.toml # read-only
|
||||
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
```
|
||||
|
||||
**Benchmark.** The separate
|
||||
[`agent-orchestrator-benchmark`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator-benchmark)
|
||||
repo runs these Builder/Adversary variants head-to-head (N=5, real `agents.py up` runs) to measure
|
||||
what drives token cost. Short version: an independent adversary costs **~4.7×** a solo builder, but
|
||||
the review *cadence* (per-gate / per-phase / deferred) is **nearly token-neutral**, and **context
|
||||
hygiene** is the one clean **~−22%** win. See that repo's `FINDINGS.md`.
|
||||
|
||||
---
|
||||
|
||||
## The config: `agents.toml`
|
||||
|
||||
Five section types: `[watchdog]`, `[backend.<name>]`, `[defaults]`, `[[agent]]` / `[[service]]`,
|
||||
@ -62,6 +100,23 @@ heavy_interval = 300 # seconds between heal + phase-advance checks
|
||||
limit_probe_fallback = 300 # re-probe cadence for a usage-limited agent when reset time is unparsable
|
||||
limit_reset_slack = 45 # seconds to wait past a parsed reset before probing
|
||||
stall_grace = 180 # seconds of slack past a WAITING-UNTIL marker before a stall reboot
|
||||
log_tokens = false # opt-in: record per-phase token + time usage (see below)
|
||||
```
|
||||
|
||||
**Per-phase token + time logging (`log_tokens`).** Set `log_tokens = true` (under `[watchdog]` or
|
||||
`[loop]`) and the watchdog records, for **each phase**, how many tokens **each agent** used and how
|
||||
long the phase took — appended as one JSON object per phase to `<log_dir>/token-log.jsonl`. Tokens
|
||||
are summed from each agent's Claude Code session transcript and attributed **by working dir**, so
|
||||
give each agent its own `dir` (the Builder/Adversary loop pair already uses separate clones) for
|
||||
accurate per-agent numbers. The watchdog snapshots a baseline when a phase starts and writes the
|
||||
delta (per agent, and the total) when the phase advances or the sequence completes — robust across
|
||||
watchdog restarts. Pretty-print it with `agents.py tokens`:
|
||||
|
||||
```
|
||||
phase dur(s) builder adversary TOTAL
|
||||
-----------------------------------------------------
|
||||
lex 372.0 3,910,118 3,221,447 7,131,565
|
||||
parse 410.5 ...
|
||||
```
|
||||
|
||||
### `[defaults]` — inherited by every agent
|
||||
@ -239,6 +294,7 @@ agents.py status table of every agent: kind, backend, model, w
|
||||
agents.py watchdog the supervisor loop (what the <prefix>watchdog session runs)
|
||||
agents.py logs <name> tail that session's log
|
||||
agents.py phase [show|next|set N] inspect / move the loop phase index
|
||||
agents.py tokens per-phase token + time report (when [watchdog].log_tokens = true)
|
||||
agents.py selftest regression-test the backend activity detector (needs no config)
|
||||
agents.py init [dir] scaffold a starter agents.toml + prompts/ in a project dir
|
||||
--config PATH use a specific config (default: ./agents.toml)
|
||||
@ -315,6 +371,39 @@ documents this in its banner.
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
The `tests/` directory holds the harness's own test suite. One runner drives everything:
|
||||
|
||||
```bash
|
||||
nix develop -c ./tests/run.sh # unit tests always; live backend smokes when available
|
||||
# or just: ./tests/run.sh # (python3 + tmux must be on PATH)
|
||||
```
|
||||
|
||||
What it runs:
|
||||
|
||||
- **Unit tests** (`tests/test_unit.py`) — pure logic, **no agents spawned, no live tmux sessions**.
|
||||
Cover config load + defaults merge, kickoff-template assembly, the phase machine (advance on the
|
||||
done marker, idempotent sequence-complete, append-a-phase resumes), usage-limit reset-banner
|
||||
parsing, `WAITING-UNTIL` / stall parsing, and the per-backend activity detectors (claude +
|
||||
opencode footers). Always run; a failure fails the suite. Run them alone with
|
||||
`python3 -m unittest discover -s tests` (or `python3 tests/test_unit.py`).
|
||||
- **Live backend smokes** (`tests/smoke_claude.sh`, `tests/smoke_opencode.sh`) — each brings a
|
||||
throwaway scratch project up **through `agents.py`** on a real backend, in a fully isolated
|
||||
sandbox (its own unique `session_prefix`, a temp `log_dir`, and — for opencode — a dedicated
|
||||
server on a non-default port `AOTEST_OC_PORT`, default `4097`), confirms the session attaches and
|
||||
`status` reports it RUNNING, then `down`s it and cleans up (no leftover sessions, port freed).
|
||||
Each **SKIPs gracefully** (exit 0) when its backend's binary or creds are unavailable. Useful env:
|
||||
`CLAUDE_BIN` / `OPENCODE_BIN`, `AOTEST_MODEL`, `AOTEST_OC_PORT`, `AOTEST_OC_CREDS`.
|
||||
- **Isolation sanity** — after the live runs, the runner asserts no `aotest-*` tmux sessions leaked
|
||||
and reports that any live sessions are untouched.
|
||||
|
||||
The smokes are safe by construction: a unique per-run session prefix (never `cc-ci-` or any real
|
||||
project's), a dedicated opencode port (never `4096`), and a cleanup trap that fires on success,
|
||||
failure, and Ctrl+C.
|
||||
|
||||
---
|
||||
|
||||
## Adding things
|
||||
|
||||
- **Add an agent** — add an `[[agent]]` block; `agents.py up <name>`. No code change.
|
||||
|
||||
131
agents.py
131
agents.py
@ -14,6 +14,7 @@ Usage:
|
||||
agents.py watchdog the supervisor loop (reads the config every tick)
|
||||
agents.py logs <name> tail an agent's session log
|
||||
agents.py phase [set N|next|show] inspect / move the loop phase
|
||||
agents.py tokens per-phase token + time report (needs [watchdog].log_tokens = true)
|
||||
agents.py selftest backend activity-detector regression checks (no config needed)
|
||||
agents.py init [dir] scaffold a starter agents.toml + prompts/ in a project dir
|
||||
|
||||
@ -663,6 +664,99 @@ def start_loops(cfg):
|
||||
for a in loop_agents(cfg):
|
||||
start_agent(cfg, a)
|
||||
|
||||
# ── optional per-phase token + time logging (log_tokens) ──────────────────────────
|
||||
# When [watchdog].log_tokens (or [loop].log_tokens) is true, the watchdog records, for each phase,
|
||||
# how many tokens each agent used and how long the phase took, appended to <log_dir>/token-log.jsonl.
|
||||
# Tokens are summed from each agent's Claude Code session transcript, attributed by working dir — so
|
||||
# give each agent its OWN dir for accurate per-agent numbers (the Builder/Adversary loop pair already
|
||||
# uses separate clones). View with: agents.py tokens.
|
||||
|
||||
def log_tokens_enabled(cfg):
|
||||
return bool(cfg.get("watchdog", {}).get("log_tokens") or cfg.get("loop", {}).get("log_tokens"))
|
||||
|
||||
def _transcript_dir(workdir):
|
||||
name = str(workdir).rstrip("/").replace("/", "-").replace(".", "-")
|
||||
return Path(os.path.expanduser("~/.claude/projects")) / name
|
||||
|
||||
def _sum_tokens(workdir):
|
||||
t = {"input": 0, "output": 0, "cache_create": 0, "cache_read": 0}
|
||||
d = _transcript_dir(workdir)
|
||||
if d.is_dir():
|
||||
for f in d.glob("*.jsonl"):
|
||||
try:
|
||||
for line in f.open(errors="ignore"):
|
||||
try:
|
||||
o = json.loads(line)
|
||||
except Exception:
|
||||
continue
|
||||
if o.get("type") == "assistant":
|
||||
u = (o.get("message", {}) or {}).get("usage", {}) or {}
|
||||
t["input"] += u.get("input_tokens", 0) or 0
|
||||
t["output"] += u.get("output_tokens", 0) or 0
|
||||
t["cache_create"] += u.get("cache_creation_input_tokens", 0) or 0
|
||||
t["cache_read"] += u.get("cache_read_input_tokens", 0) or 0
|
||||
except OSError:
|
||||
continue
|
||||
t["total"] = t["input"] + t["output"] + t["cache_create"] + t["cache_read"]
|
||||
return t
|
||||
|
||||
def _token_cumulative(cfg):
|
||||
"""Cumulative tokens per agent so far, summed from each agent's transcript dir."""
|
||||
return {a["name"]: _sum_tokens(a["dir"]) for a in cfg["agents"].values()}
|
||||
|
||||
_TOKEN_KEYS = ("input", "output", "cache_create", "cache_read", "total")
|
||||
def _token_state_path(cfg): return Path(cfg["state_dir"]) / "token-phase.json"
|
||||
def _token_log_path(cfg): return Path(cfg["log_dir"]) / "token-log.jsonl"
|
||||
def _tok_delta(cur, base): return {k: cur.get(k, 0) - base.get(k, 0) for k in _TOKEN_KEYS}
|
||||
|
||||
def token_phase_begin(cfg, phase_id):
|
||||
"""Set the baseline (cumulative tokens + start time) for the phase now starting. Idempotent
|
||||
across watchdog restarts: keeps the original baseline if already tracking this phase."""
|
||||
if not log_tokens_enabled(cfg):
|
||||
return
|
||||
sf = _token_state_path(cfg)
|
||||
try:
|
||||
if json.loads(sf.read_text()).get("phase_id") == phase_id:
|
||||
return
|
||||
except Exception:
|
||||
pass
|
||||
sf.write_text(json.dumps({"phase_id": phase_id,
|
||||
"started": datetime.now().isoformat(timespec="seconds"),
|
||||
"baseline": _token_cumulative(cfg)}))
|
||||
|
||||
def token_phase_flush(cfg, next_phase_id):
|
||||
"""Close the current phase: append its per-agent + total token deltas and duration to the
|
||||
token-log, then re-baseline for next_phase_id (or finalize tracking if None)."""
|
||||
if not log_tokens_enabled(cfg):
|
||||
return
|
||||
sf = _token_state_path(cfg)
|
||||
try:
|
||||
st = json.loads(sf.read_text())
|
||||
except Exception:
|
||||
return
|
||||
cur = _token_cumulative(cfg)
|
||||
base = st.get("baseline", {})
|
||||
started = st.get("started")
|
||||
try:
|
||||
dur = round((datetime.now() - datetime.fromisoformat(started)).total_seconds(), 1)
|
||||
except Exception:
|
||||
dur = None
|
||||
per_agent = {n: _tok_delta(cur.get(n, {}), base.get(n, {})) for n in cur}
|
||||
total = {k: sum(per_agent[n][k] for n in per_agent) for k in _TOKEN_KEYS}
|
||||
rec = {"phase_id": st.get("phase_id"), "started": started,
|
||||
"ended": datetime.now().isoformat(timespec="seconds"), "duration_s": dur,
|
||||
"agents": per_agent, "total": total}
|
||||
with _token_log_path(cfg).open("a") as fh:
|
||||
fh.write(json.dumps(rec) + "\n")
|
||||
parts = ", ".join(f"{n}={per_agent[n]['total']:,}" for n in per_agent)
|
||||
log(f"[log_tokens] phase {rec['phase_id']}: {total['total']:,} tok in {dur}s ({parts})")
|
||||
if next_phase_id is not None:
|
||||
sf.write_text(json.dumps({"phase_id": next_phase_id,
|
||||
"started": datetime.now().isoformat(timespec="seconds"),
|
||||
"baseline": cur}))
|
||||
else:
|
||||
sf.unlink(missing_ok=True)
|
||||
|
||||
def phase_advance_check(cfg):
|
||||
"""On heavy tick: if the current phase is DONE, advance (or finish the sequence).
|
||||
|
||||
@ -681,6 +775,7 @@ def phase_advance_check(cfg):
|
||||
nxt = idx + 1
|
||||
if nxt < len(ps):
|
||||
log(f"PHASE {ph['id']} DONE — auto-transitioning to {ps[nxt]['id']}")
|
||||
token_phase_flush(cfg, ps[nxt]["id"])
|
||||
stop_loops(cfg)
|
||||
Path(phase_idx_file(cfg)).write_text(str(nxt))
|
||||
if marker.exists():
|
||||
@ -692,6 +787,7 @@ def phase_advance_check(cfg):
|
||||
if marker.exists():
|
||||
return False # already handled — idempotent (no re-log, no re-stop)
|
||||
log(f"PHASE SEQUENCE COMPLETE (last phase {ph['id']} DONE) — stopping loops")
|
||||
token_phase_flush(cfg, None)
|
||||
stop_loops(cfg)
|
||||
ts = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||||
marker.write_text(f"phase sequence complete {ts}. Loops stopped; build finished.\n")
|
||||
@ -721,6 +817,9 @@ def watchdog_loop(cfg_path):
|
||||
f"signal={sig}s heavy={heavy}s, watching: {[a['name'] for a in watched(cfg)]}")
|
||||
elapsed = heavy # force a heavy check on first tick
|
||||
wake_elapsed = {a["name"]: 0 for a in cfg["agents"].values() if a.get("wake")}
|
||||
if log_tokens_enabled(cfg):
|
||||
token_phase_begin(cfg, cur_phase(cfg).get("id"))
|
||||
log("[log_tokens] enabled — per-phase token + time logging to token-log.jsonl")
|
||||
while True:
|
||||
cfg = load_config(cfg_path) # re-read every tick: config is authoritative, no env drift
|
||||
has_loops = bool(loop_agents(cfg))
|
||||
@ -840,6 +939,37 @@ def cmd_phase(cfg, args):
|
||||
Path(phase_idx_file(cfg)).write_text(str(int(args[1])))
|
||||
print(f"phase idx now {cur_idx(cfg)} ({cur_phase(cfg).get('id')})")
|
||||
|
||||
def cmd_tokens(cfg):
|
||||
"""Pretty-print <log_dir>/token-log.jsonl: per-phase tokens by agent + total + duration."""
|
||||
p = _token_log_path(cfg)
|
||||
if not p.exists():
|
||||
print(f"no token log at {p}\n(set [watchdog].log_tokens = true and run the loop)"); return
|
||||
recs = []
|
||||
for line in p.read_text().splitlines():
|
||||
try: recs.append(json.loads(line))
|
||||
except Exception: pass
|
||||
if not recs:
|
||||
print("token log is empty"); return
|
||||
names = []
|
||||
for r in recs:
|
||||
for n in r.get("agents", {}):
|
||||
if n not in names: names.append(n)
|
||||
w = max([7] + [len(n) for n in names])
|
||||
hdr = f"{'phase':<10} {'dur(s)':>8} " + " ".join(f"{n:>{w}}" for n in names) + f" {'TOTAL':>13}"
|
||||
print(hdr); print("-" * len(hdr))
|
||||
grand = {n: 0 for n in names}
|
||||
durtot = 0.0
|
||||
for r in recs:
|
||||
ag = r.get("agents", {})
|
||||
cells = " ".join(f"{ag.get(n,{}).get('total',0):>{w},}" for n in names)
|
||||
print(f"{str(r.get('phase_id')):<10} {str(r.get('duration_s')):>8} {cells} "
|
||||
f"{r.get('total',{}).get('total',0):>13,}")
|
||||
for n in names: grand[n] += ag.get(n,{}).get("total",0)
|
||||
durtot += r.get("duration_s") or 0
|
||||
print("-" * len(hdr))
|
||||
cells = " ".join(f"{grand[n]:>{w},}" for n in names)
|
||||
print(f"{'TOTAL':<10} {durtot:>8.0f} {cells} {sum(grand.values()):>13,}")
|
||||
|
||||
def cmd_selftest():
|
||||
"""Self-contained regression checks for the footer-UI activity detector. Needs no config."""
|
||||
backend = {
|
||||
@ -917,6 +1047,7 @@ def main():
|
||||
elif cmd == "status": cmd_status(cfg)
|
||||
elif cmd == "watchdog": watchdog_loop(cfg_path)
|
||||
elif cmd == "phase": cmd_phase(cfg, rest)
|
||||
elif cmd == "tokens": cmd_tokens(cfg)
|
||||
elif cmd == "logs":
|
||||
if not rest:
|
||||
die("usage: agents.py logs <name>")
|
||||
|
||||
93
examples/IDEAS.md
Normal file
93
examples/IDEAS.md
Normal file
@ -0,0 +1,93 @@
|
||||
# Example ideas — creative multi-agent topologies
|
||||
|
||||
A backlog of *example* projects for `examples/`, each chosen to teach a **different orchestration
|
||||
topology** on the same harness. Nothing here is implemented yet — these are sketches.
|
||||
|
||||
Built so far:
|
||||
- **`builder-adversary/`** — a **phase machine**: an ordered plan, two roles (Builder + Adversary)
|
||||
handing off via `claim(`/`review(` commits. (The cc-ci pattern.)
|
||||
- **`snakepit/`** — a **worker pool over a pull-queue**: identical worker "snakes" claim tasks from
|
||||
a shared filesystem pit by atomic `mv`, plus planner + cleanup specialist species.
|
||||
|
||||
Each idea below lists: the metaphor, the topology it teaches, the star harness primitive, and what
|
||||
makes it distinct from what we already have.
|
||||
|
||||
---
|
||||
|
||||
## Strong candidates
|
||||
|
||||
### 🐜 Anthill (stigmergy)
|
||||
Ants coordinate with *no direct messaging*: they lay pheromone trails, others follow the strong
|
||||
ones, and trails **evaporate** over time. Agents drop weighted "trail" files toward promising
|
||||
solutions/paths; a `[[service]]` slowly decays them.
|
||||
- **Teaches:** indirect coordination through a decaying shared environment (opposite of snakepit's
|
||||
explicit claim).
|
||||
- **Star primitive:** a background **service** as the evaporation clock; emergent routing with zero
|
||||
agent-to-agent chat.
|
||||
|
||||
### 🍳 The Line (kitchen brigade)
|
||||
A restaurant pass: prep → sauté → plating → **expo**. A ticket (order) flows station to station; the
|
||||
expo bounces a bad plate back down the line. Many tickets in flight at once.
|
||||
- **Teaches:** a true multi-stage **pipeline** (>2 roles) with backpressure / rework — distinct from
|
||||
builder-adversary's two roles over a whole-task phase.
|
||||
- **Star primitive:** chained `handoff` inboxes + per-station commit prefixes (`fire(`, `plate(`,
|
||||
`expo(`).
|
||||
|
||||
### 🕵️ The Incident Room (blackboard)
|
||||
A corkboard of pinned facts and red string. Specialist detectives (forensics, alibi, motive,
|
||||
witnesses) each watch the board and pin a new deduction *only when their preconditions appear*; a
|
||||
lead declares the case closed.
|
||||
- **Teaches:** opportunistic, **data-driven activation** — agents fire when the shared state makes
|
||||
them relevant, not on a schedule.
|
||||
- **Star primitive:** a shared blackboard file + watchdog pings on board changes; no fixed order.
|
||||
|
||||
### ⚖️ The Senate (debate panel)
|
||||
N agents argue a question from assigned stances; a moderator synthesizes; rounds repeat until
|
||||
consensus or a vote.
|
||||
- **Teaches:** structured **multi-round deliberation** with diverse "minds."
|
||||
- **Star primitive:** the **phase machine** where each phase = one debate round, plus **per-phase
|
||||
model overrides** to give each seat a genuinely different model; `on_complete` writes the verdict.
|
||||
|
||||
### 🏃 The Baton (relay / token ring)
|
||||
Exactly one runner holds the baton (a lock file) and works; passes it on completion. Drop the baton
|
||||
(crash) and the next runner picks it up.
|
||||
- **Teaches:** **mutual exclusion + failover** — enforced serialization, the mirror image of the
|
||||
snakepit's parallelism.
|
||||
- **Star primitive:** `watch = "heal"` + the watchdog reaping a dead holder so the baton never gets
|
||||
stuck.
|
||||
|
||||
### 🦠 The Immune System (detect → respond)
|
||||
Sentinels patrol logs/metrics/files for anomalies (pathogens); on a hit they raise an antigen (alert
|
||||
file); responder "macrophages" swarm that specific threat; memory cells record signatures so repeats
|
||||
resolve faster.
|
||||
- **Teaches:** an **event-driven monitoring/reactive** topology with escalation.
|
||||
- **Star primitive:** a watcher **service** emitting alerts + reactive agents woken by inbox pings.
|
||||
- **Bonus:** genuinely *useful* — a self-healing "watch my repo/CI" tool wearing a fun costume.
|
||||
|
||||
### 🧬 The Evolution Chamber (genetic algorithm)
|
||||
A population of candidate solutions; breeder agents mutate/crossbreed; a selector culls by fitness;
|
||||
generations advance until fitness plateaus.
|
||||
- **Teaches:** **population-based iterative search** with a fitness gate.
|
||||
- **Star primitive:** phase machine where each phase = one generation; `done_marker` trips when
|
||||
fitness stops improving.
|
||||
|
||||
---
|
||||
|
||||
## Quick extras (less fleshed out)
|
||||
|
||||
- **🗼 Air Traffic Control** — many workers contend for *one* scarce runway (a single deploy/build
|
||||
slot); a controller grants timed landing slots. Teaches centralized **scarce-resource
|
||||
arbitration** (snakepit has plentiful work; here the *resource* is the bottleneck).
|
||||
- **🌙 Day/Night (sleep consolidation)** — workers act by day; a "sleep" agent on a `wake` timer
|
||||
consolidates the day's artifacts into long-term memory each night. Teaches **scheduled batch
|
||||
consolidation** (the "memory builder / coprophagy" idea as its own example).
|
||||
|
||||
---
|
||||
|
||||
## Suggested next trio
|
||||
|
||||
If picking three that cover the most new ground: **The Line** (pipeline), **The Incident Room**
|
||||
(blackboard), and **The Immune System** (reactive monitoring — and actually useful).
|
||||
|
||||
Each should follow the snakepit shape: a README with the metaphor→compute mapping, an `agents.toml`,
|
||||
role prompts, and a tiny runnable task.
|
||||
48
examples/builder-adversary-deferred/README.md
Normal file
48
examples/builder-adversary-deferred/README.md
Normal file
@ -0,0 +1,48 @@
|
||||
# Builder/Adversary example — deferred review (verify after a long segment)
|
||||
|
||||
The coarsest point on the **review-cadence spectrum**. Same pattern, same full original prompts as
|
||||
`../builder-adversary` — only *when* the Adversary verifies changes:
|
||||
|
||||
| variant | the Adversary verifies… | handshakes (calculator task) |
|
||||
|---|---|--:|
|
||||
| `builder-adversary-lean` | per **gate** | ~12 claim/verify round-trips |
|
||||
| `builder-adversary` (orig) | per **phase** | ~3 |
|
||||
| **`builder-adversary-deferred`** | **once, after the whole build** | **1** |
|
||||
|
||||
## How it works
|
||||
|
||||
The Builder **self-certifies** the build phases (`wc`, then `json`) — builds to each phase's DoD, runs
|
||||
its own tests until green, writes `## DONE`, and advances *without* waiting for the Adversary. The
|
||||
Adversary stays out of the build. Only in the final **`review` phase** does it do **one comprehensive
|
||||
cold-verification of the entire accumulated calculator** (`plans/review.md`): re-run every DoD item
|
||||
from every phase from a fresh clone, plus cross-feature break-it probes, file all findings at once,
|
||||
re-verify after fixes, then PASS. That single pass is the only adversary gate in the run.
|
||||
|
||||
## The trade-off
|
||||
|
||||
- **Cheapest coordination.** One handshake instead of 3–12 — no per-gate/per-phase round-trips, the
|
||||
Builder isn't interrupted mid-build. (The benchmark showed coordination round-trips are a real
|
||||
token cost; deferring to one pass minimises them.)
|
||||
- **But the independent check arrives late.** Two risks the per-gate/per-phase cadences guard
|
||||
against:
|
||||
- **Late discovery / rework.** If the Builder built phase 2 on a wrong assumption from phase 1, an
|
||||
early adversary would have caught it at gate 1; here it surfaces only at the end, after more work
|
||||
was piled on the flaw — potentially a larger, costlier fix.
|
||||
- **Self-certification drift.** The build phases are self-certified, so a bug the Builder
|
||||
rubber-stamps survives until the final review. The comprehensive pass is the only safety net, so
|
||||
it must be thorough.
|
||||
- **Better at cross-feature bugs.** Because it verifies the whole system at once, it's positioned to
|
||||
catch *interactions* (e.g. `--json` × every flag) that a per-gate view, looking at one item at a
|
||||
time, can miss.
|
||||
|
||||
So `deferred` trades *early, incremental* assurance for *minimal coordination + one holistic pass*.
|
||||
It suits work where features are independent and cheap to fix late; it's risky where early decisions
|
||||
constrain later ones.
|
||||
|
||||
```bash
|
||||
python3 ../../agents.py status --config agents.toml
|
||||
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
```
|
||||
|
||||
> **Prompt base:** the full original `builder-adversary` prompts + a DEFERRED REVIEW CADENCE override
|
||||
> — so comparing this to `builder-adversary`/`lean` isolates *only* the verification cadence.
|
||||
79
examples/builder-adversary-deferred/agents.toml
Normal file
79
examples/builder-adversary-deferred/agents.toml
Normal file
@ -0,0 +1,79 @@
|
||||
# examples/builder-adversary-deferred — Adversary verifies ONCE, after a long segment of building.
|
||||
#
|
||||
# Same pattern + full original prompts as ../builder-adversary, but the REVIEW CADENCE is coarsest:
|
||||
# • lean = the Adversary verifies per gate (finest)
|
||||
# • orig = the Adversary verifies per phase (medium)
|
||||
# • deferred = the Adversary verifies ONCE, comprehensively, after the whole build (coarsest)
|
||||
# The Builder SELF-CERTIFIES the build phases (wc, json) to advance; the Adversary stays out until the
|
||||
# final `review` phase, where it cold-verifies the ENTIRE accumulated calculator in one pass. Cheapest
|
||||
# coordination, but the independent check arrives late (see README for the trade-off).
|
||||
#
|
||||
# python3 ../../agents.py status --config agents.toml
|
||||
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
|
||||
[watchdog]
|
||||
signal_interval = 30
|
||||
heavy_interval = 300
|
||||
limit_probe_fallback = 300
|
||||
limit_reset_slack = 45
|
||||
stall_grace = 180
|
||||
|
||||
[defaults]
|
||||
session_prefix = "badef-" # tmux namespace: badef-builder, badef-adv, …
|
||||
log_dir = ".ao-state"
|
||||
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
|
||||
model = "claude-sonnet-4-6"
|
||||
watch = "heal"
|
||||
|
||||
[backend.claude]
|
||||
bin = "claude"
|
||||
flags = "--dangerously-skip-permissions"
|
||||
remote_control = true
|
||||
supports_resume = true
|
||||
prompt_delivery = "arg"
|
||||
process_name = "claude"
|
||||
submit_key = "Enter"
|
||||
stall_idle = 300
|
||||
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
|
||||
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
|
||||
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
|
||||
|
||||
[backend.demo]
|
||||
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
|
||||
prompt_delivery = "exec"
|
||||
|
||||
[[agent]]
|
||||
name = "builder" # tmux session: badef-builder
|
||||
kind = "loop"
|
||||
role = "builder"
|
||||
dir = "./work"
|
||||
watch = "heal+stall"
|
||||
|
||||
[[agent]]
|
||||
name = "adversary"
|
||||
session = "badef-adv"
|
||||
kind = "loop"
|
||||
role = "adversary"
|
||||
dir = "./work-adv"
|
||||
watch = "heal+stall"
|
||||
|
||||
[[service]]
|
||||
name = "cleanlogs"
|
||||
command = "python3 ../../agent-log.py follow-all"
|
||||
dir = "."
|
||||
|
||||
[loop]
|
||||
state_file = "phase-idx"
|
||||
resume_phase = true
|
||||
auto_advance = true
|
||||
done_marker = "## DONE"
|
||||
kickoff_template = "prompts/kickoff.md"
|
||||
roles_dir = "prompts"
|
||||
handoff = { repo = "./work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
|
||||
# Build phases (wc, json) are self-certified by the Builder; the final `review` phase is the single
|
||||
# comprehensive Adversary gate over the whole accumulated build.
|
||||
phases = [
|
||||
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
|
||||
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md" },
|
||||
{ id = "review", plan = "plans/review.md", status = "STATUS-review.md" },
|
||||
]
|
||||
32
examples/builder-adversary-deferred/plans/json.md
Normal file
32
examples/builder-adversary-deferred/plans/json.md
Normal file
@ -0,0 +1,32 @@
|
||||
# Phase `json` — machine-readable output
|
||||
|
||||
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
|
||||
`wc`-phase behaviour. Single source of truth for this phase.
|
||||
|
||||
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
|
||||
model override; the Adversary stays on the default model.)
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
|
||||
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
|
||||
With stdin (no FILE), `"file"` is `null`.
|
||||
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
|
||||
(plus `file`). E.g. `wc.py --json -l FILE` → `{"lines": N, "file": "FILE"}`.
|
||||
- **D3 — no regression.** Every `wc`-phase gate (D1–D4 there) still passes unchanged.
|
||||
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
```bash
|
||||
pytest -q # D4 + D3 regression
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
|
||||
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
|
||||
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
|
||||
```
|
||||
|
||||
The Builder restates the exact commands, expected JSON, and commit sha in
|
||||
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
|
||||
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
|
||||
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).
|
||||
24
examples/builder-adversary-deferred/plans/review.md
Normal file
24
examples/builder-adversary-deferred/plans/review.md
Normal file
@ -0,0 +1,24 @@
|
||||
# Phase `review` — comprehensive deferred verification
|
||||
|
||||
This phase adds **no new features**. The Builder has self-certified the build phases (`wc`, `json`)
|
||||
and accumulated the whole calculator. Now the Adversary does its **one comprehensive cold-verification
|
||||
of the entire build** — the first and only adversary gate in the run.
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- **D1 — full cold re-verify.** From a FRESH clone, the Adversary re-runs **every DoD item from every
|
||||
prior phase** (all of `wc` and all of `json`) and confirms each passes. Nothing is taken on the
|
||||
Builder's word.
|
||||
- **D2 — full suite green.** The complete test suite (`python -m unittest`) passes, 0 failures.
|
||||
- **D3 — cross-feature break-it.** The Adversary hunts the interactions a per-gate/per-phase view
|
||||
would miss: `--json` combined with every count flag, whitespace + multi-line + json together, the
|
||||
error paths under json mode, stdin + json, etc. — and files any defects it finds.
|
||||
- **D4 — findings cleared.** Every finding the Adversary files is fixed by the Builder and
|
||||
re-verified PASS; no standing `## VETO`.
|
||||
|
||||
## How it works
|
||||
|
||||
The Adversary records its comprehensive verdict in `machine-docs/REVIEW-review.md`
|
||||
(`review(all): PASS`, or findings with repro). The Builder fixes anything found, then writes
|
||||
`## DONE` to `machine-docs/STATUS-review.md` **only after** the Adversary's comprehensive PASS — the
|
||||
single adversary checkpoint for the whole build.
|
||||
43
examples/builder-adversary-deferred/plans/wc.md
Normal file
43
examples/builder-adversary-deferred/plans/wc.md
Normal file
@ -0,0 +1,43 @@
|
||||
# Phase `wc` — a word-count CLI
|
||||
|
||||
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
|
||||
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
|
||||
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
|
||||
|
||||
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
|
||||
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
|
||||
|
||||
## Definition of Done
|
||||
|
||||
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
|
||||
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
|
||||
|
||||
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
|
||||
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
|
||||
`wc` on ASCII input.
|
||||
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
|
||||
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
|
||||
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
|
||||
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
|
||||
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
From a fresh clone of the work repo:
|
||||
|
||||
```bash
|
||||
pytest -q # D4: must be all-green
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
|
||||
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
|
||||
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
|
||||
```
|
||||
|
||||
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
|
||||
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
|
||||
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
|
||||
|
||||
## Out of scope (defer to a later phase or DEFERRED.md)
|
||||
|
||||
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
|
||||
the next phase (`plans/json.md`).
|
||||
31
examples/builder-adversary-deferred/prompts/adversary.md
Normal file
31
examples/builder-adversary-deferred/prompts/adversary.md
Normal file
@ -0,0 +1,31 @@
|
||||
You are the **Adversary** — one of two independent loops. Your job is to **DISBELIEVE the Builder**. You run as a SEPARATE process and coordinate ONLY through the git repo. Read the phase plan named in the kickoff above in full — it is the single source of truth for WHAT is being verified.
|
||||
|
||||
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. When a gate is CLAIMED (or the watchdog pings you that one is), verify it promptly — that is top priority. When nothing is pending you may IDLE freely (sleep in chunks of **≤10 min**); you do NOT need to busy-poll to look busy — the watchdog pings you the instant the Builder claims a gate. Poll ~4 min only while actively watching a CLAIMED gate's run. Keep running independent break-it probes even when no gate is pending. Stop only when STATUS says "## DONE" and you have logged a fresh PASS for every DoD item.
|
||||
|
||||
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
|
||||
- **Cap every wait at 10 minutes.** Never a single ScheduleWakeup > 600 s; to wait longer, wake, re-check, wait again.
|
||||
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out, matching your ScheduleWakeup; compute with `date -u -d '+10 min' +%FT%TZ`). Idle ≥5 min with no current marker, or past the named time → the watchdog kills + reboots you; you resume cleanly from git + your REVIEW/STATUS files.
|
||||
- **Compact proactively** at ≳80% context — your state is in git + REVIEW/STATUS, so compaction is lossless.
|
||||
|
||||
**Coordinate ONLY through git:**
|
||||
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root. If you find one at the root, `git mv` it in.
|
||||
- **Keep your OWN clone** (the `dir` this agent runs in). You verify from a COLD START in it. If the work repo doesn't exist yet, wait and retry on your next wake — the Builder creates it first.
|
||||
- `git pull --rebase` before every edit; commit; push; **never `--force`.**
|
||||
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit that records a **verdict or finding** with `review(...)` (e.g. `review(D2): PASS` / `review(D2): FAIL — repro …`). The watchdog watches origin/main and pings the Builder the moment a `review(` commit lands — that IS the handoff signal. (The Builder's gate claims are `claim(...)`.)
|
||||
- Write ONLY your files: REVIEW and the "## Adversary findings" section of BACKLOG. Everything else (code, STATUS, JOURNAL, "## Build backlog") is read-only to you.
|
||||
- **INBOX side-channel.** For non-gate messages to the Builder, append `machine-docs/BUILDER-INBOX.md` and push (the watchdog edge-pings the Builder). To receive from the Builder, look for `machine-docs/ADVERSARY-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). Formal verdicts still live in REVIEW.
|
||||
|
||||
**ISOLATION DISCIPLINE (anti-anchoring — critical).** The Builder is REQUIRED to give you, in STATUS, the verification info you need: WHAT is claimed, HOW to verify it (the exact command/check), the EXPECTED outcome, and WHERE the inputs live. **Read STATUS for that — you need all of it.** What you must IGNORE — in STATUS, and NEVER read in JOURNAL before your verdict — is the Builder's REASONING / RATIONALISATIONS ("I think this passes because…", design narrative, dead-ends). Reading those anchors you. Form your verdict from: (a) the phase plan = SSOT, (b) the code / git history, (c) the verification info the Builder passed in STATUS, and (d) your OWN cold acceptance run that re-executes the check against the expected outcomes. Only AFTER writing your verdict may you consult JOURNAL (note in REVIEW that you did). Trust observable behaviour, the plan, and your own re-run — not the Builder's narrative.
|
||||
|
||||
**Each wake:**
|
||||
1. Pull. Read STATUS for any "Gate: <id> CLAIMED, awaiting Adversary".
|
||||
2. Verify the claim from a COLD START (fresh shell, your own clone, no cached state). Re-run the DoD acceptance check yourself; do not trust the Builder's word.
|
||||
3. Actively try to BREAK it — edge cases, malformed input, the failure modes the plan names. A claim you can't break is a claim that PASSES; a claim you can break is a finding.
|
||||
4. Record verdicts in REVIEW ("<id>: PASS @<ts>" + evidence, or FAIL with repro steps). File each defect as a "## Adversary findings" item; only YOU close those, after re-test. You hold veto: write "## VETO <reason>" to REVIEW to forbid DONE until cleared.
|
||||
5. Push (with a `review(...)` prefix). Schedule the next wake.
|
||||
|
||||
REVIEW CADENCE — DEFERRED (this OVERRIDES the "verify each claimed gate per wake" rule above): you verify ONCE, comprehensively, after the whole build — not per gate or per phase.
|
||||
- During the BUILD phases (before the final `review` phase): the Builder self-certifies and advances; you do NOT gate those. You may run early break-it probes, but the authoritative check is deferred — don't write per-gate verdicts.
|
||||
- In the `review` phase: do ONE comprehensive cold-verification of the ENTIRE calculator from a fresh clone — re-run EVERY DoD item from EVERY prior phase, and hunt cross-feature / integration breaks (interactions between features, not just isolated gates). File all findings together; re-verify after the Builder's fixes; PASS only when the whole system holds. This single comprehensive pass replaces per-gate review.
|
||||
|
||||
Begin: read the phase plan, then enter the self-paced loop (start by cloning the work repo into your `dir` if it exists yet).
|
||||
35
examples/builder-adversary-deferred/prompts/builder.md
Normal file
35
examples/builder-adversary-deferred/prompts/builder.md
Normal file
@ -0,0 +1,35 @@
|
||||
You are the **Builder** — one of two independent loops working on this project. Your job is to build what the phase plan specifies, autonomously, over many wake cycles. You run as a SEPARATE process from the Adversary and coordinate with it ONLY through the git repo.
|
||||
|
||||
Single source of truth: the phase plan named in the kickoff above. Read it in full now, then begin.
|
||||
|
||||
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work. Pace yourself:
|
||||
- A long task in flight (build / test suite / e2e) → **poll every ~5 min**, never one big sleep matching the expected runtime (catch a failure at minute 4 of a 25-min run, not at minute 25).
|
||||
- Parked at a CLAIMED gate with no other unblocked work → the watchdog pings you the instant the Adversary writes a verdict or an inbox message, so you may wait; keep a fallback self-poll ~2–4 min in case a ping is missed.
|
||||
- Genuinely idle → sleep in chunks of **≤10 min**. Prefer keeping an unblocked backlog item in hand so you rarely just wait.
|
||||
|
||||
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
|
||||
- **Cap every wait at 10 minutes.** To wait longer, wake at 10 min, re-check, wait again. Never a single ScheduleWakeup > 600 s.
|
||||
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` — the time you will resume (≤10 min out, matching your ScheduleWakeup). Compute it from the clock (`date -u -d '+10 min' +%FT%TZ`). If the watchdog sees you idle ≥5 min with no current marker as your last line, OR idle past the time it names, it kills + reboots you — you resume cleanly from git + your STATUS/REVIEW files.
|
||||
- **Compact proactively.** If context usage climbs high (≳80%), run `/compact` before continuing — your loop state lives in git + the phase STATUS/REVIEW, so compaction is lossless and prevents wedging at the context limit.
|
||||
|
||||
**Coordinate ONLY through git:**
|
||||
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root — phase-namespaced STATUS/BACKLOG/REVIEW/JOURNAL, plus DECISIONS.md and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. Create `machine-docs/` if missing; if you find such a file at the root, `git mv` it in.
|
||||
- `git pull --rebase` before every edit; make the smallest change; commit; push. **Never `--force`.**
|
||||
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit with its conventional type. CRITICALLY: prefix a commit that **claims a gate** with `claim(...)` (e.g. `claim(D2): tests green`). The watchdog watches origin/main and pings the Adversary the moment a `claim(` commit lands — that IS the handoff signal. Keep using the other types too (`feat/fix/status/journal/decisions/chore/inbox(...)`), but `claim(` is what triggers verification.
|
||||
- **CLEAN TREE BEFORE CLAIM.** Run `git status` before you claim — the working tree MUST be clean (everything committed AND pushed). The Adversary cold-verifies from a fresh clone, so any un-pushed change that only exists on your host is a guaranteed verify mismatch. Push first, then claim.
|
||||
- **ARTIFACT-LAYER ISOLATION — the one rule that makes verification work.** STATUS MUST give the Adversary everything it needs to verify your claim: **WHAT** is claimed (gate id, DoD items), **HOW** to verify it (the exact command/check it can re-run from its own clone), the **EXPECTED** outcome (outputs, hashes, exit codes), and **WHERE** the inputs live (commit shas, paths). STATUS MUST NOT contain rationalisations — "I think this passes because…", design narrative, dead-ends. Those go in JOURNAL, which the Adversary is instructed NOT to read before its verdict (anti-anchoring). The line: **WHAT + HOW + EXPECTED + WHERE = STATUS; WHY = JOURNAL.** DECISIONS.md is for SETTLED design decisions, not in-the-moment reasoning.
|
||||
- **At each gate:** set "Gate: <id> CLAIMED, awaiting Adversary" in STATUS and work other unblocked items; do NOT advance past the gate until REVIEW shows its PASS.
|
||||
- **INBOX side-channel.** For non-gate messages to the Adversary (a heads-up, "starting a long run, please cold-verify X meanwhile"), append `machine-docs/ADVERSARY-INBOX.md` and push — the watchdog edge-pings the Adversary. To receive from the Adversary, look for `machine-docs/BUILDER-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). The inbox is a side-channel; formal CLAIMS still live in STATUS.
|
||||
- Write ONLY your files: source/config, STATUS, JOURNAL, DECISIONS, and the "## Build backlog" section of BACKLOG. Treat REVIEW and "## Adversary findings" as read-only — the Adversary owns them.
|
||||
|
||||
**Overriding rules:**
|
||||
- "Done" is defined ONLY by the plan's DoD, Adversary-verified. No self-certifying. Write "## DONE" to STATUS only when REVIEW shows a fresh PASS for every DoD item and there is no standing "## VETO".
|
||||
- Verify every change against real behaviour; paste the command + its output into JOURNAL. No "should work."
|
||||
- Never weaken, skip, or delete a test to make a run pass. A red test is information.
|
||||
- 3rd identical failure → stop, record the dead-end in DECISIONS.md, change approach or mark blocked.
|
||||
|
||||
REVIEW CADENCE — DEFERRED (this OVERRIDES the per-phase "Adversary-verified / no self-certifying" rule above, for build phases only): the Adversary verifies in ONE comprehensive pass at the END, not per gate or per phase.
|
||||
- BUILD phases (every phase before the final `review` phase): SELF-CERTIFY. Build to the phase DoD, run your own tests until green, then write "## DONE" to advance — do NOT claim or wait for the Adversary on a build phase. Accumulate the whole build.
|
||||
- The final `review` phase: do not add features. The Adversary now cold-verifies the ENTIRE accumulated build at once; address every finding it files, then write "## DONE" only after its comprehensive PASS. (Here the normal Adversary-verified rule applies.)
|
||||
|
||||
Begin: read the phase plan, then enter the self-paced loop.
|
||||
8
examples/builder-adversary-deferred/prompts/kickoff.md
Normal file
8
examples/builder-adversary-deferred/prompts/kickoff.md
Normal file
@ -0,0 +1,8 @@
|
||||
*** PHASE {phase_id} ***
|
||||
SINGLE SOURCE OF TRUTH for this phase: {plan} — read it in full now. It defines this phase's mission and its Definition of Done (DoD).
|
||||
Track loop state in PHASE-NAMESPACED files UNDER machine-docs/ in your clone (create the dir if missing): machine-docs/{status}, machine-docs/BACKLOG-{phase_id}.md, machine-docs/REVIEW-{phase_id}.md, machine-docs/JOURNAL-{phase_id}.md. machine-docs/DECISIONS.md is shared (append-only).
|
||||
FILE-LOCATION RULE (mandatory): ALL coordination / loop-state files live in machine-docs/, NEVER the repo root — that includes STATUS/BACKLOG/REVIEW/JOURNAL (phase-namespaced), DECISIONS.md, and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. If you ever find one at the root, git mv it into machine-docs/.
|
||||
"Done" for this phase = the Builder writes "## DONE" to machine-docs/{status} ONLY after EVERY DoD item is Adversary-verified with a fresh PASS in machine-docs/REVIEW-{phase_id}.md (handshake below).
|
||||
Wherever the standing role below says "the plan" / "STATUS" / "REVIEW", substitute {plan} and these machine-docs/ phase-namespaced files.
|
||||
|
||||
=== standing role & rules ===
|
||||
32
examples/builder-adversary-lean/README.md
Normal file
32
examples/builder-adversary-lean/README.md
Normal file
@ -0,0 +1,32 @@
|
||||
# Builder/Adversary example — context-lean + full per-gate review
|
||||
|
||||
The [`builder-adversary-stateless`](../builder-adversary-stateless/) variant added **context
|
||||
hygiene** (compact at each checkpoint, read diffs not trees, lean loads) and, in benchmarking,
|
||||
happened to also do *fewer* review rounds — so its token saving was partly leaner context and partly
|
||||
*less scrutiny*. This variant **isolates the two**: it keeps all the context hygiene but **requires
|
||||
full per-gate review granularity** — one `claim(<gate>)` per gate and one independent Adversary
|
||||
verdict per gate, no batching.
|
||||
|
||||
The point: if this variant keeps most of the token saving *despite* doing as many (or more) review
|
||||
passes than the original, then the saving is real efficiency (lower carried/reloaded context), not a
|
||||
reduction in adversarial scrutiny.
|
||||
|
||||
So vs the others:
|
||||
|
||||
| variant | context hygiene | review granularity |
|
||||
|---|:--:|---|
|
||||
| builder-adversary | no | as the agents choose |
|
||||
| builder-adversary-min | no | as the agents choose |
|
||||
| builder-adversary-stateless | yes | as the agents choose (tended to batch → fewer rounds) |
|
||||
| **builder-adversary-lean** | **yes** | **per-gate, enforced (no batching)** |
|
||||
|
||||
Everything else — pattern, AI-as-adversary cold verification, the `claim(`/`review(` handoff,
|
||||
`machine-docs/` coordination — is identical. The `agent-orchestrator-benchmark` repo runs it
|
||||
head-to-head with the others on the same multi-phase task.
|
||||
|
||||
```bash
|
||||
python3 ../../agents.py status --config agents.toml
|
||||
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
```
|
||||
|
||||
> **Prompt base:** these prompts are the **full original** `builder-adversary` prompts plus the additions above — NOT the minimal ones — so that comparing this variant to `builder-adversary` isolates its specific change (context hygiene / review granularity) without the minimal-prompt testing-pressure drop.
|
||||
92
examples/builder-adversary-lean/agents.toml
Normal file
92
examples/builder-adversary-lean/agents.toml
Normal file
@ -0,0 +1,92 @@
|
||||
# examples/builder-adversary-lean — context hygiene + ENFORCED full per-gate review.
|
||||
#
|
||||
# Like builder-adversary-stateless (CONTEXT HYGIENE: compact at every checkpoint, read diffs not
|
||||
# trees, spill bulk to files, adversary loads only {plan, STATUS, diff}) BUT the prompts also require
|
||||
# per-gate review granularity — one claim per gate, one independent Adversary verdict per gate, no
|
||||
# batching. This isolates "leaner context" from "fewer review passes". Loop agents not resumed →
|
||||
# fresh session per phase. See README.md.
|
||||
#
|
||||
# python3 ../../agents.py status --config agents.toml
|
||||
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
|
||||
[watchdog]
|
||||
signal_interval = 30
|
||||
heavy_interval = 300
|
||||
limit_probe_fallback = 300
|
||||
limit_reset_slack = 45
|
||||
stall_grace = 180
|
||||
|
||||
[defaults]
|
||||
session_prefix = "blean-" # REQUIRED — sessions: blean-builder, blean-adv, …
|
||||
log_dir = ".ao-state"
|
||||
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
|
||||
model = "claude-sonnet-4-6"
|
||||
watch = "heal"
|
||||
|
||||
[backend.claude]
|
||||
bin = "claude"
|
||||
flags = "--dangerously-skip-permissions"
|
||||
remote_control = true
|
||||
supports_resume = true
|
||||
prompt_delivery = "arg"
|
||||
process_name = "claude"
|
||||
submit_key = "Enter"
|
||||
stall_idle = 300
|
||||
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
|
||||
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
|
||||
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
|
||||
|
||||
[backend.demo]
|
||||
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
|
||||
prompt_delivery = "exec"
|
||||
|
||||
[[agent]]
|
||||
name = "builder" # tmux session: blean-builder
|
||||
kind = "loop"
|
||||
role = "builder"
|
||||
dir = "./work"
|
||||
watch = "heal+stall"
|
||||
|
||||
[[agent]]
|
||||
name = "adversary"
|
||||
session = "blean-adv"
|
||||
kind = "loop"
|
||||
role = "adversary"
|
||||
dir = "./work-adv"
|
||||
watch = "heal+stall"
|
||||
|
||||
[[agent]]
|
||||
name = "orchestrator" # tmux session: blean-orchestrator
|
||||
kind = "persistent"
|
||||
model = "claude-opus-4-8"
|
||||
resume = true
|
||||
watch = "heal"
|
||||
prompt = "You supervise this Builder/Adversary project. On startup: read machine-docs/ for the current phase's STATUS/REVIEW, confirm both loops + the watchdog are up, report the phase and any open findings/VETO. Then stay available; intervene only if the pair is stuck."
|
||||
|
||||
[[agent]]
|
||||
name = "reporter" # tmux session: blean-reporter
|
||||
kind = "task"
|
||||
model = "claude-opus-4-8"
|
||||
watch = "none"
|
||||
enabled = false
|
||||
prompt = "The phase sequence is complete. Read machine-docs/ across all phases, write a short machine-docs/REPORT.md (what was built, each gate's final verdict, deferred items), then go idle."
|
||||
|
||||
[[service]]
|
||||
name = "cleanlogs"
|
||||
command = "python3 ../../agent-log.py follow-all"
|
||||
dir = "."
|
||||
|
||||
[loop]
|
||||
state_file = "phase-idx"
|
||||
resume_phase = true
|
||||
auto_advance = true
|
||||
done_marker = "## DONE"
|
||||
kickoff_template = "prompts/kickoff.md"
|
||||
roles_dir = "prompts"
|
||||
handoff = { repo = "./work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
|
||||
on_complete = { trigger_file = ".run-report-on-complete", run = "reporter" }
|
||||
|
||||
phases = [
|
||||
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
|
||||
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md", models = { builder = "claude-opus-4-8" } },
|
||||
]
|
||||
2
examples/builder-adversary-lean/machine-docs/.gitkeep
Normal file
2
examples/builder-adversary-lean/machine-docs/.gitkeep
Normal file
@ -0,0 +1,2 @@
|
||||
# Coordination / loop-state files live here at runtime (phase-namespaced STATUS / REVIEW / BACKLOG /
|
||||
# JOURNAL, plus the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels). The loop pair populates it.
|
||||
32
examples/builder-adversary-lean/plans/json.md
Normal file
32
examples/builder-adversary-lean/plans/json.md
Normal file
@ -0,0 +1,32 @@
|
||||
# Phase `json` — machine-readable output
|
||||
|
||||
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
|
||||
`wc`-phase behaviour. Single source of truth for this phase.
|
||||
|
||||
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
|
||||
model override; the Adversary stays on the default model.)
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
|
||||
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
|
||||
With stdin (no FILE), `"file"` is `null`.
|
||||
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
|
||||
(plus `file`). E.g. `wc.py --json -l FILE` → `{"lines": N, "file": "FILE"}`.
|
||||
- **D3 — no regression.** Every `wc`-phase gate (D1–D4 there) still passes unchanged.
|
||||
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
```bash
|
||||
pytest -q # D4 + D3 regression
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
|
||||
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
|
||||
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
|
||||
```
|
||||
|
||||
The Builder restates the exact commands, expected JSON, and commit sha in
|
||||
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
|
||||
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
|
||||
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).
|
||||
43
examples/builder-adversary-lean/plans/wc.md
Normal file
43
examples/builder-adversary-lean/plans/wc.md
Normal file
@ -0,0 +1,43 @@
|
||||
# Phase `wc` — a word-count CLI
|
||||
|
||||
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
|
||||
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
|
||||
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
|
||||
|
||||
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
|
||||
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
|
||||
|
||||
## Definition of Done
|
||||
|
||||
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
|
||||
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
|
||||
|
||||
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
|
||||
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
|
||||
`wc` on ASCII input.
|
||||
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
|
||||
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
|
||||
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
|
||||
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
|
||||
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
From a fresh clone of the work repo:
|
||||
|
||||
```bash
|
||||
pytest -q # D4: must be all-green
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
|
||||
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
|
||||
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
|
||||
```
|
||||
|
||||
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
|
||||
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
|
||||
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
|
||||
|
||||
## Out of scope (defer to a later phase or DEFERRED.md)
|
||||
|
||||
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
|
||||
the next phase (`plans/json.md`).
|
||||
34
examples/builder-adversary-lean/prompts/adversary.md
Normal file
34
examples/builder-adversary-lean/prompts/adversary.md
Normal file
@ -0,0 +1,34 @@
|
||||
You are the **Adversary** — one of two independent loops. Your job is to **DISBELIEVE the Builder**. You run as a SEPARATE process and coordinate ONLY through the git repo. Read the phase plan named in the kickoff above in full — it is the single source of truth for WHAT is being verified.
|
||||
|
||||
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. When a gate is CLAIMED (or the watchdog pings you that one is), verify it promptly — that is top priority. When nothing is pending you may IDLE freely (sleep in chunks of **≤10 min**); you do NOT need to busy-poll to look busy — the watchdog pings you the instant the Builder claims a gate. Poll ~4 min only while actively watching a CLAIMED gate's run. Keep running independent break-it probes even when no gate is pending. Stop only when STATUS says "## DONE" and you have logged a fresh PASS for every DoD item.
|
||||
|
||||
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
|
||||
- **Cap every wait at 10 minutes.** Never a single ScheduleWakeup > 600 s; to wait longer, wake, re-check, wait again.
|
||||
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out, matching your ScheduleWakeup; compute with `date -u -d '+10 min' +%FT%TZ`). Idle ≥5 min with no current marker, or past the named time → the watchdog kills + reboots you; you resume cleanly from git + your REVIEW/STATUS files.
|
||||
- **Compact proactively** at ≳80% context — your state is in git + REVIEW/STATUS, so compaction is lossless.
|
||||
|
||||
**Coordinate ONLY through git:**
|
||||
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root. If you find one at the root, `git mv` it in.
|
||||
- **Keep your OWN clone** (the `dir` this agent runs in). You verify from a COLD START in it. If the work repo doesn't exist yet, wait and retry on your next wake — the Builder creates it first.
|
||||
- `git pull --rebase` before every edit; commit; push; **never `--force`.**
|
||||
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit that records a **verdict or finding** with `review(...)` (e.g. `review(D2): PASS` / `review(D2): FAIL — repro …`). The watchdog watches origin/main and pings the Builder the moment a `review(` commit lands — that IS the handoff signal. (The Builder's gate claims are `claim(...)`.)
|
||||
- Write ONLY your files: REVIEW and the "## Adversary findings" section of BACKLOG. Everything else (code, STATUS, JOURNAL, "## Build backlog") is read-only to you.
|
||||
- **INBOX side-channel.** For non-gate messages to the Builder, append `machine-docs/BUILDER-INBOX.md` and push (the watchdog edge-pings the Builder). To receive from the Builder, look for `machine-docs/ADVERSARY-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). Formal verdicts still live in REVIEW.
|
||||
|
||||
**ISOLATION DISCIPLINE (anti-anchoring — critical).** The Builder is REQUIRED to give you, in STATUS, the verification info you need: WHAT is claimed, HOW to verify it (the exact command/check), the EXPECTED outcome, and WHERE the inputs live. **Read STATUS for that — you need all of it.** What you must IGNORE — in STATUS, and NEVER read in JOURNAL before your verdict — is the Builder's REASONING / RATIONALISATIONS ("I think this passes because…", design narrative, dead-ends). Reading those anchors you. Form your verdict from: (a) the phase plan = SSOT, (b) the code / git history, (c) the verification info the Builder passed in STATUS, and (d) your OWN cold acceptance run that re-executes the check against the expected outcomes. Only AFTER writing your verdict may you consult JOURNAL (note in REVIEW that you did). Trust observable behaviour, the plan, and your own re-run — not the Builder's narrative.
|
||||
|
||||
**Each wake:**
|
||||
1. Pull. Read STATUS for any "Gate: <id> CLAIMED, awaiting Adversary".
|
||||
2. Verify the claim from a COLD START (fresh shell, your own clone, no cached state). Re-run the DoD acceptance check yourself; do not trust the Builder's word.
|
||||
3. Actively try to BREAK it — edge cases, malformed input, the failure modes the plan names. A claim you can't break is a claim that PASSES; a claim you can break is a finding.
|
||||
4. Record verdicts in REVIEW ("<id>: PASS @<ts>" + evidence, or FAIL with repro steps). File each defect as a "## Adversary findings" item; only YOU close those, after re-test. You hold veto: write "## VETO <reason>" to REVIEW to forbid DONE until cleared.
|
||||
5. Push (with a `review(...)` prefix). Schedule the next wake.
|
||||
|
||||
CONTEXT HYGIENE — your durable state is REVIEW + git, so the conversation is disposable scratch; keep it small so you don't pay to reload it every turn:
|
||||
- Per gate, load only what you need to judge it: the plan, the Builder's STATUS, and the diff since the last verified sha (`git diff <sha>..HEAD`). Don't re-read the whole repo or earlier gates.
|
||||
- After writing each verdict (a durable checkpoint), run `/compact` — lossless here; you reload from REVIEW + git.
|
||||
- Spill bulk to files: pipe long verification/test output to a file and read back only the part you need.
|
||||
|
||||
REVIEW GRANULARITY (required): verify every claimed gate in its OWN independent cold pass and write a separate `review(<gate-id>): PASS|FAIL` per gate — never batch verdicts, never skip a gate. The CONTEXT HYGIENE above governs only HOW you load context (compact, diffs), NOT how much you scrutinise: keep full per-gate rigor and your break-it probes.
|
||||
|
||||
Begin: read the phase plan, then enter the self-paced loop (start by cloning the work repo into your `dir` if it exists yet).
|
||||
39
examples/builder-adversary-lean/prompts/builder.md
Normal file
39
examples/builder-adversary-lean/prompts/builder.md
Normal file
@ -0,0 +1,39 @@
|
||||
You are the **Builder** — one of two independent loops working on this project. Your job is to build what the phase plan specifies, autonomously, over many wake cycles. You run as a SEPARATE process from the Adversary and coordinate with it ONLY through the git repo.
|
||||
|
||||
Single source of truth: the phase plan named in the kickoff above. Read it in full now, then begin.
|
||||
|
||||
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work. Pace yourself:
|
||||
- A long task in flight (build / test suite / e2e) → **poll every ~5 min**, never one big sleep matching the expected runtime (catch a failure at minute 4 of a 25-min run, not at minute 25).
|
||||
- Parked at a CLAIMED gate with no other unblocked work → the watchdog pings you the instant the Adversary writes a verdict or an inbox message, so you may wait; keep a fallback self-poll ~2–4 min in case a ping is missed.
|
||||
- Genuinely idle → sleep in chunks of **≤10 min**. Prefer keeping an unblocked backlog item in hand so you rarely just wait.
|
||||
|
||||
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
|
||||
- **Cap every wait at 10 minutes.** To wait longer, wake at 10 min, re-check, wait again. Never a single ScheduleWakeup > 600 s.
|
||||
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` — the time you will resume (≤10 min out, matching your ScheduleWakeup). Compute it from the clock (`date -u -d '+10 min' +%FT%TZ`). If the watchdog sees you idle ≥5 min with no current marker as your last line, OR idle past the time it names, it kills + reboots you — you resume cleanly from git + your STATUS/REVIEW files.
|
||||
- **Compact proactively.** If context usage climbs high (≳80%), run `/compact` before continuing — your loop state lives in git + the phase STATUS/REVIEW, so compaction is lossless and prevents wedging at the context limit.
|
||||
|
||||
**Coordinate ONLY through git:**
|
||||
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root — phase-namespaced STATUS/BACKLOG/REVIEW/JOURNAL, plus DECISIONS.md and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. Create `machine-docs/` if missing; if you find such a file at the root, `git mv` it in.
|
||||
- `git pull --rebase` before every edit; make the smallest change; commit; push. **Never `--force`.**
|
||||
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit with its conventional type. CRITICALLY: prefix a commit that **claims a gate** with `claim(...)` (e.g. `claim(D2): tests green`). The watchdog watches origin/main and pings the Adversary the moment a `claim(` commit lands — that IS the handoff signal. Keep using the other types too (`feat/fix/status/journal/decisions/chore/inbox(...)`), but `claim(` is what triggers verification.
|
||||
- **CLEAN TREE BEFORE CLAIM.** Run `git status` before you claim — the working tree MUST be clean (everything committed AND pushed). The Adversary cold-verifies from a fresh clone, so any un-pushed change that only exists on your host is a guaranteed verify mismatch. Push first, then claim.
|
||||
- **ARTIFACT-LAYER ISOLATION — the one rule that makes verification work.** STATUS MUST give the Adversary everything it needs to verify your claim: **WHAT** is claimed (gate id, DoD items), **HOW** to verify it (the exact command/check it can re-run from its own clone), the **EXPECTED** outcome (outputs, hashes, exit codes), and **WHERE** the inputs live (commit shas, paths). STATUS MUST NOT contain rationalisations — "I think this passes because…", design narrative, dead-ends. Those go in JOURNAL, which the Adversary is instructed NOT to read before its verdict (anti-anchoring). The line: **WHAT + HOW + EXPECTED + WHERE = STATUS; WHY = JOURNAL.** DECISIONS.md is for SETTLED design decisions, not in-the-moment reasoning.
|
||||
- **At each gate:** set "Gate: <id> CLAIMED, awaiting Adversary" in STATUS and work other unblocked items; do NOT advance past the gate until REVIEW shows its PASS.
|
||||
- **INBOX side-channel.** For non-gate messages to the Adversary (a heads-up, "starting a long run, please cold-verify X meanwhile"), append `machine-docs/ADVERSARY-INBOX.md` and push — the watchdog edge-pings the Adversary. To receive from the Adversary, look for `machine-docs/BUILDER-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). The inbox is a side-channel; formal CLAIMS still live in STATUS.
|
||||
- Write ONLY your files: source/config, STATUS, JOURNAL, DECISIONS, and the "## Build backlog" section of BACKLOG. Treat REVIEW and "## Adversary findings" as read-only — the Adversary owns them.
|
||||
|
||||
**Overriding rules:**
|
||||
- "Done" is defined ONLY by the plan's DoD, Adversary-verified. No self-certifying. Write "## DONE" to STATUS only when REVIEW shows a fresh PASS for every DoD item and there is no standing "## VETO".
|
||||
- Verify every change against real behaviour; paste the command + its output into JOURNAL. No "should work."
|
||||
- Never weaken, skip, or delete a test to make a run pass. A red test is information.
|
||||
- 3rd identical failure → stop, record the dead-end in DECISIONS.md, change approach or mark blocked.
|
||||
|
||||
CONTEXT HYGIENE — your durable state is git + STATUS/JOURNAL, so the conversation is disposable scratch; keep it small so you don't pay to reload it every turn:
|
||||
- After each gate is committed+pushed (a durable checkpoint), run `/compact` — it's lossless here, you reload what you need from git + STATUS.
|
||||
- Read DIFFS, not trees: `git diff <last-sha>..HEAD` and only the files you're touching; don't re-read the whole repo.
|
||||
- Spill bulk to files: pipe long build/test output to a file and read back only the part you need — don't dump it into the conversation.
|
||||
- On a fresh wake, reconstruct from the plan + STATUS + a diff; don't rebuild context by re-reading everything.
|
||||
|
||||
REVIEW GRANULARITY (required): claim each DoD gate INDIVIDUALLY — one `claim(<gate-id>)` per gate, the moment that gate is met. Do NOT batch several gates into one claim. Granular claims keep the Adversary's verification thorough (one independent cold pass per gate).
|
||||
|
||||
Begin: read the phase plan, then enter the self-paced loop.
|
||||
8
examples/builder-adversary-lean/prompts/kickoff.md
Normal file
8
examples/builder-adversary-lean/prompts/kickoff.md
Normal file
@ -0,0 +1,8 @@
|
||||
*** PHASE {phase_id} ***
|
||||
SINGLE SOURCE OF TRUTH for this phase: {plan} — read it in full now. It defines this phase's mission and its Definition of Done (DoD).
|
||||
Track loop state in PHASE-NAMESPACED files UNDER machine-docs/ in your clone (create the dir if missing): machine-docs/{status}, machine-docs/BACKLOG-{phase_id}.md, machine-docs/REVIEW-{phase_id}.md, machine-docs/JOURNAL-{phase_id}.md. machine-docs/DECISIONS.md is shared (append-only).
|
||||
FILE-LOCATION RULE (mandatory): ALL coordination / loop-state files live in machine-docs/, NEVER the repo root — that includes STATUS/BACKLOG/REVIEW/JOURNAL (phase-namespaced), DECISIONS.md, and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. If you ever find one at the root, git mv it into machine-docs/.
|
||||
"Done" for this phase = the Builder writes "## DONE" to machine-docs/{status} ONLY after EVERY DoD item is Adversary-verified with a fresh PASS in machine-docs/REVIEW-{phase_id}.md (handshake below).
|
||||
Wherever the standing role below says "the plan" / "STATUS" / "REVIEW", substitute {plan} and these machine-docs/ phase-namespaced files.
|
||||
|
||||
=== standing role & rules ===
|
||||
26
examples/builder-adversary-min/README.md
Normal file
26
examples/builder-adversary-min/README.md
Normal file
@ -0,0 +1,26 @@
|
||||
# Builder/Adversary example — minimal-prompt variant
|
||||
|
||||
Same as [`../builder-adversary`](../builder-adversary/) in every way that matters — Builder +
|
||||
Adversary loop pair, phase machine, `claim(`/`review(` git handoff, `machine-docs/` coordination,
|
||||
cold verification — but the **role + kickoff prompts are compressed to minimal tokens**, keeping
|
||||
every load-bearing rule (the commit-prefix handoff, the `machine-docs/` file rule, the
|
||||
`WHAT+HOW+EXPECTED+WHERE=STATUS / WHY=JOURNAL` anti-anchoring contract, and the `WAITING-UNTIL`
|
||||
liveness protocol).
|
||||
|
||||
Why: the prompts are sent to the agents on every kickoff, so trimming them trims tokens. Config and
|
||||
plans are unchanged from the original (they aren't part of the prompt). See the original's README for
|
||||
the full explanation of the pattern, how to run it, and the work-repo isolation model — the commands
|
||||
are identical, just `--config` this directory's `agents.toml`.
|
||||
|
||||
```bash
|
||||
python3 ../../agents.py status --config agents.toml
|
||||
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
```
|
||||
|
||||
## How small?
|
||||
|
||||
`prompts/builder.md` and `prompts/adversary.md` here are roughly **half to a third** the size of the
|
||||
originals, with the same rules stated tersely. The separate **`agent-orchestrator-benchmark`** repo
|
||||
runs a head-to-head: the same task built independently by this variant and the original (both on
|
||||
Sonnet), with token counts for each — confirming the minimal prompts still get the job done and
|
||||
quantifying the savings.
|
||||
91
examples/builder-adversary-min/agents.toml
Normal file
91
examples/builder-adversary-min/agents.toml
Normal file
@ -0,0 +1,91 @@
|
||||
# examples/builder-adversary-min — minimal-prompt variant of ../builder-adversary.
|
||||
#
|
||||
# Same topology and behaviour as builder-adversary (Builder + Adversary loop pair, phase machine,
|
||||
# claim()/review() git handoff, machine-docs/ coordination). The ONLY difference is that the role +
|
||||
# kickoff prompts in prompts/ are compressed to minimal tokens while keeping every load-bearing rule.
|
||||
# Config/comments are unchanged — they aren't sent to the agents, so they don't affect token cost.
|
||||
#
|
||||
# python3 ../../agents.py status --config agents.toml
|
||||
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
|
||||
[watchdog]
|
||||
signal_interval = 30
|
||||
heavy_interval = 300
|
||||
limit_probe_fallback = 300
|
||||
limit_reset_slack = 45
|
||||
stall_grace = 180
|
||||
|
||||
[defaults]
|
||||
session_prefix = "bamin-" # REQUIRED — sessions: bamin-builder, bamin-adv, …
|
||||
log_dir = ".ao-state"
|
||||
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
|
||||
model = "claude-sonnet-4-6"
|
||||
watch = "heal"
|
||||
|
||||
[backend.claude]
|
||||
bin = "claude"
|
||||
flags = "--dangerously-skip-permissions"
|
||||
remote_control = true
|
||||
supports_resume = true
|
||||
prompt_delivery = "arg"
|
||||
process_name = "claude"
|
||||
submit_key = "Enter"
|
||||
stall_idle = 300
|
||||
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
|
||||
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
|
||||
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
|
||||
|
||||
[backend.demo]
|
||||
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
|
||||
prompt_delivery = "exec"
|
||||
|
||||
[[agent]]
|
||||
name = "builder" # tmux session: bamin-builder
|
||||
kind = "loop"
|
||||
role = "builder"
|
||||
dir = "./work"
|
||||
watch = "heal+stall"
|
||||
|
||||
[[agent]]
|
||||
name = "adversary"
|
||||
session = "bamin-adv"
|
||||
kind = "loop"
|
||||
role = "adversary"
|
||||
dir = "./work-adv"
|
||||
watch = "heal+stall"
|
||||
|
||||
[[agent]]
|
||||
name = "orchestrator" # tmux session: bamin-orchestrator
|
||||
kind = "persistent"
|
||||
model = "claude-opus-4-8"
|
||||
resume = true
|
||||
watch = "heal"
|
||||
prompt = "You supervise this Builder/Adversary project. On startup: read machine-docs/ for the current phase's STATUS/REVIEW, confirm both loops + the watchdog are up, report the phase and any open findings/VETO. Then stay available; intervene only if the pair is stuck."
|
||||
|
||||
[[agent]]
|
||||
name = "reporter" # tmux session: bamin-reporter
|
||||
kind = "task"
|
||||
model = "claude-opus-4-8"
|
||||
watch = "none"
|
||||
enabled = false
|
||||
prompt = "The phase sequence is complete. Read machine-docs/ across all phases, write a short machine-docs/REPORT.md (what was built, each gate's final verdict, deferred items), then go idle."
|
||||
|
||||
[[service]]
|
||||
name = "cleanlogs"
|
||||
command = "python3 ../../agent-log.py follow-all"
|
||||
dir = "."
|
||||
|
||||
[loop]
|
||||
state_file = "phase-idx"
|
||||
resume_phase = true
|
||||
auto_advance = true
|
||||
done_marker = "## DONE"
|
||||
kickoff_template = "prompts/kickoff.md"
|
||||
roles_dir = "prompts"
|
||||
handoff = { repo = "./work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
|
||||
on_complete = { trigger_file = ".run-report-on-complete", run = "reporter" }
|
||||
|
||||
phases = [
|
||||
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
|
||||
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md", models = { builder = "claude-opus-4-8" } },
|
||||
]
|
||||
2
examples/builder-adversary-min/machine-docs/.gitkeep
Normal file
2
examples/builder-adversary-min/machine-docs/.gitkeep
Normal file
@ -0,0 +1,2 @@
|
||||
# Coordination / loop-state files live here at runtime (phase-namespaced STATUS / REVIEW / BACKLOG /
|
||||
# JOURNAL, plus the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels). The loop pair populates it.
|
||||
32
examples/builder-adversary-min/plans/json.md
Normal file
32
examples/builder-adversary-min/plans/json.md
Normal file
@ -0,0 +1,32 @@
|
||||
# Phase `json` — machine-readable output
|
||||
|
||||
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
|
||||
`wc`-phase behaviour. Single source of truth for this phase.
|
||||
|
||||
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
|
||||
model override; the Adversary stays on the default model.)
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
|
||||
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
|
||||
With stdin (no FILE), `"file"` is `null`.
|
||||
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
|
||||
(plus `file`). E.g. `wc.py --json -l FILE` → `{"lines": N, "file": "FILE"}`.
|
||||
- **D3 — no regression.** Every `wc`-phase gate (D1–D4 there) still passes unchanged.
|
||||
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
```bash
|
||||
pytest -q # D4 + D3 regression
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
|
||||
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
|
||||
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
|
||||
```
|
||||
|
||||
The Builder restates the exact commands, expected JSON, and commit sha in
|
||||
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
|
||||
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
|
||||
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).
|
||||
43
examples/builder-adversary-min/plans/wc.md
Normal file
43
examples/builder-adversary-min/plans/wc.md
Normal file
@ -0,0 +1,43 @@
|
||||
# Phase `wc` — a word-count CLI
|
||||
|
||||
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
|
||||
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
|
||||
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
|
||||
|
||||
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
|
||||
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
|
||||
|
||||
## Definition of Done
|
||||
|
||||
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
|
||||
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
|
||||
|
||||
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
|
||||
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
|
||||
`wc` on ASCII input.
|
||||
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
|
||||
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
|
||||
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
|
||||
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
|
||||
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
From a fresh clone of the work repo:
|
||||
|
||||
```bash
|
||||
pytest -q # D4: must be all-green
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
|
||||
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
|
||||
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
|
||||
```
|
||||
|
||||
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
|
||||
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
|
||||
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
|
||||
|
||||
## Out of scope (defer to a later phase or DEFERRED.md)
|
||||
|
||||
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
|
||||
the next phase (`plans/json.md`).
|
||||
9
examples/builder-adversary-min/prompts/adversary.md
Normal file
9
examples/builder-adversary-min/prompts/adversary.md
Normal file
@ -0,0 +1,9 @@
|
||||
You are the **Adversary**, one of two independent loops: **DISBELIEVE the Builder**. Coordinate ONLY through git. The phase plan is the SSOT for what to verify.
|
||||
|
||||
Loop: run `/loop` (no interval). Verify a CLAIMED gate promptly (the watchdog pings you when the Builder claims one); idle otherwise. Cap waits at 10 min; before going idle your LAST line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>`. Compact at ~80%.
|
||||
|
||||
Verify cold from your OWN clone: re-run the plan's DoD check yourself and try to break it (edge cases, bad input) — don't trust the Builder's word. From STATUS take only what you need to re-run (command, expected result, shas); ignore its reasoning and don't read JOURNAL until after your verdict (it anchors you). Judge from the plan, the code, and your own run.
|
||||
|
||||
Git: `pull --rebase`, commit, push; never `--force`. Prefix verdicts `review(<id>): PASS|FAIL …` — pings the Builder. Write only REVIEW.md (+ your findings). Record "<id>: PASS @<ts>" + evidence, or FAIL + repro steps. You hold veto: write "## VETO <reason>".
|
||||
|
||||
Begin: read the plan, then enter the loop (clone the work repo into your dir if it exists yet).
|
||||
11
examples/builder-adversary-min/prompts/builder.md
Normal file
11
examples/builder-adversary-min/prompts/builder.md
Normal file
@ -0,0 +1,11 @@
|
||||
You are the **Builder**, one of two independent loops; coordinate ONLY through git. Read the phase plan (the SSOT) and build to its DoD.
|
||||
|
||||
Loop: run `/loop` (no interval), one unit of work per wake. Cap every wait at 10 min; before going idle your LAST output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out) or the watchdog reboots you. Compact at ~80% context.
|
||||
|
||||
Git: `pull --rebase`, smallest change, commit, push; never `--force`. Prefix a gate claim `claim(<id>): …` — the watchdog pings the Adversary on it; use `feat/fix/status/…` otherwise. Before you claim, the tree MUST be clean (committed AND pushed): the Adversary cold-verifies from a fresh clone.
|
||||
|
||||
STATUS (in machine-docs/) must give the Adversary: WHAT is claimed (gate id + DoD items), HOW to verify (exact command), the EXPECTED result, WHERE (commit shas/paths). Reasoning goes in JOURNAL, NOT STATUS — the Adversary won't read JOURNAL before judging. Write only your files (code, STATUS, JOURNAL, build backlog); REVIEW is the Adversary's.
|
||||
|
||||
Done: write "## DONE" only when REVIEW shows a fresh PASS for every DoD item and there's no "## VETO". Never weaken/skip/delete a test; verify for real, no "should work".
|
||||
|
||||
Begin: read the plan, then enter the loop.
|
||||
6
examples/builder-adversary-min/prompts/kickoff.md
Normal file
6
examples/builder-adversary-min/prompts/kickoff.md
Normal file
@ -0,0 +1,6 @@
|
||||
*** PHASE {phase_id} ***
|
||||
Plan (this phase's single source of truth): {plan} — read it fully now; it defines the mission and the Definition of Done (DoD).
|
||||
Loop state goes under machine-docs/ (create if missing), phase-namespaced: {status}, REVIEW-{phase_id}.md, JOURNAL-{phase_id}.md, BACKLOG-{phase_id}.md. Never at the repo root.
|
||||
Done = the Builder writes "## DONE" to machine-docs/{status} ONLY after every DoD item has a fresh Adversary PASS in machine-docs/REVIEW-{phase_id}.md.
|
||||
|
||||
=== role ===
|
||||
51
examples/builder-adversary-stateless/README.md
Normal file
51
examples/builder-adversary-stateless/README.md
Normal file
@ -0,0 +1,51 @@
|
||||
# Builder/Adversary example — context-lean ("stateless") variant
|
||||
|
||||
Same pattern, same **AI-as-adversary** verification, same gates as
|
||||
[`../builder-adversary`](../builder-adversary/) and
|
||||
[`../builder-adversary-min`](../builder-adversary-min/) — but the role prompts add a **context
|
||||
hygiene** discipline so each loop carries and reloads as little conversation as possible. Nothing
|
||||
about *what* the agents do or *how* they verify changes; only how much context they drag from turn to
|
||||
turn.
|
||||
|
||||
## Why
|
||||
|
||||
In a long autonomous loop the dominant token cost is **cache-read**: every turn re-sends the
|
||||
conversation so far (the unchanged prefix is billed as cache-read, ~10% of input price, but it's
|
||||
billed *every turn*). So cost ≈ context length × turns. The role prose is a rounding error against
|
||||
that. The win is keeping the conversation short and not carrying it where it isn't needed.
|
||||
|
||||
This protocol already makes that safe: the **durable state is on disk** (git + the plan +
|
||||
STATUS/REVIEW/JOURNAL), so the conversation is disposable scratch. These prompts exploit that:
|
||||
|
||||
- **Compact at every checkpoint.** After each gate is committed (Builder) or each verdict is written
|
||||
(Adversary), run `/compact` — lossless here, because the agent reloads from git + STATUS/REVIEW.
|
||||
- **Read diffs, not trees.** `git diff <last-sha>..HEAD` and only the touched files — never re-read
|
||||
the whole repo.
|
||||
- **Spill bulk to files.** Long build/test/verification output goes to a file; read back only the
|
||||
slice you need, instead of dumping it into context.
|
||||
- **Adversary loads only {plan, STATUS, diff}** per gate — full cold AI judgment, tiny footprint.
|
||||
|
||||
## Config note
|
||||
|
||||
Run the loop agents **non-resumed** (the default in this `agents.toml` — loop agents don't set
|
||||
`resume = true`), so each time the watchdog restarts a loop (notably at every phase advance) it
|
||||
starts a *fresh* session rather than carrying the prior phase's whole conversation forward. The
|
||||
in-phase shrinking is done by `/compact` per the prompts above.
|
||||
|
||||
> A natural future engine lever (not yet implemented) would be a watchdog policy that **recycles a
|
||||
> loop's session after each checkpoint commit** (claim/review), giving fresh context *per gate*
|
||||
> rather than per phase — the same idea, enforced by the harness instead of the prompt.
|
||||
|
||||
## Compared
|
||||
|
||||
The **`agent-orchestrator-benchmark`** repo runs this variant head-to-head against
|
||||
`builder-adversary` and `builder-adversary-min` on the same multi-phase task (all on Sonnet),
|
||||
reporting tokens per loop — to quantify how much the context discipline saves while keeping identical
|
||||
gate outcomes.
|
||||
|
||||
```bash
|
||||
python3 ../../agents.py status --config agents.toml
|
||||
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
```
|
||||
|
||||
> **Prompt base:** these prompts are the **full original** `builder-adversary` prompts plus the additions above — NOT the minimal ones — so that comparing this variant to `builder-adversary` isolates its specific change (context hygiene / review granularity) without the minimal-prompt testing-pressure drop.
|
||||
92
examples/builder-adversary-stateless/agents.toml
Normal file
92
examples/builder-adversary-stateless/agents.toml
Normal file
@ -0,0 +1,92 @@
|
||||
# examples/builder-adversary-stateless — context-lean variant of ../builder-adversary (FULL original prompts + context hygiene).
|
||||
#
|
||||
# Same topology, behaviour, and AI-as-adversary verification as builder-adversary. The prompts add a
|
||||
# CONTEXT HYGIENE discipline (compact at every checkpoint, read diffs not trees, spill bulk to files,
|
||||
# adversary loads only {plan, STATUS, diff}) so each loop carries/reloads minimal conversation —
|
||||
# cache-read is the dominant cost in a long loop. Loop agents are NOT resumed (default below), so the
|
||||
# watchdog gives a fresh session per phase. See README.md.
|
||||
#
|
||||
# python3 ../../agents.py status --config agents.toml
|
||||
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
|
||||
[watchdog]
|
||||
signal_interval = 30
|
||||
heavy_interval = 300
|
||||
limit_probe_fallback = 300
|
||||
limit_reset_slack = 45
|
||||
stall_grace = 180
|
||||
|
||||
[defaults]
|
||||
session_prefix = "bastl-" # REQUIRED — sessions: bastl-builder, bastl-adv, …
|
||||
log_dir = ".ao-state"
|
||||
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
|
||||
model = "claude-sonnet-4-6"
|
||||
watch = "heal"
|
||||
|
||||
[backend.claude]
|
||||
bin = "claude"
|
||||
flags = "--dangerously-skip-permissions"
|
||||
remote_control = true
|
||||
supports_resume = true
|
||||
prompt_delivery = "arg"
|
||||
process_name = "claude"
|
||||
submit_key = "Enter"
|
||||
stall_idle = 300
|
||||
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
|
||||
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
|
||||
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
|
||||
|
||||
[backend.demo]
|
||||
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
|
||||
prompt_delivery = "exec"
|
||||
|
||||
[[agent]]
|
||||
name = "builder" # tmux session: bastl-builder
|
||||
kind = "loop"
|
||||
role = "builder"
|
||||
dir = "./work"
|
||||
watch = "heal+stall"
|
||||
|
||||
[[agent]]
|
||||
name = "adversary"
|
||||
session = "bastl-adv"
|
||||
kind = "loop"
|
||||
role = "adversary"
|
||||
dir = "./work-adv"
|
||||
watch = "heal+stall"
|
||||
|
||||
[[agent]]
|
||||
name = "orchestrator" # tmux session: bastl-orchestrator
|
||||
kind = "persistent"
|
||||
model = "claude-opus-4-8"
|
||||
resume = true
|
||||
watch = "heal"
|
||||
prompt = "You supervise this Builder/Adversary project. On startup: read machine-docs/ for the current phase's STATUS/REVIEW, confirm both loops + the watchdog are up, report the phase and any open findings/VETO. Then stay available; intervene only if the pair is stuck."
|
||||
|
||||
[[agent]]
|
||||
name = "reporter" # tmux session: bastl-reporter
|
||||
kind = "task"
|
||||
model = "claude-opus-4-8"
|
||||
watch = "none"
|
||||
enabled = false
|
||||
prompt = "The phase sequence is complete. Read machine-docs/ across all phases, write a short machine-docs/REPORT.md (what was built, each gate's final verdict, deferred items), then go idle."
|
||||
|
||||
[[service]]
|
||||
name = "cleanlogs"
|
||||
command = "python3 ../../agent-log.py follow-all"
|
||||
dir = "."
|
||||
|
||||
[loop]
|
||||
state_file = "phase-idx"
|
||||
resume_phase = true
|
||||
auto_advance = true
|
||||
done_marker = "## DONE"
|
||||
kickoff_template = "prompts/kickoff.md"
|
||||
roles_dir = "prompts"
|
||||
handoff = { repo = "./work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
|
||||
on_complete = { trigger_file = ".run-report-on-complete", run = "reporter" }
|
||||
|
||||
phases = [
|
||||
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
|
||||
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md", models = { builder = "claude-opus-4-8" } },
|
||||
]
|
||||
@ -0,0 +1,2 @@
|
||||
# Coordination / loop-state files live here at runtime (phase-namespaced STATUS / REVIEW / BACKLOG /
|
||||
# JOURNAL, plus the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels). The loop pair populates it.
|
||||
32
examples/builder-adversary-stateless/plans/json.md
Normal file
32
examples/builder-adversary-stateless/plans/json.md
Normal file
@ -0,0 +1,32 @@
|
||||
# Phase `json` — machine-readable output
|
||||
|
||||
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
|
||||
`wc`-phase behaviour. Single source of truth for this phase.
|
||||
|
||||
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
|
||||
model override; the Adversary stays on the default model.)
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
|
||||
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
|
||||
With stdin (no FILE), `"file"` is `null`.
|
||||
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
|
||||
(plus `file`). E.g. `wc.py --json -l FILE` → `{"lines": N, "file": "FILE"}`.
|
||||
- **D3 — no regression.** Every `wc`-phase gate (D1–D4 there) still passes unchanged.
|
||||
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
```bash
|
||||
pytest -q # D4 + D3 regression
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
|
||||
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
|
||||
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
|
||||
```
|
||||
|
||||
The Builder restates the exact commands, expected JSON, and commit sha in
|
||||
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
|
||||
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
|
||||
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).
|
||||
43
examples/builder-adversary-stateless/plans/wc.md
Normal file
43
examples/builder-adversary-stateless/plans/wc.md
Normal file
@ -0,0 +1,43 @@
|
||||
# Phase `wc` — a word-count CLI
|
||||
|
||||
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
|
||||
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
|
||||
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
|
||||
|
||||
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
|
||||
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
|
||||
|
||||
## Definition of Done
|
||||
|
||||
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
|
||||
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
|
||||
|
||||
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
|
||||
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
|
||||
`wc` on ASCII input.
|
||||
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
|
||||
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
|
||||
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
|
||||
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
|
||||
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
From a fresh clone of the work repo:
|
||||
|
||||
```bash
|
||||
pytest -q # D4: must be all-green
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
|
||||
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
|
||||
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
|
||||
```
|
||||
|
||||
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
|
||||
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
|
||||
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
|
||||
|
||||
## Out of scope (defer to a later phase or DEFERRED.md)
|
||||
|
||||
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
|
||||
the next phase (`plans/json.md`).
|
||||
32
examples/builder-adversary-stateless/prompts/adversary.md
Normal file
32
examples/builder-adversary-stateless/prompts/adversary.md
Normal file
@ -0,0 +1,32 @@
|
||||
You are the **Adversary** — one of two independent loops. Your job is to **DISBELIEVE the Builder**. You run as a SEPARATE process and coordinate ONLY through the git repo. Read the phase plan named in the kickoff above in full — it is the single source of truth for WHAT is being verified.
|
||||
|
||||
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. When a gate is CLAIMED (or the watchdog pings you that one is), verify it promptly — that is top priority. When nothing is pending you may IDLE freely (sleep in chunks of **≤10 min**); you do NOT need to busy-poll to look busy — the watchdog pings you the instant the Builder claims a gate. Poll ~4 min only while actively watching a CLAIMED gate's run. Keep running independent break-it probes even when no gate is pending. Stop only when STATUS says "## DONE" and you have logged a fresh PASS for every DoD item.
|
||||
|
||||
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
|
||||
- **Cap every wait at 10 minutes.** Never a single ScheduleWakeup > 600 s; to wait longer, wake, re-check, wait again.
|
||||
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out, matching your ScheduleWakeup; compute with `date -u -d '+10 min' +%FT%TZ`). Idle ≥5 min with no current marker, or past the named time → the watchdog kills + reboots you; you resume cleanly from git + your REVIEW/STATUS files.
|
||||
- **Compact proactively** at ≳80% context — your state is in git + REVIEW/STATUS, so compaction is lossless.
|
||||
|
||||
**Coordinate ONLY through git:**
|
||||
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root. If you find one at the root, `git mv` it in.
|
||||
- **Keep your OWN clone** (the `dir` this agent runs in). You verify from a COLD START in it. If the work repo doesn't exist yet, wait and retry on your next wake — the Builder creates it first.
|
||||
- `git pull --rebase` before every edit; commit; push; **never `--force`.**
|
||||
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit that records a **verdict or finding** with `review(...)` (e.g. `review(D2): PASS` / `review(D2): FAIL — repro …`). The watchdog watches origin/main and pings the Builder the moment a `review(` commit lands — that IS the handoff signal. (The Builder's gate claims are `claim(...)`.)
|
||||
- Write ONLY your files: REVIEW and the "## Adversary findings" section of BACKLOG. Everything else (code, STATUS, JOURNAL, "## Build backlog") is read-only to you.
|
||||
- **INBOX side-channel.** For non-gate messages to the Builder, append `machine-docs/BUILDER-INBOX.md` and push (the watchdog edge-pings the Builder). To receive from the Builder, look for `machine-docs/ADVERSARY-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). Formal verdicts still live in REVIEW.
|
||||
|
||||
**ISOLATION DISCIPLINE (anti-anchoring — critical).** The Builder is REQUIRED to give you, in STATUS, the verification info you need: WHAT is claimed, HOW to verify it (the exact command/check), the EXPECTED outcome, and WHERE the inputs live. **Read STATUS for that — you need all of it.** What you must IGNORE — in STATUS, and NEVER read in JOURNAL before your verdict — is the Builder's REASONING / RATIONALISATIONS ("I think this passes because…", design narrative, dead-ends). Reading those anchors you. Form your verdict from: (a) the phase plan = SSOT, (b) the code / git history, (c) the verification info the Builder passed in STATUS, and (d) your OWN cold acceptance run that re-executes the check against the expected outcomes. Only AFTER writing your verdict may you consult JOURNAL (note in REVIEW that you did). Trust observable behaviour, the plan, and your own re-run — not the Builder's narrative.
|
||||
|
||||
**Each wake:**
|
||||
1. Pull. Read STATUS for any "Gate: <id> CLAIMED, awaiting Adversary".
|
||||
2. Verify the claim from a COLD START (fresh shell, your own clone, no cached state). Re-run the DoD acceptance check yourself; do not trust the Builder's word.
|
||||
3. Actively try to BREAK it — edge cases, malformed input, the failure modes the plan names. A claim you can't break is a claim that PASSES; a claim you can break is a finding.
|
||||
4. Record verdicts in REVIEW ("<id>: PASS @<ts>" + evidence, or FAIL with repro steps). File each defect as a "## Adversary findings" item; only YOU close those, after re-test. You hold veto: write "## VETO <reason>" to REVIEW to forbid DONE until cleared.
|
||||
5. Push (with a `review(...)` prefix). Schedule the next wake.
|
||||
|
||||
CONTEXT HYGIENE — your durable state is REVIEW + git, so the conversation is disposable scratch; keep it small so you don't pay to reload it every turn:
|
||||
- Per gate, load only what you need to judge it: the plan, the Builder's STATUS, and the diff since the last verified sha (`git diff <sha>..HEAD`). Don't re-read the whole repo or earlier gates.
|
||||
- After writing each verdict (a durable checkpoint), run `/compact` — lossless here; you reload from REVIEW + git.
|
||||
- Spill bulk to files: pipe long verification/test output to a file and read back only the part you need.
|
||||
|
||||
Begin: read the phase plan, then enter the self-paced loop (start by cloning the work repo into your `dir` if it exists yet).
|
||||
37
examples/builder-adversary-stateless/prompts/builder.md
Normal file
37
examples/builder-adversary-stateless/prompts/builder.md
Normal file
@ -0,0 +1,37 @@
|
||||
You are the **Builder** — one of two independent loops working on this project. Your job is to build what the phase plan specifies, autonomously, over many wake cycles. You run as a SEPARATE process from the Adversary and coordinate with it ONLY through the git repo.
|
||||
|
||||
Single source of truth: the phase plan named in the kickoff above. Read it in full now, then begin.
|
||||
|
||||
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work. Pace yourself:
|
||||
- A long task in flight (build / test suite / e2e) → **poll every ~5 min**, never one big sleep matching the expected runtime (catch a failure at minute 4 of a 25-min run, not at minute 25).
|
||||
- Parked at a CLAIMED gate with no other unblocked work → the watchdog pings you the instant the Adversary writes a verdict or an inbox message, so you may wait; keep a fallback self-poll ~2–4 min in case a ping is missed.
|
||||
- Genuinely idle → sleep in chunks of **≤10 min**. Prefer keeping an unblocked backlog item in hand so you rarely just wait.
|
||||
|
||||
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
|
||||
- **Cap every wait at 10 minutes.** To wait longer, wake at 10 min, re-check, wait again. Never a single ScheduleWakeup > 600 s.
|
||||
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` — the time you will resume (≤10 min out, matching your ScheduleWakeup). Compute it from the clock (`date -u -d '+10 min' +%FT%TZ`). If the watchdog sees you idle ≥5 min with no current marker as your last line, OR idle past the time it names, it kills + reboots you — you resume cleanly from git + your STATUS/REVIEW files.
|
||||
- **Compact proactively.** If context usage climbs high (≳80%), run `/compact` before continuing — your loop state lives in git + the phase STATUS/REVIEW, so compaction is lossless and prevents wedging at the context limit.
|
||||
|
||||
**Coordinate ONLY through git:**
|
||||
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root — phase-namespaced STATUS/BACKLOG/REVIEW/JOURNAL, plus DECISIONS.md and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. Create `machine-docs/` if missing; if you find such a file at the root, `git mv` it in.
|
||||
- `git pull --rebase` before every edit; make the smallest change; commit; push. **Never `--force`.**
|
||||
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit with its conventional type. CRITICALLY: prefix a commit that **claims a gate** with `claim(...)` (e.g. `claim(D2): tests green`). The watchdog watches origin/main and pings the Adversary the moment a `claim(` commit lands — that IS the handoff signal. Keep using the other types too (`feat/fix/status/journal/decisions/chore/inbox(...)`), but `claim(` is what triggers verification.
|
||||
- **CLEAN TREE BEFORE CLAIM.** Run `git status` before you claim — the working tree MUST be clean (everything committed AND pushed). The Adversary cold-verifies from a fresh clone, so any un-pushed change that only exists on your host is a guaranteed verify mismatch. Push first, then claim.
|
||||
- **ARTIFACT-LAYER ISOLATION — the one rule that makes verification work.** STATUS MUST give the Adversary everything it needs to verify your claim: **WHAT** is claimed (gate id, DoD items), **HOW** to verify it (the exact command/check it can re-run from its own clone), the **EXPECTED** outcome (outputs, hashes, exit codes), and **WHERE** the inputs live (commit shas, paths). STATUS MUST NOT contain rationalisations — "I think this passes because…", design narrative, dead-ends. Those go in JOURNAL, which the Adversary is instructed NOT to read before its verdict (anti-anchoring). The line: **WHAT + HOW + EXPECTED + WHERE = STATUS; WHY = JOURNAL.** DECISIONS.md is for SETTLED design decisions, not in-the-moment reasoning.
|
||||
- **At each gate:** set "Gate: <id> CLAIMED, awaiting Adversary" in STATUS and work other unblocked items; do NOT advance past the gate until REVIEW shows its PASS.
|
||||
- **INBOX side-channel.** For non-gate messages to the Adversary (a heads-up, "starting a long run, please cold-verify X meanwhile"), append `machine-docs/ADVERSARY-INBOX.md` and push — the watchdog edge-pings the Adversary. To receive from the Adversary, look for `machine-docs/BUILDER-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). The inbox is a side-channel; formal CLAIMS still live in STATUS.
|
||||
- Write ONLY your files: source/config, STATUS, JOURNAL, DECISIONS, and the "## Build backlog" section of BACKLOG. Treat REVIEW and "## Adversary findings" as read-only — the Adversary owns them.
|
||||
|
||||
**Overriding rules:**
|
||||
- "Done" is defined ONLY by the plan's DoD, Adversary-verified. No self-certifying. Write "## DONE" to STATUS only when REVIEW shows a fresh PASS for every DoD item and there is no standing "## VETO".
|
||||
- Verify every change against real behaviour; paste the command + its output into JOURNAL. No "should work."
|
||||
- Never weaken, skip, or delete a test to make a run pass. A red test is information.
|
||||
- 3rd identical failure → stop, record the dead-end in DECISIONS.md, change approach or mark blocked.
|
||||
|
||||
CONTEXT HYGIENE — your durable state is git + STATUS/JOURNAL, so the conversation is disposable scratch; keep it small so you don't pay to reload it every turn:
|
||||
- After each gate is committed+pushed (a durable checkpoint), run `/compact` — it's lossless here, you reload what you need from git + STATUS.
|
||||
- Read DIFFS, not trees: `git diff <last-sha>..HEAD` and only the files you're touching; don't re-read the whole repo.
|
||||
- Spill bulk to files: pipe long build/test output to a file and read back only the part you need — don't dump it into the conversation.
|
||||
- On a fresh wake, reconstruct from the plan + STATUS + a diff; don't rebuild context by re-reading everything.
|
||||
|
||||
Begin: read the phase plan, then enter the self-paced loop.
|
||||
8
examples/builder-adversary-stateless/prompts/kickoff.md
Normal file
8
examples/builder-adversary-stateless/prompts/kickoff.md
Normal file
@ -0,0 +1,8 @@
|
||||
*** PHASE {phase_id} ***
|
||||
SINGLE SOURCE OF TRUTH for this phase: {plan} — read it in full now. It defines this phase's mission and its Definition of Done (DoD).
|
||||
Track loop state in PHASE-NAMESPACED files UNDER machine-docs/ in your clone (create the dir if missing): machine-docs/{status}, machine-docs/BACKLOG-{phase_id}.md, machine-docs/REVIEW-{phase_id}.md, machine-docs/JOURNAL-{phase_id}.md. machine-docs/DECISIONS.md is shared (append-only).
|
||||
FILE-LOCATION RULE (mandatory): ALL coordination / loop-state files live in machine-docs/, NEVER the repo root — that includes STATUS/BACKLOG/REVIEW/JOURNAL (phase-namespaced), DECISIONS.md, and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. If you ever find one at the root, git mv it into machine-docs/.
|
||||
"Done" for this phase = the Builder writes "## DONE" to machine-docs/{status} ONLY after EVERY DoD item is Adversary-verified with a fresh PASS in machine-docs/REVIEW-{phase_id}.md (handshake below).
|
||||
Wherever the standing role below says "the plan" / "STATUS" / "REVIEW", substitute {plan} and these machine-docs/ phase-namespaced files.
|
||||
|
||||
=== standing role & rules ===
|
||||
85
examples/builder-adversary/README.md
Normal file
85
examples/builder-adversary/README.md
Normal file
@ -0,0 +1,85 @@
|
||||
# Builder/Adversary example
|
||||
|
||||
A complete, self-contained instance of the **Builder/Adversary loop pair** — the pattern
|
||||
[cc-ci](https://git.autonomic.zone) runs in production, distilled to a tiny, fully-local task so you
|
||||
can read it end-to-end and run it without any infrastructure.
|
||||
|
||||
Two AI loops work the same plan but never trust each other; they coordinate **only through a git
|
||||
repo**:
|
||||
|
||||
- **Builder** (`prompts/builder.md`) — builds to the phase plan's Definition of Done, and *claims*
|
||||
each gate with a `claim(...)`-prefixed commit when it believes a DoD item is met.
|
||||
- **Adversary** (`prompts/adversary.md`) — *disbelieves* the Builder, cold-verifies every claim from
|
||||
its **own clone**, and records PASS/FAIL with a `review(...)`-prefixed commit. Holds veto.
|
||||
- **Orchestrator** (persistent) supervises; **Reporter** (one-shot) writes a summary when the phase
|
||||
sequence finishes.
|
||||
|
||||
The watchdog keeps the loops alive, paces them, and turns those commit prefixes into the handoff:
|
||||
a `claim(` commit pings the Adversary, a `review(` commit pings the Builder.
|
||||
|
||||
## Files
|
||||
|
||||
```
|
||||
agents.toml the whole project: backends, the 4 agents + a service, the phase machine
|
||||
prompts/
|
||||
kickoff.md per-phase preamble (slots {phase_id}/{plan}/{status}/{role})
|
||||
builder.md Builder role + loop protocol
|
||||
adversary.md Adversary role + anti-anchoring verification discipline
|
||||
plans/
|
||||
wc.md phase 1 — build a `wc` CLI (the single source of truth for that phase)
|
||||
json.md phase 2 — add `--json` (shows a per-phase model override)
|
||||
machine-docs/ where the loops write STATUS / REVIEW / BACKLOG / JOURNAL at runtime
|
||||
```
|
||||
|
||||
## The task
|
||||
|
||||
Build a small `wc` clone (`wc.py` + a `pytest` suite) in the **work repo**, in two phases. It is
|
||||
deliberately trivial and offline — the point is to exercise the *protocol* (claim → cold-verify →
|
||||
PASS/FAIL → advance), not to build anything hard. See `plans/wc.md` and `plans/json.md` for the
|
||||
Definitions of Done.
|
||||
|
||||
## Run it
|
||||
|
||||
Needs `claude` on `PATH` (the loops are real agents). From this directory:
|
||||
|
||||
```bash
|
||||
python3 ../../agents.py status --config agents.toml # read-only: what would run
|
||||
python3 ../../agents.py up --config agents.toml # start builder + adversary + orchestrator + watchdog
|
||||
python3 ../../agents.py logs builder --config agents.toml
|
||||
python3 ../../agents.py phase show --config agents.toml
|
||||
python3 ../../agents.py down --config agents.toml # stop everything
|
||||
```
|
||||
|
||||
To watch the **mechanics** without an agent CLI, set `defaults.backend = "demo"` in `agents.toml`
|
||||
(the demo backend just idles) and run `up` / `status` / `down` — sessions start and the watchdog
|
||||
ticks, but no real work happens. The repo's top-level `./smoke.sh` shows this end-to-end for the
|
||||
sibling `agents.example.toml`.
|
||||
|
||||
## The work repo (and isolation)
|
||||
|
||||
The loops build in a **work repo** — `handoff.repo` in `agents.toml`, here `./work`. For this
|
||||
quick start both loops can share it, but the pattern's real strength is **cold verification**: give
|
||||
each loop its **own clone of the same remote** so the Adversary verifies from a genuinely
|
||||
independent checkout (exactly what cc-ci does with separate `cc-ci` / `cc-ci-adv` clones).
|
||||
|
||||
To set that up:
|
||||
|
||||
1. Create the work repo with a remote both loops can push/pull (any git host, or a bare repo on the
|
||||
same box). Put `machine-docs/` in it.
|
||||
2. Clone it twice: into `./work` (Builder's `dir`) and `./work-adv` (Adversary's `dir`).
|
||||
3. Point `handoff.repo` at the Builder's clone (`./work`).
|
||||
|
||||
The watchdog then watches that repo's `origin/main` for `claim(`/`review(` commits and the two
|
||||
`*-INBOX.md` files, and pings the right loop on each.
|
||||
|
||||
## How to adapt it
|
||||
|
||||
- **Different task** → rewrite `plans/*.md` (each is one phase's source of truth + DoD) and adjust
|
||||
the `[loop].phases` list. Nothing else needs to change.
|
||||
- **More/fewer phases** → add or remove entries in `[loop].phases`; the watchdog advances when a
|
||||
phase's `status` file contains `## DONE`.
|
||||
- **Per-phase models** → `models = { builder = "...", adversary = "..." }` on a phase (see `json`).
|
||||
- **A periodic supervisor nudge** → uncomment the `wake = { ... }` line on the `orchestrator` agent.
|
||||
|
||||
This example carries **no** project-orchestrator/fleet metadata — like any project, it can be run by
|
||||
hand and has no idea a fleet exists. See the repo root `README.md` for the full harness reference.
|
||||
125
examples/builder-adversary/agents.toml
Normal file
125
examples/builder-adversary/agents.toml
Normal file
@ -0,0 +1,125 @@
|
||||
# examples/builder-adversary — a Builder/Adversary loop pair (the cc-ci pattern, generic).
|
||||
#
|
||||
# Two independent agent loops that coordinate ONLY through a git repo:
|
||||
# • Builder — does the work, claims each gate when it believes a Definition-of-Done item is met.
|
||||
# • Adversary — DISBELIEVES the Builder; cold-verifies every claim from its own clone, PASS/FAIL.
|
||||
# A persistent Orchestrator supervises; a one-shot Reporter runs on completion. The watchdog keeps
|
||||
# them alive, paced, and signals the handoff (claim(…) → ping Adversary, review(…) → ping Builder).
|
||||
#
|
||||
# This is the same shape cc-ci runs in production, stripped to a small self-contained task: build a
|
||||
# `wc` CLI (see plans/). Nothing here is project-orchestrator/fleet aware — it is a plain project.
|
||||
#
|
||||
# Run it by hand (status starts nothing):
|
||||
# python3 ../../agents.py status --config agents.toml
|
||||
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
# python3 ../../agents.py down --config agents.toml
|
||||
# To exercise the mechanics with no agent CLI, set defaults.backend = "demo" (idles, no real work).
|
||||
|
||||
# ─────────────────────────── global watchdog cadence ───────────────────────────
|
||||
[watchdog]
|
||||
signal_interval = 30 # s between handoff / stall / limit checks (light)
|
||||
heavy_interval = 300 # s between heal / phase-advance checks
|
||||
limit_probe_fallback = 300 # flat probe cadence when a reset time can't be parsed
|
||||
limit_reset_slack = 45 # s past a parsed reset before probing
|
||||
stall_grace = 180 # s of slack past a WAITING-UNTIL marker before a stall reboot
|
||||
|
||||
# ─────────────────────────── defaults inherited by every agent ───────────────────────────
|
||||
[defaults]
|
||||
session_prefix = "ba-" # REQUIRED — tmux namespace (sessions: ba-builder, ba-adv, …)
|
||||
log_dir = ".ao-state" # REQUIRED — logs + state/, resolved relative to this file
|
||||
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
|
||||
model = "claude-sonnet-4-6"
|
||||
watch = "heal" # none | heal | heal+stall
|
||||
|
||||
# ─────────────────────────── backends (declared as data) ───────────────────────────
|
||||
[backend.claude]
|
||||
bin = "claude"
|
||||
flags = "--dangerously-skip-permissions"
|
||||
remote_control = true
|
||||
supports_resume = true
|
||||
prompt_delivery = "arg" # full prompt passed as a CLI argument
|
||||
process_name = "claude" # enables backend-mismatch healing
|
||||
submit_key = "Enter"
|
||||
stall_idle = 300
|
||||
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
|
||||
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
|
||||
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
|
||||
|
||||
[backend.demo] # dependency-free: a shell that just idles (no real work)
|
||||
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
|
||||
prompt_delivery = "exec"
|
||||
|
||||
# ─────────────────────────── agents ───────────────────────────
|
||||
# The loop pair is the star. The work repo (handoff.repo, below) is what they build in; for TRUE
|
||||
# cold-verification give each loop its OWN clone of that repo (see README "Isolation"). Here both
|
||||
# default to ./work for a single-host quick start.
|
||||
|
||||
[[agent]]
|
||||
name = "builder" # tmux session: ba-builder
|
||||
kind = "loop" # kickoff = prompts/kickoff.md (per phase) + prompts/builder.md
|
||||
role = "builder"
|
||||
dir = "./work" # the Builder's working clone of the work repo
|
||||
watch = "heal+stall" # restart if dead/wedged AND if idle past stall_idle (respects WAITING-UNTIL)
|
||||
|
||||
[[agent]]
|
||||
name = "adversary"
|
||||
session = "ba-adv" # abbreviated session name (handy in logs / remote-control)
|
||||
kind = "loop"
|
||||
role = "adversary"
|
||||
dir = "./work-adv" # the Adversary's SEPARATE clone — it verifies from a cold start
|
||||
watch = "heal+stall"
|
||||
|
||||
[[agent]]
|
||||
name = "orchestrator" # tmux session: ba-orchestrator
|
||||
kind = "persistent"
|
||||
model = "claude-opus-4-8"
|
||||
resume = true # claude --resume <state/orchestrator.id>
|
||||
watch = "heal" # keep it alive/healed; never stall-reboot a persistent supervisor
|
||||
prompt = """
|
||||
You supervise this Builder/Adversary project. On startup: read machine-docs/ (the current phase's \
|
||||
STATUS / REVIEW / JOURNAL) to see where the loop pair is, confirm both loops and the watchdog are \
|
||||
up, and report the current phase and any open Adversary findings or VETO. Then stay available; \
|
||||
intervene only if the pair is stuck (repeated FAIL on the same gate, a stall the watchdog can't \
|
||||
clear, or an operator request)."""
|
||||
# A periodic nudge is optional — uncomment to have the watchdog wake it on a timer:
|
||||
# wake = { interval = 3600, prompt_file = "prompts/supervise.md" }
|
||||
|
||||
[[agent]]
|
||||
name = "reporter" # tmux session: ba-reporter
|
||||
kind = "task" # one-shot: runs to completion, then idles
|
||||
model = "claude-opus-4-8"
|
||||
watch = "none"
|
||||
enabled = false # not started by a bare `up`; fired by [loop].on_complete below
|
||||
prompt = """
|
||||
The phase sequence is complete. Read machine-docs/ across all phases and write a short \
|
||||
machine-docs/REPORT.md summarising what was built, every gate's final Adversary verdict, and any \
|
||||
deferred items. Then go idle."""
|
||||
|
||||
# Non-AI helper service (tail + render the loop transcripts). Started by `up`, killed by `down`.
|
||||
[[service]]
|
||||
name = "cleanlogs" # tmux session: ba-cleanlogs
|
||||
command = "python3 ../../agent-log.py follow-all"
|
||||
dir = "."
|
||||
|
||||
# ─────────────────────────── the phase machine (kind="loop" agents) ───────────────────────────
|
||||
[loop]
|
||||
state_file = "phase-idx" # under <log_dir>/state/
|
||||
resume_phase = true # keep the current index across restarts (don't reset to 0)
|
||||
auto_advance = true # advance when the phase's status file shows the done_marker
|
||||
done_marker = "## DONE"
|
||||
kickoff_template = "prompts/kickoff.md" # phase preamble; slots {phase_id}/{plan}/{status}/{role}
|
||||
roles_dir = "prompts" # role prompt = prompts/<role>.md
|
||||
|
||||
# Handoff: the watchdog watches the work repo's origin/main and the two inbox files, and pings the
|
||||
# other loop on the matching signal. claim(…) commits → ping Adversary; review(…) → ping Builder.
|
||||
handoff = { repo = "./work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
|
||||
|
||||
# When the last phase completes, fire the one-shot reporter (its trigger file under <log_dir>).
|
||||
on_complete = { trigger_file = ".run-report-on-complete", run = "reporter" }
|
||||
|
||||
# Phase sequence. Each plan is this phase's single source of truth; status is where the Builder
|
||||
# writes "## DONE". The second phase shows a per-phase model override (Builder on opus for it).
|
||||
phases = [
|
||||
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
|
||||
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md", models = { builder = "claude-opus-4-8" } },
|
||||
]
|
||||
3
examples/builder-adversary/machine-docs/.gitkeep
Normal file
3
examples/builder-adversary/machine-docs/.gitkeep
Normal file
@ -0,0 +1,3 @@
|
||||
# Coordination / loop-state files live here at runtime (phase-namespaced STATUS / REVIEW / BACKLOG /
|
||||
# JOURNAL, shared DECISIONS.md, and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels).
|
||||
# This .gitkeep just ensures the directory exists; the loop pair populates it. See ../README.md.
|
||||
32
examples/builder-adversary/plans/json.md
Normal file
32
examples/builder-adversary/plans/json.md
Normal file
@ -0,0 +1,32 @@
|
||||
# Phase `json` — machine-readable output
|
||||
|
||||
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
|
||||
`wc`-phase behaviour. Single source of truth for this phase.
|
||||
|
||||
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
|
||||
model override; the Adversary stays on the default model.)
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
|
||||
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
|
||||
With stdin (no FILE), `"file"` is `null`.
|
||||
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
|
||||
(plus `file`). E.g. `wc.py --json -l FILE` → `{"lines": N, "file": "FILE"}`.
|
||||
- **D3 — no regression.** Every `wc`-phase gate (D1–D4 there) still passes unchanged.
|
||||
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
```bash
|
||||
pytest -q # D4 + D3 regression
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
|
||||
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
|
||||
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
|
||||
```
|
||||
|
||||
The Builder restates the exact commands, expected JSON, and commit sha in
|
||||
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
|
||||
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
|
||||
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).
|
||||
43
examples/builder-adversary/plans/wc.md
Normal file
43
examples/builder-adversary/plans/wc.md
Normal file
@ -0,0 +1,43 @@
|
||||
# Phase `wc` — a word-count CLI
|
||||
|
||||
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
|
||||
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
|
||||
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
|
||||
|
||||
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
|
||||
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
|
||||
|
||||
## Definition of Done
|
||||
|
||||
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
|
||||
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
|
||||
|
||||
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
|
||||
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
|
||||
`wc` on ASCII input.
|
||||
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
|
||||
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
|
||||
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
|
||||
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
|
||||
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
From a fresh clone of the work repo:
|
||||
|
||||
```bash
|
||||
pytest -q # D4: must be all-green
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
|
||||
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
|
||||
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
|
||||
```
|
||||
|
||||
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
|
||||
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
|
||||
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
|
||||
|
||||
## Out of scope (defer to a later phase or DEFERRED.md)
|
||||
|
||||
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
|
||||
the next phase (`plans/json.md`).
|
||||
27
examples/builder-adversary/prompts/adversary.md
Normal file
27
examples/builder-adversary/prompts/adversary.md
Normal file
@ -0,0 +1,27 @@
|
||||
You are the **Adversary** — one of two independent loops. Your job is to **DISBELIEVE the Builder**. You run as a SEPARATE process and coordinate ONLY through the git repo. Read the phase plan named in the kickoff above in full — it is the single source of truth for WHAT is being verified.
|
||||
|
||||
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. When a gate is CLAIMED (or the watchdog pings you that one is), verify it promptly — that is top priority. When nothing is pending you may IDLE freely (sleep in chunks of **≤10 min**); you do NOT need to busy-poll to look busy — the watchdog pings you the instant the Builder claims a gate. Poll ~4 min only while actively watching a CLAIMED gate's run. Keep running independent break-it probes even when no gate is pending. Stop only when STATUS says "## DONE" and you have logged a fresh PASS for every DoD item.
|
||||
|
||||
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
|
||||
- **Cap every wait at 10 minutes.** Never a single ScheduleWakeup > 600 s; to wait longer, wake, re-check, wait again.
|
||||
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out, matching your ScheduleWakeup; compute with `date -u -d '+10 min' +%FT%TZ`). Idle ≥5 min with no current marker, or past the named time → the watchdog kills + reboots you; you resume cleanly from git + your REVIEW/STATUS files.
|
||||
- **Compact proactively** at ≳80% context — your state is in git + REVIEW/STATUS, so compaction is lossless.
|
||||
|
||||
**Coordinate ONLY through git:**
|
||||
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root. If you find one at the root, `git mv` it in.
|
||||
- **Keep your OWN clone** (the `dir` this agent runs in). You verify from a COLD START in it. If the work repo doesn't exist yet, wait and retry on your next wake — the Builder creates it first.
|
||||
- `git pull --rebase` before every edit; commit; push; **never `--force`.**
|
||||
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit that records a **verdict or finding** with `review(...)` (e.g. `review(D2): PASS` / `review(D2): FAIL — repro …`). The watchdog watches origin/main and pings the Builder the moment a `review(` commit lands — that IS the handoff signal. (The Builder's gate claims are `claim(...)`.)
|
||||
- Write ONLY your files: REVIEW and the "## Adversary findings" section of BACKLOG. Everything else (code, STATUS, JOURNAL, "## Build backlog") is read-only to you.
|
||||
- **INBOX side-channel.** For non-gate messages to the Builder, append `machine-docs/BUILDER-INBOX.md` and push (the watchdog edge-pings the Builder). To receive from the Builder, look for `machine-docs/ADVERSARY-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). Formal verdicts still live in REVIEW.
|
||||
|
||||
**ISOLATION DISCIPLINE (anti-anchoring — critical).** The Builder is REQUIRED to give you, in STATUS, the verification info you need: WHAT is claimed, HOW to verify it (the exact command/check), the EXPECTED outcome, and WHERE the inputs live. **Read STATUS for that — you need all of it.** What you must IGNORE — in STATUS, and NEVER read in JOURNAL before your verdict — is the Builder's REASONING / RATIONALISATIONS ("I think this passes because…", design narrative, dead-ends). Reading those anchors you. Form your verdict from: (a) the phase plan = SSOT, (b) the code / git history, (c) the verification info the Builder passed in STATUS, and (d) your OWN cold acceptance run that re-executes the check against the expected outcomes. Only AFTER writing your verdict may you consult JOURNAL (note in REVIEW that you did). Trust observable behaviour, the plan, and your own re-run — not the Builder's narrative.
|
||||
|
||||
**Each wake:**
|
||||
1. Pull. Read STATUS for any "Gate: <id> CLAIMED, awaiting Adversary".
|
||||
2. Verify the claim from a COLD START (fresh shell, your own clone, no cached state). Re-run the DoD acceptance check yourself; do not trust the Builder's word.
|
||||
3. Actively try to BREAK it — edge cases, malformed input, the failure modes the plan names. A claim you can't break is a claim that PASSES; a claim you can break is a finding.
|
||||
4. Record verdicts in REVIEW ("<id>: PASS @<ts>" + evidence, or FAIL with repro steps). File each defect as a "## Adversary findings" item; only YOU close those, after re-test. You hold veto: write "## VETO <reason>" to REVIEW to forbid DONE until cleared.
|
||||
5. Push (with a `review(...)` prefix). Schedule the next wake.
|
||||
|
||||
Begin: read the phase plan, then enter the self-paced loop (start by cloning the work repo into your `dir` if it exists yet).
|
||||
31
examples/builder-adversary/prompts/builder.md
Normal file
31
examples/builder-adversary/prompts/builder.md
Normal file
@ -0,0 +1,31 @@
|
||||
You are the **Builder** — one of two independent loops working on this project. Your job is to build what the phase plan specifies, autonomously, over many wake cycles. You run as a SEPARATE process from the Adversary and coordinate with it ONLY through the git repo.
|
||||
|
||||
Single source of truth: the phase plan named in the kickoff above. Read it in full now, then begin.
|
||||
|
||||
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work. Pace yourself:
|
||||
- A long task in flight (build / test suite / e2e) → **poll every ~5 min**, never one big sleep matching the expected runtime (catch a failure at minute 4 of a 25-min run, not at minute 25).
|
||||
- Parked at a CLAIMED gate with no other unblocked work → the watchdog pings you the instant the Adversary writes a verdict or an inbox message, so you may wait; keep a fallback self-poll ~2–4 min in case a ping is missed.
|
||||
- Genuinely idle → sleep in chunks of **≤10 min**. Prefer keeping an unblocked backlog item in hand so you rarely just wait.
|
||||
|
||||
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
|
||||
- **Cap every wait at 10 minutes.** To wait longer, wake at 10 min, re-check, wait again. Never a single ScheduleWakeup > 600 s.
|
||||
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` — the time you will resume (≤10 min out, matching your ScheduleWakeup). Compute it from the clock (`date -u -d '+10 min' +%FT%TZ`). If the watchdog sees you idle ≥5 min with no current marker as your last line, OR idle past the time it names, it kills + reboots you — you resume cleanly from git + your STATUS/REVIEW files.
|
||||
- **Compact proactively.** If context usage climbs high (≳80%), run `/compact` before continuing — your loop state lives in git + the phase STATUS/REVIEW, so compaction is lossless and prevents wedging at the context limit.
|
||||
|
||||
**Coordinate ONLY through git:**
|
||||
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root — phase-namespaced STATUS/BACKLOG/REVIEW/JOURNAL, plus DECISIONS.md and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. Create `machine-docs/` if missing; if you find such a file at the root, `git mv` it in.
|
||||
- `git pull --rebase` before every edit; make the smallest change; commit; push. **Never `--force`.**
|
||||
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit with its conventional type. CRITICALLY: prefix a commit that **claims a gate** with `claim(...)` (e.g. `claim(D2): tests green`). The watchdog watches origin/main and pings the Adversary the moment a `claim(` commit lands — that IS the handoff signal. Keep using the other types too (`feat/fix/status/journal/decisions/chore/inbox(...)`), but `claim(` is what triggers verification.
|
||||
- **CLEAN TREE BEFORE CLAIM.** Run `git status` before you claim — the working tree MUST be clean (everything committed AND pushed). The Adversary cold-verifies from a fresh clone, so any un-pushed change that only exists on your host is a guaranteed verify mismatch. Push first, then claim.
|
||||
- **ARTIFACT-LAYER ISOLATION — the one rule that makes verification work.** STATUS MUST give the Adversary everything it needs to verify your claim: **WHAT** is claimed (gate id, DoD items), **HOW** to verify it (the exact command/check it can re-run from its own clone), the **EXPECTED** outcome (outputs, hashes, exit codes), and **WHERE** the inputs live (commit shas, paths). STATUS MUST NOT contain rationalisations — "I think this passes because…", design narrative, dead-ends. Those go in JOURNAL, which the Adversary is instructed NOT to read before its verdict (anti-anchoring). The line: **WHAT + HOW + EXPECTED + WHERE = STATUS; WHY = JOURNAL.** DECISIONS.md is for SETTLED design decisions, not in-the-moment reasoning.
|
||||
- **At each gate:** set "Gate: <id> CLAIMED, awaiting Adversary" in STATUS and work other unblocked items; do NOT advance past the gate until REVIEW shows its PASS.
|
||||
- **INBOX side-channel.** For non-gate messages to the Adversary (a heads-up, "starting a long run, please cold-verify X meanwhile"), append `machine-docs/ADVERSARY-INBOX.md` and push — the watchdog edge-pings the Adversary. To receive from the Adversary, look for `machine-docs/BUILDER-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). The inbox is a side-channel; formal CLAIMS still live in STATUS.
|
||||
- Write ONLY your files: source/config, STATUS, JOURNAL, DECISIONS, and the "## Build backlog" section of BACKLOG. Treat REVIEW and "## Adversary findings" as read-only — the Adversary owns them.
|
||||
|
||||
**Overriding rules:**
|
||||
- "Done" is defined ONLY by the plan's DoD, Adversary-verified. No self-certifying. Write "## DONE" to STATUS only when REVIEW shows a fresh PASS for every DoD item and there is no standing "## VETO".
|
||||
- Verify every change against real behaviour; paste the command + its output into JOURNAL. No "should work."
|
||||
- Never weaken, skip, or delete a test to make a run pass. A red test is information.
|
||||
- 3rd identical failure → stop, record the dead-end in DECISIONS.md, change approach or mark blocked.
|
||||
|
||||
Begin: read the phase plan, then enter the self-paced loop.
|
||||
8
examples/builder-adversary/prompts/kickoff.md
Normal file
8
examples/builder-adversary/prompts/kickoff.md
Normal file
@ -0,0 +1,8 @@
|
||||
*** PHASE {phase_id} ***
|
||||
SINGLE SOURCE OF TRUTH for this phase: {plan} — read it in full now. It defines this phase's mission and its Definition of Done (DoD).
|
||||
Track loop state in PHASE-NAMESPACED files UNDER machine-docs/ in your clone (create the dir if missing): machine-docs/{status}, machine-docs/BACKLOG-{phase_id}.md, machine-docs/REVIEW-{phase_id}.md, machine-docs/JOURNAL-{phase_id}.md. machine-docs/DECISIONS.md is shared (append-only).
|
||||
FILE-LOCATION RULE (mandatory): ALL coordination / loop-state files live in machine-docs/, NEVER the repo root — that includes STATUS/BACKLOG/REVIEW/JOURNAL (phase-namespaced), DECISIONS.md, and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. If you ever find one at the root, git mv it into machine-docs/.
|
||||
"Done" for this phase = the Builder writes "## DONE" to machine-docs/{status} ONLY after EVERY DoD item is Adversary-verified with a fresh PASS in machine-docs/REVIEW-{phase_id}.md (handshake below).
|
||||
Wherever the standing role below says "the plan" / "STATUS" / "REVIEW", substitute {plan} and these machine-docs/ phase-namespaced files.
|
||||
|
||||
=== standing role & rules ===
|
||||
27
examples/builder-solo/README.md
Normal file
27
examples/builder-solo/README.md
Normal file
@ -0,0 +1,27 @@
|
||||
# Builder-solo example — no Adversary (self-verification baseline)
|
||||
|
||||
A single **Builder** agent, same task spec as [`../builder-adversary`](../builder-adversary/), but
|
||||
with **no Adversary**: the Builder builds *and* verifies its own work, then self-certifies `## DONE`.
|
||||
No `claim(`/`review(` handoff — there's nothing to hand off to.
|
||||
|
||||
This is the **control** for the AI-as-adversary design. Comparing it against `builder-adversary` on
|
||||
the same task answers two things:
|
||||
|
||||
- **Cost:** how much of a run's tokens is the independent Adversary? (In the loop-pair runs the
|
||||
Adversary is ~45–53% of the total — this variant removes that.)
|
||||
- **Quality:** does an independent cold verifier catch things a self-checking builder misses? Self-
|
||||
certification has an obvious failure mode — the same agent that wrote the bug decides whether it's
|
||||
a bug. This variant measures what you give up by dropping the second pair of eyes.
|
||||
|
||||
The Builder's role prompt keeps the same verification *rigor* (run every DoD check, try to break it,
|
||||
paste observed output, no self-rubber-stamping) — the only thing removed is the **independent**
|
||||
adversary. So the comparison is "independent verification vs self-verification," not "verification vs
|
||||
none."
|
||||
|
||||
```bash
|
||||
python3 ../../agents.py status --config agents.toml
|
||||
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
```
|
||||
|
||||
The `agent-orchestrator-benchmark` repo runs this head-to-head with the other variants on the same
|
||||
multi-phase task and reports tokens + the efficiency ratios.
|
||||
68
examples/builder-solo/agents.toml
Normal file
68
examples/builder-solo/agents.toml
Normal file
@ -0,0 +1,68 @@
|
||||
# examples/builder-solo — a single Builder, NO Adversary (self-verification baseline).
|
||||
#
|
||||
# Same pattern + same task spec as ../builder-adversary, but there is only ONE agent: the Builder
|
||||
# builds AND verifies its own work, then self-certifies "## DONE". This is the control for measuring
|
||||
# what the independent AI Adversary actually costs (its tokens) and buys (independent cold
|
||||
# verification). No claim/review handoff — nothing to hand off to.
|
||||
#
|
||||
# python3 ../../agents.py status --config agents.toml
|
||||
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
|
||||
[watchdog]
|
||||
signal_interval = 30
|
||||
heavy_interval = 300
|
||||
limit_probe_fallback = 300
|
||||
limit_reset_slack = 45
|
||||
stall_grace = 180
|
||||
|
||||
[defaults]
|
||||
session_prefix = "solo-"
|
||||
log_dir = ".ao-state"
|
||||
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
|
||||
model = "claude-sonnet-4-6"
|
||||
watch = "heal"
|
||||
|
||||
[backend.claude]
|
||||
bin = "claude"
|
||||
flags = "--dangerously-skip-permissions"
|
||||
remote_control = true
|
||||
supports_resume = true
|
||||
prompt_delivery = "arg"
|
||||
process_name = "claude"
|
||||
submit_key = "Enter"
|
||||
stall_idle = 300
|
||||
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
|
||||
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
|
||||
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
|
||||
|
||||
[backend.demo]
|
||||
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
|
||||
prompt_delivery = "exec"
|
||||
|
||||
# The lone builder — builds and self-verifies.
|
||||
[[agent]]
|
||||
name = "builder" # tmux session: solo-builder
|
||||
kind = "loop"
|
||||
role = "builder" # kickoff = prompts/kickoff.md (per phase) + prompts/builder.md
|
||||
dir = "./work"
|
||||
watch = "heal+stall"
|
||||
|
||||
[[service]]
|
||||
name = "cleanlogs"
|
||||
command = "python3 ../../agent-log.py follow-all"
|
||||
dir = "."
|
||||
|
||||
# Phase machine. No handoff (single agent); the watchdog auto-advances when the builder writes
|
||||
# "## DONE" to the phase status file (read from handoff.repo's state_subdir).
|
||||
[loop]
|
||||
state_file = "phase-idx"
|
||||
resume_phase = true
|
||||
auto_advance = true
|
||||
done_marker = "## DONE"
|
||||
kickoff_template = "prompts/kickoff.md"
|
||||
roles_dir = "prompts"
|
||||
handoff = { repo = "./work", state_subdir = "machine-docs" }
|
||||
phases = [
|
||||
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
|
||||
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md" },
|
||||
]
|
||||
0
examples/builder-solo/machine-docs/.gitkeep
Normal file
0
examples/builder-solo/machine-docs/.gitkeep
Normal file
32
examples/builder-solo/plans/json.md
Normal file
32
examples/builder-solo/plans/json.md
Normal file
@ -0,0 +1,32 @@
|
||||
# Phase `json` — machine-readable output
|
||||
|
||||
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
|
||||
`wc`-phase behaviour. Single source of truth for this phase.
|
||||
|
||||
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
|
||||
model override; the Adversary stays on the default model.)
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
|
||||
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
|
||||
With stdin (no FILE), `"file"` is `null`.
|
||||
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
|
||||
(plus `file`). E.g. `wc.py --json -l FILE` → `{"lines": N, "file": "FILE"}`.
|
||||
- **D3 — no regression.** Every `wc`-phase gate (D1–D4 there) still passes unchanged.
|
||||
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
```bash
|
||||
pytest -q # D4 + D3 regression
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
|
||||
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
|
||||
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
|
||||
```
|
||||
|
||||
The Builder restates the exact commands, expected JSON, and commit sha in
|
||||
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
|
||||
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
|
||||
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).
|
||||
43
examples/builder-solo/plans/wc.md
Normal file
43
examples/builder-solo/plans/wc.md
Normal file
@ -0,0 +1,43 @@
|
||||
# Phase `wc` — a word-count CLI
|
||||
|
||||
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
|
||||
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
|
||||
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
|
||||
|
||||
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
|
||||
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
|
||||
|
||||
## Definition of Done
|
||||
|
||||
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
|
||||
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
|
||||
|
||||
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
|
||||
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
|
||||
`wc` on ASCII input.
|
||||
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
|
||||
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
|
||||
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
|
||||
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
|
||||
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
|
||||
|
||||
## How the Adversary verifies (cold)
|
||||
|
||||
From a fresh clone of the work repo:
|
||||
|
||||
```bash
|
||||
pytest -q # D4: must be all-green
|
||||
printf 'a b c\nd e\n' > /tmp/f.txt
|
||||
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
|
||||
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
|
||||
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
|
||||
```
|
||||
|
||||
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
|
||||
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
|
||||
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
|
||||
|
||||
## Out of scope (defer to a later phase or DEFERRED.md)
|
||||
|
||||
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
|
||||
the next phase (`plans/json.md`).
|
||||
15
examples/builder-solo/prompts/builder.md
Normal file
15
examples/builder-solo/prompts/builder.md
Normal file
@ -0,0 +1,15 @@
|
||||
You are the **Builder** — and the ONLY agent. There is no Adversary. You build to the plan's DoD **and verify your own work** before certifying it done. Read the phase plan (the SSOT) and build to its DoD.
|
||||
|
||||
Loop: run `/loop` (no interval), one unit of work per wake. Liveness (watchdog-enforced): cap every wait at 10 min; before going idle your LAST output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>`; compact at ~80% context.
|
||||
|
||||
Git: `pull --rebase`, smallest change, commit, push; never `--force`. Prefix commits conventionally (`feat/fix/test/status/…`).
|
||||
|
||||
**SELF-VERIFICATION (this replaces the Adversary — do it rigorously; do NOT rubber-stamp yourself):**
|
||||
- For each DoD gate, RUN the exact check the plan specifies (its command + expected output) from a clean state and confirm it passes. Don't assume — execute it and read the actual output.
|
||||
- Actively try to BREAK your own work: edge cases, malformed input, the failure modes the plan names. A gate you can break is not done.
|
||||
- Record it in `machine-docs/{status}` (or STATUS for the phase): per gate, WHAT it is, the exact command, the EXPECTED result, and the OBSERVED result (paste the real output).
|
||||
- Never weaken, skip, or delete a test to make a run pass. A red test is information.
|
||||
|
||||
Done: write "## DONE" to the phase status file ONLY after every DoD gate has a real, observed PASS from your own verification and you have no outstanding self-found defect.
|
||||
|
||||
Begin: read the plan, then enter the loop.
|
||||
7
examples/builder-solo/prompts/kickoff.md
Normal file
7
examples/builder-solo/prompts/kickoff.md
Normal file
@ -0,0 +1,7 @@
|
||||
*** PHASE {phase_id} ***
|
||||
Plan (this phase's single source of truth): {plan} — read it fully now; it defines the mission and the Definition of Done (DoD).
|
||||
You are the ONLY agent — there is no separate Adversary. You BUILD and you VERIFY YOUR OWN WORK.
|
||||
Track state under machine-docs/ (create if missing): {status} and JOURNAL-{phase_id}.md.
|
||||
Done = you write "## DONE" to machine-docs/{status} ONLY after every DoD item passes your own observed verification (run the checks, paste the output).
|
||||
|
||||
=== role ===
|
||||
84
examples/snakepit/README.md
Normal file
84
examples/snakepit/README.md
Normal file
@ -0,0 +1,84 @@
|
||||
# 🐍 Snake pit
|
||||
|
||||
> the "snake pit" agent orchestrator. each agent is a snake. you toss food (tasks) into the pit.
|
||||
> agents can devour tasks, gradually digest them, regurgitate them whole or in broken / digested
|
||||
> parts, excrete waste (chat logs, debug traces, &c), &c. obviously some specialist agents are on
|
||||
> cleanup duty
|
||||
>
|
||||
> — [@ponder.ooo](https://bsky.app/profile/ponder.ooo/post/3mmwue5bot22u), 2026-05-28
|
||||
|
||||
An agent-orchestrator example built on that idea. Where the sibling `builder-adversary` example is a
|
||||
**phase machine** (an ordered plan, two roles handing off), the snake pit is a **worker pool over a
|
||||
shared queue**: identical workers pull tasks from a pit, plus specialist species for planning and
|
||||
cleanup. Same harness, completely different topology — that's the point of having both.
|
||||
|
||||
## The core metaphor mapping
|
||||
|
||||
(From Claude running with the idea — the image in the thread.)
|
||||
|
||||
| bio | compute |
|
||||
|---|---|
|
||||
| snake species | agent specialization / system prompt |
|
||||
| hunger | priority / availability |
|
||||
| smell | task routing (tag match or embedding sim) |
|
||||
| fighting | contention resolution |
|
||||
| swallowing | task intake + context loading |
|
||||
| digestion | LLM calls / tool use |
|
||||
| regurgitate whole | re-queue (rejection / timeout) |
|
||||
| regurgitate partial | subtask decomposition |
|
||||
| excrete | artifact emission (logs, traces, results) |
|
||||
| waste heap | artifact store |
|
||||
| coprophagy | meta-agents consuming others' artifacts (log summariser, memory builder) |
|
||||
| scavengers | housekeeping agents on the waste heap |
|
||||
| snake death | crash / OOM / timeout → reap |
|
||||
|
||||
**The key insight: *regurgitation IS task decomposition*** — a planner snake swallows a big task and
|
||||
regurgitates it as smaller food the worker snakes can each digest.
|
||||
|
||||
## How it maps onto agent-orchestrator
|
||||
|
||||
- **The pit = a filesystem queue** (`pit/`). Snakes coordinate ONLY through it and claim work by
|
||||
**atomic `mv`**, so two snakes never devour the same food. Full layout + protocol: `pit/README.md`.
|
||||
- **Snake species = agents with different prompts** (the "agent specialization" row):
|
||||
- **keeper** (zookeeper, persistent) — tosses food in, keeps the pit healthy, reports.
|
||||
- **planner** (persistent) — *regurgitation = decomposition*: eats big food, regurgitates smaller
|
||||
food for the workers (`prompts/planner.md`).
|
||||
- **snake-1..3** (persistent worker pool) — devour → digest → regurgitate → excrete
|
||||
(`prompts/snake.md`). Scale the pool by copying a block.
|
||||
- **cleanup** (persistent) — the **scavenger** on the waste heap; also does light **coprophagy**
|
||||
(composts logs into a digest) and reaps food abandoned by a snake that died
|
||||
(`prompts/cleanup.md`).
|
||||
- **hunger / smell / fighting** — emergent from the loop: an idle snake naps (low hunger), picks the
|
||||
food it can do (smell), and the atomic-`mv` claim resolves contention (fighting).
|
||||
- **snake death = crash / timeout → reap** — the watchdog heals a dead snake (`watch = "heal"`); the
|
||||
cleanup snake reclaims whatever food it died holding.
|
||||
|
||||
## Run it
|
||||
|
||||
Needs `claude` on `PATH`. From this directory:
|
||||
|
||||
```bash
|
||||
python3 ../../agents.py status --config agents.toml # read-only: what would run
|
||||
python3 ../../agents.py up --config agents.toml # keeper + planner + 3 snakes + cleanup + watchdog
|
||||
python3 ../../agents.py logs snake-1 --config agents.toml
|
||||
python3 ../../agents.py down --config agents.toml
|
||||
```
|
||||
|
||||
A sample piece of food (`pit/food/food-0001-reverse-string.md`) is already in the pit, so the snakes
|
||||
have something to eat on first `up`. Toss more by writing `pit/food/food-<id>-<slug>.md` (schema in
|
||||
`pit/README.md`) — or ask the keeper to.
|
||||
|
||||
To watch the **mechanics** without an agent CLI, set `defaults.backend = "demo"` in `agents.toml`
|
||||
(the demo backend just idles) and run `up` / `status` / `down`.
|
||||
|
||||
## Extending it
|
||||
|
||||
- **More workers** → copy a `snake-N` block in `agents.toml`.
|
||||
- **A new species** → add an `[[agent]]` with its own `prompts/<species>.md` (e.g. a **coprophagy**
|
||||
meta-agent that builds long-term memory from the waste heap, distinct from the scavenger).
|
||||
- **Smarter routing** ("smell") → give food `tags:` and have snakes prefer matching tags.
|
||||
- **Real coordination across hosts** → back the pit with a git repo instead of a local dir and use
|
||||
the watchdog's `handoff` inbox pings (see the `builder-adversary` example).
|
||||
|
||||
This example carries **no** project-orchestrator/fleet metadata — like any project it can be run by
|
||||
hand and has no idea a fleet exists.
|
||||
126
examples/snakepit/agents.toml
Normal file
126
examples/snakepit/agents.toml
Normal file
@ -0,0 +1,126 @@
|
||||
# examples/snakepit — the "snake pit" agent orchestrator.
|
||||
#
|
||||
# Based on @ponder.ooo's idea (bsky, 2026-05-28): "each agent is a snake. you toss food (tasks) into
|
||||
# the pit. agents can devour tasks, gradually digest them, regurgitate them whole or in broken /
|
||||
# digested parts, excrete waste (chat logs, debug traces, &c). obviously some specialist agents are
|
||||
# on cleanup duty."
|
||||
#
|
||||
# Mapped onto agent-orchestrator, this is a WORKER-POOL-OVER-A-SHARED-QUEUE topology — quite unlike
|
||||
# the sibling builder-adversary phase machine:
|
||||
# • The PIT (./pit/) is a filesystem queue. Snakes claim work by ATOMIC `mv` (mv within one
|
||||
# filesystem is atomic, so two snakes never devour the same food).
|
||||
# • SNAKES (snake-1..3) are identical persistent workers, each running a self-paced /loop:
|
||||
# devour → digest → regurgitate (whole result, or broken-up sub-tasks back into the pit) →
|
||||
# excrete waste (logs).
|
||||
# • CLEANUP is the specialist on cleanup duty: sweeps waste, reclaims food abandoned by a snake
|
||||
# that choked or died.
|
||||
# • KEEPER (the zookeeper) tosses food in and keeps the pit healthy.
|
||||
# There is no [loop] phase machine here — no kind="loop" agents. See pit/README.md for the protocol.
|
||||
#
|
||||
# Run it by hand (status starts nothing):
|
||||
# python3 ../../agents.py status --config agents.toml
|
||||
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
|
||||
# python3 ../../agents.py down --config agents.toml
|
||||
# Mechanics-only (no agent CLI): set defaults.backend = "demo".
|
||||
|
||||
# ─────────────────────────── global watchdog cadence ───────────────────────────
|
||||
[watchdog]
|
||||
signal_interval = 30
|
||||
heavy_interval = 300
|
||||
limit_probe_fallback = 300
|
||||
limit_reset_slack = 45
|
||||
stall_grace = 180
|
||||
|
||||
# ─────────────────────────── defaults inherited by every agent ───────────────────────────
|
||||
[defaults]
|
||||
session_prefix = "snakepit-" # REQUIRED — sessions: snakepit-snake-1, snakepit-keeper, …
|
||||
log_dir = ".ao-state" # REQUIRED — logs + state/, resolved relative to this file
|
||||
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
|
||||
model = "claude-sonnet-4-6"
|
||||
watch = "heal" # keep every snake alive/healed; they self-pace and nap when the pit is empty
|
||||
|
||||
# ─────────────────────────── backends (declared as data) ───────────────────────────
|
||||
[backend.claude]
|
||||
bin = "claude"
|
||||
flags = "--dangerously-skip-permissions"
|
||||
remote_control = true
|
||||
supports_resume = true
|
||||
prompt_delivery = "arg"
|
||||
process_name = "claude"
|
||||
submit_key = "Enter"
|
||||
stall_idle = 300
|
||||
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
|
||||
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
|
||||
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
|
||||
|
||||
[backend.demo] # dependency-free: a shell that just idles
|
||||
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
|
||||
prompt_delivery = "exec"
|
||||
|
||||
# ─────────────────────────── the keeper (zookeeper / supervisor) ───────────────────────────
|
||||
[[agent]]
|
||||
name = "keeper" # tmux session: snakepit-keeper
|
||||
kind = "persistent"
|
||||
model = "claude-opus-4-8"
|
||||
resume = true
|
||||
watch = "heal"
|
||||
prompt = """
|
||||
You are the KEEPER of the snake pit (its zookeeper). On startup: read pit/README.md for the pit \
|
||||
protocol, then report the pit's state — counts of food waiting (pit/food/), in digestion \
|
||||
(pit/claimed/), regurgitated whole (pit/done/), scraps tossed back (pit/scraps/), and waste \
|
||||
(pit/waste/). Your job: (1) toss food into the pit — when an operator gives you a task, write it as \
|
||||
pit/food/food-<id>-<slug>.md per the schema in pit/README.md; (2) keep the pit healthy — watch \
|
||||
throughput, flag food stuck in pit/claimed/ for too long (a snake may have choked), and make sure \
|
||||
the snakes are fed. Stay available; report when asked."""
|
||||
# Optional periodic survey of the pit (uncomment to have the watchdog wake the keeper on a timer):
|
||||
# wake = { interval = 1800, prompt_file = "prompts/keeper-survey.md" }
|
||||
|
||||
# ─────────────────────────── the planner (a different snake species) ───────────────────────────
|
||||
# "snake species = agent specialization / system prompt." The key insight from the thread:
|
||||
# regurgitation IS task decomposition — a planner snake swallows a big task and regurgitates it as
|
||||
# smaller food the worker snakes can digest.
|
||||
[[agent]]
|
||||
name = "planner" # tmux session: snakepit-planner
|
||||
kind = "persistent"
|
||||
model = "claude-opus-4-8"
|
||||
resume = true
|
||||
watch = "heal"
|
||||
prompt = "You are the PLANNER snake — a species that eats only BIG food (tasks tagged `big: true`, or any food too large to digest in one sitting). Read prompts/planner.md and pit/README.md, then loop: devour big food from pit/food/, and regurgitate it IN PARTS — a set of smaller, self-contained food-* items tossed back into pit/food/ for the worker snakes — then remove the big item. Regurgitation IS task decomposition."
|
||||
|
||||
# ─────────────────────────── the snakes (identical worker pool) ───────────────────────────
|
||||
# Three persistent workers sharing one role (prompts/snake.md); each knows its own snake-id from its
|
||||
# inline prompt and uses it to claim food. Add more snakes by copying a block and bumping the id.
|
||||
[[agent]]
|
||||
name = "snake-1" # tmux session: snakepit-snake-1
|
||||
kind = "persistent"
|
||||
resume = true
|
||||
watch = "heal"
|
||||
prompt = "You are 🐍 snake-1, a worker snake in the pit; your snake-id is `snake-1`. Read prompts/snake.md for your full role and the pit protocol, then begin your self-paced loop — devour food from pit/food/, digest it, regurgitate the result, excrete your waste."
|
||||
|
||||
[[agent]]
|
||||
name = "snake-2"
|
||||
kind = "persistent"
|
||||
resume = true
|
||||
watch = "heal"
|
||||
prompt = "You are 🐍 snake-2, a worker snake in the pit; your snake-id is `snake-2`. Read prompts/snake.md for your full role and the pit protocol, then begin your self-paced loop — devour food from pit/food/, digest it, regurgitate the result, excrete your waste."
|
||||
|
||||
[[agent]]
|
||||
name = "snake-3"
|
||||
kind = "persistent"
|
||||
resume = true
|
||||
watch = "heal"
|
||||
prompt = "You are 🐍 snake-3, a worker snake in the pit; your snake-id is `snake-3`. Read prompts/snake.md for your full role and the pit protocol, then begin your self-paced loop — devour food from pit/food/, digest it, regurgitate the result, excrete your waste."
|
||||
|
||||
# ─────────────────────────── cleanup duty (specialist) ───────────────────────────
|
||||
[[agent]]
|
||||
name = "cleanup" # tmux session: snakepit-cleanup
|
||||
kind = "persistent"
|
||||
resume = true
|
||||
watch = "heal"
|
||||
prompt = "You are the CLEANUP snake — a specialist on cleanup duty in the pit. Read prompts/cleanup.md for your full role, then begin your self-paced loop: sweep waste from pit/waste/, and reclaim food abandoned in pit/claimed/ by a snake that choked or died (toss it back to pit/food/)."
|
||||
|
||||
# Non-AI helper: render the snakes' tmux transcripts into clean logs.
|
||||
[[service]]
|
||||
name = "cleanlogs" # tmux session: snakepit-cleanlogs
|
||||
command = "python3 ../../agent-log.py follow-all"
|
||||
dir = "."
|
||||
49
examples/snakepit/pit/README.md
Normal file
49
examples/snakepit/pit/README.md
Normal file
@ -0,0 +1,49 @@
|
||||
# The pit — a filesystem task queue
|
||||
|
||||
The pit is just directories. Snakes coordinate entirely through atomic `mv` between them — moving a
|
||||
file within one filesystem is atomic, so two snakes can never devour the same food.
|
||||
|
||||
```
|
||||
pit/
|
||||
food/ the queue: tasks waiting to be eaten (food-<id>-<slug>.md)
|
||||
claimed/ in digestion: a snake is working this one (<snake-id>.food-<id>-<slug>.md)
|
||||
done/ regurgitated WHOLE: a finished result (food-<id>-<slug>.result.md)
|
||||
scraps/ regurgitated in PARTS: notes/leftovers (anything; informational)
|
||||
waste/ excreted waste: chat logs, debug traces (<snake-id>-<ts>.log)
|
||||
```
|
||||
|
||||
> Sub-tasks ("broken / digested parts") are regurgitated back into **`food/`** as new food items, so
|
||||
> any snake can devour them. `scraps/` is for non-actionable leftovers a snake wants to keep around.
|
||||
|
||||
## Food schema (`pit/food/food-<id>-<slug>.md`)
|
||||
|
||||
```markdown
|
||||
# food-0007-reverse-string
|
||||
- **task:** Implement a `reverse(s)` function in scraps/reverse.py and a test that proves it.
|
||||
- **done-when:** `python -m pytest scraps/test_reverse.py -q` is green.
|
||||
- **tossed-by:** keeper # or another snake, if this is a regurgitated sub-task
|
||||
```
|
||||
|
||||
Keep food small and self-contained — one unit a snake can digest in a sitting. If a task is too big,
|
||||
a snake regurgitates it as several smaller food items.
|
||||
|
||||
## The eating protocol (snakes)
|
||||
|
||||
1. **Devour** — atomically claim one item:
|
||||
`mv pit/food/food-0007-reverse-string.md pit/claimed/snake-2.food-0007-reverse-string.md`
|
||||
If the `mv` fails, another snake beat you to it — pick a different one.
|
||||
2. **Digest** — do the work described in the food.
|
||||
3. **Regurgitate** — *whole*: write the result to `pit/done/food-0007-reverse-string.result.md`
|
||||
and `git`-free remove the claimed file. *In parts*: if it decomposes, write new `food-*` items
|
||||
into `pit/food/` for other snakes, and note that in your result.
|
||||
4. **Excrete** — drop your working log/trace as `pit/waste/snake-2-<ts>.log`; don't let it pile up
|
||||
in the workspace.
|
||||
5. **Choke?** On the 3rd identical failure, regurgitate the food back to `pit/food/` (or leave it in
|
||||
`claimed/` past the cleanup timeout) with a note in `scraps/`, so another snake or the keeper
|
||||
takes it.
|
||||
|
||||
## Cleanup duty
|
||||
|
||||
The cleanup snake sweeps `waste/` (summarise then prune old logs) and **reclaims** food left in
|
||||
`claimed/` longer than the abandonment timeout — a sign the snake choked or died — by moving it back
|
||||
to `food/` so a healthy snake can devour it.
|
||||
1
examples/snakepit/pit/claimed/.gitkeep
Normal file
1
examples/snakepit/pit/claimed/.gitkeep
Normal file
@ -0,0 +1 @@
|
||||
# in digestion — food a snake has devoured: <snake-id>.food-<id>-<slug>.md (see ../README.md)
|
||||
1
examples/snakepit/pit/done/.gitkeep
Normal file
1
examples/snakepit/pit/done/.gitkeep
Normal file
@ -0,0 +1 @@
|
||||
# regurgitated whole — finished results: food-<id>-<slug>.result.md (and planner *.plan.md). See ../README.md
|
||||
9
examples/snakepit/pit/food/food-0001-reverse-string.md
Normal file
9
examples/snakepit/pit/food/food-0001-reverse-string.md
Normal file
@ -0,0 +1,9 @@
|
||||
# food-0001-reverse-string
|
||||
- **task:** Implement a `reverse(s)` function in `pit/scraps/reverse.py` and a pytest that proves it
|
||||
(empty string, ASCII, and a unicode string round-trip: `reverse(reverse(s)) == s`).
|
||||
- **done-when:** `python -m pytest pit/scraps/test_reverse.py -q` is green.
|
||||
- **tossed-by:** keeper
|
||||
|
||||
<!-- A sample piece of food so the pit isn't empty on first `up`. Snakes devour it per
|
||||
pit/README.md: mv it into pit/claimed/<snake-id>.food-0001-reverse-string.md, digest, then
|
||||
write pit/done/food-0001-reverse-string.result.md. The keeper tosses real food the same way. -->
|
||||
1
examples/snakepit/pit/scraps/.gitkeep
Normal file
1
examples/snakepit/pit/scraps/.gitkeep
Normal file
@ -0,0 +1 @@
|
||||
# regurgitated in parts — non-actionable leftovers, stuck-notes, reclaims. See ../README.md
|
||||
1
examples/snakepit/pit/waste/.gitkeep
Normal file
1
examples/snakepit/pit/waste/.gitkeep
Normal file
@ -0,0 +1 @@
|
||||
# excreted waste — snake logs/traces: <snake-id>-<ts>.log; cleanup composts these. See ../README.md
|
||||
31
examples/snakepit/prompts/cleanup.md
Normal file
31
examples/snakepit/prompts/cleanup.md
Normal file
@ -0,0 +1,31 @@
|
||||
You are the **cleanup snake** — a specialist on cleanup duty in the pit. The worker snakes make a
|
||||
mess (that's fine, that's digestion); your job is to keep the pit from filling up with waste and to
|
||||
rescue food that got stuck. Read `pit/README.md` for the layout and protocol.
|
||||
|
||||
You coordinate ONLY through the pit (the filesystem). Self-paced `/loop`, no interval.
|
||||
|
||||
**Each iteration:**
|
||||
|
||||
1. **Sweep waste** — in `pit/waste/`, the snakes drop `<snake-id>-<ts>.log` traces. Roll them up:
|
||||
append a one-line digest of each to `pit/waste/COMPOST.md` (what snake, when, what it worked on),
|
||||
then delete logs older than ~30 min. Never delete a log you haven't composted. Keep `COMPOST.md`
|
||||
itself trimmed (summarise + truncate if it grows large).
|
||||
2. **Reclaim abandoned food** — scan `pit/claimed/`. A claim file (`<snake-id>.food-*`) whose mtime
|
||||
is older than the **abandonment timeout (~15 min)** means that snake choked or died mid-digest.
|
||||
Move it back to `pit/food/` (strip the `<snake-id>.` prefix) so a healthy snake re-devours it, and
|
||||
note the reclaim in `pit/scraps/reclaims.md`. Use mtime to judge age:
|
||||
`find pit/claimed -type f -mmin +15`.
|
||||
3. **Tidy** — prune empty/stale scraps, and if `pit/done/` grows large, move finished results into
|
||||
`pit/done/archive/`. Don't touch `pit/food/` items that are fresh, and never delete a result.
|
||||
|
||||
You are conservative: when unsure whether something is truly abandoned or just slow, leave it and
|
||||
re-check next pass. Better a late reclaim than stealing food from a snake that's still digesting.
|
||||
|
||||
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
|
||||
- **Cap every nap at 10 minutes** (never a single ScheduleWakeup > 600 s).
|
||||
- **Declare every nap.** FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min
|
||||
out; `date -u -d '+10 min' +%FT%TZ`). Idle past it → the watchdog reboots you; your state is the
|
||||
pit on disk.
|
||||
- **Compact proactively** at ≳80% context.
|
||||
|
||||
Begin: read `pit/README.md`, then enter your cleanup loop.
|
||||
34
examples/snakepit/prompts/planner.md
Normal file
34
examples/snakepit/prompts/planner.md
Normal file
@ -0,0 +1,34 @@
|
||||
You are the **planner** snake — a specialist species. The worker snakes digest small, self-contained
|
||||
food; you exist for the food too big to swallow whole. Your whole job is the thread's key insight:
|
||||
**regurgitation IS task decomposition.** You swallow a big task and regurgitate it as a set of
|
||||
smaller food items the worker snakes can each digest in a sitting.
|
||||
|
||||
Read `pit/README.md` for the layout, the food schema, and the eating protocol. You coordinate ONLY
|
||||
through the pit; you claim by atomic `mv`.
|
||||
|
||||
**Self-paced loop** (`/loop`, no interval). Each iteration:
|
||||
|
||||
1. **Find big food** — scan `pit/food/` for items tagged `big: true`, or any food whose `task` is
|
||||
clearly more than one sitting. Ignore small food — that's the workers' meal.
|
||||
2. **Devour it** — atomically claim it: `mv pit/food/<f> pit/claimed/planner.<f>`.
|
||||
3. **Regurgitate in parts** — decompose it into the smallest self-contained food items that still
|
||||
make sense, each with a real `done-when`. Write them into `pit/food/` as new `food-<id>-<slug>.md`
|
||||
(use `tossed-by: planner`, and reference the parent id so results can be traced). If sub-tasks
|
||||
have an order, say so in each food's body ("needs food-0012 done first") — workers respect it.
|
||||
4. **Record the plan** — write `pit/done/<parent-id>.plan.md` listing the children you tossed and
|
||||
how they add up to the parent's `done-when`, then remove the parent from `pit/claimed/`.
|
||||
5. **Excrete** your planning trace to `pit/waste/planner-<ts>.log`.
|
||||
|
||||
Keep decomposition shallow and honest: if a "big" task is actually small, just toss it back to
|
||||
`pit/food/` unchanged for a worker (don't manufacture busywork). If you can't decompose it (genuinely
|
||||
atomic but huge), note that in `pit/scraps/<id>-needs-keeper.md` and toss it back — the keeper
|
||||
decides.
|
||||
|
||||
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
|
||||
- **Cap every nap at 10 minutes** (never a single ScheduleWakeup > 600 s).
|
||||
- **Declare every nap.** FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min
|
||||
out; `date -u -d '+10 min' +%FT%TZ`). Idle past it → the watchdog reboots you; your state is the
|
||||
pit on disk.
|
||||
- **Compact proactively** at ≳80% context.
|
||||
|
||||
Begin: read `pit/README.md`, then loop — hunt for big food, decompose, regurgitate.
|
||||
39
examples/snakepit/prompts/snake.md
Normal file
39
examples/snakepit/prompts/snake.md
Normal file
@ -0,0 +1,39 @@
|
||||
You are a 🐍 **snake** in the pit — one worker in a pool of identical snakes. Your snake-id was given
|
||||
in your startup line (e.g. `snake-2`); use it in every claim and every log. Read `pit/README.md` now
|
||||
for the pit layout and the eating protocol — it is the source of truth for how to coordinate.
|
||||
|
||||
You do not talk to the other snakes. You coordinate ONLY through the pit (the filesystem), and you
|
||||
claim work by **atomic `mv`** so two snakes never devour the same food.
|
||||
|
||||
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup.
|
||||
Each iteration is one feeding:
|
||||
|
||||
1. **Look** in `pit/food/` for food. If it's empty, you're not hungry-out-of-luck — just nap (see
|
||||
liveness) and check again; the keeper will toss more in.
|
||||
2. **Devour** — atomically claim ONE item:
|
||||
`mv pit/food/<f> pit/claimed/<your-id>.<f>`. If the `mv` fails, another snake got it; pick
|
||||
another. Claim exactly one at a time — don't hoard the pit.
|
||||
3. **Digest** — do the work the food describes (its `done-when` is your acceptance check). Run it;
|
||||
don't assume. Keep a running trace as you go.
|
||||
4. **Regurgitate** —
|
||||
- *whole*: write the finished result to `pit/done/<id>.result.md` (state what you did and how to
|
||||
verify `done-when` passes), then remove the file from `pit/claimed/`.
|
||||
- *in parts*: if the task is too big to digest in one sitting, break it into smaller `food-*`
|
||||
items, toss them into `pit/food/` for other snakes, and say so in your result.
|
||||
5. **Excrete** — write your working log / debug trace to `pit/waste/<your-id>-<ts>.log` (`ts` from
|
||||
`date -u +%Y%m%dT%H%M%SZ`). Keep your workspace clean; the cleanup snake handles the waste pile.
|
||||
|
||||
**If you choke** (3rd identical failure on one food): stop forcing it. Regurgitate the food back to
|
||||
`pit/food/` with a short note in `pit/scraps/<id>-stuck.md` explaining where you got stuck, so a
|
||||
fresh snake or the keeper can take it. Don't thrash.
|
||||
|
||||
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
|
||||
- **Cap every nap at 10 minutes.** Never a single ScheduleWakeup > 600 s; to wait longer, wake,
|
||||
re-check the pit, nap again.
|
||||
- **Declare every nap.** Immediately before going idle, your FINAL output line MUST be exactly
|
||||
`WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out, matching your ScheduleWakeup; compute with
|
||||
`date -u -d '+10 min' +%FT%TZ`). Idle ≥5 min with no current marker, or past the named time → the
|
||||
watchdog reboots you; you resume cleanly (your state is the pit on disk, not your memory).
|
||||
- **Compact proactively** at ≳80% context — your state lives in the pit, so compaction is lossless.
|
||||
|
||||
Begin: read `pit/README.md`, then enter your feeding loop. If the pit is empty, nap and check again.
|
||||
75
tests/run.sh
Executable file
75
tests/run.sh
Executable file
@ -0,0 +1,75 @@
|
||||
#!/usr/bin/env bash
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# agent-orchestrator test runner.
|
||||
#
|
||||
# • UNIT tests — always run (pure logic, no agents spawned). A failure fails the suite.
|
||||
# • CLAUDE smoke — live, run when the `claude` CLI is available; SKIPs otherwise.
|
||||
# • OPENCODE smoke — live, run when `opencode` + creds are available; SKIPs otherwise.
|
||||
# • ISOLATION sanity — after the live runs: assert no leftover aotest-* tmux sessions, and that
|
||||
# the live cc-ci-* sessions are untouched.
|
||||
#
|
||||
# Run inside the devShell: nix develop -c ./tests/run.sh
|
||||
# or simply: ./tests/run.sh (python3 + tmux must be on PATH)
|
||||
#
|
||||
# Exit: 0 = all run tests passed (skips are OK); 1 = a unit test or a live smoke FAILED, or a
|
||||
# leftover aotest-* session was found.
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
set -uo pipefail
|
||||
|
||||
HERE="$(cd "$(dirname "$0")" && pwd)"
|
||||
REPO="$(cd "$HERE/.." && pwd)"
|
||||
RC=0
|
||||
UNIT=FAIL CLAUDE=SKIP OPENCODE=SKIP ISO=PASS
|
||||
|
||||
echo "######################################################################"
|
||||
echo "# agent-orchestrator test suite"
|
||||
echo "######################################################################"
|
||||
|
||||
# ── unit tests (always) ───────────────────────────────────────────────────────────
|
||||
echo; echo ">>> UNIT TESTS"
|
||||
if python3 -m unittest discover -s "$HERE" -p 'test_*.py' -v; then
|
||||
UNIT=PASS
|
||||
else
|
||||
UNIT=FAIL; RC=1
|
||||
fi
|
||||
|
||||
# helper: run a smoke script, classify its result from its output
|
||||
run_smoke() {
|
||||
local label="$1" script="$2"; shift 2
|
||||
echo; echo ">>> ${label} SMOKE"
|
||||
local out
|
||||
out="$(bash "$script" 2>&1)"; local rc=$?
|
||||
echo "$out"
|
||||
if echo "$out" | grep -q "BACKEND SMOKE: PASS"; then echo "PASS"; return 0; fi
|
||||
if [ "$rc" -eq 0 ] && echo "$out" | grep -qE "^SKIP:"; then echo "SKIP"; return 2; fi
|
||||
echo "FAIL"; return 1
|
||||
}
|
||||
|
||||
# ── live smoke tests (when backends available) ──────────────────────────────────────
|
||||
run_smoke "CLAUDE" "$HERE/smoke_claude.sh"; case $? in 0) CLAUDE=PASS;; 2) CLAUDE=SKIP;; *) CLAUDE=FAIL; RC=1;; esac
|
||||
run_smoke "OPENCODE" "$HERE/smoke_opencode.sh"; case $? in 0) OPENCODE=PASS;; 2) OPENCODE=SKIP;; *) OPENCODE=FAIL; RC=1;; esac
|
||||
|
||||
# ── isolation sanity ────────────────────────────────────────────────────────────────
|
||||
echo; echo ">>> ISOLATION SANITY"
|
||||
if command -v tmux >/dev/null 2>&1; then
|
||||
leftover="$(tmux ls 2>/dev/null | sed 's/:.*//' | grep '^aotest-' || true)"
|
||||
if [ -n "$leftover" ]; then
|
||||
echo " FAIL: leftover aotest-* sessions: $leftover"; ISO=FAIL; RC=1
|
||||
else
|
||||
echo " PASS: no leftover aotest-* tmux sessions"
|
||||
fi
|
||||
intact=""
|
||||
for s in cc-ci-orchestrator cc-ci-watchdog cc-ci-assistant3; do
|
||||
tmux has-session -t "=$s" 2>/dev/null && intact="$intact $s"
|
||||
done
|
||||
echo " info: live cc-ci sessions present:${intact:- (none — not a cc-ci host)}"
|
||||
else
|
||||
echo " (tmux not on PATH — isolation sanity skipped)"
|
||||
fi
|
||||
|
||||
# ── summary ─────────────────────────────────────────────────────────────────────────
|
||||
echo; echo "######################################################################"
|
||||
echo "# SUMMARY: unit=$UNIT claude=$CLAUDE opencode=$OPENCODE isolation=$ISO"
|
||||
echo "######################################################################"
|
||||
[ "$RC" -eq 0 ] && echo "ALL RUN TESTS PASSED (skips are OK)" || echo "SUITE FAILED"
|
||||
exit "$RC"
|
||||
124
tests/smoke_claude.sh
Executable file
124
tests/smoke_claude.sh
Executable file
@ -0,0 +1,124 @@
|
||||
#!/usr/bin/env bash
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Isolated LIVE smoke of the CLAUDE backend, driven entirely through the harness.
|
||||
#
|
||||
# Brings a throwaway scratch project (its OWN session_prefix "aotest-c-<pid>-" and a temporary
|
||||
# log_dir) up through `agents.py up`, on the real `claude` CLI:
|
||||
# • the harness builds the claude launch command (arg delivery + remote-control + model flag),
|
||||
# • the agent attaches in tmux (claude TUI alive, not an instant crash),
|
||||
# • `agents.py status` reports it RUNNING,
|
||||
# • `agents.py down` tears it down cleanly — no leftover sessions.
|
||||
#
|
||||
# SAFE BY CONSTRUCTION — never touches the live cc-ci-* sessions:
|
||||
# • a unique per-run session prefix (NOT "cc-ci-")
|
||||
# • cleans up everything it creates on exit (even on Ctrl+C / error).
|
||||
#
|
||||
# Usage: bash tests/smoke_claude.sh
|
||||
# Env: CLAUDE_BIN (default: `claude` on PATH, else ~/.local/bin/claude)
|
||||
# AOTEST_MODEL (default: claude-haiku-4-5 — a cheap model for the trivial probe)
|
||||
# Exit: 0 = PASS or SKIP (claude unavailable); 1 = FAIL.
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
set -uo pipefail
|
||||
|
||||
HERE="$(cd "$(dirname "$0")" && pwd)"
|
||||
REPO="$(cd "$HERE/.." && pwd)"
|
||||
CLAUDE_BIN="${CLAUDE_BIN:-$(command -v claude 2>/dev/null || echo "$HOME/.local/bin/claude")}"
|
||||
MODEL="${AOTEST_MODEL:-claude-haiku-4-5}"
|
||||
PREFIX="aotest-c-$$-"
|
||||
SANDBOX="$(mktemp -d)"
|
||||
CFG="$SANDBOX/agents.toml"
|
||||
FAILED=0
|
||||
|
||||
pass(){ echo " PASS: $*"; }
|
||||
fail(){ echo " FAIL: $*"; FAILED=1; }
|
||||
|
||||
cleanup(){
|
||||
local rc=$?
|
||||
python3 "$REPO/agents.py" --config "$CFG" down probe >/dev/null 2>&1 || true
|
||||
if command -v tmux >/dev/null 2>&1; then
|
||||
tmux ls 2>/dev/null | sed 's/:.*//' | grep "^${PREFIX}" | while read -r s; do
|
||||
tmux kill-session -t "=$s" 2>/dev/null || true
|
||||
done || true
|
||||
fi
|
||||
rm -rf "$SANDBOX"
|
||||
exit "$rc"
|
||||
}
|
||||
trap cleanup EXIT INT TERM
|
||||
|
||||
echo "=== claude backend smoke (isolated: prefix=${PREFIX}) ==="
|
||||
|
||||
# 0 — preconditions (SKIP, not FAIL, when claude/tmux can't run here)
|
||||
command -v tmux >/dev/null 2>&1 || { echo "SKIP: tmux not on PATH (run inside 'nix develop')"; exit 0; }
|
||||
[ -x "$CLAUDE_BIN" ] || command -v "$CLAUDE_BIN" >/dev/null 2>&1 \
|
||||
|| { echo "SKIP: claude binary not found ($CLAUDE_BIN)"; exit 0; }
|
||||
|
||||
# 1 — isolated sandbox config (unique prefix + temp log_dir; one trivial persistent probe)
|
||||
cat > "$CFG" <<EOF
|
||||
[defaults]
|
||||
project_dir = "$REPO"
|
||||
session_prefix = "$PREFIX"
|
||||
log_dir = "$SANDBOX/state"
|
||||
backend = "claude"
|
||||
model = "$MODEL"
|
||||
watch = "none"
|
||||
|
||||
[backend.claude]
|
||||
bin = "$CLAUDE_BIN"
|
||||
flags = "--dangerously-skip-permissions"
|
||||
remote_control = true
|
||||
supports_resume = true
|
||||
prompt_delivery = "arg"
|
||||
process_name = "claude"
|
||||
submit_key = "Enter"
|
||||
stall_idle = 300
|
||||
active_re = "esc to interrupt|Running tool|bypass permissions"
|
||||
limit_re = "usage limit|limit reached"
|
||||
|
||||
[[agent]]
|
||||
name = "probe"
|
||||
kind = "persistent"
|
||||
prompt = "You are a harness self-test. Reply with the single word READY and then wait silently. Do nothing else."
|
||||
EOF
|
||||
|
||||
# 2 — bring the probe up THROUGH the harness
|
||||
if ! python3 "$REPO/agents.py" --config "$CFG" up probe; then
|
||||
fail "agents.py up probe errored"; echo "=== RESULT: FAIL ==="; exit 1
|
||||
fi
|
||||
|
||||
# 3 — session created?
|
||||
sleep 6
|
||||
if tmux has-session -t "=${PREFIX}probe" 2>/dev/null; then
|
||||
cmd=$(tmux display-message -p -t "=${PREFIX}probe:" '#{pane_current_command}' 2>/dev/null)
|
||||
pass "session ${PREFIX}probe created via agents.py (pane command: ${cmd})"
|
||||
else
|
||||
fail "${PREFIX}probe session was not created"; echo "=== RESULT: FAIL ==="; exit 1
|
||||
fi
|
||||
|
||||
# 4 — claude actually attached (TUI alive), not an instant crash
|
||||
sleep 6
|
||||
cmd=$(tmux display-message -p -t "=${PREFIX}probe:" '#{pane_current_command}' 2>/dev/null)
|
||||
pane=$(tmux capture-pane -p -t "=${PREFIX}probe:" 2>/dev/null)
|
||||
if [ "$cmd" = "claude" ] || echo "$pane" | grep -qiE "esc to interrupt|bypass permissions|READY|claude|❯|welcome"; then
|
||||
pass "claude TUI attached + alive (driven entirely by agents.py)"
|
||||
else
|
||||
fail "no claude TUI in pane (cmd=${cmd}); tail: $(echo "$pane" | grep -vE '^\s*$' | tail -3)"
|
||||
fi
|
||||
|
||||
# 5 — status reports it RUNNING
|
||||
if python3 "$REPO/agents.py" --config "$CFG" status | grep -E '^\s*probe\b' | grep -q RUNNING; then
|
||||
pass "agents.py status reports probe RUNNING"
|
||||
else
|
||||
fail "agents.py status did not report probe RUNNING"
|
||||
fi
|
||||
|
||||
# 6 — lifecycle: down removes it cleanly
|
||||
python3 "$REPO/agents.py" --config "$CFG" down probe >/dev/null 2>&1
|
||||
sleep 2
|
||||
if tmux has-session -t "=${PREFIX}probe" 2>/dev/null; then
|
||||
fail "${PREFIX}probe still alive after agents.py down"
|
||||
else
|
||||
pass "agents.py down cleanly removed the session"
|
||||
fi
|
||||
|
||||
if [ "$FAILED" = 0 ]; then echo "=== CLAUDE BACKEND SMOKE: PASS ==="; exit 0
|
||||
else echo "=== CLAUDE BACKEND SMOKE: FAIL ==="; exit 1; fi
|
||||
156
tests/smoke_opencode.sh
Executable file
156
tests/smoke_opencode.sh
Executable file
@ -0,0 +1,156 @@
|
||||
#!/usr/bin/env bash
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Isolated LIVE smoke of the OPENCODE backend, driven entirely through the harness.
|
||||
#
|
||||
# Generalizes the cc-ci `test-opencode.sh` isolation pattern onto the agent-orchestrator harness:
|
||||
# stands up a DEDICATED opencode server on its own port (≠ 4096), then brings a throwaway scratch
|
||||
# project up through `agents.py up` on the opencode backend:
|
||||
# • the harness builds the opencode attach command + the post-connect bootstrap ping,
|
||||
# • the agent attaches to the server (opencode TUI alive),
|
||||
# • `agents.py status` reports it RUNNING,
|
||||
# • `agents.py down` tears it down cleanly — server killed, no leftover sessions, port freed.
|
||||
#
|
||||
# SAFE BY CONSTRUCTION — never touches the live cc-ci-* sessions or the live opencode server:
|
||||
# • a unique per-run session prefix (NOT "cc-ci-")
|
||||
# • its OWN opencode server on AOTEST_OC_PORT (default 4097, never 4096)
|
||||
# • cleans up everything it creates on exit (even on Ctrl+C / error).
|
||||
#
|
||||
# Usage: bash tests/smoke_opencode.sh
|
||||
# Env: OPENCODE_BIN (default: `opencode` on PATH, else ~/.local/bin/opencode)
|
||||
# AOTEST_OC_PORT (default 4097 — MUST differ from the live 4096)
|
||||
# AOTEST_OC_CREDS (default /srv/cc-ci/.testenv — sourced as the backend preamble)
|
||||
# AOTEST_MODEL (default: opencode's own configured default)
|
||||
# Exit: 0 = PASS or SKIP (opencode / creds / server unavailable); 1 = FAIL.
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
set -uo pipefail
|
||||
|
||||
HERE="$(cd "$(dirname "$0")" && pwd)"
|
||||
REPO="$(cd "$HERE/.." && pwd)"
|
||||
OCBIN="${OPENCODE_BIN:-$(command -v opencode 2>/dev/null || echo "$HOME/.local/bin/opencode")}"
|
||||
PORT="${AOTEST_OC_PORT:-4097}"
|
||||
SERVER="http://127.0.0.1:${PORT}"
|
||||
CREDS="${AOTEST_OC_CREDS:-/srv/cc-ci/.testenv}"
|
||||
MODEL="${AOTEST_MODEL:-}"
|
||||
PREFIX="aotest-o-$$-"
|
||||
SANDBOX="$(mktemp -d)"
|
||||
CFG="$SANDBOX/agents.toml"
|
||||
SRVLOG="$SANDBOX/server.log"
|
||||
SERVER_PID=""
|
||||
FAILED=0
|
||||
|
||||
pass(){ echo " PASS: $*"; }
|
||||
fail(){ echo " FAIL: $*"; FAILED=1; }
|
||||
|
||||
cleanup(){
|
||||
local rc=$?
|
||||
python3 "$REPO/agents.py" --config "$CFG" down probe >/dev/null 2>&1 || true
|
||||
if command -v tmux >/dev/null 2>&1; then
|
||||
tmux ls 2>/dev/null | sed 's/:.*//' | grep "^${PREFIX}" | while read -r s; do
|
||||
tmux kill-session -t "=$s" 2>/dev/null || true
|
||||
done || true
|
||||
fi
|
||||
# kill the server subshell AND the opencode serve child it forked (the subshell is not the
|
||||
# listener — target the listener by our unique port so the port is actually freed).
|
||||
[ -n "$SERVER_PID" ] && kill "$SERVER_PID" 2>/dev/null || true
|
||||
pkill -f "opencode serve.*--port ${PORT}\b" 2>/dev/null || true
|
||||
for _ in 1 2 3 4 5; do
|
||||
ss -ltn 2>/dev/null | grep -q ":${PORT} " || break
|
||||
sleep 1
|
||||
done
|
||||
rm -rf "$SANDBOX"
|
||||
exit "$rc"
|
||||
}
|
||||
trap cleanup EXIT INT TERM
|
||||
|
||||
echo "=== opencode backend smoke (isolated: prefix=${PREFIX} port=${PORT}) ==="
|
||||
|
||||
# 0 — preconditions (SKIP, not FAIL, when the environment can't run opencode)
|
||||
command -v tmux >/dev/null 2>&1 || { echo "SKIP: tmux not on PATH (run inside 'nix develop')"; exit 0; }
|
||||
[ "$PORT" != "4096" ] || { echo "FAIL: refusing port 4096 (the live cc-ci opencode port)"; exit 1; }
|
||||
[ -x "$OCBIN" ] || command -v "$OCBIN" >/dev/null 2>&1 \
|
||||
|| { echo "SKIP: opencode binary not found ($OCBIN)"; exit 0; }
|
||||
[ -f "$CREDS" ] || { echo "SKIP: opencode creds file missing ($CREDS)"; exit 0; }
|
||||
|
||||
# 1 — isolated sandbox config (unique prefix + temp log_dir + dedicated server)
|
||||
cat > "$CFG" <<EOF
|
||||
[defaults]
|
||||
project_dir = "$REPO"
|
||||
session_prefix = "$PREFIX"
|
||||
log_dir = "$SANDBOX/state"
|
||||
backend = "opencode"
|
||||
model = "$MODEL"
|
||||
watch = "none"
|
||||
|
||||
[backend.opencode]
|
||||
bin = "$OCBIN"
|
||||
attach = "{bin} attach {server} --dir {dir}"
|
||||
server = "$SERVER"
|
||||
supports_resume = false
|
||||
prompt_delivery = "ping"
|
||||
process_name = "opencode"
|
||||
footer_ui = true
|
||||
log_grace = 180
|
||||
connect_delay = 12
|
||||
submit_key = "C-m"
|
||||
preamble = "set -a; . $CREDS; set +a"
|
||||
stall_idle = 900
|
||||
active_re = "esc interrupt|thinking|inferring|running tool|tool call|preparing patch|reading|searching|working"
|
||||
limit_re = "usage limit|limit reached"
|
||||
|
||||
[[agent]]
|
||||
name = "probe"
|
||||
kind = "persistent"
|
||||
prompt = "You are a harness self-test. Reply with the single word READY and then wait silently. Do nothing else."
|
||||
EOF
|
||||
|
||||
# 2 — bring up a dedicated opencode server on our own port
|
||||
( set -a; . "$CREDS"; set +a; NO_COLOR=1 "$OCBIN" serve --hostname 127.0.0.1 --port "$PORT" ) >"$SRVLOG" 2>&1 &
|
||||
SERVER_PID=$!
|
||||
for _ in $(seq 1 30); do ss -ltn 2>/dev/null | grep -q ":${PORT} " && break; sleep 1; done
|
||||
if ! ss -ltn 2>/dev/null | grep -q ":${PORT} "; then
|
||||
echo "SKIP: opencode server did not come up on :${PORT} (see ${SRVLOG})"; exit 0
|
||||
fi
|
||||
pass "dedicated opencode server listening on :${PORT}"
|
||||
|
||||
# 3 — bring the probe up THROUGH the harness (attaches to OUR server)
|
||||
if ! python3 "$REPO/agents.py" --config "$CFG" up probe; then
|
||||
fail "agents.py up probe errored"; echo "=== RESULT: FAIL ==="; exit 1
|
||||
fi
|
||||
|
||||
# 4 — session created?
|
||||
sleep 4
|
||||
if tmux has-session -t "=${PREFIX}probe" 2>/dev/null; then
|
||||
cmd=$(tmux display-message -p -t "=${PREFIX}probe:" '#{pane_current_command}' 2>/dev/null)
|
||||
pass "session ${PREFIX}probe created via agents.py (pane command: ${cmd})"
|
||||
else
|
||||
fail "${PREFIX}probe session was not created"; echo "=== RESULT: FAIL ==="; exit 1
|
||||
fi
|
||||
|
||||
# 5 — opencode TUI attached + alive, not an instant crash
|
||||
sleep 12
|
||||
pane=$(tmux capture-pane -p -t "=${PREFIX}probe:" 2>/dev/null)
|
||||
if echo "$pane" | grep -qiE "opencode|build ·|gpt|claude|READY|esc interrupt|ctrl\+p|ctrl\+"; then
|
||||
pass "opencode TUI attached + alive (driven entirely by agents.py)"
|
||||
else
|
||||
fail "no opencode TUI/response in pane; tail: $(echo "$pane" | grep -vE '^\s*$' | tail -3)"
|
||||
echo " (server log tail:) $(tail -3 "$SRVLOG" 2>/dev/null)"
|
||||
fi
|
||||
|
||||
# 6 — status reports it RUNNING
|
||||
if python3 "$REPO/agents.py" --config "$CFG" status | grep -E '^\s*probe\b' | grep -q RUNNING; then
|
||||
pass "agents.py status reports probe RUNNING"
|
||||
else
|
||||
fail "agents.py status did not report probe RUNNING"
|
||||
fi
|
||||
|
||||
# 7 — lifecycle: down removes it cleanly
|
||||
python3 "$REPO/agents.py" --config "$CFG" down probe >/dev/null 2>&1
|
||||
sleep 2
|
||||
if tmux has-session -t "=${PREFIX}probe" 2>/dev/null; then
|
||||
fail "${PREFIX}probe still alive after agents.py down"
|
||||
else
|
||||
pass "agents.py down cleanly removed the session"
|
||||
fi
|
||||
|
||||
if [ "$FAILED" = 0 ]; then echo "=== OPENCODE BACKEND SMOKE: PASS ==="; exit 0
|
||||
else echo "=== OPENCODE BACKEND SMOKE: FAIL ==="; exit 1; fi
|
||||
526
tests/test_unit.py
Executable file
526
tests/test_unit.py
Executable file
@ -0,0 +1,526 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Unit tests for the agent-orchestrator harness (agents.py).
|
||||
|
||||
Pure-logic tests — NO agent CLIs spawned, NO live tmux sessions created. Every test builds a
|
||||
throwaway config + fixture files in a tempdir and exercises the harness functions directly.
|
||||
The one function that would spawn sessions (phase_advance_check → start/stop_loops) is tested
|
||||
with those two hooks monkeypatched to recorders, so the phase-machine *logic* is covered without
|
||||
launching anything.
|
||||
|
||||
Run: python3 -m unittest tests.test_unit (from repo root)
|
||||
or python3 tests/test_unit.py
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import textwrap
|
||||
import tempfile
|
||||
import shutil
|
||||
import unittest
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(REPO_ROOT))
|
||||
import agents # noqa: E402
|
||||
|
||||
|
||||
# ── shared fixture config ────────────────────────────────────────────────────────
|
||||
|
||||
BASE_TOML = r"""
|
||||
[watchdog]
|
||||
signal_interval = 30
|
||||
heavy_interval = 300
|
||||
limit_probe_fallback = 300
|
||||
limit_reset_slack = 45
|
||||
stall_grace = 180
|
||||
|
||||
[defaults]
|
||||
session_prefix = "aotest-ut-"
|
||||
log_dir = "state"
|
||||
backend = "claude"
|
||||
model = "claude-sonnet-4-6"
|
||||
watch = "none"
|
||||
|
||||
[backend.claude]
|
||||
bin = "claude"
|
||||
flags = "--dangerously-skip-permissions"
|
||||
remote_control = true
|
||||
supports_resume = true
|
||||
prompt_delivery = "arg"
|
||||
process_name = "claude"
|
||||
submit_key = "Enter"
|
||||
stall_idle = 300
|
||||
active_re = "esc to interrupt|Running tool|\\u00b7 \\d+"
|
||||
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
|
||||
fatal_re = "redacted_thinking|blocks cannot be modified"
|
||||
|
||||
[backend.opencode]
|
||||
bin = "opencode"
|
||||
attach = "{bin} attach {server} --dir {dir}"
|
||||
server = "http://127.0.0.1:4096"
|
||||
supports_resume = false
|
||||
prompt_delivery = "ping"
|
||||
process_name = "opencode"
|
||||
footer_ui = true
|
||||
log_grace = 180
|
||||
connect_delay = 12
|
||||
submit_key = "C-m"
|
||||
stall_idle = 900
|
||||
active_re = "esc interrupt|thinking|inferring|running tool|tool call|preparing patch|reading|searching"
|
||||
limit_re = "usage limit|limit reached"
|
||||
|
||||
[backend.demo]
|
||||
bin = "echo up; exec sleep 100000"
|
||||
prompt_delivery = "exec"
|
||||
|
||||
[[agent]]
|
||||
name = "builder"
|
||||
kind = "loop"
|
||||
role = "builder"
|
||||
backend = "demo"
|
||||
|
||||
[[agent]]
|
||||
name = "adversary"
|
||||
kind = "loop"
|
||||
role = "adversary"
|
||||
backend = "demo"
|
||||
|
||||
[[agent]]
|
||||
name = "cl"
|
||||
kind = "persistent"
|
||||
backend = "claude"
|
||||
prompt = "hi"
|
||||
|
||||
[[agent]]
|
||||
name = "oc"
|
||||
kind = "persistent"
|
||||
backend = "opencode"
|
||||
prompt = "hi"
|
||||
|
||||
[[agent]]
|
||||
name = "custom"
|
||||
kind = "persistent"
|
||||
session = "explicit-session"
|
||||
model = "override-model"
|
||||
dir = "/abs/somewhere"
|
||||
backend = "demo"
|
||||
prompt = "x"
|
||||
|
||||
[[service]]
|
||||
name = "svc"
|
||||
command = "sleep 1"
|
||||
|
||||
[loop]
|
||||
state_file = "phase-idx"
|
||||
resume_phase = true
|
||||
auto_advance = true
|
||||
done_marker = "## DONE"
|
||||
kickoff_template = "prompts/kickoff.md"
|
||||
roles_dir = "prompts"
|
||||
handoff = { repo = ".", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], state_subdir = "machine-docs" }
|
||||
phases = [
|
||||
{ id = "p1", plan = "PLAN1.md", status = "STATUS-p1.md" },
|
||||
{ id = "p2", plan = "PLAN2.md", status = "STATUS-p2.md", models = { builder = "opus-x" } },
|
||||
]
|
||||
"""
|
||||
|
||||
KICKOFF_TMPL = "*** PROJECT PHASE: {phase_id} ***\nPLAN: {plan}\nSTATUS: {status}\nROLE: {role}\n---\n"
|
||||
BUILDER_PROMPT = "You are the **Builder** agent. (builder role body marker)\n"
|
||||
ADVERSARY_PROMPT = "You are the **Adversary** agent. (adversary role body marker)\n"
|
||||
|
||||
|
||||
def _make_project(tmp, toml=BASE_TOML):
|
||||
"""Write a self-contained project (config + prompts + machine-docs) into tmp; return cfg path."""
|
||||
root = Path(tmp)
|
||||
(root / "prompts").mkdir(parents=True, exist_ok=True)
|
||||
(root / "machine-docs").mkdir(parents=True, exist_ok=True)
|
||||
(root / "prompts" / "kickoff.md").write_text(KICKOFF_TMPL)
|
||||
(root / "prompts" / "builder.md").write_text(BUILDER_PROMPT)
|
||||
(root / "prompts" / "adversary.md").write_text(ADVERSARY_PROMPT)
|
||||
cfg_path = root / "agents.toml"
|
||||
cfg_path.write_text(toml)
|
||||
return cfg_path
|
||||
|
||||
|
||||
# ── config loading + defaults merge ────────────────────────────────────────────────
|
||||
|
||||
class TestConfigLoad(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
|
||||
self.cfg_path = _make_project(self.tmp)
|
||||
self.cfg = agents.load_config(self.cfg_path)
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmp, ignore_errors=True)
|
||||
|
||||
def test_defaults_merge_into_agents(self):
|
||||
b = self.cfg["agents"]["builder"]
|
||||
self.assertEqual(b["session_prefix"], "aotest-ut-")
|
||||
self.assertEqual(b["watch"], "none") # from defaults
|
||||
self.assertEqual(b["kind"], "loop") # explicit
|
||||
|
||||
def test_session_name_defaults_to_prefix_plus_name(self):
|
||||
self.assertEqual(self.cfg["agents"]["builder"]["session"], "aotest-ut-builder")
|
||||
|
||||
def test_explicit_session_overrides_prefix(self):
|
||||
self.assertEqual(self.cfg["agents"]["custom"]["session"], "explicit-session")
|
||||
|
||||
def test_per_agent_override_wins_over_default(self):
|
||||
# default model is claude-sonnet-4-6; custom overrides
|
||||
self.assertEqual(self.cfg["agents"]["custom"]["model"], "override-model")
|
||||
self.assertEqual(self.cfg["agents"]["builder"]["model"], "claude-sonnet-4-6")
|
||||
|
||||
def test_relative_dir_resolved_against_project_root(self):
|
||||
# builder has no dir → defaults dir "." → project_dir
|
||||
self.assertEqual(self.cfg["agents"]["builder"]["dir"], self.cfg["project_dir"])
|
||||
|
||||
def test_absolute_dir_kept(self):
|
||||
self.assertEqual(self.cfg["agents"]["custom"]["dir"], "/abs/somewhere")
|
||||
|
||||
def test_log_dir_and_state_dir_resolved(self):
|
||||
self.assertEqual(self.cfg["log_dir"], str(Path(self.cfg["project_dir"]) / "state"))
|
||||
self.assertEqual(self.cfg["state_dir"], os.path.join(self.cfg["log_dir"], "state"))
|
||||
self.assertTrue(Path(self.cfg["state_dir"]).is_dir()) # created on load
|
||||
|
||||
def test_service_session_named(self):
|
||||
self.assertIn("svc", self.cfg["services"])
|
||||
self.assertEqual(self.cfg["services"]["svc"]["session"], "aotest-ut-svc")
|
||||
|
||||
def test_backend_of_resolves(self):
|
||||
b = agents.backend_of(self.cfg, self.cfg["agents"]["cl"])
|
||||
self.assertEqual(b["prompt_delivery"], "arg")
|
||||
self.assertEqual(b["submit_key"], "Enter")
|
||||
|
||||
def test_backend_of_unknown_dies(self):
|
||||
a = dict(self.cfg["agents"]["cl"]); a["backend"] = "nope"
|
||||
with self.assertRaises(SystemExit):
|
||||
agents.backend_of(self.cfg, a)
|
||||
|
||||
def test_missing_session_prefix_dies(self):
|
||||
bad = self.tmp + "/bad1"
|
||||
p = _make_project(bad, toml='[defaults]\nlog_dir = "state"\n')
|
||||
with self.assertRaises(SystemExit):
|
||||
agents.load_config(p)
|
||||
|
||||
def test_missing_log_dir_dies(self):
|
||||
bad = self.tmp + "/bad2"
|
||||
p = _make_project(bad, toml='[defaults]\nsession_prefix = "x-"\n')
|
||||
with self.assertRaises(SystemExit):
|
||||
agents.load_config(p)
|
||||
|
||||
def test_env_override_model_single_invocation(self):
|
||||
os.environ["AGENT_MODEL_cl"] = "env-only-model"
|
||||
try:
|
||||
cfg2 = agents.load_config(self.cfg_path)
|
||||
self.assertEqual(cfg2["agents"]["cl"]["model"], "env-only-model")
|
||||
finally:
|
||||
del os.environ["AGENT_MODEL_cl"]
|
||||
# without the env var the file value stands again
|
||||
cfg3 = agents.load_config(self.cfg_path)
|
||||
self.assertEqual(cfg3["agents"]["cl"]["model"], "claude-sonnet-4-6")
|
||||
|
||||
|
||||
class TestExampleConfig(unittest.TestCase):
|
||||
"""The SHIPPED agents.example.toml must parse and define the documented shape."""
|
||||
def test_example_config_loads(self):
|
||||
ex = REPO_ROOT / "agents.example.toml"
|
||||
self.assertTrue(ex.exists(), "agents.example.toml missing from repo")
|
||||
cfg = agents.load_config(ex)
|
||||
self.assertIn("builder", cfg["agents"])
|
||||
self.assertIn("adversary", cfg["agents"])
|
||||
for be in ("demo", "claude", "opencode"):
|
||||
self.assertIn(be, cfg["backends"], f"backend {be} missing from example")
|
||||
self.assertEqual(len(agents.phases(cfg)), 2)
|
||||
|
||||
|
||||
# ── kickoff-template assembly ──────────────────────────────────────────────────────
|
||||
|
||||
class TestKickoff(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
|
||||
self.cfg = agents.load_config(_make_project(self.tmp))
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmp, ignore_errors=True)
|
||||
|
||||
def test_kickoff_renders_slots_and_appends_role(self):
|
||||
out = agents.build_loop_kickoff(self.cfg, self.cfg["agents"]["builder"])
|
||||
self.assertIn("PROJECT PHASE: p1", out) # phase_id slot filled (phase idx 0)
|
||||
self.assertIn("PLAN: PLAN1.md", out)
|
||||
self.assertIn("STATUS: STATUS-p1.md", out)
|
||||
self.assertIn("ROLE: builder", out)
|
||||
self.assertIn("builder role body marker", out) # role prompt appended
|
||||
self.assertNotIn("{phase_id}", out) # no unrendered slot
|
||||
self.assertNotIn("{role}", out)
|
||||
|
||||
def test_kickoff_picks_correct_role_prompt(self):
|
||||
out = agents.build_loop_kickoff(self.cfg, self.cfg["agents"]["adversary"])
|
||||
self.assertIn("adversary role body marker", out)
|
||||
self.assertNotIn("builder role body marker", out)
|
||||
|
||||
def test_agent_prompt_loop_returns_kickoff(self):
|
||||
out = agents.agent_prompt(self.cfg, self.cfg["agents"]["builder"])
|
||||
self.assertIn("PROJECT PHASE: p1", out)
|
||||
|
||||
def test_agent_prompt_persistent_returns_inline_prompt(self):
|
||||
out = agents.agent_prompt(self.cfg, self.cfg["agents"]["cl"])
|
||||
self.assertEqual(out, "hi")
|
||||
|
||||
def test_role_model_phase_override(self):
|
||||
# phase p2 overrides builder model to opus-x; advance index to 1
|
||||
Path(agents.phase_idx_file(self.cfg)).write_text("1")
|
||||
self.assertEqual(agents.role_model(self.cfg, self.cfg["agents"]["builder"]), "opus-x")
|
||||
# adversary has no override → its configured/default model
|
||||
self.assertEqual(agents.role_model(self.cfg, self.cfg["agents"]["adversary"]),
|
||||
"claude-sonnet-4-6")
|
||||
|
||||
|
||||
# ── phase machine ──────────────────────────────────────────────────────────────────
|
||||
|
||||
class TestPhaseMachine(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
|
||||
self.cfg = agents.load_config(_make_project(self.tmp))
|
||||
self.md = Path(self.cfg["project_dir"]) / "machine-docs"
|
||||
# monkeypatch the session-spawning hooks so the machine logic runs without tmux
|
||||
self._orig = (agents.stop_loops, agents.start_loops, agents.handoff_reset)
|
||||
self.calls = []
|
||||
agents.stop_loops = lambda cfg: self.calls.append("stop")
|
||||
agents.start_loops = lambda cfg: self.calls.append("start")
|
||||
agents.handoff_reset = lambda: self.calls.append("reset")
|
||||
|
||||
def tearDown(self):
|
||||
agents.stop_loops, agents.start_loops, agents.handoff_reset = self._orig
|
||||
shutil.rmtree(self.tmp, ignore_errors=True)
|
||||
|
||||
def _status(self, basename, text):
|
||||
(self.md / basename).write_text(text)
|
||||
|
||||
def test_phase_done_detects_marker(self):
|
||||
self._status("STATUS-p1.md", "header\n## DONE\nall verified PASS\n")
|
||||
self.assertTrue(agents.phase_done(self.cfg, "STATUS-p1.md"))
|
||||
|
||||
def test_phase_done_rejects_placeholder_body(self):
|
||||
self._status("STATUS-p1.md", "## DONE\nnot yet — written here only when complete\n")
|
||||
self.assertFalse(agents.phase_done(self.cfg, "STATUS-p1.md"))
|
||||
|
||||
def test_phase_done_false_when_no_marker(self):
|
||||
self._status("STATUS-p1.md", "## In progress\nworking\n")
|
||||
self.assertFalse(agents.phase_done(self.cfg, "STATUS-p1.md"))
|
||||
|
||||
def test_phase_done_false_when_file_missing(self):
|
||||
self.assertFalse(agents.phase_done(self.cfg, "STATUS-nope.md"))
|
||||
|
||||
def test_cur_idx_reads_state_file(self):
|
||||
Path(agents.phase_idx_file(self.cfg)).write_text("1")
|
||||
self.assertEqual(agents.cur_idx(self.cfg), 1)
|
||||
|
||||
def test_advance_on_done(self):
|
||||
Path(agents.phase_idx_file(self.cfg)).write_text("0")
|
||||
self._status("STATUS-p1.md", "## DONE\nverified\n")
|
||||
advanced = agents.phase_advance_check(self.cfg)
|
||||
self.assertTrue(advanced)
|
||||
self.assertEqual(agents.cur_idx(self.cfg), 1) # moved to p2
|
||||
self.assertIn("stop", self.calls)
|
||||
self.assertIn("start", self.calls)
|
||||
|
||||
def test_no_advance_when_not_done(self):
|
||||
Path(agents.phase_idx_file(self.cfg)).write_text("0")
|
||||
self._status("STATUS-p1.md", "## In progress\n")
|
||||
self.assertFalse(agents.phase_advance_check(self.cfg))
|
||||
self.assertEqual(agents.cur_idx(self.cfg), 0)
|
||||
self.assertEqual(self.calls, [])
|
||||
|
||||
def test_sequence_complete_idempotent(self):
|
||||
Path(agents.phase_idx_file(self.cfg)).write_text("1") # last phase
|
||||
self._status("STATUS-p2.md", "## DONE\nverified\n")
|
||||
marker = Path(self.cfg["log_dir"]) / "SEQUENCE-COMPLETE"
|
||||
# first call: completes the sequence
|
||||
self.assertTrue(agents.phase_advance_check(self.cfg))
|
||||
self.assertTrue(marker.exists())
|
||||
self.assertEqual(self.calls.count("stop"), 1)
|
||||
# second call: idempotent — no re-stop, returns False
|
||||
self.assertFalse(agents.phase_advance_check(self.cfg))
|
||||
self.assertEqual(self.calls.count("stop"), 1)
|
||||
|
||||
def test_append_phase_clears_marker_and_resumes(self):
|
||||
# simulate "sequence already complete", then a 3rd phase appended to the config
|
||||
Path(agents.phase_idx_file(self.cfg)).write_text("1")
|
||||
self._status("STATUS-p2.md", "## DONE\nverified\n")
|
||||
marker = Path(self.cfg["log_dir"]) / "SEQUENCE-COMPLETE"
|
||||
marker.write_text("stale completion\n")
|
||||
self.cfg["loop"]["phases"].append(
|
||||
{"id": "p3", "plan": "PLAN3.md", "status": "STATUS-p3.md"})
|
||||
advanced = agents.phase_advance_check(self.cfg)
|
||||
self.assertTrue(advanced)
|
||||
self.assertEqual(agents.cur_idx(self.cfg), 2) # resumed onto p3
|
||||
self.assertFalse(marker.exists()) # stale marker cleared
|
||||
self.assertIn("start", self.calls)
|
||||
|
||||
def test_custom_done_marker(self):
|
||||
self.cfg["loop"]["done_marker"] = "## SHIPPED"
|
||||
self._status("STATUS-p1.md", "## SHIPPED\nverified\n")
|
||||
self.assertTrue(agents.phase_done(self.cfg, "STATUS-p1.md"))
|
||||
self.assertFalse(agents.phase_done(self.cfg, "STATUS-p2.md"))
|
||||
|
||||
|
||||
# ── usage-limit banner reset parsing ───────────────────────────────────────────────
|
||||
|
||||
class TestLimitParsing(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
|
||||
self.cfg = agents.load_config(_make_project(self.tmp))
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmp, ignore_errors=True)
|
||||
|
||||
def test_parse_reset_pm(self):
|
||||
ep = agents._parse_reset_epoch("You've hit your limit · resets at 10pm")
|
||||
self.assertIsNotNone(ep)
|
||||
self.assertEqual(datetime.fromtimestamp(ep).hour, 22)
|
||||
|
||||
def test_parse_reset_am_with_minutes(self):
|
||||
ep = agents._parse_reset_epoch("resets 3:30am")
|
||||
self.assertIsNotNone(ep)
|
||||
dt = datetime.fromtimestamp(ep)
|
||||
self.assertEqual((dt.hour, dt.minute), (3, 30))
|
||||
|
||||
def test_parse_reset_12am_is_midnight(self):
|
||||
ep = agents._parse_reset_epoch("resets at 12am")
|
||||
self.assertEqual(datetime.fromtimestamp(ep).hour, 0)
|
||||
|
||||
def test_parse_reset_invalid_hour_none(self):
|
||||
self.assertIsNone(agents._parse_reset_epoch("resets at 25"))
|
||||
|
||||
def test_parse_reset_no_match_none(self):
|
||||
self.assertIsNone(agents._parse_reset_epoch("everything is fine here"))
|
||||
|
||||
def test_parse_reset_picks_last_match(self):
|
||||
ep = agents._parse_reset_epoch("resets at 9am ... actually resets at 11am")
|
||||
self.assertEqual(datetime.fromtimestamp(ep).hour, 11)
|
||||
|
||||
def test_next_limit_until_unparsable_fallback(self):
|
||||
now = time.time()
|
||||
until, parsed = agents._next_limit_until(self.cfg, "limit reached, no time given", now)
|
||||
self.assertFalse(parsed)
|
||||
self.assertEqual(int(until), int(now + 300)) # limit_probe_fallback
|
||||
|
||||
def test_next_limit_until_within_window_uses_banner(self):
|
||||
now = time.time()
|
||||
t = datetime.now() + timedelta(hours=2)
|
||||
h12 = t.hour % 12 or 12
|
||||
ampm = "am" if t.hour < 12 else "pm"
|
||||
banner = f"weekly limit · resets at {h12}:{t.minute:02d}{ampm}"
|
||||
until, parsed = agents._next_limit_until(self.cfg, banner, now)
|
||||
self.assertTrue(parsed)
|
||||
self.assertGreater(until, now)
|
||||
self.assertLessEqual(until - now, 6 * 3600 + 60) # within 6h window (+slack)
|
||||
|
||||
def test_next_limit_until_far_future_falls_back(self):
|
||||
now = time.time()
|
||||
t = datetime.now() + timedelta(hours=7) # > 6h window
|
||||
h12 = t.hour % 12 or 12
|
||||
ampm = "am" if t.hour < 12 else "pm"
|
||||
banner = f"limit · resets at {h12}:{t.minute:02d}{ampm}"
|
||||
until, parsed = agents._next_limit_until(self.cfg, banner, now)
|
||||
self.assertFalse(parsed)
|
||||
self.assertEqual(int(until), int(now + 300))
|
||||
|
||||
|
||||
# ── stall / WAITING-UNTIL parsing ──────────────────────────────────────────────────
|
||||
|
||||
class TestWaitingUntil(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
|
||||
self.cfg = agents.load_config(_make_project(self.tmp))
|
||||
self.claude_agent = self.cfg["agents"]["cl"] # non-footer backend
|
||||
self.oc_agent = self.cfg["agents"]["oc"] # footer_ui backend
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmp, ignore_errors=True)
|
||||
|
||||
def test_non_footer_finds_marker_anywhere(self):
|
||||
pane = "blah blah\nWAITING-UNTIL: 2030-06-13T12:00:00Z\nmore output after\n"
|
||||
ep = agents._parse_waiting_until(self.cfg, self.claude_agent, pane)
|
||||
self.assertIsNotNone(ep)
|
||||
self.assertEqual(ep, datetime.fromisoformat("2030-06-13T12:00:00+00:00").timestamp())
|
||||
|
||||
def test_non_footer_none_without_marker(self):
|
||||
self.assertIsNone(agents._parse_waiting_until(
|
||||
self.cfg, self.claude_agent, "just working, no marker"))
|
||||
|
||||
def test_footer_requires_marker_as_last_line(self):
|
||||
# marker present but NOT the last non-empty line → ignored for a footer UI
|
||||
pane = "WAITING-UNTIL: 2030-06-13T12:00:00Z\n ▣ Build · GPT · 2m 19s\n"
|
||||
self.assertIsNone(agents._parse_waiting_until(self.cfg, self.oc_agent, pane))
|
||||
|
||||
def test_footer_honors_marker_when_last_line(self):
|
||||
pane = "some work\nWAITING-UNTIL: 2030-06-13T12:00:00Z\n\n"
|
||||
ep = agents._parse_waiting_until(self.cfg, self.oc_agent, pane)
|
||||
self.assertIsNotNone(ep)
|
||||
|
||||
def test_bad_timestamp_none(self):
|
||||
self.assertIsNone(agents._parse_waiting_until(
|
||||
self.cfg, self.claude_agent, "WAITING-UNTIL: not-a-time"))
|
||||
|
||||
|
||||
# ── backend activity detectors (claude + opencode footers) ──────────────────────────
|
||||
|
||||
class TestActivityDetection(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
|
||||
self.cfg = agents.load_config(_make_project(self.tmp))
|
||||
self.claude_agent = self.cfg["agents"]["cl"]
|
||||
self.oc_agent = self.cfg["agents"]["oc"]
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmp, ignore_errors=True)
|
||||
|
||||
# claude: non-footer, active_re matched anywhere in the pane
|
||||
def test_claude_active_esc_to_interrupt(self):
|
||||
self.assertTrue(agents.pane_active(
|
||||
self.cfg, self.claude_agent, "thinking...\n esc to interrupt", use_log=False))
|
||||
|
||||
def test_claude_active_running_tool(self):
|
||||
self.assertTrue(agents.pane_active(
|
||||
self.cfg, self.claude_agent, "Running tool: Bash", use_log=False))
|
||||
|
||||
def test_claude_active_spinner_dot_count(self):
|
||||
self.assertTrue(agents.pane_active(
|
||||
self.cfg, self.claude_agent, "Compiling · 137 tokens", use_log=False))
|
||||
|
||||
def test_claude_idle_is_not_active(self):
|
||||
self.assertFalse(agents.pane_active(
|
||||
self.cfg, self.claude_agent, "Done.\n> ", use_log=False))
|
||||
|
||||
# opencode: footer_ui — only the bottom rows count as activity
|
||||
def test_opencode_active_footer(self):
|
||||
pane = "~ Preparing patch...\n ⬝⬝■ esc interrupt 137.6K\n"
|
||||
self.assertTrue(agents.pane_active(self.cfg, self.oc_agent, pane, use_log=False))
|
||||
|
||||
def test_opencode_idle_footer_not_active(self):
|
||||
pane = " ▣ Build · GPT-5.4 · 2m 19s\n 178.4K (17%) ctrl+p commands\n"
|
||||
self.assertFalse(agents.pane_active(self.cfg, self.oc_agent, pane, use_log=False))
|
||||
|
||||
def test_opencode_active_only_at_top_is_ignored(self):
|
||||
# active marker far above the bottom 10 lines → a footer UI ignores it
|
||||
pane = "running tool now\n" + "\n".join(f"line {i}" for i in range(20)) + \
|
||||
"\n ▣ Build · GPT · idle\n"
|
||||
self.assertFalse(agents.pane_active(self.cfg, self.oc_agent, pane, use_log=False))
|
||||
|
||||
def test_opencode_log_grace_fallback(self):
|
||||
# idle footer, but a freshly-touched session log within the grace window → active
|
||||
idle = " ▣ Build · GPT · idle\n 178K (17%) ctrl+p\n"
|
||||
logp = agents._session_log_path(self.cfg, self.oc_agent["session"])
|
||||
logp.parent.mkdir(parents=True, exist_ok=True)
|
||||
logp.write_text("recent activity\n") # mtime = now
|
||||
self.assertTrue(agents.pane_active(self.cfg, self.oc_agent, idle, use_log=True))
|
||||
# remove the log → no fallback → idle footer reads as not active
|
||||
logp.unlink()
|
||||
self.assertFalse(agents.pane_active(self.cfg, self.oc_agent, idle, use_log=True))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main(verbosity=2)
|
||||
Reference in New Issue
Block a user