12 Commits
v0.1.0 ... main

Author SHA1 Message Date
781db071dd docs(readme): add Examples section (Builder/Adversary variants, snakepit) + benchmark note 2026-06-16 02:35:40 +00:00
90375f004e docs(examples): add builder-adversary-deferred — verify after a long segment
Coarsest review cadence: the Builder self-certifies the build phases and the
Adversary does ONE comprehensive cold-verification of the whole accumulated build
in a final `review` phase (vs orig per-phase, lean per-gate). Full original
prompts + a DEFERRED REVIEW CADENCE override, so it isolates verification cadence.
Cheapest coordination; the trade-off is the independent check arrives late (late
rework risk + self-certification drift on build phases). README spells it out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:02:44 +00:00
c6c7ce8640 change: base stateless + lean on the FULL original prompts (not minimal)
So that "stateless vs builder-adversary" and "lean vs stateless" isolate context
hygiene / review granularity WITHOUT the confound of the minimal prompts' reduced
testing pressure (which we found cuts ~25% of test methods). stateless = orig +
context hygiene; lean = orig + context hygiene + per-gate review. min stays the
pure minimal-prompt variant (isolates verbosity vs orig).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 03:17:47 +00:00
a0f7652e9e docs(examples): add builder-solo — single builder, no adversary (control)
A single Builder that builds AND self-verifies (same DoD rigor), with NO
independent Adversary and no claim/review handoff. The control for measuring
what the AI adversary costs (its tokens, ~half of a loop-pair run) and buys
(independent cold verification vs self-certification).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 02:34:50 +00:00
924874aafa feat: optional log_tokens — per-phase token + time accounting
When [watchdog].log_tokens (or [loop].log_tokens) is true, the watchdog records
for each phase how many tokens each agent used (and the total) and how long the
phase took, appended to <log_dir>/token-log.jsonl. Tokens are summed from each
agent's session transcript, attributed by working dir. View with `agents.py
tokens`. Baseline snapshot at phase start + delta at phase advance/complete;
robust across watchdog restarts. Validated: the transcript sum matches an
independent external collector exactly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 21:48:17 +00:00
e0425e6108 docs(examples): add builder-adversary-lean — context hygiene + per-gate review
Isolates the two effects conflated in builder-adversary-stateless: keeps all the
CONTEXT HYGIENE (compact/diffs/lean loads) but ENFORCES full per-gate review
granularity (one claim per gate, one independent verdict per gate, no batching).
Tests whether the token saving is real efficiency vs reduced scrutiny.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 21:42:12 +00:00
985d33dd51 docs(examples): add builder-adversary-stateless — context-lean variant
Same pattern + AI-as-adversary verification as builder-adversary-min, but the
role prompts add CONTEXT HYGIENE: /compact at every checkpoint (lossless — state
is on disk), read diffs not trees, spill bulk output to files, adversary loads
only {plan, STATUS, diff}. Loop agents non-resumed → fresh session per phase.
Targets cache-read (the dominant cost in a long loop) without changing what the
agents do or how they verify.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:47:58 +00:00
737ef81066 docs(examples): add builder-adversary-min — minimal-prompt variant
Same topology/behaviour as builder-adversary (loop pair, phase machine,
claim()/review() handoff, machine-docs coordination, cold verification) but the
role + kickoff prompts are compressed to minimal tokens, keeping every
load-bearing rule. Config and plans are unchanged. The separate
agent-orchestrator-benchmark repo runs a head-to-head token comparison.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:18:33 +00:00
11843f41a4 docs(examples): add IDEAS.md — backlog of creative example topologies
A sketch backlog of further examples, each teaching a distinct orchestration
topology (anthill/stigmergy, kitchen line/pipeline, incident room/blackboard,
senate/debate, baton/mutex+failover, immune system/reactive, evolution chamber,
plus ATC and day-night extras). Not implemented — ideas only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 18:13:48 +00:00
e4453dcfdd docs(examples): add the "snake pit" worker-pool example
Based on @ponder.ooo's "snake pit agent orchestrator" idea (bsky 2026-05-28) and
Claude's metaphor-mapping elaboration: agents are snakes, tasks are food tossed
into a shared pit; snakes devour/digest/regurgitate/excrete.

A worker-pool-over-a-shared-queue topology (contrast the builder-adversary phase
machine):
- pit/ is a filesystem queue; snakes claim by atomic mv (no two eat the same food)
- species = specialized agents: keeper (zookeeper), planner (regurgitation IS
  task decomposition), snake-1..3 (worker pool), cleanup (scavenger + coprophagy)
- no [loop] phase machine; persistent agents self-pace via /loop
- README carries the full bio→compute mapping table from the thread image

Verified: `agents.py status --config agents.toml` lists all 6 agents + service.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 17:50:42 +00:00
7f237a522c docs(examples): add a Builder/Adversary loop-pair example (the cc-ci pattern)
A self-contained examples/builder-adversary/ that distills the cc-ci production
loop pair into a tiny, fully-local task (build a `wc` CLI in two phases):

- agents.toml: builder + adversary loops, persistent orchestrator, on_complete
  reporter, cleanlogs service; phase machine with a per-phase model override
- prompts/: kickoff template + builder/adversary roles carrying the load-bearing
  protocol (claim()/review() handoff, machine-docs file-location rule,
  WHAT+HOW+EXPECTED+WHERE=STATUS / WHY=JOURNAL anti-anchoring, WAITING-UNTIL liveness)
- plans/: two phase plans (wc, json) each with a cold-verifiable Definition of Done
- README: how to run, the work-repo two-clone isolation model, how to adapt

Verified: `agents.py status --config agents.toml` parses and lists all agents.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 17:50:42 +00:00
cdcece9a9a test: add tests/ — unit suite + isolated live claude/opencode smokes + runner
Unit tests (no agents/tmux): config load + defaults merge, kickoff-template
assembly, phase machine (advance/idempotent-complete/append-resumes), limit
reset-banner parsing, WAITING-UNTIL/stall parsing, claude+opencode activity
detectors. Live smokes bring a throwaway project up THROUGH agents.py on each
real backend in an isolated sandbox (unique prefix, opencode on a non-4096
port), verify attach + status + down, and clean up. tests/run.sh runs unit
always + smokes when backends present; README documents it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 18:55:34 +00:00
66 changed files with 3215 additions and 0 deletions

View File

@ -16,7 +16,9 @@ agents.py the driver + watchdog (pure Python stdlib; needs python >=
agent-log.py render claude JSONL transcripts into clean, greppable logs
agents.example.toml a self-contained 2-agent example project
prompts/ generic role + kickoff templates (builder / adversary / kickoff)
examples/ runnable example projects — the Builder/Adversary variant family, snakepit, …
smoke.sh bring the example up + tear it down in an isolated sandbox, then clean up
tests/ the test suite — unit tests + isolated live backend smokes + a runner
flake.nix/.lock a Nix devShell with the runtime deps (python311, tmux, git)
```
@ -48,6 +50,42 @@ python3 agents.py --config agents.toml phase show # where the loop phase mach
---
## Examples
`examples/` holds runnable example projects — copy one, point `agents.py` at its `agents.toml`, and
go. The headline set is a family of **Builder/Adversary** variants that build the *same* task but each
differ in one dimension — useful both as templates and as a study of the pattern:
- **`builder-adversary`** — the canonical loop pair: a Builder that builds and an Adversary that
cold-verifies every claim, coordinating only through git (`claim(`/`review(` commits + the watchdog
handoff). **Start here.**
- **`builder-adversary-min`** — the same pattern with the prompts compressed to minimal tokens.
- **`builder-adversary-stateless`** — `builder-adversary` + **context hygiene** (compact at each
checkpoint, read diffs not trees, lean loads) to minimise carried/reloaded context.
- **`builder-adversary-lean`** — context hygiene + **per-gate** review (one claim/verdict per gate).
- **`builder-adversary-deferred`** — the Adversary verifies **once**, after the whole build, in a
final comprehensive `review` phase (vs per-phase / per-gate).
- **`builder-solo`** — a single Builder that self-certifies, with **no Adversary** (the control).
- **`snakepit`** — a different topology entirely: a pool of identical worker "snakes" pulling tasks
from a shared filesystem queue, plus cleanup specialists. (`examples/IDEAS.md` sketches more.)
Each example has its own `README.md`. Run one by hand:
```bash
cd examples/builder-adversary
python3 ../../agents.py status --config agents.toml # read-only
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
```
**Benchmark.** The separate
[`agent-orchestrator-benchmark`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator-benchmark)
repo runs these Builder/Adversary variants head-to-head (N=5, real `agents.py up` runs) to measure
what drives token cost. Short version: an independent adversary costs **~4.7×** a solo builder, but
the review *cadence* (per-gate / per-phase / deferred) is **nearly token-neutral**, and **context
hygiene** is the one clean **~22%** win. See that repo's `FINDINGS.md`.
---
## The config: `agents.toml`
Five section types: `[watchdog]`, `[backend.<name>]`, `[defaults]`, `[[agent]]` / `[[service]]`,
@ -62,6 +100,23 @@ heavy_interval = 300 # seconds between heal + phase-advance checks
limit_probe_fallback = 300 # re-probe cadence for a usage-limited agent when reset time is unparsable
limit_reset_slack = 45 # seconds to wait past a parsed reset before probing
stall_grace = 180 # seconds of slack past a WAITING-UNTIL marker before a stall reboot
log_tokens = false # opt-in: record per-phase token + time usage (see below)
```
**Per-phase token + time logging (`log_tokens`).** Set `log_tokens = true` (under `[watchdog]` or
`[loop]`) and the watchdog records, for **each phase**, how many tokens **each agent** used and how
long the phase took — appended as one JSON object per phase to `<log_dir>/token-log.jsonl`. Tokens
are summed from each agent's Claude Code session transcript and attributed **by working dir**, so
give each agent its own `dir` (the Builder/Adversary loop pair already uses separate clones) for
accurate per-agent numbers. The watchdog snapshots a baseline when a phase starts and writes the
delta (per agent, and the total) when the phase advances or the sequence completes — robust across
watchdog restarts. Pretty-print it with `agents.py tokens`:
```
phase dur(s) builder adversary TOTAL
-----------------------------------------------------
lex 372.0 3,910,118 3,221,447 7,131,565
parse 410.5 ...
```
### `[defaults]` — inherited by every agent
@ -239,6 +294,7 @@ agents.py status table of every agent: kind, backend, model, w
agents.py watchdog the supervisor loop (what the <prefix>watchdog session runs)
agents.py logs <name> tail that session's log
agents.py phase [show|next|set N] inspect / move the loop phase index
agents.py tokens per-phase token + time report (when [watchdog].log_tokens = true)
agents.py selftest regression-test the backend activity detector (needs no config)
agents.py init [dir] scaffold a starter agents.toml + prompts/ in a project dir
--config PATH use a specific config (default: ./agents.toml)
@ -315,6 +371,39 @@ documents this in its banner.
---
## Testing
The `tests/` directory holds the harness's own test suite. One runner drives everything:
```bash
nix develop -c ./tests/run.sh # unit tests always; live backend smokes when available
# or just: ./tests/run.sh # (python3 + tmux must be on PATH)
```
What it runs:
- **Unit tests** (`tests/test_unit.py`) — pure logic, **no agents spawned, no live tmux sessions**.
Cover config load + defaults merge, kickoff-template assembly, the phase machine (advance on the
done marker, idempotent sequence-complete, append-a-phase resumes), usage-limit reset-banner
parsing, `WAITING-UNTIL` / stall parsing, and the per-backend activity detectors (claude +
opencode footers). Always run; a failure fails the suite. Run them alone with
`python3 -m unittest discover -s tests` (or `python3 tests/test_unit.py`).
- **Live backend smokes** (`tests/smoke_claude.sh`, `tests/smoke_opencode.sh`) — each brings a
throwaway scratch project up **through `agents.py`** on a real backend, in a fully isolated
sandbox (its own unique `session_prefix`, a temp `log_dir`, and — for opencode — a dedicated
server on a non-default port `AOTEST_OC_PORT`, default `4097`), confirms the session attaches and
`status` reports it RUNNING, then `down`s it and cleans up (no leftover sessions, port freed).
Each **SKIPs gracefully** (exit 0) when its backend's binary or creds are unavailable. Useful env:
`CLAUDE_BIN` / `OPENCODE_BIN`, `AOTEST_MODEL`, `AOTEST_OC_PORT`, `AOTEST_OC_CREDS`.
- **Isolation sanity** — after the live runs, the runner asserts no `aotest-*` tmux sessions leaked
and reports that any live sessions are untouched.
The smokes are safe by construction: a unique per-run session prefix (never `cc-ci-` or any real
project's), a dedicated opencode port (never `4096`), and a cleanup trap that fires on success,
failure, and Ctrl+C.
---
## Adding things
- **Add an agent** — add an `[[agent]]` block; `agents.py up <name>`. No code change.

131
agents.py
View File

@ -14,6 +14,7 @@ Usage:
agents.py watchdog the supervisor loop (reads the config every tick)
agents.py logs <name> tail an agent's session log
agents.py phase [set N|next|show] inspect / move the loop phase
agents.py tokens per-phase token + time report (needs [watchdog].log_tokens = true)
agents.py selftest backend activity-detector regression checks (no config needed)
agents.py init [dir] scaffold a starter agents.toml + prompts/ in a project dir
@ -663,6 +664,99 @@ def start_loops(cfg):
for a in loop_agents(cfg):
start_agent(cfg, a)
# ── optional per-phase token + time logging (log_tokens) ──────────────────────────
# When [watchdog].log_tokens (or [loop].log_tokens) is true, the watchdog records, for each phase,
# how many tokens each agent used and how long the phase took, appended to <log_dir>/token-log.jsonl.
# Tokens are summed from each agent's Claude Code session transcript, attributed by working dir — so
# give each agent its OWN dir for accurate per-agent numbers (the Builder/Adversary loop pair already
# uses separate clones). View with: agents.py tokens.
def log_tokens_enabled(cfg):
return bool(cfg.get("watchdog", {}).get("log_tokens") or cfg.get("loop", {}).get("log_tokens"))
def _transcript_dir(workdir):
name = str(workdir).rstrip("/").replace("/", "-").replace(".", "-")
return Path(os.path.expanduser("~/.claude/projects")) / name
def _sum_tokens(workdir):
t = {"input": 0, "output": 0, "cache_create": 0, "cache_read": 0}
d = _transcript_dir(workdir)
if d.is_dir():
for f in d.glob("*.jsonl"):
try:
for line in f.open(errors="ignore"):
try:
o = json.loads(line)
except Exception:
continue
if o.get("type") == "assistant":
u = (o.get("message", {}) or {}).get("usage", {}) or {}
t["input"] += u.get("input_tokens", 0) or 0
t["output"] += u.get("output_tokens", 0) or 0
t["cache_create"] += u.get("cache_creation_input_tokens", 0) or 0
t["cache_read"] += u.get("cache_read_input_tokens", 0) or 0
except OSError:
continue
t["total"] = t["input"] + t["output"] + t["cache_create"] + t["cache_read"]
return t
def _token_cumulative(cfg):
"""Cumulative tokens per agent so far, summed from each agent's transcript dir."""
return {a["name"]: _sum_tokens(a["dir"]) for a in cfg["agents"].values()}
_TOKEN_KEYS = ("input", "output", "cache_create", "cache_read", "total")
def _token_state_path(cfg): return Path(cfg["state_dir"]) / "token-phase.json"
def _token_log_path(cfg): return Path(cfg["log_dir"]) / "token-log.jsonl"
def _tok_delta(cur, base): return {k: cur.get(k, 0) - base.get(k, 0) for k in _TOKEN_KEYS}
def token_phase_begin(cfg, phase_id):
"""Set the baseline (cumulative tokens + start time) for the phase now starting. Idempotent
across watchdog restarts: keeps the original baseline if already tracking this phase."""
if not log_tokens_enabled(cfg):
return
sf = _token_state_path(cfg)
try:
if json.loads(sf.read_text()).get("phase_id") == phase_id:
return
except Exception:
pass
sf.write_text(json.dumps({"phase_id": phase_id,
"started": datetime.now().isoformat(timespec="seconds"),
"baseline": _token_cumulative(cfg)}))
def token_phase_flush(cfg, next_phase_id):
"""Close the current phase: append its per-agent + total token deltas and duration to the
token-log, then re-baseline for next_phase_id (or finalize tracking if None)."""
if not log_tokens_enabled(cfg):
return
sf = _token_state_path(cfg)
try:
st = json.loads(sf.read_text())
except Exception:
return
cur = _token_cumulative(cfg)
base = st.get("baseline", {})
started = st.get("started")
try:
dur = round((datetime.now() - datetime.fromisoformat(started)).total_seconds(), 1)
except Exception:
dur = None
per_agent = {n: _tok_delta(cur.get(n, {}), base.get(n, {})) for n in cur}
total = {k: sum(per_agent[n][k] for n in per_agent) for k in _TOKEN_KEYS}
rec = {"phase_id": st.get("phase_id"), "started": started,
"ended": datetime.now().isoformat(timespec="seconds"), "duration_s": dur,
"agents": per_agent, "total": total}
with _token_log_path(cfg).open("a") as fh:
fh.write(json.dumps(rec) + "\n")
parts = ", ".join(f"{n}={per_agent[n]['total']:,}" for n in per_agent)
log(f"[log_tokens] phase {rec['phase_id']}: {total['total']:,} tok in {dur}s ({parts})")
if next_phase_id is not None:
sf.write_text(json.dumps({"phase_id": next_phase_id,
"started": datetime.now().isoformat(timespec="seconds"),
"baseline": cur}))
else:
sf.unlink(missing_ok=True)
def phase_advance_check(cfg):
"""On heavy tick: if the current phase is DONE, advance (or finish the sequence).
@ -681,6 +775,7 @@ def phase_advance_check(cfg):
nxt = idx + 1
if nxt < len(ps):
log(f"PHASE {ph['id']} DONE — auto-transitioning to {ps[nxt]['id']}")
token_phase_flush(cfg, ps[nxt]["id"])
stop_loops(cfg)
Path(phase_idx_file(cfg)).write_text(str(nxt))
if marker.exists():
@ -692,6 +787,7 @@ def phase_advance_check(cfg):
if marker.exists():
return False # already handled — idempotent (no re-log, no re-stop)
log(f"PHASE SEQUENCE COMPLETE (last phase {ph['id']} DONE) — stopping loops")
token_phase_flush(cfg, None)
stop_loops(cfg)
ts = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
marker.write_text(f"phase sequence complete {ts}. Loops stopped; build finished.\n")
@ -721,6 +817,9 @@ def watchdog_loop(cfg_path):
f"signal={sig}s heavy={heavy}s, watching: {[a['name'] for a in watched(cfg)]}")
elapsed = heavy # force a heavy check on first tick
wake_elapsed = {a["name"]: 0 for a in cfg["agents"].values() if a.get("wake")}
if log_tokens_enabled(cfg):
token_phase_begin(cfg, cur_phase(cfg).get("id"))
log("[log_tokens] enabled — per-phase token + time logging to token-log.jsonl")
while True:
cfg = load_config(cfg_path) # re-read every tick: config is authoritative, no env drift
has_loops = bool(loop_agents(cfg))
@ -840,6 +939,37 @@ def cmd_phase(cfg, args):
Path(phase_idx_file(cfg)).write_text(str(int(args[1])))
print(f"phase idx now {cur_idx(cfg)} ({cur_phase(cfg).get('id')})")
def cmd_tokens(cfg):
"""Pretty-print <log_dir>/token-log.jsonl: per-phase tokens by agent + total + duration."""
p = _token_log_path(cfg)
if not p.exists():
print(f"no token log at {p}\n(set [watchdog].log_tokens = true and run the loop)"); return
recs = []
for line in p.read_text().splitlines():
try: recs.append(json.loads(line))
except Exception: pass
if not recs:
print("token log is empty"); return
names = []
for r in recs:
for n in r.get("agents", {}):
if n not in names: names.append(n)
w = max([7] + [len(n) for n in names])
hdr = f"{'phase':<10} {'dur(s)':>8} " + " ".join(f"{n:>{w}}" for n in names) + f" {'TOTAL':>13}"
print(hdr); print("-" * len(hdr))
grand = {n: 0 for n in names}
durtot = 0.0
for r in recs:
ag = r.get("agents", {})
cells = " ".join(f"{ag.get(n,{}).get('total',0):>{w},}" for n in names)
print(f"{str(r.get('phase_id')):<10} {str(r.get('duration_s')):>8} {cells} "
f"{r.get('total',{}).get('total',0):>13,}")
for n in names: grand[n] += ag.get(n,{}).get("total",0)
durtot += r.get("duration_s") or 0
print("-" * len(hdr))
cells = " ".join(f"{grand[n]:>{w},}" for n in names)
print(f"{'TOTAL':<10} {durtot:>8.0f} {cells} {sum(grand.values()):>13,}")
def cmd_selftest():
"""Self-contained regression checks for the footer-UI activity detector. Needs no config."""
backend = {
@ -917,6 +1047,7 @@ def main():
elif cmd == "status": cmd_status(cfg)
elif cmd == "watchdog": watchdog_loop(cfg_path)
elif cmd == "phase": cmd_phase(cfg, rest)
elif cmd == "tokens": cmd_tokens(cfg)
elif cmd == "logs":
if not rest:
die("usage: agents.py logs <name>")

93
examples/IDEAS.md Normal file
View File

@ -0,0 +1,93 @@
# Example ideas — creative multi-agent topologies
A backlog of *example* projects for `examples/`, each chosen to teach a **different orchestration
topology** on the same harness. Nothing here is implemented yet — these are sketches.
Built so far:
- **`builder-adversary/`** — a **phase machine**: an ordered plan, two roles (Builder + Adversary)
handing off via `claim(`/`review(` commits. (The cc-ci pattern.)
- **`snakepit/`** — a **worker pool over a pull-queue**: identical worker "snakes" claim tasks from
a shared filesystem pit by atomic `mv`, plus planner + cleanup specialist species.
Each idea below lists: the metaphor, the topology it teaches, the star harness primitive, and what
makes it distinct from what we already have.
---
## Strong candidates
### 🐜 Anthill (stigmergy)
Ants coordinate with *no direct messaging*: they lay pheromone trails, others follow the strong
ones, and trails **evaporate** over time. Agents drop weighted "trail" files toward promising
solutions/paths; a `[[service]]` slowly decays them.
- **Teaches:** indirect coordination through a decaying shared environment (opposite of snakepit's
explicit claim).
- **Star primitive:** a background **service** as the evaporation clock; emergent routing with zero
agent-to-agent chat.
### 🍳 The Line (kitchen brigade)
A restaurant pass: prep → sauté → plating → **expo**. A ticket (order) flows station to station; the
expo bounces a bad plate back down the line. Many tickets in flight at once.
- **Teaches:** a true multi-stage **pipeline** (>2 roles) with backpressure / rework — distinct from
builder-adversary's two roles over a whole-task phase.
- **Star primitive:** chained `handoff` inboxes + per-station commit prefixes (`fire(`, `plate(`,
`expo(`).
### 🕵️ The Incident Room (blackboard)
A corkboard of pinned facts and red string. Specialist detectives (forensics, alibi, motive,
witnesses) each watch the board and pin a new deduction *only when their preconditions appear*; a
lead declares the case closed.
- **Teaches:** opportunistic, **data-driven activation** — agents fire when the shared state makes
them relevant, not on a schedule.
- **Star primitive:** a shared blackboard file + watchdog pings on board changes; no fixed order.
### ⚖️ The Senate (debate panel)
N agents argue a question from assigned stances; a moderator synthesizes; rounds repeat until
consensus or a vote.
- **Teaches:** structured **multi-round deliberation** with diverse "minds."
- **Star primitive:** the **phase machine** where each phase = one debate round, plus **per-phase
model overrides** to give each seat a genuinely different model; `on_complete` writes the verdict.
### 🏃 The Baton (relay / token ring)
Exactly one runner holds the baton (a lock file) and works; passes it on completion. Drop the baton
(crash) and the next runner picks it up.
- **Teaches:** **mutual exclusion + failover** — enforced serialization, the mirror image of the
snakepit's parallelism.
- **Star primitive:** `watch = "heal"` + the watchdog reaping a dead holder so the baton never gets
stuck.
### 🦠 The Immune System (detect → respond)
Sentinels patrol logs/metrics/files for anomalies (pathogens); on a hit they raise an antigen (alert
file); responder "macrophages" swarm that specific threat; memory cells record signatures so repeats
resolve faster.
- **Teaches:** an **event-driven monitoring/reactive** topology with escalation.
- **Star primitive:** a watcher **service** emitting alerts + reactive agents woken by inbox pings.
- **Bonus:** genuinely *useful* — a self-healing "watch my repo/CI" tool wearing a fun costume.
### 🧬 The Evolution Chamber (genetic algorithm)
A population of candidate solutions; breeder agents mutate/crossbreed; a selector culls by fitness;
generations advance until fitness plateaus.
- **Teaches:** **population-based iterative search** with a fitness gate.
- **Star primitive:** phase machine where each phase = one generation; `done_marker` trips when
fitness stops improving.
---
## Quick extras (less fleshed out)
- **🗼 Air Traffic Control** — many workers contend for *one* scarce runway (a single deploy/build
slot); a controller grants timed landing slots. Teaches centralized **scarce-resource
arbitration** (snakepit has plentiful work; here the *resource* is the bottleneck).
- **🌙 Day/Night (sleep consolidation)** — workers act by day; a "sleep" agent on a `wake` timer
consolidates the day's artifacts into long-term memory each night. Teaches **scheduled batch
consolidation** (the "memory builder / coprophagy" idea as its own example).
---
## Suggested next trio
If picking three that cover the most new ground: **The Line** (pipeline), **The Incident Room**
(blackboard), and **The Immune System** (reactive monitoring — and actually useful).
Each should follow the snakepit shape: a README with the metaphor→compute mapping, an `agents.toml`,
role prompts, and a tiny runnable task.

View File

@ -0,0 +1,48 @@
# Builder/Adversary example — deferred review (verify after a long segment)
The coarsest point on the **review-cadence spectrum**. Same pattern, same full original prompts as
`../builder-adversary` — only *when* the Adversary verifies changes:
| variant | the Adversary verifies… | handshakes (calculator task) |
|---|---|--:|
| `builder-adversary-lean` | per **gate** | ~12 claim/verify round-trips |
| `builder-adversary` (orig) | per **phase** | ~3 |
| **`builder-adversary-deferred`** | **once, after the whole build** | **1** |
## How it works
The Builder **self-certifies** the build phases (`wc`, then `json`) — builds to each phase's DoD, runs
its own tests until green, writes `## DONE`, and advances *without* waiting for the Adversary. The
Adversary stays out of the build. Only in the final **`review` phase** does it do **one comprehensive
cold-verification of the entire accumulated calculator** (`plans/review.md`): re-run every DoD item
from every phase from a fresh clone, plus cross-feature break-it probes, file all findings at once,
re-verify after fixes, then PASS. That single pass is the only adversary gate in the run.
## The trade-off
- **Cheapest coordination.** One handshake instead of 312 — no per-gate/per-phase round-trips, the
Builder isn't interrupted mid-build. (The benchmark showed coordination round-trips are a real
token cost; deferring to one pass minimises them.)
- **But the independent check arrives late.** Two risks the per-gate/per-phase cadences guard
against:
- **Late discovery / rework.** If the Builder built phase 2 on a wrong assumption from phase 1, an
early adversary would have caught it at gate 1; here it surfaces only at the end, after more work
was piled on the flaw — potentially a larger, costlier fix.
- **Self-certification drift.** The build phases are self-certified, so a bug the Builder
rubber-stamps survives until the final review. The comprehensive pass is the only safety net, so
it must be thorough.
- **Better at cross-feature bugs.** Because it verifies the whole system at once, it's positioned to
catch *interactions* (e.g. `--json` × every flag) that a per-gate view, looking at one item at a
time, can miss.
So `deferred` trades *early, incremental* assurance for *minimal coordination + one holistic pass*.
It suits work where features are independent and cheap to fix late; it's risky where early decisions
constrain later ones.
```bash
python3 ../../agents.py status --config agents.toml
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
```
> **Prompt base:** the full original `builder-adversary` prompts + a DEFERRED REVIEW CADENCE override
> — so comparing this to `builder-adversary`/`lean` isolates *only* the verification cadence.

View File

@ -0,0 +1,79 @@
# examples/builder-adversary-deferred — Adversary verifies ONCE, after a long segment of building.
#
# Same pattern + full original prompts as ../builder-adversary, but the REVIEW CADENCE is coarsest:
# • lean = the Adversary verifies per gate (finest)
# • orig = the Adversary verifies per phase (medium)
# • deferred = the Adversary verifies ONCE, comprehensively, after the whole build (coarsest)
# The Builder SELF-CERTIFIES the build phases (wc, json) to advance; the Adversary stays out until the
# final `review` phase, where it cold-verifies the ENTIRE accumulated calculator in one pass. Cheapest
# coordination, but the independent check arrives late (see README for the trade-off).
#
# python3 ../../agents.py status --config agents.toml
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
[watchdog]
signal_interval = 30
heavy_interval = 300
limit_probe_fallback = 300
limit_reset_slack = 45
stall_grace = 180
[defaults]
session_prefix = "badef-" # tmux namespace: badef-builder, badef-adv, …
log_dir = ".ao-state"
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
model = "claude-sonnet-4-6"
watch = "heal"
[backend.claude]
bin = "claude"
flags = "--dangerously-skip-permissions"
remote_control = true
supports_resume = true
prompt_delivery = "arg"
process_name = "claude"
submit_key = "Enter"
stall_idle = 300
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
[backend.demo]
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
prompt_delivery = "exec"
[[agent]]
name = "builder" # tmux session: badef-builder
kind = "loop"
role = "builder"
dir = "./work"
watch = "heal+stall"
[[agent]]
name = "adversary"
session = "badef-adv"
kind = "loop"
role = "adversary"
dir = "./work-adv"
watch = "heal+stall"
[[service]]
name = "cleanlogs"
command = "python3 ../../agent-log.py follow-all"
dir = "."
[loop]
state_file = "phase-idx"
resume_phase = true
auto_advance = true
done_marker = "## DONE"
kickoff_template = "prompts/kickoff.md"
roles_dir = "prompts"
handoff = { repo = "./work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
# Build phases (wc, json) are self-certified by the Builder; the final `review` phase is the single
# comprehensive Adversary gate over the whole accumulated build.
phases = [
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md" },
{ id = "review", plan = "plans/review.md", status = "STATUS-review.md" },
]

View File

@ -0,0 +1,32 @@
# Phase `json` — machine-readable output
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
`wc`-phase behaviour. Single source of truth for this phase.
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
model override; the Adversary stays on the default model.)
## Definition of Done
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
With stdin (no FILE), `"file"` is `null`.
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
(plus `file`). E.g. `wc.py --json -l FILE``{"lines": N, "file": "FILE"}`.
- **D3 — no regression.** Every `wc`-phase gate (D1D4 there) still passes unchanged.
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
## How the Adversary verifies (cold)
```bash
pytest -q # D4 + D3 regression
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
```
The Builder restates the exact commands, expected JSON, and commit sha in
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).

View File

@ -0,0 +1,24 @@
# Phase `review` — comprehensive deferred verification
This phase adds **no new features**. The Builder has self-certified the build phases (`wc`, `json`)
and accumulated the whole calculator. Now the Adversary does its **one comprehensive cold-verification
of the entire build** — the first and only adversary gate in the run.
## Definition of Done
- **D1 — full cold re-verify.** From a FRESH clone, the Adversary re-runs **every DoD item from every
prior phase** (all of `wc` and all of `json`) and confirms each passes. Nothing is taken on the
Builder's word.
- **D2 — full suite green.** The complete test suite (`python -m unittest`) passes, 0 failures.
- **D3 — cross-feature break-it.** The Adversary hunts the interactions a per-gate/per-phase view
would miss: `--json` combined with every count flag, whitespace + multi-line + json together, the
error paths under json mode, stdin + json, etc. — and files any defects it finds.
- **D4 — findings cleared.** Every finding the Adversary files is fixed by the Builder and
re-verified PASS; no standing `## VETO`.
## How it works
The Adversary records its comprehensive verdict in `machine-docs/REVIEW-review.md`
(`review(all): PASS`, or findings with repro). The Builder fixes anything found, then writes
`## DONE` to `machine-docs/STATUS-review.md` **only after** the Adversary's comprehensive PASS — the
single adversary checkpoint for the whole build.

View File

@ -0,0 +1,43 @@
# Phase `wc` — a word-count CLI
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
## Definition of Done
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
`wc` on ASCII input.
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
## How the Adversary verifies (cold)
From a fresh clone of the work repo:
```bash
pytest -q # D4: must be all-green
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
```
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
## Out of scope (defer to a later phase or DEFERRED.md)
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
the next phase (`plans/json.md`).

View File

@ -0,0 +1,31 @@
You are the **Adversary** — one of two independent loops. Your job is to **DISBELIEVE the Builder**. You run as a SEPARATE process and coordinate ONLY through the git repo. Read the phase plan named in the kickoff above in full — it is the single source of truth for WHAT is being verified.
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. When a gate is CLAIMED (or the watchdog pings you that one is), verify it promptly — that is top priority. When nothing is pending you may IDLE freely (sleep in chunks of **≤10 min**); you do NOT need to busy-poll to look busy — the watchdog pings you the instant the Builder claims a gate. Poll ~4 min only while actively watching a CLAIMED gate's run. Keep running independent break-it probes even when no gate is pending. Stop only when STATUS says "## DONE" and you have logged a fresh PASS for every DoD item.
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
- **Cap every wait at 10 minutes.** Never a single ScheduleWakeup > 600 s; to wait longer, wake, re-check, wait again.
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out, matching your ScheduleWakeup; compute with `date -u -d '+10 min' +%FT%TZ`). Idle ≥5 min with no current marker, or past the named time → the watchdog kills + reboots you; you resume cleanly from git + your REVIEW/STATUS files.
- **Compact proactively** at ≳80% context — your state is in git + REVIEW/STATUS, so compaction is lossless.
**Coordinate ONLY through git:**
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root. If you find one at the root, `git mv` it in.
- **Keep your OWN clone** (the `dir` this agent runs in). You verify from a COLD START in it. If the work repo doesn't exist yet, wait and retry on your next wake — the Builder creates it first.
- `git pull --rebase` before every edit; commit; push; **never `--force`.**
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit that records a **verdict or finding** with `review(...)` (e.g. `review(D2): PASS` / `review(D2): FAIL — repro …`). The watchdog watches origin/main and pings the Builder the moment a `review(` commit lands — that IS the handoff signal. (The Builder's gate claims are `claim(...)`.)
- Write ONLY your files: REVIEW and the "## Adversary findings" section of BACKLOG. Everything else (code, STATUS, JOURNAL, "## Build backlog") is read-only to you.
- **INBOX side-channel.** For non-gate messages to the Builder, append `machine-docs/BUILDER-INBOX.md` and push (the watchdog edge-pings the Builder). To receive from the Builder, look for `machine-docs/ADVERSARY-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). Formal verdicts still live in REVIEW.
**ISOLATION DISCIPLINE (anti-anchoring — critical).** The Builder is REQUIRED to give you, in STATUS, the verification info you need: WHAT is claimed, HOW to verify it (the exact command/check), the EXPECTED outcome, and WHERE the inputs live. **Read STATUS for that — you need all of it.** What you must IGNORE — in STATUS, and NEVER read in JOURNAL before your verdict — is the Builder's REASONING / RATIONALISATIONS ("I think this passes because…", design narrative, dead-ends). Reading those anchors you. Form your verdict from: (a) the phase plan = SSOT, (b) the code / git history, (c) the verification info the Builder passed in STATUS, and (d) your OWN cold acceptance run that re-executes the check against the expected outcomes. Only AFTER writing your verdict may you consult JOURNAL (note in REVIEW that you did). Trust observable behaviour, the plan, and your own re-run — not the Builder's narrative.
**Each wake:**
1. Pull. Read STATUS for any "Gate: <id> CLAIMED, awaiting Adversary".
2. Verify the claim from a COLD START (fresh shell, your own clone, no cached state). Re-run the DoD acceptance check yourself; do not trust the Builder's word.
3. Actively try to BREAK it — edge cases, malformed input, the failure modes the plan names. A claim you can't break is a claim that PASSES; a claim you can break is a finding.
4. Record verdicts in REVIEW ("<id>: PASS @<ts>" + evidence, or FAIL with repro steps). File each defect as a "## Adversary findings" item; only YOU close those, after re-test. You hold veto: write "## VETO <reason>" to REVIEW to forbid DONE until cleared.
5. Push (with a `review(...)` prefix). Schedule the next wake.
REVIEW CADENCE — DEFERRED (this OVERRIDES the "verify each claimed gate per wake" rule above): you verify ONCE, comprehensively, after the whole build — not per gate or per phase.
- During the BUILD phases (before the final `review` phase): the Builder self-certifies and advances; you do NOT gate those. You may run early break-it probes, but the authoritative check is deferred — don't write per-gate verdicts.
- In the `review` phase: do ONE comprehensive cold-verification of the ENTIRE calculator from a fresh clone — re-run EVERY DoD item from EVERY prior phase, and hunt cross-feature / integration breaks (interactions between features, not just isolated gates). File all findings together; re-verify after the Builder's fixes; PASS only when the whole system holds. This single comprehensive pass replaces per-gate review.
Begin: read the phase plan, then enter the self-paced loop (start by cloning the work repo into your `dir` if it exists yet).

View File

@ -0,0 +1,35 @@
You are the **Builder** — one of two independent loops working on this project. Your job is to build what the phase plan specifies, autonomously, over many wake cycles. You run as a SEPARATE process from the Adversary and coordinate with it ONLY through the git repo.
Single source of truth: the phase plan named in the kickoff above. Read it in full now, then begin.
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work. Pace yourself:
- A long task in flight (build / test suite / e2e) → **poll every ~5 min**, never one big sleep matching the expected runtime (catch a failure at minute 4 of a 25-min run, not at minute 25).
- Parked at a CLAIMED gate with no other unblocked work → the watchdog pings you the instant the Adversary writes a verdict or an inbox message, so you may wait; keep a fallback self-poll ~24 min in case a ping is missed.
- Genuinely idle → sleep in chunks of **≤10 min**. Prefer keeping an unblocked backlog item in hand so you rarely just wait.
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
- **Cap every wait at 10 minutes.** To wait longer, wake at 10 min, re-check, wait again. Never a single ScheduleWakeup > 600 s.
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` — the time you will resume (≤10 min out, matching your ScheduleWakeup). Compute it from the clock (`date -u -d '+10 min' +%FT%TZ`). If the watchdog sees you idle ≥5 min with no current marker as your last line, OR idle past the time it names, it kills + reboots you — you resume cleanly from git + your STATUS/REVIEW files.
- **Compact proactively.** If context usage climbs high (≳80%), run `/compact` before continuing — your loop state lives in git + the phase STATUS/REVIEW, so compaction is lossless and prevents wedging at the context limit.
**Coordinate ONLY through git:**
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root — phase-namespaced STATUS/BACKLOG/REVIEW/JOURNAL, plus DECISIONS.md and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. Create `machine-docs/` if missing; if you find such a file at the root, `git mv` it in.
- `git pull --rebase` before every edit; make the smallest change; commit; push. **Never `--force`.**
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit with its conventional type. CRITICALLY: prefix a commit that **claims a gate** with `claim(...)` (e.g. `claim(D2): tests green`). The watchdog watches origin/main and pings the Adversary the moment a `claim(` commit lands — that IS the handoff signal. Keep using the other types too (`feat/fix/status/journal/decisions/chore/inbox(...)`), but `claim(` is what triggers verification.
- **CLEAN TREE BEFORE CLAIM.** Run `git status` before you claim — the working tree MUST be clean (everything committed AND pushed). The Adversary cold-verifies from a fresh clone, so any un-pushed change that only exists on your host is a guaranteed verify mismatch. Push first, then claim.
- **ARTIFACT-LAYER ISOLATION — the one rule that makes verification work.** STATUS MUST give the Adversary everything it needs to verify your claim: **WHAT** is claimed (gate id, DoD items), **HOW** to verify it (the exact command/check it can re-run from its own clone), the **EXPECTED** outcome (outputs, hashes, exit codes), and **WHERE** the inputs live (commit shas, paths). STATUS MUST NOT contain rationalisations — "I think this passes because…", design narrative, dead-ends. Those go in JOURNAL, which the Adversary is instructed NOT to read before its verdict (anti-anchoring). The line: **WHAT + HOW + EXPECTED + WHERE = STATUS; WHY = JOURNAL.** DECISIONS.md is for SETTLED design decisions, not in-the-moment reasoning.
- **At each gate:** set "Gate: <id> CLAIMED, awaiting Adversary" in STATUS and work other unblocked items; do NOT advance past the gate until REVIEW shows its PASS.
- **INBOX side-channel.** For non-gate messages to the Adversary (a heads-up, "starting a long run, please cold-verify X meanwhile"), append `machine-docs/ADVERSARY-INBOX.md` and push — the watchdog edge-pings the Adversary. To receive from the Adversary, look for `machine-docs/BUILDER-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). The inbox is a side-channel; formal CLAIMS still live in STATUS.
- Write ONLY your files: source/config, STATUS, JOURNAL, DECISIONS, and the "## Build backlog" section of BACKLOG. Treat REVIEW and "## Adversary findings" as read-only — the Adversary owns them.
**Overriding rules:**
- "Done" is defined ONLY by the plan's DoD, Adversary-verified. No self-certifying. Write "## DONE" to STATUS only when REVIEW shows a fresh PASS for every DoD item and there is no standing "## VETO".
- Verify every change against real behaviour; paste the command + its output into JOURNAL. No "should work."
- Never weaken, skip, or delete a test to make a run pass. A red test is information.
- 3rd identical failure → stop, record the dead-end in DECISIONS.md, change approach or mark blocked.
REVIEW CADENCE — DEFERRED (this OVERRIDES the per-phase "Adversary-verified / no self-certifying" rule above, for build phases only): the Adversary verifies in ONE comprehensive pass at the END, not per gate or per phase.
- BUILD phases (every phase before the final `review` phase): SELF-CERTIFY. Build to the phase DoD, run your own tests until green, then write "## DONE" to advance — do NOT claim or wait for the Adversary on a build phase. Accumulate the whole build.
- The final `review` phase: do not add features. The Adversary now cold-verifies the ENTIRE accumulated build at once; address every finding it files, then write "## DONE" only after its comprehensive PASS. (Here the normal Adversary-verified rule applies.)
Begin: read the phase plan, then enter the self-paced loop.

View File

@ -0,0 +1,8 @@
*** PHASE {phase_id} ***
SINGLE SOURCE OF TRUTH for this phase: {plan} — read it in full now. It defines this phase's mission and its Definition of Done (DoD).
Track loop state in PHASE-NAMESPACED files UNDER machine-docs/ in your clone (create the dir if missing): machine-docs/{status}, machine-docs/BACKLOG-{phase_id}.md, machine-docs/REVIEW-{phase_id}.md, machine-docs/JOURNAL-{phase_id}.md. machine-docs/DECISIONS.md is shared (append-only).
FILE-LOCATION RULE (mandatory): ALL coordination / loop-state files live in machine-docs/, NEVER the repo root — that includes STATUS/BACKLOG/REVIEW/JOURNAL (phase-namespaced), DECISIONS.md, and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. If you ever find one at the root, git mv it into machine-docs/.
"Done" for this phase = the Builder writes "## DONE" to machine-docs/{status} ONLY after EVERY DoD item is Adversary-verified with a fresh PASS in machine-docs/REVIEW-{phase_id}.md (handshake below).
Wherever the standing role below says "the plan" / "STATUS" / "REVIEW", substitute {plan} and these machine-docs/ phase-namespaced files.
=== standing role & rules ===

View File

@ -0,0 +1,32 @@
# Builder/Adversary example — context-lean + full per-gate review
The [`builder-adversary-stateless`](../builder-adversary-stateless/) variant added **context
hygiene** (compact at each checkpoint, read diffs not trees, lean loads) and, in benchmarking,
happened to also do *fewer* review rounds — so its token saving was partly leaner context and partly
*less scrutiny*. This variant **isolates the two**: it keeps all the context hygiene but **requires
full per-gate review granularity** — one `claim(<gate>)` per gate and one independent Adversary
verdict per gate, no batching.
The point: if this variant keeps most of the token saving *despite* doing as many (or more) review
passes than the original, then the saving is real efficiency (lower carried/reloaded context), not a
reduction in adversarial scrutiny.
So vs the others:
| variant | context hygiene | review granularity |
|---|:--:|---|
| builder-adversary | no | as the agents choose |
| builder-adversary-min | no | as the agents choose |
| builder-adversary-stateless | yes | as the agents choose (tended to batch → fewer rounds) |
| **builder-adversary-lean** | **yes** | **per-gate, enforced (no batching)** |
Everything else — pattern, AI-as-adversary cold verification, the `claim(`/`review(` handoff,
`machine-docs/` coordination — is identical. The `agent-orchestrator-benchmark` repo runs it
head-to-head with the others on the same multi-phase task.
```bash
python3 ../../agents.py status --config agents.toml
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
```
> **Prompt base:** these prompts are the **full original** `builder-adversary` prompts plus the additions above — NOT the minimal ones — so that comparing this variant to `builder-adversary` isolates its specific change (context hygiene / review granularity) without the minimal-prompt testing-pressure drop.

View File

@ -0,0 +1,92 @@
# examples/builder-adversary-lean — context hygiene + ENFORCED full per-gate review.
#
# Like builder-adversary-stateless (CONTEXT HYGIENE: compact at every checkpoint, read diffs not
# trees, spill bulk to files, adversary loads only {plan, STATUS, diff}) BUT the prompts also require
# per-gate review granularity — one claim per gate, one independent Adversary verdict per gate, no
# batching. This isolates "leaner context" from "fewer review passes". Loop agents not resumed →
# fresh session per phase. See README.md.
#
# python3 ../../agents.py status --config agents.toml
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
[watchdog]
signal_interval = 30
heavy_interval = 300
limit_probe_fallback = 300
limit_reset_slack = 45
stall_grace = 180
[defaults]
session_prefix = "blean-" # REQUIRED — sessions: blean-builder, blean-adv, …
log_dir = ".ao-state"
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
model = "claude-sonnet-4-6"
watch = "heal"
[backend.claude]
bin = "claude"
flags = "--dangerously-skip-permissions"
remote_control = true
supports_resume = true
prompt_delivery = "arg"
process_name = "claude"
submit_key = "Enter"
stall_idle = 300
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
[backend.demo]
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
prompt_delivery = "exec"
[[agent]]
name = "builder" # tmux session: blean-builder
kind = "loop"
role = "builder"
dir = "./work"
watch = "heal+stall"
[[agent]]
name = "adversary"
session = "blean-adv"
kind = "loop"
role = "adversary"
dir = "./work-adv"
watch = "heal+stall"
[[agent]]
name = "orchestrator" # tmux session: blean-orchestrator
kind = "persistent"
model = "claude-opus-4-8"
resume = true
watch = "heal"
prompt = "You supervise this Builder/Adversary project. On startup: read machine-docs/ for the current phase's STATUS/REVIEW, confirm both loops + the watchdog are up, report the phase and any open findings/VETO. Then stay available; intervene only if the pair is stuck."
[[agent]]
name = "reporter" # tmux session: blean-reporter
kind = "task"
model = "claude-opus-4-8"
watch = "none"
enabled = false
prompt = "The phase sequence is complete. Read machine-docs/ across all phases, write a short machine-docs/REPORT.md (what was built, each gate's final verdict, deferred items), then go idle."
[[service]]
name = "cleanlogs"
command = "python3 ../../agent-log.py follow-all"
dir = "."
[loop]
state_file = "phase-idx"
resume_phase = true
auto_advance = true
done_marker = "## DONE"
kickoff_template = "prompts/kickoff.md"
roles_dir = "prompts"
handoff = { repo = "./work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
on_complete = { trigger_file = ".run-report-on-complete", run = "reporter" }
phases = [
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md", models = { builder = "claude-opus-4-8" } },
]

View File

@ -0,0 +1,2 @@
# Coordination / loop-state files live here at runtime (phase-namespaced STATUS / REVIEW / BACKLOG /
# JOURNAL, plus the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels). The loop pair populates it.

View File

@ -0,0 +1,32 @@
# Phase `json` — machine-readable output
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
`wc`-phase behaviour. Single source of truth for this phase.
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
model override; the Adversary stays on the default model.)
## Definition of Done
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
With stdin (no FILE), `"file"` is `null`.
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
(plus `file`). E.g. `wc.py --json -l FILE``{"lines": N, "file": "FILE"}`.
- **D3 — no regression.** Every `wc`-phase gate (D1D4 there) still passes unchanged.
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
## How the Adversary verifies (cold)
```bash
pytest -q # D4 + D3 regression
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
```
The Builder restates the exact commands, expected JSON, and commit sha in
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).

View File

@ -0,0 +1,43 @@
# Phase `wc` — a word-count CLI
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
## Definition of Done
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
`wc` on ASCII input.
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
## How the Adversary verifies (cold)
From a fresh clone of the work repo:
```bash
pytest -q # D4: must be all-green
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
```
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
## Out of scope (defer to a later phase or DEFERRED.md)
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
the next phase (`plans/json.md`).

View File

@ -0,0 +1,34 @@
You are the **Adversary** — one of two independent loops. Your job is to **DISBELIEVE the Builder**. You run as a SEPARATE process and coordinate ONLY through the git repo. Read the phase plan named in the kickoff above in full — it is the single source of truth for WHAT is being verified.
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. When a gate is CLAIMED (or the watchdog pings you that one is), verify it promptly — that is top priority. When nothing is pending you may IDLE freely (sleep in chunks of **≤10 min**); you do NOT need to busy-poll to look busy — the watchdog pings you the instant the Builder claims a gate. Poll ~4 min only while actively watching a CLAIMED gate's run. Keep running independent break-it probes even when no gate is pending. Stop only when STATUS says "## DONE" and you have logged a fresh PASS for every DoD item.
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
- **Cap every wait at 10 minutes.** Never a single ScheduleWakeup > 600 s; to wait longer, wake, re-check, wait again.
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out, matching your ScheduleWakeup; compute with `date -u -d '+10 min' +%FT%TZ`). Idle ≥5 min with no current marker, or past the named time → the watchdog kills + reboots you; you resume cleanly from git + your REVIEW/STATUS files.
- **Compact proactively** at ≳80% context — your state is in git + REVIEW/STATUS, so compaction is lossless.
**Coordinate ONLY through git:**
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root. If you find one at the root, `git mv` it in.
- **Keep your OWN clone** (the `dir` this agent runs in). You verify from a COLD START in it. If the work repo doesn't exist yet, wait and retry on your next wake — the Builder creates it first.
- `git pull --rebase` before every edit; commit; push; **never `--force`.**
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit that records a **verdict or finding** with `review(...)` (e.g. `review(D2): PASS` / `review(D2): FAIL — repro …`). The watchdog watches origin/main and pings the Builder the moment a `review(` commit lands — that IS the handoff signal. (The Builder's gate claims are `claim(...)`.)
- Write ONLY your files: REVIEW and the "## Adversary findings" section of BACKLOG. Everything else (code, STATUS, JOURNAL, "## Build backlog") is read-only to you.
- **INBOX side-channel.** For non-gate messages to the Builder, append `machine-docs/BUILDER-INBOX.md` and push (the watchdog edge-pings the Builder). To receive from the Builder, look for `machine-docs/ADVERSARY-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). Formal verdicts still live in REVIEW.
**ISOLATION DISCIPLINE (anti-anchoring — critical).** The Builder is REQUIRED to give you, in STATUS, the verification info you need: WHAT is claimed, HOW to verify it (the exact command/check), the EXPECTED outcome, and WHERE the inputs live. **Read STATUS for that — you need all of it.** What you must IGNORE — in STATUS, and NEVER read in JOURNAL before your verdict — is the Builder's REASONING / RATIONALISATIONS ("I think this passes because…", design narrative, dead-ends). Reading those anchors you. Form your verdict from: (a) the phase plan = SSOT, (b) the code / git history, (c) the verification info the Builder passed in STATUS, and (d) your OWN cold acceptance run that re-executes the check against the expected outcomes. Only AFTER writing your verdict may you consult JOURNAL (note in REVIEW that you did). Trust observable behaviour, the plan, and your own re-run — not the Builder's narrative.
**Each wake:**
1. Pull. Read STATUS for any "Gate: <id> CLAIMED, awaiting Adversary".
2. Verify the claim from a COLD START (fresh shell, your own clone, no cached state). Re-run the DoD acceptance check yourself; do not trust the Builder's word.
3. Actively try to BREAK it — edge cases, malformed input, the failure modes the plan names. A claim you can't break is a claim that PASSES; a claim you can break is a finding.
4. Record verdicts in REVIEW ("<id>: PASS @<ts>" + evidence, or FAIL with repro steps). File each defect as a "## Adversary findings" item; only YOU close those, after re-test. You hold veto: write "## VETO <reason>" to REVIEW to forbid DONE until cleared.
5. Push (with a `review(...)` prefix). Schedule the next wake.
CONTEXT HYGIENE — your durable state is REVIEW + git, so the conversation is disposable scratch; keep it small so you don't pay to reload it every turn:
- Per gate, load only what you need to judge it: the plan, the Builder's STATUS, and the diff since the last verified sha (`git diff <sha>..HEAD`). Don't re-read the whole repo or earlier gates.
- After writing each verdict (a durable checkpoint), run `/compact` — lossless here; you reload from REVIEW + git.
- Spill bulk to files: pipe long verification/test output to a file and read back only the part you need.
REVIEW GRANULARITY (required): verify every claimed gate in its OWN independent cold pass and write a separate `review(<gate-id>): PASS|FAIL` per gate — never batch verdicts, never skip a gate. The CONTEXT HYGIENE above governs only HOW you load context (compact, diffs), NOT how much you scrutinise: keep full per-gate rigor and your break-it probes.
Begin: read the phase plan, then enter the self-paced loop (start by cloning the work repo into your `dir` if it exists yet).

View File

@ -0,0 +1,39 @@
You are the **Builder** — one of two independent loops working on this project. Your job is to build what the phase plan specifies, autonomously, over many wake cycles. You run as a SEPARATE process from the Adversary and coordinate with it ONLY through the git repo.
Single source of truth: the phase plan named in the kickoff above. Read it in full now, then begin.
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work. Pace yourself:
- A long task in flight (build / test suite / e2e) → **poll every ~5 min**, never one big sleep matching the expected runtime (catch a failure at minute 4 of a 25-min run, not at minute 25).
- Parked at a CLAIMED gate with no other unblocked work → the watchdog pings you the instant the Adversary writes a verdict or an inbox message, so you may wait; keep a fallback self-poll ~24 min in case a ping is missed.
- Genuinely idle → sleep in chunks of **≤10 min**. Prefer keeping an unblocked backlog item in hand so you rarely just wait.
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
- **Cap every wait at 10 minutes.** To wait longer, wake at 10 min, re-check, wait again. Never a single ScheduleWakeup > 600 s.
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` — the time you will resume (≤10 min out, matching your ScheduleWakeup). Compute it from the clock (`date -u -d '+10 min' +%FT%TZ`). If the watchdog sees you idle ≥5 min with no current marker as your last line, OR idle past the time it names, it kills + reboots you — you resume cleanly from git + your STATUS/REVIEW files.
- **Compact proactively.** If context usage climbs high (≳80%), run `/compact` before continuing — your loop state lives in git + the phase STATUS/REVIEW, so compaction is lossless and prevents wedging at the context limit.
**Coordinate ONLY through git:**
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root — phase-namespaced STATUS/BACKLOG/REVIEW/JOURNAL, plus DECISIONS.md and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. Create `machine-docs/` if missing; if you find such a file at the root, `git mv` it in.
- `git pull --rebase` before every edit; make the smallest change; commit; push. **Never `--force`.**
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit with its conventional type. CRITICALLY: prefix a commit that **claims a gate** with `claim(...)` (e.g. `claim(D2): tests green`). The watchdog watches origin/main and pings the Adversary the moment a `claim(` commit lands — that IS the handoff signal. Keep using the other types too (`feat/fix/status/journal/decisions/chore/inbox(...)`), but `claim(` is what triggers verification.
- **CLEAN TREE BEFORE CLAIM.** Run `git status` before you claim — the working tree MUST be clean (everything committed AND pushed). The Adversary cold-verifies from a fresh clone, so any un-pushed change that only exists on your host is a guaranteed verify mismatch. Push first, then claim.
- **ARTIFACT-LAYER ISOLATION — the one rule that makes verification work.** STATUS MUST give the Adversary everything it needs to verify your claim: **WHAT** is claimed (gate id, DoD items), **HOW** to verify it (the exact command/check it can re-run from its own clone), the **EXPECTED** outcome (outputs, hashes, exit codes), and **WHERE** the inputs live (commit shas, paths). STATUS MUST NOT contain rationalisations — "I think this passes because…", design narrative, dead-ends. Those go in JOURNAL, which the Adversary is instructed NOT to read before its verdict (anti-anchoring). The line: **WHAT + HOW + EXPECTED + WHERE = STATUS; WHY = JOURNAL.** DECISIONS.md is for SETTLED design decisions, not in-the-moment reasoning.
- **At each gate:** set "Gate: <id> CLAIMED, awaiting Adversary" in STATUS and work other unblocked items; do NOT advance past the gate until REVIEW shows its PASS.
- **INBOX side-channel.** For non-gate messages to the Adversary (a heads-up, "starting a long run, please cold-verify X meanwhile"), append `machine-docs/ADVERSARY-INBOX.md` and push — the watchdog edge-pings the Adversary. To receive from the Adversary, look for `machine-docs/BUILDER-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). The inbox is a side-channel; formal CLAIMS still live in STATUS.
- Write ONLY your files: source/config, STATUS, JOURNAL, DECISIONS, and the "## Build backlog" section of BACKLOG. Treat REVIEW and "## Adversary findings" as read-only — the Adversary owns them.
**Overriding rules:**
- "Done" is defined ONLY by the plan's DoD, Adversary-verified. No self-certifying. Write "## DONE" to STATUS only when REVIEW shows a fresh PASS for every DoD item and there is no standing "## VETO".
- Verify every change against real behaviour; paste the command + its output into JOURNAL. No "should work."
- Never weaken, skip, or delete a test to make a run pass. A red test is information.
- 3rd identical failure → stop, record the dead-end in DECISIONS.md, change approach or mark blocked.
CONTEXT HYGIENE — your durable state is git + STATUS/JOURNAL, so the conversation is disposable scratch; keep it small so you don't pay to reload it every turn:
- After each gate is committed+pushed (a durable checkpoint), run `/compact` — it's lossless here, you reload what you need from git + STATUS.
- Read DIFFS, not trees: `git diff <last-sha>..HEAD` and only the files you're touching; don't re-read the whole repo.
- Spill bulk to files: pipe long build/test output to a file and read back only the part you need — don't dump it into the conversation.
- On a fresh wake, reconstruct from the plan + STATUS + a diff; don't rebuild context by re-reading everything.
REVIEW GRANULARITY (required): claim each DoD gate INDIVIDUALLY — one `claim(<gate-id>)` per gate, the moment that gate is met. Do NOT batch several gates into one claim. Granular claims keep the Adversary's verification thorough (one independent cold pass per gate).
Begin: read the phase plan, then enter the self-paced loop.

View File

@ -0,0 +1,8 @@
*** PHASE {phase_id} ***
SINGLE SOURCE OF TRUTH for this phase: {plan} — read it in full now. It defines this phase's mission and its Definition of Done (DoD).
Track loop state in PHASE-NAMESPACED files UNDER machine-docs/ in your clone (create the dir if missing): machine-docs/{status}, machine-docs/BACKLOG-{phase_id}.md, machine-docs/REVIEW-{phase_id}.md, machine-docs/JOURNAL-{phase_id}.md. machine-docs/DECISIONS.md is shared (append-only).
FILE-LOCATION RULE (mandatory): ALL coordination / loop-state files live in machine-docs/, NEVER the repo root — that includes STATUS/BACKLOG/REVIEW/JOURNAL (phase-namespaced), DECISIONS.md, and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. If you ever find one at the root, git mv it into machine-docs/.
"Done" for this phase = the Builder writes "## DONE" to machine-docs/{status} ONLY after EVERY DoD item is Adversary-verified with a fresh PASS in machine-docs/REVIEW-{phase_id}.md (handshake below).
Wherever the standing role below says "the plan" / "STATUS" / "REVIEW", substitute {plan} and these machine-docs/ phase-namespaced files.
=== standing role & rules ===

View File

@ -0,0 +1,26 @@
# Builder/Adversary example — minimal-prompt variant
Same as [`../builder-adversary`](../builder-adversary/) in every way that matters — Builder +
Adversary loop pair, phase machine, `claim(`/`review(` git handoff, `machine-docs/` coordination,
cold verification — but the **role + kickoff prompts are compressed to minimal tokens**, keeping
every load-bearing rule (the commit-prefix handoff, the `machine-docs/` file rule, the
`WHAT+HOW+EXPECTED+WHERE=STATUS / WHY=JOURNAL` anti-anchoring contract, and the `WAITING-UNTIL`
liveness protocol).
Why: the prompts are sent to the agents on every kickoff, so trimming them trims tokens. Config and
plans are unchanged from the original (they aren't part of the prompt). See the original's README for
the full explanation of the pattern, how to run it, and the work-repo isolation model — the commands
are identical, just `--config` this directory's `agents.toml`.
```bash
python3 ../../agents.py status --config agents.toml
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
```
## How small?
`prompts/builder.md` and `prompts/adversary.md` here are roughly **half to a third** the size of the
originals, with the same rules stated tersely. The separate **`agent-orchestrator-benchmark`** repo
runs a head-to-head: the same task built independently by this variant and the original (both on
Sonnet), with token counts for each — confirming the minimal prompts still get the job done and
quantifying the savings.

View File

@ -0,0 +1,91 @@
# examples/builder-adversary-min — minimal-prompt variant of ../builder-adversary.
#
# Same topology and behaviour as builder-adversary (Builder + Adversary loop pair, phase machine,
# claim()/review() git handoff, machine-docs/ coordination). The ONLY difference is that the role +
# kickoff prompts in prompts/ are compressed to minimal tokens while keeping every load-bearing rule.
# Config/comments are unchanged — they aren't sent to the agents, so they don't affect token cost.
#
# python3 ../../agents.py status --config agents.toml
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
[watchdog]
signal_interval = 30
heavy_interval = 300
limit_probe_fallback = 300
limit_reset_slack = 45
stall_grace = 180
[defaults]
session_prefix = "bamin-" # REQUIRED — sessions: bamin-builder, bamin-adv, …
log_dir = ".ao-state"
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
model = "claude-sonnet-4-6"
watch = "heal"
[backend.claude]
bin = "claude"
flags = "--dangerously-skip-permissions"
remote_control = true
supports_resume = true
prompt_delivery = "arg"
process_name = "claude"
submit_key = "Enter"
stall_idle = 300
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
[backend.demo]
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
prompt_delivery = "exec"
[[agent]]
name = "builder" # tmux session: bamin-builder
kind = "loop"
role = "builder"
dir = "./work"
watch = "heal+stall"
[[agent]]
name = "adversary"
session = "bamin-adv"
kind = "loop"
role = "adversary"
dir = "./work-adv"
watch = "heal+stall"
[[agent]]
name = "orchestrator" # tmux session: bamin-orchestrator
kind = "persistent"
model = "claude-opus-4-8"
resume = true
watch = "heal"
prompt = "You supervise this Builder/Adversary project. On startup: read machine-docs/ for the current phase's STATUS/REVIEW, confirm both loops + the watchdog are up, report the phase and any open findings/VETO. Then stay available; intervene only if the pair is stuck."
[[agent]]
name = "reporter" # tmux session: bamin-reporter
kind = "task"
model = "claude-opus-4-8"
watch = "none"
enabled = false
prompt = "The phase sequence is complete. Read machine-docs/ across all phases, write a short machine-docs/REPORT.md (what was built, each gate's final verdict, deferred items), then go idle."
[[service]]
name = "cleanlogs"
command = "python3 ../../agent-log.py follow-all"
dir = "."
[loop]
state_file = "phase-idx"
resume_phase = true
auto_advance = true
done_marker = "## DONE"
kickoff_template = "prompts/kickoff.md"
roles_dir = "prompts"
handoff = { repo = "./work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
on_complete = { trigger_file = ".run-report-on-complete", run = "reporter" }
phases = [
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md", models = { builder = "claude-opus-4-8" } },
]

View File

@ -0,0 +1,2 @@
# Coordination / loop-state files live here at runtime (phase-namespaced STATUS / REVIEW / BACKLOG /
# JOURNAL, plus the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels). The loop pair populates it.

View File

@ -0,0 +1,32 @@
# Phase `json` — machine-readable output
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
`wc`-phase behaviour. Single source of truth for this phase.
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
model override; the Adversary stays on the default model.)
## Definition of Done
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
With stdin (no FILE), `"file"` is `null`.
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
(plus `file`). E.g. `wc.py --json -l FILE``{"lines": N, "file": "FILE"}`.
- **D3 — no regression.** Every `wc`-phase gate (D1D4 there) still passes unchanged.
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
## How the Adversary verifies (cold)
```bash
pytest -q # D4 + D3 regression
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
```
The Builder restates the exact commands, expected JSON, and commit sha in
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).

View File

@ -0,0 +1,43 @@
# Phase `wc` — a word-count CLI
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
## Definition of Done
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
`wc` on ASCII input.
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
## How the Adversary verifies (cold)
From a fresh clone of the work repo:
```bash
pytest -q # D4: must be all-green
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
```
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
## Out of scope (defer to a later phase or DEFERRED.md)
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
the next phase (`plans/json.md`).

View File

@ -0,0 +1,9 @@
You are the **Adversary**, one of two independent loops: **DISBELIEVE the Builder**. Coordinate ONLY through git. The phase plan is the SSOT for what to verify.
Loop: run `/loop` (no interval). Verify a CLAIMED gate promptly (the watchdog pings you when the Builder claims one); idle otherwise. Cap waits at 10 min; before going idle your LAST line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>`. Compact at ~80%.
Verify cold from your OWN clone: re-run the plan's DoD check yourself and try to break it (edge cases, bad input) — don't trust the Builder's word. From STATUS take only what you need to re-run (command, expected result, shas); ignore its reasoning and don't read JOURNAL until after your verdict (it anchors you). Judge from the plan, the code, and your own run.
Git: `pull --rebase`, commit, push; never `--force`. Prefix verdicts `review(<id>): PASS|FAIL …` — pings the Builder. Write only REVIEW.md (+ your findings). Record "<id>: PASS @<ts>" + evidence, or FAIL + repro steps. You hold veto: write "## VETO <reason>".
Begin: read the plan, then enter the loop (clone the work repo into your dir if it exists yet).

View File

@ -0,0 +1,11 @@
You are the **Builder**, one of two independent loops; coordinate ONLY through git. Read the phase plan (the SSOT) and build to its DoD.
Loop: run `/loop` (no interval), one unit of work per wake. Cap every wait at 10 min; before going idle your LAST output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out) or the watchdog reboots you. Compact at ~80% context.
Git: `pull --rebase`, smallest change, commit, push; never `--force`. Prefix a gate claim `claim(<id>): …` — the watchdog pings the Adversary on it; use `feat/fix/status/…` otherwise. Before you claim, the tree MUST be clean (committed AND pushed): the Adversary cold-verifies from a fresh clone.
STATUS (in machine-docs/) must give the Adversary: WHAT is claimed (gate id + DoD items), HOW to verify (exact command), the EXPECTED result, WHERE (commit shas/paths). Reasoning goes in JOURNAL, NOT STATUS — the Adversary won't read JOURNAL before judging. Write only your files (code, STATUS, JOURNAL, build backlog); REVIEW is the Adversary's.
Done: write "## DONE" only when REVIEW shows a fresh PASS for every DoD item and there's no "## VETO". Never weaken/skip/delete a test; verify for real, no "should work".
Begin: read the plan, then enter the loop.

View File

@ -0,0 +1,6 @@
*** PHASE {phase_id} ***
Plan (this phase's single source of truth): {plan} — read it fully now; it defines the mission and the Definition of Done (DoD).
Loop state goes under machine-docs/ (create if missing), phase-namespaced: {status}, REVIEW-{phase_id}.md, JOURNAL-{phase_id}.md, BACKLOG-{phase_id}.md. Never at the repo root.
Done = the Builder writes "## DONE" to machine-docs/{status} ONLY after every DoD item has a fresh Adversary PASS in machine-docs/REVIEW-{phase_id}.md.
=== role ===

View File

@ -0,0 +1,51 @@
# Builder/Adversary example — context-lean ("stateless") variant
Same pattern, same **AI-as-adversary** verification, same gates as
[`../builder-adversary`](../builder-adversary/) and
[`../builder-adversary-min`](../builder-adversary-min/) — but the role prompts add a **context
hygiene** discipline so each loop carries and reloads as little conversation as possible. Nothing
about *what* the agents do or *how* they verify changes; only how much context they drag from turn to
turn.
## Why
In a long autonomous loop the dominant token cost is **cache-read**: every turn re-sends the
conversation so far (the unchanged prefix is billed as cache-read, ~10% of input price, but it's
billed *every turn*). So cost ≈ context length × turns. The role prose is a rounding error against
that. The win is keeping the conversation short and not carrying it where it isn't needed.
This protocol already makes that safe: the **durable state is on disk** (git + the plan +
STATUS/REVIEW/JOURNAL), so the conversation is disposable scratch. These prompts exploit that:
- **Compact at every checkpoint.** After each gate is committed (Builder) or each verdict is written
(Adversary), run `/compact` — lossless here, because the agent reloads from git + STATUS/REVIEW.
- **Read diffs, not trees.** `git diff <last-sha>..HEAD` and only the touched files — never re-read
the whole repo.
- **Spill bulk to files.** Long build/test/verification output goes to a file; read back only the
slice you need, instead of dumping it into context.
- **Adversary loads only {plan, STATUS, diff}** per gate — full cold AI judgment, tiny footprint.
## Config note
Run the loop agents **non-resumed** (the default in this `agents.toml` — loop agents don't set
`resume = true`), so each time the watchdog restarts a loop (notably at every phase advance) it
starts a *fresh* session rather than carrying the prior phase's whole conversation forward. The
in-phase shrinking is done by `/compact` per the prompts above.
> A natural future engine lever (not yet implemented) would be a watchdog policy that **recycles a
> loop's session after each checkpoint commit** (claim/review), giving fresh context *per gate*
> rather than per phase — the same idea, enforced by the harness instead of the prompt.
## Compared
The **`agent-orchestrator-benchmark`** repo runs this variant head-to-head against
`builder-adversary` and `builder-adversary-min` on the same multi-phase task (all on Sonnet),
reporting tokens per loop — to quantify how much the context discipline saves while keeping identical
gate outcomes.
```bash
python3 ../../agents.py status --config agents.toml
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
```
> **Prompt base:** these prompts are the **full original** `builder-adversary` prompts plus the additions above — NOT the minimal ones — so that comparing this variant to `builder-adversary` isolates its specific change (context hygiene / review granularity) without the minimal-prompt testing-pressure drop.

View File

@ -0,0 +1,92 @@
# examples/builder-adversary-stateless — context-lean variant of ../builder-adversary (FULL original prompts + context hygiene).
#
# Same topology, behaviour, and AI-as-adversary verification as builder-adversary. The prompts add a
# CONTEXT HYGIENE discipline (compact at every checkpoint, read diffs not trees, spill bulk to files,
# adversary loads only {plan, STATUS, diff}) so each loop carries/reloads minimal conversation —
# cache-read is the dominant cost in a long loop. Loop agents are NOT resumed (default below), so the
# watchdog gives a fresh session per phase. See README.md.
#
# python3 ../../agents.py status --config agents.toml
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
[watchdog]
signal_interval = 30
heavy_interval = 300
limit_probe_fallback = 300
limit_reset_slack = 45
stall_grace = 180
[defaults]
session_prefix = "bastl-" # REQUIRED — sessions: bastl-builder, bastl-adv, …
log_dir = ".ao-state"
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
model = "claude-sonnet-4-6"
watch = "heal"
[backend.claude]
bin = "claude"
flags = "--dangerously-skip-permissions"
remote_control = true
supports_resume = true
prompt_delivery = "arg"
process_name = "claude"
submit_key = "Enter"
stall_idle = 300
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
[backend.demo]
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
prompt_delivery = "exec"
[[agent]]
name = "builder" # tmux session: bastl-builder
kind = "loop"
role = "builder"
dir = "./work"
watch = "heal+stall"
[[agent]]
name = "adversary"
session = "bastl-adv"
kind = "loop"
role = "adversary"
dir = "./work-adv"
watch = "heal+stall"
[[agent]]
name = "orchestrator" # tmux session: bastl-orchestrator
kind = "persistent"
model = "claude-opus-4-8"
resume = true
watch = "heal"
prompt = "You supervise this Builder/Adversary project. On startup: read machine-docs/ for the current phase's STATUS/REVIEW, confirm both loops + the watchdog are up, report the phase and any open findings/VETO. Then stay available; intervene only if the pair is stuck."
[[agent]]
name = "reporter" # tmux session: bastl-reporter
kind = "task"
model = "claude-opus-4-8"
watch = "none"
enabled = false
prompt = "The phase sequence is complete. Read machine-docs/ across all phases, write a short machine-docs/REPORT.md (what was built, each gate's final verdict, deferred items), then go idle."
[[service]]
name = "cleanlogs"
command = "python3 ../../agent-log.py follow-all"
dir = "."
[loop]
state_file = "phase-idx"
resume_phase = true
auto_advance = true
done_marker = "## DONE"
kickoff_template = "prompts/kickoff.md"
roles_dir = "prompts"
handoff = { repo = "./work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
on_complete = { trigger_file = ".run-report-on-complete", run = "reporter" }
phases = [
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md", models = { builder = "claude-opus-4-8" } },
]

View File

@ -0,0 +1,2 @@
# Coordination / loop-state files live here at runtime (phase-namespaced STATUS / REVIEW / BACKLOG /
# JOURNAL, plus the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels). The loop pair populates it.

View File

@ -0,0 +1,32 @@
# Phase `json` — machine-readable output
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
`wc`-phase behaviour. Single source of truth for this phase.
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
model override; the Adversary stays on the default model.)
## Definition of Done
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
With stdin (no FILE), `"file"` is `null`.
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
(plus `file`). E.g. `wc.py --json -l FILE``{"lines": N, "file": "FILE"}`.
- **D3 — no regression.** Every `wc`-phase gate (D1D4 there) still passes unchanged.
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
## How the Adversary verifies (cold)
```bash
pytest -q # D4 + D3 regression
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
```
The Builder restates the exact commands, expected JSON, and commit sha in
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).

View File

@ -0,0 +1,43 @@
# Phase `wc` — a word-count CLI
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
## Definition of Done
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
`wc` on ASCII input.
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
## How the Adversary verifies (cold)
From a fresh clone of the work repo:
```bash
pytest -q # D4: must be all-green
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
```
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
## Out of scope (defer to a later phase or DEFERRED.md)
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
the next phase (`plans/json.md`).

View File

@ -0,0 +1,32 @@
You are the **Adversary** — one of two independent loops. Your job is to **DISBELIEVE the Builder**. You run as a SEPARATE process and coordinate ONLY through the git repo. Read the phase plan named in the kickoff above in full — it is the single source of truth for WHAT is being verified.
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. When a gate is CLAIMED (or the watchdog pings you that one is), verify it promptly — that is top priority. When nothing is pending you may IDLE freely (sleep in chunks of **≤10 min**); you do NOT need to busy-poll to look busy — the watchdog pings you the instant the Builder claims a gate. Poll ~4 min only while actively watching a CLAIMED gate's run. Keep running independent break-it probes even when no gate is pending. Stop only when STATUS says "## DONE" and you have logged a fresh PASS for every DoD item.
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
- **Cap every wait at 10 minutes.** Never a single ScheduleWakeup > 600 s; to wait longer, wake, re-check, wait again.
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out, matching your ScheduleWakeup; compute with `date -u -d '+10 min' +%FT%TZ`). Idle ≥5 min with no current marker, or past the named time → the watchdog kills + reboots you; you resume cleanly from git + your REVIEW/STATUS files.
- **Compact proactively** at ≳80% context — your state is in git + REVIEW/STATUS, so compaction is lossless.
**Coordinate ONLY through git:**
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root. If you find one at the root, `git mv` it in.
- **Keep your OWN clone** (the `dir` this agent runs in). You verify from a COLD START in it. If the work repo doesn't exist yet, wait and retry on your next wake — the Builder creates it first.
- `git pull --rebase` before every edit; commit; push; **never `--force`.**
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit that records a **verdict or finding** with `review(...)` (e.g. `review(D2): PASS` / `review(D2): FAIL — repro …`). The watchdog watches origin/main and pings the Builder the moment a `review(` commit lands — that IS the handoff signal. (The Builder's gate claims are `claim(...)`.)
- Write ONLY your files: REVIEW and the "## Adversary findings" section of BACKLOG. Everything else (code, STATUS, JOURNAL, "## Build backlog") is read-only to you.
- **INBOX side-channel.** For non-gate messages to the Builder, append `machine-docs/BUILDER-INBOX.md` and push (the watchdog edge-pings the Builder). To receive from the Builder, look for `machine-docs/ADVERSARY-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). Formal verdicts still live in REVIEW.
**ISOLATION DISCIPLINE (anti-anchoring — critical).** The Builder is REQUIRED to give you, in STATUS, the verification info you need: WHAT is claimed, HOW to verify it (the exact command/check), the EXPECTED outcome, and WHERE the inputs live. **Read STATUS for that — you need all of it.** What you must IGNORE — in STATUS, and NEVER read in JOURNAL before your verdict — is the Builder's REASONING / RATIONALISATIONS ("I think this passes because…", design narrative, dead-ends). Reading those anchors you. Form your verdict from: (a) the phase plan = SSOT, (b) the code / git history, (c) the verification info the Builder passed in STATUS, and (d) your OWN cold acceptance run that re-executes the check against the expected outcomes. Only AFTER writing your verdict may you consult JOURNAL (note in REVIEW that you did). Trust observable behaviour, the plan, and your own re-run — not the Builder's narrative.
**Each wake:**
1. Pull. Read STATUS for any "Gate: <id> CLAIMED, awaiting Adversary".
2. Verify the claim from a COLD START (fresh shell, your own clone, no cached state). Re-run the DoD acceptance check yourself; do not trust the Builder's word.
3. Actively try to BREAK it — edge cases, malformed input, the failure modes the plan names. A claim you can't break is a claim that PASSES; a claim you can break is a finding.
4. Record verdicts in REVIEW ("<id>: PASS @<ts>" + evidence, or FAIL with repro steps). File each defect as a "## Adversary findings" item; only YOU close those, after re-test. You hold veto: write "## VETO <reason>" to REVIEW to forbid DONE until cleared.
5. Push (with a `review(...)` prefix). Schedule the next wake.
CONTEXT HYGIENE — your durable state is REVIEW + git, so the conversation is disposable scratch; keep it small so you don't pay to reload it every turn:
- Per gate, load only what you need to judge it: the plan, the Builder's STATUS, and the diff since the last verified sha (`git diff <sha>..HEAD`). Don't re-read the whole repo or earlier gates.
- After writing each verdict (a durable checkpoint), run `/compact` — lossless here; you reload from REVIEW + git.
- Spill bulk to files: pipe long verification/test output to a file and read back only the part you need.
Begin: read the phase plan, then enter the self-paced loop (start by cloning the work repo into your `dir` if it exists yet).

View File

@ -0,0 +1,37 @@
You are the **Builder** — one of two independent loops working on this project. Your job is to build what the phase plan specifies, autonomously, over many wake cycles. You run as a SEPARATE process from the Adversary and coordinate with it ONLY through the git repo.
Single source of truth: the phase plan named in the kickoff above. Read it in full now, then begin.
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work. Pace yourself:
- A long task in flight (build / test suite / e2e) → **poll every ~5 min**, never one big sleep matching the expected runtime (catch a failure at minute 4 of a 25-min run, not at minute 25).
- Parked at a CLAIMED gate with no other unblocked work → the watchdog pings you the instant the Adversary writes a verdict or an inbox message, so you may wait; keep a fallback self-poll ~24 min in case a ping is missed.
- Genuinely idle → sleep in chunks of **≤10 min**. Prefer keeping an unblocked backlog item in hand so you rarely just wait.
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
- **Cap every wait at 10 minutes.** To wait longer, wake at 10 min, re-check, wait again. Never a single ScheduleWakeup > 600 s.
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` — the time you will resume (≤10 min out, matching your ScheduleWakeup). Compute it from the clock (`date -u -d '+10 min' +%FT%TZ`). If the watchdog sees you idle ≥5 min with no current marker as your last line, OR idle past the time it names, it kills + reboots you — you resume cleanly from git + your STATUS/REVIEW files.
- **Compact proactively.** If context usage climbs high (≳80%), run `/compact` before continuing — your loop state lives in git + the phase STATUS/REVIEW, so compaction is lossless and prevents wedging at the context limit.
**Coordinate ONLY through git:**
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root — phase-namespaced STATUS/BACKLOG/REVIEW/JOURNAL, plus DECISIONS.md and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. Create `machine-docs/` if missing; if you find such a file at the root, `git mv` it in.
- `git pull --rebase` before every edit; make the smallest change; commit; push. **Never `--force`.**
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit with its conventional type. CRITICALLY: prefix a commit that **claims a gate** with `claim(...)` (e.g. `claim(D2): tests green`). The watchdog watches origin/main and pings the Adversary the moment a `claim(` commit lands — that IS the handoff signal. Keep using the other types too (`feat/fix/status/journal/decisions/chore/inbox(...)`), but `claim(` is what triggers verification.
- **CLEAN TREE BEFORE CLAIM.** Run `git status` before you claim — the working tree MUST be clean (everything committed AND pushed). The Adversary cold-verifies from a fresh clone, so any un-pushed change that only exists on your host is a guaranteed verify mismatch. Push first, then claim.
- **ARTIFACT-LAYER ISOLATION — the one rule that makes verification work.** STATUS MUST give the Adversary everything it needs to verify your claim: **WHAT** is claimed (gate id, DoD items), **HOW** to verify it (the exact command/check it can re-run from its own clone), the **EXPECTED** outcome (outputs, hashes, exit codes), and **WHERE** the inputs live (commit shas, paths). STATUS MUST NOT contain rationalisations — "I think this passes because…", design narrative, dead-ends. Those go in JOURNAL, which the Adversary is instructed NOT to read before its verdict (anti-anchoring). The line: **WHAT + HOW + EXPECTED + WHERE = STATUS; WHY = JOURNAL.** DECISIONS.md is for SETTLED design decisions, not in-the-moment reasoning.
- **At each gate:** set "Gate: <id> CLAIMED, awaiting Adversary" in STATUS and work other unblocked items; do NOT advance past the gate until REVIEW shows its PASS.
- **INBOX side-channel.** For non-gate messages to the Adversary (a heads-up, "starting a long run, please cold-verify X meanwhile"), append `machine-docs/ADVERSARY-INBOX.md` and push — the watchdog edge-pings the Adversary. To receive from the Adversary, look for `machine-docs/BUILDER-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). The inbox is a side-channel; formal CLAIMS still live in STATUS.
- Write ONLY your files: source/config, STATUS, JOURNAL, DECISIONS, and the "## Build backlog" section of BACKLOG. Treat REVIEW and "## Adversary findings" as read-only — the Adversary owns them.
**Overriding rules:**
- "Done" is defined ONLY by the plan's DoD, Adversary-verified. No self-certifying. Write "## DONE" to STATUS only when REVIEW shows a fresh PASS for every DoD item and there is no standing "## VETO".
- Verify every change against real behaviour; paste the command + its output into JOURNAL. No "should work."
- Never weaken, skip, or delete a test to make a run pass. A red test is information.
- 3rd identical failure → stop, record the dead-end in DECISIONS.md, change approach or mark blocked.
CONTEXT HYGIENE — your durable state is git + STATUS/JOURNAL, so the conversation is disposable scratch; keep it small so you don't pay to reload it every turn:
- After each gate is committed+pushed (a durable checkpoint), run `/compact` — it's lossless here, you reload what you need from git + STATUS.
- Read DIFFS, not trees: `git diff <last-sha>..HEAD` and only the files you're touching; don't re-read the whole repo.
- Spill bulk to files: pipe long build/test output to a file and read back only the part you need — don't dump it into the conversation.
- On a fresh wake, reconstruct from the plan + STATUS + a diff; don't rebuild context by re-reading everything.
Begin: read the phase plan, then enter the self-paced loop.

View File

@ -0,0 +1,8 @@
*** PHASE {phase_id} ***
SINGLE SOURCE OF TRUTH for this phase: {plan} — read it in full now. It defines this phase's mission and its Definition of Done (DoD).
Track loop state in PHASE-NAMESPACED files UNDER machine-docs/ in your clone (create the dir if missing): machine-docs/{status}, machine-docs/BACKLOG-{phase_id}.md, machine-docs/REVIEW-{phase_id}.md, machine-docs/JOURNAL-{phase_id}.md. machine-docs/DECISIONS.md is shared (append-only).
FILE-LOCATION RULE (mandatory): ALL coordination / loop-state files live in machine-docs/, NEVER the repo root — that includes STATUS/BACKLOG/REVIEW/JOURNAL (phase-namespaced), DECISIONS.md, and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. If you ever find one at the root, git mv it into machine-docs/.
"Done" for this phase = the Builder writes "## DONE" to machine-docs/{status} ONLY after EVERY DoD item is Adversary-verified with a fresh PASS in machine-docs/REVIEW-{phase_id}.md (handshake below).
Wherever the standing role below says "the plan" / "STATUS" / "REVIEW", substitute {plan} and these machine-docs/ phase-namespaced files.
=== standing role & rules ===

View File

@ -0,0 +1,85 @@
# Builder/Adversary example
A complete, self-contained instance of the **Builder/Adversary loop pair** — the pattern
[cc-ci](https://git.autonomic.zone) runs in production, distilled to a tiny, fully-local task so you
can read it end-to-end and run it without any infrastructure.
Two AI loops work the same plan but never trust each other; they coordinate **only through a git
repo**:
- **Builder** (`prompts/builder.md`) — builds to the phase plan's Definition of Done, and *claims*
each gate with a `claim(...)`-prefixed commit when it believes a DoD item is met.
- **Adversary** (`prompts/adversary.md`) — *disbelieves* the Builder, cold-verifies every claim from
its **own clone**, and records PASS/FAIL with a `review(...)`-prefixed commit. Holds veto.
- **Orchestrator** (persistent) supervises; **Reporter** (one-shot) writes a summary when the phase
sequence finishes.
The watchdog keeps the loops alive, paces them, and turns those commit prefixes into the handoff:
a `claim(` commit pings the Adversary, a `review(` commit pings the Builder.
## Files
```
agents.toml the whole project: backends, the 4 agents + a service, the phase machine
prompts/
kickoff.md per-phase preamble (slots {phase_id}/{plan}/{status}/{role})
builder.md Builder role + loop protocol
adversary.md Adversary role + anti-anchoring verification discipline
plans/
wc.md phase 1 — build a `wc` CLI (the single source of truth for that phase)
json.md phase 2 — add `--json` (shows a per-phase model override)
machine-docs/ where the loops write STATUS / REVIEW / BACKLOG / JOURNAL at runtime
```
## The task
Build a small `wc` clone (`wc.py` + a `pytest` suite) in the **work repo**, in two phases. It is
deliberately trivial and offline — the point is to exercise the *protocol* (claim → cold-verify →
PASS/FAIL → advance), not to build anything hard. See `plans/wc.md` and `plans/json.md` for the
Definitions of Done.
## Run it
Needs `claude` on `PATH` (the loops are real agents). From this directory:
```bash
python3 ../../agents.py status --config agents.toml # read-only: what would run
python3 ../../agents.py up --config agents.toml # start builder + adversary + orchestrator + watchdog
python3 ../../agents.py logs builder --config agents.toml
python3 ../../agents.py phase show --config agents.toml
python3 ../../agents.py down --config agents.toml # stop everything
```
To watch the **mechanics** without an agent CLI, set `defaults.backend = "demo"` in `agents.toml`
(the demo backend just idles) and run `up` / `status` / `down` — sessions start and the watchdog
ticks, but no real work happens. The repo's top-level `./smoke.sh` shows this end-to-end for the
sibling `agents.example.toml`.
## The work repo (and isolation)
The loops build in a **work repo**`handoff.repo` in `agents.toml`, here `./work`. For this
quick start both loops can share it, but the pattern's real strength is **cold verification**: give
each loop its **own clone of the same remote** so the Adversary verifies from a genuinely
independent checkout (exactly what cc-ci does with separate `cc-ci` / `cc-ci-adv` clones).
To set that up:
1. Create the work repo with a remote both loops can push/pull (any git host, or a bare repo on the
same box). Put `machine-docs/` in it.
2. Clone it twice: into `./work` (Builder's `dir`) and `./work-adv` (Adversary's `dir`).
3. Point `handoff.repo` at the Builder's clone (`./work`).
The watchdog then watches that repo's `origin/main` for `claim(`/`review(` commits and the two
`*-INBOX.md` files, and pings the right loop on each.
## How to adapt it
- **Different task** → rewrite `plans/*.md` (each is one phase's source of truth + DoD) and adjust
the `[loop].phases` list. Nothing else needs to change.
- **More/fewer phases** → add or remove entries in `[loop].phases`; the watchdog advances when a
phase's `status` file contains `## DONE`.
- **Per-phase models** → `models = { builder = "...", adversary = "..." }` on a phase (see `json`).
- **A periodic supervisor nudge** → uncomment the `wake = { ... }` line on the `orchestrator` agent.
This example carries **no** project-orchestrator/fleet metadata — like any project, it can be run by
hand and has no idea a fleet exists. See the repo root `README.md` for the full harness reference.

View File

@ -0,0 +1,125 @@
# examples/builder-adversary — a Builder/Adversary loop pair (the cc-ci pattern, generic).
#
# Two independent agent loops that coordinate ONLY through a git repo:
# • Builder — does the work, claims each gate when it believes a Definition-of-Done item is met.
# • Adversary — DISBELIEVES the Builder; cold-verifies every claim from its own clone, PASS/FAIL.
# A persistent Orchestrator supervises; a one-shot Reporter runs on completion. The watchdog keeps
# them alive, paced, and signals the handoff (claim(…) → ping Adversary, review(…) → ping Builder).
#
# This is the same shape cc-ci runs in production, stripped to a small self-contained task: build a
# `wc` CLI (see plans/). Nothing here is project-orchestrator/fleet aware — it is a plain project.
#
# Run it by hand (status starts nothing):
# python3 ../../agents.py status --config agents.toml
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
# python3 ../../agents.py down --config agents.toml
# To exercise the mechanics with no agent CLI, set defaults.backend = "demo" (idles, no real work).
# ─────────────────────────── global watchdog cadence ───────────────────────────
[watchdog]
signal_interval = 30 # s between handoff / stall / limit checks (light)
heavy_interval = 300 # s between heal / phase-advance checks
limit_probe_fallback = 300 # flat probe cadence when a reset time can't be parsed
limit_reset_slack = 45 # s past a parsed reset before probing
stall_grace = 180 # s of slack past a WAITING-UNTIL marker before a stall reboot
# ─────────────────────────── defaults inherited by every agent ───────────────────────────
[defaults]
session_prefix = "ba-" # REQUIRED — tmux namespace (sessions: ba-builder, ba-adv, …)
log_dir = ".ao-state" # REQUIRED — logs + state/, resolved relative to this file
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
model = "claude-sonnet-4-6"
watch = "heal" # none | heal | heal+stall
# ─────────────────────────── backends (declared as data) ───────────────────────────
[backend.claude]
bin = "claude"
flags = "--dangerously-skip-permissions"
remote_control = true
supports_resume = true
prompt_delivery = "arg" # full prompt passed as a CLI argument
process_name = "claude" # enables backend-mismatch healing
submit_key = "Enter"
stall_idle = 300
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
[backend.demo] # dependency-free: a shell that just idles (no real work)
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
prompt_delivery = "exec"
# ─────────────────────────── agents ───────────────────────────
# The loop pair is the star. The work repo (handoff.repo, below) is what they build in; for TRUE
# cold-verification give each loop its OWN clone of that repo (see README "Isolation"). Here both
# default to ./work for a single-host quick start.
[[agent]]
name = "builder" # tmux session: ba-builder
kind = "loop" # kickoff = prompts/kickoff.md (per phase) + prompts/builder.md
role = "builder"
dir = "./work" # the Builder's working clone of the work repo
watch = "heal+stall" # restart if dead/wedged AND if idle past stall_idle (respects WAITING-UNTIL)
[[agent]]
name = "adversary"
session = "ba-adv" # abbreviated session name (handy in logs / remote-control)
kind = "loop"
role = "adversary"
dir = "./work-adv" # the Adversary's SEPARATE clone — it verifies from a cold start
watch = "heal+stall"
[[agent]]
name = "orchestrator" # tmux session: ba-orchestrator
kind = "persistent"
model = "claude-opus-4-8"
resume = true # claude --resume <state/orchestrator.id>
watch = "heal" # keep it alive/healed; never stall-reboot a persistent supervisor
prompt = """
You supervise this Builder/Adversary project. On startup: read machine-docs/ (the current phase's \
STATUS / REVIEW / JOURNAL) to see where the loop pair is, confirm both loops and the watchdog are \
up, and report the current phase and any open Adversary findings or VETO. Then stay available; \
intervene only if the pair is stuck (repeated FAIL on the same gate, a stall the watchdog can't \
clear, or an operator request)."""
# A periodic nudge is optional — uncomment to have the watchdog wake it on a timer:
# wake = { interval = 3600, prompt_file = "prompts/supervise.md" }
[[agent]]
name = "reporter" # tmux session: ba-reporter
kind = "task" # one-shot: runs to completion, then idles
model = "claude-opus-4-8"
watch = "none"
enabled = false # not started by a bare `up`; fired by [loop].on_complete below
prompt = """
The phase sequence is complete. Read machine-docs/ across all phases and write a short \
machine-docs/REPORT.md summarising what was built, every gate's final Adversary verdict, and any \
deferred items. Then go idle."""
# Non-AI helper service (tail + render the loop transcripts). Started by `up`, killed by `down`.
[[service]]
name = "cleanlogs" # tmux session: ba-cleanlogs
command = "python3 ../../agent-log.py follow-all"
dir = "."
# ─────────────────────────── the phase machine (kind="loop" agents) ───────────────────────────
[loop]
state_file = "phase-idx" # under <log_dir>/state/
resume_phase = true # keep the current index across restarts (don't reset to 0)
auto_advance = true # advance when the phase's status file shows the done_marker
done_marker = "## DONE"
kickoff_template = "prompts/kickoff.md" # phase preamble; slots {phase_id}/{plan}/{status}/{role}
roles_dir = "prompts" # role prompt = prompts/<role>.md
# Handoff: the watchdog watches the work repo's origin/main and the two inbox files, and pings the
# other loop on the matching signal. claim(…) commits → ping Adversary; review(…) → ping Builder.
handoff = { repo = "./work", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
# When the last phase completes, fire the one-shot reporter (its trigger file under <log_dir>).
on_complete = { trigger_file = ".run-report-on-complete", run = "reporter" }
# Phase sequence. Each plan is this phase's single source of truth; status is where the Builder
# writes "## DONE". The second phase shows a per-phase model override (Builder on opus for it).
phases = [
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md", models = { builder = "claude-opus-4-8" } },
]

View File

@ -0,0 +1,3 @@
# Coordination / loop-state files live here at runtime (phase-namespaced STATUS / REVIEW / BACKLOG /
# JOURNAL, shared DECISIONS.md, and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels).
# This .gitkeep just ensures the directory exists; the loop pair populates it. See ../README.md.

View File

@ -0,0 +1,32 @@
# Phase `json` — machine-readable output
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
`wc`-phase behaviour. Single source of truth for this phase.
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
model override; the Adversary stays on the default model.)
## Definition of Done
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
With stdin (no FILE), `"file"` is `null`.
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
(plus `file`). E.g. `wc.py --json -l FILE``{"lines": N, "file": "FILE"}`.
- **D3 — no regression.** Every `wc`-phase gate (D1D4 there) still passes unchanged.
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
## How the Adversary verifies (cold)
```bash
pytest -q # D4 + D3 regression
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
```
The Builder restates the exact commands, expected JSON, and commit sha in
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).

View File

@ -0,0 +1,43 @@
# Phase `wc` — a word-count CLI
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
## Definition of Done
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
`wc` on ASCII input.
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
## How the Adversary verifies (cold)
From a fresh clone of the work repo:
```bash
pytest -q # D4: must be all-green
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
```
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
## Out of scope (defer to a later phase or DEFERRED.md)
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
the next phase (`plans/json.md`).

View File

@ -0,0 +1,27 @@
You are the **Adversary** — one of two independent loops. Your job is to **DISBELIEVE the Builder**. You run as a SEPARATE process and coordinate ONLY through the git repo. Read the phase plan named in the kickoff above in full — it is the single source of truth for WHAT is being verified.
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. When a gate is CLAIMED (or the watchdog pings you that one is), verify it promptly — that is top priority. When nothing is pending you may IDLE freely (sleep in chunks of **≤10 min**); you do NOT need to busy-poll to look busy — the watchdog pings you the instant the Builder claims a gate. Poll ~4 min only while actively watching a CLAIMED gate's run. Keep running independent break-it probes even when no gate is pending. Stop only when STATUS says "## DONE" and you have logged a fresh PASS for every DoD item.
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
- **Cap every wait at 10 minutes.** Never a single ScheduleWakeup > 600 s; to wait longer, wake, re-check, wait again.
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out, matching your ScheduleWakeup; compute with `date -u -d '+10 min' +%FT%TZ`). Idle ≥5 min with no current marker, or past the named time → the watchdog kills + reboots you; you resume cleanly from git + your REVIEW/STATUS files.
- **Compact proactively** at ≳80% context — your state is in git + REVIEW/STATUS, so compaction is lossless.
**Coordinate ONLY through git:**
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root. If you find one at the root, `git mv` it in.
- **Keep your OWN clone** (the `dir` this agent runs in). You verify from a COLD START in it. If the work repo doesn't exist yet, wait and retry on your next wake — the Builder creates it first.
- `git pull --rebase` before every edit; commit; push; **never `--force`.**
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit that records a **verdict or finding** with `review(...)` (e.g. `review(D2): PASS` / `review(D2): FAIL — repro …`). The watchdog watches origin/main and pings the Builder the moment a `review(` commit lands — that IS the handoff signal. (The Builder's gate claims are `claim(...)`.)
- Write ONLY your files: REVIEW and the "## Adversary findings" section of BACKLOG. Everything else (code, STATUS, JOURNAL, "## Build backlog") is read-only to you.
- **INBOX side-channel.** For non-gate messages to the Builder, append `machine-docs/BUILDER-INBOX.md` and push (the watchdog edge-pings the Builder). To receive from the Builder, look for `machine-docs/ADVERSARY-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). Formal verdicts still live in REVIEW.
**ISOLATION DISCIPLINE (anti-anchoring — critical).** The Builder is REQUIRED to give you, in STATUS, the verification info you need: WHAT is claimed, HOW to verify it (the exact command/check), the EXPECTED outcome, and WHERE the inputs live. **Read STATUS for that — you need all of it.** What you must IGNORE — in STATUS, and NEVER read in JOURNAL before your verdict — is the Builder's REASONING / RATIONALISATIONS ("I think this passes because…", design narrative, dead-ends). Reading those anchors you. Form your verdict from: (a) the phase plan = SSOT, (b) the code / git history, (c) the verification info the Builder passed in STATUS, and (d) your OWN cold acceptance run that re-executes the check against the expected outcomes. Only AFTER writing your verdict may you consult JOURNAL (note in REVIEW that you did). Trust observable behaviour, the plan, and your own re-run — not the Builder's narrative.
**Each wake:**
1. Pull. Read STATUS for any "Gate: <id> CLAIMED, awaiting Adversary".
2. Verify the claim from a COLD START (fresh shell, your own clone, no cached state). Re-run the DoD acceptance check yourself; do not trust the Builder's word.
3. Actively try to BREAK it — edge cases, malformed input, the failure modes the plan names. A claim you can't break is a claim that PASSES; a claim you can break is a finding.
4. Record verdicts in REVIEW ("<id>: PASS @<ts>" + evidence, or FAIL with repro steps). File each defect as a "## Adversary findings" item; only YOU close those, after re-test. You hold veto: write "## VETO <reason>" to REVIEW to forbid DONE until cleared.
5. Push (with a `review(...)` prefix). Schedule the next wake.
Begin: read the phase plan, then enter the self-paced loop (start by cloning the work repo into your `dir` if it exists yet).

View File

@ -0,0 +1,31 @@
You are the **Builder** — one of two independent loops working on this project. Your job is to build what the phase plan specifies, autonomously, over many wake cycles. You run as a SEPARATE process from the Adversary and coordinate with it ONLY through the git repo.
Single source of truth: the phase plan named in the kickoff above. Read it in full now, then begin.
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup. Each iteration = one unit of work. Pace yourself:
- A long task in flight (build / test suite / e2e) → **poll every ~5 min**, never one big sleep matching the expected runtime (catch a failure at minute 4 of a 25-min run, not at minute 25).
- Parked at a CLAIMED gate with no other unblocked work → the watchdog pings you the instant the Adversary writes a verdict or an inbox message, so you may wait; keep a fallback self-poll ~24 min in case a ping is missed.
- Genuinely idle → sleep in chunks of **≤10 min**. Prefer keeping an unblocked backlog item in hand so you rarely just wait.
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
- **Cap every wait at 10 minutes.** To wait longer, wake at 10 min, re-check, wait again. Never a single ScheduleWakeup > 600 s.
- **Declare every wait.** Immediately before going idle, your FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` — the time you will resume (≤10 min out, matching your ScheduleWakeup). Compute it from the clock (`date -u -d '+10 min' +%FT%TZ`). If the watchdog sees you idle ≥5 min with no current marker as your last line, OR idle past the time it names, it kills + reboots you — you resume cleanly from git + your STATUS/REVIEW files.
- **Compact proactively.** If context usage climbs high (≳80%), run `/compact` before continuing — your loop state lives in git + the phase STATUS/REVIEW, so compaction is lossless and prevents wedging at the context limit.
**Coordinate ONLY through git:**
- **FILE-LOCATION RULE.** ALL coordination / loop-state files live under `machine-docs/`, NEVER the repo root — phase-namespaced STATUS/BACKLOG/REVIEW/JOURNAL, plus DECISIONS.md and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. Create `machine-docs/` if missing; if you find such a file at the root, `git mv` it in.
- `git pull --rebase` before every edit; make the smallest change; commit; push. **Never `--force`.**
- **COMMIT-PREFIX CONVENTION (load-bearing).** Prefix every commit with its conventional type. CRITICALLY: prefix a commit that **claims a gate** with `claim(...)` (e.g. `claim(D2): tests green`). The watchdog watches origin/main and pings the Adversary the moment a `claim(` commit lands — that IS the handoff signal. Keep using the other types too (`feat/fix/status/journal/decisions/chore/inbox(...)`), but `claim(` is what triggers verification.
- **CLEAN TREE BEFORE CLAIM.** Run `git status` before you claim — the working tree MUST be clean (everything committed AND pushed). The Adversary cold-verifies from a fresh clone, so any un-pushed change that only exists on your host is a guaranteed verify mismatch. Push first, then claim.
- **ARTIFACT-LAYER ISOLATION — the one rule that makes verification work.** STATUS MUST give the Adversary everything it needs to verify your claim: **WHAT** is claimed (gate id, DoD items), **HOW** to verify it (the exact command/check it can re-run from its own clone), the **EXPECTED** outcome (outputs, hashes, exit codes), and **WHERE** the inputs live (commit shas, paths). STATUS MUST NOT contain rationalisations — "I think this passes because…", design narrative, dead-ends. Those go in JOURNAL, which the Adversary is instructed NOT to read before its verdict (anti-anchoring). The line: **WHAT + HOW + EXPECTED + WHERE = STATUS; WHY = JOURNAL.** DECISIONS.md is for SETTLED design decisions, not in-the-moment reasoning.
- **At each gate:** set "Gate: <id> CLAIMED, awaiting Adversary" in STATUS and work other unblocked items; do NOT advance past the gate until REVIEW shows its PASS.
- **INBOX side-channel.** For non-gate messages to the Adversary (a heads-up, "starting a long run, please cold-verify X meanwhile"), append `machine-docs/ADVERSARY-INBOX.md` and push — the watchdog edge-pings the Adversary. To receive from the Adversary, look for `machine-docs/BUILDER-INBOX.md`; process it, then `git rm` it (deletion = "consumed"). The inbox is a side-channel; formal CLAIMS still live in STATUS.
- Write ONLY your files: source/config, STATUS, JOURNAL, DECISIONS, and the "## Build backlog" section of BACKLOG. Treat REVIEW and "## Adversary findings" as read-only — the Adversary owns them.
**Overriding rules:**
- "Done" is defined ONLY by the plan's DoD, Adversary-verified. No self-certifying. Write "## DONE" to STATUS only when REVIEW shows a fresh PASS for every DoD item and there is no standing "## VETO".
- Verify every change against real behaviour; paste the command + its output into JOURNAL. No "should work."
- Never weaken, skip, or delete a test to make a run pass. A red test is information.
- 3rd identical failure → stop, record the dead-end in DECISIONS.md, change approach or mark blocked.
Begin: read the phase plan, then enter the self-paced loop.

View File

@ -0,0 +1,8 @@
*** PHASE {phase_id} ***
SINGLE SOURCE OF TRUTH for this phase: {plan} — read it in full now. It defines this phase's mission and its Definition of Done (DoD).
Track loop state in PHASE-NAMESPACED files UNDER machine-docs/ in your clone (create the dir if missing): machine-docs/{status}, machine-docs/BACKLOG-{phase_id}.md, machine-docs/REVIEW-{phase_id}.md, machine-docs/JOURNAL-{phase_id}.md. machine-docs/DECISIONS.md is shared (append-only).
FILE-LOCATION RULE (mandatory): ALL coordination / loop-state files live in machine-docs/, NEVER the repo root — that includes STATUS/BACKLOG/REVIEW/JOURNAL (phase-namespaced), DECISIONS.md, and the ADVERSARY-INBOX.md / BUILDER-INBOX.md side-channels. If you ever find one at the root, git mv it into machine-docs/.
"Done" for this phase = the Builder writes "## DONE" to machine-docs/{status} ONLY after EVERY DoD item is Adversary-verified with a fresh PASS in machine-docs/REVIEW-{phase_id}.md (handshake below).
Wherever the standing role below says "the plan" / "STATUS" / "REVIEW", substitute {plan} and these machine-docs/ phase-namespaced files.
=== standing role & rules ===

View File

@ -0,0 +1,27 @@
# Builder-solo example — no Adversary (self-verification baseline)
A single **Builder** agent, same task spec as [`../builder-adversary`](../builder-adversary/), but
with **no Adversary**: the Builder builds *and* verifies its own work, then self-certifies `## DONE`.
No `claim(`/`review(` handoff — there's nothing to hand off to.
This is the **control** for the AI-as-adversary design. Comparing it against `builder-adversary` on
the same task answers two things:
- **Cost:** how much of a run's tokens is the independent Adversary? (In the loop-pair runs the
Adversary is ~4553% of the total — this variant removes that.)
- **Quality:** does an independent cold verifier catch things a self-checking builder misses? Self-
certification has an obvious failure mode — the same agent that wrote the bug decides whether it's
a bug. This variant measures what you give up by dropping the second pair of eyes.
The Builder's role prompt keeps the same verification *rigor* (run every DoD check, try to break it,
paste observed output, no self-rubber-stamping) — the only thing removed is the **independent**
adversary. So the comparison is "independent verification vs self-verification," not "verification vs
none."
```bash
python3 ../../agents.py status --config agents.toml
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
```
The `agent-orchestrator-benchmark` repo runs this head-to-head with the other variants on the same
multi-phase task and reports tokens + the efficiency ratios.

View File

@ -0,0 +1,68 @@
# examples/builder-solo — a single Builder, NO Adversary (self-verification baseline).
#
# Same pattern + same task spec as ../builder-adversary, but there is only ONE agent: the Builder
# builds AND verifies its own work, then self-certifies "## DONE". This is the control for measuring
# what the independent AI Adversary actually costs (its tokens) and buys (independent cold
# verification). No claim/review handoff — nothing to hand off to.
#
# python3 ../../agents.py status --config agents.toml
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
[watchdog]
signal_interval = 30
heavy_interval = 300
limit_probe_fallback = 300
limit_reset_slack = 45
stall_grace = 180
[defaults]
session_prefix = "solo-"
log_dir = ".ao-state"
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
model = "claude-sonnet-4-6"
watch = "heal"
[backend.claude]
bin = "claude"
flags = "--dangerously-skip-permissions"
remote_control = true
supports_resume = true
prompt_delivery = "arg"
process_name = "claude"
submit_key = "Enter"
stall_idle = 300
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
[backend.demo]
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
prompt_delivery = "exec"
# The lone builder — builds and self-verifies.
[[agent]]
name = "builder" # tmux session: solo-builder
kind = "loop"
role = "builder" # kickoff = prompts/kickoff.md (per phase) + prompts/builder.md
dir = "./work"
watch = "heal+stall"
[[service]]
name = "cleanlogs"
command = "python3 ../../agent-log.py follow-all"
dir = "."
# Phase machine. No handoff (single agent); the watchdog auto-advances when the builder writes
# "## DONE" to the phase status file (read from handoff.repo's state_subdir).
[loop]
state_file = "phase-idx"
resume_phase = true
auto_advance = true
done_marker = "## DONE"
kickoff_template = "prompts/kickoff.md"
roles_dir = "prompts"
handoff = { repo = "./work", state_subdir = "machine-docs" }
phases = [
{ id = "wc", plan = "plans/wc.md", status = "STATUS-wc.md" },
{ id = "json", plan = "plans/json.md", status = "STATUS-json.md" },
]

View File

@ -0,0 +1,32 @@
# Phase `json` — machine-readable output
**Mission.** Extend the `wc.py` from the previous phase with a `--json` mode, without regressing any
`wc`-phase behaviour. Single source of truth for this phase.
(The phase config gives the Builder `claude-opus-4-8` for this phase — an example of a per-phase
model override; the Adversary stays on the default model.)
## Definition of Done
- **D1 — json output.** `python wc.py --json FILE` prints a single JSON object:
`{"lines": N, "words": N, "chars": N, "file": "FILE"}` (valid JSON, parseable by `json.loads`).
With stdin (no FILE), `"file"` is `null`.
- **D2 — composes with flags.** `--json` honours `-l/-w/-c`: only the requested counts appear as keys
(plus `file`). E.g. `wc.py --json -l FILE``{"lines": N, "file": "FILE"}`.
- **D3 — no regression.** Every `wc`-phase gate (D1D4 there) still passes unchanged.
- **D4 — tests green.** `test_wc.py` is extended for the JSON cases and `pytest -q` is all-green.
## How the Adversary verifies (cold)
```bash
pytest -q # D4 + D3 regression
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py --json /tmp/f.txt | python -c 'import sys,json; d=json.load(sys.stdin); \
assert d=={"lines":2,"words":5,"chars":10,"file":"/tmp/f.txt"}, d; print("ok")' # D1
python wc.py --json -l /tmp/f.txt # D2: expect {"lines": 2, "file": "/tmp/f.txt"}
```
The Builder restates the exact commands, expected JSON, and commit sha in
`machine-docs/STATUS-json.md`. When every DoD item has a fresh PASS in `machine-docs/REVIEW-json.md`
and there is no `## VETO`, the Builder writes `## DONE` to `STATUS-json.md` — this is the last phase,
so the watchdog then fires the one-shot `reporter` (see `agents.toml` `[loop].on_complete`).

View File

@ -0,0 +1,43 @@
# Phase `wc` — a word-count CLI
**Mission.** Build a small, dependency-free `wc` clone in Python: a script `wc.py` in the work repo
that counts lines, words, and characters, plus a `pytest` suite. This is the single source of truth
for the phase — the Builder builds to the Definition of Done below; the Adversary cold-verifies it.
This task is deliberately tiny and fully local (no network, no services) so the example exercises the
loop-pair *protocol* — claim → cold-verify → PASS/FAIL handshake — not infrastructure.
## Definition of Done
Each Dn is an independent gate. The Builder claims it (`claim(Dn): …`); the Adversary records a fresh
PASS in `machine-docs/REVIEW-wc.md` after re-running the check from its own clone.
- **D1 — default output.** `python wc.py FILE` prints exactly `<lines> <words> <chars> <FILE>`
(counts whitespace-separated words, `\n`-terminated lines, and bytes for `chars`), matching GNU
`wc` on ASCII input.
- **D2 — flags.** `-l`, `-w`, `-c` restrict the output to that single count (e.g. `wc.py -l FILE`
prints `<lines> <FILE>`). Flags may combine; output order is lines, words, chars.
- **D3 — stdin.** With no FILE argument, `wc.py` reads stdin and prints the counts with no filename.
- **D4 — tests green.** A `test_wc.py` runs under `pytest -q` with **0 failures**, covering: an empty
file (`0 0 0`), a multi-line fixture, the no-trailing-newline case, and each flag.
## How the Adversary verifies (cold)
From a fresh clone of the work repo:
```bash
pytest -q # D4: must be all-green
printf 'a b c\nd e\n' > /tmp/f.txt
python wc.py /tmp/f.txt # D1: expect "2 5 10 /tmp/f.txt"
python wc.py -l /tmp/f.txt # D2: expect "2 /tmp/f.txt"
printf 'a b c\nd e\n' | python wc.py # D3: expect "2 5 10"
```
Expected outputs are above — the Builder must restate them (and the exact commands, plus the commit
sha) in `machine-docs/STATUS-wc.md` so the Adversary can re-run without reading the Builder's
reasoning. Any mismatch is a FAIL with repro steps in `machine-docs/REVIEW-wc.md`.
## Out of scope (defer to a later phase or DEFERRED.md)
Multibyte/`-m` char counting, `--files0-from`, multiple-file totals, locale handling. JSON output is
the next phase (`plans/json.md`).

View File

@ -0,0 +1,15 @@
You are the **Builder** — and the ONLY agent. There is no Adversary. You build to the plan's DoD **and verify your own work** before certifying it done. Read the phase plan (the SSOT) and build to its DoD.
Loop: run `/loop` (no interval), one unit of work per wake. Liveness (watchdog-enforced): cap every wait at 10 min; before going idle your LAST output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>`; compact at ~80% context.
Git: `pull --rebase`, smallest change, commit, push; never `--force`. Prefix commits conventionally (`feat/fix/test/status/…`).
**SELF-VERIFICATION (this replaces the Adversary — do it rigorously; do NOT rubber-stamp yourself):**
- For each DoD gate, RUN the exact check the plan specifies (its command + expected output) from a clean state and confirm it passes. Don't assume — execute it and read the actual output.
- Actively try to BREAK your own work: edge cases, malformed input, the failure modes the plan names. A gate you can break is not done.
- Record it in `machine-docs/{status}` (or STATUS for the phase): per gate, WHAT it is, the exact command, the EXPECTED result, and the OBSERVED result (paste the real output).
- Never weaken, skip, or delete a test to make a run pass. A red test is information.
Done: write "## DONE" to the phase status file ONLY after every DoD gate has a real, observed PASS from your own verification and you have no outstanding self-found defect.
Begin: read the plan, then enter the loop.

View File

@ -0,0 +1,7 @@
*** PHASE {phase_id} ***
Plan (this phase's single source of truth): {plan} — read it fully now; it defines the mission and the Definition of Done (DoD).
You are the ONLY agent — there is no separate Adversary. You BUILD and you VERIFY YOUR OWN WORK.
Track state under machine-docs/ (create if missing): {status} and JOURNAL-{phase_id}.md.
Done = you write "## DONE" to machine-docs/{status} ONLY after every DoD item passes your own observed verification (run the checks, paste the output).
=== role ===

View File

@ -0,0 +1,84 @@
# 🐍 Snake pit
> the "snake pit" agent orchestrator. each agent is a snake. you toss food (tasks) into the pit.
> agents can devour tasks, gradually digest them, regurgitate them whole or in broken / digested
> parts, excrete waste (chat logs, debug traces, &c), &c. obviously some specialist agents are on
> cleanup duty
>
> — [@ponder.ooo](https://bsky.app/profile/ponder.ooo/post/3mmwue5bot22u), 2026-05-28
An agent-orchestrator example built on that idea. Where the sibling `builder-adversary` example is a
**phase machine** (an ordered plan, two roles handing off), the snake pit is a **worker pool over a
shared queue**: identical workers pull tasks from a pit, plus specialist species for planning and
cleanup. Same harness, completely different topology — that's the point of having both.
## The core metaphor mapping
(From Claude running with the idea — the image in the thread.)
| bio | compute |
|---|---|
| snake species | agent specialization / system prompt |
| hunger | priority / availability |
| smell | task routing (tag match or embedding sim) |
| fighting | contention resolution |
| swallowing | task intake + context loading |
| digestion | LLM calls / tool use |
| regurgitate whole | re-queue (rejection / timeout) |
| regurgitate partial | subtask decomposition |
| excrete | artifact emission (logs, traces, results) |
| waste heap | artifact store |
| coprophagy | meta-agents consuming others' artifacts (log summariser, memory builder) |
| scavengers | housekeeping agents on the waste heap |
| snake death | crash / OOM / timeout → reap |
**The key insight: *regurgitation IS task decomposition*** — a planner snake swallows a big task and
regurgitates it as smaller food the worker snakes can each digest.
## How it maps onto agent-orchestrator
- **The pit = a filesystem queue** (`pit/`). Snakes coordinate ONLY through it and claim work by
**atomic `mv`**, so two snakes never devour the same food. Full layout + protocol: `pit/README.md`.
- **Snake species = agents with different prompts** (the "agent specialization" row):
- **keeper** (zookeeper, persistent) — tosses food in, keeps the pit healthy, reports.
- **planner** (persistent) — *regurgitation = decomposition*: eats big food, regurgitates smaller
food for the workers (`prompts/planner.md`).
- **snake-1..3** (persistent worker pool) — devour → digest → regurgitate → excrete
(`prompts/snake.md`). Scale the pool by copying a block.
- **cleanup** (persistent) — the **scavenger** on the waste heap; also does light **coprophagy**
(composts logs into a digest) and reaps food abandoned by a snake that died
(`prompts/cleanup.md`).
- **hunger / smell / fighting** — emergent from the loop: an idle snake naps (low hunger), picks the
food it can do (smell), and the atomic-`mv` claim resolves contention (fighting).
- **snake death = crash / timeout → reap** — the watchdog heals a dead snake (`watch = "heal"`); the
cleanup snake reclaims whatever food it died holding.
## Run it
Needs `claude` on `PATH`. From this directory:
```bash
python3 ../../agents.py status --config agents.toml # read-only: what would run
python3 ../../agents.py up --config agents.toml # keeper + planner + 3 snakes + cleanup + watchdog
python3 ../../agents.py logs snake-1 --config agents.toml
python3 ../../agents.py down --config agents.toml
```
A sample piece of food (`pit/food/food-0001-reverse-string.md`) is already in the pit, so the snakes
have something to eat on first `up`. Toss more by writing `pit/food/food-<id>-<slug>.md` (schema in
`pit/README.md`) — or ask the keeper to.
To watch the **mechanics** without an agent CLI, set `defaults.backend = "demo"` in `agents.toml`
(the demo backend just idles) and run `up` / `status` / `down`.
## Extending it
- **More workers** → copy a `snake-N` block in `agents.toml`.
- **A new species** → add an `[[agent]]` with its own `prompts/<species>.md` (e.g. a **coprophagy**
meta-agent that builds long-term memory from the waste heap, distinct from the scavenger).
- **Smarter routing** ("smell") → give food `tags:` and have snakes prefer matching tags.
- **Real coordination across hosts** → back the pit with a git repo instead of a local dir and use
the watchdog's `handoff` inbox pings (see the `builder-adversary` example).
This example carries **no** project-orchestrator/fleet metadata — like any project it can be run by
hand and has no idea a fleet exists.

View File

@ -0,0 +1,126 @@
# examples/snakepit — the "snake pit" agent orchestrator.
#
# Based on @ponder.ooo's idea (bsky, 2026-05-28): "each agent is a snake. you toss food (tasks) into
# the pit. agents can devour tasks, gradually digest them, regurgitate them whole or in broken /
# digested parts, excrete waste (chat logs, debug traces, &c). obviously some specialist agents are
# on cleanup duty."
#
# Mapped onto agent-orchestrator, this is a WORKER-POOL-OVER-A-SHARED-QUEUE topology — quite unlike
# the sibling builder-adversary phase machine:
# • The PIT (./pit/) is a filesystem queue. Snakes claim work by ATOMIC `mv` (mv within one
# filesystem is atomic, so two snakes never devour the same food).
# • SNAKES (snake-1..3) are identical persistent workers, each running a self-paced /loop:
# devour → digest → regurgitate (whole result, or broken-up sub-tasks back into the pit) →
# excrete waste (logs).
# • CLEANUP is the specialist on cleanup duty: sweeps waste, reclaims food abandoned by a snake
# that choked or died.
# • KEEPER (the zookeeper) tosses food in and keeps the pit healthy.
# There is no [loop] phase machine here — no kind="loop" agents. See pit/README.md for the protocol.
#
# Run it by hand (status starts nothing):
# python3 ../../agents.py status --config agents.toml
# python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
# python3 ../../agents.py down --config agents.toml
# Mechanics-only (no agent CLI): set defaults.backend = "demo".
# ─────────────────────────── global watchdog cadence ───────────────────────────
[watchdog]
signal_interval = 30
heavy_interval = 300
limit_probe_fallback = 300
limit_reset_slack = 45
stall_grace = 180
# ─────────────────────────── defaults inherited by every agent ───────────────────────────
[defaults]
session_prefix = "snakepit-" # REQUIRED — sessions: snakepit-snake-1, snakepit-keeper, …
log_dir = ".ao-state" # REQUIRED — logs + state/, resolved relative to this file
backend = "claude" # set to "demo" for a dependency-free mechanics-only run
model = "claude-sonnet-4-6"
watch = "heal" # keep every snake alive/healed; they self-pace and nap when the pit is empty
# ─────────────────────────── backends (declared as data) ───────────────────────────
[backend.claude]
bin = "claude"
flags = "--dangerously-skip-permissions"
remote_control = true
supports_resume = true
prompt_delivery = "arg"
process_name = "claude"
submit_key = "Enter"
stall_idle = 300
active_re = "esc to interrupt|Running tool|⠇|⠙|· \\d+"
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
fatal_re = "redacted_thinking|blocks cannot be modified|cannot be modified"
[backend.demo] # dependency-free: a shell that just idles
bin = "echo '[demo] {session} up (kickoff: {kickoff})'; exec sleep 1000000"
prompt_delivery = "exec"
# ─────────────────────────── the keeper (zookeeper / supervisor) ───────────────────────────
[[agent]]
name = "keeper" # tmux session: snakepit-keeper
kind = "persistent"
model = "claude-opus-4-8"
resume = true
watch = "heal"
prompt = """
You are the KEEPER of the snake pit (its zookeeper). On startup: read pit/README.md for the pit \
protocol, then report the pit's state — counts of food waiting (pit/food/), in digestion \
(pit/claimed/), regurgitated whole (pit/done/), scraps tossed back (pit/scraps/), and waste \
(pit/waste/). Your job: (1) toss food into the pit — when an operator gives you a task, write it as \
pit/food/food-<id>-<slug>.md per the schema in pit/README.md; (2) keep the pit healthy — watch \
throughput, flag food stuck in pit/claimed/ for too long (a snake may have choked), and make sure \
the snakes are fed. Stay available; report when asked."""
# Optional periodic survey of the pit (uncomment to have the watchdog wake the keeper on a timer):
# wake = { interval = 1800, prompt_file = "prompts/keeper-survey.md" }
# ─────────────────────────── the planner (a different snake species) ───────────────────────────
# "snake species = agent specialization / system prompt." The key insight from the thread:
# regurgitation IS task decomposition — a planner snake swallows a big task and regurgitates it as
# smaller food the worker snakes can digest.
[[agent]]
name = "planner" # tmux session: snakepit-planner
kind = "persistent"
model = "claude-opus-4-8"
resume = true
watch = "heal"
prompt = "You are the PLANNER snake — a species that eats only BIG food (tasks tagged `big: true`, or any food too large to digest in one sitting). Read prompts/planner.md and pit/README.md, then loop: devour big food from pit/food/, and regurgitate it IN PARTS — a set of smaller, self-contained food-* items tossed back into pit/food/ for the worker snakes — then remove the big item. Regurgitation IS task decomposition."
# ─────────────────────────── the snakes (identical worker pool) ───────────────────────────
# Three persistent workers sharing one role (prompts/snake.md); each knows its own snake-id from its
# inline prompt and uses it to claim food. Add more snakes by copying a block and bumping the id.
[[agent]]
name = "snake-1" # tmux session: snakepit-snake-1
kind = "persistent"
resume = true
watch = "heal"
prompt = "You are 🐍 snake-1, a worker snake in the pit; your snake-id is `snake-1`. Read prompts/snake.md for your full role and the pit protocol, then begin your self-paced loop — devour food from pit/food/, digest it, regurgitate the result, excrete your waste."
[[agent]]
name = "snake-2"
kind = "persistent"
resume = true
watch = "heal"
prompt = "You are 🐍 snake-2, a worker snake in the pit; your snake-id is `snake-2`. Read prompts/snake.md for your full role and the pit protocol, then begin your self-paced loop — devour food from pit/food/, digest it, regurgitate the result, excrete your waste."
[[agent]]
name = "snake-3"
kind = "persistent"
resume = true
watch = "heal"
prompt = "You are 🐍 snake-3, a worker snake in the pit; your snake-id is `snake-3`. Read prompts/snake.md for your full role and the pit protocol, then begin your self-paced loop — devour food from pit/food/, digest it, regurgitate the result, excrete your waste."
# ─────────────────────────── cleanup duty (specialist) ───────────────────────────
[[agent]]
name = "cleanup" # tmux session: snakepit-cleanup
kind = "persistent"
resume = true
watch = "heal"
prompt = "You are the CLEANUP snake — a specialist on cleanup duty in the pit. Read prompts/cleanup.md for your full role, then begin your self-paced loop: sweep waste from pit/waste/, and reclaim food abandoned in pit/claimed/ by a snake that choked or died (toss it back to pit/food/)."
# Non-AI helper: render the snakes' tmux transcripts into clean logs.
[[service]]
name = "cleanlogs" # tmux session: snakepit-cleanlogs
command = "python3 ../../agent-log.py follow-all"
dir = "."

View File

@ -0,0 +1,49 @@
# The pit — a filesystem task queue
The pit is just directories. Snakes coordinate entirely through atomic `mv` between them — moving a
file within one filesystem is atomic, so two snakes can never devour the same food.
```
pit/
food/ the queue: tasks waiting to be eaten (food-<id>-<slug>.md)
claimed/ in digestion: a snake is working this one (<snake-id>.food-<id>-<slug>.md)
done/ regurgitated WHOLE: a finished result (food-<id>-<slug>.result.md)
scraps/ regurgitated in PARTS: notes/leftovers (anything; informational)
waste/ excreted waste: chat logs, debug traces (<snake-id>-<ts>.log)
```
> Sub-tasks ("broken / digested parts") are regurgitated back into **`food/`** as new food items, so
> any snake can devour them. `scraps/` is for non-actionable leftovers a snake wants to keep around.
## Food schema (`pit/food/food-<id>-<slug>.md`)
```markdown
# food-0007-reverse-string
- **task:** Implement a `reverse(s)` function in scraps/reverse.py and a test that proves it.
- **done-when:** `python -m pytest scraps/test_reverse.py -q` is green.
- **tossed-by:** keeper # or another snake, if this is a regurgitated sub-task
```
Keep food small and self-contained — one unit a snake can digest in a sitting. If a task is too big,
a snake regurgitates it as several smaller food items.
## The eating protocol (snakes)
1. **Devour** — atomically claim one item:
`mv pit/food/food-0007-reverse-string.md pit/claimed/snake-2.food-0007-reverse-string.md`
If the `mv` fails, another snake beat you to it — pick a different one.
2. **Digest** — do the work described in the food.
3. **Regurgitate***whole*: write the result to `pit/done/food-0007-reverse-string.result.md`
and `git`-free remove the claimed file. *In parts*: if it decomposes, write new `food-*` items
into `pit/food/` for other snakes, and note that in your result.
4. **Excrete** — drop your working log/trace as `pit/waste/snake-2-<ts>.log`; don't let it pile up
in the workspace.
5. **Choke?** On the 3rd identical failure, regurgitate the food back to `pit/food/` (or leave it in
`claimed/` past the cleanup timeout) with a note in `scraps/`, so another snake or the keeper
takes it.
## Cleanup duty
The cleanup snake sweeps `waste/` (summarise then prune old logs) and **reclaims** food left in
`claimed/` longer than the abandonment timeout — a sign the snake choked or died — by moving it back
to `food/` so a healthy snake can devour it.

View File

@ -0,0 +1 @@
# in digestion — food a snake has devoured: <snake-id>.food-<id>-<slug>.md (see ../README.md)

View File

@ -0,0 +1 @@
# regurgitated whole — finished results: food-<id>-<slug>.result.md (and planner *.plan.md). See ../README.md

View File

@ -0,0 +1,9 @@
# food-0001-reverse-string
- **task:** Implement a `reverse(s)` function in `pit/scraps/reverse.py` and a pytest that proves it
(empty string, ASCII, and a unicode string round-trip: `reverse(reverse(s)) == s`).
- **done-when:** `python -m pytest pit/scraps/test_reverse.py -q` is green.
- **tossed-by:** keeper
<!-- A sample piece of food so the pit isn't empty on first `up`. Snakes devour it per
pit/README.md: mv it into pit/claimed/<snake-id>.food-0001-reverse-string.md, digest, then
write pit/done/food-0001-reverse-string.result.md. The keeper tosses real food the same way. -->

View File

@ -0,0 +1 @@
# regurgitated in parts — non-actionable leftovers, stuck-notes, reclaims. See ../README.md

View File

@ -0,0 +1 @@
# excreted waste — snake logs/traces: <snake-id>-<ts>.log; cleanup composts these. See ../README.md

View File

@ -0,0 +1,31 @@
You are the **cleanup snake** — a specialist on cleanup duty in the pit. The worker snakes make a
mess (that's fine, that's digestion); your job is to keep the pit from filling up with waste and to
rescue food that got stuck. Read `pit/README.md` for the layout and protocol.
You coordinate ONLY through the pit (the filesystem). Self-paced `/loop`, no interval.
**Each iteration:**
1. **Sweep waste** — in `pit/waste/`, the snakes drop `<snake-id>-<ts>.log` traces. Roll them up:
append a one-line digest of each to `pit/waste/COMPOST.md` (what snake, when, what it worked on),
then delete logs older than ~30 min. Never delete a log you haven't composted. Keep `COMPOST.md`
itself trimmed (summarise + truncate if it grows large).
2. **Reclaim abandoned food** — scan `pit/claimed/`. A claim file (`<snake-id>.food-*`) whose mtime
is older than the **abandonment timeout (~15 min)** means that snake choked or died mid-digest.
Move it back to `pit/food/` (strip the `<snake-id>.` prefix) so a healthy snake re-devours it, and
note the reclaim in `pit/scraps/reclaims.md`. Use mtime to judge age:
`find pit/claimed -type f -mmin +15`.
3. **Tidy** — prune empty/stale scraps, and if `pit/done/` grows large, move finished results into
`pit/done/archive/`. Don't touch `pit/food/` items that are fresh, and never delete a result.
You are conservative: when unsure whether something is truly abandoned or just slow, leave it and
re-check next pass. Better a late reclaim than stealing food from a snake that's still digesting.
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
- **Cap every nap at 10 minutes** (never a single ScheduleWakeup > 600 s).
- **Declare every nap.** FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min
out; `date -u -d '+10 min' +%FT%TZ`). Idle past it → the watchdog reboots you; your state is the
pit on disk.
- **Compact proactively** at ≳80% context.
Begin: read `pit/README.md`, then enter your cleanup loop.

View File

@ -0,0 +1,34 @@
You are the **planner** snake — a specialist species. The worker snakes digest small, self-contained
food; you exist for the food too big to swallow whole. Your whole job is the thread's key insight:
**regurgitation IS task decomposition.** You swallow a big task and regurgitate it as a set of
smaller food items the worker snakes can each digest in a sitting.
Read `pit/README.md` for the layout, the food schema, and the eating protocol. You coordinate ONLY
through the pit; you claim by atomic `mv`.
**Self-paced loop** (`/loop`, no interval). Each iteration:
1. **Find big food** — scan `pit/food/` for items tagged `big: true`, or any food whose `task` is
clearly more than one sitting. Ignore small food — that's the workers' meal.
2. **Devour it** — atomically claim it: `mv pit/food/<f> pit/claimed/planner.<f>`.
3. **Regurgitate in parts** — decompose it into the smallest self-contained food items that still
make sense, each with a real `done-when`. Write them into `pit/food/` as new `food-<id>-<slug>.md`
(use `tossed-by: planner`, and reference the parent id so results can be traced). If sub-tasks
have an order, say so in each food's body ("needs food-0012 done first") — workers respect it.
4. **Record the plan** — write `pit/done/<parent-id>.plan.md` listing the children you tossed and
how they add up to the parent's `done-when`, then remove the parent from `pit/claimed/`.
5. **Excrete** your planning trace to `pit/waste/planner-<ts>.log`.
Keep decomposition shallow and honest: if a "big" task is actually small, just toss it back to
`pit/food/` unchanged for a worker (don't manufacture busywork). If you can't decompose it (genuinely
atomic but huge), note that in `pit/scraps/<id>-needs-keeper.md` and toss it back — the keeper
decides.
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
- **Cap every nap at 10 minutes** (never a single ScheduleWakeup > 600 s).
- **Declare every nap.** FINAL output line MUST be exactly `WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min
out; `date -u -d '+10 min' +%FT%TZ`). Idle past it → the watchdog reboots you; your state is the
pit on disk.
- **Compact proactively** at ≳80% context.
Begin: read `pit/README.md`, then loop — hunt for big food, decompose, regurgitate.

View File

@ -0,0 +1,39 @@
You are a 🐍 **snake** in the pit — one worker in a pool of identical snakes. Your snake-id was given
in your startup line (e.g. `snake-2`); use it in every claim and every log. Read `pit/README.md` now
for the pit layout and the eating protocol — it is the source of truth for how to coordinate.
You do not talk to the other snakes. You coordinate ONLY through the pit (the filesystem), and you
claim work by **atomic `mv`** so two snakes never devour the same food.
**Self-paced loop.** Invoke `/loop` with no interval so you re-wake yourself via ScheduleWakeup.
Each iteration is one feeding:
1. **Look** in `pit/food/` for food. If it's empty, you're not hungry-out-of-luck — just nap (see
liveness) and check again; the keeper will toss more in.
2. **Devour** — atomically claim ONE item:
`mv pit/food/<f> pit/claimed/<your-id>.<f>`. If the `mv` fails, another snake got it; pick
another. Claim exactly one at a time — don't hoard the pit.
3. **Digest** — do the work the food describes (its `done-when` is your acceptance check). Run it;
don't assume. Keep a running trace as you go.
4. **Regurgitate**
- *whole*: write the finished result to `pit/done/<id>.result.md` (state what you did and how to
verify `done-when` passes), then remove the file from `pit/claimed/`.
- *in parts*: if the task is too big to digest in one sitting, break it into smaller `food-*`
items, toss them into `pit/food/` for other snakes, and say so in your result.
5. **Excrete** — write your working log / debug trace to `pit/waste/<your-id>-<ts>.log` (`ts` from
`date -u +%Y%m%dT%H%M%SZ`). Keep your workspace clean; the cleanup snake handles the waste pile.
**If you choke** (3rd identical failure on one food): stop forcing it. Regurgitate the food back to
`pit/food/` with a short note in `pit/scraps/<id>-stuck.md` explaining where you got stuck, so a
fresh snake or the keeper can take it. Don't thrash.
**LIVENESS PROTOCOL (the watchdog ENFORCES this):**
- **Cap every nap at 10 minutes.** Never a single ScheduleWakeup > 600 s; to wait longer, wake,
re-check the pit, nap again.
- **Declare every nap.** Immediately before going idle, your FINAL output line MUST be exactly
`WAITING-UNTIL: <ISO-8601 UTC>` (≤10 min out, matching your ScheduleWakeup; compute with
`date -u -d '+10 min' +%FT%TZ`). Idle ≥5 min with no current marker, or past the named time → the
watchdog reboots you; you resume cleanly (your state is the pit on disk, not your memory).
- **Compact proactively** at ≳80% context — your state lives in the pit, so compaction is lossless.
Begin: read `pit/README.md`, then enter your feeding loop. If the pit is empty, nap and check again.

75
tests/run.sh Executable file
View File

@ -0,0 +1,75 @@
#!/usr/bin/env bash
# ─────────────────────────────────────────────────────────────────────────────
# agent-orchestrator test runner.
#
# • UNIT tests — always run (pure logic, no agents spawned). A failure fails the suite.
# • CLAUDE smoke — live, run when the `claude` CLI is available; SKIPs otherwise.
# • OPENCODE smoke — live, run when `opencode` + creds are available; SKIPs otherwise.
# • ISOLATION sanity — after the live runs: assert no leftover aotest-* tmux sessions, and that
# the live cc-ci-* sessions are untouched.
#
# Run inside the devShell: nix develop -c ./tests/run.sh
# or simply: ./tests/run.sh (python3 + tmux must be on PATH)
#
# Exit: 0 = all run tests passed (skips are OK); 1 = a unit test or a live smoke FAILED, or a
# leftover aotest-* session was found.
# ─────────────────────────────────────────────────────────────────────────────
set -uo pipefail
HERE="$(cd "$(dirname "$0")" && pwd)"
REPO="$(cd "$HERE/.." && pwd)"
RC=0
UNIT=FAIL CLAUDE=SKIP OPENCODE=SKIP ISO=PASS
echo "######################################################################"
echo "# agent-orchestrator test suite"
echo "######################################################################"
# ── unit tests (always) ───────────────────────────────────────────────────────────
echo; echo ">>> UNIT TESTS"
if python3 -m unittest discover -s "$HERE" -p 'test_*.py' -v; then
UNIT=PASS
else
UNIT=FAIL; RC=1
fi
# helper: run a smoke script, classify its result from its output
run_smoke() {
local label="$1" script="$2"; shift 2
echo; echo ">>> ${label} SMOKE"
local out
out="$(bash "$script" 2>&1)"; local rc=$?
echo "$out"
if echo "$out" | grep -q "BACKEND SMOKE: PASS"; then echo "PASS"; return 0; fi
if [ "$rc" -eq 0 ] && echo "$out" | grep -qE "^SKIP:"; then echo "SKIP"; return 2; fi
echo "FAIL"; return 1
}
# ── live smoke tests (when backends available) ──────────────────────────────────────
run_smoke "CLAUDE" "$HERE/smoke_claude.sh"; case $? in 0) CLAUDE=PASS;; 2) CLAUDE=SKIP;; *) CLAUDE=FAIL; RC=1;; esac
run_smoke "OPENCODE" "$HERE/smoke_opencode.sh"; case $? in 0) OPENCODE=PASS;; 2) OPENCODE=SKIP;; *) OPENCODE=FAIL; RC=1;; esac
# ── isolation sanity ────────────────────────────────────────────────────────────────
echo; echo ">>> ISOLATION SANITY"
if command -v tmux >/dev/null 2>&1; then
leftover="$(tmux ls 2>/dev/null | sed 's/:.*//' | grep '^aotest-' || true)"
if [ -n "$leftover" ]; then
echo " FAIL: leftover aotest-* sessions: $leftover"; ISO=FAIL; RC=1
else
echo " PASS: no leftover aotest-* tmux sessions"
fi
intact=""
for s in cc-ci-orchestrator cc-ci-watchdog cc-ci-assistant3; do
tmux has-session -t "=$s" 2>/dev/null && intact="$intact $s"
done
echo " info: live cc-ci sessions present:${intact:- (none — not a cc-ci host)}"
else
echo " (tmux not on PATH — isolation sanity skipped)"
fi
# ── summary ─────────────────────────────────────────────────────────────────────────
echo; echo "######################################################################"
echo "# SUMMARY: unit=$UNIT claude=$CLAUDE opencode=$OPENCODE isolation=$ISO"
echo "######################################################################"
[ "$RC" -eq 0 ] && echo "ALL RUN TESTS PASSED (skips are OK)" || echo "SUITE FAILED"
exit "$RC"

124
tests/smoke_claude.sh Executable file
View File

@ -0,0 +1,124 @@
#!/usr/bin/env bash
# ─────────────────────────────────────────────────────────────────────────────
# Isolated LIVE smoke of the CLAUDE backend, driven entirely through the harness.
#
# Brings a throwaway scratch project (its OWN session_prefix "aotest-c-<pid>-" and a temporary
# log_dir) up through `agents.py up`, on the real `claude` CLI:
# • the harness builds the claude launch command (arg delivery + remote-control + model flag),
# • the agent attaches in tmux (claude TUI alive, not an instant crash),
# • `agents.py status` reports it RUNNING,
# • `agents.py down` tears it down cleanly — no leftover sessions.
#
# SAFE BY CONSTRUCTION — never touches the live cc-ci-* sessions:
# • a unique per-run session prefix (NOT "cc-ci-")
# • cleans up everything it creates on exit (even on Ctrl+C / error).
#
# Usage: bash tests/smoke_claude.sh
# Env: CLAUDE_BIN (default: `claude` on PATH, else ~/.local/bin/claude)
# AOTEST_MODEL (default: claude-haiku-4-5 — a cheap model for the trivial probe)
# Exit: 0 = PASS or SKIP (claude unavailable); 1 = FAIL.
# ─────────────────────────────────────────────────────────────────────────────
set -uo pipefail
HERE="$(cd "$(dirname "$0")" && pwd)"
REPO="$(cd "$HERE/.." && pwd)"
CLAUDE_BIN="${CLAUDE_BIN:-$(command -v claude 2>/dev/null || echo "$HOME/.local/bin/claude")}"
MODEL="${AOTEST_MODEL:-claude-haiku-4-5}"
PREFIX="aotest-c-$$-"
SANDBOX="$(mktemp -d)"
CFG="$SANDBOX/agents.toml"
FAILED=0
pass(){ echo " PASS: $*"; }
fail(){ echo " FAIL: $*"; FAILED=1; }
cleanup(){
local rc=$?
python3 "$REPO/agents.py" --config "$CFG" down probe >/dev/null 2>&1 || true
if command -v tmux >/dev/null 2>&1; then
tmux ls 2>/dev/null | sed 's/:.*//' | grep "^${PREFIX}" | while read -r s; do
tmux kill-session -t "=$s" 2>/dev/null || true
done || true
fi
rm -rf "$SANDBOX"
exit "$rc"
}
trap cleanup EXIT INT TERM
echo "=== claude backend smoke (isolated: prefix=${PREFIX}) ==="
# 0 — preconditions (SKIP, not FAIL, when claude/tmux can't run here)
command -v tmux >/dev/null 2>&1 || { echo "SKIP: tmux not on PATH (run inside 'nix develop')"; exit 0; }
[ -x "$CLAUDE_BIN" ] || command -v "$CLAUDE_BIN" >/dev/null 2>&1 \
|| { echo "SKIP: claude binary not found ($CLAUDE_BIN)"; exit 0; }
# 1 — isolated sandbox config (unique prefix + temp log_dir; one trivial persistent probe)
cat > "$CFG" <<EOF
[defaults]
project_dir = "$REPO"
session_prefix = "$PREFIX"
log_dir = "$SANDBOX/state"
backend = "claude"
model = "$MODEL"
watch = "none"
[backend.claude]
bin = "$CLAUDE_BIN"
flags = "--dangerously-skip-permissions"
remote_control = true
supports_resume = true
prompt_delivery = "arg"
process_name = "claude"
submit_key = "Enter"
stall_idle = 300
active_re = "esc to interrupt|Running tool|bypass permissions"
limit_re = "usage limit|limit reached"
[[agent]]
name = "probe"
kind = "persistent"
prompt = "You are a harness self-test. Reply with the single word READY and then wait silently. Do nothing else."
EOF
# 2 — bring the probe up THROUGH the harness
if ! python3 "$REPO/agents.py" --config "$CFG" up probe; then
fail "agents.py up probe errored"; echo "=== RESULT: FAIL ==="; exit 1
fi
# 3 — session created?
sleep 6
if tmux has-session -t "=${PREFIX}probe" 2>/dev/null; then
cmd=$(tmux display-message -p -t "=${PREFIX}probe:" '#{pane_current_command}' 2>/dev/null)
pass "session ${PREFIX}probe created via agents.py (pane command: ${cmd})"
else
fail "${PREFIX}probe session was not created"; echo "=== RESULT: FAIL ==="; exit 1
fi
# 4 — claude actually attached (TUI alive), not an instant crash
sleep 6
cmd=$(tmux display-message -p -t "=${PREFIX}probe:" '#{pane_current_command}' 2>/dev/null)
pane=$(tmux capture-pane -p -t "=${PREFIX}probe:" 2>/dev/null)
if [ "$cmd" = "claude" ] || echo "$pane" | grep -qiE "esc to interrupt|bypass permissions|READY|claude||welcome"; then
pass "claude TUI attached + alive (driven entirely by agents.py)"
else
fail "no claude TUI in pane (cmd=${cmd}); tail: $(echo "$pane" | grep -vE '^\s*$' | tail -3)"
fi
# 5 — status reports it RUNNING
if python3 "$REPO/agents.py" --config "$CFG" status | grep -E '^\s*probe\b' | grep -q RUNNING; then
pass "agents.py status reports probe RUNNING"
else
fail "agents.py status did not report probe RUNNING"
fi
# 6 — lifecycle: down removes it cleanly
python3 "$REPO/agents.py" --config "$CFG" down probe >/dev/null 2>&1
sleep 2
if tmux has-session -t "=${PREFIX}probe" 2>/dev/null; then
fail "${PREFIX}probe still alive after agents.py down"
else
pass "agents.py down cleanly removed the session"
fi
if [ "$FAILED" = 0 ]; then echo "=== CLAUDE BACKEND SMOKE: PASS ==="; exit 0
else echo "=== CLAUDE BACKEND SMOKE: FAIL ==="; exit 1; fi

156
tests/smoke_opencode.sh Executable file
View File

@ -0,0 +1,156 @@
#!/usr/bin/env bash
# ─────────────────────────────────────────────────────────────────────────────
# Isolated LIVE smoke of the OPENCODE backend, driven entirely through the harness.
#
# Generalizes the cc-ci `test-opencode.sh` isolation pattern onto the agent-orchestrator harness:
# stands up a DEDICATED opencode server on its own port (≠ 4096), then brings a throwaway scratch
# project up through `agents.py up` on the opencode backend:
# • the harness builds the opencode attach command + the post-connect bootstrap ping,
# • the agent attaches to the server (opencode TUI alive),
# • `agents.py status` reports it RUNNING,
# • `agents.py down` tears it down cleanly — server killed, no leftover sessions, port freed.
#
# SAFE BY CONSTRUCTION — never touches the live cc-ci-* sessions or the live opencode server:
# • a unique per-run session prefix (NOT "cc-ci-")
# • its OWN opencode server on AOTEST_OC_PORT (default 4097, never 4096)
# • cleans up everything it creates on exit (even on Ctrl+C / error).
#
# Usage: bash tests/smoke_opencode.sh
# Env: OPENCODE_BIN (default: `opencode` on PATH, else ~/.local/bin/opencode)
# AOTEST_OC_PORT (default 4097 — MUST differ from the live 4096)
# AOTEST_OC_CREDS (default /srv/cc-ci/.testenv — sourced as the backend preamble)
# AOTEST_MODEL (default: opencode's own configured default)
# Exit: 0 = PASS or SKIP (opencode / creds / server unavailable); 1 = FAIL.
# ─────────────────────────────────────────────────────────────────────────────
set -uo pipefail
HERE="$(cd "$(dirname "$0")" && pwd)"
REPO="$(cd "$HERE/.." && pwd)"
OCBIN="${OPENCODE_BIN:-$(command -v opencode 2>/dev/null || echo "$HOME/.local/bin/opencode")}"
PORT="${AOTEST_OC_PORT:-4097}"
SERVER="http://127.0.0.1:${PORT}"
CREDS="${AOTEST_OC_CREDS:-/srv/cc-ci/.testenv}"
MODEL="${AOTEST_MODEL:-}"
PREFIX="aotest-o-$$-"
SANDBOX="$(mktemp -d)"
CFG="$SANDBOX/agents.toml"
SRVLOG="$SANDBOX/server.log"
SERVER_PID=""
FAILED=0
pass(){ echo " PASS: $*"; }
fail(){ echo " FAIL: $*"; FAILED=1; }
cleanup(){
local rc=$?
python3 "$REPO/agents.py" --config "$CFG" down probe >/dev/null 2>&1 || true
if command -v tmux >/dev/null 2>&1; then
tmux ls 2>/dev/null | sed 's/:.*//' | grep "^${PREFIX}" | while read -r s; do
tmux kill-session -t "=$s" 2>/dev/null || true
done || true
fi
# kill the server subshell AND the opencode serve child it forked (the subshell is not the
# listener — target the listener by our unique port so the port is actually freed).
[ -n "$SERVER_PID" ] && kill "$SERVER_PID" 2>/dev/null || true
pkill -f "opencode serve.*--port ${PORT}\b" 2>/dev/null || true
for _ in 1 2 3 4 5; do
ss -ltn 2>/dev/null | grep -q ":${PORT} " || break
sleep 1
done
rm -rf "$SANDBOX"
exit "$rc"
}
trap cleanup EXIT INT TERM
echo "=== opencode backend smoke (isolated: prefix=${PREFIX} port=${PORT}) ==="
# 0 — preconditions (SKIP, not FAIL, when the environment can't run opencode)
command -v tmux >/dev/null 2>&1 || { echo "SKIP: tmux not on PATH (run inside 'nix develop')"; exit 0; }
[ "$PORT" != "4096" ] || { echo "FAIL: refusing port 4096 (the live cc-ci opencode port)"; exit 1; }
[ -x "$OCBIN" ] || command -v "$OCBIN" >/dev/null 2>&1 \
|| { echo "SKIP: opencode binary not found ($OCBIN)"; exit 0; }
[ -f "$CREDS" ] || { echo "SKIP: opencode creds file missing ($CREDS)"; exit 0; }
# 1 — isolated sandbox config (unique prefix + temp log_dir + dedicated server)
cat > "$CFG" <<EOF
[defaults]
project_dir = "$REPO"
session_prefix = "$PREFIX"
log_dir = "$SANDBOX/state"
backend = "opencode"
model = "$MODEL"
watch = "none"
[backend.opencode]
bin = "$OCBIN"
attach = "{bin} attach {server} --dir {dir}"
server = "$SERVER"
supports_resume = false
prompt_delivery = "ping"
process_name = "opencode"
footer_ui = true
log_grace = 180
connect_delay = 12
submit_key = "C-m"
preamble = "set -a; . $CREDS; set +a"
stall_idle = 900
active_re = "esc interrupt|thinking|inferring|running tool|tool call|preparing patch|reading|searching|working"
limit_re = "usage limit|limit reached"
[[agent]]
name = "probe"
kind = "persistent"
prompt = "You are a harness self-test. Reply with the single word READY and then wait silently. Do nothing else."
EOF
# 2 — bring up a dedicated opencode server on our own port
( set -a; . "$CREDS"; set +a; NO_COLOR=1 "$OCBIN" serve --hostname 127.0.0.1 --port "$PORT" ) >"$SRVLOG" 2>&1 &
SERVER_PID=$!
for _ in $(seq 1 30); do ss -ltn 2>/dev/null | grep -q ":${PORT} " && break; sleep 1; done
if ! ss -ltn 2>/dev/null | grep -q ":${PORT} "; then
echo "SKIP: opencode server did not come up on :${PORT} (see ${SRVLOG})"; exit 0
fi
pass "dedicated opencode server listening on :${PORT}"
# 3 — bring the probe up THROUGH the harness (attaches to OUR server)
if ! python3 "$REPO/agents.py" --config "$CFG" up probe; then
fail "agents.py up probe errored"; echo "=== RESULT: FAIL ==="; exit 1
fi
# 4 — session created?
sleep 4
if tmux has-session -t "=${PREFIX}probe" 2>/dev/null; then
cmd=$(tmux display-message -p -t "=${PREFIX}probe:" '#{pane_current_command}' 2>/dev/null)
pass "session ${PREFIX}probe created via agents.py (pane command: ${cmd})"
else
fail "${PREFIX}probe session was not created"; echo "=== RESULT: FAIL ==="; exit 1
fi
# 5 — opencode TUI attached + alive, not an instant crash
sleep 12
pane=$(tmux capture-pane -p -t "=${PREFIX}probe:" 2>/dev/null)
if echo "$pane" | grep -qiE "opencode|build ·|gpt|claude|READY|esc interrupt|ctrl\+p|ctrl\+"; then
pass "opencode TUI attached + alive (driven entirely by agents.py)"
else
fail "no opencode TUI/response in pane; tail: $(echo "$pane" | grep -vE '^\s*$' | tail -3)"
echo " (server log tail:) $(tail -3 "$SRVLOG" 2>/dev/null)"
fi
# 6 — status reports it RUNNING
if python3 "$REPO/agents.py" --config "$CFG" status | grep -E '^\s*probe\b' | grep -q RUNNING; then
pass "agents.py status reports probe RUNNING"
else
fail "agents.py status did not report probe RUNNING"
fi
# 7 — lifecycle: down removes it cleanly
python3 "$REPO/agents.py" --config "$CFG" down probe >/dev/null 2>&1
sleep 2
if tmux has-session -t "=${PREFIX}probe" 2>/dev/null; then
fail "${PREFIX}probe still alive after agents.py down"
else
pass "agents.py down cleanly removed the session"
fi
if [ "$FAILED" = 0 ]; then echo "=== OPENCODE BACKEND SMOKE: PASS ==="; exit 0
else echo "=== OPENCODE BACKEND SMOKE: FAIL ==="; exit 1; fi

526
tests/test_unit.py Executable file
View File

@ -0,0 +1,526 @@
#!/usr/bin/env python3
"""Unit tests for the agent-orchestrator harness (agents.py).
Pure-logic tests — NO agent CLIs spawned, NO live tmux sessions created. Every test builds a
throwaway config + fixture files in a tempdir and exercises the harness functions directly.
The one function that would spawn sessions (phase_advance_check → start/stop_loops) is tested
with those two hooks monkeypatched to recorders, so the phase-machine *logic* is covered without
launching anything.
Run: python3 -m unittest tests.test_unit (from repo root)
or python3 tests/test_unit.py
"""
import os
import sys
import time
import textwrap
import tempfile
import shutil
import unittest
from datetime import datetime, timedelta
from pathlib import Path
REPO_ROOT = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(REPO_ROOT))
import agents # noqa: E402
# ── shared fixture config ────────────────────────────────────────────────────────
BASE_TOML = r"""
[watchdog]
signal_interval = 30
heavy_interval = 300
limit_probe_fallback = 300
limit_reset_slack = 45
stall_grace = 180
[defaults]
session_prefix = "aotest-ut-"
log_dir = "state"
backend = "claude"
model = "claude-sonnet-4-6"
watch = "none"
[backend.claude]
bin = "claude"
flags = "--dangerously-skip-permissions"
remote_control = true
supports_resume = true
prompt_delivery = "arg"
process_name = "claude"
submit_key = "Enter"
stall_idle = 300
active_re = "esc to interrupt|Running tool|\\u00b7 \\d+"
limit_re = "spend limit|usage limit|limit reached|reached your .*limit|out of (credits|tokens)"
fatal_re = "redacted_thinking|blocks cannot be modified"
[backend.opencode]
bin = "opencode"
attach = "{bin} attach {server} --dir {dir}"
server = "http://127.0.0.1:4096"
supports_resume = false
prompt_delivery = "ping"
process_name = "opencode"
footer_ui = true
log_grace = 180
connect_delay = 12
submit_key = "C-m"
stall_idle = 900
active_re = "esc interrupt|thinking|inferring|running tool|tool call|preparing patch|reading|searching"
limit_re = "usage limit|limit reached"
[backend.demo]
bin = "echo up; exec sleep 100000"
prompt_delivery = "exec"
[[agent]]
name = "builder"
kind = "loop"
role = "builder"
backend = "demo"
[[agent]]
name = "adversary"
kind = "loop"
role = "adversary"
backend = "demo"
[[agent]]
name = "cl"
kind = "persistent"
backend = "claude"
prompt = "hi"
[[agent]]
name = "oc"
kind = "persistent"
backend = "opencode"
prompt = "hi"
[[agent]]
name = "custom"
kind = "persistent"
session = "explicit-session"
model = "override-model"
dir = "/abs/somewhere"
backend = "demo"
prompt = "x"
[[service]]
name = "svc"
command = "sleep 1"
[loop]
state_file = "phase-idx"
resume_phase = true
auto_advance = true
done_marker = "## DONE"
kickoff_template = "prompts/kickoff.md"
roles_dir = "prompts"
handoff = { repo = ".", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], state_subdir = "machine-docs" }
phases = [
{ id = "p1", plan = "PLAN1.md", status = "STATUS-p1.md" },
{ id = "p2", plan = "PLAN2.md", status = "STATUS-p2.md", models = { builder = "opus-x" } },
]
"""
KICKOFF_TMPL = "*** PROJECT PHASE: {phase_id} ***\nPLAN: {plan}\nSTATUS: {status}\nROLE: {role}\n---\n"
BUILDER_PROMPT = "You are the **Builder** agent. (builder role body marker)\n"
ADVERSARY_PROMPT = "You are the **Adversary** agent. (adversary role body marker)\n"
def _make_project(tmp, toml=BASE_TOML):
"""Write a self-contained project (config + prompts + machine-docs) into tmp; return cfg path."""
root = Path(tmp)
(root / "prompts").mkdir(parents=True, exist_ok=True)
(root / "machine-docs").mkdir(parents=True, exist_ok=True)
(root / "prompts" / "kickoff.md").write_text(KICKOFF_TMPL)
(root / "prompts" / "builder.md").write_text(BUILDER_PROMPT)
(root / "prompts" / "adversary.md").write_text(ADVERSARY_PROMPT)
cfg_path = root / "agents.toml"
cfg_path.write_text(toml)
return cfg_path
# ── config loading + defaults merge ────────────────────────────────────────────────
class TestConfigLoad(unittest.TestCase):
def setUp(self):
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
self.cfg_path = _make_project(self.tmp)
self.cfg = agents.load_config(self.cfg_path)
def tearDown(self):
shutil.rmtree(self.tmp, ignore_errors=True)
def test_defaults_merge_into_agents(self):
b = self.cfg["agents"]["builder"]
self.assertEqual(b["session_prefix"], "aotest-ut-")
self.assertEqual(b["watch"], "none") # from defaults
self.assertEqual(b["kind"], "loop") # explicit
def test_session_name_defaults_to_prefix_plus_name(self):
self.assertEqual(self.cfg["agents"]["builder"]["session"], "aotest-ut-builder")
def test_explicit_session_overrides_prefix(self):
self.assertEqual(self.cfg["agents"]["custom"]["session"], "explicit-session")
def test_per_agent_override_wins_over_default(self):
# default model is claude-sonnet-4-6; custom overrides
self.assertEqual(self.cfg["agents"]["custom"]["model"], "override-model")
self.assertEqual(self.cfg["agents"]["builder"]["model"], "claude-sonnet-4-6")
def test_relative_dir_resolved_against_project_root(self):
# builder has no dir → defaults dir "." → project_dir
self.assertEqual(self.cfg["agents"]["builder"]["dir"], self.cfg["project_dir"])
def test_absolute_dir_kept(self):
self.assertEqual(self.cfg["agents"]["custom"]["dir"], "/abs/somewhere")
def test_log_dir_and_state_dir_resolved(self):
self.assertEqual(self.cfg["log_dir"], str(Path(self.cfg["project_dir"]) / "state"))
self.assertEqual(self.cfg["state_dir"], os.path.join(self.cfg["log_dir"], "state"))
self.assertTrue(Path(self.cfg["state_dir"]).is_dir()) # created on load
def test_service_session_named(self):
self.assertIn("svc", self.cfg["services"])
self.assertEqual(self.cfg["services"]["svc"]["session"], "aotest-ut-svc")
def test_backend_of_resolves(self):
b = agents.backend_of(self.cfg, self.cfg["agents"]["cl"])
self.assertEqual(b["prompt_delivery"], "arg")
self.assertEqual(b["submit_key"], "Enter")
def test_backend_of_unknown_dies(self):
a = dict(self.cfg["agents"]["cl"]); a["backend"] = "nope"
with self.assertRaises(SystemExit):
agents.backend_of(self.cfg, a)
def test_missing_session_prefix_dies(self):
bad = self.tmp + "/bad1"
p = _make_project(bad, toml='[defaults]\nlog_dir = "state"\n')
with self.assertRaises(SystemExit):
agents.load_config(p)
def test_missing_log_dir_dies(self):
bad = self.tmp + "/bad2"
p = _make_project(bad, toml='[defaults]\nsession_prefix = "x-"\n')
with self.assertRaises(SystemExit):
agents.load_config(p)
def test_env_override_model_single_invocation(self):
os.environ["AGENT_MODEL_cl"] = "env-only-model"
try:
cfg2 = agents.load_config(self.cfg_path)
self.assertEqual(cfg2["agents"]["cl"]["model"], "env-only-model")
finally:
del os.environ["AGENT_MODEL_cl"]
# without the env var the file value stands again
cfg3 = agents.load_config(self.cfg_path)
self.assertEqual(cfg3["agents"]["cl"]["model"], "claude-sonnet-4-6")
class TestExampleConfig(unittest.TestCase):
"""The SHIPPED agents.example.toml must parse and define the documented shape."""
def test_example_config_loads(self):
ex = REPO_ROOT / "agents.example.toml"
self.assertTrue(ex.exists(), "agents.example.toml missing from repo")
cfg = agents.load_config(ex)
self.assertIn("builder", cfg["agents"])
self.assertIn("adversary", cfg["agents"])
for be in ("demo", "claude", "opencode"):
self.assertIn(be, cfg["backends"], f"backend {be} missing from example")
self.assertEqual(len(agents.phases(cfg)), 2)
# ── kickoff-template assembly ──────────────────────────────────────────────────────
class TestKickoff(unittest.TestCase):
def setUp(self):
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
self.cfg = agents.load_config(_make_project(self.tmp))
def tearDown(self):
shutil.rmtree(self.tmp, ignore_errors=True)
def test_kickoff_renders_slots_and_appends_role(self):
out = agents.build_loop_kickoff(self.cfg, self.cfg["agents"]["builder"])
self.assertIn("PROJECT PHASE: p1", out) # phase_id slot filled (phase idx 0)
self.assertIn("PLAN: PLAN1.md", out)
self.assertIn("STATUS: STATUS-p1.md", out)
self.assertIn("ROLE: builder", out)
self.assertIn("builder role body marker", out) # role prompt appended
self.assertNotIn("{phase_id}", out) # no unrendered slot
self.assertNotIn("{role}", out)
def test_kickoff_picks_correct_role_prompt(self):
out = agents.build_loop_kickoff(self.cfg, self.cfg["agents"]["adversary"])
self.assertIn("adversary role body marker", out)
self.assertNotIn("builder role body marker", out)
def test_agent_prompt_loop_returns_kickoff(self):
out = agents.agent_prompt(self.cfg, self.cfg["agents"]["builder"])
self.assertIn("PROJECT PHASE: p1", out)
def test_agent_prompt_persistent_returns_inline_prompt(self):
out = agents.agent_prompt(self.cfg, self.cfg["agents"]["cl"])
self.assertEqual(out, "hi")
def test_role_model_phase_override(self):
# phase p2 overrides builder model to opus-x; advance index to 1
Path(agents.phase_idx_file(self.cfg)).write_text("1")
self.assertEqual(agents.role_model(self.cfg, self.cfg["agents"]["builder"]), "opus-x")
# adversary has no override → its configured/default model
self.assertEqual(agents.role_model(self.cfg, self.cfg["agents"]["adversary"]),
"claude-sonnet-4-6")
# ── phase machine ──────────────────────────────────────────────────────────────────
class TestPhaseMachine(unittest.TestCase):
def setUp(self):
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
self.cfg = agents.load_config(_make_project(self.tmp))
self.md = Path(self.cfg["project_dir"]) / "machine-docs"
# monkeypatch the session-spawning hooks so the machine logic runs without tmux
self._orig = (agents.stop_loops, agents.start_loops, agents.handoff_reset)
self.calls = []
agents.stop_loops = lambda cfg: self.calls.append("stop")
agents.start_loops = lambda cfg: self.calls.append("start")
agents.handoff_reset = lambda: self.calls.append("reset")
def tearDown(self):
agents.stop_loops, agents.start_loops, agents.handoff_reset = self._orig
shutil.rmtree(self.tmp, ignore_errors=True)
def _status(self, basename, text):
(self.md / basename).write_text(text)
def test_phase_done_detects_marker(self):
self._status("STATUS-p1.md", "header\n## DONE\nall verified PASS\n")
self.assertTrue(agents.phase_done(self.cfg, "STATUS-p1.md"))
def test_phase_done_rejects_placeholder_body(self):
self._status("STATUS-p1.md", "## DONE\nnot yet — written here only when complete\n")
self.assertFalse(agents.phase_done(self.cfg, "STATUS-p1.md"))
def test_phase_done_false_when_no_marker(self):
self._status("STATUS-p1.md", "## In progress\nworking\n")
self.assertFalse(agents.phase_done(self.cfg, "STATUS-p1.md"))
def test_phase_done_false_when_file_missing(self):
self.assertFalse(agents.phase_done(self.cfg, "STATUS-nope.md"))
def test_cur_idx_reads_state_file(self):
Path(agents.phase_idx_file(self.cfg)).write_text("1")
self.assertEqual(agents.cur_idx(self.cfg), 1)
def test_advance_on_done(self):
Path(agents.phase_idx_file(self.cfg)).write_text("0")
self._status("STATUS-p1.md", "## DONE\nverified\n")
advanced = agents.phase_advance_check(self.cfg)
self.assertTrue(advanced)
self.assertEqual(agents.cur_idx(self.cfg), 1) # moved to p2
self.assertIn("stop", self.calls)
self.assertIn("start", self.calls)
def test_no_advance_when_not_done(self):
Path(agents.phase_idx_file(self.cfg)).write_text("0")
self._status("STATUS-p1.md", "## In progress\n")
self.assertFalse(agents.phase_advance_check(self.cfg))
self.assertEqual(agents.cur_idx(self.cfg), 0)
self.assertEqual(self.calls, [])
def test_sequence_complete_idempotent(self):
Path(agents.phase_idx_file(self.cfg)).write_text("1") # last phase
self._status("STATUS-p2.md", "## DONE\nverified\n")
marker = Path(self.cfg["log_dir"]) / "SEQUENCE-COMPLETE"
# first call: completes the sequence
self.assertTrue(agents.phase_advance_check(self.cfg))
self.assertTrue(marker.exists())
self.assertEqual(self.calls.count("stop"), 1)
# second call: idempotent — no re-stop, returns False
self.assertFalse(agents.phase_advance_check(self.cfg))
self.assertEqual(self.calls.count("stop"), 1)
def test_append_phase_clears_marker_and_resumes(self):
# simulate "sequence already complete", then a 3rd phase appended to the config
Path(agents.phase_idx_file(self.cfg)).write_text("1")
self._status("STATUS-p2.md", "## DONE\nverified\n")
marker = Path(self.cfg["log_dir"]) / "SEQUENCE-COMPLETE"
marker.write_text("stale completion\n")
self.cfg["loop"]["phases"].append(
{"id": "p3", "plan": "PLAN3.md", "status": "STATUS-p3.md"})
advanced = agents.phase_advance_check(self.cfg)
self.assertTrue(advanced)
self.assertEqual(agents.cur_idx(self.cfg), 2) # resumed onto p3
self.assertFalse(marker.exists()) # stale marker cleared
self.assertIn("start", self.calls)
def test_custom_done_marker(self):
self.cfg["loop"]["done_marker"] = "## SHIPPED"
self._status("STATUS-p1.md", "## SHIPPED\nverified\n")
self.assertTrue(agents.phase_done(self.cfg, "STATUS-p1.md"))
self.assertFalse(agents.phase_done(self.cfg, "STATUS-p2.md"))
# ── usage-limit banner reset parsing ───────────────────────────────────────────────
class TestLimitParsing(unittest.TestCase):
def setUp(self):
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
self.cfg = agents.load_config(_make_project(self.tmp))
def tearDown(self):
shutil.rmtree(self.tmp, ignore_errors=True)
def test_parse_reset_pm(self):
ep = agents._parse_reset_epoch("You've hit your limit · resets at 10pm")
self.assertIsNotNone(ep)
self.assertEqual(datetime.fromtimestamp(ep).hour, 22)
def test_parse_reset_am_with_minutes(self):
ep = agents._parse_reset_epoch("resets 3:30am")
self.assertIsNotNone(ep)
dt = datetime.fromtimestamp(ep)
self.assertEqual((dt.hour, dt.minute), (3, 30))
def test_parse_reset_12am_is_midnight(self):
ep = agents._parse_reset_epoch("resets at 12am")
self.assertEqual(datetime.fromtimestamp(ep).hour, 0)
def test_parse_reset_invalid_hour_none(self):
self.assertIsNone(agents._parse_reset_epoch("resets at 25"))
def test_parse_reset_no_match_none(self):
self.assertIsNone(agents._parse_reset_epoch("everything is fine here"))
def test_parse_reset_picks_last_match(self):
ep = agents._parse_reset_epoch("resets at 9am ... actually resets at 11am")
self.assertEqual(datetime.fromtimestamp(ep).hour, 11)
def test_next_limit_until_unparsable_fallback(self):
now = time.time()
until, parsed = agents._next_limit_until(self.cfg, "limit reached, no time given", now)
self.assertFalse(parsed)
self.assertEqual(int(until), int(now + 300)) # limit_probe_fallback
def test_next_limit_until_within_window_uses_banner(self):
now = time.time()
t = datetime.now() + timedelta(hours=2)
h12 = t.hour % 12 or 12
ampm = "am" if t.hour < 12 else "pm"
banner = f"weekly limit · resets at {h12}:{t.minute:02d}{ampm}"
until, parsed = agents._next_limit_until(self.cfg, banner, now)
self.assertTrue(parsed)
self.assertGreater(until, now)
self.assertLessEqual(until - now, 6 * 3600 + 60) # within 6h window (+slack)
def test_next_limit_until_far_future_falls_back(self):
now = time.time()
t = datetime.now() + timedelta(hours=7) # > 6h window
h12 = t.hour % 12 or 12
ampm = "am" if t.hour < 12 else "pm"
banner = f"limit · resets at {h12}:{t.minute:02d}{ampm}"
until, parsed = agents._next_limit_until(self.cfg, banner, now)
self.assertFalse(parsed)
self.assertEqual(int(until), int(now + 300))
# ── stall / WAITING-UNTIL parsing ──────────────────────────────────────────────────
class TestWaitingUntil(unittest.TestCase):
def setUp(self):
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
self.cfg = agents.load_config(_make_project(self.tmp))
self.claude_agent = self.cfg["agents"]["cl"] # non-footer backend
self.oc_agent = self.cfg["agents"]["oc"] # footer_ui backend
def tearDown(self):
shutil.rmtree(self.tmp, ignore_errors=True)
def test_non_footer_finds_marker_anywhere(self):
pane = "blah blah\nWAITING-UNTIL: 2030-06-13T12:00:00Z\nmore output after\n"
ep = agents._parse_waiting_until(self.cfg, self.claude_agent, pane)
self.assertIsNotNone(ep)
self.assertEqual(ep, datetime.fromisoformat("2030-06-13T12:00:00+00:00").timestamp())
def test_non_footer_none_without_marker(self):
self.assertIsNone(agents._parse_waiting_until(
self.cfg, self.claude_agent, "just working, no marker"))
def test_footer_requires_marker_as_last_line(self):
# marker present but NOT the last non-empty line → ignored for a footer UI
pane = "WAITING-UNTIL: 2030-06-13T12:00:00Z\n ▣ Build · GPT · 2m 19s\n"
self.assertIsNone(agents._parse_waiting_until(self.cfg, self.oc_agent, pane))
def test_footer_honors_marker_when_last_line(self):
pane = "some work\nWAITING-UNTIL: 2030-06-13T12:00:00Z\n\n"
ep = agents._parse_waiting_until(self.cfg, self.oc_agent, pane)
self.assertIsNotNone(ep)
def test_bad_timestamp_none(self):
self.assertIsNone(agents._parse_waiting_until(
self.cfg, self.claude_agent, "WAITING-UNTIL: not-a-time"))
# ── backend activity detectors (claude + opencode footers) ──────────────────────────
class TestActivityDetection(unittest.TestCase):
def setUp(self):
self.tmp = tempfile.mkdtemp(prefix="aotest-ut-")
self.cfg = agents.load_config(_make_project(self.tmp))
self.claude_agent = self.cfg["agents"]["cl"]
self.oc_agent = self.cfg["agents"]["oc"]
def tearDown(self):
shutil.rmtree(self.tmp, ignore_errors=True)
# claude: non-footer, active_re matched anywhere in the pane
def test_claude_active_esc_to_interrupt(self):
self.assertTrue(agents.pane_active(
self.cfg, self.claude_agent, "thinking...\n esc to interrupt", use_log=False))
def test_claude_active_running_tool(self):
self.assertTrue(agents.pane_active(
self.cfg, self.claude_agent, "Running tool: Bash", use_log=False))
def test_claude_active_spinner_dot_count(self):
self.assertTrue(agents.pane_active(
self.cfg, self.claude_agent, "Compiling · 137 tokens", use_log=False))
def test_claude_idle_is_not_active(self):
self.assertFalse(agents.pane_active(
self.cfg, self.claude_agent, "Done.\n> ", use_log=False))
# opencode: footer_ui — only the bottom rows count as activity
def test_opencode_active_footer(self):
pane = "~ Preparing patch...\n ⬝⬝■ esc interrupt 137.6K\n"
self.assertTrue(agents.pane_active(self.cfg, self.oc_agent, pane, use_log=False))
def test_opencode_idle_footer_not_active(self):
pane = " ▣ Build · GPT-5.4 · 2m 19s\n 178.4K (17%) ctrl+p commands\n"
self.assertFalse(agents.pane_active(self.cfg, self.oc_agent, pane, use_log=False))
def test_opencode_active_only_at_top_is_ignored(self):
# active marker far above the bottom 10 lines → a footer UI ignores it
pane = "running tool now\n" + "\n".join(f"line {i}" for i in range(20)) + \
"\n ▣ Build · GPT · idle\n"
self.assertFalse(agents.pane_active(self.cfg, self.oc_agent, pane, use_log=False))
def test_opencode_log_grace_fallback(self):
# idle footer, but a freshly-touched session log within the grace window → active
idle = " ▣ Build · GPT · idle\n 178K (17%) ctrl+p\n"
logp = agents._session_log_path(self.cfg, self.oc_agent["session"])
logp.parent.mkdir(parents=True, exist_ok=True)
logp.write_text("recent activity\n") # mtime = now
self.assertTrue(agents.pane_active(self.cfg, self.oc_agent, idle, use_log=True))
# remove the log → no fallback → idle footer reads as not active
logp.unlink()
self.assertFalse(agents.pane_active(self.cfg, self.oc_agent, idle, use_log=True))
if __name__ == "__main__":
unittest.main(verbosity=2)