agent-orchestrator/README.md

# agent-orchestrator

A generic, reusable harness for running and supervising a fleet of AI-agent sessions in **tmux**.
One driver script + one declarative config (`agents.toml`) describe every agent — a Builder /
Adversary loop pair, a persistent supervisor, a one-shot task — and a **watchdog** keeps them
alive, healed, paced, and coordinated. The watchdog reads the same config every tick, so there is
never any env-vs-file drift.

Nothing about any particular project lives in this repo. Paths, the loop **kickoff preamble**, the
**handoff conventions**, and the **on-complete** hook are all supplied by the project's config and
prompt files. A project consumes this repo as a pinned **git submodule** (`engine/`) and keeps its
own config, prompts, state, and tmux namespace — total isolation between projects.

```
agents.py            the driver + watchdog (pure Python stdlib; needs python >= 3.11 for tomllib)
agent-log.py         render claude JSONL transcripts into clean, greppable logs
agents.example.toml  a self-contained 2-agent example project
prompts/             generic role + kickoff templates (builder / adversary / kickoff)
examples/            runnable example projects — the Builder/Adversary variant family, snakepit, …
smoke.sh             bring the example up + tear it down in an isolated sandbox, then clean up
tests/               the test suite — unit tests + isolated live backend smokes + a runner
flake.nix/.lock      a Nix devShell with the runtime deps (python311, tmux, git)
```

---

## Quick start

```bash
nix develop                                   # python311 + tmux + git on PATH (see "Nix" below)

python3 agents.py selftest                    # regression-test the activity detector (no config)
python3 agents.py status --config agents.example.toml   # one table: every agent + the phase
./smoke.sh                                    # prove up/down works end-to-end, isolated + clean

python3 agents.py init myproject              # scaffold a starter agents.toml + prompts/
```

`up` is **use-or-create**: an already-running session is left alone, never double-started.

```bash
python3 agents.py --config agents.toml up           # start all enabled agents + services + watchdog
python3 agents.py --config agents.toml up builder    # start just one agent (by name)
python3 agents.py --config agents.toml down          # stop everything
python3 agents.py --config agents.toml logs builder  # tail one session's log
python3 agents.py --config agents.toml phase show    # where the loop phase machine is
```

`--config` defaults to `./agents.toml`, falling back to one next to `agents.py`.

---

## Examples

`examples/` holds runnable example projects — copy one, point `agents.py` at its `agents.toml`, and
go. The headline set is a family of **Builder/Adversary** variants that build the *same* task but each
differ in one dimension — useful both as templates and as a study of the pattern:

- **`builder-adversary`** — the canonical loop pair: a Builder that builds and an Adversary that
  cold-verifies every claim, coordinating only through git (`claim(`/`review(` commits + the watchdog
  handoff). **Start here.**
- **`builder-adversary-min`** — the same pattern with the prompts compressed to minimal tokens.
- **`builder-adversary-stateless`** — `builder-adversary` + **context hygiene** (compact at each
  checkpoint, read diffs not trees, lean loads) to minimise carried/reloaded context.
- **`builder-adversary-lean`** — context hygiene + **per-gate** review (one claim/verdict per gate).
- **`builder-adversary-deferred`** — the Adversary verifies **once**, after the whole build, in a
  final comprehensive `review` phase (vs per-phase / per-gate).
- **`builder-solo`** — a single Builder that self-certifies, with **no Adversary** (the control).
- **`snakepit`** — a different topology entirely: a pool of identical worker "snakes" pulling tasks
  from a shared filesystem queue, plus cleanup specialists. (`examples/IDEAS.md` sketches more.)

Each example has its own `README.md`. Run one by hand:

```bash
cd examples/builder-adversary
python3 ../../agents.py status --config agents.toml      # read-only
python3 ../../agents.py up     --config agents.toml       # needs `claude` on PATH
```

**Benchmark.** The separate
[`agent-orchestrator-benchmark`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator-benchmark)
repo runs these Builder/Adversary variants head-to-head (N=5, real `agents.py up` runs) to measure
what drives token cost. Short version: an independent adversary costs **~4.7×** a solo builder, but
the review *cadence* (per-gate / per-phase / deferred) is **nearly token-neutral**, and **context
hygiene** is the one clean **~−22%** win. See that repo's `FINDINGS.md`.

---

## The config: `agents.toml`

Five section types: `[watchdog]`, `[backend.<name>]`, `[defaults]`, `[[agent]]` / `[[service]]`,
and `[loop]`. See `agents.example.toml` for a complete, runnable example.

### `[watchdog]` — global supervisor cadence

```toml
[watchdog]
signal_interval      = 30    # seconds between light checks (handoff / stall / limit)
heavy_interval       = 300   # seconds between heal + phase-advance checks
limit_probe_fallback = 300   # re-probe cadence for a usage-limited agent when reset time is unparsable
limit_reset_slack    = 45    # seconds to wait past a parsed reset before probing
stall_grace          = 180   # seconds of slack past a WAITING-UNTIL marker before a stall reboot
log_tokens           = false # opt-in: record per-phase token + time usage (see below)
```

**Per-phase token + time logging (`log_tokens`).** Set `log_tokens = true` (under `[watchdog]` or
`[loop]`) and the watchdog records, for **each phase**, how many tokens **each agent** used and how
long the phase took — appended as one JSON object per phase to `<log_dir>/token-log.jsonl`. Tokens
are summed from each agent's Claude Code session transcript and attributed **by working dir**, so
give each agent its own `dir` (the Builder/Adversary loop pair already uses separate clones) for
accurate per-agent numbers. The watchdog snapshots a baseline when a phase starts and writes the
delta (per agent, and the total) when the phase advances or the sequence completes — robust across
watchdog restarts. Pretty-print it with `agents.py tokens`:

```
phase        dur(s)   builder adversary         TOTAL
-----------------------------------------------------
lex           372.0 3,910,118 3,221,447     7,131,565
parse         410.5 ...
```

### `[defaults]` — inherited by every agent

```toml
[defaults]
session_prefix = "myproj-"   # REQUIRED: tmux namespace for this project. No implicit default.
log_dir        = ".ao-state" # REQUIRED: logs + state/. Relative paths resolve against the config dir.
backend        = "claude"
model          = "claude-sonnet-4-6"
dir            = "."         # default working dir for agents (relative → project dir)
watch          = "heal"      # none | heal | heal+stall
project_dir    = "."         # OPTIONAL: project root for resolving prompts/paths (default: config's dir)
```

`session_prefix` and `log_dir` are **required** — the harness has no project-specific fallbacks.
Every relative path (`log_dir`, an agent's `dir`, `handoff.repo`, prompt/template files) resolves
against `project_dir`, which defaults to the directory holding the config file. When the config
lives in a sandbox but the prompts live elsewhere (as `smoke.sh` does), set `project_dir`
explicitly.

### `[backend.<name>]` — backends declared as data

A backend is fully described by config — no code change to add one. The one field that selects
behavior is `prompt_delivery`:

| `prompt_delivery` | how the kickoff reaches the agent | example |
|---|---|---|
| `"arg"`  | passed as a CLI argument (claude-style) | `claude … "$(cat kickoff)"` |
| `"ping"` | typed in after a TUI connects (opencode-style) | attach, wait, send-keys |
| `"exec"` | a plain command; the prompt is written to a file | generic / demo |

```toml
[backend.claude]
bin             = "claude"
flags           = "--dangerously-skip-permissions"
remote_control  = true          # add a --remote-control <session> flag
supports_resume = true          # honor an agent's resume=true
prompt_delivery = "arg"
process_name    = "claude"      # the pane process a healthy session runs (backend-mismatch healing)
submit_key      = "Enter"       # key to submit a typed message
stall_idle      = 300           # seconds idle before a heal+stall agent is rebooted
active_re = "esc to interrupt|Running tool|· \\d+"   # pane shows the agent is WORKING
limit_re  = "usage limit|limit reached|reached your .*limit"   # usage/rate-limit banner
fatal_re  = "redacted_thinking|cannot be modified"  # unrecoverable session state → kill + restart

[backend.opencode]              # a TUI backend
bin             = "opencode"
attach          = "{bin} attach {server} --dir {dir}"
server          = "http://127.0.0.1:4096"
prompt_delivery = "ping"
process_name    = "opencode"
footer_ui       = true          # a static footer lingers after a turn → only the bottom = activity
log_grace       = 180           # within this many seconds of a log write, treat as active
connect_delay   = 12            # seconds to wait for the TUI before typing
submit_key      = "C-m"
model_env       = true          # pass the model via OPENCODE_CONFIG_CONTENT
preamble        = "set -a; . ./.env; set +a"   # shell run before launch (e.g. load creds)
active_re = "esc interrupt|thinking|running tool|preparing patch"
limit_re  = "usage limit|limit reached"

[backend.demo]                  # a dependency-free backend for testing the harness mechanics
bin             = "echo '[demo] {session} up'; exec sleep 1000000"
prompt_delivery = "exec"        # {kickoff}=prompt file, {session}=session name, {model}=model
```

For an `"arg"` backend the flag *templates* are configurable (so you can point at a non-claude
CLI): `resume_flag` (default `--resume '{id}'`), `model_flag` (default `--model '{model}'`),
`remote_control_flag` (default `--remote-control '{session}'`). A backend that sets `process_name`
participates in backend-mismatch healing; one that doesn't (e.g. `demo`) never does.

### `[[agent]]` — one block per agent

```toml
[[agent]]
name    = "builder"            # tmux session defaults to <session_prefix><name>; override with session=
kind    = "loop"              # loop | persistent | task
backend = "claude"            # overrides defaults.backend
model   = "claude-opus-4-8"   # overrides defaults.model
dir     = "."                 # working dir (relative → project dir)
role    = "builder"           # loop agents only: role prompt = <roles_dir>/<role>.md
resume  = true                # (arg backends with supports_resume) --resume <state/<name>.id>
watch   = "heal+stall"        # none | heal | heal+stall
enabled = true                # false = not started by a bare `up`, not supervised
wake    = { interval = 3600, prompt_file = "prompts/supervise.md" }   # periodic nudge
prompt  = """inline startup text"""          # persistent/task agents; OR prompt_file = "path.md"
log_signature = "PROJECT PHASE"              # optional: disambiguate agents that share a dir (agent-log.py)
```

| kind | prompt source | typical `watch` |
|---|---|---|
| `loop` | auto-built: kickoff template + `prompts/<role>.md` | `heal+stall` |
| `persistent` | `prompt` / `prompt_file` (+ optional `resume`, `wake`) | `heal` |
| `task` | `prompt` (runs once, then idles) | `none`, `enabled=false` |

**`watch` policy:**

| value | behavior |
|---|---|
| `none` | ignored by the watchdog entirely |
| `heal` | restart if the session is dead, FATAL-wedged, or running the wrong backend; pause all healing while inside a usage-limit window; **never** reboot just for being idle |
| `heal+stall` | everything in `heal`, **plus** reboot if idle past `stall_idle` — respecting any `WAITING-UNTIL: <ISO-8601>` self-wake marker the agent prints as its last line |

### `[[service]]` — non-AI helper processes

```toml
[[service]]
name    = "cleanlogs"
command = "python3 agent-log.py follow-all"
```

Started by a bare `up`, killed by `down`. Just a supervised command in a tmux session.

### `[loop]` — the phase state machine (governs `kind="loop"` agents)

```toml
[loop]
state_file       = "phase-idx"          # under <log_dir>/state/
resume_phase     = true                 # keep the phase index across restarts (don't reset to 0)
auto_advance     = true                 # advance when the current phase's status file says done_marker
done_marker      = "## DONE"
kickoff_template = "prompts/kickoff.md" # project preamble; slots {phase_id}/{plan}/{status}/{role}
roles_dir        = "prompts"            # role prompt = <roles_dir>/<role>.md
handoff = { repo = ".", claim_pings = "adversary", review_pings = "builder",
            inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"],
            claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
on_complete = { trigger_file = ".run-on-complete", run = "reporter" }   # run task agent on completion
phases = [
  { id = "p1", plan = "plans/p1.md", status = "STATUS-p1.md" },
  { id = "p2", plan = "plans/p2.md", status = "STATUS-p2.md", models = { builder = "claude-opus-4-8" } },
]
```

- **Kickoff template.** A loop agent's prompt is `kickoff_template` (with `{phase_id}`, `{plan}`,
  `{status}`, `{role}` substituted from the current phase) followed by `<roles_dir>/<role>.md`.
  Both are project files; this repo ships generic starters in `prompts/`. There is no built-in
  preamble text.
- **Per-phase model override.** A phase's `models = { builder = "...", adversary = "..." }`
  overrides those agents' model for just that phase (matched on the agent's `role`).
- **Auto-advance.** Each heavy tick, if the current phase's `status` file (looked up in
  `handoff.repo`'s `state_subdir/` then its root) contains a real `done_marker` — not a "Not
  yet…" placeholder — the watchdog stops the loops, bumps the phase index, and restarts them on
  the next phase. After the last phase it writes a `SEQUENCE-COMPLETE` marker under `log_dir` and
  stops the loops (idempotent — no churn). Appending a phase later clears the stale marker and
  resumes. On completion, an optional `on_complete.run` task agent fires if its `trigger_file`
  exists under `log_dir`.
- **Handoff signalling.** The watchdog watches `handoff.repo`'s `origin/main` for commits whose
  subject matches `claim_pattern` / `review_pattern`, and watches the two `inboxes` files. When a
  claim lands it pings the `claim_pings` agent; a review pings `review_pings`; an inbox change
  pings the relevant side. This is how the Builder and Adversary coordinate purely through git.

---

## Config vs state

- **Config** = `agents.toml` — declarative, version-controlled, the only source of truth.
- **State** = `<log_dir>/state/` — machine-written runtime only: `phase-idx` (current phase),
  `<name>.id` (resume id), `limited-<session>.json` (active usage-limit window),
  `kickoff-<session>.txt` (the exact prompt last sent). Git-ignore your `log_dir`.
- **Env** = a one-off override for a *single* invocation only: `AGENT_MODEL_<name>=…` /
  `AGENT_BACKEND_<name>=…`. The persisted watchdog ignores env and re-reads the file every tick —
  deliberately, so env-vs-file drift can never silently revert a backend.

---

## The driver: verbs

The recommended (not required) verb set — an AI project-orchestrator can rely on these being
present, but a harness is free to add more:

```
agents.py up [name…]               start enabled agents (+ services + watchdog); use-or-create
agents.py down [name…]             stop agents/services/watchdog (all, or named)
agents.py status                   table of every agent: kind, backend, model, watch, state, phase
agents.py watchdog                 the supervisor loop (what the <prefix>watchdog session runs)
agents.py logs <name>              tail that session's log
agents.py phase [show|next|set N]  inspect / move the loop phase index
agents.py tokens                   per-phase token + time report (when [watchdog].log_tokens = true)
agents.py selftest                 regression-test the backend activity detector (needs no config)
agents.py init [dir]               scaffold a starter agents.toml + prompts/ in a project dir
  --config PATH                    use a specific config (default: ./agents.toml)
```

### The watchdog tick

`agents.py watchdog` runs as the `<prefix>watchdog` tmux session and **re-reads the config every
tick**. Each loop:

- **signal tick** (`signal_interval`): handoff pings; for each watched agent the usage-limit check,
  and for `heal+stall` agents the stall check; fire any due `wake`.
- **heavy tick** (`heavy_interval`): advance the loop phase if the current one is done; otherwise
  heal each watched agent per its `watch` policy. When the sequence is complete the finished loops
  stay stopped, but persistent agents stay supervised.

**Usage-limit handling:** when an agent prints a limit banner, the watchdog parses the reset time,
arms a quiet window (never rebooting a limited agent), and at the end sends one probe to resume it
— re-arming if the banner re-prints.

---

## Driving the harness from an AI project-orchestrator

This harness is designed to be driven by an AI "project-orchestrator" (PO) that creates and runs
many projects, each pinning its own copy of this engine. The contract is intentionally **not
rigid** — the PO reads these docs and works out how to drive a project. What it can rely on:

1. **One config, one driver.** Everything the PO needs to know about a project's agents is in that
   project's `agents.toml`; everything it can *do* is a verb above. To inspect, `status`. To start
   or stop, `up` / `down`. To move the phase, `phase`.
2. **Isolation by `session_prefix`.** Two projects never collide as long as their `session_prefix`
   differ. The PO assigns each project a unique prefix at creation.
3. **State is on disk, not in the PO.** Phase index, resume ids and limit windows live under the
   project's `log_dir`. The PO can restart a project (or the whole host) and the watchdog resumes
   from there.
4. **Knowledge is one-directional.** A project repo contains nothing about the PO or the fleet —
   it can be run by hand and would have no idea a PO exists. The PO's fleet registry is the only
   record of which projects exist and at what engine ref. This repo never reaches "up" toward a PO.
5. **Submodule pin = the engine version.** A project pins this repo at a tag (e.g. `v0.1.0`) as a
   submodule under `engine/`. Bumping is per-project and opt-in (`git submodule update --remote`);
   one project's bump can't break another.

A minimal project layout the PO scaffolds:

```
my-project/                 # its own repo; knows nothing about the PO
  agents.toml               # harness config (this schema)
  engine/                   # this repo as a pinned submodule
  prompts/                  # role prompts + kickoff template
  machine-docs/             # the loop pair's coordination files (STATUS/REVIEW/inboxes)
  .ao-state/                # runtime state + logs (gitignored)
  .env                      # project creds (never in git)
```

Run it by hand with `engine/agents.py up --config agents.toml`.

---

## Nix

A `flake.nix` provides a reproducible devShell with the runtime deps (`python311` for stdlib
`tomllib`, plus `tmux` and `git`):

```bash
nix develop                                   # enter the shell
nix develop -c python3 agents.py selftest      # or run one command in it
nix flake check                                # evaluate + build the devShell
```

The agent CLIs themselves (`claude`, `opencode`) are **external, non-Nix tools** — install them
per their own docs and make sure they are on `PATH` before launching live agents. The devShell
documents this in its banner.

---

## Testing

The `tests/` directory holds the harness's own test suite. One runner drives everything:

```bash
nix develop -c ./tests/run.sh      # unit tests always; live backend smokes when available
# or just:  ./tests/run.sh         # (python3 + tmux must be on PATH)
```

What it runs:

- **Unit tests** (`tests/test_unit.py`) — pure logic, **no agents spawned, no live tmux sessions**.
  Cover config load + defaults merge, kickoff-template assembly, the phase machine (advance on the
  done marker, idempotent sequence-complete, append-a-phase resumes), usage-limit reset-banner
  parsing, `WAITING-UNTIL` / stall parsing, and the per-backend activity detectors (claude +
  opencode footers). Always run; a failure fails the suite. Run them alone with
  `python3 -m unittest discover -s tests` (or `python3 tests/test_unit.py`).
- **Live backend smokes** (`tests/smoke_claude.sh`, `tests/smoke_opencode.sh`) — each brings a
  throwaway scratch project up **through `agents.py`** on a real backend, in a fully isolated
  sandbox (its own unique `session_prefix`, a temp `log_dir`, and — for opencode — a dedicated
  server on a non-default port `AOTEST_OC_PORT`, default `4097`), confirms the session attaches and
  `status` reports it RUNNING, then `down`s it and cleans up (no leftover sessions, port freed).
  Each **SKIPs gracefully** (exit 0) when its backend's binary or creds are unavailable. Useful env:
  `CLAUDE_BIN` / `OPENCODE_BIN`, `AOTEST_MODEL`, `AOTEST_OC_PORT`, `AOTEST_OC_CREDS`.
- **Isolation sanity** — after the live runs, the runner asserts no `aotest-*` tmux sessions leaked
  and reports that any live sessions are untouched.

The smokes are safe by construction: a unique per-run session prefix (never `cc-ci-` or any real
project's), a dedicated opencode port (never `4096`), and a cleanup trap that fires on success,
failure, and Ctrl+C.

---

## Adding things

- **Add an agent** — add an `[[agent]]` block; `agents.py up <name>`. No code change.
- **Add a backend** — add a `[backend.<name>]` block (`bin`, `prompt_delivery`, the regexes);
  point an agent at it with `backend = "<name>"`.
- **Add / append a phase** — add an entry to `[loop].phases`; the watchdog advances into it
  automatically (clearing a stale `SEQUENCE-COMPLETE` if the sequence had finished).
- **Change a model or backend** — edit the field (or a phase's `models = {}`), then
  `agents.py down <name> && agents.py up <name>`. The watchdog re-reads the file; it won't fight you.