Files
agent-orchestrator/README.md

416 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# agent-orchestrator
A generic, reusable harness for running and supervising a fleet of AI-agent sessions in **tmux**.
One driver script + one declarative config (`agents.toml`) describe every agent — a Builder /
Adversary loop pair, a persistent supervisor, a one-shot task — and a **watchdog** keeps them
alive, healed, paced, and coordinated. The watchdog reads the same config every tick, so there is
never any env-vs-file drift.
Nothing about any particular project lives in this repo. Paths, the loop **kickoff preamble**, the
**handoff conventions**, and the **on-complete** hook are all supplied by the project's config and
prompt files. A project consumes this repo as a pinned **git submodule** (`engine/`) and keeps its
own config, prompts, state, and tmux namespace — total isolation between projects.
```
agents.py the driver + watchdog (pure Python stdlib; needs python >= 3.11 for tomllib)
agent-log.py render claude JSONL transcripts into clean, greppable logs
agents.example.toml a self-contained 2-agent example project
prompts/ generic role + kickoff templates (builder / adversary / kickoff)
examples/ runnable example projects — the Builder/Adversary variant family, snakepit, …
smoke.sh bring the example up + tear it down in an isolated sandbox, then clean up
tests/ the test suite — unit tests + isolated live backend smokes + a runner
flake.nix/.lock a Nix devShell with the runtime deps (python311, tmux, git)
```
---
## Quick start
```bash
nix develop # python311 + tmux + git on PATH (see "Nix" below)
python3 agents.py selftest # regression-test the activity detector (no config)
python3 agents.py status --config agents.example.toml # one table: every agent + the phase
./smoke.sh # prove up/down works end-to-end, isolated + clean
python3 agents.py init myproject # scaffold a starter agents.toml + prompts/
```
`up` is **use-or-create**: an already-running session is left alone, never double-started.
```bash
python3 agents.py --config agents.toml up # start all enabled agents + services + watchdog
python3 agents.py --config agents.toml up builder # start just one agent (by name)
python3 agents.py --config agents.toml down # stop everything
python3 agents.py --config agents.toml logs builder # tail one session's log
python3 agents.py --config agents.toml phase show # where the loop phase machine is
```
`--config` defaults to `./agents.toml`, falling back to one next to `agents.py`.
---
## Examples
`examples/` holds runnable example projects — copy one, point `agents.py` at its `agents.toml`, and
go. The headline set is a family of **Builder/Adversary** variants that build the *same* task but each
differ in one dimension — useful both as templates and as a study of the pattern:
- **`builder-adversary`** — the canonical loop pair: a Builder that builds and an Adversary that
cold-verifies every claim, coordinating only through git (`claim(`/`review(` commits + the watchdog
handoff). **Start here.**
- **`builder-adversary-min`** — the same pattern with the prompts compressed to minimal tokens.
- **`builder-adversary-stateless`** — `builder-adversary` + **context hygiene** (compact at each
checkpoint, read diffs not trees, lean loads) to minimise carried/reloaded context.
- **`builder-adversary-lean`** — context hygiene + **per-gate** review (one claim/verdict per gate).
- **`builder-adversary-deferred`** — the Adversary verifies **once**, after the whole build, in a
final comprehensive `review` phase (vs per-phase / per-gate).
- **`builder-solo`** — a single Builder that self-certifies, with **no Adversary** (the control).
- **`snakepit`** — a different topology entirely: a pool of identical worker "snakes" pulling tasks
from a shared filesystem queue, plus cleanup specialists. (`examples/IDEAS.md` sketches more.)
Each example has its own `README.md`. Run one by hand:
```bash
cd examples/builder-adversary
python3 ../../agents.py status --config agents.toml # read-only
python3 ../../agents.py up --config agents.toml # needs `claude` on PATH
```
**Benchmark.** The separate
[`agent-orchestrator-benchmark`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator-benchmark)
repo runs these Builder/Adversary variants head-to-head (N=5, real `agents.py up` runs) to measure
what drives token cost. Short version: an independent adversary costs **~4.7×** a solo builder, but
the review *cadence* (per-gate / per-phase / deferred) is **nearly token-neutral**, and **context
hygiene** is the one clean **~22%** win. See that repo's `FINDINGS.md`.
---
## The config: `agents.toml`
Five section types: `[watchdog]`, `[backend.<name>]`, `[defaults]`, `[[agent]]` / `[[service]]`,
and `[loop]`. See `agents.example.toml` for a complete, runnable example.
### `[watchdog]` — global supervisor cadence
```toml
[watchdog]
signal_interval = 30 # seconds between light checks (handoff / stall / limit)
heavy_interval = 300 # seconds between heal + phase-advance checks
limit_probe_fallback = 300 # re-probe cadence for a usage-limited agent when reset time is unparsable
limit_reset_slack = 45 # seconds to wait past a parsed reset before probing
stall_grace = 180 # seconds of slack past a WAITING-UNTIL marker before a stall reboot
log_tokens = false # opt-in: record per-phase token + time usage (see below)
```
**Per-phase token + time logging (`log_tokens`).** Set `log_tokens = true` (under `[watchdog]` or
`[loop]`) and the watchdog records, for **each phase**, how many tokens **each agent** used and how
long the phase took — appended as one JSON object per phase to `<log_dir>/token-log.jsonl`. Tokens
are summed from each agent's Claude Code session transcript and attributed **by working dir**, so
give each agent its own `dir` (the Builder/Adversary loop pair already uses separate clones) for
accurate per-agent numbers. The watchdog snapshots a baseline when a phase starts and writes the
delta (per agent, and the total) when the phase advances or the sequence completes — robust across
watchdog restarts. Pretty-print it with `agents.py tokens`:
```
phase dur(s) builder adversary TOTAL
-----------------------------------------------------
lex 372.0 3,910,118 3,221,447 7,131,565
parse 410.5 ...
```
### `[defaults]` — inherited by every agent
```toml
[defaults]
session_prefix = "myproj-" # REQUIRED: tmux namespace for this project. No implicit default.
log_dir = ".ao-state" # REQUIRED: logs + state/. Relative paths resolve against the config dir.
backend = "claude"
model = "claude-sonnet-4-6"
dir = "." # default working dir for agents (relative → project dir)
watch = "heal" # none | heal | heal+stall
project_dir = "." # OPTIONAL: project root for resolving prompts/paths (default: config's dir)
```
`session_prefix` and `log_dir` are **required** — the harness has no project-specific fallbacks.
Every relative path (`log_dir`, an agent's `dir`, `handoff.repo`, prompt/template files) resolves
against `project_dir`, which defaults to the directory holding the config file. When the config
lives in a sandbox but the prompts live elsewhere (as `smoke.sh` does), set `project_dir`
explicitly.
### `[backend.<name>]` — backends declared as data
A backend is fully described by config — no code change to add one. The one field that selects
behavior is `prompt_delivery`:
| `prompt_delivery` | how the kickoff reaches the agent | example |
|---|---|---|
| `"arg"` | passed as a CLI argument (claude-style) | `claude … "$(cat kickoff)"` |
| `"ping"` | typed in after a TUI connects (opencode-style) | attach, wait, send-keys |
| `"exec"` | a plain command; the prompt is written to a file | generic / demo |
```toml
[backend.claude]
bin = "claude"
flags = "--dangerously-skip-permissions"
remote_control = true # add a --remote-control <session> flag
supports_resume = true # honor an agent's resume=true
prompt_delivery = "arg"
process_name = "claude" # the pane process a healthy session runs (backend-mismatch healing)
submit_key = "Enter" # key to submit a typed message
stall_idle = 300 # seconds idle before a heal+stall agent is rebooted
active_re = "esc to interrupt|Running tool|· \\d+" # pane shows the agent is WORKING
limit_re = "usage limit|limit reached|reached your .*limit" # usage/rate-limit banner
fatal_re = "redacted_thinking|cannot be modified" # unrecoverable session state → kill + restart
[backend.opencode] # a TUI backend
bin = "opencode"
attach = "{bin} attach {server} --dir {dir}"
server = "http://127.0.0.1:4096"
prompt_delivery = "ping"
process_name = "opencode"
footer_ui = true # a static footer lingers after a turn → only the bottom = activity
log_grace = 180 # within this many seconds of a log write, treat as active
connect_delay = 12 # seconds to wait for the TUI before typing
submit_key = "C-m"
model_env = true # pass the model via OPENCODE_CONFIG_CONTENT
preamble = "set -a; . ./.env; set +a" # shell run before launch (e.g. load creds)
active_re = "esc interrupt|thinking|running tool|preparing patch"
limit_re = "usage limit|limit reached"
[backend.demo] # a dependency-free backend for testing the harness mechanics
bin = "echo '[demo] {session} up'; exec sleep 1000000"
prompt_delivery = "exec" # {kickoff}=prompt file, {session}=session name, {model}=model
```
For an `"arg"` backend the flag *templates* are configurable (so you can point at a non-claude
CLI): `resume_flag` (default `--resume '{id}'`), `model_flag` (default `--model '{model}'`),
`remote_control_flag` (default `--remote-control '{session}'`). A backend that sets `process_name`
participates in backend-mismatch healing; one that doesn't (e.g. `demo`) never does.
### `[[agent]]` — one block per agent
```toml
[[agent]]
name = "builder" # tmux session defaults to <session_prefix><name>; override with session=
kind = "loop" # loop | persistent | task
backend = "claude" # overrides defaults.backend
model = "claude-opus-4-8" # overrides defaults.model
dir = "." # working dir (relative → project dir)
role = "builder" # loop agents only: role prompt = <roles_dir>/<role>.md
resume = true # (arg backends with supports_resume) --resume <state/<name>.id>
watch = "heal+stall" # none | heal | heal+stall
enabled = true # false = not started by a bare `up`, not supervised
wake = { interval = 3600, prompt_file = "prompts/supervise.md" } # periodic nudge
prompt = """inline startup text""" # persistent/task agents; OR prompt_file = "path.md"
log_signature = "PROJECT PHASE" # optional: disambiguate agents that share a dir (agent-log.py)
```
| kind | prompt source | typical `watch` |
|---|---|---|
| `loop` | auto-built: kickoff template + `prompts/<role>.md` | `heal+stall` |
| `persistent` | `prompt` / `prompt_file` (+ optional `resume`, `wake`) | `heal` |
| `task` | `prompt` (runs once, then idles) | `none`, `enabled=false` |
**`watch` policy:**
| value | behavior |
|---|---|
| `none` | ignored by the watchdog entirely |
| `heal` | restart if the session is dead, FATAL-wedged, or running the wrong backend; pause all healing while inside a usage-limit window; **never** reboot just for being idle |
| `heal+stall` | everything in `heal`, **plus** reboot if idle past `stall_idle` — respecting any `WAITING-UNTIL: <ISO-8601>` self-wake marker the agent prints as its last line |
### `[[service]]` — non-AI helper processes
```toml
[[service]]
name = "cleanlogs"
command = "python3 agent-log.py follow-all"
```
Started by a bare `up`, killed by `down`. Just a supervised command in a tmux session.
### `[loop]` — the phase state machine (governs `kind="loop"` agents)
```toml
[loop]
state_file = "phase-idx" # under <log_dir>/state/
resume_phase = true # keep the phase index across restarts (don't reset to 0)
auto_advance = true # advance when the current phase's status file says done_marker
done_marker = "## DONE"
kickoff_template = "prompts/kickoff.md" # project preamble; slots {phase_id}/{plan}/{status}/{role}
roles_dir = "prompts" # role prompt = <roles_dir>/<role>.md
handoff = { repo = ".", claim_pings = "adversary", review_pings = "builder",
inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"],
claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
on_complete = { trigger_file = ".run-on-complete", run = "reporter" } # run task agent on completion
phases = [
{ id = "p1", plan = "plans/p1.md", status = "STATUS-p1.md" },
{ id = "p2", plan = "plans/p2.md", status = "STATUS-p2.md", models = { builder = "claude-opus-4-8" } },
]
```
- **Kickoff template.** A loop agent's prompt is `kickoff_template` (with `{phase_id}`, `{plan}`,
`{status}`, `{role}` substituted from the current phase) followed by `<roles_dir>/<role>.md`.
Both are project files; this repo ships generic starters in `prompts/`. There is no built-in
preamble text.
- **Per-phase model override.** A phase's `models = { builder = "...", adversary = "..." }`
overrides those agents' model for just that phase (matched on the agent's `role`).
- **Auto-advance.** Each heavy tick, if the current phase's `status` file (looked up in
`handoff.repo`'s `state_subdir/` then its root) contains a real `done_marker` — not a "Not
yet…" placeholder — the watchdog stops the loops, bumps the phase index, and restarts them on
the next phase. After the last phase it writes a `SEQUENCE-COMPLETE` marker under `log_dir` and
stops the loops (idempotent — no churn). Appending a phase later clears the stale marker and
resumes. On completion, an optional `on_complete.run` task agent fires if its `trigger_file`
exists under `log_dir`.
- **Handoff signalling.** The watchdog watches `handoff.repo`'s `origin/main` for commits whose
subject matches `claim_pattern` / `review_pattern`, and watches the two `inboxes` files. When a
claim lands it pings the `claim_pings` agent; a review pings `review_pings`; an inbox change
pings the relevant side. This is how the Builder and Adversary coordinate purely through git.
---
## Config vs state
- **Config** = `agents.toml` — declarative, version-controlled, the only source of truth.
- **State** = `<log_dir>/state/` — machine-written runtime only: `phase-idx` (current phase),
`<name>.id` (resume id), `limited-<session>.json` (active usage-limit window),
`kickoff-<session>.txt` (the exact prompt last sent). Git-ignore your `log_dir`.
- **Env** = a one-off override for a *single* invocation only: `AGENT_MODEL_<name>=…` /
`AGENT_BACKEND_<name>=…`. The persisted watchdog ignores env and re-reads the file every tick —
deliberately, so env-vs-file drift can never silently revert a backend.
---
## The driver: verbs
The recommended (not required) verb set — an AI project-orchestrator can rely on these being
present, but a harness is free to add more:
```
agents.py up [name…] start enabled agents (+ services + watchdog); use-or-create
agents.py down [name…] stop agents/services/watchdog (all, or named)
agents.py status table of every agent: kind, backend, model, watch, state, phase
agents.py watchdog the supervisor loop (what the <prefix>watchdog session runs)
agents.py logs <name> tail that session's log
agents.py phase [show|next|set N] inspect / move the loop phase index
agents.py tokens per-phase token + time report (when [watchdog].log_tokens = true)
agents.py selftest regression-test the backend activity detector (needs no config)
agents.py init [dir] scaffold a starter agents.toml + prompts/ in a project dir
--config PATH use a specific config (default: ./agents.toml)
```
### The watchdog tick
`agents.py watchdog` runs as the `<prefix>watchdog` tmux session and **re-reads the config every
tick**. Each loop:
- **signal tick** (`signal_interval`): handoff pings; for each watched agent the usage-limit check,
and for `heal+stall` agents the stall check; fire any due `wake`.
- **heavy tick** (`heavy_interval`): advance the loop phase if the current one is done; otherwise
heal each watched agent per its `watch` policy. When the sequence is complete the finished loops
stay stopped, but persistent agents stay supervised.
**Usage-limit handling:** when an agent prints a limit banner, the watchdog parses the reset time,
arms a quiet window (never rebooting a limited agent), and at the end sends one probe to resume it
— re-arming if the banner re-prints.
---
## Driving the harness from an AI project-orchestrator
This harness is designed to be driven by an AI "project-orchestrator" (PO) that creates and runs
many projects, each pinning its own copy of this engine. The contract is intentionally **not
rigid** — the PO reads these docs and works out how to drive a project. What it can rely on:
1. **One config, one driver.** Everything the PO needs to know about a project's agents is in that
project's `agents.toml`; everything it can *do* is a verb above. To inspect, `status`. To start
or stop, `up` / `down`. To move the phase, `phase`.
2. **Isolation by `session_prefix`.** Two projects never collide as long as their `session_prefix`
differ. The PO assigns each project a unique prefix at creation.
3. **State is on disk, not in the PO.** Phase index, resume ids and limit windows live under the
project's `log_dir`. The PO can restart a project (or the whole host) and the watchdog resumes
from there.
4. **Knowledge is one-directional.** A project repo contains nothing about the PO or the fleet —
it can be run by hand and would have no idea a PO exists. The PO's fleet registry is the only
record of which projects exist and at what engine ref. This repo never reaches "up" toward a PO.
5. **Submodule pin = the engine version.** A project pins this repo at a tag (e.g. `v0.1.0`) as a
submodule under `engine/`. Bumping is per-project and opt-in (`git submodule update --remote`);
one project's bump can't break another.
A minimal project layout the PO scaffolds:
```
my-project/ # its own repo; knows nothing about the PO
agents.toml # harness config (this schema)
engine/ # this repo as a pinned submodule
prompts/ # role prompts + kickoff template
machine-docs/ # the loop pair's coordination files (STATUS/REVIEW/inboxes)
.ao-state/ # runtime state + logs (gitignored)
.env # project creds (never in git)
```
Run it by hand with `engine/agents.py up --config agents.toml`.
---
## Nix
A `flake.nix` provides a reproducible devShell with the runtime deps (`python311` for stdlib
`tomllib`, plus `tmux` and `git`):
```bash
nix develop # enter the shell
nix develop -c python3 agents.py selftest # or run one command in it
nix flake check # evaluate + build the devShell
```
The agent CLIs themselves (`claude`, `opencode`) are **external, non-Nix tools** — install them
per their own docs and make sure they are on `PATH` before launching live agents. The devShell
documents this in its banner.
---
## Testing
The `tests/` directory holds the harness's own test suite. One runner drives everything:
```bash
nix develop -c ./tests/run.sh # unit tests always; live backend smokes when available
# or just: ./tests/run.sh # (python3 + tmux must be on PATH)
```
What it runs:
- **Unit tests** (`tests/test_unit.py`) — pure logic, **no agents spawned, no live tmux sessions**.
Cover config load + defaults merge, kickoff-template assembly, the phase machine (advance on the
done marker, idempotent sequence-complete, append-a-phase resumes), usage-limit reset-banner
parsing, `WAITING-UNTIL` / stall parsing, and the per-backend activity detectors (claude +
opencode footers). Always run; a failure fails the suite. Run them alone with
`python3 -m unittest discover -s tests` (or `python3 tests/test_unit.py`).
- **Live backend smokes** (`tests/smoke_claude.sh`, `tests/smoke_opencode.sh`) — each brings a
throwaway scratch project up **through `agents.py`** on a real backend, in a fully isolated
sandbox (its own unique `session_prefix`, a temp `log_dir`, and — for opencode — a dedicated
server on a non-default port `AOTEST_OC_PORT`, default `4097`), confirms the session attaches and
`status` reports it RUNNING, then `down`s it and cleans up (no leftover sessions, port freed).
Each **SKIPs gracefully** (exit 0) when its backend's binary or creds are unavailable. Useful env:
`CLAUDE_BIN` / `OPENCODE_BIN`, `AOTEST_MODEL`, `AOTEST_OC_PORT`, `AOTEST_OC_CREDS`.
- **Isolation sanity** — after the live runs, the runner asserts no `aotest-*` tmux sessions leaked
and reports that any live sessions are untouched.
The smokes are safe by construction: a unique per-run session prefix (never `cc-ci-` or any real
project's), a dedicated opencode port (never `4096`), and a cleanup trap that fires on success,
failure, and Ctrl+C.
---
## Adding things
- **Add an agent** — add an `[[agent]]` block; `agents.py up <name>`. No code change.
- **Add a backend** — add a `[backend.<name>]` block (`bin`, `prompt_delivery`, the regexes);
point an agent at it with `backend = "<name>"`.
- **Add / append a phase** — add an entry to `[loop].phases`; the watchdog advances into it
automatically (clearing a stale `SEQUENCE-COMPLETE` if the sequence had finished).
- **Change a model or backend** — edit the field (or a phase's `models = {}`), then
`agents.py down <name> && agents.py up <name>`. The watchdog re-reads the file; it won't fight you.