# agent-orchestrator A generic, reusable harness for running and supervising a fleet of AI-agent sessions in **tmux**. One driver script + one declarative config (`agents.toml`) describe every agent — a Builder / Adversary loop pair, a persistent supervisor, a one-shot task — and a **watchdog** keeps them alive, healed, paced, and coordinated. The watchdog reads the same config every tick, so there is never any env-vs-file drift. Nothing about any particular project lives in this repo. Paths, the loop **kickoff preamble**, the **handoff conventions**, and the **on-complete** hook are all supplied by the project's config and prompt files. A project consumes this repo as a pinned **git submodule** (`engine/`) and keeps its own config, prompts, state, and tmux namespace — total isolation between projects. ``` agents.py the driver + watchdog (pure Python stdlib; needs python >= 3.11 for tomllib) agent-log.py render claude JSONL transcripts into clean, greppable logs agents.example.toml a self-contained 2-agent example project prompts/ generic role + kickoff templates (builder / adversary / kickoff) examples/ runnable example projects — the Builder/Adversary variant family, snakepit, … smoke.sh bring the example up + tear it down in an isolated sandbox, then clean up tests/ the test suite — unit tests + isolated live backend smokes + a runner flake.nix/.lock a Nix devShell with the runtime deps (python311, tmux, git) ``` --- ## Quick start ```bash nix develop # python311 + tmux + git on PATH (see "Nix" below) python3 agents.py selftest # regression-test the activity detector (no config) python3 agents.py status --config agents.example.toml # one table: every agent + the phase ./smoke.sh # prove up/down works end-to-end, isolated + clean python3 agents.py init myproject # scaffold a starter agents.toml + prompts/ ``` `up` is **use-or-create**: an already-running session is left alone, never double-started. ```bash python3 agents.py --config agents.toml up # start all enabled agents + services + watchdog python3 agents.py --config agents.toml up builder # start just one agent (by name) python3 agents.py --config agents.toml down # stop everything python3 agents.py --config agents.toml logs builder # tail one session's log python3 agents.py --config agents.toml phase show # where the loop phase machine is ``` `--config` defaults to `./agents.toml`, falling back to one next to `agents.py`. --- ## Examples `examples/` holds runnable example projects — copy one, point `agents.py` at its `agents.toml`, and go. The headline set is a family of **Builder/Adversary** variants that build the *same* task but each differ in one dimension — useful both as templates and as a study of the pattern: - **`builder-adversary`** — the canonical loop pair: a Builder that builds and an Adversary that cold-verifies every claim, coordinating only through git (`claim(`/`review(` commits + the watchdog handoff). **Start here.** - **`builder-adversary-min`** — the same pattern with the prompts compressed to minimal tokens. - **`builder-adversary-stateless`** — `builder-adversary` + **context hygiene** (compact at each checkpoint, read diffs not trees, lean loads) to minimise carried/reloaded context. - **`builder-adversary-lean`** — context hygiene + **per-gate** review (one claim/verdict per gate). - **`builder-adversary-deferred`** — the Adversary verifies **once**, after the whole build, in a final comprehensive `review` phase (vs per-phase / per-gate). - **`builder-solo`** — a single Builder that self-certifies, with **no Adversary** (the control). - **`snakepit`** — a different topology entirely: a pool of identical worker "snakes" pulling tasks from a shared filesystem queue, plus cleanup specialists. (`examples/IDEAS.md` sketches more.) Each example has its own `README.md`. Run one by hand: ```bash cd examples/builder-adversary python3 ../../agents.py status --config agents.toml # read-only python3 ../../agents.py up --config agents.toml # needs `claude` on PATH ``` **Benchmark.** The separate [`agent-orchestrator-benchmark`](https://git.autonomic.zone/recipe-maintainers/agent-orchestrator-benchmark) repo runs these Builder/Adversary variants head-to-head (N=5, real `agents.py up` runs) to measure what drives token cost. Short version: an independent adversary costs **~4.7×** a solo builder, but the review *cadence* (per-gate / per-phase / deferred) is **nearly token-neutral**, and **context hygiene** is the one clean **~−22%** win. See that repo's `FINDINGS.md`. --- ## The config: `agents.toml` Five section types: `[watchdog]`, `[backend.]`, `[defaults]`, `[[agent]]` / `[[service]]`, and `[loop]`. See `agents.example.toml` for a complete, runnable example. ### `[watchdog]` — global supervisor cadence ```toml [watchdog] signal_interval = 30 # seconds between light checks (handoff / stall / limit) heavy_interval = 300 # seconds between heal + phase-advance checks limit_probe_fallback = 300 # re-probe cadence for a usage-limited agent when reset time is unparsable limit_reset_slack = 45 # seconds to wait past a parsed reset before probing stall_grace = 180 # seconds of slack past a WAITING-UNTIL marker before a stall reboot log_tokens = false # opt-in: record per-phase token + time usage (see below) ``` **Per-phase token + time logging (`log_tokens`).** Set `log_tokens = true` (under `[watchdog]` or `[loop]`) and the watchdog records, for **each phase**, how many tokens **each agent** used and how long the phase took — appended as one JSON object per phase to `/token-log.jsonl`. Tokens are summed from each agent's Claude Code session transcript and attributed **by working dir**, so give each agent its own `dir` (the Builder/Adversary loop pair already uses separate clones) for accurate per-agent numbers. The watchdog snapshots a baseline when a phase starts and writes the delta (per agent, and the total) when the phase advances or the sequence completes — robust across watchdog restarts. Pretty-print it with `agents.py tokens`: ``` phase dur(s) builder adversary TOTAL ----------------------------------------------------- lex 372.0 3,910,118 3,221,447 7,131,565 parse 410.5 ... ``` ### `[defaults]` — inherited by every agent ```toml [defaults] session_prefix = "myproj-" # REQUIRED: tmux namespace for this project. No implicit default. log_dir = ".ao-state" # REQUIRED: logs + state/. Relative paths resolve against the config dir. backend = "claude" model = "claude-sonnet-4-6" dir = "." # default working dir for agents (relative → project dir) watch = "heal" # none | heal | heal+stall project_dir = "." # OPTIONAL: project root for resolving prompts/paths (default: config's dir) ``` `session_prefix` and `log_dir` are **required** — the harness has no project-specific fallbacks. Every relative path (`log_dir`, an agent's `dir`, `handoff.repo`, prompt/template files) resolves against `project_dir`, which defaults to the directory holding the config file. When the config lives in a sandbox but the prompts live elsewhere (as `smoke.sh` does), set `project_dir` explicitly. ### `[backend.]` — backends declared as data A backend is fully described by config — no code change to add one. The one field that selects behavior is `prompt_delivery`: | `prompt_delivery` | how the kickoff reaches the agent | example | |---|---|---| | `"arg"` | passed as a CLI argument (claude-style) | `claude … "$(cat kickoff)"` | | `"ping"` | typed in after a TUI connects (opencode-style) | attach, wait, send-keys | | `"exec"` | a plain command; the prompt is written to a file | generic / demo | ```toml [backend.claude] bin = "claude" flags = "--dangerously-skip-permissions" remote_control = true # add a --remote-control flag supports_resume = true # honor an agent's resume=true prompt_delivery = "arg" process_name = "claude" # the pane process a healthy session runs (backend-mismatch healing) submit_key = "Enter" # key to submit a typed message stall_idle = 300 # seconds idle before a heal+stall agent is rebooted active_re = "esc to interrupt|Running tool|· \\d+" # pane shows the agent is WORKING limit_re = "usage limit|limit reached|reached your .*limit" # usage/rate-limit banner fatal_re = "redacted_thinking|cannot be modified" # unrecoverable session state → kill + restart [backend.opencode] # a TUI backend bin = "opencode" attach = "{bin} attach {server} --dir {dir}" server = "http://127.0.0.1:4096" prompt_delivery = "ping" process_name = "opencode" footer_ui = true # a static footer lingers after a turn → only the bottom = activity log_grace = 180 # within this many seconds of a log write, treat as active connect_delay = 12 # seconds to wait for the TUI before typing submit_key = "C-m" model_env = true # pass the model via OPENCODE_CONFIG_CONTENT preamble = "set -a; . ./.env; set +a" # shell run before launch (e.g. load creds) active_re = "esc interrupt|thinking|running tool|preparing patch" limit_re = "usage limit|limit reached" [backend.demo] # a dependency-free backend for testing the harness mechanics bin = "echo '[demo] {session} up'; exec sleep 1000000" prompt_delivery = "exec" # {kickoff}=prompt file, {session}=session name, {model}=model ``` For an `"arg"` backend the flag *templates* are configurable (so you can point at a non-claude CLI): `resume_flag` (default `--resume '{id}'`), `model_flag` (default `--model '{model}'`), `remote_control_flag` (default `--remote-control '{session}'`). A backend that sets `process_name` participates in backend-mismatch healing; one that doesn't (e.g. `demo`) never does. ### `[[agent]]` — one block per agent ```toml [[agent]] name = "builder" # tmux session defaults to ; override with session= kind = "loop" # loop | persistent | task backend = "claude" # overrides defaults.backend model = "claude-opus-4-8" # overrides defaults.model dir = "." # working dir (relative → project dir) role = "builder" # loop agents only: role prompt = /.md resume = true # (arg backends with supports_resume) --resume .id> watch = "heal+stall" # none | heal | heal+stall enabled = true # false = not started by a bare `up`, not supervised wake = { interval = 3600, prompt_file = "prompts/supervise.md" } # periodic nudge prompt = """inline startup text""" # persistent/task agents; OR prompt_file = "path.md" log_signature = "PROJECT PHASE" # optional: disambiguate agents that share a dir (agent-log.py) ``` | kind | prompt source | typical `watch` | |---|---|---| | `loop` | auto-built: kickoff template + `prompts/.md` | `heal+stall` | | `persistent` | `prompt` / `prompt_file` (+ optional `resume`, `wake`) | `heal` | | `task` | `prompt` (runs once, then idles) | `none`, `enabled=false` | **`watch` policy:** | value | behavior | |---|---| | `none` | ignored by the watchdog entirely | | `heal` | restart if the session is dead, FATAL-wedged, or running the wrong backend; pause all healing while inside a usage-limit window; **never** reboot just for being idle | | `heal+stall` | everything in `heal`, **plus** reboot if idle past `stall_idle` — respecting any `WAITING-UNTIL: ` self-wake marker the agent prints as its last line | ### `[[service]]` — non-AI helper processes ```toml [[service]] name = "cleanlogs" command = "python3 agent-log.py follow-all" ``` Started by a bare `up`, killed by `down`. Just a supervised command in a tmux session. ### `[loop]` — the phase state machine (governs `kind="loop"` agents) ```toml [loop] state_file = "phase-idx" # under /state/ resume_phase = true # keep the phase index across restarts (don't reset to 0) auto_advance = true # advance when the current phase's status file says done_marker done_marker = "## DONE" kickoff_template = "prompts/kickoff.md" # project preamble; slots {phase_id}/{plan}/{status}/{role} roles_dir = "prompts" # role prompt = /.md handoff = { repo = ".", claim_pings = "adversary", review_pings = "builder", inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"], claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" } on_complete = { trigger_file = ".run-on-complete", run = "reporter" } # run task agent on completion phases = [ { id = "p1", plan = "plans/p1.md", status = "STATUS-p1.md" }, { id = "p2", plan = "plans/p2.md", status = "STATUS-p2.md", models = { builder = "claude-opus-4-8" } }, ] ``` - **Kickoff template.** A loop agent's prompt is `kickoff_template` (with `{phase_id}`, `{plan}`, `{status}`, `{role}` substituted from the current phase) followed by `/.md`. Both are project files; this repo ships generic starters in `prompts/`. There is no built-in preamble text. - **Per-phase model override.** A phase's `models = { builder = "...", adversary = "..." }` overrides those agents' model for just that phase (matched on the agent's `role`). - **Auto-advance.** Each heavy tick, if the current phase's `status` file (looked up in `handoff.repo`'s `state_subdir/` then its root) contains a real `done_marker` — not a "Not yet…" placeholder — the watchdog stops the loops, bumps the phase index, and restarts them on the next phase. After the last phase it writes a `SEQUENCE-COMPLETE` marker under `log_dir` and stops the loops (idempotent — no churn). Appending a phase later clears the stale marker and resumes. On completion, an optional `on_complete.run` task agent fires if its `trigger_file` exists under `log_dir`. - **Handoff signalling.** The watchdog watches `handoff.repo`'s `origin/main` for commits whose subject matches `claim_pattern` / `review_pattern`, and watches the two `inboxes` files. When a claim lands it pings the `claim_pings` agent; a review pings `review_pings`; an inbox change pings the relevant side. This is how the Builder and Adversary coordinate purely through git. --- ## Config vs state - **Config** = `agents.toml` — declarative, version-controlled, the only source of truth. - **State** = `/state/` — machine-written runtime only: `phase-idx` (current phase), `.id` (resume id), `limited-.json` (active usage-limit window), `kickoff-.txt` (the exact prompt last sent). Git-ignore your `log_dir`. - **Env** = a one-off override for a *single* invocation only: `AGENT_MODEL_=…` / `AGENT_BACKEND_=…`. The persisted watchdog ignores env and re-reads the file every tick — deliberately, so env-vs-file drift can never silently revert a backend. --- ## The driver: verbs The recommended (not required) verb set — an AI project-orchestrator can rely on these being present, but a harness is free to add more: ``` agents.py up [name…] start enabled agents (+ services + watchdog); use-or-create agents.py down [name…] stop agents/services/watchdog (all, or named) agents.py status table of every agent: kind, backend, model, watch, state, phase agents.py watchdog the supervisor loop (what the watchdog session runs) agents.py logs tail that session's log agents.py phase [show|next|set N] inspect / move the loop phase index agents.py tokens per-phase token + time report (when [watchdog].log_tokens = true) agents.py selftest regression-test the backend activity detector (needs no config) agents.py init [dir] scaffold a starter agents.toml + prompts/ in a project dir --config PATH use a specific config (default: ./agents.toml) ``` ### The watchdog tick `agents.py watchdog` runs as the `watchdog` tmux session and **re-reads the config every tick**. Each loop: - **signal tick** (`signal_interval`): handoff pings; for each watched agent the usage-limit check, and for `heal+stall` agents the stall check; fire any due `wake`. - **heavy tick** (`heavy_interval`): advance the loop phase if the current one is done; otherwise heal each watched agent per its `watch` policy. When the sequence is complete the finished loops stay stopped, but persistent agents stay supervised. **Usage-limit handling:** when an agent prints a limit banner, the watchdog parses the reset time, arms a quiet window (never rebooting a limited agent), and at the end sends one probe to resume it — re-arming if the banner re-prints. --- ## Driving the harness from an AI project-orchestrator This harness is designed to be driven by an AI "project-orchestrator" (PO) that creates and runs many projects, each pinning its own copy of this engine. The contract is intentionally **not rigid** — the PO reads these docs and works out how to drive a project. What it can rely on: 1. **One config, one driver.** Everything the PO needs to know about a project's agents is in that project's `agents.toml`; everything it can *do* is a verb above. To inspect, `status`. To start or stop, `up` / `down`. To move the phase, `phase`. 2. **Isolation by `session_prefix`.** Two projects never collide as long as their `session_prefix` differ. The PO assigns each project a unique prefix at creation. 3. **State is on disk, not in the PO.** Phase index, resume ids and limit windows live under the project's `log_dir`. The PO can restart a project (or the whole host) and the watchdog resumes from there. 4. **Knowledge is one-directional.** A project repo contains nothing about the PO or the fleet — it can be run by hand and would have no idea a PO exists. The PO's fleet registry is the only record of which projects exist and at what engine ref. This repo never reaches "up" toward a PO. 5. **Submodule pin = the engine version.** A project pins this repo at a tag (e.g. `v0.1.0`) as a submodule under `engine/`. Bumping is per-project and opt-in (`git submodule update --remote`); one project's bump can't break another. A minimal project layout the PO scaffolds: ``` my-project/ # its own repo; knows nothing about the PO agents.toml # harness config (this schema) engine/ # this repo as a pinned submodule prompts/ # role prompts + kickoff template machine-docs/ # the loop pair's coordination files (STATUS/REVIEW/inboxes) .ao-state/ # runtime state + logs (gitignored) .env # project creds (never in git) ``` Run it by hand with `engine/agents.py up --config agents.toml`. --- ## Nix A `flake.nix` provides a reproducible devShell with the runtime deps (`python311` for stdlib `tomllib`, plus `tmux` and `git`): ```bash nix develop # enter the shell nix develop -c python3 agents.py selftest # or run one command in it nix flake check # evaluate + build the devShell ``` The agent CLIs themselves (`claude`, `opencode`) are **external, non-Nix tools** — install them per their own docs and make sure they are on `PATH` before launching live agents. The devShell documents this in its banner. --- ## Testing The `tests/` directory holds the harness's own test suite. One runner drives everything: ```bash nix develop -c ./tests/run.sh # unit tests always; live backend smokes when available # or just: ./tests/run.sh # (python3 + tmux must be on PATH) ``` What it runs: - **Unit tests** (`tests/test_unit.py`) — pure logic, **no agents spawned, no live tmux sessions**. Cover config load + defaults merge, kickoff-template assembly, the phase machine (advance on the done marker, idempotent sequence-complete, append-a-phase resumes), usage-limit reset-banner parsing, `WAITING-UNTIL` / stall parsing, and the per-backend activity detectors (claude + opencode footers). Always run; a failure fails the suite. Run them alone with `python3 -m unittest discover -s tests` (or `python3 tests/test_unit.py`). - **Live backend smokes** (`tests/smoke_claude.sh`, `tests/smoke_opencode.sh`) — each brings a throwaway scratch project up **through `agents.py`** on a real backend, in a fully isolated sandbox (its own unique `session_prefix`, a temp `log_dir`, and — for opencode — a dedicated server on a non-default port `AOTEST_OC_PORT`, default `4097`), confirms the session attaches and `status` reports it RUNNING, then `down`s it and cleans up (no leftover sessions, port freed). Each **SKIPs gracefully** (exit 0) when its backend's binary or creds are unavailable. Useful env: `CLAUDE_BIN` / `OPENCODE_BIN`, `AOTEST_MODEL`, `AOTEST_OC_PORT`, `AOTEST_OC_CREDS`. - **Isolation sanity** — after the live runs, the runner asserts no `aotest-*` tmux sessions leaked and reports that any live sessions are untouched. The smokes are safe by construction: a unique per-run session prefix (never `cc-ci-` or any real project's), a dedicated opencode port (never `4096`), and a cleanup trap that fires on success, failure, and Ctrl+C. --- ## Adding things - **Add an agent** — add an `[[agent]]` block; `agents.py up `. No code change. - **Add a backend** — add a `[backend.]` block (`bin`, `prompt_delivery`, the regexes); point an agent at it with `backend = ""`. - **Add / append a phase** — add an entry to `[loop].phases`; the watchdog advances into it automatically (clearing a stale `SEQUENCE-COMPLETE` if the sequence had finished). - **Change a model or backend** — edit the field (or a phase's `models = {}`), then `agents.py down && agents.py up `. The watchdog re-reads the file; it won't fight you.