Unit tests (no agents/tmux): config load + defaults merge, kickoff-template assembly, phase machine (advance/idempotent-complete/append-resumes), limit reset-banner parsing, WAITING-UNTIL/stall parsing, claude+opencode activity detectors. Live smokes bring a throwaway project up THROUGH agents.py on each real backend in an isolated sandbox (unique prefix, opencode on a non-4096 port), verify attach + status + down, and clean up. tests/run.sh runs unit always + smokes when backends present; README documents it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
agent-orchestrator
A generic, reusable harness for running and supervising a fleet of AI-agent sessions in tmux.
One driver script + one declarative config (agents.toml) describe every agent — a Builder /
Adversary loop pair, a persistent supervisor, a one-shot task — and a watchdog keeps them
alive, healed, paced, and coordinated. The watchdog reads the same config every tick, so there is
never any env-vs-file drift.
Nothing about any particular project lives in this repo. Paths, the loop kickoff preamble, the
handoff conventions, and the on-complete hook are all supplied by the project's config and
prompt files. A project consumes this repo as a pinned git submodule (engine/) and keeps its
own config, prompts, state, and tmux namespace — total isolation between projects.
agents.py the driver + watchdog (pure Python stdlib; needs python >= 3.11 for tomllib)
agent-log.py render claude JSONL transcripts into clean, greppable logs
agents.example.toml a self-contained 2-agent example project
prompts/ generic role + kickoff templates (builder / adversary / kickoff)
smoke.sh bring the example up + tear it down in an isolated sandbox, then clean up
tests/ the test suite — unit tests + isolated live backend smokes + a runner
flake.nix/.lock a Nix devShell with the runtime deps (python311, tmux, git)
Quick start
nix develop # python311 + tmux + git on PATH (see "Nix" below)
python3 agents.py selftest # regression-test the activity detector (no config)
python3 agents.py status --config agents.example.toml # one table: every agent + the phase
./smoke.sh # prove up/down works end-to-end, isolated + clean
python3 agents.py init myproject # scaffold a starter agents.toml + prompts/
up is use-or-create: an already-running session is left alone, never double-started.
python3 agents.py --config agents.toml up # start all enabled agents + services + watchdog
python3 agents.py --config agents.toml up builder # start just one agent (by name)
python3 agents.py --config agents.toml down # stop everything
python3 agents.py --config agents.toml logs builder # tail one session's log
python3 agents.py --config agents.toml phase show # where the loop phase machine is
--config defaults to ./agents.toml, falling back to one next to agents.py.
The config: agents.toml
Five section types: [watchdog], [backend.<name>], [defaults], [[agent]] / [[service]],
and [loop]. See agents.example.toml for a complete, runnable example.
[watchdog] — global supervisor cadence
[watchdog]
signal_interval = 30 # seconds between light checks (handoff / stall / limit)
heavy_interval = 300 # seconds between heal + phase-advance checks
limit_probe_fallback = 300 # re-probe cadence for a usage-limited agent when reset time is unparsable
limit_reset_slack = 45 # seconds to wait past a parsed reset before probing
stall_grace = 180 # seconds of slack past a WAITING-UNTIL marker before a stall reboot
[defaults] — inherited by every agent
[defaults]
session_prefix = "myproj-" # REQUIRED: tmux namespace for this project. No implicit default.
log_dir = ".ao-state" # REQUIRED: logs + state/. Relative paths resolve against the config dir.
backend = "claude"
model = "claude-sonnet-4-6"
dir = "." # default working dir for agents (relative → project dir)
watch = "heal" # none | heal | heal+stall
project_dir = "." # OPTIONAL: project root for resolving prompts/paths (default: config's dir)
session_prefix and log_dir are required — the harness has no project-specific fallbacks.
Every relative path (log_dir, an agent's dir, handoff.repo, prompt/template files) resolves
against project_dir, which defaults to the directory holding the config file. When the config
lives in a sandbox but the prompts live elsewhere (as smoke.sh does), set project_dir
explicitly.
[backend.<name>] — backends declared as data
A backend is fully described by config — no code change to add one. The one field that selects
behavior is prompt_delivery:
prompt_delivery |
how the kickoff reaches the agent | example |
|---|---|---|
"arg" |
passed as a CLI argument (claude-style) | claude … "$(cat kickoff)" |
"ping" |
typed in after a TUI connects (opencode-style) | attach, wait, send-keys |
"exec" |
a plain command; the prompt is written to a file | generic / demo |
[backend.claude]
bin = "claude"
flags = "--dangerously-skip-permissions"
remote_control = true # add a --remote-control <session> flag
supports_resume = true # honor an agent's resume=true
prompt_delivery = "arg"
process_name = "claude" # the pane process a healthy session runs (backend-mismatch healing)
submit_key = "Enter" # key to submit a typed message
stall_idle = 300 # seconds idle before a heal+stall agent is rebooted
active_re = "esc to interrupt|Running tool|· \\d+" # pane shows the agent is WORKING
limit_re = "usage limit|limit reached|reached your .*limit" # usage/rate-limit banner
fatal_re = "redacted_thinking|cannot be modified" # unrecoverable session state → kill + restart
[backend.opencode] # a TUI backend
bin = "opencode"
attach = "{bin} attach {server} --dir {dir}"
server = "http://127.0.0.1:4096"
prompt_delivery = "ping"
process_name = "opencode"
footer_ui = true # a static footer lingers after a turn → only the bottom = activity
log_grace = 180 # within this many seconds of a log write, treat as active
connect_delay = 12 # seconds to wait for the TUI before typing
submit_key = "C-m"
model_env = true # pass the model via OPENCODE_CONFIG_CONTENT
preamble = "set -a; . ./.env; set +a" # shell run before launch (e.g. load creds)
active_re = "esc interrupt|thinking|running tool|preparing patch"
limit_re = "usage limit|limit reached"
[backend.demo] # a dependency-free backend for testing the harness mechanics
bin = "echo '[demo] {session} up'; exec sleep 1000000"
prompt_delivery = "exec" # {kickoff}=prompt file, {session}=session name, {model}=model
For an "arg" backend the flag templates are configurable (so you can point at a non-claude
CLI): resume_flag (default --resume '{id}'), model_flag (default --model '{model}'),
remote_control_flag (default --remote-control '{session}'). A backend that sets process_name
participates in backend-mismatch healing; one that doesn't (e.g. demo) never does.
[[agent]] — one block per agent
[[agent]]
name = "builder" # tmux session defaults to <session_prefix><name>; override with session=
kind = "loop" # loop | persistent | task
backend = "claude" # overrides defaults.backend
model = "claude-opus-4-8" # overrides defaults.model
dir = "." # working dir (relative → project dir)
role = "builder" # loop agents only: role prompt = <roles_dir>/<role>.md
resume = true # (arg backends with supports_resume) --resume <state/<name>.id>
watch = "heal+stall" # none | heal | heal+stall
enabled = true # false = not started by a bare `up`, not supervised
wake = { interval = 3600, prompt_file = "prompts/supervise.md" } # periodic nudge
prompt = """inline startup text""" # persistent/task agents; OR prompt_file = "path.md"
log_signature = "PROJECT PHASE" # optional: disambiguate agents that share a dir (agent-log.py)
| kind | prompt source | typical watch |
|---|---|---|
loop |
auto-built: kickoff template + prompts/<role>.md |
heal+stall |
persistent |
prompt / prompt_file (+ optional resume, wake) |
heal |
task |
prompt (runs once, then idles) |
none, enabled=false |
watch policy:
| value | behavior |
|---|---|
none |
ignored by the watchdog entirely |
heal |
restart if the session is dead, FATAL-wedged, or running the wrong backend; pause all healing while inside a usage-limit window; never reboot just for being idle |
heal+stall |
everything in heal, plus reboot if idle past stall_idle — respecting any WAITING-UNTIL: <ISO-8601> self-wake marker the agent prints as its last line |
[[service]] — non-AI helper processes
[[service]]
name = "cleanlogs"
command = "python3 agent-log.py follow-all"
Started by a bare up, killed by down. Just a supervised command in a tmux session.
[loop] — the phase state machine (governs kind="loop" agents)
[loop]
state_file = "phase-idx" # under <log_dir>/state/
resume_phase = true # keep the phase index across restarts (don't reset to 0)
auto_advance = true # advance when the current phase's status file says done_marker
done_marker = "## DONE"
kickoff_template = "prompts/kickoff.md" # project preamble; slots {phase_id}/{plan}/{status}/{role}
roles_dir = "prompts" # role prompt = <roles_dir>/<role>.md
handoff = { repo = ".", claim_pings = "adversary", review_pings = "builder",
inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"],
claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
on_complete = { trigger_file = ".run-on-complete", run = "reporter" } # run task agent on completion
phases = [
{ id = "p1", plan = "plans/p1.md", status = "STATUS-p1.md" },
{ id = "p2", plan = "plans/p2.md", status = "STATUS-p2.md", models = { builder = "claude-opus-4-8" } },
]
- Kickoff template. A loop agent's prompt is
kickoff_template(with{phase_id},{plan},{status},{role}substituted from the current phase) followed by<roles_dir>/<role>.md. Both are project files; this repo ships generic starters inprompts/. There is no built-in preamble text. - Per-phase model override. A phase's
models = { builder = "...", adversary = "..." }overrides those agents' model for just that phase (matched on the agent'srole). - Auto-advance. Each heavy tick, if the current phase's
statusfile (looked up inhandoff.repo'sstate_subdir/then its root) contains a realdone_marker— not a "Not yet…" placeholder — the watchdog stops the loops, bumps the phase index, and restarts them on the next phase. After the last phase it writes aSEQUENCE-COMPLETEmarker underlog_dirand stops the loops (idempotent — no churn). Appending a phase later clears the stale marker and resumes. On completion, an optionalon_complete.runtask agent fires if itstrigger_fileexists underlog_dir. - Handoff signalling. The watchdog watches
handoff.repo'sorigin/mainfor commits whose subject matchesclaim_pattern/review_pattern, and watches the twoinboxesfiles. When a claim lands it pings theclaim_pingsagent; a review pingsreview_pings; an inbox change pings the relevant side. This is how the Builder and Adversary coordinate purely through git.
Config vs state
- Config =
agents.toml— declarative, version-controlled, the only source of truth. - State =
<log_dir>/state/— machine-written runtime only:phase-idx(current phase),<name>.id(resume id),limited-<session>.json(active usage-limit window),kickoff-<session>.txt(the exact prompt last sent). Git-ignore yourlog_dir. - Env = a one-off override for a single invocation only:
AGENT_MODEL_<name>=…/AGENT_BACKEND_<name>=…. The persisted watchdog ignores env and re-reads the file every tick — deliberately, so env-vs-file drift can never silently revert a backend.
The driver: verbs
The recommended (not required) verb set — an AI project-orchestrator can rely on these being present, but a harness is free to add more:
agents.py up [name…] start enabled agents (+ services + watchdog); use-or-create
agents.py down [name…] stop agents/services/watchdog (all, or named)
agents.py status table of every agent: kind, backend, model, watch, state, phase
agents.py watchdog the supervisor loop (what the <prefix>watchdog session runs)
agents.py logs <name> tail that session's log
agents.py phase [show|next|set N] inspect / move the loop phase index
agents.py selftest regression-test the backend activity detector (needs no config)
agents.py init [dir] scaffold a starter agents.toml + prompts/ in a project dir
--config PATH use a specific config (default: ./agents.toml)
The watchdog tick
agents.py watchdog runs as the <prefix>watchdog tmux session and re-reads the config every
tick. Each loop:
- signal tick (
signal_interval): handoff pings; for each watched agent the usage-limit check, and forheal+stallagents the stall check; fire any duewake. - heavy tick (
heavy_interval): advance the loop phase if the current one is done; otherwise heal each watched agent per itswatchpolicy. When the sequence is complete the finished loops stay stopped, but persistent agents stay supervised.
Usage-limit handling: when an agent prints a limit banner, the watchdog parses the reset time, arms a quiet window (never rebooting a limited agent), and at the end sends one probe to resume it — re-arming if the banner re-prints.
Driving the harness from an AI project-orchestrator
This harness is designed to be driven by an AI "project-orchestrator" (PO) that creates and runs many projects, each pinning its own copy of this engine. The contract is intentionally not rigid — the PO reads these docs and works out how to drive a project. What it can rely on:
- One config, one driver. Everything the PO needs to know about a project's agents is in that
project's
agents.toml; everything it can do is a verb above. To inspect,status. To start or stop,up/down. To move the phase,phase. - Isolation by
session_prefix. Two projects never collide as long as theirsession_prefixdiffer. The PO assigns each project a unique prefix at creation. - State is on disk, not in the PO. Phase index, resume ids and limit windows live under the
project's
log_dir. The PO can restart a project (or the whole host) and the watchdog resumes from there. - Knowledge is one-directional. A project repo contains nothing about the PO or the fleet — it can be run by hand and would have no idea a PO exists. The PO's fleet registry is the only record of which projects exist and at what engine ref. This repo never reaches "up" toward a PO.
- Submodule pin = the engine version. A project pins this repo at a tag (e.g.
v0.1.0) as a submodule underengine/. Bumping is per-project and opt-in (git submodule update --remote); one project's bump can't break another.
A minimal project layout the PO scaffolds:
my-project/ # its own repo; knows nothing about the PO
agents.toml # harness config (this schema)
engine/ # this repo as a pinned submodule
prompts/ # role prompts + kickoff template
machine-docs/ # the loop pair's coordination files (STATUS/REVIEW/inboxes)
.ao-state/ # runtime state + logs (gitignored)
.env # project creds (never in git)
Run it by hand with engine/agents.py up --config agents.toml.
Nix
A flake.nix provides a reproducible devShell with the runtime deps (python311 for stdlib
tomllib, plus tmux and git):
nix develop # enter the shell
nix develop -c python3 agents.py selftest # or run one command in it
nix flake check # evaluate + build the devShell
The agent CLIs themselves (claude, opencode) are external, non-Nix tools — install them
per their own docs and make sure they are on PATH before launching live agents. The devShell
documents this in its banner.
Testing
The tests/ directory holds the harness's own test suite. One runner drives everything:
nix develop -c ./tests/run.sh # unit tests always; live backend smokes when available
# or just: ./tests/run.sh # (python3 + tmux must be on PATH)
What it runs:
- Unit tests (
tests/test_unit.py) — pure logic, no agents spawned, no live tmux sessions. Cover config load + defaults merge, kickoff-template assembly, the phase machine (advance on the done marker, idempotent sequence-complete, append-a-phase resumes), usage-limit reset-banner parsing,WAITING-UNTIL/ stall parsing, and the per-backend activity detectors (claude + opencode footers). Always run; a failure fails the suite. Run them alone withpython3 -m unittest discover -s tests(orpython3 tests/test_unit.py). - Live backend smokes (
tests/smoke_claude.sh,tests/smoke_opencode.sh) — each brings a throwaway scratch project up throughagents.pyon a real backend, in a fully isolated sandbox (its own uniquesession_prefix, a templog_dir, and — for opencode — a dedicated server on a non-default portAOTEST_OC_PORT, default4097), confirms the session attaches andstatusreports it RUNNING, thendowns it and cleans up (no leftover sessions, port freed). Each SKIPs gracefully (exit 0) when its backend's binary or creds are unavailable. Useful env:CLAUDE_BIN/OPENCODE_BIN,AOTEST_MODEL,AOTEST_OC_PORT,AOTEST_OC_CREDS. - Isolation sanity — after the live runs, the runner asserts no
aotest-*tmux sessions leaked and reports that any live sessions are untouched.
The smokes are safe by construction: a unique per-run session prefix (never cc-ci- or any real
project's), a dedicated opencode port (never 4096), and a cleanup trap that fires on success,
failure, and Ctrl+C.
Adding things
- Add an agent — add an
[[agent]]block;agents.py up <name>. No code change. - Add a backend — add a
[backend.<name>]block (bin,prompt_delivery, the regexes); point an agent at it withbackend = "<name>". - Add / append a phase — add an entry to
[loop].phases; the watchdog advances into it automatically (clearing a staleSEQUENCE-COMPLETEif the sequence had finished). - Change a model or backend — edit the field (or a phase's
models = {}), thenagents.py down <name> && agents.py up <name>. The watchdog re-reads the file; it won't fight you.