Files
agent-orchestrator/README.md

22 KiB
Raw Blame History

agent-orchestrator

A generic, reusable harness for running and supervising a fleet of AI-agent sessions in tmux. One driver script + one declarative config (agents.toml) describe every agent — a Builder / Adversary loop pair, a persistent supervisor, a one-shot task — and a watchdog keeps them alive, healed, paced, and coordinated. The watchdog reads the same config every tick, so there is never any env-vs-file drift.

Nothing about any particular project lives in this repo. Paths, the loop kickoff preamble, the handoff conventions, and the on-complete hook are all supplied by the project's config and prompt files. A project consumes this repo as a pinned git submodule (engine/) and keeps its own config, prompts, state, and tmux namespace — total isolation between projects.

agents.py            the driver + watchdog (pure Python stdlib; needs python >= 3.11 for tomllib)
agent-log.py         render claude JSONL transcripts into clean, greppable logs
agents.example.toml  a self-contained 2-agent example project
prompts/             generic role + kickoff templates (builder / adversary / kickoff)
examples/            runnable example projects — the Builder/Adversary variant family, snakepit, …
smoke.sh             bring the example up + tear it down in an isolated sandbox, then clean up
tests/               the test suite — unit tests + isolated live backend smokes + a runner
flake.nix/.lock      a Nix devShell with the runtime deps (python311, tmux, git)

Quick start

nix develop                                   # python311 + tmux + git on PATH (see "Nix" below)

python3 agents.py selftest                    # regression-test the activity detector (no config)
python3 agents.py status --config agents.example.toml   # one table: every agent + the phase
./smoke.sh                                    # prove up/down works end-to-end, isolated + clean

python3 agents.py init myproject              # scaffold a starter agents.toml + prompts/

up is use-or-create: an already-running session is left alone, never double-started.

python3 agents.py --config agents.toml up           # start all enabled agents + services + watchdog
python3 agents.py --config agents.toml up builder    # start just one agent (by name)
python3 agents.py --config agents.toml down          # stop everything
python3 agents.py --config agents.toml logs builder  # tail one session's log
python3 agents.py --config agents.toml phase show    # where the loop phase machine is

--config defaults to ./agents.toml, falling back to one next to agents.py.


Examples

examples/ holds runnable example projects — copy one, point agents.py at its agents.toml, and go. The headline set is a family of Builder/Adversary variants that build the same task but each differ in one dimension — useful both as templates and as a study of the pattern:

  • builder-adversary — the canonical loop pair: a Builder that builds and an Adversary that cold-verifies every claim, coordinating only through git (claim(/review( commits + the watchdog handoff). Start here.
  • builder-adversary-min — the same pattern with the prompts compressed to minimal tokens.
  • builder-adversary-statelessbuilder-adversary + context hygiene (compact at each checkpoint, read diffs not trees, lean loads) to minimise carried/reloaded context.
  • builder-adversary-lean — context hygiene + per-gate review (one claim/verdict per gate).
  • builder-adversary-deferred — the Adversary verifies once, after the whole build, in a final comprehensive review phase (vs per-phase / per-gate).
  • builder-solo — a single Builder that self-certifies, with no Adversary (the control).
  • snakepit — a different topology entirely: a pool of identical worker "snakes" pulling tasks from a shared filesystem queue, plus cleanup specialists. (examples/IDEAS.md sketches more.)

Each example has its own README.md. Run one by hand:

cd examples/builder-adversary
python3 ../../agents.py status --config agents.toml      # read-only
python3 ../../agents.py up     --config agents.toml       # needs `claude` on PATH

Benchmark. The separate agent-orchestrator-benchmark repo runs these Builder/Adversary variants head-to-head (N=5, real agents.py up runs) to measure what drives token cost. Short version: an independent adversary costs ~4.7× a solo builder, but the review cadence (per-gate / per-phase / deferred) is nearly token-neutral, and context hygiene is the one clean ~22% win. See that repo's FINDINGS.md.


The config: agents.toml

Five section types: [watchdog], [backend.<name>], [defaults], [[agent]] / [[service]], and [loop]. See agents.example.toml for a complete, runnable example.

[watchdog] — global supervisor cadence

[watchdog]
signal_interval      = 30    # seconds between light checks (handoff / stall / limit)
heavy_interval       = 300   # seconds between heal + phase-advance checks
limit_probe_fallback = 300   # re-probe cadence for a usage-limited agent when reset time is unparsable
limit_reset_slack    = 45    # seconds to wait past a parsed reset before probing
stall_grace          = 180   # seconds of slack past a WAITING-UNTIL marker before a stall reboot
log_tokens           = false # opt-in: record per-phase token + time usage (see below)

Per-phase token + time logging (log_tokens). Set log_tokens = true (under [watchdog] or [loop]) and the watchdog records, for each phase, how many tokens each agent used and how long the phase took — appended as one JSON object per phase to <log_dir>/token-log.jsonl. Tokens are summed from each agent's Claude Code session transcript and attributed by working dir, so give each agent its own dir (the Builder/Adversary loop pair already uses separate clones) for accurate per-agent numbers. The watchdog snapshots a baseline when a phase starts and writes the delta (per agent, and the total) when the phase advances or the sequence completes — robust across watchdog restarts. Pretty-print it with agents.py tokens:

phase        dur(s)   builder adversary         TOTAL
-----------------------------------------------------
lex           372.0 3,910,118 3,221,447     7,131,565
parse         410.5 ...

[defaults] — inherited by every agent

[defaults]
session_prefix = "myproj-"   # REQUIRED: tmux namespace for this project. No implicit default.
log_dir        = ".ao-state" # REQUIRED: logs + state/. Relative paths resolve against the config dir.
backend        = "claude"
model          = "claude-sonnet-4-6"
dir            = "."         # default working dir for agents (relative → project dir)
watch          = "heal"      # none | heal | heal+stall
project_dir    = "."         # OPTIONAL: project root for resolving prompts/paths (default: config's dir)

session_prefix and log_dir are required — the harness has no project-specific fallbacks. Every relative path (log_dir, an agent's dir, handoff.repo, prompt/template files) resolves against project_dir, which defaults to the directory holding the config file. When the config lives in a sandbox but the prompts live elsewhere (as smoke.sh does), set project_dir explicitly.

[backend.<name>] — backends declared as data

A backend is fully described by config — no code change to add one. The one field that selects behavior is prompt_delivery:

prompt_delivery how the kickoff reaches the agent example
"arg" passed as a CLI argument (claude-style) claude … "$(cat kickoff)"
"ping" typed in after a TUI connects (opencode-style) attach, wait, send-keys
"exec" a plain command; the prompt is written to a file generic / demo
[backend.claude]
bin             = "claude"
flags           = "--dangerously-skip-permissions"
remote_control  = true          # add a --remote-control <session> flag
supports_resume = true          # honor an agent's resume=true
prompt_delivery = "arg"
process_name    = "claude"      # the pane process a healthy session runs (backend-mismatch healing)
submit_key      = "Enter"       # key to submit a typed message
stall_idle      = 300           # seconds idle before a heal+stall agent is rebooted
active_re = "esc to interrupt|Running tool|· \\d+"   # pane shows the agent is WORKING
limit_re  = "usage limit|limit reached|reached your .*limit"   # usage/rate-limit banner
fatal_re  = "redacted_thinking|cannot be modified"  # unrecoverable session state → kill + restart

[backend.opencode]              # a TUI backend
bin             = "opencode"
attach          = "{bin} attach {server} --dir {dir}"
server          = "http://127.0.0.1:4096"
prompt_delivery = "ping"
process_name    = "opencode"
footer_ui       = true          # a static footer lingers after a turn → only the bottom = activity
log_grace       = 180           # within this many seconds of a log write, treat as active
connect_delay   = 12            # seconds to wait for the TUI before typing
submit_key      = "C-m"
model_env       = true          # pass the model via OPENCODE_CONFIG_CONTENT
preamble        = "set -a; . ./.env; set +a"   # shell run before launch (e.g. load creds)
active_re = "esc interrupt|thinking|running tool|preparing patch"
limit_re  = "usage limit|limit reached"

[backend.demo]                  # a dependency-free backend for testing the harness mechanics
bin             = "echo '[demo] {session} up'; exec sleep 1000000"
prompt_delivery = "exec"        # {kickoff}=prompt file, {session}=session name, {model}=model

For an "arg" backend the flag templates are configurable (so you can point at a non-claude CLI): resume_flag (default --resume '{id}'), model_flag (default --model '{model}'), remote_control_flag (default --remote-control '{session}'). A backend that sets process_name participates in backend-mismatch healing; one that doesn't (e.g. demo) never does.

[[agent]] — one block per agent

[[agent]]
name    = "builder"            # tmux session defaults to <session_prefix><name>; override with session=
kind    = "loop"              # loop | persistent | task
backend = "claude"            # overrides defaults.backend
model   = "claude-opus-4-8"   # overrides defaults.model
dir     = "."                 # working dir (relative → project dir)
role    = "builder"           # loop agents only: role prompt = <roles_dir>/<role>.md
resume  = true                # (arg backends with supports_resume) --resume <state/<name>.id>
watch   = "heal+stall"        # none | heal | heal+stall
enabled = true                # false = not started by a bare `up`, not supervised
wake    = { interval = 3600, prompt_file = "prompts/supervise.md" }   # periodic nudge
prompt  = """inline startup text"""          # persistent/task agents; OR prompt_file = "path.md"
log_signature = "PROJECT PHASE"              # optional: disambiguate agents that share a dir (agent-log.py)
kind prompt source typical watch
loop auto-built: kickoff template + prompts/<role>.md heal+stall
persistent prompt / prompt_file (+ optional resume, wake) heal
task prompt (runs once, then idles) none, enabled=false

watch policy:

value behavior
none ignored by the watchdog entirely
heal restart if the session is dead, FATAL-wedged, or running the wrong backend; pause all healing while inside a usage-limit window; never reboot just for being idle
heal+stall everything in heal, plus reboot if idle past stall_idle — respecting any WAITING-UNTIL: <ISO-8601> self-wake marker the agent prints as its last line

[[service]] — non-AI helper processes

[[service]]
name    = "cleanlogs"
command = "python3 agent-log.py follow-all"

Started by a bare up, killed by down. Just a supervised command in a tmux session.

[loop] — the phase state machine (governs kind="loop" agents)

[loop]
state_file       = "phase-idx"          # under <log_dir>/state/
resume_phase     = true                 # keep the phase index across restarts (don't reset to 0)
auto_advance     = true                 # advance when the current phase's status file says done_marker
done_marker      = "## DONE"
kickoff_template = "prompts/kickoff.md" # project preamble; slots {phase_id}/{plan}/{status}/{role}
roles_dir        = "prompts"            # role prompt = <roles_dir>/<role>.md
handoff = { repo = ".", claim_pings = "adversary", review_pings = "builder",
            inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"],
            claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
on_complete = { trigger_file = ".run-on-complete", run = "reporter" }   # run task agent on completion
phases = [
  { id = "p1", plan = "plans/p1.md", status = "STATUS-p1.md" },
  { id = "p2", plan = "plans/p2.md", status = "STATUS-p2.md", models = { builder = "claude-opus-4-8" } },
]
  • Kickoff template. A loop agent's prompt is kickoff_template (with {phase_id}, {plan}, {status}, {role} substituted from the current phase) followed by <roles_dir>/<role>.md. Both are project files; this repo ships generic starters in prompts/. There is no built-in preamble text.
  • Per-phase model override. A phase's models = { builder = "...", adversary = "..." } overrides those agents' model for just that phase (matched on the agent's role).
  • Auto-advance. Each heavy tick, if the current phase's status file (looked up in handoff.repo's state_subdir/ then its root) contains a real done_marker — not a "Not yet…" placeholder — the watchdog stops the loops, bumps the phase index, and restarts them on the next phase. After the last phase it writes a SEQUENCE-COMPLETE marker under log_dir and stops the loops (idempotent — no churn). Appending a phase later clears the stale marker and resumes. On completion, an optional on_complete.run task agent fires if its trigger_file exists under log_dir.
  • Handoff signalling. The watchdog watches handoff.repo's origin/main for commits whose subject matches claim_pattern / review_pattern, and watches the two inboxes files. When a claim lands it pings the claim_pings agent; a review pings review_pings; an inbox change pings the relevant side. This is how the Builder and Adversary coordinate purely through git.

Config vs state

  • Config = agents.toml — declarative, version-controlled, the only source of truth.
  • State = <log_dir>/state/ — machine-written runtime only: phase-idx (current phase), <name>.id (resume id), limited-<session>.json (active usage-limit window), kickoff-<session>.txt (the exact prompt last sent). Git-ignore your log_dir.
  • Env = a one-off override for a single invocation only: AGENT_MODEL_<name>=… / AGENT_BACKEND_<name>=…. The persisted watchdog ignores env and re-reads the file every tick — deliberately, so env-vs-file drift can never silently revert a backend.

The driver: verbs

The recommended (not required) verb set — an AI project-orchestrator can rely on these being present, but a harness is free to add more:

agents.py up [name…]               start enabled agents (+ services + watchdog); use-or-create
agents.py down [name…]             stop agents/services/watchdog (all, or named)
agents.py status                   table of every agent: kind, backend, model, watch, state, phase
agents.py watchdog                 the supervisor loop (what the <prefix>watchdog session runs)
agents.py logs <name>              tail that session's log
agents.py phase [show|next|set N]  inspect / move the loop phase index
agents.py tokens                   per-phase token + time report (when [watchdog].log_tokens = true)
agents.py selftest                 regression-test the backend activity detector (needs no config)
agents.py init [dir]               scaffold a starter agents.toml + prompts/ in a project dir
  --config PATH                    use a specific config (default: ./agents.toml)

The watchdog tick

agents.py watchdog runs as the <prefix>watchdog tmux session and re-reads the config every tick. Each loop:

  • signal tick (signal_interval): handoff pings; for each watched agent the usage-limit check, and for heal+stall agents the stall check; fire any due wake.
  • heavy tick (heavy_interval): advance the loop phase if the current one is done; otherwise heal each watched agent per its watch policy. When the sequence is complete the finished loops stay stopped, but persistent agents stay supervised.

Usage-limit handling: when an agent prints a limit banner, the watchdog parses the reset time, arms a quiet window (never rebooting a limited agent), and at the end sends one probe to resume it — re-arming if the banner re-prints.


Driving the harness from an AI project-orchestrator

This harness is designed to be driven by an AI "project-orchestrator" (PO) that creates and runs many projects, each pinning its own copy of this engine. The contract is intentionally not rigid — the PO reads these docs and works out how to drive a project. What it can rely on:

  1. One config, one driver. Everything the PO needs to know about a project's agents is in that project's agents.toml; everything it can do is a verb above. To inspect, status. To start or stop, up / down. To move the phase, phase.
  2. Isolation by session_prefix. Two projects never collide as long as their session_prefix differ. The PO assigns each project a unique prefix at creation.
  3. State is on disk, not in the PO. Phase index, resume ids and limit windows live under the project's log_dir. The PO can restart a project (or the whole host) and the watchdog resumes from there.
  4. Knowledge is one-directional. A project repo contains nothing about the PO or the fleet — it can be run by hand and would have no idea a PO exists. The PO's fleet registry is the only record of which projects exist and at what engine ref. This repo never reaches "up" toward a PO.
  5. Submodule pin = the engine version. A project pins this repo at a tag (e.g. v0.1.0) as a submodule under engine/. Bumping is per-project and opt-in (git submodule update --remote); one project's bump can't break another.

A minimal project layout the PO scaffolds:

my-project/                 # its own repo; knows nothing about the PO
  agents.toml               # harness config (this schema)
  engine/                   # this repo as a pinned submodule
  prompts/                  # role prompts + kickoff template
  machine-docs/             # the loop pair's coordination files (STATUS/REVIEW/inboxes)
  .ao-state/                # runtime state + logs (gitignored)
  .env                      # project creds (never in git)

Run it by hand with engine/agents.py up --config agents.toml.


Nix

A flake.nix provides a reproducible devShell with the runtime deps (python311 for stdlib tomllib, plus tmux and git):

nix develop                                   # enter the shell
nix develop -c python3 agents.py selftest      # or run one command in it
nix flake check                                # evaluate + build the devShell

The agent CLIs themselves (claude, opencode) are external, non-Nix tools — install them per their own docs and make sure they are on PATH before launching live agents. The devShell documents this in its banner.


Testing

The tests/ directory holds the harness's own test suite. One runner drives everything:

nix develop -c ./tests/run.sh      # unit tests always; live backend smokes when available
# or just:  ./tests/run.sh         # (python3 + tmux must be on PATH)

What it runs:

  • Unit tests (tests/test_unit.py) — pure logic, no agents spawned, no live tmux sessions. Cover config load + defaults merge, kickoff-template assembly, the phase machine (advance on the done marker, idempotent sequence-complete, append-a-phase resumes), usage-limit reset-banner parsing, WAITING-UNTIL / stall parsing, and the per-backend activity detectors (claude + opencode footers). Always run; a failure fails the suite. Run them alone with python3 -m unittest discover -s tests (or python3 tests/test_unit.py).
  • Live backend smokes (tests/smoke_claude.sh, tests/smoke_opencode.sh) — each brings a throwaway scratch project up through agents.py on a real backend, in a fully isolated sandbox (its own unique session_prefix, a temp log_dir, and — for opencode — a dedicated server on a non-default port AOTEST_OC_PORT, default 4097), confirms the session attaches and status reports it RUNNING, then downs it and cleans up (no leftover sessions, port freed). Each SKIPs gracefully (exit 0) when its backend's binary or creds are unavailable. Useful env: CLAUDE_BIN / OPENCODE_BIN, AOTEST_MODEL, AOTEST_OC_PORT, AOTEST_OC_CREDS.
  • Isolation sanity — after the live runs, the runner asserts no aotest-* tmux sessions leaked and reports that any live sessions are untouched.

The smokes are safe by construction: a unique per-run session prefix (never cc-ci- or any real project's), a dedicated opencode port (never 4096), and a cleanup trap that fires on success, failure, and Ctrl+C.


Adding things

  • Add an agent — add an [[agent]] block; agents.py up <name>. No code change.
  • Add a backend — add a [backend.<name>] block (bin, prompt_delivery, the regexes); point an agent at it with backend = "<name>".
  • Add / append a phase — add an entry to [loop].phases; the watchdog advances into it automatically (clearing a stale SEQUENCE-COMPLETE if the sequence had finished).
  • Change a model or backend — edit the field (or a phase's models = {}), then agents.py down <name> && agents.py up <name>. The watchdog re-reads the file; it won't fight you.