agent-orchestrator/README.md

# agent-orchestrator

A generic, reusable harness for running and supervising a fleet of AI-agent sessions in **tmux**.
One driver script + one declarative config (`agents.toml`) describe every agent — a Builder /
Adversary loop pair, a persistent supervisor, a one-shot task — and a **watchdog** keeps them
alive, healed, paced, and coordinated. The watchdog reads the same config every tick, so there is
never any env-vs-file drift.

Nothing about any particular project lives in this repo. Paths, the loop **kickoff preamble**, the
**handoff conventions**, and the **on-complete** hook are all supplied by the project's config and
prompt files. A project consumes this repo as a pinned **git submodule** (`engine/`) and keeps its
own config, prompts, state, and tmux namespace — total isolation between projects.

```
agents.py            the driver + watchdog (pure Python stdlib; needs python >= 3.11 for tomllib)
agent-log.py         render claude JSONL transcripts into clean, greppable logs
agents.example.toml  a self-contained 2-agent example project
prompts/             generic role + kickoff templates (builder / adversary / kickoff)
smoke.sh             bring the example up + tear it down in an isolated sandbox, then clean up
flake.nix/.lock      a Nix devShell with the runtime deps (python311, tmux, git)
```

---

## Quick start

```bash
nix develop                                   # python311 + tmux + git on PATH (see "Nix" below)

python3 agents.py selftest                    # regression-test the activity detector (no config)
python3 agents.py status --config agents.example.toml   # one table: every agent + the phase
./smoke.sh                                    # prove up/down works end-to-end, isolated + clean

python3 agents.py init myproject              # scaffold a starter agents.toml + prompts/
```

`up` is **use-or-create**: an already-running session is left alone, never double-started.

```bash
python3 agents.py --config agents.toml up           # start all enabled agents + services + watchdog
python3 agents.py --config agents.toml up builder    # start just one agent (by name)
python3 agents.py --config agents.toml down          # stop everything
python3 agents.py --config agents.toml logs builder  # tail one session's log
python3 agents.py --config agents.toml phase show    # where the loop phase machine is
```

`--config` defaults to `./agents.toml`, falling back to one next to `agents.py`.

---

## The config: `agents.toml`

Five section types: `[watchdog]`, `[backend.<name>]`, `[defaults]`, `[[agent]]` / `[[service]]`,
and `[loop]`. See `agents.example.toml` for a complete, runnable example.

### `[watchdog]` — global supervisor cadence

```toml
[watchdog]
signal_interval      = 30    # seconds between light checks (handoff / stall / limit)
heavy_interval       = 300   # seconds between heal + phase-advance checks
limit_probe_fallback = 300   # re-probe cadence for a usage-limited agent when reset time is unparsable
limit_reset_slack    = 45    # seconds to wait past a parsed reset before probing
stall_grace          = 180   # seconds of slack past a WAITING-UNTIL marker before a stall reboot
```

### `[defaults]` — inherited by every agent

```toml
[defaults]
session_prefix = "myproj-"   # REQUIRED: tmux namespace for this project. No implicit default.
log_dir        = ".ao-state" # REQUIRED: logs + state/. Relative paths resolve against the config dir.
backend        = "claude"
model          = "claude-sonnet-4-6"
dir            = "."         # default working dir for agents (relative → project dir)
watch          = "heal"      # none | heal | heal+stall
project_dir    = "."         # OPTIONAL: project root for resolving prompts/paths (default: config's dir)
```

`session_prefix` and `log_dir` are **required** — the harness has no project-specific fallbacks.
Every relative path (`log_dir`, an agent's `dir`, `handoff.repo`, prompt/template files) resolves
against `project_dir`, which defaults to the directory holding the config file. When the config
lives in a sandbox but the prompts live elsewhere (as `smoke.sh` does), set `project_dir`
explicitly.

### `[backend.<name>]` — backends declared as data

A backend is fully described by config — no code change to add one. The one field that selects
behavior is `prompt_delivery`:

| `prompt_delivery` | how the kickoff reaches the agent | example |
|---|---|---|
| `"arg"`  | passed as a CLI argument (claude-style) | `claude … "$(cat kickoff)"` |
| `"ping"` | typed in after a TUI connects (opencode-style) | attach, wait, send-keys |
| `"exec"` | a plain command; the prompt is written to a file | generic / demo |

```toml
[backend.claude]
bin             = "claude"
flags           = "--dangerously-skip-permissions"
remote_control  = true          # add a --remote-control <session> flag
supports_resume = true          # honor an agent's resume=true
prompt_delivery = "arg"
process_name    = "claude"      # the pane process a healthy session runs (backend-mismatch healing)
submit_key      = "Enter"       # key to submit a typed message
stall_idle      = 300           # seconds idle before a heal+stall agent is rebooted
active_re = "esc to interrupt|Running tool|· \\d+"   # pane shows the agent is WORKING
limit_re  = "usage limit|limit reached|reached your .*limit"   # usage/rate-limit banner
fatal_re  = "redacted_thinking|cannot be modified"  # unrecoverable session state → kill + restart

[backend.opencode]              # a TUI backend
bin             = "opencode"
attach          = "{bin} attach {server} --dir {dir}"
server          = "http://127.0.0.1:4096"
prompt_delivery = "ping"
process_name    = "opencode"
footer_ui       = true          # a static footer lingers after a turn → only the bottom = activity
log_grace       = 180           # within this many seconds of a log write, treat as active
connect_delay   = 12            # seconds to wait for the TUI before typing
submit_key      = "C-m"
model_env       = true          # pass the model via OPENCODE_CONFIG_CONTENT
preamble        = "set -a; . ./.env; set +a"   # shell run before launch (e.g. load creds)
active_re = "esc interrupt|thinking|running tool|preparing patch"
limit_re  = "usage limit|limit reached"

[backend.demo]                  # a dependency-free backend for testing the harness mechanics
bin             = "echo '[demo] {session} up'; exec sleep 1000000"
prompt_delivery = "exec"        # {kickoff}=prompt file, {session}=session name, {model}=model
```

For an `"arg"` backend the flag *templates* are configurable (so you can point at a non-claude
CLI): `resume_flag` (default `--resume '{id}'`), `model_flag` (default `--model '{model}'`),
`remote_control_flag` (default `--remote-control '{session}'`). A backend that sets `process_name`
participates in backend-mismatch healing; one that doesn't (e.g. `demo`) never does.

### `[[agent]]` — one block per agent

```toml
[[agent]]
name    = "builder"            # tmux session defaults to <session_prefix><name>; override with session=
kind    = "loop"              # loop | persistent | task
backend = "claude"            # overrides defaults.backend
model   = "claude-opus-4-8"   # overrides defaults.model
dir     = "."                 # working dir (relative → project dir)
role    = "builder"           # loop agents only: role prompt = <roles_dir>/<role>.md
resume  = true                # (arg backends with supports_resume) --resume <state/<name>.id>
watch   = "heal+stall"        # none | heal | heal+stall
enabled = true                # false = not started by a bare `up`, not supervised
wake    = { interval = 3600, prompt_file = "prompts/supervise.md" }   # periodic nudge
prompt  = """inline startup text"""          # persistent/task agents; OR prompt_file = "path.md"
log_signature = "PROJECT PHASE"              # optional: disambiguate agents that share a dir (agent-log.py)
```

| kind | prompt source | typical `watch` |
|---|---|---|
| `loop` | auto-built: kickoff template + `prompts/<role>.md` | `heal+stall` |
| `persistent` | `prompt` / `prompt_file` (+ optional `resume`, `wake`) | `heal` |
| `task` | `prompt` (runs once, then idles) | `none`, `enabled=false` |

**`watch` policy:**

| value | behavior |
|---|---|
| `none` | ignored by the watchdog entirely |
| `heal` | restart if the session is dead, FATAL-wedged, or running the wrong backend; pause all healing while inside a usage-limit window; **never** reboot just for being idle |
| `heal+stall` | everything in `heal`, **plus** reboot if idle past `stall_idle` — respecting any `WAITING-UNTIL: <ISO-8601>` self-wake marker the agent prints as its last line |

### `[[service]]` — non-AI helper processes

```toml
[[service]]
name    = "cleanlogs"
command = "python3 agent-log.py follow-all"
```

Started by a bare `up`, killed by `down`. Just a supervised command in a tmux session.

### `[loop]` — the phase state machine (governs `kind="loop"` agents)

```toml
[loop]
state_file       = "phase-idx"          # under <log_dir>/state/
resume_phase     = true                 # keep the phase index across restarts (don't reset to 0)
auto_advance     = true                 # advance when the current phase's status file says done_marker
done_marker      = "## DONE"
kickoff_template = "prompts/kickoff.md" # project preamble; slots {phase_id}/{plan}/{status}/{role}
roles_dir        = "prompts"            # role prompt = <roles_dir>/<role>.md
handoff = { repo = ".", claim_pings = "adversary", review_pings = "builder",
            inboxes = ["ADVERSARY-INBOX.md", "BUILDER-INBOX.md"],
            claim_pattern = "^claim", review_pattern = "^review", state_subdir = "machine-docs" }
on_complete = { trigger_file = ".run-on-complete", run = "reporter" }   # run task agent on completion
phases = [
  { id = "p1", plan = "plans/p1.md", status = "STATUS-p1.md" },
  { id = "p2", plan = "plans/p2.md", status = "STATUS-p2.md", models = { builder = "claude-opus-4-8" } },
]
```

- **Kickoff template.** A loop agent's prompt is `kickoff_template` (with `{phase_id}`, `{plan}`,
  `{status}`, `{role}` substituted from the current phase) followed by `<roles_dir>/<role>.md`.
  Both are project files; this repo ships generic starters in `prompts/`. There is no built-in
  preamble text.
- **Per-phase model override.** A phase's `models = { builder = "...", adversary = "..." }`
  overrides those agents' model for just that phase (matched on the agent's `role`).
- **Auto-advance.** Each heavy tick, if the current phase's `status` file (looked up in
  `handoff.repo`'s `state_subdir/` then its root) contains a real `done_marker` — not a "Not
  yet…" placeholder — the watchdog stops the loops, bumps the phase index, and restarts them on
  the next phase. After the last phase it writes a `SEQUENCE-COMPLETE` marker under `log_dir` and
  stops the loops (idempotent — no churn). Appending a phase later clears the stale marker and
  resumes. On completion, an optional `on_complete.run` task agent fires if its `trigger_file`
  exists under `log_dir`.
- **Handoff signalling.** The watchdog watches `handoff.repo`'s `origin/main` for commits whose
  subject matches `claim_pattern` / `review_pattern`, and watches the two `inboxes` files. When a
  claim lands it pings the `claim_pings` agent; a review pings `review_pings`; an inbox change
  pings the relevant side. This is how the Builder and Adversary coordinate purely through git.

---

## Config vs state

- **Config** = `agents.toml` — declarative, version-controlled, the only source of truth.
- **State** = `<log_dir>/state/` — machine-written runtime only: `phase-idx` (current phase),
  `<name>.id` (resume id), `limited-<session>.json` (active usage-limit window),
  `kickoff-<session>.txt` (the exact prompt last sent). Git-ignore your `log_dir`.
- **Env** = a one-off override for a *single* invocation only: `AGENT_MODEL_<name>=…` /
  `AGENT_BACKEND_<name>=…`. The persisted watchdog ignores env and re-reads the file every tick —
  deliberately, so env-vs-file drift can never silently revert a backend.

---

## The driver: verbs

The recommended (not required) verb set — an AI project-orchestrator can rely on these being
present, but a harness is free to add more:

```
agents.py up [name…]               start enabled agents (+ services + watchdog); use-or-create
agents.py down [name…]             stop agents/services/watchdog (all, or named)
agents.py status                   table of every agent: kind, backend, model, watch, state, phase
agents.py watchdog                 the supervisor loop (what the <prefix>watchdog session runs)
agents.py logs <name>              tail that session's log
agents.py phase [show|next|set N]  inspect / move the loop phase index
agents.py selftest                 regression-test the backend activity detector (needs no config)
agents.py init [dir]               scaffold a starter agents.toml + prompts/ in a project dir
  --config PATH                    use a specific config (default: ./agents.toml)
```

### The watchdog tick

`agents.py watchdog` runs as the `<prefix>watchdog` tmux session and **re-reads the config every
tick**. Each loop:

- **signal tick** (`signal_interval`): handoff pings; for each watched agent the usage-limit check,
  and for `heal+stall` agents the stall check; fire any due `wake`.
- **heavy tick** (`heavy_interval`): advance the loop phase if the current one is done; otherwise
  heal each watched agent per its `watch` policy. When the sequence is complete the finished loops
  stay stopped, but persistent agents stay supervised.

**Usage-limit handling:** when an agent prints a limit banner, the watchdog parses the reset time,
arms a quiet window (never rebooting a limited agent), and at the end sends one probe to resume it
— re-arming if the banner re-prints.

---

## Driving the harness from an AI project-orchestrator

This harness is designed to be driven by an AI "project-orchestrator" (PO) that creates and runs
many projects, each pinning its own copy of this engine. The contract is intentionally **not
rigid** — the PO reads these docs and works out how to drive a project. What it can rely on:

1. **One config, one driver.** Everything the PO needs to know about a project's agents is in that
   project's `agents.toml`; everything it can *do* is a verb above. To inspect, `status`. To start
   or stop, `up` / `down`. To move the phase, `phase`.
2. **Isolation by `session_prefix`.** Two projects never collide as long as their `session_prefix`
   differ. The PO assigns each project a unique prefix at creation.
3. **State is on disk, not in the PO.** Phase index, resume ids and limit windows live under the
   project's `log_dir`. The PO can restart a project (or the whole host) and the watchdog resumes
   from there.
4. **Knowledge is one-directional.** A project repo contains nothing about the PO or the fleet —
   it can be run by hand and would have no idea a PO exists. The PO's fleet registry is the only
   record of which projects exist and at what engine ref. This repo never reaches "up" toward a PO.
5. **Submodule pin = the engine version.** A project pins this repo at a tag (e.g. `v0.1.0`) as a
   submodule under `engine/`. Bumping is per-project and opt-in (`git submodule update --remote`);
   one project's bump can't break another.

A minimal project layout the PO scaffolds:

```
my-project/                 # its own repo; knows nothing about the PO
  agents.toml               # harness config (this schema)
  engine/                   # this repo as a pinned submodule
  prompts/                  # role prompts + kickoff template
  machine-docs/             # the loop pair's coordination files (STATUS/REVIEW/inboxes)
  .ao-state/                # runtime state + logs (gitignored)
  .env                      # project creds (never in git)
```

Run it by hand with `engine/agents.py up --config agents.toml`.

---

## Nix

A `flake.nix` provides a reproducible devShell with the runtime deps (`python311` for stdlib
`tomllib`, plus `tmux` and `git`):

```bash
nix develop                                   # enter the shell
nix develop -c python3 agents.py selftest      # or run one command in it
nix flake check                                # evaluate + build the devShell
```

The agent CLIs themselves (`claude`, `opencode`) are **external, non-Nix tools** — install them
per their own docs and make sure they are on `PATH` before launching live agents. The devShell
documents this in its banner.

---

## Adding things

- **Add an agent** — add an `[[agent]]` block; `agents.py up <name>`. No code change.
- **Add a backend** — add a `[backend.<name>]` block (`bin`, `prompt_delivery`, the regexes);
  point an agent at it with `backend = "<name>"`.
- **Add / append a phase** — add an entry to `[loop].phases`; the watchdog advances into it
  automatically (clearing a stale `SEQUENCE-COMPLETE` if the sequence had finished).
- **Change a model or backend** — edit the field (or a phase's `models = {}`), then
  `agents.py down <name> && agents.py up <name>`. The watchdog re-reads the file; it won't fight you.