9.5 KiB
REVIEW — phase aotest (Adversary log)
Phase plan: /srv/cc-ci/cc-ci-plan/plan-phase-aotest-verify.md
Deliverable repo: recipe-maintainers/agent-orchestrator on git.autonomic.zone
Adversary orientation @2026-06-13T18:44Z
Mission: Verify the agent-orchestrator harness runs a real project generically on BOTH claude and opencode backends, fully isolated, with a committed test suite.
DoD items to verify (from phase plan):
- Unit tests PASS — run from clean /tmp checkout inside
nix develop - claude smoke test PASSES via the harness (isolated, cleaned up)
- opencode smoke test PASSES or SKIPs with clear, justified reason recorded here
- No leftover
aotest-*tmux sessions or held ports after the run; live cc-ci sessions (cc-ci-orchestrator/watchdog/assistant3) untouched - Test suite + runner committed and documented in README
Key guardrails for my verification:
- Must use a non-
cc-ci-session prefix (aotest-* is correct) - opencode port must ≠ 4096 (the live cc-ci port)
- Do NOT touch live launch system:
/srv/cc-ci/cc-ci-plan/agents.py,agents.toml,cc-ci-plan/state/, or running tmux sessions - Verify from COLD START: fresh shell, /tmp checkout, no cached state
Repo state at orientation: v0.1.0 (commit 289ef07) — no tests/ dir present yet.
Awaiting Builder to push the aotest deliverable.
Code orientation @2026-06-13T18:44Z (from clean /tmp/ao-adv-check clone):
Key functions the unit tests MUST exercise (from reading agents.py 929 lines):
load_config: session_prefix required → hard die; log_dir required → hard die; defaults merge; project_dir resolution; agents inherit defaults; services inherit defaultsbuild_loop_kickoff: reads[loop].kickoff_template, fills{phase_id}/{plan}/{status}/{role}, then appends<roles_dir>/<role>.md. No project text in code — must test slot substitution.phase_done: readsstatus_basenamefromhandoff_repo(cfg), looks fordone_markerline; skips DONE_PLACEHOLDER_RE lines. Must test: file absent → False, no marker → False, marker present → True, placeholder line → False.phase_advance_check: auto-advance on DONE marker; idempotent when SEQUENCE-COMPLETE exists; appending a phase clears SEQUENCE-COMPLETE marker and resumes._parse_reset_epoch: AM/PM handling (12pm=12:00, 12am=00:00), 24h format, invalid hour/minute returns None, no match returns None. Takes the LAST match._parse_waiting_until: footer_ui branch uses last non-empty line only; non-footer scans whole pane. ISO-8601 with Z suffix. Invalid format returns None.pane_active: claude backend usesactive_rematch; opencode usesfooter_uibranch (only last line of 3 matters); limit banner + idle = not active (tested in selftest).
Live smoke isolation requirements (DoD verification):
- claude smoke: session prefix must be
aotest-(NOTcc-ci-), isolated log dir under /tmp - opencode smoke: port must ≠ 4096 (live cc-ci port is 4096), own server, own prefix
- Post-run:
tmux ls | grep aotest→ zero results; live sessions intact
Specific break-it checks I will run:
tmux ls | grep aotestbefore AND after — no leakagess -ltn | grep 4096— opencode test must NOT use this port- Check cc-ci sessions: cc-ci-orchestrator, cc-ci-watchdog, cc-ci-assistant3 still present
- Try to interrupt the live smoke mid-run (if isolatable) — cleanup still fires
- Unit test edge cases:
- load_config with missing session_prefix → expect die()
- load_config with missing log_dir → expect die()
- phase_done with ## DONE followed only by placeholder → expect False
- _parse_reset_epoch("resets Jun 16, 12pm") → 12:00 (NOT 24:00 which is invalid)
- _parse_reset_epoch("resets Jun 16, 12am") → 00:00 (not 12:00)
- _parse_waiting_until with footer_ui=True: only last non-empty line checked
- Confirm selftest (DoD-3 of aoeng) still passes after any test infrastructure changes
Verdicts
ALL DoD items: PASS @2026-06-13T19:00Z
Cold verification from clean /tmp/ao-adv-check clone (fresh git clone before pulling the
Builder's STATUS — verdict formed independently). Commit verified: cdcece9a9ac64b458103194025f2c22ba830ce15.
rm -rf /tmp/ao-adv-check
git clone https://...@git.autonomic.zone/recipe-maintainers/agent-orchestrator.git /tmp/ao-adv-check
git -C /tmp/ao-adv-check rev-parse HEAD
# → cdcece9a9ac64b458103194025f2c22ba830ce15 ✓ matches claimed commit
DoD-1 — Unit tests PASS (clean /tmp, nix develop): PASS
cd /tmp/ao-adv-check && nix develop -c python3 -m unittest discover -s tests -p 'test_*.py' -v
Ran 51 tests in 0.062s
OK
51 tests, rc=0. Coverage confirmed:
TestConfigLoad(12 tests): session_prefix required die, log_dir required die, defaults merge, explicit session override, per-agent override wins, relative/absolute dir resolution, log_dir resolved, state_dir created, service session named, backend_of resolves, backend_of unknown dies, env AGENT_MODEL override single-invocationTestExampleConfig(1 test): shippedagents.example.tomlloads with expected shapeTestKickoff(5 tests): slot fill ({phase_id}/{plan}/{status}/{role}), correct role prompt appended, no unrendered slots, agent_prompt dispatches correctly, role_model phase overrideTestPhaseMachine(8 tests): phase_done detects marker, rejects placeholder, false when no marker, false when file missing; cur_idx reads state file; advance on DONE; sequence-complete idempotent (no re-stop on 2nd call); append-phase clears SEQUENCE-COMPLETE and resumes; custom done_marker respectedTestLimitParsing(8 tests): PM, AM+minutes, 12am=midnight, invalid hour=None, no match=None, picks last match, unparsable fallback, within-6h window uses banner, >6h falls backTestWaitingUntil(5 tests): non-footer finds marker anywhere, non-footer None without marker, footer ignores marker not in last line, footer honors marker as last line, bad timestamp=NoneTestActivityDetection(8 tests): claude active_re (esc to interrupt, Running tool, spinner), claude idle not active; opencode active footer, idle footer, active-only-at-top ignored, log_grace fallback via mtime
DoD-2 — claude smoke PASSES via harness: PASS
cd /tmp/ao-adv-check && nix develop -c bash tests/smoke_claude.sh
=== claude backend smoke (isolated: prefix=aotest-c-681472-) ===
[agents] starting aotest-c-681472-probe (claude, kind=persistent, model=claude-haiku-4-5)
PASS: session aotest-c-681472-probe created via agents.py (pane command: claude)
PASS: claude TUI attached + alive (driven entirely by agents.py)
PASS: agents.py status reports probe RUNNING
PASS: agents.py down cleanly removed the session
=== CLAUDE BACKEND SMOKE: PASS ===
Confirmed: isolated prefix aotest-c-<pid>- (not cc-ci-), temp sandbox log_dir, pane command
is claude (TUI alive), status RUNNING, down cleans up. Cleanup trap on EXIT/INT/TERM.
DoD-3 — opencode smoke PASSES via harness (dedicated port ≠ 4096): PASS
cd /tmp/ao-adv-check && nix develop -c bash tests/smoke_opencode.sh
=== opencode backend smoke (isolated: prefix=aotest-o-681566- port=4097) ===
PASS: dedicated opencode server listening on :4097
[agents] starting aotest-o-681566-probe (opencode, kind=persistent, model=default)
PASS: session aotest-o-681566-probe created via agents.py (pane command: opencode)
PASS: opencode TUI attached + alive (driven entirely by agents.py)
PASS: agents.py status reports probe RUNNING
PASS: agents.py down cleanly removed the session
=== OPENCODE BACKEND SMOKE: PASS ===
Confirmed: dedicated server on :4097 (script has hardcoded guard refusing 4096); isolated
prefix aotest-o-<pid>-; TUI attached; cleanup kills server AND does pkill -f "opencode serve.*--port ${PORT}" + waits for port to free.
DoD-4 — No leftover aotest-* sessions or ports; cc-ci sessions intact: PASS
Post-run isolation check (after full suite via run.sh):
tmux ls | grep '^aotest-'
# → (no output) ✓
ss -ltn | grep ':4097 '
# → (no output) ✓
tmux ls | grep -E 'cc-ci-orchestrator|cc-ci-watchdog|cc-ci-assistant3'
# → cc-ci-assistant3, cc-ci-orchestrator, cc-ci-watchdog ✓
run.sh isolation sanity block output:
>>> ISOLATION SANITY
PASS: no leftover aotest-* tmux sessions
info: live cc-ci sessions present: cc-ci-orchestrator cc-ci-watchdog cc-ci-assistant3
DoD-5 — Test suite + runner committed and documented: PASS
Files at commit cdcece9:
tests/test_unit.py— 51-test stdlib unittest suite ✓tests/smoke_claude.sh— isolated live claude smoke ✓tests/smoke_opencode.sh— isolated live opencode smoke ✓tests/run.sh— runner: unit always, live smokes when available, isolation sanity ✓
README ## Testing section (lines ~321–351):
- Documents
nix develop -c ./tests/run.shas the canonical invocation ✓ - Explains what each layer covers (unit vs live vs isolation) ✓
- Documents skip conditions (backend bin/creds absent) ✓
- Documents useful env vars (CLAUDE_BIN, AOTEST_MODEL, AOTEST_OC_PORT, AOTEST_OC_CREDS) ✓
- Notes safety by construction (non-cc-ci prefix, non-4096 port, cleanup trap) ✓
Full suite summary (run.sh output)
SUMMARY: unit=PASS claude=PASS opencode=PASS isolation=PASS
ALL RUN TESTS PASSED (skips are OK)
rc=0. Verified at commit cdcece9, clean /tmp clone, nix develop (Python 3.11.11, tmux 3.5a).
No findings. No veto. Phase aotest is DONE.
All 5 DoD items PASS at 2026-06-13T19:00Z on commit cdcece9.