Recent phases wrote STATUS/BACKLOG/REVIEW/JOURNAL to the repo ROOT because
build_kickoff + plan.md's tree used bare filenames, even though the loops'
AGENTS.md + INBOX/DECISIONS/DEFERRED conventions already said machine-docs/.
Make machine-docs/ the single mandated home everywhere: build_kickoff now
emits machine-docs/ paths + an explicit FILE-LOCATION RULE; both loop prompts
and plan.md (tree + seed step) updated; orchestrator AGENTS.md documents +
enforces it. resolve_state/INBOX handoff already read machine-docs/ first.
When LOG_DIR/.run-upgrade-on-complete exists, the watchdog launches
launch-upgrader.py start the moment the last phase reaches ## DONE (then
consumes the flag). Lets the operator replace a scheduled weekly cron run with
'run as soon as the current phase queue finishes' — used tonight: the
cc-ci-upgrade-all.timer was stopped (stamp forwarded past tonight's slot) and
this flag set instead.
A Builder scaffolded 'STATUS-mailu.md' with a '## DONE / Not yet. Written
here only when ...' placeholder section; phase_done's startswith('## DONE')
matched it and auto-advanced past mailu without any of its work being done
(no recipe PR, no claim, no review). Harden phase_done: a '## DONE' heading
counts only when its first non-empty body line is not a placeholder/negation
(Not yet / pending / TBD / when all / <...> etc). Verified against all shipped
STATUS files (real DONEs still detected; mailu placeholder rejected).
Lets a single phase pin a different model, read fresh each role_model call so
a phase transition flips it automatically with no watchdog bounce. Operator
wants builder on opus for the complex dstamp phase, reverting to sonnet from
mailu on: .loop-model-dstamp=opus while base .loop-model stays sonnet.
The tick whose probe resumed a session was continuing into stall logic with
its pre-resume pane capture; a 4h-old WAITING-UNTIL in that stale data got
the freshly-resumed adversary kill+rebooted (05:52). Treat probe-resume as
handled-this-tick; the next 30s tick sees the live session.
Night-watch findings (monthly-spend-limit window, ~01:49-04:45):
- probe text said 'usage limit' which matches LIMIT_RE, so a submitted probe
kept limited_now true forever -> reworded to 'quota window' with a CAUTION
note (nudge text must never match LIMIT_RE)
- dedupe scanned all 40 captured lines, so once a probe scrolled into the
conversation no further probe ever fired (builder/adv frozen at nudges=1,
orchestrator probes degraded to hourly riding the wake scroll) -> dedupe
now only checks the bottom 8 lines (input area)
Core invariant HELD: zero kill+reboots during the limit window.
plan(lvl5): operator addition - the top-corner level badge (card, dashboard
pill, badge SVG) shows only the level number+color, zero capping info; the
inline per-rung table keeps intentional-skip/unverified detail.
Operator request: the hourly supervision prompt should land regardless of
limit state, as a fallback that keeps things on track if the limit-state
machinery ever breaks. If the limit is genuinely still in force the wake is
harmless (the banner just re-prints and limit_tick re-arms); once it lifts,
the queued wake doubles as a resume nudge.
Replace the blind every-300s 'limit appears lifted' nudge (claude) and the
opencode-only _maybe_nudge_limit with one unified limit_tick state machine:
- parse the reset time from the limit banner (last match wins; stale banners
whose time already passed fall back rather than waiting ~a day)
- arm a quiet window until reset+45s; parse failure -> flat 5-minute probe
loop (operator-specified; not exponential backoff)
- while armed, suppress ALL healing: a limit-stalled session is NEVER
kill+rebooted (this was the conc-phase churn: claude limit stalls fell
through to the generic idle reboot, losing the banner and re-hitting
the limit fresh)
- at window end send ONE nudge as a self-verifying probe: spinner clears
the state; a re-printed banner re-arms from the fresh reset time
- dedupe: never stack a probe while our own text is visible in the pane
- state persisted per session in LOG_DIR (.limited-<session>) so watchdog
restarts keep the window
- orchestrator gets the same treatment: limit_tick in heal_orchestrator,
a per-signal-tick orch_limit_check, and hourly wakes deferred during
limit windows
- loud WARNING at 3 probes, then continue flat probes forever
Also rename the orchestrator session default cc-ci-orchestrator-vm ->
cc-ci-orchestrator (launch.py ORCH_SESSION, launch-orchestrator.py SESSION,
docs/scripts references).
The raw 'tmux pipe-pane' logs are TUI-escape soup (the 191MB builder log).
agent-log.py renders Claude's own JSONL transcript into a clean one-event-
per-line <agent>.clean.log — read-only on a file the agent writes anyway, so
zero agent slowdown and zero extra tokens. Resolves each agent's transcript
(disambiguating the shared project dir by kickoff signature; tracks restarts).
'follow-all' runs as the cc-ci-cleanlogs session, wired into launch.py start
so it comes up with the loops. render/tail subcommands for ad-hoc use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirror the .loop-backend pattern: env wins, else the persisted file, else
the default build sequence. Without this, a custom single-phase run was
invisible to bare 'launch.py status' and would NOT survive a reboot (the
service has no PHASES_SPEC env). Now the current phase set is durable.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The watchdog is spawned into the existing tmux server and didn't reliably
inherit a custom PHASES_SPEC — it would fall back to the default 11-phase
spec and mis-detect completion. Forward PHASES_SPEC/PHASE_IDX_FILE/
LOOP_BACKEND/LOOP_MODEL explicitly in the watchdog command so custom
single-phase runs (like the mirror-enroll plan) work end-to-end. Also make
the mirror-enroll plan's live-host-deploy step an explicit claim-and-wait
operator gate for the loops.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The standalone ai-progress-monitor.sh waker pinged a hardcoded
orchestrator session every 15m. Move that into the watchdog loop:
ORCH_WAKE_INTERVAL (default 3600s) types the supervision prompt into
the live orchestrator session, retrying each tick until it lands so a
busy or briefly-absent orchestrator is never interrupted and no hour is
skipped. Delete the now-redundant waker script; the prompt file is now
driven by the watchdog. Reboot-safe by inheritance (the watchdog is
started by cc-ci-loops.service).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1. API key: opencode doesn't support env: substitution in apiKey — write
actual key value to ~/.config/opencode/opencode.jsonc at setup time
(file is not committed to git; key sourced from .testenv).
2. Permission system: add permission:"allow" to opencode config (equivalent
to --dangerously-skip-permissions) to avoid interactive prompts.
3. Submit key: opencode TUI uses Enter (return) to submit; Ctrl+S not
needed. ping_session already uses Enter — keep as is.
4. Startup timing: bump opencode TUI init wait from 4s to 8s so the TUI
is fully connected to the server before bootstrap is sent.
5. Backend persistence: LOOP_BACKEND/LOOP_MODEL written to .loop-backend /
.loop-model so the watchdog uses them when restarting dead sessions.
All tested: both builder and adversary sessions alive, deepseek-v4-pro
processing kickoffs via tinfoil inference.tinfoil.sh, no API/permission
errors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three fixes discovered during first live run:
- inference host is inference.tinfoil.sh not api.tinfoil.sh (control plane
only serves /v1/models, not /v1/chat/completions)
- opencode run exits after one turn; switch to opencode attach for the
persistent TUI, then ping_session sends the kickoff prompt
- NO_COLOR=1 suppresses the first-run interactive theme picker
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bash scripts are now one-liner wrappers: exec python3 <script>.py "$@"
All logic lives in the Python scripts (pure stdlib, no deps).
launch.py — loops + watchdog:
Full port of launch.sh: phase sequencing, start/stop/status/logs/watchdog,
handoff signalling, stall detection, heal_session, heal_orchestrator.
Cleaner structure: config block → helpers → phase/kickoff/agent/healing/
handoff/watchdog/main. LOOP_BACKEND + LOOP_MODEL switches throughout.
launch-orchestrator.py — orchestrator session:
claude path: --resume <id> preserved (conversation survives reboots).
opencode path: run --attach --title (no --resume; STARTUP_PROMPT orients
the new session; reads JOURNAL.md for context).
STARTUP_PROMPT updated to reference JOURNAL.md on startup.
launch-upgrader.py — one-shot upgrade job:
LOOP_BACKEND / LOOP_MODEL take precedence over UPGRADER_BACKEND / UPGRADER_MODEL.
Both claude and opencode paths supported.
cc-ci-plan/JOURNAL.md — new orchestrator handoff file:
Persistent across conversation resets. Documents the handoff format and
carries the current session's summary: migration complete, phase 5 in
progress (V3/V7 PASS), phase 4 deferred, open items for next session.
AGENTS.md: step 1 on startup = read JOURNAL.md; step 5 = append on handoff.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>