226 lines
19 KiB
Markdown
226 lines
19 KiB
Markdown
# Orchestrator journal
|
||
|
||
This file is the **persistent handoff record** for the cc-ci orchestrator. Every orchestrator
|
||
session (whether Claude or opencode) reads this on startup and appends to it when handing off or
|
||
when something noteworthy happens. It survives conversation resets — it is the memory that
|
||
`--resume` can't provide for opencode, and a more readable supplement for Claude sessions.
|
||
|
||
**On startup:** read this file before doing anything else. The most recent `## Session` entry
|
||
is where the previous session left off. Carry that context forward.
|
||
|
||
**On handoff / end of session:** append a `## Session` block (see format below) summarising
|
||
what happened, the current state, and anything the next session needs to know.
|
||
|
||
**On significant events mid-session:** append a `### Event` sub-entry (no need to wait for
|
||
handoff).
|
||
|
||
---
|
||
|
||
## Format
|
||
|
||
```markdown
|
||
## Session YYYY-MM-DD HH:MM UTC — <backend> <model>
|
||
**Left off:** <one sentence — what was the last thing done>
|
||
**Phase / loop state:** <phase X [N/11], loops RUNNING/stopped, cc-ci healthy/issue>
|
||
**Open items:** <bullet list of anything the next session needs to act on, or "none">
|
||
**Notes:** <anything surprising, a decision made, a known blocker, etc.>
|
||
|
||
### Event HH:MM — <short label>
|
||
<brief note>
|
||
```
|
||
|
||
---
|
||
|
||
## Session 2026-05-31 ~18:30 UTC — Claude Sonnet 4.6
|
||
|
||
**Left off:** Got opencode/deepseek-v4-pro working as the loop backend. Both builder and
|
||
adversary are actively running on `tinfoil/deepseek-v4-pro` (via `inference.tinfoil.sh`).
|
||
Phase 5 [11/11] in progress. The operator is debugging the opencode web UI visibility and
|
||
wants to continue orchestrating from opencode itself.
|
||
|
||
**Phase / loop state:**
|
||
- Phase **5 [11/11]** (`plan-phase5-verify-upgrade-flow.md`), in progress
|
||
- Latest product-repo commit: `de635ad` — `status(5): V3 DONE (custom-html-tiny upgrade GREEN); V7 DONE; A5-1/A5-2 fixed`
|
||
- Loops **RUNNING** on opencode/deepseek-v4-pro, actively processing (32–62K tokens in flight)
|
||
- Watchdog **RUNNING**, backend persisted to `.loop-backend` / `.loop-model` files
|
||
|
||
**Open items for next session:**
|
||
- Phase 5 loops need to finish V1–V9 and write `## DONE` to STATUS-5.md. They were at V3+V7 PASS before the backend switch. After completing phase 5, phase 6 (reconcile-only over all 18 recipe mirrors) and phase 7 (full upgrade on n8n + ghost + matrix-synapse) still need running.
|
||
- Phase 4 (final review/polish) was deliberately **deferred** — run it after weekly Opus credits reset. Phase idx currently at 10 (phase 5). To run phase 4 later: set idx to 9, start with `LOOP_BACKEND=claude RESUME_PHASE=1 cc-ci-plan/launch.sh start`.
|
||
- **Restart loops after reading this** — the current sessions are mid-processing. `cc-ci-plan/launch.sh status` will show state; if sessions are stalled, `LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro RESUME_PHASE=1 cc-ci-plan/launch.sh start`.
|
||
- DNS: `oc.commoninternet.net A 100.84.190.30` still needs adding (operator step). Web UI reachable directly at `http://100.84.190.30` in the meantime.
|
||
- Old Incus orchestrator VM (`cc-ci-orchestrator`, `100.116.55.106`) still cold standby — stop + delete when confident in Hetzner.
|
||
|
||
**Notes — opencode/tinfoil setup (critical for next session):**
|
||
- **Backend files:** `LOOP_BACKEND=opencode` and `LOOP_MODEL=tinfoil/deepseek-v4-pro` are persisted in `/srv/cc-ci/.cc-ci-logs/.loop-backend` and `.loop-model`. The watchdog reads these to restart dead sessions with the right backend.
|
||
- **API key:** stored in `/srv/cc-ci/.testenv` as `TINFOIL_API_KEY`. Written directly (not via `env:`) into `~/.config/opencode/opencode.jsonc` — opencode doesn't do env substitution in apiKey. The config also has `"permission": "allow"` (all tool calls auto-approved).
|
||
- **Inference URL:** `https://inference.tinfoil.sh/v1` (NOT `api.tinfoil.sh` — that's the control plane only). Fixed in both `.testenv` and `opencode.jsonc`.
|
||
- **Opencode web server:** `opencode-web.service` runs `opencode serve --hostname 127.0.0.1 --port 4096`. Nginx proxies `oc.commoninternet.net → localhost:4096` on tailscale IP. Sessions from the plain `opencode` TUI DO appear in the shared server's DB (they auto-connect via IPC), so the web UI should show them once DNS is set.
|
||
- **Launch command for opencode loops:** `LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro RESUME_PHASE=1 cc-ci-plan/launch.sh start`
|
||
- **Launch command for claude loops (fallback):** `LOOP_BACKEND=claude LOOP_MODEL=sonnet RESUME_PHASE=1 cc-ci-plan/launch.sh start`
|
||
- **Launchers rewritten to Python:** `launch.py`, `launch-orchestrator.py`, `launch-upgrader.py` (bash wrappers are one-liners). All committed to `recipe-maintainers/cc-ci-orchestrator` (HEAD: `3412100`).
|
||
- **Opencode binary:** `/home/loops/.local/bin/opencode` v1.15.13. Re-install if missing: `curl -sL https://github.com/anomalyco/opencode/releases/download/v1.15.13/opencode-linux-x64.tar.gz | tar -xz -C /home/loops/.local/bin opencode`
|
||
- **Known opencode quirk:** the loop bootstrap message (pointing to the kickoff file) is sent via `ping_session` with `submit_key="Enter"`. The TUI needs ~8s to connect before the message is sent. If a session seems stuck at the blank prompt, manually send: the message from `.cc-ci-logs/.kickoff-cc-ci-builder.txt` (or adv), then press Enter.
|
||
- **Orchestrator in opencode:** `LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro cc-ci-plan/launch-orchestrator.sh fresh` — no `--resume` (opencode doesn't support it); reads this JOURNAL.md as startup context.
|
||
|
||
### Event 04:13 — migrated orchestrator to Hetzner cpx22
|
||
`cc-ci-loops.service` enabled, reboot-resilient. cc-ci server also Hetzner (server 134485294, `ssh cc-ci` → `100.95.31.88`).
|
||
|
||
### Event 13:22 — phase 4 paused, phase 5 started
|
||
Weekly Opus credits exhausted mid-session. Switched to Sonnet. Phase idx manually set to 10 (phase 5).
|
||
|
||
### Event 17:29 — loops stopped to switch backends
|
||
|
||
### Event 18:20 — opencode/deepseek loops running
|
||
After 7 bug fixes (wrong inference host, opencode run exits, --dir exits, env: not substituted in apiKey, permission prompts, submit key, timing), both loops now running on `tinfoil/deepseek-v4-pro` via the shared opencode-web.service.
|
||
|
||
---
|
||
|
||
## Session 2026-06-01 14:13 UTC — OpenCode GPT-5.4
|
||
|
||
**Left off:** Completed the assistant-owned phase 6 mirror reconcile pass and phase 7 targeted recipe-upgrade pass, wrote the operator summary, and dropped a `phase6-phase7.done` marker.
|
||
|
||
**Phase / loop state:**
|
||
- Builder/Adversary loops still on phase **5 [11/11]** separately from this assistant work.
|
||
- Assistant phase 6 summary/result file: `cc-ci-plan/phase6-phase7-summary-2026-06-01.md`
|
||
- Assistant phase 6/7 completion marker: `cc-ci-plan/phase6-phase7.done`
|
||
|
||
**Open items:**
|
||
- Bridge enrollment does **not** match the full phase-2 18-recipe set. Repo/live poll set = `custom-html`, `custom-html-tiny`, `cryptpad`, `hedgedoc`, `keycloak`, `lasuite-docs`, `lasuite-meet`, `matrix-synapse`, `n8n` (+ `cc-ci`). Missing vs phase-2 set: `bluesky-pds`, `discourse`, `ghost`, `immich`, `lasuite-drive`, `mailu`, `mattermost-lts`, `mumble`, `plausible`, `uptime-kuma`. Extra: `hedgedoc`.
|
||
- `ghost` phase-7 PR is open but not CI-triggerable until bridge enrollment includes `recipe-maintainers/ghost`.
|
||
- Review whether recipes still intended to be enrolled without mirrors: `lasuite-drive`, `mailu`, `mumble`, `uptime-kuma`.
|
||
|
||
**Notes:**
|
||
- Phase 6 reconciled all 18 enrolled recipes from scratch clones. Stale mirror PRs auto-closed on `lasuite-docs` (#1/#2/#3) and `keycloak` (#1). Four enrolled recipes currently have no mirror repo.
|
||
- Phase 7 outcomes: `n8n` stable PR `#3` went GREEN on build `61`; `matrix-synapse` existing PR `#1` re-ran and failed on build `53`; `ghost` PR `#2` opened successfully but verification is blocked by bridge enrollment mismatch.
|
||
- The bridge service rolled during verification; earlier `!testme` comments posted before/re-during the restart were swallowed as pre-existing by the poller startup pass. A clean re-run on stable `n8n` after the rollout confirmed the live path.
|
||
|
||
---
|
||
|
||
## Session 2026-05-31 ~04:00 UTC — Claude Sonnet 4.6
|
||
|
||
**Left off:** Completed the orchestrator → Hetzner migration (cpx22, server 134487234, public
|
||
`168.119.126.100`, tailnet `cc-ci-orchestrator-1` @ `100.84.190.30`). The old Incus VM
|
||
(`100.116.55.106`) is still on the tailnet — cold standby, not yet deleted.
|
||
|
||
**Phase / loop state:** Phases 1c–1e, 2w, 2pc, 2, 2b, 3 all DONE. Phase 5 [11/11]
|
||
(upgrade-flow verify) in progress — loops running, actively verifying the `!testme`
|
||
end-to-end flow on the new Hetzner cc-ci server.
|
||
|
||
**Open items:**
|
||
- Phase 5 is in progress — loops need to finish V1–V9 and write `## DONE` to STATUS-5.md.
|
||
- Phase 4 (final review/polish) was deliberately **skipped** this session — it is queued
|
||
at idx 9 in PHASE_IDX_FILE. Resume it after the weekly Opus credits reset.
|
||
- Phase 6 (reconcile-only over all 18 recipe mirrors) and Phase 7 (full upgrade on n8n +
|
||
ghost + matrix-synapse) are planned but not yet started — run them after Phase 5 DONE.
|
||
- Old Incus orchestrator VM (`cc-ci-orchestrator`, `100.116.55.106`) is still running —
|
||
stop it via the b1 Incus API once happy with the Hetzner box. mTLS certs at
|
||
`/srv/incus-terraform-nix-vm-creator/terraform-secrets/`.
|
||
- DNS: `oc.commoninternet.net` A record → `100.84.190.30` still needs adding (operator step).
|
||
|
||
**Notes:**
|
||
- `cc-ci-loops.service` is **enabled** and wired with `reboot-log.sh` ExecStartPre — a reboot
|
||
is a non-event; loops + watchdog auto-resume via RESUME_PHASE=1.
|
||
- The cc-ci **server** also moved to Hetzner (server 134485294, `ssh cc-ci` →
|
||
`100.95.31.88`). It has authenticated Docker Hub pulls and 150 GB disk — the old OOM /
|
||
disk-starvation / rate-limit issues are gone.
|
||
- All recipe mirrors currently reconcile correctly; no stale open PRs observed.
|
||
- `opencode` v1.15.13 installed at `/home/loops/.local/bin/opencode`. Tinfoil API key is in
|
||
`.testenv` as `TINFOIL_API_KEY`. Backend switch: `LOOP_BACKEND=opencode
|
||
LOOP_MODEL=tinfoil/deepseek-v4-pro RESUME_PHASE=1 cc-ci-plan/launch.sh start`.
|
||
- Launcher scripts rewritten to Python (`launch.py`, `launch-orchestrator.py`,
|
||
`launch-upgrader.py`); bash wrappers are now one-liners that `exec python3 <script> "$@"`.
|
||
|
||
### Event 03:13 — migrated from old Incus VM to Hetzner
|
||
Loops were started manually during staging (not by the service); first systemd-managed
|
||
boot was later this session. `cc-ci-loops.service` now enabled.
|
||
|
||
### Event 05:23 — phase 3 (results-UX) completed
|
||
All R1–R8 Adversary-verified, no VETO. Watchdog auto-advanced to phase 4.
|
||
|
||
### Event 13:22 — phase 4 paused, jumped to phase 5
|
||
Operator deferred phase 4 (weekly Opus credits exhausted). Phase idx manually set to 10
|
||
(phase 5). Loops restarted on Sonnet.
|
||
|
||
### Event 17:29 — loops stopped pending restart on different model
|
||
Operator paused loops to reconfigure backend (opencode/tinfoil exploration). Phase 5
|
||
[11/11] was in progress — loops had verified V1/V2/V3/V7 (custom-html-tiny upgrade GREEN).
|
||
Phase idx = 10 (phase 5), loops stopped, watchdog stopped.
|
||
|
||
---
|
||
|
||
## Session 2026-06-01 03:34 UTC — OpenCode GPT-5.4
|
||
|
||
**Left off:** Fixed opencode web visibility for the Builder/Adversary loop sessions by switching
|
||
the loop launcher from plain TUI startup to `opencode attach` against the shared web server, and
|
||
patched the orchestrator launcher the same way for the next session.
|
||
|
||
**Phase / loop state:**
|
||
- Phase **5 [11/11]** (`plan-phase5-verify-upgrade-flow.md`), still in progress
|
||
- Loops **RUNNING** on opencode with OpenAI `gpt-5.4`
|
||
- Watchdog **RUNNING**
|
||
- `opencode-web.service` **RUNNING** and nginx still serving `http://oc.commoninternet.net`
|
||
|
||
**Open items:**
|
||
- Start a fresh orchestrator session in opencode if desired; this current conversation cannot be
|
||
resumed as an opencode session, only handed off.
|
||
- If you want the orchestrator tmux session to move from Claude to opencode, use
|
||
`LOOP_BACKEND=opencode LOOP_MODEL=openai/gpt-5.4 ORCH_SESSION=cc-ci-orchestrator-oc cc-ci-plan/launch-orchestrator.sh fresh`
|
||
or stop/recreate `cc-ci-orchestrator-vm` explicitly.
|
||
- Phase 5 work itself is still unfinished; loops should continue from current state.
|
||
- Phase 4 remains deferred; phases 6 and 7 still remain after phase 5 completes.
|
||
|
||
**Notes:**
|
||
- The key fix for web visibility was **`opencode attach http://127.0.0.1:4096 --dir ...`**.
|
||
Plain `opencode` TUI sessions were inconsistently recorded and often did not show in the web UI.
|
||
- The path choice was much less important than attach mode. We tested both symlinked and real repo
|
||
paths. Attach mode was the real fix.
|
||
- One attached loop initially hit `python3: not found` because tool execution started flowing
|
||
through the shared `opencode-web.service` environment. Fixed by broadening the service PATH at
|
||
runtime and in `nix/hosts/cc-ci-orchestrator-hetzner/configuration.nix`.
|
||
- Current launcher state: `cc-ci-plan/launch.py` uses attach mode for opencode loops; `cc-ci-plan/launch-orchestrator.py`
|
||
is patched to use attach mode for opencode orchestrator sessions too.
|
||
- A runtime systemd override was applied at `/run/systemd/system/opencode-web.service.d/override.conf`.
|
||
Persist the final service environment with `nixos-rebuild` when convenient.
|
||
|
||
### Event 13:46 — recovered cc-ci from emergency mode via Hetzner rescue
|
||
`cc-ci` stopped booting cleanly after a `nixos-rebuild test --flake path:/root/builder-clone#cc-ci`
|
||
activation. Hetzner rescue + VNC console showed emergency mode; mounted journal showed `/boot` waiting on
|
||
`/dev/disk/by-label/ESP`. The immediate repair was restoring the missing FAT label on `/dev/sda15`
|
||
(`fatlabel /dev/sda15 ESP`) and rebooting normally. Follow-up investigation item: determine why the
|
||
wrong boot layout was activated and prevent future use of `#cc-ci` on the Hetzner server when the
|
||
correct host target is `#cc-ci-hetzner`.
|
||
|
||
### Event 18:53 — scheduled supervision pass
|
||
Checked Builder, Adversary, and Assistant live state. `ssh cc-ci hostname` still returns `nixos` after
|
||
the corrected Hetzner rebuild. Builder is active on a fresh matrix-synapse rerun under the restored
|
||
bridge path; Adversary was nudged to re-orient to that live state; Assistant remains idle after
|
||
finishing phase 6/7 and recording the bridge-enrollment mismatch against the full 18-recipe phase-2 set.
|
||
|
||
### Event 16:34 — progress monitor nudged stalled phase-5 workers
|
||
`launch.py status` showed builder, adversary, and watchdog running; `ssh cc-ci hostname` succeeded (`nixos`).
|
||
Assistant session was present and already idle after its completed phase 6/7 pass (`phase6-phase7.done` exists).
|
||
Builder was still blocked on a model usage-limit retry and adversary was parked past `WAITING-UNTIL 2026-06-01T14:24:51Z`, so both received tmux nudges to re-read the live phase-5 status and continue from current evidence.
|
||
|
||
### Event 19:04 — progress monitor rechecked phase-5 workers
|
||
`launch.py status` still shows phase 5 [11/11] in progress with builder, adversary, and watchdog running; `ssh cc-ci hostname` still succeeds (`nixos`).
|
||
`STATUS-5.md` still lacks `## DONE`, so phase 5 remains open, while `cc-ci-plan/phase6-phase7.done` confirms the assistant-owned phase 6/7 work is finished and the assistant remains idle.
|
||
Builder is active on the current V5 frontier; adversary's declared `WAITING-UNTIL 2026-06-01T19:03:38Z` had just expired, so it was nudged to re-read the live phase-5 status and continue from current evidence.
|
||
|
||
### Event 19:08 — operator directed simulated stale-test path
|
||
Operator clarified that V5/V6 should not depend on discovering a naturally occurring stale-test recipe. Builder and adversary were both nudged to switch to a simulated/seeded stale-test case on an enrolled sandbox candidate, then verify the two intended behaviors: DEFAULT comment-only and `--with-tests` opening/verifying the paired cc-ci test PR.
|
||
|
||
### Event 21:46 — backend reverted to claude, waker folded into watchdog, boot service fixed (Claude Sonnet 4.6)
|
||
Operator was out of Claude credits and had run the loops on opencode (deepseek-v4-pro, then gpt-5.4); now reverted to claude.
|
||
- **Backend → claude/sonnet.** Closed all opencode sessions (`cc-ci-orchestrator-oc`, `cc-ci-assistant`) and stopped `opencode serve`; restarted builder+adv via `RESUME_PHASE=1 LOOP_BACKEND=claude LOOP_MODEL=sonnet launch.py start`. `.loop-backend`=claude, `.loop-model`=sonnet. Restarted the watchdog too so it dropped its stale opencode-backend memory.
|
||
- **Waker → watchdog.** Retired the standalone `ai-progress-monitor.sh`/`cc-ci-orchestrator-waker` (it pinged the dead `-oc` session every 15m). The watchdog now wakes the orchestrator session for an hourly supervision pass (`ORCH_WAKE_INTERVAL`=3600s, prompt = `ai-progress-monitor-prompt.txt`), retrying each tick until the orchestrator is idle so it never interrupts/skips. Reboot-safe (watchdog is started by `cc-ci-loops.service`).
|
||
- **Boot fix.** `cc-ci-loops.service` had been failing on every boot (`claude CLI not found`) because the systemd `path` lacked `/home/loops/.local/bin`; loops were started by hand. Fixed in the flake (`CLAUDE_BIN` abs path + PATH export), `nixos-rebuild switch` applied — service now starts the loops cleanly on boot. Verified: clean start log, no error, phase 5 RUNNING.
|
||
- **Note:** the rebuild restarted `opencode-web.service` (still `wantedBy multi-user.target` in the flake) — idle serve, harmless to the claude loops, but it will keep returning on every rebuild/reboot until disabled in the flake.
|
||
|
||
### Event 23:23 — BUILD COMPLETE (all phases done) + weekly-upgrade cron cutover to a NixOS timer
|
||
Phase 5 reached `## DONE` and the watchdog wrote SEQUENCE-COMPLETE at 23:23:43Z: **the entire cc-ci build is finished** (phases 1c 1b 1d 1e 2w 2pc 2 2b 3 4 5). All V1–V9 + §4 cron Adversary-verified PASS, no VETOs, no open findings. The watchdog auto-stopped the loops and exited (so the in-watchdog hourly orchestrator wake is also gone now — by design; the build is done). Only `cc-ci-orchestrator-vm` remains up.
|
||
- **§4 cron — how the loops left it vs. final state.** During verification the loops swapped the busybox-crond-in-tmux for a `CronCreate` job (weekly id `8dd9aed3`, Mon 23:04 UTC) and disabled busybox crond. But CronCreate is **in-memory + session-scoped**: when the Builder session ended at sequence-complete, that weekly job evaporated (confirmed: `CronList` from this session shows none). That fragility is exactly what the operator asked to fix.
|
||
- **Final mechanism = reboot-safe NixOS systemd timer.** Activated `cc-ci-upgrade-all.{service,timer}` (committed earlier as `ee58027`): **OnCalendar Sun 02:00 UTC, Persistent=true**, timer-triggered only (service not `wantedBy multi-user.target`). `nixos-rebuild switch` applied — only ADDED the two units, did NOT bounce anything (loops were already stopped). `systemctl list-timers` → next run **Sun 2026-06-07 02:00:00 UTC**. Retired the leftovers: busybox crond already gone, removed the inert `/home/loops/.cc-ci-crontabs/loops`.
|
||
- **Operator-requested schedule change:** weekly upgrade moved from Mon 23:04 UTC (the phase-5 test schedule) to **Sun 02:00 UTC**.
|
||
- **Stale note:** `cc-ci/machine-docs/DECISIONS.md` still records "§4 weekly cron: CronCreate" — now superseded by the NixOS timer. Left to the operator/next loop run to amend (cc-ci product repo, loops' single-writer domain).
|