66 KiB
Orchestrator journal
This file is the persistent handoff record for the cc-ci orchestrator. Every orchestrator
session (whether Claude or opencode) reads this on startup and appends to it when handing off or
when something noteworthy happens. It survives conversation resets — it is the memory that
--resume can't provide for opencode, and a more readable supplement for Claude sessions.
On startup: read this file before doing anything else. The most recent ## Session entry
is where the previous session left off. Carry that context forward.
On handoff / end of session: append a ## Session block (see format below) summarising
what happened, the current state, and anything the next session needs to know.
On significant events mid-session: append a ### Event sub-entry (no need to wait for
handoff).
Format
## Session YYYY-MM-DD HH:MM UTC — <backend> <model>
**Left off:** <one sentence — what was the last thing done>
**Phase / loop state:** <phase X [N/11], loops RUNNING/stopped, cc-ci healthy/issue>
**Open items:** <bullet list of anything the next session needs to act on, or "none">
**Notes:** <anything surprising, a decision made, a known blocker, etc.>
### Event HH:MM — <short label>
<brief note>
Session 2026-05-31 ~18:30 UTC — Claude Sonnet 4.6
Left off: Got opencode/deepseek-v4-pro working as the loop backend. Both builder and
adversary are actively running on tinfoil/deepseek-v4-pro (via inference.tinfoil.sh).
Phase 5 [11/11] in progress. The operator is debugging the opencode web UI visibility and
wants to continue orchestrating from opencode itself.
Phase / loop state:
- Phase 5 [11/11] (
plan-phase5-verify-upgrade-flow.md), in progress - Latest product-repo commit:
de635ad—status(5): V3 DONE (custom-html-tiny upgrade GREEN); V7 DONE; A5-1/A5-2 fixed - Loops RUNNING on opencode/deepseek-v4-pro, actively processing (32–62K tokens in flight)
- Watchdog RUNNING, backend persisted to
.loop-backend/.loop-modelfiles
Open items for next session:
- Phase 5 loops need to finish V1–V9 and write
## DONEto STATUS-5.md. They were at V3+V7 PASS before the backend switch. After completing phase 5, phase 6 (reconcile-only over all 18 recipe mirrors) and phase 7 (full upgrade on n8n + ghost + matrix-synapse) still need running. - Phase 4 (final review/polish) was deliberately deferred — run it after weekly Opus credits reset. Phase idx currently at 10 (phase 5). To run phase 4 later: set idx to 9, start with
LOOP_BACKEND=claude RESUME_PHASE=1 cc-ci-plan/launch.sh start. - Restart loops after reading this — the current sessions are mid-processing.
cc-ci-plan/launch.sh statuswill show state; if sessions are stalled,LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro RESUME_PHASE=1 cc-ci-plan/launch.sh start. - DNS:
oc.commoninternet.net A 100.84.190.30still needs adding (operator step). Web UI reachable directly athttp://100.84.190.30in the meantime. - Old Incus orchestrator VM (
cc-ci-orchestrator,100.116.55.106) still cold standby — stop + delete when confident in Hetzner.
Notes — opencode/tinfoil setup (critical for next session):
- Backend files:
LOOP_BACKEND=opencodeandLOOP_MODEL=tinfoil/deepseek-v4-proare persisted in/srv/cc-ci/.cc-ci-logs/.loop-backendand.loop-model. The watchdog reads these to restart dead sessions with the right backend. - API key: stored in
/srv/cc-ci/.testenvasTINFOIL_API_KEY. Written directly (not viaenv:) into~/.config/opencode/opencode.jsonc— opencode doesn't do env substitution in apiKey. The config also has"permission": "allow"(all tool calls auto-approved). - Inference URL:
https://inference.tinfoil.sh/v1(NOTapi.tinfoil.sh— that's the control plane only). Fixed in both.testenvandopencode.jsonc. - Opencode web server:
opencode-web.servicerunsopencode serve --hostname 127.0.0.1 --port 4096. Nginx proxiesoc.commoninternet.net → localhost:4096on tailscale IP. Sessions from the plainopencodeTUI DO appear in the shared server's DB (they auto-connect via IPC), so the web UI should show them once DNS is set. - Launch command for opencode loops:
LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro RESUME_PHASE=1 cc-ci-plan/launch.sh start - Launch command for claude loops (fallback):
LOOP_BACKEND=claude LOOP_MODEL=sonnet RESUME_PHASE=1 cc-ci-plan/launch.sh start - Launchers rewritten to Python:
launch.py,launch-orchestrator.py,launch-upgrader.py(bash wrappers are one-liners). All committed torecipe-maintainers/cc-ci-orchestrator(HEAD:3412100). - Opencode binary:
/home/loops/.local/bin/opencodev1.15.13. Re-install if missing:curl -sL https://github.com/anomalyco/opencode/releases/download/v1.15.13/opencode-linux-x64.tar.gz | tar -xz -C /home/loops/.local/bin opencode - Known opencode quirk: the loop bootstrap message (pointing to the kickoff file) is sent via
ping_sessionwithsubmit_key="Enter". The TUI needs ~8s to connect before the message is sent. If a session seems stuck at the blank prompt, manually send: the message from.cc-ci-logs/.kickoff-cc-ci-builder.txt(or adv), then press Enter. - Orchestrator in opencode:
LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro cc-ci-plan/launch-orchestrator.sh fresh— no--resume(opencode doesn't support it); reads this JOURNAL.md as startup context.
Event 04:13 — migrated orchestrator to Hetzner cpx22
cc-ci-loops.service enabled, reboot-resilient. cc-ci server also Hetzner (server 134485294, ssh cc-ci → 100.95.31.88).
Event 13:22 — phase 4 paused, phase 5 started
Weekly Opus credits exhausted mid-session. Switched to Sonnet. Phase idx manually set to 10 (phase 5).
Event 17:29 — loops stopped to switch backends
Event 18:20 — opencode/deepseek loops running
After 7 bug fixes (wrong inference host, opencode run exits, --dir exits, env: not substituted in apiKey, permission prompts, submit key, timing), both loops now running on tinfoil/deepseek-v4-pro via the shared opencode-web.service.
Session 2026-06-01 14:13 UTC — OpenCode GPT-5.4
Left off: Completed the assistant-owned phase 6 mirror reconcile pass and phase 7 targeted recipe-upgrade pass, wrote the operator summary, and dropped a phase6-phase7.done marker.
Phase / loop state:
- Builder/Adversary loops still on phase 5 [11/11] separately from this assistant work.
- Assistant phase 6 summary/result file:
cc-ci-plan/phase6-phase7-summary-2026-06-01.md - Assistant phase 6/7 completion marker:
cc-ci-plan/phase6-phase7.done
Open items:
- Bridge enrollment does not match the full phase-2 18-recipe set. Repo/live poll set =
custom-html,custom-html-tiny,cryptpad,hedgedoc,keycloak,lasuite-docs,lasuite-meet,matrix-synapse,n8n(+cc-ci). Missing vs phase-2 set:bluesky-pds,discourse,ghost,immich,lasuite-drive,mailu,mattermost-lts,mumble,plausible,uptime-kuma. Extra:hedgedoc. ghostphase-7 PR is open but not CI-triggerable until bridge enrollment includesrecipe-maintainers/ghost.- Review whether recipes still intended to be enrolled without mirrors:
lasuite-drive,mailu,mumble,uptime-kuma.
Notes:
- Phase 6 reconciled all 18 enrolled recipes from scratch clones. Stale mirror PRs auto-closed on
lasuite-docs(#1/#2/#3) andkeycloak(#1). Four enrolled recipes currently have no mirror repo. - Phase 7 outcomes:
n8nstable PR#3went GREEN on build61;matrix-synapseexisting PR#1re-ran and failed on build53;ghostPR#2opened successfully but verification is blocked by bridge enrollment mismatch. - The bridge service rolled during verification; earlier
!testmecomments posted before/re-during the restart were swallowed as pre-existing by the poller startup pass. A clean re-run on stablen8nafter the rollout confirmed the live path.
Session 2026-05-31 ~04:00 UTC — Claude Sonnet 4.6
Left off: Completed the orchestrator → Hetzner migration (cpx22, server 134487234, public
168.119.126.100, tailnet cc-ci-orchestrator-1 @ 100.84.190.30). The old Incus VM
(100.116.55.106) is still on the tailnet — cold standby, not yet deleted.
Phase / loop state: Phases 1c–1e, 2w, 2pc, 2, 2b, 3 all DONE. Phase 5 [11/11]
(upgrade-flow verify) in progress — loops running, actively verifying the !testme
end-to-end flow on the new Hetzner cc-ci server.
Open items:
- Phase 5 is in progress — loops need to finish V1–V9 and write
## DONEto STATUS-5.md. - Phase 4 (final review/polish) was deliberately skipped this session — it is queued at idx 9 in PHASE_IDX_FILE. Resume it after the weekly Opus credits reset.
- Phase 6 (reconcile-only over all 18 recipe mirrors) and Phase 7 (full upgrade on n8n + ghost + matrix-synapse) are planned but not yet started — run them after Phase 5 DONE.
- Old Incus orchestrator VM (
cc-ci-orchestrator,100.116.55.106) is still running — stop it via the b1 Incus API once happy with the Hetzner box. mTLS certs at/srv/incus-terraform-nix-vm-creator/terraform-secrets/. - DNS:
oc.commoninternet.netA record →100.84.190.30still needs adding (operator step).
Notes:
cc-ci-loops.serviceis enabled and wired withreboot-log.shExecStartPre — a reboot is a non-event; loops + watchdog auto-resume via RESUME_PHASE=1.- The cc-ci server also moved to Hetzner (server 134485294,
ssh cc-ci→100.95.31.88). It has authenticated Docker Hub pulls and 150 GB disk — the old OOM / disk-starvation / rate-limit issues are gone. - All recipe mirrors currently reconcile correctly; no stale open PRs observed.
opencodev1.15.13 installed at/home/loops/.local/bin/opencode. Tinfoil API key is in.testenvasTINFOIL_API_KEY. Backend switch:LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro RESUME_PHASE=1 cc-ci-plan/launch.sh start.- Launcher scripts rewritten to Python (
launch.py,launch-orchestrator.py,launch-upgrader.py); bash wrappers are now one-liners thatexec python3 <script> "$@".
Event 03:13 — migrated from old Incus VM to Hetzner
Loops were started manually during staging (not by the service); first systemd-managed
boot was later this session. cc-ci-loops.service now enabled.
Event 05:23 — phase 3 (results-UX) completed
All R1–R8 Adversary-verified, no VETO. Watchdog auto-advanced to phase 4.
Event 13:22 — phase 4 paused, jumped to phase 5
Operator deferred phase 4 (weekly Opus credits exhausted). Phase idx manually set to 10 (phase 5). Loops restarted on Sonnet.
Event 17:29 — loops stopped pending restart on different model
Operator paused loops to reconfigure backend (opencode/tinfoil exploration). Phase 5 [11/11] was in progress — loops had verified V1/V2/V3/V7 (custom-html-tiny upgrade GREEN). Phase idx = 10 (phase 5), loops stopped, watchdog stopped.
Session 2026-06-01 03:34 UTC — OpenCode GPT-5.4
Left off: Fixed opencode web visibility for the Builder/Adversary loop sessions by switching
the loop launcher from plain TUI startup to opencode attach against the shared web server, and
patched the orchestrator launcher the same way for the next session.
Phase / loop state:
- Phase 5 [11/11] (
plan-phase5-verify-upgrade-flow.md), still in progress - Loops RUNNING on opencode with OpenAI
gpt-5.4 - Watchdog RUNNING
opencode-web.serviceRUNNING and nginx still servinghttp://oc.commoninternet.net
Open items:
- Start a fresh orchestrator session in opencode if desired; this current conversation cannot be resumed as an opencode session, only handed off.
- If you want the orchestrator tmux session to move from Claude to opencode, use
LOOP_BACKEND=opencode LOOP_MODEL=openai/gpt-5.4 ORCH_SESSION=cc-ci-orchestrator-oc cc-ci-plan/launch-orchestrator.sh freshor stop/recreatecc-ci-orchestrator-vmexplicitly. - Phase 5 work itself is still unfinished; loops should continue from current state.
- Phase 4 remains deferred; phases 6 and 7 still remain after phase 5 completes.
Notes:
- The key fix for web visibility was
opencode attach http://127.0.0.1:4096 --dir .... PlainopencodeTUI sessions were inconsistently recorded and often did not show in the web UI. - The path choice was much less important than attach mode. We tested both symlinked and real repo paths. Attach mode was the real fix.
- One attached loop initially hit
python3: not foundbecause tool execution started flowing through the sharedopencode-web.serviceenvironment. Fixed by broadening the service PATH at runtime and innix/hosts/cc-ci-orchestrator-hetzner/configuration.nix. - Current launcher state:
cc-ci-plan/launch.pyuses attach mode for opencode loops;cc-ci-plan/launch-orchestrator.pyis patched to use attach mode for opencode orchestrator sessions too. - A runtime systemd override was applied at
/run/systemd/system/opencode-web.service.d/override.conf. Persist the final service environment withnixos-rebuildwhen convenient.
Event 13:46 — recovered cc-ci from emergency mode via Hetzner rescue
cc-ci stopped booting cleanly after a nixos-rebuild test --flake path:/root/builder-clone#cc-ci
activation. Hetzner rescue + VNC console showed emergency mode; mounted journal showed /boot waiting on
/dev/disk/by-label/ESP. The immediate repair was restoring the missing FAT label on /dev/sda15
(fatlabel /dev/sda15 ESP) and rebooting normally. Follow-up investigation item: determine why the
wrong boot layout was activated and prevent future use of #cc-ci on the Hetzner server when the
correct host target is #cc-ci-hetzner.
Event 18:53 — scheduled supervision pass
Checked Builder, Adversary, and Assistant live state. ssh cc-ci hostname still returns nixos after
the corrected Hetzner rebuild. Builder is active on a fresh matrix-synapse rerun under the restored
bridge path; Adversary was nudged to re-orient to that live state; Assistant remains idle after
finishing phase 6/7 and recording the bridge-enrollment mismatch against the full 18-recipe phase-2 set.
Event 16:34 — progress monitor nudged stalled phase-5 workers
launch.py status showed builder, adversary, and watchdog running; ssh cc-ci hostname succeeded (nixos).
Assistant session was present and already idle after its completed phase 6/7 pass (phase6-phase7.done exists).
Builder was still blocked on a model usage-limit retry and adversary was parked past WAITING-UNTIL 2026-06-01T14:24:51Z, so both received tmux nudges to re-read the live phase-5 status and continue from current evidence.
Event 19:04 — progress monitor rechecked phase-5 workers
launch.py status still shows phase 5 [11/11] in progress with builder, adversary, and watchdog running; ssh cc-ci hostname still succeeds (nixos).
STATUS-5.md still lacks ## DONE, so phase 5 remains open, while cc-ci-plan/phase6-phase7.done confirms the assistant-owned phase 6/7 work is finished and the assistant remains idle.
Builder is active on the current V5 frontier; adversary's declared WAITING-UNTIL 2026-06-01T19:03:38Z had just expired, so it was nudged to re-read the live phase-5 status and continue from current evidence.
Event 19:08 — operator directed simulated stale-test path
Operator clarified that V5/V6 should not depend on discovering a naturally occurring stale-test recipe. Builder and adversary were both nudged to switch to a simulated/seeded stale-test case on an enrolled sandbox candidate, then verify the two intended behaviors: DEFAULT comment-only and --with-tests opening/verifying the paired cc-ci test PR.
Event 21:46 — backend reverted to claude, waker folded into watchdog, boot service fixed (Claude Sonnet 4.6)
Operator was out of Claude credits and had run the loops on opencode (deepseek-v4-pro, then gpt-5.4); now reverted to claude.
- Backend → claude/sonnet. Closed all opencode sessions (
cc-ci-orchestrator-oc,cc-ci-assistant) and stoppedopencode serve; restarted builder+adv viaRESUME_PHASE=1 LOOP_BACKEND=claude LOOP_MODEL=sonnet launch.py start..loop-backend=claude,.loop-model=sonnet. Restarted the watchdog too so it dropped its stale opencode-backend memory. - Waker → watchdog. Retired the standalone
ai-progress-monitor.sh/cc-ci-orchestrator-waker(it pinged the dead-ocsession every 15m). The watchdog now wakes the orchestrator session for an hourly supervision pass (ORCH_WAKE_INTERVAL=3600s, prompt =ai-progress-monitor-prompt.txt), retrying each tick until the orchestrator is idle so it never interrupts/skips. Reboot-safe (watchdog is started bycc-ci-loops.service). - Boot fix.
cc-ci-loops.servicehad been failing on every boot (claude CLI not found) because the systemdpathlacked/home/loops/.local/bin; loops were started by hand. Fixed in the flake (CLAUDE_BINabs path + PATH export),nixos-rebuild switchapplied — service now starts the loops cleanly on boot. Verified: clean start log, no error, phase 5 RUNNING. - Note: the rebuild restarted
opencode-web.service(stillwantedBy multi-user.targetin the flake) — idle serve, harmless to the claude loops, but it will keep returning on every rebuild/reboot until disabled in the flake.
Event 23:23 — BUILD COMPLETE (all phases done) + weekly-upgrade cron cutover to a NixOS timer
Phase 5 reached ## DONE and the watchdog wrote SEQUENCE-COMPLETE at 23:23:43Z: the entire cc-ci build is finished (phases 1c 1b 1d 1e 2w 2pc 2 2b 3 4 5). All V1–V9 + §4 cron Adversary-verified PASS, no VETOs, no open findings. The watchdog auto-stopped the loops and exited (so the in-watchdog hourly orchestrator wake is also gone now — by design; the build is done). Only cc-ci-orchestrator-vm remains up.
- §4 cron — how the loops left it vs. final state. During verification the loops swapped the busybox-crond-in-tmux for a
CronCreatejob (weekly id8dd9aed3, Mon 23:04 UTC) and disabled busybox crond. But CronCreate is in-memory + session-scoped: when the Builder session ended at sequence-complete, that weekly job evaporated (confirmed:CronListfrom this session shows none). That fragility is exactly what the operator asked to fix. - Final mechanism = reboot-safe NixOS systemd timer. Activated
cc-ci-upgrade-all.{service,timer}(committed earlier asee58027): OnCalendar Sun 02:00 UTC, Persistent=true, timer-triggered only (service notwantedBy multi-user.target).nixos-rebuild switchapplied — only ADDED the two units, did NOT bounce anything (loops were already stopped).systemctl list-timers→ next run Sun 2026-06-07 02:00:00 UTC. Retired the leftovers: busybox crond already gone, removed the inert/home/loops/.cc-ci-crontabs/loops. - Operator-requested schedule change: weekly upgrade moved from Mon 23:04 UTC (the phase-5 test schedule) to Sun 02:00 UTC.
- Stale note:
cc-ci/machine-docs/DECISIONS.mdstill records "§4 weekly cron: CronCreate" — now superseded by the NixOS timer. Left to the operator/next loop run to amend (cc-ci product repo, loops' single-writer domain).
Event 2026-06-02 03:42 — post-build work: mirror+regression phases DONE; overnight /upgrade-all running
After the cc-ci build completed (2026-06-01), the operator drove a sequence of post-build phases via the loops:
mirrorphase DONE (01:16Z): all recipes mirrored + enrolled (created mirrors for lasuite-drive/mailu/mumble; enrolled the 9 missing; loops did the live-hostnixos-rebuild #cc-ci+!testmeverification themselves after the deploy gate was removed —#cc-ciis safe sincebe4f451).regressionphase DONE (03:42:07Z) — entire 2-phase sequence (mirror→regression) complete; loops + watchdog stopped/exited. Shippedtests/regression/(cc-ci PR#5, NOT merged): 7 canaries, all Adversary cold-verified — good-simple (custom-html-tiny) GREEN, good-significant (lasuite-docs) GREEN (5 tiers + clean teardown + no secret leak), bad-false-green RED, and 4 per-tier REDs (bad-install/upgrade/backup/restore, each RED at the intended tier with prior tiers passing; dedicated fixture recipes custom-html-bkp-bad / custom-html-rst-bad). 3@canary+ 4@canary_fast; README documents the milestone-only cadence (not per-commit).- PR consolidation (Assistant, one-shot): every recipe mirror reduced to ≤1 open PR (custom-html #2→#1 @1.13.0; ghost #2→#1 = backup+upgrade). Verified one-open-PR-per-recipe across all mirrors.
- Overnight run in flight:
cc-ci-overnight(overnight-run.sh) gated on assistant-done + usage-reset + loops-idle, then launched the weekly/upgrade-allat 03:40Z (DEFAULT, never merges). It will write/srv/cc-ci/.cc-ci-logs/overnight-report-<date>.mdand ping this session to deliver the operator's morning PushNotification (held until then — no overnight ping). The build-complete + regression-shipped headline will be folded into that morning notification. - State: watchdog/loops stopped by design (sequence complete → hourly wake stops too); the overnight runner + the weekly Sun-02:00 timer are the only live automation. Recipe-upgrade PR behavior was also reworked this session: never close unmerged PRs; extend an existing open upgrade PR by commit-on-top (no force-push) instead of a parallel PR; only close merged-upstream PRs.
Event 2026-06-02 11:40 — OVERNIGHT /upgrade-all COMPLETE (full run, all 18 recipes)
The overnight runner finished and pinged the orchestrator; morning report at /srv/cc-ci/.cc-ci-logs/overnight-report-2026-06-02.md (+ upgrades/upgrade-all-2026-06-02.md). Considered 18 · GREEN !testme: 10 · stale-test (commented): 2 · failed: 2 · skipped: 4. Nothing merged. It followed through on all recipes (the original 7 + the 7 recovered from the abra-auth issue + the rest).
- GREEN (10): cryptpad, keycloak, lasuite-meet, mailu, n8n (⚠ pg volume path change), custom-html, custom-html-tiny, uptime-kuma, lasuite-docs, ghost (⚠ supersedes its open ci/mysql-backup PR#1).
- Stale-test → operator
--with-tests(2): matrix-synapse (test_upgrade_preserves_data, ci_marker lost across pgautoupgrade 17→18), discourse (test_create_topic_roundtrip, Discourse 3.5.0 flippedallow_uncategorized_topicsdefault). - Failed — pre-existing recipe bugs (2): mattermost-lts (
test_restore_returns_stateafter 3 !testme; backup/restore bug, see ci/pg-restore PR#1), plausible (ClickHouse IPv6 + GHCR move after 3 !testme; see ci/clickhouse-backup-resilient PR#1). - Skipped (4): bluesky-pds / mumble / lasuite-drive up-to-date; immich — abra can't parse tag+digest image refs (explanatory comment left on PR#1).
- abra-auth issue: confirmed = go-git needs creds embedded in
originURL (ignores insteadOf/.netrc); recovered all 8 at runtime; skills now fixed (this session). TTYscriptwrapper was a separate, also-correct fix. - Operator follow-ups: (a) a few recipes now have 2 open PRs — an upgrade PR alongside a prior backup-fix PR (discourse #2+#1, ghost #3+#1, mattermost-lts #2+#1, plausible #2+#1) — reconcile/close the superseded ones; (b) re-run matrix-synapse + discourse with
--with-teststo refresh the stale tests; (c) mattermost-lts + plausible failures are pre-existing recipe bugs to investigate.
Event 2026-06-02 ~17:30 — bridge: one comment per !testme (deployed)
Operator wanted each !testme to get its OWN comment (edited in place to that run's result), instead of
the old "reuse/edit one marker comment in place" (which made re-runs on an unchanged head invisible).
Changed bridge/bridge.py process_testme: always post_comment a fresh ⏳ placeholder per run;
watch_and_reflect still edits THAT run's cid to ✅/❌. cc-ci repo commit a78ec2d (pushed).
- Deployed to the live cc-ci server. ⚠️ Deploy-path gap: the documented
/root/cc-ciis GONE. Deployed via/root/builder-clone(the harness's host clone — has the real remote + thesecretssubmodule):git pull origin main→git submodule update --init secrets→nixos-rebuild switch --flake '/root/builder-clone?submodules=1#cc-ci'. Diff was bridge-only (no other nix/ changes), so only the bridge image rolled (content-hash tag 3761c42→4482ce9). Verified: new image Running, poller watching all 20 repos. Follow-up: establish a clean canonical deploy checkout for the cc-ci server (not the harness's builder-clone).
Event 2026-06-02 ~23:05 — /recipe-report skill + report.ci.commoninternet.net SHIPPED
Built + deployed the weekly public "Recipe Report" (plan: cc-ci-plan/plan-recipe-report-skill.md).
- Serving: nix/modules/reports.nix (nginx:alpine static server, traefik Host(report.ci.commoninternet.net)
- wildcard TLS, serves /var/lib/cc-ci-reports). cc-ci repo
f5a6f71, deployed via builder-clone. Live.
- wildcard TLS, serves /var/lib/cc-ci-reports). cc-ci repo
- Generator:
cc-ci-plan/recipe-report.py(survey/render/publish) + skill.claude/skills/recipe-report/cc-ci-plan/launch-report.py(own cc-ci-report agent, REPORT_MODEL default opus — separate from the sonnet upgrader). upgrade-all's closing step launches it. orchestrator repoc7301a9.
- Page: title "The Recipe Report" / "Week of "; ① Needs attention (PRs to merge + errors) · Routine · comprehensive table (all recipes, CI shown as level/number+LINK, no images). Index lists all weeks.
- First report (opus-generated) LIVE: https://report.ci.commoninternet.net/week-2026-06-02.html (10 green PRs, 2 failed, matrix-synapse stale-test; 21-recipe table). From next weekly /upgrade-all it auto-publishes.
- Note: still deploying the cc-ci server via /root/builder-clone (the deploy-path gap remains).
Event 2026-06-02 ~23:16 — Recipe Report v2: newspaper front page (CVE-led editorial)
Reworked the report to a newspaper layout: masthead + opus editorial LEAD (overall fleet state + what
to focus on) + a 🔒 Security Bulletin of critical-CVE upgrades FIRST, then needs-attention/routine, then
the comprehensive table ("the full wire"). survey now feeds opus each recipe's upgrade_notes_md
(breaking-change/CVE analysis). orchestrator 6cf5913. First v2 (opus) live + verified — it led with
the nginx 1.29→1.31 CVE batch (custom-html, cryptpad) and even noted live state past the morning summary.
Event 2026-06-09 ~19:50 — Orchestrator handover (assistant session): concurrent-CI fixes + immich/plausible drive
Operator promoted the cc-ci-assistant session (immich upgrade one-shot) to ORCHESTRATOR: "work on these
fixes to concurrent runs, then drive immich and plausible to green; autonomous; track in this repo."
Immich (PR recipe-maintainers/immich#2, head a92b28d): upgrade to
1.7.0+v2.7.5 (postgres pin HELD at 14-vectorchord0.4.3-pgvectors0.2.0@sha256:bcf63357… — what
immich-server v2.7.5 pins; abra FATA'd on tag+digest so surveyed upstream directly, registry persisted
at cc-ci-plan/upstream/immich.md) + backup/restore fix: pg_dump --clean --if-exists no-DROP restore
(DROP DATABASE PANICs pgvecto.rs → postgres signal 6 — confirmed in CI 225 logs + dev) + immich-docs
search_path sed. Verified GREEN end-to-end in dev via real abra backup/restore path; dev-immich torn
down, zero leakage. 6 !testme runs RED so far; 229/230 root cause (drone sqlite log extraction):
/pg_backup.sh: No such file or directory — the harness chaos-deployed a tree WITHOUT the config,
suspected shared-checkout race (my repro scripts flipped ~/.abra/recipes/immich during the builds).
Queue findings (operator: "queue is getting blocked"): build 231 (plausible !testme) was doomed —
cc-ci main lacks assistant3's UPGRADE_BASE_VERSION=3.0.1 pin (branch test/plausible-upgrade-base-3.0.1;
its push build 233 failed LINT, not content); canceled 231+232 (232=immich; drone cancel LEAKED the
python child — killed by hand; its immi-ad3e33 orphan reaped manually). Push-build lint has been RED
since ≥ build 209 (repo-wide format drift + shellcheck + statix + 17 ruff errors) — nothing can land
green. Parallel-CI unsafety confirmed in .drone.yml on main: CCCI_JANITOR_MAX_AGE=0 (a starting
build reaps ANY in-flight run app), concurrency.limit=1 vs DRONE_RUNNER_CAPACITY=2 (live since 18:35),
shared HOME=/root + shared ~/.abra/recipes/ checkout — all annotated "safe because capacity=1".
Plan in flight: (1) lint-green commit (subagent on /home/loops/work/cc-ci-fix); (2) concurrency
safety: per-recipe flock in run_recipe_ci.py + janitor pidfile/age scoping + concurrency.limit=2 +
comment updates; (3) merge plausible pin; (4) re-!testme immich alone → green; (5) plausible green is
assistant3's lane (its verify: upgrade/backup tiers PASSED, restore post-hook failed gzip: /postgres.dump.gz: No such file — pre-hook never produced the dump in the snapshot) — coordinating via
tmux, not duplicating. Siblings: cc-ci-assistant3 (plausible), cc-ci-upgrader (told to review plausible
failure). Memories moved INTO this repo at memory/ (542ed0a) — auto-memory path is a symlink now.
Event 2026-06-09 ~21:10 — Concurrent-CI fixes LANDED on cc-ci main (build 236 green)
Orchestrator (this session) landed the queue/concurrency work on cc-ci main, first green push build
since the lint drift began (~209): 9a77725 style: repo-wide lint pass (118 files — ruff format/fix,
shfmt, nixpkgs-fmt/statix/deadnix, yamllint, lasuite-docs quoting; lint PASS + 138 unit tests);
c0df77d fix(harness): concurrent-run safety — per-recipe flock /run/lock/cc-ci-recipe-<recipe>.lock
taken in main() BEFORE fetch_recipe (kernel auto-release, no stale-lock mode; same-recipe runs
serialise, different recipes parallel) + active-run registry /run/cc-ci-active/<domain> pidfiles with
three-way janitor (alive=never reap / dead=reap now / unknown=age fallback 2h; pid-reuse guarded via
/proc cmdline match on run_recipe_ci) + .drone.yml: concurrency.limit 1→2, CCCI_JANITOR_MAX_AGE=0
REMOVED, stale capacity=1 comments rewritten; c828f6c merge of assistant3's
test/plausible-upgrade-base-3.0.1 pin (UPGRADE_BASE_VERSION=3.0.1+v2.0.0). Branch push build 234 green,
main push build 236 green. /root/builder-clone fast-forwarded to c828f6c. assistant3 notified via tmux
(plausible !testme unblocked; restore-hook gzip failure is their lane). Next: immich PR #2 !testme
re-triggered alone (checkout parked clean at a92b28d) — polling to verdict.
Event 2026-06-09 ~23:20 — Two harness convergence fixes landed; immich on run 3 (build 245)
Immich !testme run 2 (build 238) RED but PROGRESS: install/upgrade/custom PASS (checkout race gone),
backup CRASHED — backupbot exec'd the db pre-hook into a container swarm killed seconds earlier: the
chaos redeploy changes the db image (pgvecto.rs→vectorchord pin) and registers a stop-first rolling
update that hadn't STARTED when the N/N convergence check passed (old task still 1/1). → 68ef0f8
fix(harness): services_converged() also requires swarm UpdateStatus settled + bounded settle-wait in
backup_app(). Run 3 (build 241) then HUNG 22min in the restore tier: the app service's UpdateStatus
was 'paused' (swarm default update-failure-action after one task flicker during restore) — a state
that persists FOREVER; my check treated it as in-flight. Killed 241 (cancel leaks the python child —
killed by hand; immi-ad3e33 undeployed+rm'd, registry entry cleared, zero leakage verified). →
e6d55b5 fix(harness): only 'updating'/'rollback_started' block convergence; 'paused' + N/N is
settled (health asserts still gate). Both branch builds green (239, 243); main ff'd; builder-clone
updated. Plausible build 237 (assistant3, head 4cab6b5): install/upgrade/backup PASS — their
gzip/dump-path fix WORKS, marker restore test PASSED; remaining: app 502s after restore
(test_restore_healthy + custom tier) + restore hook needs pg_restore --if-exists; diagnosed +
relayed via tmux. Concurrency machinery observed working live: parallel immich+plausible runs held
per-recipe locks, registered pidfiles, plausible's teardown unregistered cleanly. Immich run 4 =
build 245 (custom, running) with both fixes live — monitor armed.
Event 2026-06-10 ~00:05 — IMMICH PR #2 GREEN (build 245, level=4); cc-ci PR #9 merged; plausible go-ahead
Immich PR #2 (head a92b28d) verdict: GREEN — build 245, ALL tiers pass (install/upgrade/backup/ restore/custom), level=4, deploy-count 1/1, PR-side VERDICT=GREEN via testme poller, zero leakage (no stacks/volumes, active-run registry empty). The 1.7.0+v2.7.5 upgrade + no-DROP pg backup/restore is fully CI-verified; merge decision is the operator's. Run history: 6 RED pre-handover → 238 RED (backup 409, convergence gap) → 241 hung (paused-update flag) → 245 GREEN. cc-ci PR #9 merged (157d06d, push build 246 green): assistant3's one-flag test fix (psql -q in plausible _register_site — command tags polluted the compared output; assertion unchanged, reviewed: no gate weakening). Caught an UNSUBMITTED "PR #9 is merged, go ahead" sitting in assistant3's tmux prompt BEFORE it was true — verified state=open first, merged it, synced builder-clone, then submitted the go-ahead. Plausible !testme on PR #3 (head 270c840, incl. their 502-after-restore + --if-exists fixes) is now assistant3's trigger; cc-ci main has every prerequisite landed. Open: plausible verdict (#11), final session wrap.
Session 2026-06-09/10 — Orchestrator: concurrent-CI fixed, immich + plausible BOTH GREEN
Mission complete. Operator's brief: "work on these fixes to concurrent runs, then drive immich and plausible to green." Final state:
- immich PR #2 (head a92b28d): GREEN — build 245, all tiers, level=4 (1.7.0+v2.7.5, vectorchord db pin, no-DROP pg backup/restore).
- plausible PR #3 (head 270c840): GREEN — build 247 (assistant3's lane; their gzip dump-path, 502-after-restore and --if-exists fixes + my merged pin/test-fix/harness prerequisites), all tiers, level=4.
- Both verified zero-leakage (no stacks/volumes, /run/cc-ci-active empty). Merge decisions left to the operator per the standing never-merge-recipe-PRs rule.
- cc-ci main is healthy: lint gate green since 9a77725; concurrent runs safe (c0df77d flock + registry; 68ef0f8 + e6d55b5 convergence); plausible pin c828f6c; PR #9 psql -q merge 157d06d. Builds 234–247: every push build green; parallel custom runs exercised the locking live.
- Memories added: swarm-updatestatus-convergence-gotchas (+ earlier session set). Open items for a future session: drone cancel still leaks the python child (kill by hand; maybe trap/pgroup fix in the runner step); recipe-mirrors org still private (PR-STATUS column dark — operator flip); operator to review/merge the two green recipe PRs.
Session 2026-06-10 — Orchestrator: concurrency restructure DONE (phase conc)
Operator approved the simplification plan (concurrency-restructure-full-plan.md); ran it through
the Builder (fable) / Adversary (opus, via new ADV_MODEL launcher support e0c9f23) loops + watchdog.
Phase conc ## DONE 08:56 UTC, M1+M2 both Adversary-PASS, no open veto. cc-ci main now: per-app-domain
kernel flock (registry/pidfiles/recipe-flock DELETED), flock-probe janitor, per-run ABRA_DIR
(servers/ symlinked), PDEATHSIG+setsid+60-min-deadline lifetime chain, single capacity knob,
tests/concurrency suite (21 real-kernel cases, outside the unit gate), docs/concurrency.md rewritten.
Live verification earned two real catches: wrapper exit-code poisoning under set -e (e1c4198,
false-RED, adversary 4-path matrix proved no false-GREEN) and CONC-A1 (domain-keyed deploy-count
file in shared /tmp raced outside the lock — pre-existing, masked by the old recipe flock; fixed
per-run + mutation-proven test; VETO lifted after 290/291 both green). Also fixed this session:
orchestrator identity — watchdog was supervising the stale June-1 session; renamed mine to
cc-ci-orchestrator-vm, repointed .orchestrator-session-id (old → .bak). Loops survived a
limit-stall window 07:51–08:03 via watchdog kill/reboot/nudge — resilience layer worked as designed.
Open for operator: review/merge immich PR#2 + plausible PR#3 (still green, unmerged); stale
session cc-ci-orchestrator-stale can be killed; recipe-mirrors org still private.
2026-06-11 ~01:15 — phase shot queued; limit-system night watch
- Operator requested a follow-on phase: audit + repair the per-recipe CI screenshot
(badge/card) across ALL enrolled recipes. Plan written:
cc-ci-plan/plan-phase-shot-screenshots.md, queued AFTER rcust in .phases-spec
(rcust;shot) — watchdog auto-advances on rcust
## DONE. - Pre-audit evidence (last ~120 runs): plausible screenshot=null on every run; immich/lasuite-meet/cryptpad (+flaky n8n) produce byte-identical ~4.8KB PNGs = suspected blank SPA frames; ghost/mattermost/discourse/etc healthy.
- Hourly wakes tonight: TEMPORARY line added to ai-progress-monitor-prompt.txt — verify the new limit-wait system (d6e1a70/2e1ab8d) on each wake; remove the line 06-11 daytime.
- Orchestrator renamed cc-ci-orchestrator (was -vm); stale Jun01 squatter killed; watchdog bounced twice tonight (limit patch, then hourly-wake-during-limit fallback).
2026-06-11 ~01:35 — phase lvl5 queued after shot
- Operator: extend the level ladder — L5 =
abra recipe lintpasses on the tested ref (PR head), after the existing four rungs. Plan: cc-ci-plan/plan-phase-lvl5-lint-rung.md. - Key design hazards captured: abra.py:109-114 (pinned deploy lints + FATAs R014 from the CI mirror-origin repoint; chaos/PR path skips lint today) — rung must lint recipe content, not mirror plumbing; verdict-neutral; conservative capping; old artifacts render.
- .phases-spec now rcust;shot;lvl5 (idx=1, shot active); watchdog bounce to load it.
2026-06-11 ~01:50 — lvl5 plan amended: de-capping folded in (operator decision)
- Operator: remove the "capping" notion entirely. Explicit Q&A settled semantics: level = highest PASSED rung where everything below is pass-or-N/A — N/A rungs are skipped (no longer stop the climb), a real FAIL still blocks. cap/cap_reason/capped deleted from code+schema+card+dashboard+docs; rung table is the sole detail carrier.
- Deliberate override of Phase-3 "N/A caps" stance — to be recorded in DECISIONS.md by the loops. Before/after level table for all recipes required so the Adversary can attribute every level shift to the rule change.
2026-06-11 ~02:00 — lvl5 refinement: intentional vs unintentional N/A
- Operator: an N/A rung only skips if it's an INTENTIONAL skip (declared/structural: not backup-capable, no upgrade target). UNINTENTIONAL N/A (infra error, missing tool, aborted tier = unverified) blocks — the level cannot be above an unverified rung. Statuses now {pass, fail, skip, unver}; unclassifiable N/A defaults to unver.
2026-06-11 ~04:50 — night-watch findings: limit system held core invariant; 2 bugs fixed
- ~01:49-01:51 all three sessions hit a MONTHLY SPEND limit ("You've hit your monthly spend limit. /usage-credits to adjust") — no reset time exists, so "unparsable → flat 5-min probe" was CORRECT behavior. Zero kill+reboots during the limit window (the old system's churn bug is confirmed gone — last stall reboots were 23:26-23:50, old code).
- Bug 1: probe text contained "usage limit" → matched LIMIT_RE → self-sustaining window. Reworded to "quota window" (must never match LIMIT_RE).
- Bug 2: probe dedupe checked the whole 40-line pane → once the submitted probe scrolled into the conversation, all further probes were suppressed (builder/adv stuck at nudges=1; orchestrator probes degraded to hourly, riding the wake's scroll). Dedupe now checks only the bottom 8 lines (the input area).
- shot phase: M1 PASSed (ae10b55, 19/19 matrix) + builder landed the harness capture fix (ce50f64) BEFORE the limit hit. Loops resume via watchdog probe after this bounce.
- lvl5 plan: operator addition — top badge (card corner/dashboard pill/SVG) shows ONLY the level, no capping info; inline rung table keeps intentional-skip detail.
2026-06-11 ~11:35 — phase bsky queued after lvl5
- Operator: fix whatever is wrong with the bluesky-pds recipe, then its screenshot. Plan: cc-ci-plan/plan-phase-bsky-fix.md. Known: upstream image breakage under the pinned tag (Cannot find module /app/index.js, Node v24), proven harness/ref-neutral in rcust M2; DEFERRED carries the re-pin follow-up. Deliverable = green recipe-mirror PR (operator merges) + verified screenshot on the PR runs; DEFERRED entries closed.
- .phases-spec now rcust;shot;lvl5;bsky (idx still lvl5); watchdog bounce to load.
2026-06-11 ~12:05 — four more phases queued + DEFERRED housekeeping (operator)
- Queue now: bsky (in progress, idx 3) → dstamp (discourse abra-stamp drift dig) → mailu (backupbot labels recipe PR) → kuma (uptime-kuma create-a-monitor test) → drone (gitea-dep enrollment; P0 host /etc/timezone deploy is MINE — nixos-rebuild switch on cc-ci host with committed 3bde76f, do it before/when phase drone starts or when STATUS-drone flags BLOCKED).
- DEFERRED.md housekept (cc-ci 823023a): closed plausible-enrollment, discourse-bitnami, immich-pgdump (PR#2 merge-pending), plausible-Q4.7b (PR#3 merge-pending); re-entered mailu/drone/kuma/dstamp/bsky with plan pointers. Operator confirmed immich PR#2 + plausible PR#3 are good — merges pending on operator.
- ORCHESTRATOR ACTION ITEM: deploy /etc/timezone host fix for phase drone P0.
2026-06-11 ~18:20 — FALSE phase completion: mailu skipped, fixed + re-queued
- The mailu builder (sonnet) scaffolded STATUS-mailu.md with a PLACEHOLDER "## DONE" section (line 47, body "Not yet. Written here only when all DoD items have Adversary PASS"). phase_done()'s startswith("## DONE") matched it → watchdog auto-advanced mailu→kuma at 18:01 with ZERO mailu work done (no recipe PR, no claim, no review).
- Fix (cc-ci-orch
4275adc): phase_done now ignores a "## DONE" whose first non-empty body line is a placeholder/negation (Not yet / pending / TBD / when all / <...>). Verified against all shipped STATUS files: real DONEs still detected, mailu placeholder rejected. - kuma (running, real work) had NO STATUS file yet → no second misfire; the fix protects it + drone going forward.
- mailu RE-QUEUED after kuma: spec now rcust;shot;lvl5;bsky;dstamp;mailu(skipped);kuma; mailu(redo);drone — 9 phases, idx6=kuma. Watchdog bounced to load fix+spec. Builder correctly sonnet at kuma (per-phase override still working).
2026-06-11 ~20:50 — weekly upgrade: skip tonight, queue after phases (operator)
- Operator: cancel tonight's weekly /upgrade-all cron, run it once after the current phase queue (…mailu→drone) finishes instead.
- cc-ci-upgrade-all.timer STOPPED (can't
disable— /etc/systemd read-only). Stamp forwarded to 06-12 03:00 so a reboot/rebuild tonight schedules NEXT run 06-19, NOT a catch-up of tonight's 02:00 slot. GOTCHA:systemctl starton this timer fires the service IMMEDIATELY (Persistent=true) — it did, launched an unwanted upgrader run at 20:47 which I killed. DO NOTstartthe timer to re-arm; let a host reboot/nixos-rebuild reactivate it (the drone P0 rebuild will), forward-stamp prevents catch-up. - Post-phase run wired: watchdog hook (launch.py
3fa3178) launches launch-upgrader.py start when the LAST phase hits ## DONE, gated by flag /srv/cc-ci/.cc-ci-logs/.run-upgrade- on-complete (set now, consumed once). Upgrader inherits sonnet. So whendronecompletes, /upgrade-all auto-starts.
2026-06-11 ~20:58 — coordination files → machine-docs/ + memory committed (operator)
- Operator: recent phases wrote STATUS/BACKLOG/REVIEW/JOURNAL to the cc-ci repo ROOT.
Root cause: build_kickoff + plan.md tree used bare filenames (older phases + INBOX/
DECISIONS/DEFERRED already used machine-docs/). Fixed everywhere: build_kickoff emits
machine-docs/ paths + explicit FILE-LOCATION RULE; prompts/builder+adversary, plan.md
(tree+seed), loops AGENTS.md, orchestrator AGENTS.md all updated (cc-ci-orch
e144354). - Moved 32 root files → machine-docs/ in cc-ci repo (85a7813, all git-detected renames, no content change). Both clones synced; loops restarted with new kickoff (verified kickoff → machine-docs/STATUS-mailu.md); watchdog bounced. resolve_state/INBOX already read machine-docs/ first so phase_done unaffected.
- Memory notes committed+pushed (cc-ci-orch
c33b21f) per AGENTS.md 'memory lives in repo'.
2026-06-11 ~22:05 — phase cfold queued after drone (+ recipe CI sweep)
- Operator: collapse custom-test folders functional/ + playwright/ → one custom/ folder (the split is purely organizational — verified: discovery.py globs both with no branching, same tier/rung/fixtures/failure semantics). Plan: plan-phase-cfold-custom-folder.md. M2 = full !testme recipe sweep proving no recipe's custom tests silently dropped + levels unchanged (the operator-required sweep).
- .phases-spec now …drone;cfold (10 phases). cfold is the new LAST phase, so the .run-upgrade-on-complete hook fires /upgrade-all AFTER cfold — correct order (folder change swept-green before the weekly upgrade runs). Watchdog bounced to load it.
2026-06-11 ~22:55 — drone DONE → upgrade fired; cfold PAUSED to serialize
- drone completed 22:31 → watchdog hit sequence-complete, fired the queued /upgrade-all (cc-ci-upgrader, weekly run) per the operator's earlier request. Upgrade running now.
- I'd queued cfold ~22:52; the bounced watchdog auto-advanced into cfold, making it CONCURRENT with the upgrade. They conflict (both real-CI; cfold edits the harness the upgrade's !testme uses; upgrade version-bumps confound cfold's baseline). PAUSED cfold: stopped its loops + the watchdog; phase-idx preserved at 9. Upgrade left running.
- RESUME cfold (restart watchdog → phase-idx 9) once /upgrade-all is confirmed DONE. See memory cfold-paused-pending-upgrade. Will action on supervision wakes.
2026-06-12 ~00:30 — unstuck the weekly upgrade (wedged on discourse)
- /upgrade-all froze ~2h on discourse: its iteration-2 !testme chaos deploy disc-50cc8a had app+sidekiq stuck 0/1 in Swarm 'New' state 24min (db/redis up, box 5.8Gi free) — transient scheduler wedge, NOT a recipe defect (discourse L5 @build #450 ~5h prior). drone build waited on it; testme-on-pr.sh blocked polling; agent frozen.
- Fix: docker stack rm disc-50cc8a (freed box+build); Esc-interrupted the upgrader; nudged it with the diagnosis → "one clean discourse retry then move on regardless; comment+skip if it re-wedges". Agent recovered, now checking build state before retry. Rest of queue (ghost/immich/keycloak/lasuite-*/mailu/matrix-synapse) still ahead. cfold still paused.
2026-06-12 ~03:30 — ROOT CAUSE: proxy overlay VIP exhaustion (not "tired box")
- Empirically verified from dockerd logs: the shared
proxyoverlay (10.0.1.0/24 = 254 VIPs, joined by every recipe deploy) exhausted its IP pool. Endpoint-GC race on concurrent stack rm (key modified/network proxy remove failed, 45×) leaked IPs over 11 days of dockerd uptime → 13×could not find an available IP while allocating VIPfrom 22:53 → tasks stuck in SwarmNew→ discourse + ghost deploys wedged (looked like recipe failures; were infra). 02:50 docker restart rebuilt the allocator → cleared. - FIXES: (a) upgrade-all Step 0 now prunes leaked overlays + restarts docker if VIP-failures are in the journal (per-run safety net, committed). (b) DURABLE: enlarge proxy to /16 in swarm.nix — runbook plan-proxy-vip-exhaustion-fix.md + memory proxy-vip-exhaustion-runbook, orchestrator to execute in a maintenance window AFTER the current upgrade (recreating proxy disrupts routing). (c) ghost PR debug: plan-ghostpr-debug-fix.md + memory ghost-pr-debug.
- NOT switching the upgrade to sequential (operator: concurrency is fine; the leak is the issue). Duplicate ghost subagent from the interrupt churn — told the upgrader to TaskStop one.
2026-06-12 15:43 UTC — opencode web restored + OpenAI launcher added
- Re-enabled
opencode-web.serviceinnix/hosts/cc-ci-orchestrator-hetzner/configuration.nix(wantedBy = [ "multi-user.target" ]) and persisted the broader PATH that the old runtime override had been providing.nixos-rebuild switch --flake .#cc-ci-orchestrator-hetznerbrought the service back and it now passescurl http://127.0.0.1:4096/global/health. - Added
cc-ci-plan/launch-opencode.shas a separate entrypoint that keeps the default Claudelaunch-orchestrator.shuntouched. It rebuilds the host flake ifopencode-webis not enabled, starts the service if needed, then launches the orchestrator withLOOP_BACKEND=opencode,LOOP_MODEL=openai/gpt-5.4, and default session namecc-ci-orchestrator-oc. - Patched
launch-orchestrator.pyso opencode launches can force the requested model even thoughopencode attachhas no--modelflag: it now injectsOPENCODE_CONFIG_CONTENTfor the session. Verified live:cc-ci-orchestrator-octmux session running onbackend=opencode model=openai/gpt-5.4, visible through the shared web server.
2026-06-12 ~15:55 — OpenCode GPT-5.4 loops resumed; durable proxy phases queued
- Operator switched orchestration from Claude to OpenCode/GPT-5.4 and requested the remaining work be made explicit: durable Swarm proxy fix, post-proxy verification, and ghost re-evaluation.
- Added phase plans:
plan-phase-pvfix-swarm-proxy.md,plan-phase-pvcheck-post-proxy-verification.md,plan-phase-ghost-reeval.md. - Persisted phase queue is now:
rcust;shot;lvl5;bsky;dstamp;mailu;kuma;mailu;drone;cfold;pvfix;pvcheck;ghost, with idx still9(cfold) so the loops finish the already-started custom-folder phase before the proxy and ghost follow-ups. - Replaced stale Claude
cc-ci-orchestratortmux session (was parked on a weekly-limit banner) with an OpenCode session usingopenai/gpt-5.4; builder/adversary restarted withLOOP_BACKEND=opencode LOOP_MODEL=openai/gpt-5.4 ADV_MODEL=openai/gpt-5.4 RESUME_PHASE=1. - Watchdog bug fixed in
launch.py: it now treats only the configuredORCH_SESSIONtmux session as orchestrator liveness and restarts it if the pane command does not match the expected backend. This prevents stale Claude one-shot/report sessions from masking a missing OpenCode orchestrator. - Verified live tmux mapping:
cc-ci-builder,cc-ci-adv, andcc-ci-orchestratorare allopencode;cc-ci-watchdogis running. The watchdog will hourly-wakecc-ci-orchestratorvia the existingORCH_WAKE_INTERVAL=3600path and will apply the existing limit-window nudge/restart handling.
2026-06-12 ~16:05 — queued GPT-5.5 post-cfold no-loss review
- Operator requested one additional pass at the end of
cfold: GPT-5.5 reviews the cfold implementation and confirms no custom test/coverage/assertion/fixture behavior was lost. - Added
plan-phase-cf55-gpt55-cfold-review.mdand insertedcf55immediately aftercfold, beforepvfix. Persisted queue is nowrcust;shot;lvl5;bsky;dstamp;mailu;kuma;mailu;drone;cfold;cf55;pvfix;pvcheck;ghost. - Set per-phase model overrides in
.cc-ci-logs:.loop-model-cf55=openai/gpt-5.5and.loop-model-adv-cf55=openai/gpt-5.5. Currentcfoldloops stay on OpenCode GPT-5.4; when cfold writes real## DONE, watchdog should auto-transition tocf55and start both loops on GPT-5.5. - Bounced only
cc-ci-watchdogso it loaded the 14-phase queue without interrupting the active builder and adversary sessions. Verifiedlaunch.sh status:cfold [10/14], builder/adv/watchdog RUNNING; watched sessions are stillopencodefor builder/adv/orchestrator.
2026-06-12 ~21:45 — OpenCode watchdog idle detection fixed; stalled cfold loops recovered
- Operator correctly observed the OpenCode loops had stalled after cfold M1. Root cause: watchdog
activity detection reused the Claude-oriented
ACTIVE_RE, whose tokens (▣,Build ·) also appear in OpenCode's static completed-turn footer. Finished OpenCode turns therefore looked active forever, so the idle-stall branch never fired. - Fixed
launch.pywith an OpenCode-specific classifier: Claude still uses the oldACTIVE_REpath; onlyBACKEND=opencodeusesopencode_pane_active(), which checks the bottom prompt/status area plus recent log mtime and ignores the static completed footer. - Added
launch.py selftest-opencode-activityregression test. It proves: completed OpenCode footer = idle, liveesc interruptfooter = active, and a limit banner plus completed footer is not active.python3 -m py_compile cc-ci-plan/launch.pyand the selftest both passed. - Restarted
cc-ci-watchdogon the patched code. Builder was already stuck in a stale input mode after a manual nudge landed as a shell command, so restarted onlycc-ci-builderviastart_agent; it came back on OpenCode GPT-5.4 and is actively planning the M2 sweep. Adversary accepted its nudge and wroteWAITING-UNTIL: 2026-06-12T21:55:28Zwhile awaiting Builder's formal M2 claim.
2026-06-13 ~05:15 — queued cf48 (Opus 4.8 cfold review, second model)
- Operator: add a second independent post-cfold coverage-loss review by Opus 4.8 (cross-check of the cf55 GPT-5.5 review). Plan: plan-phase-cf48-opus-cfold-review.md — same 7 review categories as cf55, independent verdict + a cf55-vs-cf48 agreement note.
- Inserted cf48 after cf55: queue = ...cfold;cf55;cf48;pvfix;pvcheck;ghost (15 phases, idx 10=cf55).
- Per-phase model overrides: .loop-model-cf48 = .loop-model-adv-cf48 = claude-opus-4-8 (claude backend — current .loop-backend=claude, so it'll run on opus; orchestrator must keep backend=claude when cf48 starts since per-phase backend isn't overridable).
- Also this session: relaunched after a session restart (NOT a host reboot, 13d uptime); loops+watchdog were stopped in a claude-sonnet limit window on cf55 → restarted via RESUME_PHASE=1 launch.py start.
2026-06-13 ~05:27 — launch system unified (agents.toml + agents.py); cutover done
- Replaced the 5 bespoke launchers + ~15 dotfiles with ONE config (cc-ci-plan/agents.toml)
- ONE driver (cc-ci-plan/agents.py: up/down/status/watchdog/logs/phase). Design + behavior mapping in cc-ci-plan/plan-unified-launch.md.
- launch.py and launch-orchestrator.sh are now COMPATIBILITY SHIMS → agents.py (originals at
*.orig). So
launch.py start|status|stop|watchdog|logs, the systemd boot chain (cc-ci-loops-start → launch.sh → launch.py start), and your startup routine all drive the new system transparently — no behavior change for you. - Config is the single source of truth; the watchdog re-reads it every tick (no more env-vs-file drift, which had caused the opencode-revert bug earlier today). Backend/model/prompt/watch policy per agent live in agents.toml. To change a model or backend: edit agents.toml.
- State (phase index, resume ids, limit windows) now under .cc-ci-logs/state/. Phase machine
unchanged; de-duped the doubled
mailuentry (cf55 was idx 10 → now idx 9; current phase pvfix = idx 10). All agents respawned via the new system and confirmed working on pvfix. - The new watchdog tmux session (cc-ci-watchdog) runs
agents.py watchdog. Same heal/limit/ stall/handoff/phase-advance/wake behavior, lifted verbatim, now config-driven.
2026-06-13 ~05:30 — startup: unified agents.toml live; re-added dropped cf48
- Session relaunch (NOT a host reboot, 13d uptime). Supervision UP: unified
agents.py watchdog(--config agents.toml) + builder/adv on claude-sonnet + orchestrator on claude-opus-4-8. Phase pvfix [proxy /16 fix] in progress; cf55 confirmed ## DONE (advance was legit). - The launch-system unification (agents.toml + agents.py) was deployed in the gap. It was transcribed from .phases-spec BEFORE I added cf48 (05:15), so cf48 (the operator's opus cfold review) was DROPPED. Re-added it to agents.toml — appended AFTER ghost (the system is already past cf55/on pvfix, so inserting before pvfix would shift the live phase index). agents.py re-reads config every tick, so no watchdog bounce needed. cf48 runs as the last phase, opus 4.8, claude backend.
2026-06-13 ~12:40 — queued pxgate (deploy-proxy health-gate circular-dep D8 fix)
- Operator: fix the A1 circular dependency (deploy-proxy health-gates on ci.commoninternet.net = dashboard, but dashboard is After=deploy-proxy → fresh-boot deadlock → proxy fails at 900s).
- Plan plan-phase-pxgate-proxy-healthgate.md: re-target the traefik health gate to a
dashboard-independent traefik-self endpoint (/ping or api/version), keep rollback semantics;
M1 = fix + controlled repro (loops), M2 = from-scratch cold-boot proof (orchestrator owns the
live nixos-rebuild). Appended pxgate to agents.toml (idx 14); cleared SEQUENCE-COMPLETE +
phase set 14+ started loops → resumes the build for this one phase.
2026-06-13 ~13:50 — pxgate M2: orchestrator nixos-rebuilt the cc-ci host (operator-authorized)
- Operator OK'd the live rebuild (no CI running). Deployed the pxgate fix (deploy-proxy health probe → traefik.ci.commoninternet.net/api/version) via nixos-rebuild switch --flake .#cc-ci.
- Debugged 3 issues to get the build green: (1) git "not owned by current user" → chown root; (2) sops
secrets/secrets.yaml does not exist→ copied operator-held secret from /etc/cc-ci/secrets/; (3) git flake excludes untracked files → dropped .git (plain path flake). Procedure saved as memory cc-ci-host-rebuild-procedure (the host had NO self-service rebuild path; last rebuild 05-31). - Result: deploy-proxy active on the dashboard-independent probe (verified in the running nix-store warm_reconcile.py), all 9 infra services 1/1, no failed units, endpoints 200. Cycle broken by construction. Only proxy/keycloak/sweep units rebuilt (nixpkgs pinned). Pushed BUILDER-INBOX with M2 evidence to unblock the loops. From-scratch reboot proof offered to operator, not yet done.
2026-06-13 ~13:52 — pxgate cold-boot proof PASSED (real reboot)
- Loops already wrote pxgate ## DONE (M1+M2 PASS) at 13:47 off my nixos-rebuild + static evidence.
- Operator then asked for the real cold-boot proof → I
systemctl rebooted the cc-ci host (new boot_id 2f77b915). Result: deploy-proxy reached active 13:51:03, 11s BEFORE deploy-dashboard (13:51:14) — the old 15-min deadlock is gone. deploy-proxy journal: traefik noop-healthy via /api/version → Finished. No failed units; 9/9 infra; endpoints 200. D8 circular-dep DEFINITIVELY fixed. Build sequence complete again (pxgate was 15/15).
2026-06-15 ~18:30 — recipe-upgrade workflow: abra version bump + release-notes links in PR
- Operator: (1) upgrade PR comments must link the relevant upstream release notes; (2) stop hand-editing
the coop-cloud version label — use
abra recipe release <recipe> -x|-y|-z --dry-runto COMPUTE the a.b.c+x.y.z (semver per flag + app image tag), apply on the PR branch, never run non-dry-run (it publishes upstream). Verified the syntax + dry-run behavior on the live abra first. - Changed cc-ci recipe-upgrade SKILL.md (§2 bump + §3 PR body) and recipe-maintainer (submodule, pushed to notplants/recipe-maintainer main): recipe-upgrade-apply, recipe-new-tag, recipe-create-pr, recipe-upgrade-plan, recipe-upstream. Release-notes URLs come from the per-recipe upstream registry (cc-ci-plan/upstream/.md ; recipe-info//upstream.md). Also earlier: used-recipes.md + uptime-kuma marked external (excluded from weekly upgrade).
2026-06-15 ~22:55 — gtea phase DONE: gitea now fully-tested + LFS PR #1 verified; SEQUENCE-COMPLETE
- Phase
gtea(queued this session, builder+adversary on sonnet) completed. M1 Adversary-PASS @20:32Z; M2 Adversary-PASS @22:10Z; claim+## DONEat 778720c. Watchdog wrote SEQUENCE-COMPLETE (gtea was 20/20) and stopped the loops. Queue now drained → hourly wake stops; new work = queue a new phase in agents.toml. - All 6 M2 DoD verified in REAL CI: (1) gitea
mainfull 5-tier suite green — build #684 L5, LFS test correctly SKIPs on main; (2) LFS roundtrip green on PR #1 — build #695 L5,test_lfs_roundtripPASS (18s), via a new UPGRADE_SECRET_PREP hook that pre-creates the correct 43-charlfs_jwt_secret; (3) drone dep path UNAFFECTED — build #692 drone L5 (the central §2 constraint held); (4) ruff lint green; (5) unit 53/53 incltest_gitea_dep10/10 throughout; (6) no secret leak in any artifact. - New
tests/gitea/: recipe_meta.py (dual-role dep+recipe-under-test), ops.py, 4 lifecycle overlays, custom/{test_health,test_git_push,test_admin_api,test_lfs_roundtrip}, PARITY.md (ports upstream health_check+git_push). Harness fixes: UPGRADE_EXTRA_ENV/UPGRADE_SECRET_PREP for the opt-in LFS overlay- secret, head_ref uses git SHA not branch name, STACK_NAME derived from domain.
- Builder self-deployed git-lfs 3.6.1 to the cc-ci host via nixos-rebuild (operator OK with loops doing
host rebuilds — "recommendation not rule"; see memory). Host verified healthy after. The commented-out
.env.samplelength=43 spec in PR #1 was the LFS-secret root cause; fixed harness-side (an operator uncomments it when enabling LFS, so NOT a PR defect — no PR comment needed). PR #1 NOT merged (operator's call); the suite confirms its LFS feature works end-to-end.
2026-06-17 ~18:00 — upgrade-base / canonical / config sequence DONE: regall→samever→canon→dash→settings→nixenv; SEQUENCE-COMPLETE
- A 6-phase sequence (operator-driven over 2026-06-16/17) all M1+M2 Adversary-verified PASS, no VETO; watchdog wrote SEQUENCE-COMPLETE and stopped the loops. Host verified healthy throughout + after the final runtime-env switch (systemctl --failed empty, services active, endpoints 200, git-lfs 3.6.1, 38G free).
- regall (sonnet): full all-recipe regression after the prevb dynamic-base change — 21/21 GREEN, no prevb-caused regressions (plausible's red was a pre-existing recipe bug, fixed via its PR#3).
- samever (opus): when the last-green canonical == the PR head version, the resolver steps back to the newest published release tag < head instead of a same-version no-op (design A; canonical-history = design B deferred to IDEAS). Proven: step-back base<head, version-bump path + discourse#4 unaffected.
- canon (opus): made the HOLLOW nightly sweep real + proven (it fired green but only custom-html was enrolled and ZERO canonicals had ever promoted). Now: all-21-enrolled (keycloak + 5 others are recorded §2.B exceptions), mirror-sync to upstream, promote only to tagged releases, trigger on a new release tag (operator refinements), skip-when-no-new-tag, run-twice determinism (15 skip / exceptions run), UPGRADE_BASE_VERSION retired (plausible on dynamic base 3.0.1), AI-free runtime, weekly timer. ~7 real defects caught+fixed (false PASS-label, broad promote failure, master-vs-main, cold-dep deadlock, concurrent sweeps, live-keycloak footgun, DEFECT-3 env-parity).
- dash (opus): per-recipe history page now sourced from local /var/lib/cc-ci-runs (432 runs) instead of the latest-100 Drone-builds slice — full history (handles mixed numeric/named ids).
- settings (opus): minimal CI-server /etc/cc-ci/settings.toml + SKIP_CANONICALS_FOR_UPGRADE (default false, false here) + always-on release-tag-first no-canonical fallback (newest release tag < head → main-tip only as last resort). Proven live: keycloak(no canon)→release tag, gitea(canon)→last-green, flag-true bypasses canonical to release-tag path.
- nixenv (opus): single-sourced the harness runtime env (ccciPyEnv+ccciRuntimeTools+cc-ci-run in packages.nix) referenced by cc-ci-run, the sweep timer (now execs cc-ci-run), and BOTH host systemPackages — root-cause fix for DEFECT-3 drift; removed the DEFECT-3 PATH patch; cc-ci host gained git-lfs/openssl. Live parity proven on BOTH the real timer fire AND the Drone path (#871): gitea test_lfs_roundtrip green from the shared env, zero missing-tool signatures.
- Orchestrator notes: wrote samever's ## DONE marker once (Builder was opus-quota-blocked; work was Adversary-cleared); nudged regall's bold-wrapped marker fix; queued every phase + the design refinements live. Queue now drained → hourly wake stops; new work = queue a new phase in agents.toml.
2026-06-18 ~07:10 — redfix DONE: all 6 canon-sweep failures fixed + verified; SEQUENCE-COMPLETE
- Phase
redfix(opus) M1+M2 fresh Adversary PASS, no VETO. Investigated all 6 canon-sweep failures in ISOLATION (flake vs genuine), then fixed each via a recipe PR or harness improvement — none left as a standing exception. Host verified healthy after (0 failed, services active, live keycloak SSO 302 undisturbed, 36G free). - The six (operator: fix all, recipe PR or harness):
- mattermost-lts — recipe PR: postgres dump +
backupbot.restore.post-hook(immich pattern); restore now round-trips. (genuine recipe defect, not the canon "load race") - discourse — cc-ci overlay-scope fix (the
test_upgrade.pyoverlay asserted an unreleased official-image migration); Adversary FAILed the first claim (F-redfix-1: dangling image-less sidekiq in compose.smtpauth.yml → R011 lint regression + broke smtp-auth), Builder fixed, re-verified level=5. (canon's "timeout" root-cause was WRONG — no timeout) - keycloak — harness: collision-free
canonical_domain(warm-canon-<r>) for live-warm providers, then enrolled; promotes without disturbing the live OIDC service. - mumble — harness: handshake readiness/retry stabilization (it was a LOAD FLAKE — operator's recollection was right; 2× green in isolation).
- bluesky-pds — recipe PR: reference the app svc as
${STACK_NAME}_app(operator-directed; the established pattern, cf. matrix-synapse) instead of the bareappthat collided cross-stack on the shared proxy. Dropped the earlier app→pds rename + coupled cc-ci exec-ref change (cleaner, recipe-only). - gitea — recipe PR: render
app.iniinto the writableconfig:/etc/giteavolume so the 3.5.3→3.6.0 warm advance can persist the JWT secret (was crashing on the read-only config mount). v1 broke the wizard transition (reverted); rework verified chaos-deploy green.
- mattermost-lts — recipe PR: postgres dump +
- Orchestrator notes: restarted the Builder once to shed a 692k-token context that was trapping it in the
opus usage limit (operator-authorized; loops resume only via a
{name}.idfile, none present → fresh session; re-oriented from STATUS/journals via a nudge). Relayed the operator's${STACK_NAME}_appbluesky guidance. 4 recipe PRs + 2 harness fixes; nothing merged (operator reviews/merges). - Queue drained again → hourly wake stops.