Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges: the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety net to Step 0 (network prune + docker restart when VIP-allocation failures are logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix, maintenance window) and for debugging/fixing the ghost PR afterward.
546 lines
47 KiB
Markdown
546 lines
47 KiB
Markdown
# Orchestrator journal
|
||
|
||
This file is the **persistent handoff record** for the cc-ci orchestrator. Every orchestrator
|
||
session (whether Claude or opencode) reads this on startup and appends to it when handing off or
|
||
when something noteworthy happens. It survives conversation resets — it is the memory that
|
||
`--resume` can't provide for opencode, and a more readable supplement for Claude sessions.
|
||
|
||
**On startup:** read this file before doing anything else. The most recent `## Session` entry
|
||
is where the previous session left off. Carry that context forward.
|
||
|
||
**On handoff / end of session:** append a `## Session` block (see format below) summarising
|
||
what happened, the current state, and anything the next session needs to know.
|
||
|
||
**On significant events mid-session:** append a `### Event` sub-entry (no need to wait for
|
||
handoff).
|
||
|
||
---
|
||
|
||
## Format
|
||
|
||
```markdown
|
||
## Session YYYY-MM-DD HH:MM UTC — <backend> <model>
|
||
**Left off:** <one sentence — what was the last thing done>
|
||
**Phase / loop state:** <phase X [N/11], loops RUNNING/stopped, cc-ci healthy/issue>
|
||
**Open items:** <bullet list of anything the next session needs to act on, or "none">
|
||
**Notes:** <anything surprising, a decision made, a known blocker, etc.>
|
||
|
||
### Event HH:MM — <short label>
|
||
<brief note>
|
||
```
|
||
|
||
---
|
||
|
||
## Session 2026-05-31 ~18:30 UTC — Claude Sonnet 4.6
|
||
|
||
**Left off:** Got opencode/deepseek-v4-pro working as the loop backend. Both builder and
|
||
adversary are actively running on `tinfoil/deepseek-v4-pro` (via `inference.tinfoil.sh`).
|
||
Phase 5 [11/11] in progress. The operator is debugging the opencode web UI visibility and
|
||
wants to continue orchestrating from opencode itself.
|
||
|
||
**Phase / loop state:**
|
||
- Phase **5 [11/11]** (`plan-phase5-verify-upgrade-flow.md`), in progress
|
||
- Latest product-repo commit: `de635ad` — `status(5): V3 DONE (custom-html-tiny upgrade GREEN); V7 DONE; A5-1/A5-2 fixed`
|
||
- Loops **RUNNING** on opencode/deepseek-v4-pro, actively processing (32–62K tokens in flight)
|
||
- Watchdog **RUNNING**, backend persisted to `.loop-backend` / `.loop-model` files
|
||
|
||
**Open items for next session:**
|
||
- Phase 5 loops need to finish V1–V9 and write `## DONE` to STATUS-5.md. They were at V3+V7 PASS before the backend switch. After completing phase 5, phase 6 (reconcile-only over all 18 recipe mirrors) and phase 7 (full upgrade on n8n + ghost + matrix-synapse) still need running.
|
||
- Phase 4 (final review/polish) was deliberately **deferred** — run it after weekly Opus credits reset. Phase idx currently at 10 (phase 5). To run phase 4 later: set idx to 9, start with `LOOP_BACKEND=claude RESUME_PHASE=1 cc-ci-plan/launch.sh start`.
|
||
- **Restart loops after reading this** — the current sessions are mid-processing. `cc-ci-plan/launch.sh status` will show state; if sessions are stalled, `LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro RESUME_PHASE=1 cc-ci-plan/launch.sh start`.
|
||
- DNS: `oc.commoninternet.net A 100.84.190.30` still needs adding (operator step). Web UI reachable directly at `http://100.84.190.30` in the meantime.
|
||
- Old Incus orchestrator VM (`cc-ci-orchestrator`, `100.116.55.106`) still cold standby — stop + delete when confident in Hetzner.
|
||
|
||
**Notes — opencode/tinfoil setup (critical for next session):**
|
||
- **Backend files:** `LOOP_BACKEND=opencode` and `LOOP_MODEL=tinfoil/deepseek-v4-pro` are persisted in `/srv/cc-ci/.cc-ci-logs/.loop-backend` and `.loop-model`. The watchdog reads these to restart dead sessions with the right backend.
|
||
- **API key:** stored in `/srv/cc-ci/.testenv` as `TINFOIL_API_KEY`. Written directly (not via `env:`) into `~/.config/opencode/opencode.jsonc` — opencode doesn't do env substitution in apiKey. The config also has `"permission": "allow"` (all tool calls auto-approved).
|
||
- **Inference URL:** `https://inference.tinfoil.sh/v1` (NOT `api.tinfoil.sh` — that's the control plane only). Fixed in both `.testenv` and `opencode.jsonc`.
|
||
- **Opencode web server:** `opencode-web.service` runs `opencode serve --hostname 127.0.0.1 --port 4096`. Nginx proxies `oc.commoninternet.net → localhost:4096` on tailscale IP. Sessions from the plain `opencode` TUI DO appear in the shared server's DB (they auto-connect via IPC), so the web UI should show them once DNS is set.
|
||
- **Launch command for opencode loops:** `LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro RESUME_PHASE=1 cc-ci-plan/launch.sh start`
|
||
- **Launch command for claude loops (fallback):** `LOOP_BACKEND=claude LOOP_MODEL=sonnet RESUME_PHASE=1 cc-ci-plan/launch.sh start`
|
||
- **Launchers rewritten to Python:** `launch.py`, `launch-orchestrator.py`, `launch-upgrader.py` (bash wrappers are one-liners). All committed to `recipe-maintainers/cc-ci-orchestrator` (HEAD: `3412100`).
|
||
- **Opencode binary:** `/home/loops/.local/bin/opencode` v1.15.13. Re-install if missing: `curl -sL https://github.com/anomalyco/opencode/releases/download/v1.15.13/opencode-linux-x64.tar.gz | tar -xz -C /home/loops/.local/bin opencode`
|
||
- **Known opencode quirk:** the loop bootstrap message (pointing to the kickoff file) is sent via `ping_session` with `submit_key="Enter"`. The TUI needs ~8s to connect before the message is sent. If a session seems stuck at the blank prompt, manually send: the message from `.cc-ci-logs/.kickoff-cc-ci-builder.txt` (or adv), then press Enter.
|
||
- **Orchestrator in opencode:** `LOOP_BACKEND=opencode LOOP_MODEL=tinfoil/deepseek-v4-pro cc-ci-plan/launch-orchestrator.sh fresh` — no `--resume` (opencode doesn't support it); reads this JOURNAL.md as startup context.
|
||
|
||
### Event 04:13 — migrated orchestrator to Hetzner cpx22
|
||
`cc-ci-loops.service` enabled, reboot-resilient. cc-ci server also Hetzner (server 134485294, `ssh cc-ci` → `100.95.31.88`).
|
||
|
||
### Event 13:22 — phase 4 paused, phase 5 started
|
||
Weekly Opus credits exhausted mid-session. Switched to Sonnet. Phase idx manually set to 10 (phase 5).
|
||
|
||
### Event 17:29 — loops stopped to switch backends
|
||
|
||
### Event 18:20 — opencode/deepseek loops running
|
||
After 7 bug fixes (wrong inference host, opencode run exits, --dir exits, env: not substituted in apiKey, permission prompts, submit key, timing), both loops now running on `tinfoil/deepseek-v4-pro` via the shared opencode-web.service.
|
||
|
||
---
|
||
|
||
## Session 2026-06-01 14:13 UTC — OpenCode GPT-5.4
|
||
|
||
**Left off:** Completed the assistant-owned phase 6 mirror reconcile pass and phase 7 targeted recipe-upgrade pass, wrote the operator summary, and dropped a `phase6-phase7.done` marker.
|
||
|
||
**Phase / loop state:**
|
||
- Builder/Adversary loops still on phase **5 [11/11]** separately from this assistant work.
|
||
- Assistant phase 6 summary/result file: `cc-ci-plan/phase6-phase7-summary-2026-06-01.md`
|
||
- Assistant phase 6/7 completion marker: `cc-ci-plan/phase6-phase7.done`
|
||
|
||
**Open items:**
|
||
- Bridge enrollment does **not** match the full phase-2 18-recipe set. Repo/live poll set = `custom-html`, `custom-html-tiny`, `cryptpad`, `hedgedoc`, `keycloak`, `lasuite-docs`, `lasuite-meet`, `matrix-synapse`, `n8n` (+ `cc-ci`). Missing vs phase-2 set: `bluesky-pds`, `discourse`, `ghost`, `immich`, `lasuite-drive`, `mailu`, `mattermost-lts`, `mumble`, `plausible`, `uptime-kuma`. Extra: `hedgedoc`.
|
||
- `ghost` phase-7 PR is open but not CI-triggerable until bridge enrollment includes `recipe-maintainers/ghost`.
|
||
- Review whether recipes still intended to be enrolled without mirrors: `lasuite-drive`, `mailu`, `mumble`, `uptime-kuma`.
|
||
|
||
**Notes:**
|
||
- Phase 6 reconciled all 18 enrolled recipes from scratch clones. Stale mirror PRs auto-closed on `lasuite-docs` (#1/#2/#3) and `keycloak` (#1). Four enrolled recipes currently have no mirror repo.
|
||
- Phase 7 outcomes: `n8n` stable PR `#3` went GREEN on build `61`; `matrix-synapse` existing PR `#1` re-ran and failed on build `53`; `ghost` PR `#2` opened successfully but verification is blocked by bridge enrollment mismatch.
|
||
- The bridge service rolled during verification; earlier `!testme` comments posted before/re-during the restart were swallowed as pre-existing by the poller startup pass. A clean re-run on stable `n8n` after the rollout confirmed the live path.
|
||
|
||
---
|
||
|
||
## Session 2026-05-31 ~04:00 UTC — Claude Sonnet 4.6
|
||
|
||
**Left off:** Completed the orchestrator → Hetzner migration (cpx22, server 134487234, public
|
||
`168.119.126.100`, tailnet `cc-ci-orchestrator-1` @ `100.84.190.30`). The old Incus VM
|
||
(`100.116.55.106`) is still on the tailnet — cold standby, not yet deleted.
|
||
|
||
**Phase / loop state:** Phases 1c–1e, 2w, 2pc, 2, 2b, 3 all DONE. Phase 5 [11/11]
|
||
(upgrade-flow verify) in progress — loops running, actively verifying the `!testme`
|
||
end-to-end flow on the new Hetzner cc-ci server.
|
||
|
||
**Open items:**
|
||
- Phase 5 is in progress — loops need to finish V1–V9 and write `## DONE` to STATUS-5.md.
|
||
- Phase 4 (final review/polish) was deliberately **skipped** this session — it is queued
|
||
at idx 9 in PHASE_IDX_FILE. Resume it after the weekly Opus credits reset.
|
||
- Phase 6 (reconcile-only over all 18 recipe mirrors) and Phase 7 (full upgrade on n8n +
|
||
ghost + matrix-synapse) are planned but not yet started — run them after Phase 5 DONE.
|
||
- Old Incus orchestrator VM (`cc-ci-orchestrator`, `100.116.55.106`) is still running —
|
||
stop it via the b1 Incus API once happy with the Hetzner box. mTLS certs at
|
||
`/srv/incus-terraform-nix-vm-creator/terraform-secrets/`.
|
||
- DNS: `oc.commoninternet.net` A record → `100.84.190.30` still needs adding (operator step).
|
||
|
||
**Notes:**
|
||
- `cc-ci-loops.service` is **enabled** and wired with `reboot-log.sh` ExecStartPre — a reboot
|
||
is a non-event; loops + watchdog auto-resume via RESUME_PHASE=1.
|
||
- The cc-ci **server** also moved to Hetzner (server 134485294, `ssh cc-ci` →
|
||
`100.95.31.88`). It has authenticated Docker Hub pulls and 150 GB disk — the old OOM /
|
||
disk-starvation / rate-limit issues are gone.
|
||
- All recipe mirrors currently reconcile correctly; no stale open PRs observed.
|
||
- `opencode` v1.15.13 installed at `/home/loops/.local/bin/opencode`. Tinfoil API key is in
|
||
`.testenv` as `TINFOIL_API_KEY`. Backend switch: `LOOP_BACKEND=opencode
|
||
LOOP_MODEL=tinfoil/deepseek-v4-pro RESUME_PHASE=1 cc-ci-plan/launch.sh start`.
|
||
- Launcher scripts rewritten to Python (`launch.py`, `launch-orchestrator.py`,
|
||
`launch-upgrader.py`); bash wrappers are now one-liners that `exec python3 <script> "$@"`.
|
||
|
||
### Event 03:13 — migrated from old Incus VM to Hetzner
|
||
Loops were started manually during staging (not by the service); first systemd-managed
|
||
boot was later this session. `cc-ci-loops.service` now enabled.
|
||
|
||
### Event 05:23 — phase 3 (results-UX) completed
|
||
All R1–R8 Adversary-verified, no VETO. Watchdog auto-advanced to phase 4.
|
||
|
||
### Event 13:22 — phase 4 paused, jumped to phase 5
|
||
Operator deferred phase 4 (weekly Opus credits exhausted). Phase idx manually set to 10
|
||
(phase 5). Loops restarted on Sonnet.
|
||
|
||
### Event 17:29 — loops stopped pending restart on different model
|
||
Operator paused loops to reconfigure backend (opencode/tinfoil exploration). Phase 5
|
||
[11/11] was in progress — loops had verified V1/V2/V3/V7 (custom-html-tiny upgrade GREEN).
|
||
Phase idx = 10 (phase 5), loops stopped, watchdog stopped.
|
||
|
||
---
|
||
|
||
## Session 2026-06-01 03:34 UTC — OpenCode GPT-5.4
|
||
|
||
**Left off:** Fixed opencode web visibility for the Builder/Adversary loop sessions by switching
|
||
the loop launcher from plain TUI startup to `opencode attach` against the shared web server, and
|
||
patched the orchestrator launcher the same way for the next session.
|
||
|
||
**Phase / loop state:**
|
||
- Phase **5 [11/11]** (`plan-phase5-verify-upgrade-flow.md`), still in progress
|
||
- Loops **RUNNING** on opencode with OpenAI `gpt-5.4`
|
||
- Watchdog **RUNNING**
|
||
- `opencode-web.service` **RUNNING** and nginx still serving `http://oc.commoninternet.net`
|
||
|
||
**Open items:**
|
||
- Start a fresh orchestrator session in opencode if desired; this current conversation cannot be
|
||
resumed as an opencode session, only handed off.
|
||
- If you want the orchestrator tmux session to move from Claude to opencode, use
|
||
`LOOP_BACKEND=opencode LOOP_MODEL=openai/gpt-5.4 ORCH_SESSION=cc-ci-orchestrator-oc cc-ci-plan/launch-orchestrator.sh fresh`
|
||
or stop/recreate `cc-ci-orchestrator-vm` explicitly.
|
||
- Phase 5 work itself is still unfinished; loops should continue from current state.
|
||
- Phase 4 remains deferred; phases 6 and 7 still remain after phase 5 completes.
|
||
|
||
**Notes:**
|
||
- The key fix for web visibility was **`opencode attach http://127.0.0.1:4096 --dir ...`**.
|
||
Plain `opencode` TUI sessions were inconsistently recorded and often did not show in the web UI.
|
||
- The path choice was much less important than attach mode. We tested both symlinked and real repo
|
||
paths. Attach mode was the real fix.
|
||
- One attached loop initially hit `python3: not found` because tool execution started flowing
|
||
through the shared `opencode-web.service` environment. Fixed by broadening the service PATH at
|
||
runtime and in `nix/hosts/cc-ci-orchestrator-hetzner/configuration.nix`.
|
||
- Current launcher state: `cc-ci-plan/launch.py` uses attach mode for opencode loops; `cc-ci-plan/launch-orchestrator.py`
|
||
is patched to use attach mode for opencode orchestrator sessions too.
|
||
- A runtime systemd override was applied at `/run/systemd/system/opencode-web.service.d/override.conf`.
|
||
Persist the final service environment with `nixos-rebuild` when convenient.
|
||
|
||
### Event 13:46 — recovered cc-ci from emergency mode via Hetzner rescue
|
||
`cc-ci` stopped booting cleanly after a `nixos-rebuild test --flake path:/root/builder-clone#cc-ci`
|
||
activation. Hetzner rescue + VNC console showed emergency mode; mounted journal showed `/boot` waiting on
|
||
`/dev/disk/by-label/ESP`. The immediate repair was restoring the missing FAT label on `/dev/sda15`
|
||
(`fatlabel /dev/sda15 ESP`) and rebooting normally. Follow-up investigation item: determine why the
|
||
wrong boot layout was activated and prevent future use of `#cc-ci` on the Hetzner server when the
|
||
correct host target is `#cc-ci-hetzner`.
|
||
|
||
### Event 18:53 — scheduled supervision pass
|
||
Checked Builder, Adversary, and Assistant live state. `ssh cc-ci hostname` still returns `nixos` after
|
||
the corrected Hetzner rebuild. Builder is active on a fresh matrix-synapse rerun under the restored
|
||
bridge path; Adversary was nudged to re-orient to that live state; Assistant remains idle after
|
||
finishing phase 6/7 and recording the bridge-enrollment mismatch against the full 18-recipe phase-2 set.
|
||
|
||
### Event 16:34 — progress monitor nudged stalled phase-5 workers
|
||
`launch.py status` showed builder, adversary, and watchdog running; `ssh cc-ci hostname` succeeded (`nixos`).
|
||
Assistant session was present and already idle after its completed phase 6/7 pass (`phase6-phase7.done` exists).
|
||
Builder was still blocked on a model usage-limit retry and adversary was parked past `WAITING-UNTIL 2026-06-01T14:24:51Z`, so both received tmux nudges to re-read the live phase-5 status and continue from current evidence.
|
||
|
||
### Event 19:04 — progress monitor rechecked phase-5 workers
|
||
`launch.py status` still shows phase 5 [11/11] in progress with builder, adversary, and watchdog running; `ssh cc-ci hostname` still succeeds (`nixos`).
|
||
`STATUS-5.md` still lacks `## DONE`, so phase 5 remains open, while `cc-ci-plan/phase6-phase7.done` confirms the assistant-owned phase 6/7 work is finished and the assistant remains idle.
|
||
Builder is active on the current V5 frontier; adversary's declared `WAITING-UNTIL 2026-06-01T19:03:38Z` had just expired, so it was nudged to re-read the live phase-5 status and continue from current evidence.
|
||
|
||
### Event 19:08 — operator directed simulated stale-test path
|
||
Operator clarified that V5/V6 should not depend on discovering a naturally occurring stale-test recipe. Builder and adversary were both nudged to switch to a simulated/seeded stale-test case on an enrolled sandbox candidate, then verify the two intended behaviors: DEFAULT comment-only and `--with-tests` opening/verifying the paired cc-ci test PR.
|
||
|
||
### Event 21:46 — backend reverted to claude, waker folded into watchdog, boot service fixed (Claude Sonnet 4.6)
|
||
Operator was out of Claude credits and had run the loops on opencode (deepseek-v4-pro, then gpt-5.4); now reverted to claude.
|
||
- **Backend → claude/sonnet.** Closed all opencode sessions (`cc-ci-orchestrator-oc`, `cc-ci-assistant`) and stopped `opencode serve`; restarted builder+adv via `RESUME_PHASE=1 LOOP_BACKEND=claude LOOP_MODEL=sonnet launch.py start`. `.loop-backend`=claude, `.loop-model`=sonnet. Restarted the watchdog too so it dropped its stale opencode-backend memory.
|
||
- **Waker → watchdog.** Retired the standalone `ai-progress-monitor.sh`/`cc-ci-orchestrator-waker` (it pinged the dead `-oc` session every 15m). The watchdog now wakes the orchestrator session for an hourly supervision pass (`ORCH_WAKE_INTERVAL`=3600s, prompt = `ai-progress-monitor-prompt.txt`), retrying each tick until the orchestrator is idle so it never interrupts/skips. Reboot-safe (watchdog is started by `cc-ci-loops.service`).
|
||
- **Boot fix.** `cc-ci-loops.service` had been failing on every boot (`claude CLI not found`) because the systemd `path` lacked `/home/loops/.local/bin`; loops were started by hand. Fixed in the flake (`CLAUDE_BIN` abs path + PATH export), `nixos-rebuild switch` applied — service now starts the loops cleanly on boot. Verified: clean start log, no error, phase 5 RUNNING.
|
||
- **Note:** the rebuild restarted `opencode-web.service` (still `wantedBy multi-user.target` in the flake) — idle serve, harmless to the claude loops, but it will keep returning on every rebuild/reboot until disabled in the flake.
|
||
|
||
### Event 23:23 — BUILD COMPLETE (all phases done) + weekly-upgrade cron cutover to a NixOS timer
|
||
Phase 5 reached `## DONE` and the watchdog wrote SEQUENCE-COMPLETE at 23:23:43Z: **the entire cc-ci build is finished** (phases 1c 1b 1d 1e 2w 2pc 2 2b 3 4 5). All V1–V9 + §4 cron Adversary-verified PASS, no VETOs, no open findings. The watchdog auto-stopped the loops and exited (so the in-watchdog hourly orchestrator wake is also gone now — by design; the build is done). Only `cc-ci-orchestrator-vm` remains up.
|
||
- **§4 cron — how the loops left it vs. final state.** During verification the loops swapped the busybox-crond-in-tmux for a `CronCreate` job (weekly id `8dd9aed3`, Mon 23:04 UTC) and disabled busybox crond. But CronCreate is **in-memory + session-scoped**: when the Builder session ended at sequence-complete, that weekly job evaporated (confirmed: `CronList` from this session shows none). That fragility is exactly what the operator asked to fix.
|
||
- **Final mechanism = reboot-safe NixOS systemd timer.** Activated `cc-ci-upgrade-all.{service,timer}` (committed earlier as `ee58027`): **OnCalendar Sun 02:00 UTC, Persistent=true**, timer-triggered only (service not `wantedBy multi-user.target`). `nixos-rebuild switch` applied — only ADDED the two units, did NOT bounce anything (loops were already stopped). `systemctl list-timers` → next run **Sun 2026-06-07 02:00:00 UTC**. Retired the leftovers: busybox crond already gone, removed the inert `/home/loops/.cc-ci-crontabs/loops`.
|
||
- **Operator-requested schedule change:** weekly upgrade moved from Mon 23:04 UTC (the phase-5 test schedule) to **Sun 02:00 UTC**.
|
||
- **Stale note:** `cc-ci/machine-docs/DECISIONS.md` still records "§4 weekly cron: CronCreate" — now superseded by the NixOS timer. Left to the operator/next loop run to amend (cc-ci product repo, loops' single-writer domain).
|
||
|
||
### Event 2026-06-02 03:42 — post-build work: mirror+regression phases DONE; overnight /upgrade-all running
|
||
After the cc-ci build completed (2026-06-01), the operator drove a sequence of post-build phases via the loops:
|
||
- **`mirror` phase DONE** (01:16Z): all recipes mirrored + enrolled (created mirrors for lasuite-drive/mailu/mumble; enrolled the 9 missing; loops did the live-host `nixos-rebuild #cc-ci` + `!testme` verification themselves after the deploy gate was removed — `#cc-ci` is safe since `be4f451`).
|
||
- **`regression` phase DONE** (03:42:07Z) — **entire 2-phase sequence (mirror→regression) complete; loops + watchdog stopped/exited.** Shipped `tests/regression/` (cc-ci PR#5, NOT merged): **7 canaries, all Adversary cold-verified** — good-simple (custom-html-tiny) GREEN, good-significant (lasuite-docs) GREEN (5 tiers + clean teardown + no secret leak), bad-false-green RED, and 4 per-tier REDs (bad-install/upgrade/backup/restore, each RED at the intended tier with prior tiers passing; dedicated fixture recipes custom-html-bkp-bad / custom-html-rst-bad). 3 `@canary` + 4 `@canary_fast`; README documents the milestone-only cadence (not per-commit).
|
||
- **PR consolidation (Assistant, one-shot):** every recipe mirror reduced to ≤1 open PR (custom-html #2→#1 @1.13.0; ghost #2→#1 = backup+upgrade). Verified one-open-PR-per-recipe across all mirrors.
|
||
- **Overnight run in flight:** `cc-ci-overnight` (overnight-run.sh) gated on assistant-done + usage-reset + loops-idle, then launched the weekly **`/upgrade-all`** at 03:40Z (DEFAULT, never merges). It will write `/srv/cc-ci/.cc-ci-logs/overnight-report-<date>.md` and ping this session to deliver the operator's **morning** PushNotification (held until then — no overnight ping). The build-complete + regression-shipped headline will be folded into that morning notification.
|
||
- **State:** watchdog/loops stopped by design (sequence complete → hourly wake stops too); the overnight runner + the weekly Sun-02:00 timer are the only live automation. Recipe-upgrade PR behavior was also reworked this session: never close unmerged PRs; extend an existing open upgrade PR by commit-on-top (no force-push) instead of a parallel PR; only close merged-upstream PRs.
|
||
|
||
### Event 2026-06-02 11:40 — OVERNIGHT /upgrade-all COMPLETE (full run, all 18 recipes)
|
||
The overnight runner finished and pinged the orchestrator; morning report at `/srv/cc-ci/.cc-ci-logs/overnight-report-2026-06-02.md` (+ `upgrades/upgrade-all-2026-06-02.md`). **Considered 18 · GREEN !testme: 10 · stale-test (commented): 2 · failed: 2 · skipped: 4. Nothing merged.** It followed through on all recipes (the original 7 + the 7 recovered from the abra-auth issue + the rest).
|
||
- **GREEN (10):** cryptpad, keycloak, lasuite-meet, mailu, n8n (⚠ pg volume path change), custom-html, custom-html-tiny, uptime-kuma, lasuite-docs, ghost (⚠ supersedes its open ci/mysql-backup PR#1).
|
||
- **Stale-test → operator `--with-tests` (2):** matrix-synapse (`test_upgrade_preserves_data`, ci_marker lost across pgautoupgrade 17→18), discourse (`test_create_topic_roundtrip`, Discourse 3.5.0 flipped `allow_uncategorized_topics` default).
|
||
- **Failed — pre-existing recipe bugs (2):** mattermost-lts (`test_restore_returns_state` after 3 !testme; backup/restore bug, see ci/pg-restore PR#1), plausible (ClickHouse IPv6 + GHCR move after 3 !testme; see ci/clickhouse-backup-resilient PR#1).
|
||
- **Skipped (4):** bluesky-pds / mumble / lasuite-drive up-to-date; immich — abra can't parse tag+digest image refs (explanatory comment left on PR#1).
|
||
- **abra-auth issue:** confirmed = go-git needs creds embedded in `origin` URL (ignores insteadOf/.netrc); recovered all 8 at runtime; skills now fixed (this session). TTY `script` wrapper was a separate, also-correct fix.
|
||
- **Operator follow-ups:** (a) a few recipes now have 2 open PRs — an upgrade PR alongside a prior backup-fix PR (discourse #2+#1, ghost #3+#1, mattermost-lts #2+#1, plausible #2+#1) — reconcile/close the superseded ones; (b) re-run matrix-synapse + discourse with `--with-tests` to refresh the stale tests; (c) mattermost-lts + plausible failures are pre-existing recipe bugs to investigate.
|
||
|
||
### Event 2026-06-02 ~17:30 — bridge: one comment per !testme (deployed)
|
||
Operator wanted each `!testme` to get its OWN comment (edited in place to that run's result), instead of
|
||
the old "reuse/edit one marker comment in place" (which made re-runs on an unchanged head invisible).
|
||
Changed `bridge/bridge.py` `process_testme`: always `post_comment` a fresh ⏳ placeholder per run;
|
||
`watch_and_reflect` still edits THAT run's `cid` to ✅/❌. cc-ci repo commit `a78ec2d` (pushed).
|
||
- **Deployed to the live cc-ci server.** ⚠️ Deploy-path gap: the documented `/root/cc-ci` is GONE.
|
||
Deployed via `/root/builder-clone` (the harness's host clone — has the real remote + the `secrets`
|
||
submodule): `git pull origin main` → `git submodule update --init secrets` →
|
||
`nixos-rebuild switch --flake '/root/builder-clone?submodules=1#cc-ci'`. Diff was bridge-only (no
|
||
other nix/ changes), so only the bridge image rolled (content-hash tag 3761c42→4482ce9). Verified:
|
||
new image Running, poller watching all 20 repos. **Follow-up:** establish a clean canonical deploy
|
||
checkout for the cc-ci server (not the harness's builder-clone).
|
||
|
||
### Event 2026-06-02 ~23:05 — /recipe-report skill + report.ci.commoninternet.net SHIPPED
|
||
Built + deployed the weekly public "Recipe Report" (plan: cc-ci-plan/plan-recipe-report-skill.md).
|
||
- **Serving:** nix/modules/reports.nix (nginx:alpine static server, traefik Host(report.ci.commoninternet.net)
|
||
+ wildcard TLS, serves /var/lib/cc-ci-reports). cc-ci repo `f5a6f71`, deployed via builder-clone. Live.
|
||
- **Generator:** `cc-ci-plan/recipe-report.py` (survey/render/publish) + skill `.claude/skills/recipe-report/`
|
||
+ `cc-ci-plan/launch-report.py` (own cc-ci-report agent, **REPORT_MODEL default opus** — separate from
|
||
the sonnet upgrader). upgrade-all's closing step launches it. orchestrator repo `c7301a9`.
|
||
- **Page:** title "The Recipe Report" / "Week of <date>"; ① Needs attention (PRs to merge + errors) ·
|
||
Routine · comprehensive table (all recipes, CI shown as level/number+LINK, no images). Index lists all weeks.
|
||
- **First report (opus-generated) LIVE:** https://report.ci.commoninternet.net/week-2026-06-02.html
|
||
(10 green PRs, 2 failed, matrix-synapse stale-test; 21-recipe table). From next weekly /upgrade-all it
|
||
auto-publishes.
|
||
- **Note:** still deploying the cc-ci server via /root/builder-clone (the deploy-path gap remains).
|
||
|
||
### Event 2026-06-02 ~23:16 — Recipe Report v2: newspaper front page (CVE-led editorial)
|
||
Reworked the report to a newspaper layout: masthead + opus editorial LEAD (overall fleet state + what
|
||
to focus on) + a 🔒 Security Bulletin of critical-CVE upgrades FIRST, then needs-attention/routine, then
|
||
the comprehensive table ("the full wire"). survey now feeds opus each recipe's upgrade_notes_md
|
||
(breaking-change/CVE analysis). orchestrator `6cf5913`. First v2 (opus) live + verified — it led with
|
||
the nginx 1.29→1.31 CVE batch (custom-html, cryptpad) and even noted live state past the morning summary.
|
||
|
||
### Event 2026-06-09 ~19:50 — Orchestrator handover (assistant session): concurrent-CI fixes + immich/plausible drive
|
||
Operator promoted the cc-ci-assistant session (immich upgrade one-shot) to ORCHESTRATOR: "work on these
|
||
fixes to concurrent runs, then drive immich and plausible to green; autonomous; track in this repo."
|
||
**Immich (PR https://git.autonomic.zone/recipe-maintainers/immich/pulls/2, head a92b28d):** upgrade to
|
||
1.7.0+v2.7.5 (postgres pin HELD at 14-vectorchord0.4.3-pgvectors0.2.0@sha256:bcf63357… — what
|
||
immich-server v2.7.5 pins; abra FATA'd on tag+digest so surveyed upstream directly, registry persisted
|
||
at cc-ci-plan/upstream/immich.md) + backup/restore fix: `pg_dump --clean --if-exists` no-DROP restore
|
||
(**DROP DATABASE PANICs pgvecto.rs** → postgres signal 6 — confirmed in CI 225 logs + dev) + immich-docs
|
||
search_path sed. **Verified GREEN end-to-end in dev via real abra backup/restore path**; dev-immich torn
|
||
down, zero leakage. 6 !testme runs RED so far; 229/230 root cause (drone sqlite log extraction):
|
||
`/pg_backup.sh: No such file or directory` — the harness chaos-deployed a tree WITHOUT the config,
|
||
suspected shared-checkout race (my repro scripts flipped ~/.abra/recipes/immich during the builds).
|
||
**Queue findings (operator: "queue is getting blocked"):** build 231 (plausible !testme) was doomed —
|
||
cc-ci main lacks assistant3's UPGRADE_BASE_VERSION=3.0.1 pin (branch test/plausible-upgrade-base-3.0.1;
|
||
its push build 233 failed LINT, not content); canceled 231+232 (232=immich; drone cancel LEAKED the
|
||
python child — killed by hand; its immi-ad3e33 orphan reaped manually). **Push-build lint has been RED
|
||
since ≥ build 209** (repo-wide format drift + shellcheck + statix + 17 ruff errors) — nothing can land
|
||
green. **Parallel-CI unsafety confirmed in .drone.yml on main:** CCCI_JANITOR_MAX_AGE=0 (a starting
|
||
build reaps ANY in-flight run app), concurrency.limit=1 vs DRONE_RUNNER_CAPACITY=2 (live since 18:35),
|
||
shared HOME=/root + shared ~/.abra/recipes/<recipe> checkout — all annotated "safe because capacity=1".
|
||
**Plan in flight:** (1) lint-green commit (subagent on /home/loops/work/cc-ci-fix); (2) concurrency
|
||
safety: per-recipe flock in run_recipe_ci.py + janitor pidfile/age scoping + concurrency.limit=2 +
|
||
comment updates; (3) merge plausible pin; (4) re-!testme immich alone → green; (5) plausible green is
|
||
assistant3's lane (its verify: upgrade/backup tiers PASSED, restore post-hook failed `gzip:
|
||
/postgres.dump.gz: No such file` — pre-hook never produced the dump in the snapshot) — coordinating via
|
||
tmux, not duplicating. Siblings: cc-ci-assistant3 (plausible), cc-ci-upgrader (told to review plausible
|
||
failure). Memories moved INTO this repo at memory/ (542ed0a) — auto-memory path is a symlink now.
|
||
|
||
### Event 2026-06-09 ~21:10 — Concurrent-CI fixes LANDED on cc-ci main (build 236 green)
|
||
Orchestrator (this session) landed the queue/concurrency work on cc-ci main, first green push build
|
||
since the lint drift began (~209): `9a77725` style: repo-wide lint pass (118 files — ruff format/fix,
|
||
shfmt, nixpkgs-fmt/statix/deadnix, yamllint, lasuite-docs quoting; lint PASS + 138 unit tests);
|
||
`c0df77d` fix(harness): concurrent-run safety — per-recipe flock `/run/lock/cc-ci-recipe-<recipe>.lock`
|
||
taken in main() BEFORE fetch_recipe (kernel auto-release, no stale-lock mode; same-recipe runs
|
||
serialise, different recipes parallel) + active-run registry `/run/cc-ci-active/<domain>` pidfiles with
|
||
three-way janitor (alive=never reap / dead=reap now / unknown=age fallback 2h; pid-reuse guarded via
|
||
/proc cmdline match on run_recipe_ci) + .drone.yml: concurrency.limit 1→2, CCCI_JANITOR_MAX_AGE=0
|
||
REMOVED, stale capacity=1 comments rewritten; `c828f6c` merge of assistant3's
|
||
test/plausible-upgrade-base-3.0.1 pin (UPGRADE_BASE_VERSION=3.0.1+v2.0.0). Branch push build 234 green,
|
||
main push build 236 green. /root/builder-clone fast-forwarded to c828f6c. assistant3 notified via tmux
|
||
(plausible !testme unblocked; restore-hook gzip failure is their lane). Next: immich PR #2 !testme
|
||
re-triggered alone (checkout parked clean at a92b28d) — polling to verdict.
|
||
|
||
### Event 2026-06-09 ~23:20 — Two harness convergence fixes landed; immich on run 3 (build 245)
|
||
Immich !testme run 2 (build 238) RED but PROGRESS: install/upgrade/custom PASS (checkout race gone),
|
||
backup CRASHED — backupbot exec'd the db pre-hook into a container swarm killed seconds earlier: the
|
||
chaos redeploy changes the db image (pgvecto.rs→vectorchord pin) and registers a stop-first rolling
|
||
update that hadn't STARTED when the N/N convergence check passed (old task still 1/1). → `68ef0f8`
|
||
fix(harness): services_converged() also requires swarm UpdateStatus settled + bounded settle-wait in
|
||
backup_app(). Run 3 (build 241) then HUNG 22min in the restore tier: the app service's UpdateStatus
|
||
was 'paused' (swarm default update-failure-action after one task flicker during restore) — a state
|
||
that persists FOREVER; my check treated it as in-flight. Killed 241 (cancel leaks the python child —
|
||
killed by hand; immi-ad3e33 undeployed+rm'd, registry entry cleared, zero leakage verified). →
|
||
`e6d55b5` fix(harness): only 'updating'/'rollback_started' block convergence; 'paused' + N/N is
|
||
settled (health asserts still gate). Both branch builds green (239, 243); main ff'd; builder-clone
|
||
updated. **Plausible build 237 (assistant3, head 4cab6b5): install/upgrade/backup PASS — their
|
||
gzip/dump-path fix WORKS, marker restore test PASSED; remaining: app 502s after restore
|
||
(test_restore_healthy + custom tier) + restore hook needs pg_restore --if-exists; diagnosed +
|
||
relayed via tmux.** Concurrency machinery observed working live: parallel immich+plausible runs held
|
||
per-recipe locks, registered pidfiles, plausible's teardown unregistered cleanly. Immich run 4 =
|
||
build 245 (custom, running) with both fixes live — monitor armed.
|
||
|
||
### Event 2026-06-10 ~00:05 — IMMICH PR #2 GREEN (build 245, level=4); cc-ci PR #9 merged; plausible go-ahead
|
||
**Immich PR #2 (head a92b28d) verdict: GREEN** — build 245, ALL tiers pass (install/upgrade/backup/
|
||
restore/custom), level=4, deploy-count 1/1, PR-side VERDICT=GREEN via testme poller, zero leakage
|
||
(no stacks/volumes, active-run registry empty). The 1.7.0+v2.7.5 upgrade + no-DROP pg backup/restore
|
||
is fully CI-verified; merge decision is the operator's. Run history: 6 RED pre-handover → 238 RED
|
||
(backup 409, convergence gap) → 241 hung (paused-update flag) → 245 GREEN. **cc-ci PR #9 merged**
|
||
(157d06d, push build 246 green): assistant3's one-flag test fix (psql -q in plausible
|
||
_register_site — command tags polluted the compared output; assertion unchanged, reviewed: no gate
|
||
weakening). Caught an UNSUBMITTED "PR #9 is merged, go ahead" sitting in assistant3's tmux prompt
|
||
BEFORE it was true — verified state=open first, merged it, synced builder-clone, then submitted the
|
||
go-ahead. Plausible !testme on PR #3 (head 270c840, incl. their 502-after-restore + --if-exists
|
||
fixes) is now assistant3's trigger; cc-ci main has every prerequisite landed. Open: plausible verdict
|
||
(#11), final session wrap.
|
||
|
||
## Session 2026-06-09/10 — Orchestrator: concurrent-CI fixed, immich + plausible BOTH GREEN
|
||
**Mission complete.** Operator's brief: "work on these fixes to concurrent runs, then drive immich
|
||
and plausible to green." Final state:
|
||
- **immich PR #2 (head a92b28d): GREEN** — build 245, all tiers, level=4 (1.7.0+v2.7.5, vectorchord
|
||
db pin, no-DROP pg backup/restore).
|
||
- **plausible PR #3 (head 270c840): GREEN** — build 247 (assistant3's lane; their gzip dump-path,
|
||
502-after-restore and --if-exists fixes + my merged pin/test-fix/harness prerequisites), all
|
||
tiers, level=4.
|
||
- Both verified zero-leakage (no stacks/volumes, /run/cc-ci-active empty). Merge decisions left to
|
||
the operator per the standing never-merge-recipe-PRs rule.
|
||
- **cc-ci main is healthy:** lint gate green since 9a77725; concurrent runs safe (c0df77d flock +
|
||
registry; 68ef0f8 + e6d55b5 convergence); plausible pin c828f6c; PR #9 psql -q merge 157d06d.
|
||
Builds 234–247: every push build green; parallel custom runs exercised the locking live.
|
||
- Memories added: swarm-updatestatus-convergence-gotchas (+ earlier session set). Open items for a
|
||
future session: drone cancel still leaks the python child (kill by hand; maybe trap/pgroup fix in
|
||
the runner step); recipe-mirrors org still private (PR-STATUS column dark — operator flip);
|
||
operator to review/merge the two green recipe PRs.
|
||
|
||
## Session 2026-06-10 — Orchestrator: concurrency restructure DONE (phase conc)
|
||
|
||
Operator approved the simplification plan (concurrency-restructure-full-plan.md); ran it through
|
||
the Builder (fable) / Adversary (opus, via new ADV_MODEL launcher support e0c9f23) loops + watchdog.
|
||
Phase conc ## DONE 08:56 UTC, M1+M2 both Adversary-PASS, no open veto. cc-ci main now: per-app-domain
|
||
kernel flock (registry/pidfiles/recipe-flock DELETED), flock-probe janitor, per-run ABRA_DIR
|
||
(servers/ symlinked), PDEATHSIG+setsid+60-min-deadline lifetime chain, single capacity knob,
|
||
tests/concurrency suite (21 real-kernel cases, outside the unit gate), docs/concurrency.md rewritten.
|
||
Live verification earned two real catches: wrapper exit-code poisoning under set -e (e1c4198,
|
||
false-RED, adversary 4-path matrix proved no false-GREEN) and CONC-A1 (domain-keyed deploy-count
|
||
file in shared /tmp raced outside the lock — pre-existing, masked by the old recipe flock; fixed
|
||
per-run + mutation-proven test; VETO lifted after 290/291 both green). Also fixed this session:
|
||
orchestrator identity — watchdog was supervising the stale June-1 session; renamed mine to
|
||
cc-ci-orchestrator-vm, repointed .orchestrator-session-id (old → .bak). Loops survived a
|
||
limit-stall window 07:51–08:03 via watchdog kill/reboot/nudge — resilience layer worked as designed.
|
||
Open for operator: review/merge immich PR#2 + plausible PR#3 (still green, unmerged); stale
|
||
session cc-ci-orchestrator-stale can be killed; recipe-mirrors org still private.
|
||
|
||
## 2026-06-11 ~01:15 — phase `shot` queued; limit-system night watch
|
||
- Operator requested a follow-on phase: audit + repair the per-recipe CI screenshot
|
||
(badge/card) across ALL enrolled recipes. Plan written:
|
||
cc-ci-plan/plan-phase-shot-screenshots.md, queued AFTER rcust in .phases-spec
|
||
(rcust;shot) — watchdog auto-advances on rcust `## DONE`.
|
||
- Pre-audit evidence (last ~120 runs): plausible screenshot=null on every run;
|
||
immich/lasuite-meet/cryptpad (+flaky n8n) produce byte-identical ~4.8KB PNGs =
|
||
suspected blank SPA frames; ghost/mattermost/discourse/etc healthy.
|
||
- Hourly wakes tonight: TEMPORARY line added to ai-progress-monitor-prompt.txt — verify
|
||
the new limit-wait system (d6e1a70/2e1ab8d) on each wake; remove the line 06-11 daytime.
|
||
- Orchestrator renamed cc-ci-orchestrator (was -vm); stale Jun01 squatter killed;
|
||
watchdog bounced twice tonight (limit patch, then hourly-wake-during-limit fallback).
|
||
|
||
## 2026-06-11 ~01:35 — phase `lvl5` queued after `shot`
|
||
- Operator: extend the level ladder — L5 = `abra recipe lint` passes on the tested ref
|
||
(PR head), after the existing four rungs. Plan: cc-ci-plan/plan-phase-lvl5-lint-rung.md.
|
||
- Key design hazards captured: abra.py:109-114 (pinned deploy lints + FATAs R014 from the
|
||
CI mirror-origin repoint; chaos/PR path skips lint today) — rung must lint recipe
|
||
content, not mirror plumbing; verdict-neutral; conservative capping; old artifacts render.
|
||
- .phases-spec now rcust;shot;lvl5 (idx=1, shot active); watchdog bounce to load it.
|
||
|
||
## 2026-06-11 ~01:50 — lvl5 plan amended: de-capping folded in (operator decision)
|
||
- Operator: remove the "capping" notion entirely. Explicit Q&A settled semantics:
|
||
level = highest PASSED rung where everything below is pass-or-N/A — N/A rungs are
|
||
skipped (no longer stop the climb), a real FAIL still blocks. cap/cap_reason/capped
|
||
deleted from code+schema+card+dashboard+docs; rung table is the sole detail carrier.
|
||
- Deliberate override of Phase-3 "N/A caps" stance — to be recorded in DECISIONS.md by
|
||
the loops. Before/after level table for all recipes required so the Adversary can
|
||
attribute every level shift to the rule change.
|
||
|
||
## 2026-06-11 ~02:00 — lvl5 refinement: intentional vs unintentional N/A
|
||
- Operator: an N/A rung only skips if it's an INTENTIONAL skip (declared/structural:
|
||
not backup-capable, no upgrade target). UNINTENTIONAL N/A (infra error, missing tool,
|
||
aborted tier = unverified) blocks — the level cannot be above an unverified rung.
|
||
Statuses now {pass, fail, skip, unver}; unclassifiable N/A defaults to unver.
|
||
|
||
## 2026-06-11 ~04:50 — night-watch findings: limit system held core invariant; 2 bugs fixed
|
||
- ~01:49-01:51 all three sessions hit a MONTHLY SPEND limit ("You've hit your monthly
|
||
spend limit. /usage-credits to adjust") — no reset time exists, so "unparsable → flat
|
||
5-min probe" was CORRECT behavior. Zero kill+reboots during the limit window (the old
|
||
system's churn bug is confirmed gone — last stall reboots were 23:26-23:50, old code).
|
||
- Bug 1: probe text contained "usage limit" → matched LIMIT_RE → self-sustaining window.
|
||
Reworded to "quota window" (must never match LIMIT_RE).
|
||
- Bug 2: probe dedupe checked the whole 40-line pane → once the submitted probe scrolled
|
||
into the conversation, all further probes were suppressed (builder/adv stuck at
|
||
nudges=1; orchestrator probes degraded to hourly, riding the wake's scroll). Dedupe now
|
||
checks only the bottom 8 lines (the input area).
|
||
- shot phase: M1 PASSed (ae10b55, 19/19 matrix) + builder landed the harness capture fix
|
||
(ce50f64) BEFORE the limit hit. Loops resume via watchdog probe after this bounce.
|
||
- lvl5 plan: operator addition — top badge (card corner/dashboard pill/SVG) shows ONLY
|
||
the level, no capping info; inline rung table keeps intentional-skip detail.
|
||
|
||
## 2026-06-11 ~11:35 — phase `bsky` queued after `lvl5`
|
||
- Operator: fix whatever is wrong with the bluesky-pds recipe, then its screenshot.
|
||
Plan: cc-ci-plan/plan-phase-bsky-fix.md. Known: upstream image breakage under the
|
||
pinned tag (Cannot find module /app/index.js, Node v24), proven harness/ref-neutral
|
||
in rcust M2; DEFERRED carries the re-pin follow-up. Deliverable = green recipe-mirror
|
||
PR (operator merges) + verified screenshot on the PR runs; DEFERRED entries closed.
|
||
- .phases-spec now rcust;shot;lvl5;bsky (idx still lvl5); watchdog bounce to load.
|
||
|
||
## 2026-06-11 ~12:05 — four more phases queued + DEFERRED housekeeping (operator)
|
||
- Queue now: bsky (in progress, idx 3) → dstamp (discourse abra-stamp drift dig) →
|
||
mailu (backupbot labels recipe PR) → kuma (uptime-kuma create-a-monitor test) →
|
||
drone (gitea-dep enrollment; P0 host /etc/timezone deploy is MINE — nixos-rebuild
|
||
switch on cc-ci host with committed 3bde76f, do it before/when phase drone starts or
|
||
when STATUS-drone flags BLOCKED).
|
||
- DEFERRED.md housekept (cc-ci 823023a): closed plausible-enrollment, discourse-bitnami,
|
||
immich-pgdump (PR#2 merge-pending), plausible-Q4.7b (PR#3 merge-pending); re-entered
|
||
mailu/drone/kuma/dstamp/bsky with plan pointers. Operator confirmed immich PR#2 +
|
||
plausible PR#3 are good — merges pending on operator.
|
||
- ORCHESTRATOR ACTION ITEM: deploy /etc/timezone host fix for phase drone P0.
|
||
|
||
## 2026-06-11 ~18:20 — FALSE phase completion: mailu skipped, fixed + re-queued
|
||
- The mailu builder (sonnet) scaffolded STATUS-mailu.md with a PLACEHOLDER "## DONE"
|
||
section (line 47, body "Not yet. Written here only when all DoD items have Adversary
|
||
PASS"). phase_done()'s startswith("## DONE") matched it → watchdog auto-advanced
|
||
mailu→kuma at 18:01 with ZERO mailu work done (no recipe PR, no claim, no review).
|
||
- Fix (cc-ci-orch 4275adc): phase_done now ignores a "## DONE" whose first non-empty body
|
||
line is a placeholder/negation (Not yet / pending / TBD / when all / <...>). Verified
|
||
against all shipped STATUS files: real DONEs still detected, mailu placeholder rejected.
|
||
- kuma (running, real work) had NO STATUS file yet → no second misfire; the fix protects
|
||
it + drone going forward.
|
||
- mailu RE-QUEUED after kuma: spec now rcust;shot;lvl5;bsky;dstamp;mailu(skipped);kuma;
|
||
mailu(redo);drone — 9 phases, idx6=kuma. Watchdog bounced to load fix+spec. Builder
|
||
correctly sonnet at kuma (per-phase override still working).
|
||
|
||
## 2026-06-11 ~20:50 — weekly upgrade: skip tonight, queue after phases (operator)
|
||
- Operator: cancel tonight's weekly /upgrade-all cron, run it once after the current
|
||
phase queue (…mailu→drone) finishes instead.
|
||
- cc-ci-upgrade-all.timer STOPPED (can't `disable` — /etc/systemd read-only). Stamp
|
||
forwarded to 06-12 03:00 so a reboot/rebuild tonight schedules NEXT run 06-19, NOT a
|
||
catch-up of tonight's 02:00 slot. GOTCHA: `systemctl start` on this timer fires the
|
||
service IMMEDIATELY (Persistent=true) — it did, launched an unwanted upgrader run at
|
||
20:47 which I killed. DO NOT `start` the timer to re-arm; let a host reboot/nixos-rebuild
|
||
reactivate it (the drone P0 rebuild will), forward-stamp prevents catch-up.
|
||
- Post-phase run wired: watchdog hook (launch.py 3fa3178) launches launch-upgrader.py
|
||
start when the LAST phase hits ## DONE, gated by flag /srv/cc-ci/.cc-ci-logs/.run-upgrade-
|
||
on-complete (set now, consumed once). Upgrader inherits sonnet. So when `drone` completes,
|
||
/upgrade-all auto-starts.
|
||
|
||
## 2026-06-11 ~20:58 — coordination files → machine-docs/ + memory committed (operator)
|
||
- Operator: recent phases wrote STATUS/BACKLOG/REVIEW/JOURNAL to the cc-ci repo ROOT.
|
||
Root cause: build_kickoff + plan.md tree used bare filenames (older phases + INBOX/
|
||
DECISIONS/DEFERRED already used machine-docs/). Fixed everywhere: build_kickoff emits
|
||
machine-docs/ paths + explicit FILE-LOCATION RULE; prompts/builder+adversary, plan.md
|
||
(tree+seed), loops AGENTS.md, orchestrator AGENTS.md all updated (cc-ci-orch e144354).
|
||
- Moved 32 root files → machine-docs/ in cc-ci repo (85a7813, all git-detected renames,
|
||
no content change). Both clones synced; loops restarted with new kickoff (verified
|
||
kickoff → machine-docs/STATUS-mailu.md); watchdog bounced. resolve_state/INBOX already
|
||
read machine-docs/ first so phase_done unaffected.
|
||
- Memory notes committed+pushed (cc-ci-orch c33b21f) per AGENTS.md 'memory lives in repo'.
|
||
|
||
## 2026-06-11 ~22:05 — phase `cfold` queued after drone (+ recipe CI sweep)
|
||
- Operator: collapse custom-test folders functional/ + playwright/ → one custom/ folder
|
||
(the split is purely organizational — verified: discovery.py globs both with no
|
||
branching, same tier/rung/fixtures/failure semantics). Plan:
|
||
plan-phase-cfold-custom-folder.md. M2 = full !testme recipe sweep proving no recipe's
|
||
custom tests silently dropped + levels unchanged (the operator-required sweep).
|
||
- .phases-spec now …drone;cfold (10 phases). cfold is the new LAST phase, so the
|
||
.run-upgrade-on-complete hook fires /upgrade-all AFTER cfold — correct order (folder
|
||
change swept-green before the weekly upgrade runs). Watchdog bounced to load it.
|
||
|
||
## 2026-06-11 ~22:55 — drone DONE → upgrade fired; cfold PAUSED to serialize
|
||
- drone completed 22:31 → watchdog hit sequence-complete, fired the queued /upgrade-all
|
||
(cc-ci-upgrader, weekly run) per the operator's earlier request. Upgrade running now.
|
||
- I'd queued cfold ~22:52; the bounced watchdog auto-advanced into cfold, making it
|
||
CONCURRENT with the upgrade. They conflict (both real-CI; cfold edits the harness the
|
||
upgrade's !testme uses; upgrade version-bumps confound cfold's baseline). PAUSED cfold:
|
||
stopped its loops + the watchdog; phase-idx preserved at 9. Upgrade left running.
|
||
- RESUME cfold (restart watchdog → phase-idx 9) once /upgrade-all is confirmed DONE. See
|
||
memory [[cfold-paused-pending-upgrade]]. Will action on supervision wakes.
|
||
|
||
## 2026-06-12 ~00:30 — unstuck the weekly upgrade (wedged on discourse)
|
||
- /upgrade-all froze ~2h on discourse: its iteration-2 !testme chaos deploy disc-50cc8a
|
||
had app+sidekiq stuck 0/1 in Swarm 'New' state 24min (db/redis up, box 5.8Gi free) —
|
||
transient scheduler wedge, NOT a recipe defect (discourse L5 @build #450 ~5h prior).
|
||
drone build waited on it; testme-on-pr.sh blocked polling; agent frozen.
|
||
- Fix: docker stack rm disc-50cc8a (freed box+build); Esc-interrupted the upgrader; nudged
|
||
it with the diagnosis → "one clean discourse retry then move on regardless; comment+skip
|
||
if it re-wedges". Agent recovered, now checking build state before retry. Rest of queue
|
||
(ghost/immich/keycloak/lasuite-*/mailu/matrix-synapse) still ahead. cfold still paused.
|
||
|
||
## 2026-06-12 ~03:30 — ROOT CAUSE: proxy overlay VIP exhaustion (not "tired box")
|
||
- Empirically verified from dockerd logs: the shared `proxy` overlay (10.0.1.0/24 = 254 VIPs,
|
||
joined by every recipe deploy) exhausted its IP pool. Endpoint-GC race on concurrent stack rm
|
||
(`key modified`/`network proxy remove failed`, 45×) leaked IPs over 11 days of dockerd uptime →
|
||
13× `could not find an available IP while allocating VIP` from 22:53 → tasks stuck in Swarm `New`
|
||
→ discourse + ghost deploys wedged (looked like recipe failures; were infra). 02:50 docker
|
||
restart rebuilt the allocator → cleared.
|
||
- FIXES: (a) upgrade-all Step 0 now prunes leaked overlays + restarts docker if VIP-failures are in
|
||
the journal (per-run safety net, committed). (b) DURABLE: enlarge proxy to /16 in swarm.nix —
|
||
runbook plan-proxy-vip-exhaustion-fix.md + memory [[proxy-vip-exhaustion-runbook]], orchestrator
|
||
to execute in a maintenance window AFTER the current upgrade (recreating proxy disrupts routing).
|
||
(c) ghost PR debug: plan-ghostpr-debug-fix.md + memory [[ghost-pr-debug]].
|
||
- NOT switching the upgrade to sequential (operator: concurrency is fine; the leak is the issue).
|
||
Duplicate ghost subagent from the interrupt churn — told the upgrader to TaskStop one.
|