Commit Graph

230 Commits

Author SHA1 Message Date
4f31abc0c7 upstream(mailu): update Redis standing note — operator approved 8.8.x jump 2026-06-12 03:48:52 +00:00
d3a9455eb3 upstream(lasuite-meet): document livekit v1.13.1 TURN-auth note + redis 8.8.0 2026-06-12 03:43:29 +00:00
ca02a0dd6f upgrade-all: proxy VIP-exhaustion guard in Step 0; runbooks for proxy /16 enlarge + ghost PR debug
Root-caused (empirically, dockerd logs) the discourse/ghost deploy wedges:
the shared proxy overlay (/24=254 VIPs) exhausts as concurrent stack rm leaks
endpoints over many days -> tasks stuck in Swarm 'New'. Add a per-run safety
net to Step 0 (network prune + docker restart when VIP-allocation failures are
logged). Plans + memory for the durable fix (enlarge proxy to /16 in swarm.nix,
maintenance window) and for debugging/fixing the ghost PR afterward.
2026-06-12 03:30:00 +00:00
7ce898e0e4 upstream(immich): document concurrent app+db restart update_config fix 2026-06-12 03:26:37 +00:00
28b9431035 upstream(immich): note pgvectors0.3.0 bump in PR #2 + new digest (2026-06-12) 2026-06-12 02:50:45 +00:00
2c5e08f78c upgrade-all: simplify to a rolling pool, alphabetical (drop waves + heavy/light)
Per operator: just work through recipes alphabetically keeping CAP (=
DRONE_RUNNER_CAPACITY=2) subagents running at once, starting the next the moment
one finishes (rolling pool via run_in_background). Removes the wave-barrier and
the heavy/light classification entirely — simpler and no slot ever idles.
2026-06-12 01:58:22 +00:00
894d829313 upgrade-all: at the tail, fill slots with two heavies rather than serialize
Per operator: always fill all CAP slots. Heavy/light alternation only spreads
heavies across waves while a light is available; once only heavies remain, run
two-per-wave (capacity is the tuned ceiling) instead of one-per-wave.
2026-06-12 01:55:29 +00:00
f744c79e2d upgrade-all: alternate heavy/light per wave (not heaviest-first)
Host memory is the binding limit, so never schedule two HEAVY recipes in the
same capacity wave — pair each heavy (discourse/immich/matrix-synapse/
lasuite-drive/mattermost-lts/ghost) with a light one to bound peak memory while
keeping both slots busy. Heaviest-first could co-schedule two heavies and OOM/
wedge the box (the disc-50cc8a 'New'-state wedge). For CAP>2 cap heavies at
~CAP/2; if only heavies remain, run one-per-wave.
2026-06-12 01:47:22 +00:00
a45517b432 upgrade-all: default to concurrency-bounded (DRONE_RUNNER_CAPACITY) subagents
Now that the 2026-06-10 concurrency restructure makes concurrent recipe runs
safe (per-run trees, app-domain locks, isolation), default /upgrade-all to run
up to DRONE_RUNNER_CAPACITY (the drone runner's slots, currently 2) recipe
subagents at a time instead of strictly sequential — using all available
concurrency without oversubscribing. Query the live capacity from
'systemctl show drone-runner-exec' (fallback 2); process recipes in waves of
CAP (emit CAP Agent calls per message, await, next wave). Flags: --capacity N,
--sequential (CAP=1, old default — use when the build loops share the box),
--parallel (unbounded). Applies to the NEXT run; the in-flight run is unaffected.
2026-06-12 01:39:44 +00:00
1eb720e95a journal: unstuck weekly upgrade wedged on discourse Swarm scheduling hiccup 2026-06-12 00:31:29 +00:00
a1cceef3d4 ops: pause cfold until /upgrade-all finishes (serialize — they conflict on CI); journal+memory 2026-06-11 22:56:27 +00:00
af2b2e8156 plan: phase 'cfold' — collapse functional/+playwright/ into custom/ + full !testme recipe sweep (queued after drone)
The functional/playwright split is purely organizational (discovery globs both
with no branching; same custom tier -> L4 rung, same fixtures, same failure
semantics). Migrate all custom tests to one custom/ folder; M1 proves coverage
identical before/after (no silent drops), M2 is a full real-CI !testme sweep
across all recipes confirming levels unchanged. cfold becomes the last phase so
the queued /upgrade-all fires after it (folder change verified before upgrade).
2026-06-11 22:52:45 +00:00
79134a94e8 memory: drop drone P0 host-deploy note — /etc/timezone present on cc-ci, prerequisite satisfied (drone phase deploying gitea fine) 2026-06-11 21:55:16 +00:00
34fc68d4b8 journal: coordination files moved to machine-docs/; memory committed 2026-06-11 20:57:57 +00:00
c33b21fe8d memory: commit session notes (drone P0, weekly-upgrade-queued, mailu/index updates)
Per AGENTS.md 'Agent memory lives in memory/ (in this repo)' — memory notes
must be committed + pushed like any repo change, not left only in the local
~/.claude symlink target.
2026-06-11 20:56:24 +00:00
e144354668 loops: mandate machine-docs/ for ALL coordination files (kickoff/prompts/plan/AGENTS)
Recent phases wrote STATUS/BACKLOG/REVIEW/JOURNAL to the repo ROOT because
build_kickoff + plan.md's tree used bare filenames, even though the loops'
AGENTS.md + INBOX/DECISIONS/DEFERRED conventions already said machine-docs/.
Make machine-docs/ the single mandated home everywhere: build_kickoff now
emits machine-docs/ paths + an explicit FILE-LOCATION RULE; both loop prompts
and plan.md (tree + seed step) updated; orchestrator AGENTS.md documents +
enforces it. resolve_state/INBOX handoff already read machine-docs/ first.
2026-06-11 20:56:24 +00:00
23b5fc4753 journal: weekly upgrade skipped tonight, queued after phase queue via watchdog hook 2026-06-11 20:50:25 +00:00
3fa3178546 watchdog: one-shot /upgrade-all trigger on phase-sequence completion
When LOG_DIR/.run-upgrade-on-complete exists, the watchdog launches
launch-upgrader.py start the moment the last phase reaches ## DONE (then
consumes the flag). Lets the operator replace a scheduled weekly cron run with
'run as soon as the current phase queue finishes' — used tonight: the
cc-ci-upgrade-all.timer was stopped (stamp forwarded past tonight's slot) and
this flag set instead.
2026-06-11 20:49:54 +00:00
0005ce81af journal: mailu false-completion incident + fix + re-queue 2026-06-11 18:20:54 +00:00
4275adc4a5 watchdog: phase_done ignores placeholder '## DONE' sections (skipped mailu)
A Builder scaffolded 'STATUS-mailu.md' with a '## DONE / Not yet. Written
here only when ...' placeholder section; phase_done's startswith('## DONE')
matched it and auto-advanced past mailu without any of its work being done
(no recipe PR, no claim, no review). Harden phase_done: a '## DONE' heading
counts only when its first non-empty body line is not a placeholder/negation
(Not yet / pending / TBD / when all / <...> etc). Verified against all shipped
STATUS files (real DONEs still detected; mailu placeholder rejected).
2026-06-11 18:20:21 +00:00
211b4e231c launch: per-phase model override (.loop-model[-adv]-<pid>)
Lets a single phase pin a different model, read fresh each role_model call so
a phase transition flips it automatically with no watchdog bounce. Operator
wants builder on opus for the complex dstamp phase, reverting to sonnet from
mailu on: .loop-model-dstamp=opus while base .loop-model stays sonnet.
2026-06-11 16:15:18 +00:00
5c260d225c launch-orchestrator: persisted .orch-model file (ORCH_MODEL > LOOP_MODEL > file)
Operator switching models near weekly limits: loops -> sonnet, orchestrator
-> opus. Dotfiles updated (.loop-model/.loop-model-adv=sonnet,
.orch-model=opus) so watchdog restarts keep the choice.
2026-06-11 16:03:29 +00:00
327b9f4efe plan: phases dstamp, mailu, kuma, drone (queued after bsky) + journal
- dstamp: attribute + fix the discourse abra-stamp drift (env change 06-05→
  06-10, harness-neutral, currently pinning discourse at L1); blast-radius
  sweep; HC1 keeps its teeth
- mailu: backupbot v2 labels recipe PR, restore proven on real seeded mail,
  backup rung earned instead of skipped (operator approved re-entry)
- kuma: uptime-kuma first-run wizard + create-a-monitor functional test
  (Socket.IO or Playwright, real probe evidence, flake-checked)
- drone: gitea-dep enrollment, maximal subset per Phase-2 scoping;
  P0 /etc/timezone host deploy is orchestrator-owned (3bde76f committed)
2026-06-11 11:43:03 +00:00
f395247da4 docs(bsky): upstream registry for bluesky-pds — :0.4 is a moving tag now tracking main (0.5.1/node24/index.ts); exact tags through 0.4.219 keep classic index.js layout 2026-06-11 11:35:56 +00:00
c89cd6366b plan: phase 'bsky' — fix bluesky-pds recipe + its screenshot (queued after lvl5)
Root-cause the upstream image breakage (Cannot find module /app/index.js,
Node v24 under the pinned tag — proven harness/ref-neutral in rcust M2),
research upstream releases (persist to cc-ci-plan/upstream/bluesky-pds.md),
fix via recipe-mirror PR (NEVER merge — operator does), prove full lifecycle
green incl. the new L5 lint rung via !testme at PR head, then verify a real
credential-free screenshot on those runs (hook only if needed). Close both
DEFERRED bluesky entries; crisp operator handoff in STATUS-bsky.md.
2026-06-11 11:30:49 +00:00
0900c439d4 wake prompt: remove temporary limit-system night-watch line (condition met)
The 2026-06-11 night watch is over: the limit-wait system was verified
end-to-end on a real monthly-spend-limit window (hit -> hold without reboots
-> flat probes -> prompt resume on lift), and the three bugs it surfaced are
fixed (5ea17fc, 969eb60). Standing supervision continues without the extra
check.
2026-06-11 06:55:08 +00:00
969eb60df1 watchdog: probe-resumed tick returns True — don't evaluate stale pane after resume
The tick whose probe resumed a session was continuing into stall logic with
its pre-resume pane capture; a 4h-old WAITING-UNTIL in that stale data got
the freshly-resumed adversary kill+rebooted (05:52). Treat probe-resume as
handled-this-tick; the next 30s tick sees the live session.
2026-06-11 05:53:44 +00:00
5ea17fca21 watchdog: fix limit-probe self-match + scrollback dedupe wedge; plan(lvl5): badge shows level only
Night-watch findings (monthly-spend-limit window, ~01:49-04:45):
- probe text said 'usage limit' which matches LIMIT_RE, so a submitted probe
  kept limited_now true forever -> reworded to 'quota window' with a CAUTION
  note (nudge text must never match LIMIT_RE)
- dedupe scanned all 40 captured lines, so once a probe scrolled into the
  conversation no further probe ever fired (builder/adv frozen at nudges=1,
  orchestrator probes degraded to hourly riding the wake scroll) -> dedupe
  now only checks the bottom 8 lines (input area)
Core invariant HELD: zero kill+reboots during the limit window.

plan(lvl5): operator addition - the top-corner level badge (card, dashboard
pill, badge SVG) shows only the level number+color, zero capping info; the
inline per-rung table keeps intentional-skip/unverified detail.
2026-06-11 05:52:26 +00:00
76aa104dbd plan(lvl5): N/A split — intentional skip climbs, unintentional (unverified) blocks
Operator refinement: only declared/structural skips (not backup-capable, no
previous version) let the climb continue; a rung that should have run but
didn't (infra error, abra missing, tier abort, timeout) blocks the level at
the last verified rung. Every N/A source in derive_rungs gets an explicit
classification (DECISIONS.md, adversary-reviewed); unclassifiable defaults to
unverified. Unit tests + one synthesized tier-abort run prove the rule.
2026-06-11 01:47:26 +00:00
1f7fc7eb39 plan(lvl5): fold in de-capping — level = highest passed rung, N/A skips, fail blocks
Operator decision (explicit Q&A 2026-06-11): remove cap/cap_reason/capped
entirely. New formula: level = max i with rung_i==pass and all j<i in
{pass,na}. N/A no longer stops the climb (the confusing part — e.g.
non-backup-capable recipes were stuck at L2); a real FAIL still blocks.
Per-rung table + verdict carry the completeness story. Added: de-cap
implementation reqs, both-schema rendering, before/after level table for all
recipes, N/A-skip proof run, bad-canary designed-levels re-derivation under
the new formula.
2026-06-11 01:45:54 +00:00
0aab78d3a2 plan: phase 'lvl5' — L5 level rung: abra recipe lint passes on the PR (queued after shot)
New top rung after install/upgrade/backup-restore/functional: lint the exact
recipe ref under test; gap-caps per ladder semantics; verdict-neutral and
time-bounded; mirror-origin R014 plumbing must not pollute recipe lint results
(abra.py:109-114); all consumers (results/card/dashboard/badge/docs/tests)
updated; old artifacts still render. M1 = adversary-cold-verified implementation
pre-merge; M2 = real-CI proof incl. a genuine L5, a genuine lint-capped L4, and
2 drone-path runs. Recipe lint failures -> mirror PRs or DEFERRED, never merged.
2026-06-11 01:39:27 +00:00
7c042c2f2a plan: phase 'shot' — recipe screenshot audit & repair (queued after rcust)
Audit every enrolled recipe's CI badge/card screenshot, diagnose defects
(plausible null-every-run; ~4.8KB blank-frame SPAs: immich/lasuite-meet/
cryptpad/flaky n8n), fix via harness default-wait improvement first, per-recipe
SCREENSHOT hooks second; M1 audit matrix + M2 visually-verified PNGs on fresh
real-CI runs (>=2 !testme). Cosmetics-never-block and secret-safety guardrails
binding. Also: temporary hourly-wake instruction to verify the new limit-wait
system tonight; journal entry.
2026-06-11 01:17:32 +00:00
2e1ab8d384 watchdog: hourly orchestrator wake fires even during a limit window
Operator request: the hourly supervision prompt should land regardless of
limit state, as a fallback that keeps things on track if the limit-state
machinery ever breaks. If the limit is genuinely still in force the wake is
harmless (the banner just re-prints and limit_tick re-arms); once it lifts,
the queued wake doubles as a resume nudge.
2026-06-11 01:00:29 +00:00
d6e1a704da watchdog: parse limit-reset time, never reboot limit-stalled sessions; rename orch session
Replace the blind every-300s 'limit appears lifted' nudge (claude) and the
opencode-only _maybe_nudge_limit with one unified limit_tick state machine:

- parse the reset time from the limit banner (last match wins; stale banners
  whose time already passed fall back rather than waiting ~a day)
- arm a quiet window until reset+45s; parse failure -> flat 5-minute probe
  loop (operator-specified; not exponential backoff)
- while armed, suppress ALL healing: a limit-stalled session is NEVER
  kill+rebooted (this was the conc-phase churn: claude limit stalls fell
  through to the generic idle reboot, losing the banner and re-hitting
  the limit fresh)
- at window end send ONE nudge as a self-verifying probe: spinner clears
  the state; a re-printed banner re-arms from the fresh reset time
- dedupe: never stack a probe while our own text is visible in the pane
- state persisted per session in LOG_DIR (.limited-<session>) so watchdog
  restarts keep the window
- orchestrator gets the same treatment: limit_tick in heal_orchestrator,
  a per-signal-tick orch_limit_check, and hourly wakes deferred during
  limit windows
- loud WARNING at 3 probes, then continue flat probes forever

Also rename the orchestrator session default cc-ci-orchestrator-vm ->
cc-ci-orchestrator (launch.py ORCH_SESSION, launch-orchestrator.py SESSION,
docs/scripts references).
2026-06-11 00:55:07 +00:00
aefaf17757 plan: recipe-customization restructure — full builder/adversary plan (P1-P6 + real-CI regression sweep gate) 2026-06-10 16:28:09 +00:00
a6e177e286 journal: phase conc DONE — concurrency restructure landed, M1+M2 adversary-verified 2026-06-10 13:58:25 +00:00
e0c9f23391 feat(launch): ADV_MODEL — per-role model override for the Adversary loop 2026-06-10 04:03:35 +00:00
a1b4943da1 plan: adapt concurrency restructure to builder/adversary loop protocol (gates M1/M2, phase-namespaced state) 2026-06-10 03:54:31 +00:00
520fb18461 plan: full concurrency restructure implementation plan (builder/adversary handoff) 2026-06-10 03:48:14 +00:00
0d169c2a20 plan: concurrency restructure — flock-probe janitor, per-run ABRA_DIR, lock-lifetime chain 2026-06-10 03:41:05 +00:00
335ea1d7c1 journal: session wrap — concurrent CI fixed, immich (245) + plausible (247) both GREEN 2026-06-09 23:18:32 +00:00
08706c665e memory: swarm UpdateStatus convergence gotchas (builds 238/241) 2026-06-09 23:14:18 +00:00
926e4553b7 journal: immich PR #2 GREEN (build 245, level=4); cc-ci PR #9 merged; plausible unblocked 2026-06-09 23:13:56 +00:00
e3e0a9ee80 journal: two harness convergence fixes (UpdateStatus settle + paused-is-settled); immich build 245 in flight 2026-06-09 23:08:59 +00:00
1580738c97 journal: concurrent-CI fixes landed on cc-ci main (build 236 green) 2026-06-09 22:02:08 +00:00
ec3e0c35dd journal: orchestrator handover — concurrent-CI fixes + immich/plausible drive 2026-06-09 19:45:21 +00:00
542ed0afe3 memory: move agent memory into repo (memory/), note in AGENTS.md
Persistent agent memories now live in memory/ in this repo; the Claude
auto-memory path is symlinked here so future memories land in the repo
and get committed like any other change.
2026-06-09 19:25:20 +00:00
330378d30d ideas: fail-fast on crash-looping deploy + don't let one wedged run starve the CI queue
After a live incident: plausible build 220 (ClickHouse exit-1 crash-loop) held the
single serial runner for its full 1200s DEPLOY_TIMEOUT, starving immich PR-2's
queued builds for ~12min until manually torn down. Logs the two fixes (fail-fast
on crash-loop; head-of-line blocking on the serial runner) + the interim
mitigations (step-2b dev loop for debugging; SIGINT to free a wedged run).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 16:29:30 +00:00
a2c1cb550a upstream(immich): release-notes sources + DB-pin & VectorChord backup/restore notes 2026-06-09 15:49:20 +00:00
c60fc6d056 change(cleanup): reap dev deploys at start+end of /upgrade-all instead of a timer
Per operator: drop the hourly cc-ci-reap-dev-deploys systemd timer; instead run the
dev-* reaper at the START (Step 0, alongside the orphan sweep) and END (new step 4b)
of each /upgrade-all run, with THRESHOLD=0 (the run is quiescent then, so clear all
dev-* unconditionally). The reaper keeps its safe default (4h) for ad-hoc use.
Step-2b mandatory teardown is unchanged (primary mechanism); this is the backstop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 15:47:16 +00:00