Operator request: the hourly supervision prompt should land regardless of
limit state, as a fallback that keeps things on track if the limit-state
machinery ever breaks. If the limit is genuinely still in force the wake is
harmless (the banner just re-prints and limit_tick re-arms); once it lifts,
the queued wake doubles as a resume nudge.
Replace the blind every-300s 'limit appears lifted' nudge (claude) and the
opencode-only _maybe_nudge_limit with one unified limit_tick state machine:
- parse the reset time from the limit banner (last match wins; stale banners
whose time already passed fall back rather than waiting ~a day)
- arm a quiet window until reset+45s; parse failure -> flat 5-minute probe
loop (operator-specified; not exponential backoff)
- while armed, suppress ALL healing: a limit-stalled session is NEVER
kill+rebooted (this was the conc-phase churn: claude limit stalls fell
through to the generic idle reboot, losing the banner and re-hitting
the limit fresh)
- at window end send ONE nudge as a self-verifying probe: spinner clears
the state; a re-printed banner re-arms from the fresh reset time
- dedupe: never stack a probe while our own text is visible in the pane
- state persisted per session in LOG_DIR (.limited-<session>) so watchdog
restarts keep the window
- orchestrator gets the same treatment: limit_tick in heal_orchestrator,
a per-signal-tick orch_limit_check, and hourly wakes deferred during
limit windows
- loud WARNING at 3 probes, then continue flat probes forever
Also rename the orchestrator session default cc-ci-orchestrator-vm ->
cc-ci-orchestrator (launch.py ORCH_SESSION, launch-orchestrator.py SESSION,
docs/scripts references).
Persistent agent memories now live in memory/ in this repo; the Claude
auto-memory path is symlinked here so future memories land in the repo
and get committed like any other change.
After a live incident: plausible build 220 (ClickHouse exit-1 crash-loop) held the
single serial runner for its full 1200s DEPLOY_TIMEOUT, starving immich PR-2's
queued builds for ~12min until manually torn down. Logs the two fixes (fail-fast
on crash-loop; head-of-line blocking on the serial runner) + the interim
mitigations (step-2b dev loop for debugging; SIGINT to free a wedged run).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per operator: drop the hourly cc-ci-reap-dev-deploys systemd timer; instead run the
dev-* reaper at the START (Step 0, alongside the orphan sweep) and END (new step 4b)
of each /upgrade-all run, with THRESHOLD=0 (the run is quiescent then, so clear all
dev-* unconditionally). The reaper keeps its safe default (4h) for ad-hoc use.
Step-2b mandatory teardown is unchanged (primary mechanism); this is the backstop.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- /recipe-upgrade step 2b: teardown is now MANDATORY on every exit path (finally),
with a verify-no-leak check; tear down even on failure before reporting.
- reap-dev-deploys.sh: safe, age-gated backstop that removes only idle dev-* stacks
(never CI per-run stacks, warm-*, infra; an active dev loop stays fresh).
- orchestrator: hourly cc-ci-reap-dev-deploys systemd timer runs it against cc-ci,
bounding any leaked dev deploy from a crashed/abandoned loop.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Absolute, mode-gated rule reinforced in /recipe-upgrade (Guardrails + the new
step-2b direct-deploy loop where the upgrader has cc-ci host access) and noted as
the interim safeguard in IDEAS.md until the deploy loop moves to isolated infra.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The step-2b direct deploy-and-inspect runs on the cc-ci server's own swarm today, so
the upgrader holds write access to the host that owns the tests + CI verdict — a
trust hole (could hack the tests). Parked idea: a dedicated throwaway test server
with scoped creds, so the upgrader can deploy+inspect but not modify the gate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The upgrader now deploys the WIP recipe directly on cc-ci (abra app deploy --chaos
under a dev-<recipe> domain on the local swarm) and inspects live logs
(docker service logs) to SEE what the upgrade does, before/alongside the !testme
CI gate. ADDITIONAL to — not a replacement for — the 3-attempt !testme verification;
it front-loads diagnosis so fewer CI attempts are spent on basics. Always torn down
(orphan-sweep is the backstop). /upgrade-all dispatch references the new step 2b.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
abra hard-FATAs on image refs with both a tag and a digest (immich:
postgres:14-vectorchord...@sha256:..., valkey:9@sha256:...), aborting the whole
recipe survey so immich was silently dropped. Per operator: don't normalize the
recipe; catch the failure and check the upstream registry directly.
- /upgrade-all box item 4: a tag+digest parse FATA is NOT not-fetchable. Use abra
for the images it parses; for the rest, list upstream tags (Docker Hub / ghcr /
buildx imagetools) and judge availability (match the variant the app supports,
not blindly the max). Upgradeable if abra OR the direct check finds a newer tag.
- /recipe-upgrade implement: hand-bump tag+digest pins (abra can't), and re-resolve
+ re-append the digest for the new tag so the pin is preserved (never drop it).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rename the table's Status column -> TESTS (the CI/test verdict, unchanged
content). Add a new STATUS column showing the PR's LIVE state, fetched
client-side: 'open' vs a ✓ for any not-open state (merged or closed). The cell
is a JS hook (data-repo/data-pr) derived from existing recipe+pr fields; an
inline, dependency-free, CSP-safe script GETs the same-origin /pr/<recipe>/<n>
proxy (cc-ci nix/modules/reports.nix) on load and every 30s, and degrades to a
muted '?' if the proxy/repo is unreachable. Blank cell when a row has no PR.
Doc + SKILL updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documents the end-to-end workflow used to land the intentional-skips/4-rung-ladder
feature: explore harness → branch a local cc-ci clone → implement + unit-verify
cold on cc-ci → live full-stage check → open PR (never push main) → independent
adversary verdict → squash-merge on PASS → deploy via /root/builder-clone rebuild.
Includes the adversary-verify-pr6.md plan as a reusable template.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds sweep-orphans.sh (safe-by-allowlist: removes orphan test stacks, standalone
debug containers >30m old, leaked dangling volumes, and reparented docker-run
wrappers; spares infra + warm-* canonicals and their retained volumes) and wires
it as Step 0 of /upgrade-all so a prior run's leaked stack/container/process can't
contend for the shared Swarm or skew the survey. Idempotent; no-op when clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New page order: short lead -> the full wire table (sorted by priority-to-address,
CVE recipes first, new CVEs count column) -> Addendum (bullets of real special
issues, omitted if clean) -> Security Bulletin -> per-recipe "What changed".
- recipe-report.py: _table() gains a CVEs column + recipe-name linking; new
_changes() helper; render() reordered; docstring SPEC SHAPE updated
(cve/addendum/changes added, needs_attention/routine removed).
- recipe-report/SKILL.md + example-spec.json: new procedure, spec shape, and
gold-standard template (2026-06-05, new format).
- launch-report.py: kickoff text reflects the new priority-ordered structure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A stale cc-ci-report session (from a prior week's run, gone idle) caused this week's
launch-report.py 'start' (use-or-create) to leave it and never run a fresh report.
Fix: upgrade-all step 6 now calls 'fresh', and start only leaves a session that's
actively busy producing a report — an idle/leftover session is killed + restarted.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The reconcile that's supposed to make the mirror main == upstream main was fetching origin/main —
but origin is the cc-ci MIRROR, so it synced the mirror to itself (a no-op) and never pulled real
upstream. Fix: fetch coopcloud explicitly (git.coopcloud.tech/coop-cloud/<recipe>, default branch
main OR master) via an 'upstream' remote and force-sync the mirror main + tags from it. Every recipe
has a coopcloud correspondent; none are forked. Also reorder the skill so the reconcile runs BEFORE
the upgrade check, so the check sees the real current recipe. Verified by divergence test (diverged a
mirror, reconcile snapped it back to coopcloud HEAD).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the two gaps vs recipe-maintainer's recipe-upgrade-plan:
- Per-recipe release-notes registry at cc-ci-plan/upstream/<recipe>.md (discover the source repo +
releases/changelog URL for each image once, persist+commit, reuse) — fetch release notes FROM those
URLs instead of rediscovering ad-hoc each run. Format doc + cryptpad seed included.
- Explicitly read the recipe's README for shipped upgrade/migration notes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_stories() now auto-links whole-word recipe mentions in story titles + bodies to their mirror
repos (same single-pass linkify as the lead); explicit PR/build links are untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>