After a live incident: plausible build 220 (ClickHouse exit-1 crash-loop) held the
single serial runner for its full 1200s DEPLOY_TIMEOUT, starving immich PR-2's
queued builds for ~12min until manually torn down. Logs the two fixes (fail-fast
on crash-loop; head-of-line blocking on the serial runner) + the interim
mitigations (step-2b dev loop for debugging; SIGINT to free a wedged run).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Absolute, mode-gated rule reinforced in /recipe-upgrade (Guardrails + the new
step-2b direct-deploy loop where the upgrader has cc-ci host access) and noted as
the interim safeguard in IDEAS.md until the deploy loop moves to isolated infra.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The step-2b direct deploy-and-inspect runs on the cc-ci server's own swarm today, so
the upgrader holds write access to the host that owns the tests + CI verdict — a
trust hole (could hack the tests). Parked idea: a dedicated throwaway test server
with scoped creds, so the upgrader can deploy+inspect but not modify the gate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Operator (2026-05-30): the real deploy-speed bottleneck was hardware (cc-ci VM
was 2 vCPU on a 4-core host + disk-I/O-bound; RAM fine), now fixed directly
(bumped to 4 vCPU, made cc-nix-test the only running VM on b1). The 2b software
micro-optimizations are judged unlikely to help, so:
- IDEAS.md: parked the whole empirical-perf program (instrumentation, baseline,
attribution) + the optimization menu (image cache/prepull, readiness tuning,
warm-SSO start/stop, runner caching, concurrency sizing, resources, secret
overhead) under "Phase-2b empirical performance work", revisit only if
measurement later proves a specific software bottleneck.
- plan-phase2b: reduced to ONE goal — confirm (and fix if needed) that the
per-recipe test sequence already uses the minimum deploys (1 base shared by
install+functional+backup/restore, +1 for the upgrade tier, +1 per dep),
enforced by the existing DG4.1 deploy-count check, WITHOUT weakening any test.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Operator (2026-05-29): on one host Docker's local image store already IS the cache; the churn was
over-pruning, not a missing cache. So 2pc = conservative prune policy + confirm local-store retention
+ daemon auth (PC1-3). Registry pull-through cache deferred to IDEAS with a concrete revisit
condition (multi-node, or measured cold-deploy bottleneck on recreate-surviving storage).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First item: later, for environments where the CI server has repo-admin, consider an
opt-in (off-by-default) feature to auto-register + idempotently reconcile the issue_comment
webhook — preserving the read-only/polling default. Parked, out of current scope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>