Operator-directed (2026-05-29): front-load the two EVIDENCE-BASED image wins before grinding the remaining deploy-heavy recipes — Phase 2 pauses, 2pc runs, Phase 2 resumes (seq: …2w 2pc 2 2b 3 4). PC1: conservative prune (no reflexive `prune -af`, never mid-run, keep base images) — kills the documented prune→re-pull→rate-limit churn. PC2: local registry:2 pull-through cache for docker.io, PAT-authenticated, Nix-reconciled, daemon registry-mirror → transparent to abra/swarm; subsequent pulls (across recipes/runs/post-prune) are local → faster deploys + rate-limit gone. Bounded scope: these two only; concurrency/readiness-tuning stay in measurement-driven 2b. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4.6 KiB
cc-ci Phase 2pc — Image pull-through cache + sane prune policy (front-loaded perf win)
Status: ACTIVE — a small interjection into Phase 2 (operator, 2026-05-29). Phase 2
(plan-phase2-recipe-tests.md) is PAUSED at its current progress (STATUS-2/BACKLOG-2 preserved);
the loops do this short phase, then Phase 2 resumes automatically where it left off.
Transition: auto — on ## DONE in machine-docs/STATUS-2pc.md the watchdog returns to Phase 2.
Why now (not in 2b): image handling is the demonstrated hot spot (Docker Hub rate-limit hit
twice; docker image prune -af wiped cached images mid-run → forced a full cold re-pull → rate
limit — see JOURNAL-2). These two fixes are evidence-based, not speculative, and the remaining
~dozen recipes are all deploy/pull-heavy, so front-loading compounds. The rest of perf stays
measurement-driven in Phase 2b.
Owner: Builder + Adversary loops. This file: /srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md
Phase order now: … 1e → 2w → 2(paused) → 2pc → 2(resume) → 2b → 3 → 4.
Definition of Done (PC1 + PC2; Adversary cold-verifies → machine-docs/REVIEW-2pc.md)
- PC1 — Conservative image-prune policy (easy win). Stop reflexive
docker image prune -af; never prune during a deploy/test run; keep base/in-use images; prune only truly-orphaned, old layers (age-gated) and only under real disk pressure (now bounded — host is 70 GB, ~43 G free). Removes the documented prune→re-pull→rate-limit churn. Wherever the harness/janitor prunes today, make it surgical (dangling + age threshold), not-af. Verify: a normal run no longer evicts cached base images; disk stays bounded without-af. - PC2 — Docker Hub pull-through cache. A local
registry:2in proxy/pull-through mode pointing atregistry-1.docker.io, authenticated with the Docker Hub PAT (thenptest2creds already in sops/.testenv), declared as a Nix-reconciled service (same idempotent pattern as proxy/keycloak — survives a D8 rebuild; the cache contents are runtime cache, not in the git closure). Configure the Docker daemon'sregistry-mirrorsto use it so alldocker.iopulls (abra/swarm included — transparent, no command change) go through the cache. First pull caches locally; every subsequent pull — across recipes, across runs, after a prune — is local. Verify: (a) 2nd deploy of an image pulls from the cache (not Docker Hub) — show via cache logs / a measured pull-time drop; (b) survives a prune (re-pull is local, not a Docker Hub hit); (c) a measured deploy speedup on a repeat/warm-cache deploy vs cold; (d) cache-miss pulls authenticate (per-account 200/6h), so the rate-limit pressure is gone. - PC3 — Bounded + documented. The cache has a disk cap / its own GC so it can't grow
unbounded (it's a cache — evict LRU/old). Scope note: this covers
docker.io(the rate-limited, most-shared registry);git.coopcloud.techimages are out of scope here (not rate-limited, fewer) — a follow-up if 2b shows it matters.docs/notes the cache + prune policy; deviations inDECISIONS.md.
When PC1–PC3 hold and are Adversary-verified, write ## DONE to machine-docs/STATUS-2pc.md →
watchdog auto-returns to Phase 2.
Guardrails / constraints
- Real pull path only — the daemon
registry-mirrorsmakes the cache transparent to abra/swarm; do NOT special-case pulls or bypass abra. (Consistent with the lasuite-drive "real abra commands" rule.) - Bounded scope — this is the TWO evidence-based image wins only. Concurrency (
MAX_TESTS>1), readiness-poll tuning, deploy dedup, etc. stay in measurement-driven Phase 2b — do NOT expand this interjection into general optimization. - Cache is cache, not source — the registry service is Nix-declared + reconciled (rebuildable); its stored layers are runtime cache, excluded from the D8 closure (re-warmed by pulls).
- Don't weaken any test (cardinal rule); the cache must not mask a genuinely-broken image pull.
Open decisions (log in machine-docs/DECISIONS.md)
- Cache storage location + size cap + eviction policy; where the
registry:2service is declared (newnix/modules/registry-cache.nix). - Daemon
registry-mirrorswiring (NixOSvirtualisation.docker.daemon.settings) + how the cache authenticates upstream with the PAT (sops secret → registry config). - The measured speedup target to report (cold vs warm-cache deploy delta on a representative recipe).