Operator (2026-05-29): on one host Docker's local image store already IS the cache; the churn was over-pruning, not a missing cache. So 2pc = conservative prune policy + confirm local-store retention + daemon auth (PC1-3). Registry pull-through cache deferred to IDEAS with a concrete revisit condition (multi-node, or measured cold-deploy bottleneck on recreate-surviving storage). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
65 lines
4.5 KiB
Markdown
65 lines
4.5 KiB
Markdown
# cc-ci Phase 2pc — Sane image-prune policy (retain Docker's local image cache)
|
||
|
||
**Status:** ACTIVE — a **small interjection into Phase 2** (operator, 2026-05-29). Phase 2
|
||
(`plan-phase2-recipe-tests.md`) is **PAUSED at its current progress** (STATUS-2/BACKLOG-2 preserved);
|
||
the loops do this short phase, then **Phase 2 resumes automatically**.
|
||
**Transition:** auto — on `## DONE` in `machine-docs/STATUS-2pc.md` the watchdog returns to Phase 2.
|
||
**Owner:** Builder + Adversary loops. **This file:** `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`
|
||
**Phase order:** … 1e → 2w → 2(paused) → **2pc** → 2(resume) → 2b → 3 → 4.
|
||
|
||
---
|
||
|
||
## Why (and why a separate registry cache is NOT in scope)
|
||
|
||
Image handling is the demonstrated hot spot: the Docker Hub rate-limit was hit twice, and
|
||
`docker image prune -af` (run to free disk) **wiped cached images mid-run → forced a full cold
|
||
re-pull of 12 images → rate limit** (JOURNAL-2).
|
||
|
||
But the root cause was **over-pruning, not lack of a cache.** On this **single host, Docker's own
|
||
local image store already IS the cache** — a pulled image stays, and re-deploys (cold tests, warm
|
||
canonical, reboots) reuse the local layers with no re-download; the daemon is PAT-authenticated, so
|
||
the residual per-deploy manifest checks sit comfortably under the 200/6h per-account budget. So
|
||
**keeping the local store (stop aggressive pruning) recovers ~all the benefit** a cache would give.
|
||
|
||
A separate `registry:2` **pull-through cache is deliberately OUT of scope** here — its distinctive
|
||
wins don't apply to a single authenticated, non-pruning host: multi-node fan-out (we have **one**
|
||
node), surviving prune/VM-rebuild on **separate** storage (ours would be co-located, lost on a
|
||
recreate anyway), and cache-miss authentication (the daemon is already authenticated). It would add a
|
||
registry service + daemon-mirror config + cache GC for marginal gain. **Deferred to IDEAS / Phase 2b**
|
||
with a concrete revisit condition (see Guardrails).
|
||
|
||
## Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-2pc.md`)
|
||
|
||
- [ ] **PC1 — Conservative prune policy.** Stop reflexive `docker image prune -af`. **Never prune
|
||
during a deploy/test run.** Keep base/in-use images. Prune only **truly-orphaned, old** layers
|
||
(dangling + age-gated) and only under genuine disk pressure (now bounded — host 70 GB, ~43 G
|
||
free). Wherever the harness/janitor/CI prunes today, make it **surgical**, not `-af`. The
|
||
per-run teardown must keep removing the run's app **volumes/secrets/services** (sacred) but
|
||
**must NOT remove images.**
|
||
- [ ] **PC2 — Local cache retained + authenticated (confirm).** Confirm the Docker daemon stays
|
||
**PAT-authenticated** for `docker.io` and that the **local image store is retained across runs,
|
||
teardowns, and reboots** — so a repeat deploy of a previously-pulled image **reuses local
|
||
layers (no re-download)** and makes at most an authenticated manifest check.
|
||
- [ ] **PC3 — Verified + documented.** **Adversary proof:** deploy a recipe, tear it down, redeploy
|
||
→ the redeploy **does not re-download image layers** (served from the local store; show via
|
||
`docker` events/pull output / a measured pull-time drop), and a normal run no longer evicts
|
||
cached base images while disk stays bounded **without** `-af`. `docs/` notes the prune policy;
|
||
deviations in `DECISIONS.md`.
|
||
|
||
When PC1–PC3 hold and are Adversary-verified, write `## DONE` to `machine-docs/STATUS-2pc.md` →
|
||
watchdog auto-returns to Phase 2.
|
||
|
||
## Guardrails / constraints
|
||
- **Bounded scope** — prune policy + confirm local-store retention/auth ONLY. Do NOT build a registry
|
||
pull-through cache here, and do NOT expand into concurrency/readiness-tuning/dedup (those are
|
||
measurement-driven Phase 2b).
|
||
- **Real pull path** — no special-casing pulls; abra/swarm pull through the normal authenticated
|
||
daemon.
|
||
- **Don't weaken any test** — the retained cache must not mask a genuinely-broken image (pinned
|
||
versions still resolve correctly; a real new digest still pulls).
|
||
- **Registry pull-through cache — DEFERRED (IDEAS / Phase 2b), revisit ONLY if:** (a) cc-ci ever goes
|
||
**multi-node**, OR (b) Phase-2b measurement shows **cold-cache / fresh-deploy pull time is a real
|
||
bottleneck** (e.g. D8 throwaway-rebuild or fresh-canonical seeding) **AND** the cache is hosted on
|
||
**recreate-surviving storage** (an Incus volume / a path on host b1, not the VM's ephemeral disk).
|
||
Otherwise it's complexity without payoff on a single host.
|