Files
cc-ci-orchestrator/cc-ci-plan/plan-phase2pc-image-cache.md
autonomic-bot f7971d949d 2pc: drop the pull-through registry cache — single host makes it marginal; keep PC1 prune-policy only
Operator (2026-05-29): on one host Docker's local image store already IS the cache; the churn was
over-pruning, not a missing cache. So 2pc = conservative prune policy + confirm local-store retention
+ daemon auth (PC1-3). Registry pull-through cache deferred to IDEAS with a concrete revisit
condition (multi-node, or measured cold-deploy bottleneck on recreate-surviving storage).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:24:56 +01:00

65 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# cc-ci Phase 2pc — Sane image-prune policy (retain Docker's local image cache)
**Status:** ACTIVE — a **small interjection into Phase 2** (operator, 2026-05-29). Phase 2
(`plan-phase2-recipe-tests.md`) is **PAUSED at its current progress** (STATUS-2/BACKLOG-2 preserved);
the loops do this short phase, then **Phase 2 resumes automatically**.
**Transition:** auto — on `## DONE` in `machine-docs/STATUS-2pc.md` the watchdog returns to Phase 2.
**Owner:** Builder + Adversary loops. **This file:** `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`
**Phase order:** … 1e → 2w → 2(paused) → **2pc** → 2(resume) → 2b → 3 → 4.
---
## Why (and why a separate registry cache is NOT in scope)
Image handling is the demonstrated hot spot: the Docker Hub rate-limit was hit twice, and
`docker image prune -af` (run to free disk) **wiped cached images mid-run → forced a full cold
re-pull of 12 images → rate limit** (JOURNAL-2).
But the root cause was **over-pruning, not lack of a cache.** On this **single host, Docker's own
local image store already IS the cache** — a pulled image stays, and re-deploys (cold tests, warm
canonical, reboots) reuse the local layers with no re-download; the daemon is PAT-authenticated, so
the residual per-deploy manifest checks sit comfortably under the 200/6h per-account budget. So
**keeping the local store (stop aggressive pruning) recovers ~all the benefit** a cache would give.
A separate `registry:2` **pull-through cache is deliberately OUT of scope** here — its distinctive
wins don't apply to a single authenticated, non-pruning host: multi-node fan-out (we have **one**
node), surviving prune/VM-rebuild on **separate** storage (ours would be co-located, lost on a
recreate anyway), and cache-miss authentication (the daemon is already authenticated). It would add a
registry service + daemon-mirror config + cache GC for marginal gain. **Deferred to IDEAS / Phase 2b**
with a concrete revisit condition (see Guardrails).
## Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-2pc.md`)
- [ ] **PC1 — Conservative prune policy.** Stop reflexive `docker image prune -af`. **Never prune
during a deploy/test run.** Keep base/in-use images. Prune only **truly-orphaned, old** layers
(dangling + age-gated) and only under genuine disk pressure (now bounded — host 70 GB, ~43 G
free). Wherever the harness/janitor/CI prunes today, make it **surgical**, not `-af`. The
per-run teardown must keep removing the run's app **volumes/secrets/services** (sacred) but
**must NOT remove images.**
- [ ] **PC2 — Local cache retained + authenticated (confirm).** Confirm the Docker daemon stays
**PAT-authenticated** for `docker.io` and that the **local image store is retained across runs,
teardowns, and reboots** — so a repeat deploy of a previously-pulled image **reuses local
layers (no re-download)** and makes at most an authenticated manifest check.
- [ ] **PC3 — Verified + documented.** **Adversary proof:** deploy a recipe, tear it down, redeploy
→ the redeploy **does not re-download image layers** (served from the local store; show via
`docker` events/pull output / a measured pull-time drop), and a normal run no longer evicts
cached base images while disk stays bounded **without** `-af`. `docs/` notes the prune policy;
deviations in `DECISIONS.md`.
When PC1PC3 hold and are Adversary-verified, write `## DONE` to `machine-docs/STATUS-2pc.md`
watchdog auto-returns to Phase 2.
## Guardrails / constraints
- **Bounded scope** — prune policy + confirm local-store retention/auth ONLY. Do NOT build a registry
pull-through cache here, and do NOT expand into concurrency/readiness-tuning/dedup (those are
measurement-driven Phase 2b).
- **Real pull path** — no special-casing pulls; abra/swarm pull through the normal authenticated
daemon.
- **Don't weaken any test** — the retained cache must not mask a genuinely-broken image (pinned
versions still resolve correctly; a real new digest still pulls).
- **Registry pull-through cache — DEFERRED (IDEAS / Phase 2b), revisit ONLY if:** (a) cc-ci ever goes
**multi-node**, OR (b) Phase-2b measurement shows **cold-cache / fresh-deploy pull time is a real
bottleneck** (e.g. D8 throwaway-rebuild or fresh-canonical seeding) **AND** the cache is hosted on
**recreate-surviving storage** (an Incus volume / a path on host b1, not the VM's ephemeral disk).
Otherwise it's complexity without payoff on a single host.