Files

autonomic-bot f7971d949d 2pc: drop the pull-through registry cache — single host makes it marginal; keep PC1 prune-policy only

Operator (2026-05-29): on one host Docker's local image store already IS the cache; the churn was
over-pruning, not a missing cache. So 2pc = conservative prune policy + confirm local-store retention
+ daemon auth (PC1-3). Registry pull-through cache deferred to IDEAS with a concrete revisit
condition (multi-node, or measured cold-deploy bottleneck on recreate-surviving storage).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-29 09:24:56 +01:00

4.5 KiB

Raw Blame History

cc-ci Phase 2pc — Sane image-prune policy (retain Docker's local image cache)

Status: ACTIVE — a small interjection into Phase 2 (operator, 2026-05-29). Phase 2 (plan-phase2-recipe-tests.md) is PAUSED at its current progress (STATUS-2/BACKLOG-2 preserved); the loops do this short phase, then Phase 2 resumes automatically. Transition: auto — on ## DONE in machine-docs/STATUS-2pc.md the watchdog returns to Phase 2. Owner: Builder + Adversary loops. This file: /srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md Phase order: … 1e → 2w → 2(paused) → 2pc → 2(resume) → 2b → 3 → 4.

Why (and why a separate registry cache is NOT in scope)

Image handling is the demonstrated hot spot: the Docker Hub rate-limit was hit twice, and docker image prune -af (run to free disk) wiped cached images mid-run → forced a full cold re-pull of 12 images → rate limit (JOURNAL-2).

But the root cause was over-pruning, not lack of a cache. On this single host, Docker's own local image store already IS the cache — a pulled image stays, and re-deploys (cold tests, warm canonical, reboots) reuse the local layers with no re-download; the daemon is PAT-authenticated, so the residual per-deploy manifest checks sit comfortably under the 200/6h per-account budget. So keeping the local store (stop aggressive pruning) recovers ~all the benefit a cache would give.

A separate registry:2 pull-through cache is deliberately OUT of scope here — its distinctive wins don't apply to a single authenticated, non-pruning host: multi-node fan-out (we have one node), surviving prune/VM-rebuild on separate storage (ours would be co-located, lost on a recreate anyway), and cache-miss authentication (the daemon is already authenticated). It would add a registry service + daemon-mirror config + cache GC for marginal gain. Deferred to IDEAS / Phase 2b with a concrete revisit condition (see Guardrails).

Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-2pc.md`)

PC1 — Conservative prune policy. Stop reflexive docker image prune -af. Never prune during a deploy/test run. Keep base/in-use images. Prune only truly-orphaned, old layers (dangling + age-gated) and only under genuine disk pressure (now bounded — host 70 GB, ~43 G free). Wherever the harness/janitor/CI prunes today, make it surgical, not -af. The per-run teardown must keep removing the run's app volumes/secrets/services (sacred) but must NOT remove images.
PC2 — Local cache retained + authenticated (confirm). Confirm the Docker daemon stays PAT-authenticated for docker.io and that the local image store is retained across runs, teardowns, and reboots — so a repeat deploy of a previously-pulled image reuses local layers (no re-download) and makes at most an authenticated manifest check.
PC3 — Verified + documented. Adversary proof: deploy a recipe, tear it down, redeploy → the redeploy does not re-download image layers (served from the local store; show via docker events/pull output / a measured pull-time drop), and a normal run no longer evicts cached base images while disk stays bounded without -af. docs/ notes the prune policy; deviations in DECISIONS.md.

When PC1–PC3 hold and are Adversary-verified, write ## DONE to machine-docs/STATUS-2pc.md → watchdog auto-returns to Phase 2.

Guardrails / constraints

Bounded scope — prune policy + confirm local-store retention/auth ONLY. Do NOT build a registry pull-through cache here, and do NOT expand into concurrency/readiness-tuning/dedup (those are measurement-driven Phase 2b).
Real pull path — no special-casing pulls; abra/swarm pull through the normal authenticated daemon.
Don't weaken any test — the retained cache must not mask a genuinely-broken image (pinned versions still resolve correctly; a real new digest still pulls).
Registry pull-through cache — DEFERRED (IDEAS / Phase 2b), revisit ONLY if: (a) cc-ci ever goes multi-node, OR (b) Phase-2b measurement shows cold-cache / fresh-deploy pull time is a real bottleneck (e.g. D8 throwaway-rebuild or fresh-canonical seeding) AND the cache is hosted on recreate-surviving storage (an Incus volume / a path on host b1, not the VM's ephemeral disk). Otherwise it's complexity without payoff on a single host.

4.5 KiB Raw Blame History Unescape Escape

cc-ci Phase 2pc — Sane image-prune policy (retain Docker's local image cache)

Why (and why a separate registry cache is NOT in scope)

Definition of Done (Adversary cold-verifies → machine-docs/REVIEW-2pc.md)

Guardrails / constraints

4.5 KiB

Raw Blame History

Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-2pc.md`)