Files
cc-ci-orchestrator/cc-ci-plan/plan-phase2pc-image-cache.md
autonomic-bot 0352cb5607 plan: Phase 2pc — image pull-through cache + sane prune policy (front-loaded perf interjection)
Operator-directed (2026-05-29): front-load the two EVIDENCE-BASED image wins before grinding the
remaining deploy-heavy recipes — Phase 2 pauses, 2pc runs, Phase 2 resumes (seq: …2w 2pc 2 2b 3 4).
PC1: conservative prune (no reflexive `prune -af`, never mid-run, keep base images) — kills the
documented prune→re-pull→rate-limit churn. PC2: local registry:2 pull-through cache for docker.io,
PAT-authenticated, Nix-reconciled, daemon registry-mirror → transparent to abra/swarm; subsequent
pulls (across recipes/runs/post-prune) are local → faster deploys + rate-limit gone. Bounded scope:
these two only; concurrency/readiness-tuning stay in measurement-driven 2b.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:20:30 +01:00

4.6 KiB
Raw Blame History

cc-ci Phase 2pc — Image pull-through cache + sane prune policy (front-loaded perf win)

Status: ACTIVE — a small interjection into Phase 2 (operator, 2026-05-29). Phase 2 (plan-phase2-recipe-tests.md) is PAUSED at its current progress (STATUS-2/BACKLOG-2 preserved); the loops do this short phase, then Phase 2 resumes automatically where it left off. Transition: auto — on ## DONE in machine-docs/STATUS-2pc.md the watchdog returns to Phase 2. Why now (not in 2b): image handling is the demonstrated hot spot (Docker Hub rate-limit hit twice; docker image prune -af wiped cached images mid-run → forced a full cold re-pull → rate limit — see JOURNAL-2). These two fixes are evidence-based, not speculative, and the remaining ~dozen recipes are all deploy/pull-heavy, so front-loading compounds. The rest of perf stays measurement-driven in Phase 2b. Owner: Builder + Adversary loops. This file: /srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md Phase order now: … 1e → 2w → 2(paused) → 2pc → 2(resume) → 2b → 3 → 4.


Definition of Done (PC1 + PC2; Adversary cold-verifies → machine-docs/REVIEW-2pc.md)

  • PC1 — Conservative image-prune policy (easy win). Stop reflexive docker image prune -af; never prune during a deploy/test run; keep base/in-use images; prune only truly-orphaned, old layers (age-gated) and only under real disk pressure (now bounded — host is 70 GB, ~43 G free). Removes the documented prune→re-pull→rate-limit churn. Wherever the harness/janitor prunes today, make it surgical (dangling + age threshold), not -af. Verify: a normal run no longer evicts cached base images; disk stays bounded without -af.
  • PC2 — Docker Hub pull-through cache. A local registry:2 in proxy/pull-through mode pointing at registry-1.docker.io, authenticated with the Docker Hub PAT (the nptest2 creds already in sops/.testenv), declared as a Nix-reconciled service (same idempotent pattern as proxy/keycloak — survives a D8 rebuild; the cache contents are runtime cache, not in the git closure). Configure the Docker daemon's registry-mirrors to use it so all docker.io pulls (abra/swarm included — transparent, no command change) go through the cache. First pull caches locally; every subsequent pull — across recipes, across runs, after a prune — is local. Verify: (a) 2nd deploy of an image pulls from the cache (not Docker Hub) — show via cache logs / a measured pull-time drop; (b) survives a prune (re-pull is local, not a Docker Hub hit); (c) a measured deploy speedup on a repeat/warm-cache deploy vs cold; (d) cache-miss pulls authenticate (per-account 200/6h), so the rate-limit pressure is gone.
  • PC3 — Bounded + documented. The cache has a disk cap / its own GC so it can't grow unbounded (it's a cache — evict LRU/old). Scope note: this covers docker.io (the rate-limited, most-shared registry); git.coopcloud.tech images are out of scope here (not rate-limited, fewer) — a follow-up if 2b shows it matters. docs/ notes the cache + prune policy; deviations in DECISIONS.md.

When PC1PC3 hold and are Adversary-verified, write ## DONE to machine-docs/STATUS-2pc.md → watchdog auto-returns to Phase 2.

Guardrails / constraints

  • Real pull path only — the daemon registry-mirrors makes the cache transparent to abra/swarm; do NOT special-case pulls or bypass abra. (Consistent with the lasuite-drive "real abra commands" rule.)
  • Bounded scope — this is the TWO evidence-based image wins only. Concurrency (MAX_TESTS>1), readiness-poll tuning, deploy dedup, etc. stay in measurement-driven Phase 2b — do NOT expand this interjection into general optimization.
  • Cache is cache, not source — the registry service is Nix-declared + reconciled (rebuildable); its stored layers are runtime cache, excluded from the D8 closure (re-warmed by pulls).
  • Don't weaken any test (cardinal rule); the cache must not mask a genuinely-broken image pull.

Open decisions (log in machine-docs/DECISIONS.md)

  • Cache storage location + size cap + eviction policy; where the registry:2 service is declared (new nix/modules/registry-cache.nix).
  • Daemon registry-mirrors wiring (NixOS virtualisation.docker.daemon.settings) + how the cache authenticates upstream with the PAT (sops secret → registry config).
  • The measured speedup target to report (cold vs warm-cache deploy delta on a representative recipe).