Files
cc-ci-orchestrator/cc-ci-plan/plan-phase2pc-image-cache.md
autonomic-bot 0352cb5607 plan: Phase 2pc — image pull-through cache + sane prune policy (front-loaded perf interjection)
Operator-directed (2026-05-29): front-load the two EVIDENCE-BASED image wins before grinding the
remaining deploy-heavy recipes — Phase 2 pauses, 2pc runs, Phase 2 resumes (seq: …2w 2pc 2 2b 3 4).
PC1: conservative prune (no reflexive `prune -af`, never mid-run, keep base images) — kills the
documented prune→re-pull→rate-limit churn. PC2: local registry:2 pull-through cache for docker.io,
PAT-authenticated, Nix-reconciled, daemon registry-mirror → transparent to abra/swarm; subsequent
pulls (across recipes/runs/post-prune) are local → faster deploys + rate-limit gone. Bounded scope:
these two only; concurrency/readiness-tuning stay in measurement-driven 2b.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:20:30 +01:00

62 lines
4.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# cc-ci Phase 2pc — Image pull-through cache + sane prune policy (front-loaded perf win)
**Status:** ACTIVE — a **small interjection into Phase 2** (operator, 2026-05-29). Phase 2
(`plan-phase2-recipe-tests.md`) is **PAUSED at its current progress** (STATUS-2/BACKLOG-2 preserved);
the loops do this short phase, then **Phase 2 resumes automatically** where it left off.
**Transition:** auto — on `## DONE` in `machine-docs/STATUS-2pc.md` the watchdog returns to Phase 2.
**Why now (not in 2b):** image handling is the **demonstrated** hot spot (Docker Hub rate-limit hit
twice; `docker image prune -af` wiped cached images mid-run → forced a full cold re-pull → rate
limit — see JOURNAL-2). These two fixes are **evidence-based, not speculative**, and the remaining
~dozen recipes are all deploy/pull-heavy, so front-loading compounds. The rest of perf stays
measurement-driven in Phase 2b.
**Owner:** Builder + Adversary loops. **This file:** `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`
**Phase order now:** … 1e → 2w → 2(paused) → **2pc** → 2(resume) → 2b → 3 → 4.
---
## Definition of Done (PC1 + PC2; Adversary cold-verifies → `machine-docs/REVIEW-2pc.md`)
- [ ] **PC1 — Conservative image-prune policy (easy win).** Stop reflexive `docker image prune -af`;
**never prune during a deploy/test run**; keep base/in-use images; prune only **truly-orphaned,
old** layers (age-gated) and only under real disk pressure (now bounded — host is 70 GB,
~43 G free). Removes the documented prune→re-pull→rate-limit churn. Wherever the harness/janitor
prunes today, make it surgical (dangling + age threshold), not `-af`. **Verify:** a normal run
no longer evicts cached base images; disk stays bounded without `-af`.
- [ ] **PC2 — Docker Hub pull-through cache.** A local **`registry:2` in proxy/pull-through mode**
pointing at `registry-1.docker.io`, **authenticated with the Docker Hub PAT** (the `nptest2`
creds already in sops/`.testenv`), declared as a **Nix-reconciled service** (same idempotent
pattern as proxy/keycloak — survives a D8 rebuild; the cache *contents* are runtime cache, not
in the git closure). Configure the Docker daemon's **`registry-mirrors`** to use it so all
`docker.io` pulls (abra/swarm included — transparent, no command change) go through the cache.
First pull caches locally; **every subsequent pull — across recipes, across runs, after a
prune — is local.** **Verify:** (a) 2nd deploy of an image pulls from the cache (not Docker
Hub) — show via cache logs / a measured pull-time drop; (b) survives a prune (re-pull is local,
not a Docker Hub hit); (c) a **measured deploy speedup** on a repeat/warm-cache deploy vs cold;
(d) cache-miss pulls authenticate (per-account 200/6h), so the rate-limit pressure is gone.
- [ ] **PC3 — Bounded + documented.** The cache has a disk cap / its own GC so it can't grow
unbounded (it's a cache — evict LRU/old). Scope note: this covers **`docker.io`** (the
rate-limited, most-shared registry); `git.coopcloud.tech` images are out of scope here (not
rate-limited, fewer) — a follow-up if 2b shows it matters. `docs/` notes the cache + prune
policy; deviations in `DECISIONS.md`.
When PC1PC3 hold and are Adversary-verified, write `## DONE` to `machine-docs/STATUS-2pc.md`
watchdog auto-returns to Phase 2.
## Guardrails / constraints
- **Real pull path only** — the daemon `registry-mirrors` makes the cache transparent to abra/swarm;
do NOT special-case pulls or bypass abra. (Consistent with the lasuite-drive "real abra commands"
rule.)
- **Bounded scope** — this is the TWO evidence-based image wins only. Concurrency (`MAX_TESTS>1`),
readiness-poll tuning, deploy dedup, etc. stay in **measurement-driven Phase 2b** — do NOT expand
this interjection into general optimization.
- **Cache is cache, not source** — the registry *service* is Nix-declared + reconciled (rebuildable);
its stored layers are runtime cache, excluded from the D8 closure (re-warmed by pulls).
- **Don't weaken any test** (cardinal rule); the cache must not mask a genuinely-broken image pull.
## Open decisions (log in machine-docs/DECISIONS.md)
- Cache storage location + size cap + eviction policy; where the `registry:2` service is declared
(new `nix/modules/registry-cache.nix`).
- Daemon `registry-mirrors` wiring (NixOS `virtualisation.docker.daemon.settings`) + how the cache
authenticates upstream with the PAT (sops secret → registry config).
- The measured speedup target to report (cold vs warm-cache deploy delta on a representative recipe).