plan: Phase 2pc — image pull-through cache + sane prune policy (front-loaded perf interjection)
Operator-directed (2026-05-29): front-load the two EVIDENCE-BASED image wins before grinding the remaining deploy-heavy recipes — Phase 2 pauses, 2pc runs, Phase 2 resumes (seq: …2w 2pc 2 2b 3 4). PC1: conservative prune (no reflexive `prune -af`, never mid-run, keep base images) — kills the documented prune→re-pull→rate-limit churn. PC2: local registry:2 pull-through cache for docker.io, PAT-authenticated, Nix-reconciled, daemon registry-mirror → transparent to abra/swarm; subsequent pulls (across recipes/runs/post-prune) are local → faster deploys + rate-limit gone. Bounded scope: these two only; concurrency/readiness-tuning stay in measurement-driven 2b. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -56,7 +56,7 @@ WATCH_ORCHESTRATOR="${WATCH_ORCHESTRATOR:-1}"
|
||||
# Ordered phase sequence: each entry "id|planfile|statusbasename". The watchdog runs them in order,
|
||||
# auto-transitions on the phase's "## DONE" (in BUILDER_DIR/<statusbasename>), and STOPS after the
|
||||
# last one (manual gate). Override PHASES_SPEC (semicolon-separated) to change the sequence.
|
||||
PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
|
||||
PHASES_SPEC="${PHASES_SPEC:-1c|plan-phase1c-full-reproducibility.md|STATUS-1c.md;1b|plan-phase1b-review-lint.md|STATUS-1b.md;1d|plan-phase1d-generic-test-suite.md|STATUS-1d.md;1e|plan-phase1e-harness-corrections.md|STATUS-1e.md;2w|plan-phase2w-warm-canonical-quick.md|STATUS-2w.md;2pc|plan-phase2pc-image-cache.md|STATUS-2pc.md;2|plan-phase2-recipe-tests.md|STATUS-2.md;2b|plan-phase2b-test-performance.md|STATUS-2b.md;3|plan-phase3-results-ux.md|STATUS-3.md;4|plan-phase4-final-review-polish-cleanup.md|STATUS-4.md}"
|
||||
IFS=';' read -r -a PHASES <<< "$PHASES_SPEC"
|
||||
PHASE_IDX_FILE="${PHASE_IDX_FILE:-$LOG_DIR/.phase-idx}"
|
||||
# --------------------------------------------------------------------------
|
||||
|
||||
61
cc-ci-plan/plan-phase2pc-image-cache.md
Normal file
61
cc-ci-plan/plan-phase2pc-image-cache.md
Normal file
@ -0,0 +1,61 @@
|
||||
# cc-ci Phase 2pc — Image pull-through cache + sane prune policy (front-loaded perf win)
|
||||
|
||||
**Status:** ACTIVE — a **small interjection into Phase 2** (operator, 2026-05-29). Phase 2
|
||||
(`plan-phase2-recipe-tests.md`) is **PAUSED at its current progress** (STATUS-2/BACKLOG-2 preserved);
|
||||
the loops do this short phase, then **Phase 2 resumes automatically** where it left off.
|
||||
**Transition:** auto — on `## DONE` in `machine-docs/STATUS-2pc.md` the watchdog returns to Phase 2.
|
||||
**Why now (not in 2b):** image handling is the **demonstrated** hot spot (Docker Hub rate-limit hit
|
||||
twice; `docker image prune -af` wiped cached images mid-run → forced a full cold re-pull → rate
|
||||
limit — see JOURNAL-2). These two fixes are **evidence-based, not speculative**, and the remaining
|
||||
~dozen recipes are all deploy/pull-heavy, so front-loading compounds. The rest of perf stays
|
||||
measurement-driven in Phase 2b.
|
||||
**Owner:** Builder + Adversary loops. **This file:** `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`
|
||||
**Phase order now:** … 1e → 2w → 2(paused) → **2pc** → 2(resume) → 2b → 3 → 4.
|
||||
|
||||
---
|
||||
|
||||
## Definition of Done (PC1 + PC2; Adversary cold-verifies → `machine-docs/REVIEW-2pc.md`)
|
||||
|
||||
- [ ] **PC1 — Conservative image-prune policy (easy win).** Stop reflexive `docker image prune -af`;
|
||||
**never prune during a deploy/test run**; keep base/in-use images; prune only **truly-orphaned,
|
||||
old** layers (age-gated) and only under real disk pressure (now bounded — host is 70 GB,
|
||||
~43 G free). Removes the documented prune→re-pull→rate-limit churn. Wherever the harness/janitor
|
||||
prunes today, make it surgical (dangling + age threshold), not `-af`. **Verify:** a normal run
|
||||
no longer evicts cached base images; disk stays bounded without `-af`.
|
||||
- [ ] **PC2 — Docker Hub pull-through cache.** A local **`registry:2` in proxy/pull-through mode**
|
||||
pointing at `registry-1.docker.io`, **authenticated with the Docker Hub PAT** (the `nptest2`
|
||||
creds already in sops/`.testenv`), declared as a **Nix-reconciled service** (same idempotent
|
||||
pattern as proxy/keycloak — survives a D8 rebuild; the cache *contents* are runtime cache, not
|
||||
in the git closure). Configure the Docker daemon's **`registry-mirrors`** to use it so all
|
||||
`docker.io` pulls (abra/swarm included — transparent, no command change) go through the cache.
|
||||
First pull caches locally; **every subsequent pull — across recipes, across runs, after a
|
||||
prune — is local.** **Verify:** (a) 2nd deploy of an image pulls from the cache (not Docker
|
||||
Hub) — show via cache logs / a measured pull-time drop; (b) survives a prune (re-pull is local,
|
||||
not a Docker Hub hit); (c) a **measured deploy speedup** on a repeat/warm-cache deploy vs cold;
|
||||
(d) cache-miss pulls authenticate (per-account 200/6h), so the rate-limit pressure is gone.
|
||||
- [ ] **PC3 — Bounded + documented.** The cache has a disk cap / its own GC so it can't grow
|
||||
unbounded (it's a cache — evict LRU/old). Scope note: this covers **`docker.io`** (the
|
||||
rate-limited, most-shared registry); `git.coopcloud.tech` images are out of scope here (not
|
||||
rate-limited, fewer) — a follow-up if 2b shows it matters. `docs/` notes the cache + prune
|
||||
policy; deviations in `DECISIONS.md`.
|
||||
|
||||
When PC1–PC3 hold and are Adversary-verified, write `## DONE` to `machine-docs/STATUS-2pc.md` →
|
||||
watchdog auto-returns to Phase 2.
|
||||
|
||||
## Guardrails / constraints
|
||||
- **Real pull path only** — the daemon `registry-mirrors` makes the cache transparent to abra/swarm;
|
||||
do NOT special-case pulls or bypass abra. (Consistent with the lasuite-drive "real abra commands"
|
||||
rule.)
|
||||
- **Bounded scope** — this is the TWO evidence-based image wins only. Concurrency (`MAX_TESTS>1`),
|
||||
readiness-poll tuning, deploy dedup, etc. stay in **measurement-driven Phase 2b** — do NOT expand
|
||||
this interjection into general optimization.
|
||||
- **Cache is cache, not source** — the registry *service* is Nix-declared + reconciled (rebuildable);
|
||||
its stored layers are runtime cache, excluded from the D8 closure (re-warmed by pulls).
|
||||
- **Don't weaken any test** (cardinal rule); the cache must not mask a genuinely-broken image pull.
|
||||
|
||||
## Open decisions (log in machine-docs/DECISIONS.md)
|
||||
- Cache storage location + size cap + eviction policy; where the `registry:2` service is declared
|
||||
(new `nix/modules/registry-cache.nix`).
|
||||
- Daemon `registry-mirrors` wiring (NixOS `virtualisation.docker.daemon.settings`) + how the cache
|
||||
authenticates upstream with the PAT (sops secret → registry config).
|
||||
- The measured speedup target to report (cold vs warm-cache deploy delta on a representative recipe).
|
||||
Reference in New Issue
Block a user