2pc: drop the pull-through registry cache — single host makes it marginal; keep PC1 prune-policy only
Operator (2026-05-29): on one host Docker's local image store already IS the cache; the churn was over-pruning, not a missing cache. So 2pc = conservative prune policy + confirm local-store retention + daemon auth (PC1-3). Registry pull-through cache deferred to IDEAS with a concrete revisit condition (multi-node, or measured cold-deploy bottleneck on recreate-surviving storage). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -30,6 +30,17 @@ item into the project `BACKLOG.md` as `[idea]` if/when it becomes relevant.
|
|||||||
brittle, or if maintainers strongly prefer normal coop-cloud workflow over the Nix layer — weigh
|
brittle, or if maintainers strongly prefer normal coop-cloud workflow over the Nix layer — weigh
|
||||||
that against how much we value full reproducibility (D8) + hands-off auto-updates. *Added:* 2026-05-29.
|
that against how much we value full reproducibility (D8) + hands-off auto-updates. *Added:* 2026-05-29.
|
||||||
|
|
||||||
|
- [ ] **Docker Hub pull-through registry cache (`registry:2` proxy) — deferred; single-host makes it marginal.**
|
||||||
|
Considered as a Phase-2pc perf win, then **dropped (operator, 2026-05-29):** on a **single host**,
|
||||||
|
Docker's own local image store already caches pulled images (re-deploys reuse local layers), so the
|
||||||
|
prune-policy fix (Phase 2pc PC1) recovers ~all the benefit. A separate pull-through cache's
|
||||||
|
distinctive wins don't apply here — multi-node fan-out (one node), surviving prune/VM-rebuild on
|
||||||
|
*separate* storage (ours would be co-located, lost on recreate), cache-miss auth (daemon already
|
||||||
|
PAT-authenticated). **Revisit ONLY if:** (a) cc-ci goes **multi-node**, OR (b) Phase-2b measurement
|
||||||
|
shows **cold-cache / fresh-deploy pull time** (D8 throwaway-rebuild, fresh-canonical seeding) is a
|
||||||
|
real bottleneck **AND** the cache lives on **recreate-surviving storage** (Incus volume / host-b1
|
||||||
|
path, not the VM's ephemeral disk). Otherwise it's complexity without payoff. *Added:* 2026-05-29.
|
||||||
|
|
||||||
- [ ] **Optional `--extra-tests` flag for heavy / operational tests (opt-in heavy suite).**
|
- [ ] **Optional `--extra-tests` flag for heavy / operational tests (opt-in heavy suite).**
|
||||||
Some recipe tests are "more than needed" for the default CI signal — state-management /
|
Some recipe tests are "more than needed" for the default CI signal — state-management /
|
||||||
long-running-instance / load / helper-script operational tests that don't fit the ephemeral
|
long-running-instance / load / helper-script operational tests that don't fit the ephemeral
|
||||||
|
|||||||
@ -1,61 +1,64 @@
|
|||||||
# cc-ci Phase 2pc — Image pull-through cache + sane prune policy (front-loaded perf win)
|
# cc-ci Phase 2pc — Sane image-prune policy (retain Docker's local image cache)
|
||||||
|
|
||||||
**Status:** ACTIVE — a **small interjection into Phase 2** (operator, 2026-05-29). Phase 2
|
**Status:** ACTIVE — a **small interjection into Phase 2** (operator, 2026-05-29). Phase 2
|
||||||
(`plan-phase2-recipe-tests.md`) is **PAUSED at its current progress** (STATUS-2/BACKLOG-2 preserved);
|
(`plan-phase2-recipe-tests.md`) is **PAUSED at its current progress** (STATUS-2/BACKLOG-2 preserved);
|
||||||
the loops do this short phase, then **Phase 2 resumes automatically** where it left off.
|
the loops do this short phase, then **Phase 2 resumes automatically**.
|
||||||
**Transition:** auto — on `## DONE` in `machine-docs/STATUS-2pc.md` the watchdog returns to Phase 2.
|
**Transition:** auto — on `## DONE` in `machine-docs/STATUS-2pc.md` the watchdog returns to Phase 2.
|
||||||
**Why now (not in 2b):** image handling is the **demonstrated** hot spot (Docker Hub rate-limit hit
|
|
||||||
twice; `docker image prune -af` wiped cached images mid-run → forced a full cold re-pull → rate
|
|
||||||
limit — see JOURNAL-2). These two fixes are **evidence-based, not speculative**, and the remaining
|
|
||||||
~dozen recipes are all deploy/pull-heavy, so front-loading compounds. The rest of perf stays
|
|
||||||
measurement-driven in Phase 2b.
|
|
||||||
**Owner:** Builder + Adversary loops. **This file:** `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`
|
**Owner:** Builder + Adversary loops. **This file:** `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`
|
||||||
**Phase order now:** … 1e → 2w → 2(paused) → **2pc** → 2(resume) → 2b → 3 → 4.
|
**Phase order:** … 1e → 2w → 2(paused) → **2pc** → 2(resume) → 2b → 3 → 4.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Definition of Done (PC1 + PC2; Adversary cold-verifies → `machine-docs/REVIEW-2pc.md`)
|
## Why (and why a separate registry cache is NOT in scope)
|
||||||
|
|
||||||
- [ ] **PC1 — Conservative image-prune policy (easy win).** Stop reflexive `docker image prune -af`;
|
Image handling is the demonstrated hot spot: the Docker Hub rate-limit was hit twice, and
|
||||||
**never prune during a deploy/test run**; keep base/in-use images; prune only **truly-orphaned,
|
`docker image prune -af` (run to free disk) **wiped cached images mid-run → forced a full cold
|
||||||
old** layers (age-gated) and only under real disk pressure (now bounded — host is 70 GB,
|
re-pull of 12 images → rate limit** (JOURNAL-2).
|
||||||
~43 G free). Removes the documented prune→re-pull→rate-limit churn. Wherever the harness/janitor
|
|
||||||
prunes today, make it surgical (dangling + age threshold), not `-af`. **Verify:** a normal run
|
But the root cause was **over-pruning, not lack of a cache.** On this **single host, Docker's own
|
||||||
no longer evicts cached base images; disk stays bounded without `-af`.
|
local image store already IS the cache** — a pulled image stays, and re-deploys (cold tests, warm
|
||||||
- [ ] **PC2 — Docker Hub pull-through cache.** A local **`registry:2` in proxy/pull-through mode**
|
canonical, reboots) reuse the local layers with no re-download; the daemon is PAT-authenticated, so
|
||||||
pointing at `registry-1.docker.io`, **authenticated with the Docker Hub PAT** (the `nptest2`
|
the residual per-deploy manifest checks sit comfortably under the 200/6h per-account budget. So
|
||||||
creds already in sops/`.testenv`), declared as a **Nix-reconciled service** (same idempotent
|
**keeping the local store (stop aggressive pruning) recovers ~all the benefit** a cache would give.
|
||||||
pattern as proxy/keycloak — survives a D8 rebuild; the cache *contents* are runtime cache, not
|
|
||||||
in the git closure). Configure the Docker daemon's **`registry-mirrors`** to use it so all
|
A separate `registry:2` **pull-through cache is deliberately OUT of scope** here — its distinctive
|
||||||
`docker.io` pulls (abra/swarm included — transparent, no command change) go through the cache.
|
wins don't apply to a single authenticated, non-pruning host: multi-node fan-out (we have **one**
|
||||||
First pull caches locally; **every subsequent pull — across recipes, across runs, after a
|
node), surviving prune/VM-rebuild on **separate** storage (ours would be co-located, lost on a
|
||||||
prune — is local.** **Verify:** (a) 2nd deploy of an image pulls from the cache (not Docker
|
recreate anyway), and cache-miss authentication (the daemon is already authenticated). It would add a
|
||||||
Hub) — show via cache logs / a measured pull-time drop; (b) survives a prune (re-pull is local,
|
registry service + daemon-mirror config + cache GC for marginal gain. **Deferred to IDEAS / Phase 2b**
|
||||||
not a Docker Hub hit); (c) a **measured deploy speedup** on a repeat/warm-cache deploy vs cold;
|
with a concrete revisit condition (see Guardrails).
|
||||||
(d) cache-miss pulls authenticate (per-account 200/6h), so the rate-limit pressure is gone.
|
|
||||||
- [ ] **PC3 — Bounded + documented.** The cache has a disk cap / its own GC so it can't grow
|
## Definition of Done (Adversary cold-verifies → `machine-docs/REVIEW-2pc.md`)
|
||||||
unbounded (it's a cache — evict LRU/old). Scope note: this covers **`docker.io`** (the
|
|
||||||
rate-limited, most-shared registry); `git.coopcloud.tech` images are out of scope here (not
|
- [ ] **PC1 — Conservative prune policy.** Stop reflexive `docker image prune -af`. **Never prune
|
||||||
rate-limited, fewer) — a follow-up if 2b shows it matters. `docs/` notes the cache + prune
|
during a deploy/test run.** Keep base/in-use images. Prune only **truly-orphaned, old** layers
|
||||||
policy; deviations in `DECISIONS.md`.
|
(dangling + age-gated) and only under genuine disk pressure (now bounded — host 70 GB, ~43 G
|
||||||
|
free). Wherever the harness/janitor/CI prunes today, make it **surgical**, not `-af`. The
|
||||||
|
per-run teardown must keep removing the run's app **volumes/secrets/services** (sacred) but
|
||||||
|
**must NOT remove images.**
|
||||||
|
- [ ] **PC2 — Local cache retained + authenticated (confirm).** Confirm the Docker daemon stays
|
||||||
|
**PAT-authenticated** for `docker.io` and that the **local image store is retained across runs,
|
||||||
|
teardowns, and reboots** — so a repeat deploy of a previously-pulled image **reuses local
|
||||||
|
layers (no re-download)** and makes at most an authenticated manifest check.
|
||||||
|
- [ ] **PC3 — Verified + documented.** **Adversary proof:** deploy a recipe, tear it down, redeploy
|
||||||
|
→ the redeploy **does not re-download image layers** (served from the local store; show via
|
||||||
|
`docker` events/pull output / a measured pull-time drop), and a normal run no longer evicts
|
||||||
|
cached base images while disk stays bounded **without** `-af`. `docs/` notes the prune policy;
|
||||||
|
deviations in `DECISIONS.md`.
|
||||||
|
|
||||||
When PC1–PC3 hold and are Adversary-verified, write `## DONE` to `machine-docs/STATUS-2pc.md` →
|
When PC1–PC3 hold and are Adversary-verified, write `## DONE` to `machine-docs/STATUS-2pc.md` →
|
||||||
watchdog auto-returns to Phase 2.
|
watchdog auto-returns to Phase 2.
|
||||||
|
|
||||||
## Guardrails / constraints
|
## Guardrails / constraints
|
||||||
- **Real pull path only** — the daemon `registry-mirrors` makes the cache transparent to abra/swarm;
|
- **Bounded scope** — prune policy + confirm local-store retention/auth ONLY. Do NOT build a registry
|
||||||
do NOT special-case pulls or bypass abra. (Consistent with the lasuite-drive "real abra commands"
|
pull-through cache here, and do NOT expand into concurrency/readiness-tuning/dedup (those are
|
||||||
rule.)
|
measurement-driven Phase 2b).
|
||||||
- **Bounded scope** — this is the TWO evidence-based image wins only. Concurrency (`MAX_TESTS>1`),
|
- **Real pull path** — no special-casing pulls; abra/swarm pull through the normal authenticated
|
||||||
readiness-poll tuning, deploy dedup, etc. stay in **measurement-driven Phase 2b** — do NOT expand
|
daemon.
|
||||||
this interjection into general optimization.
|
- **Don't weaken any test** — the retained cache must not mask a genuinely-broken image (pinned
|
||||||
- **Cache is cache, not source** — the registry *service* is Nix-declared + reconciled (rebuildable);
|
versions still resolve correctly; a real new digest still pulls).
|
||||||
its stored layers are runtime cache, excluded from the D8 closure (re-warmed by pulls).
|
- **Registry pull-through cache — DEFERRED (IDEAS / Phase 2b), revisit ONLY if:** (a) cc-ci ever goes
|
||||||
- **Don't weaken any test** (cardinal rule); the cache must not mask a genuinely-broken image pull.
|
**multi-node**, OR (b) Phase-2b measurement shows **cold-cache / fresh-deploy pull time is a real
|
||||||
|
bottleneck** (e.g. D8 throwaway-rebuild or fresh-canonical seeding) **AND** the cache is hosted on
|
||||||
## Open decisions (log in machine-docs/DECISIONS.md)
|
**recreate-surviving storage** (an Incus volume / a path on host b1, not the VM's ephemeral disk).
|
||||||
- Cache storage location + size cap + eviction policy; where the `registry:2` service is declared
|
Otherwise it's complexity without payoff on a single host.
|
||||||
(new `nix/modules/registry-cache.nix`).
|
|
||||||
- Daemon `registry-mirrors` wiring (NixOS `virtualisation.docker.daemon.settings`) + how the cache
|
|
||||||
authenticates upstream with the PAT (sops secret → registry config).
|
|
||||||
- The measured speedup target to report (cold vs warm-cache deploy delta on a representative recipe).
|
|
||||||
|
|||||||
Reference in New Issue
Block a user