note(2pc): realign REVIEW-2pc to narrowed scope — registry pull-through cache DROPPED per operator; 2pc is now prune-policy only (PC1 surgical prune + teardown must NOT remove images, PC2 confirm PAT-auth+local-store retention, PC3 deploy/teardown/redeploy reuses local layers). Break-it checklist updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 09:25:55 +01:00
parent 863bbac4de
commit e42753c17c

View File

@ -1,38 +1,53 @@
# REVIEW-2pc — Adversary verdicts for Phase 2pc (image pull-through cache + sane prune) # REVIEW-2pc — Adversary verdicts for Phase 2pc (sane image-prune policy)
SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`. DoD = PC1 + PC2 + PC3, SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`. DoD = PC1 + PC2 + PC3,
each Adversary cold-verified here before Builder may write `## DONE` to STATUS-2pc.md. each Adversary cold-verified here before Builder may write `## DONE` to STATUS-2pc.md.
**SCOPE CORRECTION (operator, 2026-05-29):** the registry pull-through cache (old PC2)
is **DROPPED / deferred to IDEAS** — single authenticated non-pruning host ⇒ Docker's own
local image store already IS the cache. Phase 2pc is now **prune-policy only**.
## Status: AWAITING CLAIM ## Status: AWAITING CLAIM
Phase 2pc opened 2026-05-29 (operator interjection into paused Phase 2). As of this Builder has not yet bootstrapped 2pc (no STATUS-2pc.md, no `claim(2pc…)`). No gate
file's creation the Builder has **not** bootstrapped 2pc (no STATUS-2pc.md, no `claim(2pc…)`). claimed → no verdict yet. Watching origin/main; cold-verify on first claim.
No gate is claimed → no verdict yet. Watching origin/main; will cold-verify on first claim.
## DoD (narrowed scope)
- **PC1 — Conservative prune policy.** No reflexive `docker image prune -af`. NEVER prune
during a deploy/test run. Keep base/in-use images. Prune only dangling + age-gated old
layers, only under genuine disk pressure. Per-run teardown still removes the run's
**volumes/secrets/services** (sacred) but **must NOT remove images.**
- **PC2 — Local cache retained + authenticated (confirm).** Daemon stays PAT-authenticated
for `docker.io`; local image store retained across runs, teardowns, reboots → repeat
deploy reuses local layers (no re-download), at most an authenticated manifest check.
- **PC3 — Verified + documented.** Adversary proof: deploy → teardown → redeploy does NOT
re-download layers (via `docker` events/pull output / measured pull-time drop); normal run
doesn't evict cached base images; disk bounded WITHOUT `-af`. docs/ notes policy;
deviations in DECISIONS.md.
## Pre-claim baseline recon (read-only; NOT a verdict — just what "before" looks like) ## Pre-claim baseline recon (read-only; NOT a verdict — just what "before" looks like)
- **PC1 / prune.** Current prune policy lives in `nix/modules/swarm.nix:15-19`: - **autoPrune** (`nix/modules/swarm.nix:15-19`): `flags = ["--all" "--filter" "until=24h"]`,
`autoPrune = { enable=true; dates="daily"; flags=["--all" "--filter" "until=24h"]; }`. no `--volumes`. `--all` evicts *any* image unused for 24h → would drop warm base images
- `--all` removes *any* image not used by a container in 24h → would evict warm base between runs (exactly PC1's complaint). The destructive `docker image prune -af` cited in
images between runs (PC1's complaint). No `--volumes` (correct — warm volumes survive). JOURNAL-2 (507, 690-693) was a **manual** operator action mid-deploy, NOT this systemd unit.
- The destructive `docker image prune -af` churn cited in JOURNAL-2 was a **manual** → PC1 must (a) tighten autoPrune off `--all` toward dangling-only/age-gated, AND (b) ensure
operator action mid-deploy (JOURNAL-2:507,690-693), not this systemd unit. no `-af` exists in any harness/janitor/teardown code path.
- PC1 acceptance to run cold: confirm (a) no reflexive `-af` remains in harness/janitor - **Teardown image-removal grep target:** DECISIONS.md:708 documents a manual cleanup recipe
code paths; (b) prune never fires during an active deploy/run; (c) a normal run does ending `docker image prune -f`. Must confirm the *automated* per-run teardown
NOT evict cached base images; (d) disk stays bounded without `-af`. (run_recipe_ci.py / harness) does NOT `docker rmi` / `image prune` the run's images.
- **PC2 / pull-through cache.** Does NOT exist yet — no `registry:2`, `registry-mirrors`, - **No registry cache** exists (confirmed) and per scope correction none should be built.
`registry-1.docker.io`, or pull-through config anywhere in repo (`nix/`, `runner/`).
Expect a new `nix/modules/registry-cache.nix` (Nix-reconciled service) + daemon
`virtualisation.docker.daemon.settings.registry-mirrors` + sops PAT (nptest2) for upstream auth.
- PC2 acceptance to run cold: (a) 2nd deploy of an image pulls from cache not Docker Hub
(cache logs / measured pull-time drop); (b) survives a prune (re-pull is local, not a
Hub hit); (c) measured cold-vs-warm deploy speedup; (d) cache-miss pulls authenticate.
- **PC3 / bounded+documented.** Cache must have disk cap / own GC (LRU/old eviction);
scope = docker.io only; docs/ notes cache+prune policy; deviations in DECISIONS.md.
## Break-it probes to run once PC2 lands (anti-anchoring checklist) ## Break-it probes to run once PC1 claimed (anti-anchoring checklist)
- Cache must NOT mask a genuinely-broken image pull (cardinal rule — don't weaken a test). 1. **Teardown must NOT remove images.** Deploy a recipe, capture `docker images` digest set,
Probe: request a nonexistent/garbage tag through the mirror → must still FAIL, not serve stale. run the real teardown, re-check: the recipe's image layers must STILL be present locally.
- registry-mirrors must be transparent to abra/swarm — verify a real abra deploy's pulls 2. **Redeploy reuses local layers (PC3 core).** After teardown, redeploy the SAME recipe and
traverse the cache with NO command change / no pull special-casing. confirm via `docker events` / pull output there is NO layer download (only a manifest
- Cache survives a D8-style rebuild as a service (Nix-reconciled), but its contents are NOT check, or fully local). Measure the pull-time delta vs a genuine cold pull.
in the git closure (re-warmed by pulls) — verify both halves. 3. **No mid-run prune.** Grep all code paths; confirm nothing prunes images while a
- PAT secret must not leak into published logs / dashboard / world-readable registry config. deploy/test is active (the JOURNAL-2 landmine). autoPrune is daily/off-run only.
4. **Cache must NOT mask a broken image (cardinal rule).** A pinned version still resolves to
the correct digest; a genuinely-new/changed digest still triggers a real pull — the
retained store must not serve a stale image for a recipe that actually changed.
5. **Disk stays bounded without `-af`.** Confirm the surgical policy + disk-pressure trigger
actually reclaims under pressure (don't trade rate-limit churn for a full disk).
6. **PAT auth intact + not leaked.** Daemon still authenticated to docker.io (under 200/6h);
PAT not exposed in published logs / dashboard / world-readable config.