Files
cc-ci/machine-docs/REVIEW-2pc.md

3.8 KiB

REVIEW-2pc — Adversary verdicts for Phase 2pc (sane image-prune policy)

SSOT: /srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md. DoD = PC1 + PC2 + PC3, each Adversary cold-verified here before Builder may write ## DONE to STATUS-2pc.md.

SCOPE CORRECTION (operator, 2026-05-29): the registry pull-through cache (old PC2) is DROPPED / deferred to IDEAS — single authenticated non-pruning host ⇒ Docker's own local image store already IS the cache. Phase 2pc is now prune-policy only.

Status: AWAITING CLAIM

Builder has not yet bootstrapped 2pc (no STATUS-2pc.md, no claim(2pc…)). No gate claimed → no verdict yet. Watching origin/main; cold-verify on first claim.

DoD (narrowed scope)

  • PC1 — Conservative prune policy. No reflexive docker image prune -af. NEVER prune during a deploy/test run. Keep base/in-use images. Prune only dangling + age-gated old layers, only under genuine disk pressure. Per-run teardown still removes the run's volumes/secrets/services (sacred) but must NOT remove images.
  • PC2 — Local cache retained + authenticated (confirm). Daemon stays PAT-authenticated for docker.io; local image store retained across runs, teardowns, reboots → repeat deploy reuses local layers (no re-download), at most an authenticated manifest check.
  • PC3 — Verified + documented. Adversary proof: deploy → teardown → redeploy does NOT re-download layers (via docker events/pull output / measured pull-time drop); normal run doesn't evict cached base images; disk bounded WITHOUT -af. docs/ notes policy; deviations in DECISIONS.md.

Pre-claim baseline recon (read-only; NOT a verdict — just what "before" looks like)

  • autoPrune (nix/modules/swarm.nix:15-19): flags = ["--all" "--filter" "until=24h"], no --volumes. --all evicts any image unused for 24h → would drop warm base images between runs (exactly PC1's complaint). The destructive docker image prune -af cited in JOURNAL-2 (507, 690-693) was a manual operator action mid-deploy, NOT this systemd unit. → PC1 must (a) tighten autoPrune off --all toward dangling-only/age-gated, AND (b) ensure no -af exists in any harness/janitor/teardown code path.
  • Teardown image-removal grep target: DECISIONS.md:708 documents a manual cleanup recipe ending docker image prune -f. Must confirm the automated per-run teardown (run_recipe_ci.py / harness) does NOT docker rmi / image prune the run's images.
  • No registry cache exists (confirmed) and per scope correction none should be built.

Break-it probes to run once PC1 claimed (anti-anchoring checklist)

  1. Teardown must NOT remove images. Deploy a recipe, capture docker images digest set, run the real teardown, re-check: the recipe's image layers must STILL be present locally.
  2. Redeploy reuses local layers (PC3 core). After teardown, redeploy the SAME recipe and confirm via docker events / pull output there is NO layer download (only a manifest check, or fully local). Measure the pull-time delta vs a genuine cold pull.
  3. No mid-run prune. Grep all code paths; confirm nothing prunes images while a deploy/test is active (the JOURNAL-2 landmine). autoPrune is daily/off-run only.
  4. Cache must NOT mask a broken image (cardinal rule). A pinned version still resolves to the correct digest; a genuinely-new/changed digest still triggers a real pull — the retained store must not serve a stale image for a recipe that actually changed.
  5. Disk stays bounded without -af. Confirm the surgical policy + disk-pressure trigger actually reclaims under pressure (don't trade rate-limit churn for a full disk).
  6. PAT auth intact + not leaked. Daemon still authenticated to docker.io (under 200/6h); PAT not exposed in published logs / dashboard / world-readable config.