9.2 KiB
REVIEW-2pc — Adversary verdicts for Phase 2pc (sane image-prune policy)
SSOT: /srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md. DoD = PC1 + PC2 + PC3,
each Adversary cold-verified here before Builder may write ## DONE to STATUS-2pc.md.
SCOPE CORRECTION (operator, 2026-05-29): the registry pull-through cache (old PC2) is DROPPED / deferred to IDEAS — single authenticated non-pruning host ⇒ Docker's own local image store already IS the cache. Phase 2pc is now prune-policy only.
Status: PASS @2026-05-29 (gate 2pc re-claim 9e73ebd) — PC1+PC2+PC3 cold-verified; F2pc-1 CLEARED
Verdict: PASS. Builder reconciled the git≠host drift (F2pc-1) via b9bbd25 (rename
committed units docker-prune→ci-docker-prune; NixOS reserves docker-prune). Re-verified
cold:
- git == deploy source:
git show HEAD:nix/modules/docker-prune.nixandswarm.nixare byte-identical to the host's/root/cc-cicopies (diff clean). Committed units nowsystemd.services.ci-docker-prune/.timer(docker-prune.nix:56,67) = what runs live. - live:
ci-docker-prune.timerenabled+active (daily 00:00); olddocker-prune.timernot-found. PC1 no-op @<80% (docker images18→18 unchanged). PC3 redis re-confirm: coldDownloaded newer→ warmImage is up to date(local reuse, manifest-only). - All PC1/PC2/PC3 substance from the prior pass still holds (below). A from-git rebuild now
reproduces the verified system, and STATUS-2pc's
ci-docker-prune.timerverify commands match.
F2pc-1 → CLOSED (Adversary, this verdict): git==host==ci-docker-prune, confirmed by
byte-diff + live unit state.
Scope note on PC1 pressure branch: I verified the no-op (<80%) gate live and the ≥80% code
path by read — it runs docker {container,image,builder} prune -f --filter until=24h. Crucially
image prune without --all removes only dangling+old layers and cannot evict tagged
base/in-use images (docker contract) — the cardinal "keep the cache" property is structural, not
incidental. I did not fill the 64G disk to fire the ≥80% branch live (disproportionate); I
rely on that code-read + Builder probe-5 evidence (2.34 GB dangling reclaimed, tagged images
kept). The behavior I could break-test (no-op, teardown-keeps-images, bogus-tag-fails,
cold→warm reuse) is all GREEN.
(superseded) FAIL @2026-05-29 (gate 2pc claim de6103d) — substance GREEN, git ≠ verified host
Verdict: FAIL — PC1/PC2/PC3 behavior is verified-GREEN on the live host, but the committed code does not match the deployed-and-"verified" artifact, so the claim is not reproducible from git (D8 contract violated). One blocking defect → F2pc-1 below. Fix is a one-shot reconciliation, not a redo.
What I cold-verified live (all GREEN on host — substance is sound)
- PC1 prune logic (
nix/modules/docker-prune.nix): triple-gated (≥80%/, no run-app stack^[a-z0-9]{1,4}-[0-9a-f]{6}_ci_commoninternet_net_, no converging service), prunescontainer|image|builder prune -f --filter until=24honly — never--all, never--volumes. Ran the service live @ ~27–31%/: printed "keeping local image cache, nothing to do",docker imagescount 17→17 unchanged. ✓ - PC1 teardown keeps images:
grep -rnE 'rmi|image rm|image prune|images -q' runner/ tests/conftest.py→ only comments, no image removal. Live: afterdocker service rmthe redis image (487efc061638) stayed present. ✓ - PC1 autoPrune removed: committed
swarm.nixno longer setsautoPrune(left default off); daemonenable=trueonly. A fresh rebuild creates no autoPrune unit. ✓ - PC2 PAT-auth + retention:
docker info→Username: nptest2;/root/.docker/ config.json→/run/secrets/rendered/docker-config.json(sops, symlink);authshashttps://index.docker.io/v1/. No registry mirrors (cache correctly dropped). ✓ - PC3 cold→teardown→warm (live, redis:7-alpine, real daemon = abra/swarm pull path):
COLD = 7 layers "Pull complete" / "Downloaded newer"; service up 1/1 →
service rm; image retained; WARM re-pull = "Image is up to date" (no layer download, manifest-only). ✓ - Break-it (cardinal rule):
docker pull redis:<bogus-tag>→manifest unknownerror. Retained store does not mask a broken/changed image. ✓
Why FAIL anyway — F2pc-1 (blocking): committed code ≠ verified host
- origin/main HEAD
de6103d(= theclaim(2pc)commit) defines the units assystemd.services.docker-prune/systemd.timers.docker-prune(nix/modules/docker-prune.nix:56,67). - The live, "verified" host runs
ci-docker-prune.service/ci-docker-prune.timer(enabled+active, next daily 00:00), built from uncommitted source in/root/cc-ci(/root/cc-ciis not even a git repo; its module hassystemd.services.ci-docker-prune). - Consequences: (1) the artifact the Builder "deployed+verified" was never committed —
git does not reproduce the verified system (a D8/fresh rebuild yields
docker-prune.*, a different unit name than what was verified); (2) STATUS-2pc's own HOW-to-verify commands referenceci-docker-prune.timer, which a from-git rebuild will reportnot-found→ a cold verifier following STATUS against a git-built host gets a false FAIL. - This is a reproducibility/integrity defect, not a behavioral one. The script body is the
same (
cc-ci-docker-prune); only the systemd unit wrapper name diverges. - To clear: make git == the deployed host — commit the
ci-docker-prunenaming actually deployed (push/root/cc-ci'sdocker-prune.nix), OR rename the module's units back todocker-prune,nixos-rebuild switch, and update STATUS-2pc verify commands to match. Then I re-verifygit revbuilds the exactci-docker-prune/docker-pruneunits STATUS documents. (Also confirm the staledocker-prune.service[linked,ignored] leftover is harmless / GC'd on next rebuild.)
Did NOT read JOURNAL-2pc before this verdict (anti-anchoring). Verdict formed from plan + committed code + my own cold re-run on cc-ci.
DoD (narrowed scope)
- PC1 — Conservative prune policy. No reflexive
docker image prune -af. NEVER prune during a deploy/test run. Keep base/in-use images. Prune only dangling + age-gated old layers, only under genuine disk pressure. Per-run teardown still removes the run's volumes/secrets/services (sacred) but must NOT remove images. - PC2 — Local cache retained + authenticated (confirm). Daemon stays PAT-authenticated
for
docker.io; local image store retained across runs, teardowns, reboots → repeat deploy reuses local layers (no re-download), at most an authenticated manifest check. - PC3 — Verified + documented. Adversary proof: deploy → teardown → redeploy does NOT
re-download layers (via
dockerevents/pull output / measured pull-time drop); normal run doesn't evict cached base images; disk bounded WITHOUT-af. docs/ notes policy; deviations in DECISIONS.md.
Pre-claim baseline recon (read-only; NOT a verdict — just what "before" looks like)
- autoPrune (
nix/modules/swarm.nix:15-19):flags = ["--all" "--filter" "until=24h"], no--volumes.--allevicts any image unused for 24h → would drop warm base images between runs (exactly PC1's complaint). The destructivedocker image prune -afcited in JOURNAL-2 (507, 690-693) was a manual operator action mid-deploy, NOT this systemd unit. → PC1 must (a) tighten autoPrune off--alltoward dangling-only/age-gated, AND (b) ensure no-afexists in any harness/janitor/teardown code path. - Teardown image-removal grep target: DECISIONS.md:708 documents a manual cleanup recipe
ending
docker image prune -f. Must confirm the automated per-run teardown (run_recipe_ci.py / harness) does NOTdocker rmi/image prunethe run's images. - No registry cache exists (confirmed) and per scope correction none should be built.
Break-it probes to run once PC1 claimed (anti-anchoring checklist)
- Teardown must NOT remove images. Deploy a recipe, capture
docker imagesdigest set, run the real teardown, re-check: the recipe's image layers must STILL be present locally. - Redeploy reuses local layers (PC3 core). After teardown, redeploy the SAME recipe and
confirm via
docker events/ pull output there is NO layer download (only a manifest check, or fully local). Measure the pull-time delta vs a genuine cold pull. - No mid-run prune. Grep all code paths; confirm nothing prunes images while a deploy/test is active (the JOURNAL-2 landmine). autoPrune is daily/off-run only.
- Cache must NOT mask a broken image (cardinal rule). A pinned version still resolves to the correct digest; a genuinely-new/changed digest still triggers a real pull — the retained store must not serve a stale image for a recipe that actually changed.
- Disk stays bounded without
-af. Confirm the surgical policy + disk-pressure trigger actually reclaims under pressure (don't trade rate-limit churn for a full disk). - PAT auth intact + not leaked. Daemon still authenticated to docker.io (under 200/6h); PAT not exposed in published logs / dashboard / world-readable config.