Files
cc-ci/machine-docs/REVIEW-2pc.md

9.2 KiB
Raw Blame History

REVIEW-2pc — Adversary verdicts for Phase 2pc (sane image-prune policy)

SSOT: /srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md. DoD = PC1 + PC2 + PC3, each Adversary cold-verified here before Builder may write ## DONE to STATUS-2pc.md.

SCOPE CORRECTION (operator, 2026-05-29): the registry pull-through cache (old PC2) is DROPPED / deferred to IDEAS — single authenticated non-pruning host ⇒ Docker's own local image store already IS the cache. Phase 2pc is now prune-policy only.

Status: PASS @2026-05-29 (gate 2pc re-claim 9e73ebd) — PC1+PC2+PC3 cold-verified; F2pc-1 CLEARED

Verdict: PASS. Builder reconciled the git≠host drift (F2pc-1) via b9bbd25 (rename committed units docker-pruneci-docker-prune; NixOS reserves docker-prune). Re-verified cold:

  • git == deploy source: git show HEAD:nix/modules/docker-prune.nix and swarm.nix are byte-identical to the host's /root/cc-ci copies (diff clean). Committed units now systemd.services.ci-docker-prune / .timer (docker-prune.nix:56,67) = what runs live.
  • live: ci-docker-prune.timer enabled+active (daily 00:00); old docker-prune.timer not-found. PC1 no-op @<80% (docker images 18→18 unchanged). PC3 redis re-confirm: cold Downloaded newer → warm Image is up to date (local reuse, manifest-only).
  • All PC1/PC2/PC3 substance from the prior pass still holds (below). A from-git rebuild now reproduces the verified system, and STATUS-2pc's ci-docker-prune.timer verify commands match.

F2pc-1 → CLOSED (Adversary, this verdict): git==host==ci-docker-prune, confirmed by byte-diff + live unit state.

Scope note on PC1 pressure branch: I verified the no-op (<80%) gate live and the ≥80% code path by read — it runs docker {container,image,builder} prune -f --filter until=24h. Crucially image prune without --all removes only dangling+old layers and cannot evict tagged base/in-use images (docker contract) — the cardinal "keep the cache" property is structural, not incidental. I did not fill the 64G disk to fire the ≥80% branch live (disproportionate); I rely on that code-read + Builder probe-5 evidence (2.34 GB dangling reclaimed, tagged images kept). The behavior I could break-test (no-op, teardown-keeps-images, bogus-tag-fails, cold→warm reuse) is all GREEN.


(superseded) FAIL @2026-05-29 (gate 2pc claim de6103d) — substance GREEN, git ≠ verified host

Verdict: FAIL — PC1/PC2/PC3 behavior is verified-GREEN on the live host, but the committed code does not match the deployed-and-"verified" artifact, so the claim is not reproducible from git (D8 contract violated). One blocking defect → F2pc-1 below. Fix is a one-shot reconciliation, not a redo.

What I cold-verified live (all GREEN on host — substance is sound)

  • PC1 prune logic (nix/modules/docker-prune.nix): triple-gated (≥80% /, no run-app stack ^[a-z0-9]{1,4}-[0-9a-f]{6}_ci_commoninternet_net_, no converging service), prunes container|image|builder prune -f --filter until=24h only — never --all, never --volumes. Ran the service live @ ~2731% /: printed "keeping local image cache, nothing to do", docker images count 17→17 unchanged. ✓
  • PC1 teardown keeps images: grep -rnE 'rmi|image rm|image prune|images -q' runner/ tests/conftest.py → only comments, no image removal. Live: after docker service rm the redis image (487efc061638) stayed present. ✓
  • PC1 autoPrune removed: committed swarm.nix no longer sets autoPrune (left default off); daemon enable=true only. A fresh rebuild creates no autoPrune unit. ✓
  • PC2 PAT-auth + retention: docker infoUsername: nptest2; /root/.docker/ config.json/run/secrets/rendered/docker-config.json (sops, symlink); auths has https://index.docker.io/v1/. No registry mirrors (cache correctly dropped). ✓
  • PC3 cold→teardown→warm (live, redis:7-alpine, real daemon = abra/swarm pull path): COLD = 7 layers "Pull complete" / "Downloaded newer"; service up 1/1 → service rm; image retained; WARM re-pull = "Image is up to date" (no layer download, manifest-only). ✓
  • Break-it (cardinal rule): docker pull redis:<bogus-tag>manifest unknown error. Retained store does not mask a broken/changed image. ✓

Why FAIL anyway — F2pc-1 (blocking): committed code ≠ verified host

  • origin/main HEAD de6103d (= the claim(2pc) commit) defines the units as systemd.services.docker-prune / systemd.timers.docker-prune (nix/modules/docker-prune.nix:56,67).
  • The live, "verified" host runs ci-docker-prune.service / ci-docker-prune.timer (enabled+active, next daily 00:00), built from uncommitted source in /root/cc-ci (/root/cc-ci is not even a git repo; its module has systemd.services.ci-docker-prune).
  • Consequences: (1) the artifact the Builder "deployed+verified" was never committed — git does not reproduce the verified system (a D8/fresh rebuild yields docker-prune.*, a different unit name than what was verified); (2) STATUS-2pc's own HOW-to-verify commands reference ci-docker-prune.timer, which a from-git rebuild will report not-found → a cold verifier following STATUS against a git-built host gets a false FAIL.
  • This is a reproducibility/integrity defect, not a behavioral one. The script body is the same (cc-ci-docker-prune); only the systemd unit wrapper name diverges.
  • To clear: make git == the deployed host — commit the ci-docker-prune naming actually deployed (push /root/cc-ci's docker-prune.nix), OR rename the module's units back to docker-prune, nixos-rebuild switch, and update STATUS-2pc verify commands to match. Then I re-verify git rev builds the exact ci-docker-prune/docker-prune units STATUS documents. (Also confirm the stale docker-prune.service [linked,ignored] leftover is harmless / GC'd on next rebuild.)

Did NOT read JOURNAL-2pc before this verdict (anti-anchoring). Verdict formed from plan + committed code + my own cold re-run on cc-ci.

DoD (narrowed scope)

  • PC1 — Conservative prune policy. No reflexive docker image prune -af. NEVER prune during a deploy/test run. Keep base/in-use images. Prune only dangling + age-gated old layers, only under genuine disk pressure. Per-run teardown still removes the run's volumes/secrets/services (sacred) but must NOT remove images.
  • PC2 — Local cache retained + authenticated (confirm). Daemon stays PAT-authenticated for docker.io; local image store retained across runs, teardowns, reboots → repeat deploy reuses local layers (no re-download), at most an authenticated manifest check.
  • PC3 — Verified + documented. Adversary proof: deploy → teardown → redeploy does NOT re-download layers (via docker events/pull output / measured pull-time drop); normal run doesn't evict cached base images; disk bounded WITHOUT -af. docs/ notes policy; deviations in DECISIONS.md.

Pre-claim baseline recon (read-only; NOT a verdict — just what "before" looks like)

  • autoPrune (nix/modules/swarm.nix:15-19): flags = ["--all" "--filter" "until=24h"], no --volumes. --all evicts any image unused for 24h → would drop warm base images between runs (exactly PC1's complaint). The destructive docker image prune -af cited in JOURNAL-2 (507, 690-693) was a manual operator action mid-deploy, NOT this systemd unit. → PC1 must (a) tighten autoPrune off --all toward dangling-only/age-gated, AND (b) ensure no -af exists in any harness/janitor/teardown code path.
  • Teardown image-removal grep target: DECISIONS.md:708 documents a manual cleanup recipe ending docker image prune -f. Must confirm the automated per-run teardown (run_recipe_ci.py / harness) does NOT docker rmi / image prune the run's images.
  • No registry cache exists (confirmed) and per scope correction none should be built.

Break-it probes to run once PC1 claimed (anti-anchoring checklist)

  1. Teardown must NOT remove images. Deploy a recipe, capture docker images digest set, run the real teardown, re-check: the recipe's image layers must STILL be present locally.
  2. Redeploy reuses local layers (PC3 core). After teardown, redeploy the SAME recipe and confirm via docker events / pull output there is NO layer download (only a manifest check, or fully local). Measure the pull-time delta vs a genuine cold pull.
  3. No mid-run prune. Grep all code paths; confirm nothing prunes images while a deploy/test is active (the JOURNAL-2 landmine). autoPrune is daily/off-run only.
  4. Cache must NOT mask a broken image (cardinal rule). A pinned version still resolves to the correct digest; a genuinely-new/changed digest still triggers a real pull — the retained store must not serve a stale image for a recipe that actually changed.
  5. Disk stays bounded without -af. Confirm the surgical policy + disk-pressure trigger actually reclaims under pressure (don't trade rate-limit churn for a full disk).
  6. PAT auth intact + not leaked. Daemon still authenticated to docker.io (under 200/6h); PAT not exposed in published logs / dashboard / world-readable config.