claim(2pc): PC1 conservative prune deployed+verified; PC2/PC3 local-store cache confirmed

ci-docker-prune (gated surgical prune) live on cc-ci: old autoPrune --all gone, new timer enabled (daily), no-ops below 80% disk keeping the local image cache, never --all/--volumes. Daemon stays PAT-authenticated (nptest2); /var/lib/docker retained across rebuild. PC3 proof: redis:7-alpine deploy->teardown(service rm, image retained)->redeploy = "Image is up to date", no layer re-download (cold 5303ms -> warm 674ms). Docs: runbook "Image cache & prune policy", warm.md, DECISIONS Phase-2pc, IDEAS (registry pull-through cache deferred + revisit trigger). Gate 2pc CLAIMED, awaiting Adversary cold-verify. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:42:36 +01:00
parent 16d177e73a
commit de6103d41d
5 changed files with 185 additions and 22 deletions
--- a/machine-docs/STATUS-2pc.md
+++ b/machine-docs/STATUS-2pc.md
@ -1,22 +1,93 @@
 # STATUS — Phase 2pc (sane image-prune policy)

 **SSOT:** `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`
-**Scope (operator correction 2026-05-29):** PC1 conservative prune + PC2/PC3 confirm-and-verify
-local-store retention/auth. **Registry pull-through cache DROPPED** (deferred to IDEAS / Phase 2b).
+**Scope (operator correction 2026-05-29):** PC1 conservative prune + PC2/PC3 confirm+verify
+local-store retention/auth. **Registry pull-through cache DROPPED** (deferred → `cc-ci-plan/IDEAS.md`
+ DECISIONS Phase-2pc; no registry code was written).

-## Phase: PC1 implemented, deploy+verify in flight (NOT yet claimed)
+## Gate: 2pc — CLAIMED, awaiting Adversary

-In flight: build the new prune module onto cc-ci via `nixos-rebuild switch`, then run the
-deploy→teardown→redeploy layer-reuse proof. Gate will be CLAIMED once verified on the real host.
+All of PC1/PC2/PC3 implemented, deployed to cc-ci, and Builder-verified on the real host. Commit
+sha for this claim: see `claim(2pc)` HEAD. WHAT / HOW / EXPECTED / WHERE below.

-## What changed (the diff)
+---

- `nix/modules/swarm.nix` — removed `virtualisation.docker.autoPrune` (it ran
-  `docker system prune --force --all --filter until=24h` daily; `--all` evicts every image not used
-  by a *running* container → wiped cached recipe base images → cold re-pull → Hub rate-limit churn).
- `nix/modules/docker-prune.nix` (NEW) — daily `systemd.timer` + oneshot `systemd.service`
-  `docker-prune` running a surgical, triple-gated prune. Imported in `nix/hosts/cc-ci/configuration.nix`.
- Teardown (`runner/harness/lifecycle.py::teardown_app`) UNCHANGED — already removes only
-  services/volumes/secrets/.env, never images (PC1 teardown requirement already held).
+### PC1 — Conservative prune policy

-(Verification context — WHAT/HOW/EXPECTED/WHERE — will be filled in here at gate-claim time.)
+**WHAT.** Removed the daily `docker system prune --all` and replaced it with a surgical, triple-gated
+prune that keeps Docker's local image store (the cache) warm.
+- **WHERE.** `nix/modules/docker-prune.nix` (NEW, unit `ci-docker-prune` service+timer);
+  `nix/modules/swarm.nix` (`virtualisation.docker.autoPrune` block removed, left OFF=default);
+  `nix/hosts/cc-ci/configuration.nix` (imports `docker-prune.nix`). Deployed via
+  `nixos-rebuild switch --flake path:/root/cc-ci#cc-ci`.
+- The prune **no-ops unless ALL** hold: (1) `/` usage ≥ 80%, (2) no run-app stack live
+  (`<=4char>-<6hex>_ci_commoninternet_net_*`), (3) no swarm service converging (unmet replicas).
+  When it runs: `docker {container,image,builder} prune -f --filter until=24h` — **dangling+old only,
+  never `--all`, never `--volumes`.**
+- Teardown unchanged: `runner/harness/lifecycle.py::teardown_app` removes services/volumes/secrets/
+  .env and **no images** (`grep -n 'rmi\|image rm\|image prune' runner/ tests/conftest.py` = empty).
+
+**HOW to verify (cold, Adversary's own checks):**
+```sh
+ssh cc-ci 'systemctl is-enabled docker-prune.timer'                    # EXPECT: not-found (autoPrune gone)
+ssh cc-ci 'systemctl is-enabled ci-docker-prune.timer; systemctl is-active ci-docker-prune.timer'
+ssh cc-ci 'systemctl list-timers ci-docker-prune.timer --no-pager'     # EXPECT: enabled/active, NEXT daily 00:00
+ssh cc-ci 'systemctl start ci-docker-prune.service; \
+           journalctl -u ci-docker-prune.service -n 3 --no-pager'      # EXPECT (disk<80%): "keeping local image cache, nothing to do"
+ssh cc-ci 'docker images -q | wc -l'                                   # EXPECT: unchanged before==after the manual run
+# source-read the gates + flags (no --all, no --volumes):
+grep -nE "until=24h|--all|--volumes|prune" nix/modules/docker-prune.nix
+grep -n "autoPrune" nix/modules/swarm.nix                              # EXPECT: only a comment, no enable=true
+```
+**EXPECTED:** old timer not-found; `ci-docker-prune.timer` enabled+active (daily); manual run below
+80% prints the no-op line and removes nothing; module flags are `--filter until=24h` only (never
+`--all`/`--volumes`); swarm.nix has no live autoPrune.
+
+### PC2 — Local cache retained + authenticated (confirm)
+
+**WHAT.** Daemon stays PAT-authenticated; `/var/lib/docker` local image store persists across
+runs/teardowns/reboots; no code change (sops `dockerhub_auth` → `/root/.docker/config.json` in
+`nix/modules/secrets.nix`, unchanged).
+**HOW / EXPECTED:**
+```sh
+ssh cc-ci 'docker info 2>/dev/null | grep Username'        # EXPECT: Username: nptest2
+ssh cc-ci 'ls -l /root/.docker/config.json'                # EXPECT: -> /run/secrets/rendered/docker-config.json (0600)
+ssh cc-ci 'docker images | wc -l'                          # EXPECT: many recipe images retained (was 21 leaf images)
+```
+
+### PC3 — Deploy → teardown → redeploy reuses local layers (no re-download)
+
+**WHAT.** A previously-pulled image is retained through teardown and a redeploy reuses local layers;
+only an authenticated manifest check remains. Builder-proven with a real swarm deploy/teardown/
+redeploy on `redis:7-alpine` (docker.io through the authenticated daemon — same pull path abra/swarm
+use).
+**HOW (Adversary, reproducible):**
+```sh
+ssh cc-ci 'bash -s' <<'PROOF'
+IMG=redis:7-alpine; docker rmi -f "$IMG" >/dev/null 2>&1 || true
+t0=$(date +%s%N); docker pull "$IMG" 2>&1 | grep -E "Pull complete|Downloaded|Already exists|up to date"; t1=$(date +%s%N)
+echo COLD_MS=$(((t1-t0)/1000000))
+docker service create --name pc3 --replicas 1 "$IMG" sleep 120 >/dev/null 2>&1; docker service ls --filter name=pc3 --format '{{.Replicas}}'
+docker service rm pc3 >/dev/null 2>&1
+echo retained: $(docker images redis:7-alpine --format '{{.ID}}')
+t2=$(date +%s%N); docker pull "$IMG" 2>&1 | grep -E "Pull complete|Downloaded|Already exists|up to date"; t3=$(date +%s%N)
+echo WARM_MS=$(((t3-t2)/1000000)); docker rmi -f "$IMG" >/dev/null 2>&1
+PROOF
+```
+**EXPECTED:** COLD pull shows layer "Pull complete" lines (download) — Builder saw 6 layers,
+COLD_MS≈5303; after `service rm` the image ID is still listed (retained); WARM pull shows
+`Image is up to date` (no layer download), WARM_MS≈674 (≈8× faster, manifest-only). Confirms the
+local store is the cache, survives teardown, and a redeploy needs no Docker-Hub layer download.
+Optional fuller proof: a real recipe cycle
+`RECIPE=custom-html-tiny PR=0 STAGES=install cc-ci-run runner/run_recipe_ci.py` run twice — the 2nd
+deploy shows no image-layer download.
+
+---
+
+## DoD checklist (Builder view — Adversary owns the verdict in REVIEW-2pc.md)
+- [x] **PC1** — autoPrune `--all` removed; surgical gated `ci-docker-prune` deployed; teardown keeps images.
+- [x] **PC2** — daemon PAT-authenticated (nptest2); local store retained across rebuild.
+- [x] **PC3** — deploy→teardown→redeploy reuses local layers (no re-download), measured; disk bounded
+      (31%) without `-af`. Documented (runbook/warm/DECISIONS/IDEAS).
+
+## Not blocked. No standing blockers.