feat(2pc): PC1 conservative prune — drop autoPrune --all, add gated surgical docker-prune

Removes virtualisation.docker.autoPrune (daily `docker system prune --all` evicted in-use base images → cold re-pull → Hub rate-limit churn, JOURNAL-2). Adds modules/docker-prune.nix: daily timer + oneshot that prunes only dangling+until=24h, gated on disk pressure (>=80%) AND no run-app live AND no swarm service converging; never --all, never --volumes. Teardown unchanged (never removes images). Registry pull-through cache dropped per operator scope correction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:29:54 +01:00
parent e42753c17c
commit 16d177e73a
6 changed files with 179 additions and 12 deletions
--- a/machine-docs/BACKLOG-2pc.md
+++ b/machine-docs/BACKLOG-2pc.md
@ -0,0 +1,26 @@
+# BACKLOG — Phase 2pc (sane image-prune policy)
+
+SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`.
+Scope (post operator correction 2026-05-29): **PC1 prune policy + confirm local-store
+retention/auth ONLY.** The registry:2 pull-through cache is **dropped** (deferred to IDEAS /
+Phase 2b — revisit only if multi-node OR a measured cold-deploy bottleneck on recreate-surviving
+storage).
+
+## Build backlog
+
+- [ ] **PC1 — Conservative prune policy.** Remove `virtualisation.docker.autoPrune` (`--all` evicts
+      in-use base images → forced cold re-pull → rate-limit). Replace with a surgical, gated prune:
+      dangling + `until=24h` only, NEVER `--all`/`--volumes`; gated on (a) genuine disk pressure
+      (`/` ≥ 80%), (b) no run-app stack live, (c) no swarm service converging (mid-pull). Teardown
+      already removes only services/volumes/secrets/.env — NOT images (verified) — keep it that way.
+- [ ] **PC2 — Confirm local cache retained + authenticated.** Daemon stays PAT-authenticated
+      (`docker info` Username=nptest2, sops `dockerhub_auth` → `/root/.docker/config.json`); local
+      image store `/var/lib/docker` persists across runs/teardowns/reboots. No code change expected —
+      confirm + document.
+- [ ] **PC3 — Verify + document.** Deploy → teardown → redeploy reuses local layers (no
+      re-download); disk bounded without `-af`. Update `docs/runbook.md` + `docs/` prune note;
+      record the policy + the dropped-registry-cache deviation in `DECISIONS.md`.
+
+## Adversary findings
+
+(Adversary owns this section.)
--- a/machine-docs/JOURNAL-2pc.md
+++ b/machine-docs/JOURNAL-2pc.md
@ -0,0 +1,47 @@
+# JOURNAL — Phase 2pc (sane image-prune policy)
+
+Append-only reasoning log. Facts/verification for the Adversary live in STATUS-2pc.md.
+
+## 2026-05-29 — Orientation + scope correction
+
+Read SSOT `plan-phase2pc-image-cache.md` + plan.md §6.1/§7/§9. Operator issued a **scope
+correction** mid-orientation: **drop the registry:2 pull-through cache.** Rationale (operator):
+single host → Docker's own local image store already IS the cache; re-deploys reuse local layers
+with no re-download; the daemon is PAT-authenticated so residual manifest checks sit under 200/6h.
+The churn was caused by **over-pruning** (`docker image prune -af` wiping the store), not a missing
+cache. A separate registry only pays off multi-node / separate-survivable storage, which we are not.
+**I had not yet written any registry code** (still orienting) → nothing to revert.
+
+Phase 2pc is now **PC1 (prune policy) + PC2/PC3 (confirm + verify local-store retention/auth).**
+
+### Findings from orientation (why the fix is one module)
+
+- The ONLY automated image pruner in the whole repo is
+  `virtualisation.docker.autoPrune = { flags = ["--all" "--filter" "until=24h"]; }` in
+  `nix/modules/swarm.nix`. NixOS renders this as `docker system prune --force --all --filter until=24h`
+  daily. `--all` removes every image **not used by a running container** — between runs there are no
+  test apps running, so it evicts the cached recipe base images → cold re-pull on the next run. That
+  is exactly the prune→re-pull→rate-limit churn documented in JOURNAL-2 (lines 507/542/690-693).
+- `runner/harness/lifecycle.py::teardown_app` removes services (abra undeploy / `docker stack rm`),
+  volumes, secrets, and the `.env` — and **no images** (`grep` for `rmi`/`image rm`/`image prune` in
+  `runner/` + `tests/conftest.py` is empty). So PC1's "teardown must NOT remove images" already holds.
+- `janitor`, `warm_reconcile.py`, `nightly-sweep.nix`, `drone*.nix`, `.drone.yml` — none prune images.
+- Daemon is already PAT-authenticated: `docker info` → `Username: nptest2`; sops `dockerhub_auth`
+  (base64 `nptest2:<PAT>`) → `sops.templates."docker-config.json"` → `/root/.docker/config.json`
+  (`nix/modules/secrets.nix`). PC2 needs no change — confirm + document.
+- Disk on cc-ci: `/` is 64G, 19G used, **43G free (31%)** — bounded; aggressive `--all` is
+  unnecessary, which is the whole premise.
+
+### PC1 design
+
+Replace `autoPrune` with a dedicated `nix/modules/docker-prune.nix`: a daily `systemd.timer` +
+oneshot `systemd.service` running a surgical, **triple-gated** prune:
+1. **Disk-pressure gate** — do nothing unless `/` usage ≥ 80% (Docker's local store IS our cache;
+   keep it warm; reclaim only under genuine pressure).
+2. **No-run gate** — skip if any run-app stack (`<=4char>-<6hex>_ci_commoninternet_net_*`) is live
+   (mid-pull layers can look prunable; "never prune mid-run").
+3. **No-converge gate** — skip if any swarm service has unmet replicas (a deploy/pull in flight,
+   incl. infra warm redeploys).
+When all gates pass: `docker {container,image,builder} prune -f --filter until=24h` — dangling +
+age-gated only. NEVER `--all` (keeps tagged base/in-use images), NEVER `--volumes` (warm canonical
+data, per swarm.nix's existing comment).
--- a/machine-docs/STATUS-2pc.md
+++ b/machine-docs/STATUS-2pc.md
@ -0,0 +1,22 @@
+# STATUS — Phase 2pc (sane image-prune policy)
+
+**SSOT:** `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`
+**Scope (operator correction 2026-05-29):** PC1 conservative prune + PC2/PC3 confirm-and-verify
+local-store retention/auth. **Registry pull-through cache DROPPED** (deferred to IDEAS / Phase 2b).
+
+## Phase: PC1 implemented, deploy+verify in flight (NOT yet claimed)
+
+In flight: build the new prune module onto cc-ci via `nixos-rebuild switch`, then run the
+deploy→teardown→redeploy layer-reuse proof. Gate will be CLAIMED once verified on the real host.
+
+## What changed (the diff)
+
+- `nix/modules/swarm.nix` — removed `virtualisation.docker.autoPrune` (it ran
+  `docker system prune --force --all --filter until=24h` daily; `--all` evicts every image not used
+  by a *running* container → wiped cached recipe base images → cold re-pull → Hub rate-limit churn).
+- `nix/modules/docker-prune.nix` (NEW) — daily `systemd.timer` + oneshot `systemd.service`
+  `docker-prune` running a surgical, triple-gated prune. Imported in `nix/hosts/cc-ci/configuration.nix`.
+- Teardown (`runner/harness/lifecycle.py::teardown_app`) UNCHANGED — already removes only
+  services/volumes/secrets/.env, never images (PC1 teardown requirement already held).
+
+(Verification context — WHAT/HOW/EXPECTED/WHERE — will be filled in here at gate-claim time.)