claim(2pc): PC1 conservative prune deployed+verified; PC2/PC3 local-store cache confirmed

ci-docker-prune (gated surgical prune) live on cc-ci: old autoPrune --all gone, new timer enabled (daily), no-ops below 80% disk keeping the local image cache, never --all/--volumes. Daemon stays PAT-authenticated (nptest2); /var/lib/docker retained across rebuild. PC3 proof: redis:7-alpine deploy->teardown(service rm, image retained)->redeploy = "Image is up to date", no layer re-download (cold 5303ms -> warm 674ms). Docs: runbook "Image cache & prune policy", warm.md, DECISIONS Phase-2pc, IDEAS (registry pull-through cache deferred + revisit trigger). Gate 2pc CLAIMED, awaiting Adversary cold-verify. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:42:36 +01:00
parent 16d177e73a
commit de6103d41d
5 changed files with 185 additions and 22 deletions
--- a/machine-docs/DECISIONS.md
+++ b/machine-docs/DECISIONS.md
@ -724,3 +724,29 @@ Standing policy for all Phase-2 (and later) recipe OIDC/SSO testing:
 Consequences: DEFERRED #9 (authentik enrollment) re-entry trigger narrowed to "a recipe requires
 authentik"; F2-7 (authentik backend) is not a DONE blocker. plan-sso-dep-testing.md §6 updated by the
 orchestrator to match.
+
+## Phase 2pc — image-prune policy; local store IS the cache; registry pull-through DROPPED (2026-05-29) — SETTLED
+Decision (PC1): removed `virtualisation.docker.autoPrune` (it ran `docker system prune --force --all
+--filter until=24h` daily). The `--all` evicts every image not used by a *running* container —
+between runs no test apps run, so it wiped the cached recipe base images → cold re-pull → Docker-Hub
+rate-limit churn (JOURNAL-2 507/542/690-693). Replaced with `nix/modules/docker-prune.nix`: the
+`ci-docker-prune` daily timer + oneshot, a **surgical triple-gated** prune that no-ops unless ALL of
+(1) `/` ≥ 80%, (2) no run-app stack live, (3) no swarm service converging; and when it runs prunes
+only **dangling images + stopped containers + dangling build cache, `until=24h`** — never `--all`
+(keeps tagged base/in-use images), never `--volumes` (warm canonical data). Teardown
+(`lifecycle.teardown_app`) already removes only services/volumes/secrets/.env, never images — kept.
+Why: on this **single host Docker's own local image store IS the cache** — a pulled image stays and
+redeploys reuse local layers with no re-download (proven: redis:7-alpine cold pull 5303ms w/ 6 layer
+downloads → after `service rm` teardown the image is retained → warm redeploy "Image is up to date"
+674ms, no bytes); the PAT-authenticated daemon (200/6h) makes the residual warm-deploy manifest check
+free of rate-limit pressure. So *keeping* the store recovers ~all the benefit a cache would give.
+
+Decision (registry pull-through cache): **DROPPED here, deferred to IDEAS / Phase 2b** (operator
+scope correction 2026-05-29, mid-phase). A `registry:2` pull-through cache's distinctive wins —
+multi-node fan-out, surviving prune/VM-rebuild on *separate* storage, cache-miss authentication —
+**don't apply** to a single authenticated non-pruning host (one node; co-located cache lost on a
+recreate anyway; daemon already authenticated). It would add a registry service + daemon-mirror
+config + cache GC for marginal gain. **Revisit ONLY if** (a) cc-ci goes multi-node, OR (b) Phase-2b
+measurement shows cold-deploy pull time is a real bottleneck AND the cache can live on
+recreate-surviving storage (Incus volume / host b1 path, not the VM's ephemeral disk). No registry
+code was written (caught during orientation) — nothing to revert.