Files
cc-ci/machine-docs/JOURNAL-2pc.md
autonomic-bot 16d177e73a feat(2pc): PC1 conservative prune — drop autoPrune --all, add gated surgical docker-prune
Removes virtualisation.docker.autoPrune (daily `docker system prune --all` evicted in-use base
images → cold re-pull → Hub rate-limit churn, JOURNAL-2). Adds modules/docker-prune.nix: daily
timer + oneshot that prunes only dangling+until=24h, gated on disk pressure (>=80%) AND no run-app
live AND no swarm service converging; never --all, never --volumes. Teardown unchanged (never
removes images). Registry pull-through cache dropped per operator scope correction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:30:07 +01:00

3.2 KiB

JOURNAL — Phase 2pc (sane image-prune policy)

Append-only reasoning log. Facts/verification for the Adversary live in STATUS-2pc.md.

2026-05-29 — Orientation + scope correction

Read SSOT plan-phase2pc-image-cache.md + plan.md §6.1/§7/§9. Operator issued a scope correction mid-orientation: drop the registry:2 pull-through cache. Rationale (operator): single host → Docker's own local image store already IS the cache; re-deploys reuse local layers with no re-download; the daemon is PAT-authenticated so residual manifest checks sit under 200/6h. The churn was caused by over-pruning (docker image prune -af wiping the store), not a missing cache. A separate registry only pays off multi-node / separate-survivable storage, which we are not. I had not yet written any registry code (still orienting) → nothing to revert.

Phase 2pc is now PC1 (prune policy) + PC2/PC3 (confirm + verify local-store retention/auth).

Findings from orientation (why the fix is one module)

  • The ONLY automated image pruner in the whole repo is virtualisation.docker.autoPrune = { flags = ["--all" "--filter" "until=24h"]; } in nix/modules/swarm.nix. NixOS renders this as docker system prune --force --all --filter until=24h daily. --all removes every image not used by a running container — between runs there are no test apps running, so it evicts the cached recipe base images → cold re-pull on the next run. That is exactly the prune→re-pull→rate-limit churn documented in JOURNAL-2 (lines 507/542/690-693).
  • runner/harness/lifecycle.py::teardown_app removes services (abra undeploy / docker stack rm), volumes, secrets, and the .env — and no images (grep for rmi/image rm/image prune in runner/ + tests/conftest.py is empty). So PC1's "teardown must NOT remove images" already holds.
  • janitor, warm_reconcile.py, nightly-sweep.nix, drone*.nix, .drone.yml — none prune images.
  • Daemon is already PAT-authenticated: docker infoUsername: nptest2; sops dockerhub_auth (base64 nptest2:<PAT>) → sops.templates."docker-config.json"/root/.docker/config.json (nix/modules/secrets.nix). PC2 needs no change — confirm + document.
  • Disk on cc-ci: / is 64G, 19G used, 43G free (31%) — bounded; aggressive --all is unnecessary, which is the whole premise.

PC1 design

Replace autoPrune with a dedicated nix/modules/docker-prune.nix: a daily systemd.timer + oneshot systemd.service running a surgical, triple-gated prune:

  1. Disk-pressure gate — do nothing unless / usage ≥ 80% (Docker's local store IS our cache; keep it warm; reclaim only under genuine pressure).
  2. No-run gate — skip if any run-app stack (<=4char>-<6hex>_ci_commoninternet_net_*) is live (mid-pull layers can look prunable; "never prune mid-run").
  3. No-converge gate — skip if any swarm service has unmet replicas (a deploy/pull in flight, incl. infra warm redeploys). When all gates pass: docker {container,image,builder} prune -f --filter until=24h — dangling + age-gated only. NEVER --all (keeps tagged base/in-use images), NEVER --volumes (warm canonical data, per swarm.nix's existing comment).