ci-docker-prune (gated surgical prune) live on cc-ci: old autoPrune --all gone, new timer enabled (daily), no-ops below 80% disk keeping the local image cache, never --all/--volumes. Daemon stays PAT-authenticated (nptest2); /var/lib/docker retained across rebuild. PC3 proof: redis:7-alpine deploy->teardown(service rm, image retained)->redeploy = "Image is up to date", no layer re-download (cold 5303ms -> warm 674ms). Docs: runbook "Image cache & prune policy", warm.md, DECISIONS Phase-2pc, IDEAS (registry pull-through cache deferred + revisit trigger). Gate 2pc CLAIMED, awaiting Adversary cold-verify. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5.7 KiB
JOURNAL — Phase 2pc (sane image-prune policy)
Append-only reasoning log. Facts/verification for the Adversary live in STATUS-2pc.md.
2026-05-29 — Orientation + scope correction
Read SSOT plan-phase2pc-image-cache.md + plan.md §6.1/§7/§9. Operator issued a scope
correction mid-orientation: drop the registry:2 pull-through cache. Rationale (operator):
single host → Docker's own local image store already IS the cache; re-deploys reuse local layers
with no re-download; the daemon is PAT-authenticated so residual manifest checks sit under 200/6h.
The churn was caused by over-pruning (docker image prune -af wiping the store), not a missing
cache. A separate registry only pays off multi-node / separate-survivable storage, which we are not.
I had not yet written any registry code (still orienting) → nothing to revert.
Phase 2pc is now PC1 (prune policy) + PC2/PC3 (confirm + verify local-store retention/auth).
Findings from orientation (why the fix is one module)
- The ONLY automated image pruner in the whole repo is
virtualisation.docker.autoPrune = { flags = ["--all" "--filter" "until=24h"]; }innix/modules/swarm.nix. NixOS renders this asdocker system prune --force --all --filter until=24hdaily.--allremoves every image not used by a running container — between runs there are no test apps running, so it evicts the cached recipe base images → cold re-pull on the next run. That is exactly the prune→re-pull→rate-limit churn documented in JOURNAL-2 (lines 507/542/690-693). runner/harness/lifecycle.py::teardown_appremoves services (abra undeploy /docker stack rm), volumes, secrets, and the.env— and no images (grepforrmi/image rm/image pruneinrunner/+tests/conftest.pyis empty). So PC1's "teardown must NOT remove images" already holds.janitor,warm_reconcile.py,nightly-sweep.nix,drone*.nix,.drone.yml— none prune images.- Daemon is already PAT-authenticated:
docker info→Username: nptest2; sopsdockerhub_auth(base64nptest2:<PAT>) →sops.templates."docker-config.json"→/root/.docker/config.json(nix/modules/secrets.nix). PC2 needs no change — confirm + document. - Disk on cc-ci:
/is 64G, 19G used, 43G free (31%) — bounded; aggressive--allis unnecessary, which is the whole premise.
PC1 design
Replace autoPrune with a dedicated nix/modules/docker-prune.nix: a daily systemd.timer +
oneshot systemd.service running a surgical, triple-gated prune:
- Disk-pressure gate — do nothing unless
/usage ≥ 80% (Docker's local store IS our cache; keep it warm; reclaim only under genuine pressure). - No-run gate — skip if any run-app stack (
<=4char>-<6hex>_ci_commoninternet_net_*) is live (mid-pull layers can look prunable; "never prune mid-run"). - No-converge gate — skip if any swarm service has unmet replicas (a deploy/pull in flight,
incl. infra warm redeploys).
When all gates pass:
docker {container,image,builder} prune -f --filter until=24h— dangling + age-gated only. NEVER--all(keeps tagged base/in-use images), NEVER--volumes(warm canonical data, per swarm.nix's existing comment).
2026-05-29 — Implemented + deployed + verified on cc-ci
Implementation. nix/modules/docker-prune.nix (NEW) + swarm.nix (dropped autoPrune block) +
configuration.nix import. Unit renamed docker-prune → ci-docker-prune because the NixOS
docker module reserves systemd.services.docker-prune (build conflict caught by nixos-rebuild build: "conflicting definition values for systemd.services.docker-prune.description"). Renamed,
rebuilt clean.
Deploy. Synced the 3 changed nix files to /root/cc-ci (tar over ssh; isolated change — host
tree otherwise unchanged), nixos-rebuild build (clean, shellcheck on the writeShellApplication
passed), then systemd-run --unit=ccci-sw ... nixos-rebuild switch path:/root/cc-ci#cc-ci. Switch
finished (22.5s CPU), systemctl is-system-running → running.
Verification (real host).
- Old NixOS
docker-prune.timer→is-enabled= not-found (autoPrune gone).ci-docker-prune.timer→ enabled + active;list-timersNEXT = Sat 2026-05-30 00:00 UTC (daily). - Manual
systemctl start ci-docker-prune.serviceat/=31%: log →docker-prune: / at 31% (< 80%) — keeping local image cache, nothing to do. No images removed (21 → 21). Gate works. - PC2:
docker info | grep Username→nptest2(PAT auth retained after rebuild)./var/lib/dockerpersistent (21 recipe images retained across the rebuild). - PC3 layer-reuse proof (real swarm deploy→teardown→redeploy, redis:7-alpine, docker.io via authed daemon):
Cold 5303ms (6 layer downloads) → warm 674ms (authenticated manifest check only, 0 layers re-downloaded). The alpine base layer
COLD pull: 897d... Already exists; c14c.. f546.. a300.. 941e.. 4f4f.. 677c.. Pull complete (6 downloaded) Status: Downloaded newer image for redis:7-alpine COLD_PULL_MS=5303 service create pc3b -> 1/1 service rm pc3b -> retained_after_teardown: redis:7-alpine 487efc061638 (image REMAINS) WARM pull: Status: Image is up to date for redis:7-alpine WARM_PULL_MS=674 (no bytes) redeploy create pc3b -> redeploy_ok (reused local layers)897d...showed "Already exists" even on the cold pull = cross-image base-layer reuse, a bonus cache win. Teardown (service rm) retained the image — matchesteardown_app(no rmi).
Docs/decisions. docs/runbook.md (new "Image cache & prune policy" + updated rate-limit note),
docs/warm.md (autoPrune→ci-docker-prune), DECISIONS.md (Phase-2pc entry), cc-ci-plan/IDEAS.md
(deferred registry cache + revisit trigger). Gate claimed.