Removes virtualisation.docker.autoPrune (daily `docker system prune --all` evicted in-use base images → cold re-pull → Hub rate-limit churn, JOURNAL-2). Adds modules/docker-prune.nix: daily timer + oneshot that prunes only dangling+until=24h, gated on disk pressure (>=80%) AND no run-app live AND no swarm service converging; never --all, never --volumes. Teardown unchanged (never removes images). Registry pull-through cache dropped per operator scope correction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
48 lines
3.2 KiB
Markdown
48 lines
3.2 KiB
Markdown
# JOURNAL — Phase 2pc (sane image-prune policy)
|
|
|
|
Append-only reasoning log. Facts/verification for the Adversary live in STATUS-2pc.md.
|
|
|
|
## 2026-05-29 — Orientation + scope correction
|
|
|
|
Read SSOT `plan-phase2pc-image-cache.md` + plan.md §6.1/§7/§9. Operator issued a **scope
|
|
correction** mid-orientation: **drop the registry:2 pull-through cache.** Rationale (operator):
|
|
single host → Docker's own local image store already IS the cache; re-deploys reuse local layers
|
|
with no re-download; the daemon is PAT-authenticated so residual manifest checks sit under 200/6h.
|
|
The churn was caused by **over-pruning** (`docker image prune -af` wiping the store), not a missing
|
|
cache. A separate registry only pays off multi-node / separate-survivable storage, which we are not.
|
|
**I had not yet written any registry code** (still orienting) → nothing to revert.
|
|
|
|
Phase 2pc is now **PC1 (prune policy) + PC2/PC3 (confirm + verify local-store retention/auth).**
|
|
|
|
### Findings from orientation (why the fix is one module)
|
|
|
|
- The ONLY automated image pruner in the whole repo is
|
|
`virtualisation.docker.autoPrune = { flags = ["--all" "--filter" "until=24h"]; }` in
|
|
`nix/modules/swarm.nix`. NixOS renders this as `docker system prune --force --all --filter until=24h`
|
|
daily. `--all` removes every image **not used by a running container** — between runs there are no
|
|
test apps running, so it evicts the cached recipe base images → cold re-pull on the next run. That
|
|
is exactly the prune→re-pull→rate-limit churn documented in JOURNAL-2 (lines 507/542/690-693).
|
|
- `runner/harness/lifecycle.py::teardown_app` removes services (abra undeploy / `docker stack rm`),
|
|
volumes, secrets, and the `.env` — and **no images** (`grep` for `rmi`/`image rm`/`image prune` in
|
|
`runner/` + `tests/conftest.py` is empty). So PC1's "teardown must NOT remove images" already holds.
|
|
- `janitor`, `warm_reconcile.py`, `nightly-sweep.nix`, `drone*.nix`, `.drone.yml` — none prune images.
|
|
- Daemon is already PAT-authenticated: `docker info` → `Username: nptest2`; sops `dockerhub_auth`
|
|
(base64 `nptest2:<PAT>`) → `sops.templates."docker-config.json"` → `/root/.docker/config.json`
|
|
(`nix/modules/secrets.nix`). PC2 needs no change — confirm + document.
|
|
- Disk on cc-ci: `/` is 64G, 19G used, **43G free (31%)** — bounded; aggressive `--all` is
|
|
unnecessary, which is the whole premise.
|
|
|
|
### PC1 design
|
|
|
|
Replace `autoPrune` with a dedicated `nix/modules/docker-prune.nix`: a daily `systemd.timer` +
|
|
oneshot `systemd.service` running a surgical, **triple-gated** prune:
|
|
1. **Disk-pressure gate** — do nothing unless `/` usage ≥ 80% (Docker's local store IS our cache;
|
|
keep it warm; reclaim only under genuine pressure).
|
|
2. **No-run gate** — skip if any run-app stack (`<=4char>-<6hex>_ci_commoninternet_net_*`) is live
|
|
(mid-pull layers can look prunable; "never prune mid-run").
|
|
3. **No-converge gate** — skip if any swarm service has unmet replicas (a deploy/pull in flight,
|
|
incl. infra warm redeploys).
|
|
When all gates pass: `docker {container,image,builder} prune -f --filter until=24h` — dangling +
|
|
age-gated only. NEVER `--all` (keeps tagged base/in-use images), NEVER `--volumes` (warm canonical
|
|
data, per swarm.nix's existing comment).
|