feat(2pc): PC1 conservative prune — drop autoPrune --all, add gated surgical docker-prune
Removes virtualisation.docker.autoPrune (daily `docker system prune --all` evicted in-use base images → cold re-pull → Hub rate-limit churn, JOURNAL-2). Adds modules/docker-prune.nix: daily timer + oneshot that prunes only dangling+until=24h, gated on disk pressure (>=80%) AND no run-app live AND no swarm service converging; never --all, never --volumes. Teardown unchanged (never removes images). Registry pull-through cache dropped per operator scope correction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
26
machine-docs/BACKLOG-2pc.md
Normal file
26
machine-docs/BACKLOG-2pc.md
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
# BACKLOG — Phase 2pc (sane image-prune policy)
|
||||||
|
|
||||||
|
SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`.
|
||||||
|
Scope (post operator correction 2026-05-29): **PC1 prune policy + confirm local-store
|
||||||
|
retention/auth ONLY.** The registry:2 pull-through cache is **dropped** (deferred to IDEAS /
|
||||||
|
Phase 2b — revisit only if multi-node OR a measured cold-deploy bottleneck on recreate-surviving
|
||||||
|
storage).
|
||||||
|
|
||||||
|
## Build backlog
|
||||||
|
|
||||||
|
- [ ] **PC1 — Conservative prune policy.** Remove `virtualisation.docker.autoPrune` (`--all` evicts
|
||||||
|
in-use base images → forced cold re-pull → rate-limit). Replace with a surgical, gated prune:
|
||||||
|
dangling + `until=24h` only, NEVER `--all`/`--volumes`; gated on (a) genuine disk pressure
|
||||||
|
(`/` ≥ 80%), (b) no run-app stack live, (c) no swarm service converging (mid-pull). Teardown
|
||||||
|
already removes only services/volumes/secrets/.env — NOT images (verified) — keep it that way.
|
||||||
|
- [ ] **PC2 — Confirm local cache retained + authenticated.** Daemon stays PAT-authenticated
|
||||||
|
(`docker info` Username=nptest2, sops `dockerhub_auth` → `/root/.docker/config.json`); local
|
||||||
|
image store `/var/lib/docker` persists across runs/teardowns/reboots. No code change expected —
|
||||||
|
confirm + document.
|
||||||
|
- [ ] **PC3 — Verify + document.** Deploy → teardown → redeploy reuses local layers (no
|
||||||
|
re-download); disk bounded without `-af`. Update `docs/runbook.md` + `docs/` prune note;
|
||||||
|
record the policy + the dropped-registry-cache deviation in `DECISIONS.md`.
|
||||||
|
|
||||||
|
## Adversary findings
|
||||||
|
|
||||||
|
(Adversary owns this section.)
|
||||||
47
machine-docs/JOURNAL-2pc.md
Normal file
47
machine-docs/JOURNAL-2pc.md
Normal file
@ -0,0 +1,47 @@
|
|||||||
|
# JOURNAL — Phase 2pc (sane image-prune policy)
|
||||||
|
|
||||||
|
Append-only reasoning log. Facts/verification for the Adversary live in STATUS-2pc.md.
|
||||||
|
|
||||||
|
## 2026-05-29 — Orientation + scope correction
|
||||||
|
|
||||||
|
Read SSOT `plan-phase2pc-image-cache.md` + plan.md §6.1/§7/§9. Operator issued a **scope
|
||||||
|
correction** mid-orientation: **drop the registry:2 pull-through cache.** Rationale (operator):
|
||||||
|
single host → Docker's own local image store already IS the cache; re-deploys reuse local layers
|
||||||
|
with no re-download; the daemon is PAT-authenticated so residual manifest checks sit under 200/6h.
|
||||||
|
The churn was caused by **over-pruning** (`docker image prune -af` wiping the store), not a missing
|
||||||
|
cache. A separate registry only pays off multi-node / separate-survivable storage, which we are not.
|
||||||
|
**I had not yet written any registry code** (still orienting) → nothing to revert.
|
||||||
|
|
||||||
|
Phase 2pc is now **PC1 (prune policy) + PC2/PC3 (confirm + verify local-store retention/auth).**
|
||||||
|
|
||||||
|
### Findings from orientation (why the fix is one module)
|
||||||
|
|
||||||
|
- The ONLY automated image pruner in the whole repo is
|
||||||
|
`virtualisation.docker.autoPrune = { flags = ["--all" "--filter" "until=24h"]; }` in
|
||||||
|
`nix/modules/swarm.nix`. NixOS renders this as `docker system prune --force --all --filter until=24h`
|
||||||
|
daily. `--all` removes every image **not used by a running container** — between runs there are no
|
||||||
|
test apps running, so it evicts the cached recipe base images → cold re-pull on the next run. That
|
||||||
|
is exactly the prune→re-pull→rate-limit churn documented in JOURNAL-2 (lines 507/542/690-693).
|
||||||
|
- `runner/harness/lifecycle.py::teardown_app` removes services (abra undeploy / `docker stack rm`),
|
||||||
|
volumes, secrets, and the `.env` — and **no images** (`grep` for `rmi`/`image rm`/`image prune` in
|
||||||
|
`runner/` + `tests/conftest.py` is empty). So PC1's "teardown must NOT remove images" already holds.
|
||||||
|
- `janitor`, `warm_reconcile.py`, `nightly-sweep.nix`, `drone*.nix`, `.drone.yml` — none prune images.
|
||||||
|
- Daemon is already PAT-authenticated: `docker info` → `Username: nptest2`; sops `dockerhub_auth`
|
||||||
|
(base64 `nptest2:<PAT>`) → `sops.templates."docker-config.json"` → `/root/.docker/config.json`
|
||||||
|
(`nix/modules/secrets.nix`). PC2 needs no change — confirm + document.
|
||||||
|
- Disk on cc-ci: `/` is 64G, 19G used, **43G free (31%)** — bounded; aggressive `--all` is
|
||||||
|
unnecessary, which is the whole premise.
|
||||||
|
|
||||||
|
### PC1 design
|
||||||
|
|
||||||
|
Replace `autoPrune` with a dedicated `nix/modules/docker-prune.nix`: a daily `systemd.timer` +
|
||||||
|
oneshot `systemd.service` running a surgical, **triple-gated** prune:
|
||||||
|
1. **Disk-pressure gate** — do nothing unless `/` usage ≥ 80% (Docker's local store IS our cache;
|
||||||
|
keep it warm; reclaim only under genuine pressure).
|
||||||
|
2. **No-run gate** — skip if any run-app stack (`<=4char>-<6hex>_ci_commoninternet_net_*`) is live
|
||||||
|
(mid-pull layers can look prunable; "never prune mid-run").
|
||||||
|
3. **No-converge gate** — skip if any swarm service has unmet replicas (a deploy/pull in flight,
|
||||||
|
incl. infra warm redeploys).
|
||||||
|
When all gates pass: `docker {container,image,builder} prune -f --filter until=24h` — dangling +
|
||||||
|
age-gated only. NEVER `--all` (keeps tagged base/in-use images), NEVER `--volumes` (warm canonical
|
||||||
|
data, per swarm.nix's existing comment).
|
||||||
22
machine-docs/STATUS-2pc.md
Normal file
22
machine-docs/STATUS-2pc.md
Normal file
@ -0,0 +1,22 @@
|
|||||||
|
# STATUS — Phase 2pc (sane image-prune policy)
|
||||||
|
|
||||||
|
**SSOT:** `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`
|
||||||
|
**Scope (operator correction 2026-05-29):** PC1 conservative prune + PC2/PC3 confirm-and-verify
|
||||||
|
local-store retention/auth. **Registry pull-through cache DROPPED** (deferred to IDEAS / Phase 2b).
|
||||||
|
|
||||||
|
## Phase: PC1 implemented, deploy+verify in flight (NOT yet claimed)
|
||||||
|
|
||||||
|
In flight: build the new prune module onto cc-ci via `nixos-rebuild switch`, then run the
|
||||||
|
deploy→teardown→redeploy layer-reuse proof. Gate will be CLAIMED once verified on the real host.
|
||||||
|
|
||||||
|
## What changed (the diff)
|
||||||
|
|
||||||
|
- `nix/modules/swarm.nix` — removed `virtualisation.docker.autoPrune` (it ran
|
||||||
|
`docker system prune --force --all --filter until=24h` daily; `--all` evicts every image not used
|
||||||
|
by a *running* container → wiped cached recipe base images → cold re-pull → Hub rate-limit churn).
|
||||||
|
- `nix/modules/docker-prune.nix` (NEW) — daily `systemd.timer` + oneshot `systemd.service`
|
||||||
|
`docker-prune` running a surgical, triple-gated prune. Imported in `nix/hosts/cc-ci/configuration.nix`.
|
||||||
|
- Teardown (`runner/harness/lifecycle.py::teardown_app`) UNCHANGED — already removes only
|
||||||
|
services/volumes/secrets/.env, never images (PC1 teardown requirement already held).
|
||||||
|
|
||||||
|
(Verification context — WHAT/HOW/EXPECTED/WHERE — will be filled in here at gate-claim time.)
|
||||||
@ -8,6 +8,7 @@
|
|||||||
../../modules/packages.nix
|
../../modules/packages.nix
|
||||||
../../modules/secrets.nix
|
../../modules/secrets.nix
|
||||||
../../modules/swarm.nix
|
../../modules/swarm.nix
|
||||||
|
../../modules/docker-prune.nix
|
||||||
../../modules/abra.nix
|
../../modules/abra.nix
|
||||||
../../modules/proxy.nix
|
../../modules/proxy.nix
|
||||||
../../modules/drone.nix
|
../../modules/drone.nix
|
||||||
|
|||||||
75
nix/modules/docker-prune.nix
Normal file
75
nix/modules/docker-prune.nix
Normal file
@ -0,0 +1,75 @@
|
|||||||
|
# Conservative, surgical Docker prune (Phase 2pc / PC1).
|
||||||
|
#
|
||||||
|
# REPLACES `virtualisation.docker.autoPrune` (which ran `docker system prune --force --all
|
||||||
|
# --filter until=24h` daily). The `--all` removed every image NOT used by a *running* container —
|
||||||
|
# between CI runs no test apps run, so it evicted the cached recipe base images and forced a cold
|
||||||
|
# re-pull on the next run → the prune->re-pull->Docker-Hub-rate-limit churn documented in JOURNAL-2.
|
||||||
|
#
|
||||||
|
# On this SINGLE host, Docker's own local image store IS the cache (re-deploys reuse local layers,
|
||||||
|
# no re-download; the daemon is PAT-authenticated). So we keep that store warm and only reclaim disk
|
||||||
|
# under GENUINE pressure, and even then SURGICALLY:
|
||||||
|
# - dangling images + stopped containers + dangling build cache, age-gated (until=24h) — NEVER
|
||||||
|
# `--all` (would evict tagged base/in-use images), NEVER `--volumes` (warm canonical data — see
|
||||||
|
# swarm.nix's existing comment; warm volumes are reaped only by the warm reconcilers).
|
||||||
|
# and only when nothing is in flight:
|
||||||
|
# - skip if any run-app stack is live (mid-pull layers can look prunable — "never prune mid-run");
|
||||||
|
# - skip if any swarm service has unmet replicas (a deploy/pull is converging, incl. warm redeploys).
|
||||||
|
{ pkgs, ... }:
|
||||||
|
let
|
||||||
|
# `/` usage % at/above which a surgical prune is permitted. Below this: keep the cache, no-op.
|
||||||
|
threshold = 80;
|
||||||
|
prune = pkgs.writeShellApplication {
|
||||||
|
name = "cc-ci-docker-prune";
|
||||||
|
runtimeInputs = with pkgs; [ docker coreutils gnugrep gawk ];
|
||||||
|
text = ''
|
||||||
|
THRESH=${toString threshold}
|
||||||
|
used="$(df --output=pcent / | tail -1 | tr -dc '0-9')"
|
||||||
|
: "''${used:=0}"
|
||||||
|
if [ "$used" -lt "$THRESH" ]; then
|
||||||
|
echo "docker-prune: / at ''${used}% (< ''${THRESH}%) — keeping local image cache, nothing to do"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
# NEVER prune mid-run: a live run-app stack means a deploy/test is in flight (mid-pull layers
|
||||||
|
# can look prunable). Run-app services: <=4char>-<6hex>_ci_commoninternet_net_* (lifecycle.py).
|
||||||
|
if docker service ls --format '{{.Name}}' \
|
||||||
|
| grep -qE '^[a-z0-9]{1,4}-[0-9a-f]{6}_ci_commoninternet_net_'; then
|
||||||
|
echo "docker-prune: a run-app stack is live — skipping (never prune mid-run)"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
# NEVER prune while ANY swarm service is converging (unmet replicas => a pull/deploy in flight,
|
||||||
|
# including infra warm redeploys). Replicas field is "running/desired" e.g. 1/1.
|
||||||
|
converging="$(docker service ls --format '{{.Replicas}}' \
|
||||||
|
| awk -F/ '{ if (($1+0) != ($2+0)) c++ } END { print c+0 }')"
|
||||||
|
if [ "$converging" -gt 0 ]; then
|
||||||
|
echo "docker-prune: $converging service(s) converging (deploy/pull in flight) — skipping"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
echo "docker-prune: / at ''${used}% (>= ''${THRESH}%) — surgical prune (dangling + until=24h; NEVER --all/--volumes)"
|
||||||
|
docker container prune -f --filter until=24h || true
|
||||||
|
docker image prune -f --filter until=24h || true
|
||||||
|
docker builder prune -f --filter until=24h || true
|
||||||
|
df -h /
|
||||||
|
'';
|
||||||
|
};
|
||||||
|
in
|
||||||
|
{
|
||||||
|
systemd.services.docker-prune = {
|
||||||
|
description = "Surgical disk-pressure-gated Docker prune (dangling+old only; never --all/--volumes; never mid-run)";
|
||||||
|
after = [ "docker.service" ];
|
||||||
|
requires = [ "docker.service" ];
|
||||||
|
path = [ pkgs.docker ];
|
||||||
|
serviceConfig = {
|
||||||
|
Type = "oneshot";
|
||||||
|
ExecStart = "${prune}/bin/cc-ci-docker-prune";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
|
||||||
|
systemd.timers.docker-prune = {
|
||||||
|
description = "Daily timer for the surgical Docker prune";
|
||||||
|
wantedBy = [ "timers.target" ];
|
||||||
|
timerConfig = {
|
||||||
|
OnCalendar = "daily";
|
||||||
|
Persistent = true;
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
@ -5,18 +5,14 @@
|
|||||||
{
|
{
|
||||||
virtualisation.docker = {
|
virtualisation.docker = {
|
||||||
enable = true;
|
enable = true;
|
||||||
# Reclaim disk from churning per-run images (cc-ci root is ~28 GiB). Prune images/containers/
|
# Image pruning is handled by modules/docker-prune.nix (Phase 2pc / PC1), NOT by
|
||||||
# networks/build-cache older than 24h — but NEVER volumes:
|
# `virtualisation.docker.autoPrune`. The old autoPrune ran `docker system prune --all` daily;
|
||||||
# (1) `--volumes` is incompatible with `--filter until=` (docker errors → the unit failed daily,
|
# `--all` evicts every image not used by a *running* container — between runs that wiped the
|
||||||
# degrading the system and never actually pruning — that's why disk crept to 96%); and
|
# cached recipe base images and forced a cold re-pull → the Docker-Hub-rate-limit churn in
|
||||||
# (2) Phase 2w keeps DATA-WARM canonical volumes that are UNDEPLOYED (no container), so
|
# JOURNAL-2. The replacement keeps Docker's local store warm (it IS our cache on this single
|
||||||
# `prune --volumes` would DELETE the warm known-good data. Warm volumes are pruned
|
# host) and prunes only dangling+old layers, gated on genuine disk pressure and nothing in
|
||||||
# deliberately by the warm reconcilers (WC8), never by this blanket sweep.
|
# flight. NEVER --volumes either: Phase-2w keeps DATA-WARM undeployed canonical volumes, reaped
|
||||||
autoPrune = {
|
# only by the warm reconcilers. autoPrune left OFF (the default) on purpose.
|
||||||
enable = true;
|
|
||||||
dates = "daily";
|
|
||||||
flags = [ "--all" "--filter" "until=24h" ];
|
|
||||||
};
|
|
||||||
};
|
};
|
||||||
|
|
||||||
environment.systemPackages = [ pkgs.docker ];
|
environment.systemPackages = [ pkgs.docker ];
|
||||||
|
|||||||
Reference in New Issue
Block a user