Files
cc-ci/machine-docs/JOURNAL-2pc.md
autonomic-bot 9e73ebda3d claim(2pc): re-claim — F2pc-1 resolved (git==host==ci-docker-prune via b9bbd25)
Adversary FAILed claim de6103d because that commit still named the units docker-prune while the
host runs ci-docker-prune; the rename was committed in b9bbd25 (its endorsed fix) which is in the
current pushed HEAD. git now defines the same ci-docker-prune units STATUS documents and the host
runs. Behavior was already cold-verified GREEN. Inert NixOS-builtin docker-prune.service
(inactive/linked, no timer) is unchanged by this and reproduces identically from git.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:50:39 +01:00

7.9 KiB

JOURNAL — Phase 2pc (sane image-prune policy)

Append-only reasoning log. Facts/verification for the Adversary live in STATUS-2pc.md.

2026-05-29 — Orientation + scope correction

Read SSOT plan-phase2pc-image-cache.md + plan.md §6.1/§7/§9. Operator issued a scope correction mid-orientation: drop the registry:2 pull-through cache. Rationale (operator): single host → Docker's own local image store already IS the cache; re-deploys reuse local layers with no re-download; the daemon is PAT-authenticated so residual manifest checks sit under 200/6h. The churn was caused by over-pruning (docker image prune -af wiping the store), not a missing cache. A separate registry only pays off multi-node / separate-survivable storage, which we are not. I had not yet written any registry code (still orienting) → nothing to revert.

Phase 2pc is now PC1 (prune policy) + PC2/PC3 (confirm + verify local-store retention/auth).

Findings from orientation (why the fix is one module)

  • The ONLY automated image pruner in the whole repo is virtualisation.docker.autoPrune = { flags = ["--all" "--filter" "until=24h"]; } in nix/modules/swarm.nix. NixOS renders this as docker system prune --force --all --filter until=24h daily. --all removes every image not used by a running container — between runs there are no test apps running, so it evicts the cached recipe base images → cold re-pull on the next run. That is exactly the prune→re-pull→rate-limit churn documented in JOURNAL-2 (lines 507/542/690-693).
  • runner/harness/lifecycle.py::teardown_app removes services (abra undeploy / docker stack rm), volumes, secrets, and the .env — and no images (grep for rmi/image rm/image prune in runner/ + tests/conftest.py is empty). So PC1's "teardown must NOT remove images" already holds.
  • janitor, warm_reconcile.py, nightly-sweep.nix, drone*.nix, .drone.yml — none prune images.
  • Daemon is already PAT-authenticated: docker infoUsername: nptest2; sops dockerhub_auth (base64 nptest2:<PAT>) → sops.templates."docker-config.json"/root/.docker/config.json (nix/modules/secrets.nix). PC2 needs no change — confirm + document.
  • Disk on cc-ci: / is 64G, 19G used, 43G free (31%) — bounded; aggressive --all is unnecessary, which is the whole premise.

PC1 design

Replace autoPrune with a dedicated nix/modules/docker-prune.nix: a daily systemd.timer + oneshot systemd.service running a surgical, triple-gated prune:

  1. Disk-pressure gate — do nothing unless / usage ≥ 80% (Docker's local store IS our cache; keep it warm; reclaim only under genuine pressure).
  2. No-run gate — skip if any run-app stack (<=4char>-<6hex>_ci_commoninternet_net_*) is live (mid-pull layers can look prunable; "never prune mid-run").
  3. No-converge gate — skip if any swarm service has unmet replicas (a deploy/pull in flight, incl. infra warm redeploys). When all gates pass: docker {container,image,builder} prune -f --filter until=24h — dangling + age-gated only. NEVER --all (keeps tagged base/in-use images), NEVER --volumes (warm canonical data, per swarm.nix's existing comment).

2026-05-29 — Implemented + deployed + verified on cc-ci

Implementation. nix/modules/docker-prune.nix (NEW) + swarm.nix (dropped autoPrune block) + configuration.nix import. Unit renamed docker-pruneci-docker-prune because the NixOS docker module reserves systemd.services.docker-prune (build conflict caught by nixos-rebuild build: "conflicting definition values for systemd.services.docker-prune.description"). Renamed, rebuilt clean.

Deploy. Synced the 3 changed nix files to /root/cc-ci (tar over ssh; isolated change — host tree otherwise unchanged), nixos-rebuild build (clean, shellcheck on the writeShellApplication passed), then systemd-run --unit=ccci-sw ... nixos-rebuild switch path:/root/cc-ci#cc-ci. Switch finished (22.5s CPU), systemctl is-system-runningrunning.

Verification (real host).

  • Old NixOS docker-prune.timeris-enabled = not-found (autoPrune gone). ci-docker-prune.timer → enabled + active; list-timers NEXT = Sat 2026-05-30 00:00 UTC (daily).
  • Manual systemctl start ci-docker-prune.service at /=31%: log → docker-prune: / at 31% (< 80%) — keeping local image cache, nothing to do. No images removed (21 → 21). Gate works.
  • PC2: docker info | grep Usernamenptest2 (PAT auth retained after rebuild). /var/lib/docker persistent (21 recipe images retained across the rebuild).
  • PC3 layer-reuse proof (real swarm deploy→teardown→redeploy, redis:7-alpine, docker.io via authed daemon):
    COLD pull: 897d... Already exists; c14c.. f546.. a300.. 941e.. 4f4f.. 677c.. Pull complete  (6 downloaded)
               Status: Downloaded newer image for redis:7-alpine        COLD_PULL_MS=5303
    service create pc3b -> 1/1
    service rm pc3b      -> retained_after_teardown: redis:7-alpine 487efc061638   (image REMAINS)
    WARM pull: Status: Image is up to date for redis:7-alpine          WARM_PULL_MS=674   (no bytes)
    redeploy create pc3b -> redeploy_ok (reused local layers)
    
    Cold 5303ms (6 layer downloads) → warm 674ms (authenticated manifest check only, 0 layers re-downloaded). The alpine base layer 897d... showed "Already exists" even on the cold pull = cross-image base-layer reuse, a bonus cache win. Teardown (service rm) retained the image — matches teardown_app (no rmi).

Docs/decisions. docs/runbook.md (new "Image cache & prune policy" + updated rate-limit note), docs/warm.md (autoPrune→ci-docker-prune), DECISIONS.md (Phase-2pc entry), cc-ci-plan/IDEAS.md (deferred registry cache + revisit trigger). Gate claimed.

2026-05-29 — Probe-5 evidence: surgical prune reclaims, keeps tagged/recent

Ran the exact active-path command the gated unit uses (docker image prune -f --filter until=24h

  • container/builder variants) on the host to demonstrate surgical reclaim (the daily timer only reaches this under ≥80% disk, but the command's effect is the same):
  • all images 23→17, dangling 10→4 (the 4 remaining are <24h old — the until=24h age gate kept them), 2.341 GB reclaimed, disk 31%→27% (19G→17G used).
  • ALL tagged/in-use images survived (keycloak:26.6.2, mariadb:12.2, nginx:1.30.0, redis:8.6.3, …) — no --all, so nothing tagged or container-referenced was touched. Confirms: disk stays bounded WITHOUT -af; the policy reclaims real space from old orphaned layers while keeping the warm cache intact.

2026-05-29 — F2pc-1 (committed≠host) resolution + claim discipline

Adversary FAILed gate 2pc on F2pc-1: at claim commit de6103d the committed docker-prune.nix still named units docker-prune while the verified host runs ci-docker-prune → git wouldn't reproduce the verified system (D8). Root cause: I renamed the units locally (sed) + synced to host + verified, but the rename rode in a SEPARATE commit (b9bbd25) pushed AFTER the claim( commit — and the Adversary cold-verified the claim commit's tree. Behavior was GREEN; only the artifact lagged.

b9bbd25 already committed the rename (git == host == ci-docker-prune), which is the Adversary's own endorsed fix. Confirmed current HEAD: grep systemd.(services|timers) → ci-docker-prune; host module matches; host runs ci-docker-prune.timer enabled+active; builtin docker-prune.service inactive/linked (inert NixOS default, never triggered with autoPrune off). Re-claimed.

Lesson (now a standing rule, orchestrator): before ANY gate claim, git status must be clean — everything committed AND pushed — because the Adversary cold-verifies from a fresh clone. A fix built locally but uncommitted (or trailing the claim commit) is a guaranteed cold-build mismatch. The claim commit must be the LAST thing, with the verified artifact already in it.