Files

autonomic-bot d389dd516b status(2pc): ## DONE — Adversary PASS for PC1+PC2+PC3, F2pc-1 closed, no VETO

Phase 2pc complete: conservative surgical gated prune (ci-docker-prune) live + reproducible from
git, local Docker store retained as the cache (PAT-authenticated, layer reuse proven), registry
pull-through cache deferred to IDEAS. Adversary review(2pc) 486d162 PASS @2026-05-29. Watchdog
auto-returns to Phase 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-29 09:53:30 +01:00

7.7 KiB

Raw Blame History

STATUS — Phase 2pc (sane image-prune policy)

SSOT: /srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md Scope (operator correction 2026-05-29): PC1 conservative prune + PC2/PC3 confirm+verify local-store retention/auth. Registry pull-through cache DROPPED (deferred → cc-ci-plan/IDEAS.md

DECISIONS Phase-2pc; no registry code was written).

DONE

Phase 2pc complete. Adversary PASS @2026-05-29 for PC1+PC2+PC3 (REVIEW-2pc.md, review(2pc) commit 486d162, gate re-claim 9e73ebd); F2pc-1 CLOSED; no standing VETO. git==host (ci-docker-prune, reproducible from a fresh clone). Watchdog auto-returns to Phase 2.

Gate: 2pc — PASSED (was RE-CLAIMED; F2pc-1 resolved)

All of PC1/PC2/PC3 implemented, deployed to cc-ci, and Builder-verified on the real host. WHAT / HOW / EXPECTED / WHERE below.

F2pc-1 (committed code ≠ deployed host) — RESOLVED. The Adversary cold-verified the behavior GREEN but FAILed the gate because it verified the stale claim commit de6103d, whose docker-prune.nix still named the units docker-prune while the host runs ci-docker-prune. That rename was already committed in b9bbd25 (landed before the verdict) — which is exactly the Adversary's endorsed fix ("commit the deployed ci-docker-prune naming"). Current pushed HEAD now has git == host == ci-docker-prune:

# committed git defines the SAME units STATUS documents + the host runs:
grep -nE 'systemd\.(services|timers)\.' nix/modules/docker-prune.nix   # EXPECT: ci-docker-prune (services+timers), introduced by b9bbd25
git log --oneline -1 -- nix/modules/docker-prune.nix                   # EXPECT: b9bbd25 rename commit
ssh cc-ci 'systemctl is-active ci-docker-prune.timer'                  # EXPECT: active (matches a from-git rebuild)

The NixOS-builtin docker-prune.service is inactive/linked (and docker-prune.timer is not-found): that unit is defined by the NixOS docker module whenever Docker is enabled, has no timer and no wantedBy with autoPrune off, so it never runs — it is not a leftover of this change and a fresh from-git rebuild produces the identical inert unit. The unit name is determined literally by the attribute in docker-prune.nix, so a from-git build yields ci-docker-prune.*.

(Claim discipline now followed: working tree committed + pushed + git status clean before this claim.)

PC1 — Conservative prune policy

WHAT. Removed the daily docker system prune --all and replaced it with a surgical, triple-gated prune that keeps Docker's local image store (the cache) warm.

WHERE. nix/modules/docker-prune.nix (NEW, unit ci-docker-prune service+timer); nix/modules/swarm.nix (virtualisation.docker.autoPrune block removed, left OFF=default); nix/hosts/cc-ci/configuration.nix (imports docker-prune.nix). Deployed via nixos-rebuild switch --flake path:/root/cc-ci#cc-ci.
The prune no-ops unless ALL hold: (1) / usage ≥ 80%, (2) no run-app stack live (<=4char>-<6hex>_ci_commoninternet_net_*), (3) no swarm service converging (unmet replicas). When it runs: docker {container,image,builder} prune -f --filter until=24h — dangling+old only, never --all, never --volumes.
Teardown unchanged: runner/harness/lifecycle.py::teardown_app removes services/volumes/secrets/ .env and no images (grep -n 'rmi\|image rm\|image prune' runner/ tests/conftest.py = empty).

HOW to verify (cold, Adversary's own checks):

ssh cc-ci 'systemctl is-enabled docker-prune.timer'                    # EXPECT: not-found (autoPrune gone)
ssh cc-ci 'systemctl is-enabled ci-docker-prune.timer; systemctl is-active ci-docker-prune.timer'
ssh cc-ci 'systemctl list-timers ci-docker-prune.timer --no-pager'     # EXPECT: enabled/active, NEXT daily 00:00
ssh cc-ci 'systemctl start ci-docker-prune.service; \
           journalctl -u ci-docker-prune.service -n 3 --no-pager'      # EXPECT (disk<80%): "keeping local image cache, nothing to do"
ssh cc-ci 'docker images -q | wc -l'                                   # EXPECT: unchanged before==after the manual run
# source-read the gates + flags (no --all, no --volumes):
grep -nE "until=24h|--all|--volumes|prune" nix/modules/docker-prune.nix
grep -n "autoPrune" nix/modules/swarm.nix                              # EXPECT: only a comment, no enable=true

Active-path evidence (Builder ran the exact prune command; gate reaches it only ≥80% disk): docker image prune -f --filter until=24h reclaimed 2.341 GB (images 23→17, dangling 10→4 — the 4 kept are <24h, proving the age gate), disk 31%→27%, and every tagged/in-use image survived (keycloak/mariadb/nginx/redis). Disk bounded without -af.

EXPECTED: old timer not-found; ci-docker-prune.timer enabled+active (daily); manual run below 80% prints the no-op line and removes nothing; module flags are --filter until=24h only (never --all/--volumes); swarm.nix has no live autoPrune.

PC2 — Local cache retained + authenticated (confirm)

WHAT. Daemon stays PAT-authenticated; /var/lib/docker local image store persists across runs/teardowns/reboots; no code change (sops dockerhub_auth → /root/.docker/config.json in nix/modules/secrets.nix, unchanged). HOW / EXPECTED:

ssh cc-ci 'docker info 2>/dev/null | grep Username'        # EXPECT: Username: nptest2
ssh cc-ci 'ls -l /root/.docker/config.json'                # EXPECT: -> /run/secrets/rendered/docker-config.json (0600)
ssh cc-ci 'docker images | wc -l'                          # EXPECT: many recipe images retained (was 21 leaf images)

PC3 — Deploy → teardown → redeploy reuses local layers (no re-download)

WHAT. A previously-pulled image is retained through teardown and a redeploy reuses local layers; only an authenticated manifest check remains. Builder-proven with a real swarm deploy/teardown/ redeploy on redis:7-alpine (docker.io through the authenticated daemon — same pull path abra/swarm use). HOW (Adversary, reproducible):

ssh cc-ci 'bash -s' <<'PROOF'
IMG=redis:7-alpine; docker rmi -f "$IMG" >/dev/null 2>&1 || true
t0=$(date +%s%N); docker pull "$IMG" 2>&1 | grep -E "Pull complete|Downloaded|Already exists|up to date"; t1=$(date +%s%N)
echo COLD_MS=$(((t1-t0)/1000000))
docker service create --name pc3 --replicas 1 "$IMG" sleep 120 >/dev/null 2>&1; docker service ls --filter name=pc3 --format '{{.Replicas}}'
docker service rm pc3 >/dev/null 2>&1
echo retained: $(docker images redis:7-alpine --format '{{.ID}}')
t2=$(date +%s%N); docker pull "$IMG" 2>&1 | grep -E "Pull complete|Downloaded|Already exists|up to date"; t3=$(date +%s%N)
echo WARM_MS=$(((t3-t2)/1000000)); docker rmi -f "$IMG" >/dev/null 2>&1
PROOF

EXPECTED: COLD pull shows layer "Pull complete" lines (download) — Builder saw 6 layers, COLD_MS≈5303; after service rm the image ID is still listed (retained); WARM pull shows Image is up to date (no layer download), WARM_MS≈674 (≈8× faster, manifest-only). Confirms the local store is the cache, survives teardown, and a redeploy needs no Docker-Hub layer download. Optional fuller proof: a real recipe cycle RECIPE=custom-html-tiny PR=0 STAGES=install cc-ci-run runner/run_recipe_ci.py run twice — the 2nd deploy shows no image-layer download.

DoD checklist (Builder view — Adversary owns the verdict in REVIEW-2pc.md)

PC1 — autoPrune --all removed; surgical gated ci-docker-prune deployed; teardown keeps images.
PC2 — daemon PAT-authenticated (nptest2); local store retained across rebuild.
PC3 — deploy→teardown→redeploy reuses local layers (no re-download), measured; disk bounded (31%) without -af. Documented (runbook/warm/DECISIONS/IDEAS).

7.7 KiB Raw Blame History Unescape Escape