Files
cc-ci/machine-docs/STATUS-2pc.md
autonomic-bot d389dd516b status(2pc): ## DONE — Adversary PASS for PC1+PC2+PC3, F2pc-1 closed, no VETO
Phase 2pc complete: conservative surgical gated prune (ci-docker-prune) live + reproducible from
git, local Docker store retained as the cache (PAT-authenticated, layer reuse proven), registry
pull-through cache deferred to IDEAS. Adversary review(2pc) 486d162 PASS @2026-05-29. Watchdog
auto-returns to Phase 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:53:30 +01:00

122 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# STATUS — Phase 2pc (sane image-prune policy)
**SSOT:** `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`
**Scope (operator correction 2026-05-29):** PC1 conservative prune + PC2/PC3 confirm+verify
local-store retention/auth. **Registry pull-through cache DROPPED** (deferred → `cc-ci-plan/IDEAS.md`
+ DECISIONS Phase-2pc; no registry code was written).
## DONE
Phase 2pc complete. **Adversary PASS @2026-05-29** for PC1+PC2+PC3 (REVIEW-2pc.md, `review(2pc)`
commit `486d162`, gate re-claim `9e73ebd`); **F2pc-1 CLOSED**; no standing VETO. git==host
(`ci-docker-prune`, reproducible from a fresh clone). Watchdog auto-returns to Phase 2.
## Gate: 2pc — PASSED (was RE-CLAIMED; F2pc-1 resolved)
All of PC1/PC2/PC3 implemented, deployed to cc-ci, and Builder-verified on the real host. WHAT / HOW
/ EXPECTED / WHERE below.
**F2pc-1 (committed code ≠ deployed host) — RESOLVED.** The Adversary cold-verified the *behavior*
GREEN but FAILed the gate because it verified the **stale claim commit `de6103d`**, whose
`docker-prune.nix` still named the units `docker-prune` while the host runs `ci-docker-prune`. That
rename was already committed in **`b9bbd25`** (landed before the verdict) — which is exactly the
Adversary's endorsed fix ("commit the deployed ci-docker-prune naming"). **Current pushed HEAD now
has git == host == `ci-docker-prune`:**
```sh
# committed git defines the SAME units STATUS documents + the host runs:
grep -nE 'systemd\.(services|timers)\.' nix/modules/docker-prune.nix # EXPECT: ci-docker-prune (services+timers), introduced by b9bbd25
git log --oneline -1 -- nix/modules/docker-prune.nix # EXPECT: b9bbd25 rename commit
ssh cc-ci 'systemctl is-active ci-docker-prune.timer' # EXPECT: active (matches a from-git rebuild)
```
The NixOS-builtin `docker-prune.service` is `inactive`/`linked` (and `docker-prune.timer` is
`not-found`): that unit is defined by the NixOS docker module whenever Docker is enabled, has **no
timer and no `wantedBy`** with autoPrune off, so it **never runs** — it is not a leftover of this
change and a fresh from-git rebuild produces the identical inert unit. The unit name is determined
literally by the attribute in `docker-prune.nix`, so a from-git build yields `ci-docker-prune.*`.
(Claim discipline now followed: working tree committed + pushed + `git status` clean before this claim.)
---
### PC1 — Conservative prune policy
**WHAT.** Removed the daily `docker system prune --all` and replaced it with a surgical, triple-gated
prune that keeps Docker's local image store (the cache) warm.
- **WHERE.** `nix/modules/docker-prune.nix` (NEW, unit `ci-docker-prune` service+timer);
`nix/modules/swarm.nix` (`virtualisation.docker.autoPrune` block removed, left OFF=default);
`nix/hosts/cc-ci/configuration.nix` (imports `docker-prune.nix`). Deployed via
`nixos-rebuild switch --flake path:/root/cc-ci#cc-ci`.
- The prune **no-ops unless ALL** hold: (1) `/` usage ≥ 80%, (2) no run-app stack live
(`<=4char>-<6hex>_ci_commoninternet_net_*`), (3) no swarm service converging (unmet replicas).
When it runs: `docker {container,image,builder} prune -f --filter until=24h` — **dangling+old only,
never `--all`, never `--volumes`.**
- Teardown unchanged: `runner/harness/lifecycle.py::teardown_app` removes services/volumes/secrets/
.env and **no images** (`grep -n 'rmi\|image rm\|image prune' runner/ tests/conftest.py` = empty).
**HOW to verify (cold, Adversary's own checks):**
```sh
ssh cc-ci 'systemctl is-enabled docker-prune.timer' # EXPECT: not-found (autoPrune gone)
ssh cc-ci 'systemctl is-enabled ci-docker-prune.timer; systemctl is-active ci-docker-prune.timer'
ssh cc-ci 'systemctl list-timers ci-docker-prune.timer --no-pager' # EXPECT: enabled/active, NEXT daily 00:00
ssh cc-ci 'systemctl start ci-docker-prune.service; \
journalctl -u ci-docker-prune.service -n 3 --no-pager' # EXPECT (disk<80%): "keeping local image cache, nothing to do"
ssh cc-ci 'docker images -q | wc -l' # EXPECT: unchanged before==after the manual run
# source-read the gates + flags (no --all, no --volumes):
grep -nE "until=24h|--all|--volumes|prune" nix/modules/docker-prune.nix
grep -n "autoPrune" nix/modules/swarm.nix # EXPECT: only a comment, no enable=true
```
**Active-path evidence (Builder ran the exact prune command; gate reaches it only ≥80% disk):** `docker image prune -f --filter until=24h` reclaimed **2.341 GB** (images 23→17, dangling 10→4 — the 4 kept are <24h, proving the age gate), disk 31%→27%, and **every tagged/in-use image survived** (keycloak/mariadb/nginx/redis). Disk bounded without `-af`.
**EXPECTED:** old timer not-found; `ci-docker-prune.timer` enabled+active (daily); manual run below
80% prints the no-op line and removes nothing; module flags are `--filter until=24h` only (never
`--all`/`--volumes`); swarm.nix has no live autoPrune.
### PC2 — Local cache retained + authenticated (confirm)
**WHAT.** Daemon stays PAT-authenticated; `/var/lib/docker` local image store persists across
runs/teardowns/reboots; no code change (sops `dockerhub_auth` `/root/.docker/config.json` in
`nix/modules/secrets.nix`, unchanged).
**HOW / EXPECTED:**
```sh
ssh cc-ci 'docker info 2>/dev/null | grep Username' # EXPECT: Username: nptest2
ssh cc-ci 'ls -l /root/.docker/config.json' # EXPECT: -> /run/secrets/rendered/docker-config.json (0600)
ssh cc-ci 'docker images | wc -l' # EXPECT: many recipe images retained (was 21 leaf images)
```
### PC3 — Deploy → teardown → redeploy reuses local layers (no re-download)
**WHAT.** A previously-pulled image is retained through teardown and a redeploy reuses local layers;
only an authenticated manifest check remains. Builder-proven with a real swarm deploy/teardown/
redeploy on `redis:7-alpine` (docker.io through the authenticated daemon same pull path abra/swarm
use).
**HOW (Adversary, reproducible):**
```sh
ssh cc-ci 'bash -s' <<'PROOF'
IMG=redis:7-alpine; docker rmi -f "$IMG" >/dev/null 2>&1 || true
t0=$(date +%s%N); docker pull "$IMG" 2>&1 | grep -E "Pull complete|Downloaded|Already exists|up to date"; t1=$(date +%s%N)
echo COLD_MS=$(((t1-t0)/1000000))
docker service create --name pc3 --replicas 1 "$IMG" sleep 120 >/dev/null 2>&1; docker service ls --filter name=pc3 --format '{{.Replicas}}'
docker service rm pc3 >/dev/null 2>&1
echo retained: $(docker images redis:7-alpine --format '{{.ID}}')
t2=$(date +%s%N); docker pull "$IMG" 2>&1 | grep -E "Pull complete|Downloaded|Already exists|up to date"; t3=$(date +%s%N)
echo WARM_MS=$(((t3-t2)/1000000)); docker rmi -f "$IMG" >/dev/null 2>&1
PROOF
```
**EXPECTED:** COLD pull shows layer "Pull complete" lines (download) Builder saw 6 layers,
COLD_MS5303; after `service rm` the image ID is still listed (retained); WARM pull shows
`Image is up to date` (no layer download), WARM_MS674 (≈8× faster, manifest-only). Confirms the
local store is the cache, survives teardown, and a redeploy needs no Docker-Hub layer download.
Optional fuller proof: a real recipe cycle
`RECIPE=custom-html-tiny PR=0 STAGES=install cc-ci-run runner/run_recipe_ci.py` run twice the 2nd
deploy shows no image-layer download.
---
## DoD checklist (Builder view — Adversary owns the verdict in REVIEW-2pc.md)
- [x] **PC1** autoPrune `--all` removed; surgical gated `ci-docker-prune` deployed; teardown keeps images.
- [x] **PC2** daemon PAT-authenticated (nptest2); local store retained across rebuild.
- [x] **PC3** deployteardownredeploy reuses local layers (no re-download), measured; disk bounded
(31%) without `-af`. Documented (runbook/warm/DECISIONS/IDEAS).
## Not blocked. No standing blockers.