claim(2pc): PC1 conservative prune deployed+verified; PC2/PC3 local-store cache confirmed

ci-docker-prune (gated surgical prune) live on cc-ci: old autoPrune --all gone, new timer
enabled (daily), no-ops below 80% disk keeping the local image cache, never --all/--volumes.
Daemon stays PAT-authenticated (nptest2); /var/lib/docker retained across rebuild. PC3 proof:
redis:7-alpine deploy->teardown(service rm, image retained)->redeploy = "Image is up to date",
no layer re-download (cold 5303ms -> warm 674ms). Docs: runbook "Image cache & prune policy",
warm.md, DECISIONS Phase-2pc, IDEAS (registry pull-through cache deferred + revisit trigger).
Gate 2pc CLAIMED, awaiting Adversary cold-verify.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-29 09:42:36 +01:00
parent 16d177e73a
commit de6103d41d
5 changed files with 185 additions and 22 deletions

View File

@ -24,10 +24,10 @@ curl -s -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 \
convergence `TIMEOUT` (default 300s). Bump `TIMEOUT` via the recipe's `recipe_meta.py` `EXTRA_ENV`
(lasuite-docs uses 900). Verify the stack converges manually: `docker stack services <stack>`.
- **`toomanyrequests: unauthenticated pull rate limit`** (task Rejected "No such image"): Docker Hub
anonymous rate limit — the A1 registry-creds finding. Provide Docker Hub creds (sops `secrets/`,
wire into the docker daemon). Do **not** `docker image prune -af` mid-breadth — it evicts cached
images and forces re-pulls that hit the limit. Check disk first: `df -h /` (heavy recipes need
headroom; prune only `dangling` between runs or rely on the daily autoprune).
anonymous rate limit. The daemon is now PAT-authenticated (sops `dockerhub_auth`
`/root/.docker/config.json`; `docker info` Username=nptest2; 200/6h per-account). Do **not**
`docker image prune -af` — it evicts cached base/in-use images and forces re-pulls that burn the
limit. See **Image cache & prune policy** below. Check disk first: `df -h /`.
- **`authentication required: Unauthorized` fetching recipe tags:** an abra command tried to fetch
from the private mirror origin. All recipe-touching harness calls pass `-C -o` (chaos+offline);
`recipe_versions`/upgrade use the upstream tags fetched read-only at clone time. If you see this,
@ -54,6 +54,31 @@ abra app volume remove "$D" -f -n; abra app secret remove "$D" --all -n; abra ap
```
Confirm clean: `docker service ls | grep <prefix>` returns nothing.
## Image cache & prune policy
On this **single host, Docker's own local image store IS the cache** — a pulled image stays, and
re-deploys (cold tests, warm canonical, reboots) reuse the local layers with no re-download; the
daemon is PAT-authenticated so a warm redeploy makes at most one authenticated manifest check.
Teardown removes the run's services/volumes/secrets/.env but **never images** — so the next deploy
of the same recipe is local. (No separate `registry:2` pull-through cache: it only pays off
multi-node / separate-survivable storage, neither of which we have — see DECISIONS Phase-2pc.)
Pruning is the **`ci-docker-prune`** unit (`nix/modules/docker-prune.nix`), a daily timer that is
**surgical and triple-gated** — it does **nothing** unless ALL hold: (1) `/` usage ≥ 80% (genuine
disk pressure), (2) no run-app stack live (never prune mid-run), (3) no swarm service converging
(no deploy/pull in flight). When it does run it prunes only **dangling images + stopped containers +
dangling build cache, age-gated `until=24h`** — **never `--all`** (keeps tagged base/in-use images),
**never `--volumes`** (warm canonical data). The old `virtualisation.docker.autoPrune --all` was
removed — its daily `--all` evicted cached recipe base images → cold re-pull → Hub rate-limit churn.
```sh
ssh cc-ci 'systemctl list-timers ci-docker-prune.timer --no-pager; \
systemctl start ci-docker-prune.service; \
journalctl -u ci-docker-prune.service -n 3 --no-pager' # below 80% -> no-op, keeps cache
```
Reclaim manually under real pressure (still surgical, never `-af`):
`ssh cc-ci 'docker image prune -f --filter until=24h'` (dangling only).
## Re-running / triggering by hand
- Re-comment `!testme` on the PR (distinct comment id → re-runs; deduped per comment).

View File

@ -85,10 +85,12 @@ back cleanly to a full cold run (the PR is still tested).
- **Serialize:** `DRONE_RUNNER_CAPACITY = MAX_TESTS` (default 1); the nightly sweep is serial and
skips if a `run_recipe_ci.py` is active. At most MAX_TESTS apps are ever live at once.
- **Warm keycloak shared safely** via per-run namespaced realms (above); orphan realms reaped.
- **Disk** (warm is the budget, not RAM): `virtualisation.docker.autoPrune` prunes
images/containers/networks/build-cache older than 24h but **never `--volumes`** (so data-warm
canonical volumes survive). Each canonical = one data volume + one snapshot (small; the keycloak DB
snapshot ~300M dominates). `canonical.prune_stale()` (run nightly) drops warm data for
- **Disk** (warm is the budget, not RAM): the `ci-docker-prune` unit (`nix/modules/docker-prune.nix`,
Phase-2pc) prunes only **dangling** images/containers/build-cache (`until=24h`), and only under
genuine disk pressure (`/` ≥ 80%) with nothing in flight — **never `--all`** (keeps cached base/
in-use images warm; the local store IS the cache on this single host) and **never `--volumes`** (so
data-warm canonical volumes survive). Each canonical = one data volume + one snapshot (small; the
keycloak DB snapshot ~300M dominates). `canonical.prune_stale()` (run nightly) drops warm data for
**de-enrolled** canonicals. Monitor with `df -h /` (the nightly logs it).
- **Cold teardown stays sacred:** a cold per-run app's volumes/secrets are always deleted at run end
(or janitor-reaped); promote re-seeds the canonical separately (never reuses a per-run volume).

View File

@ -724,3 +724,29 @@ Standing policy for all Phase-2 (and later) recipe OIDC/SSO testing:
Consequences: DEFERRED #9 (authentik enrollment) re-entry trigger narrowed to "a recipe requires
authentik"; F2-7 (authentik backend) is not a DONE blocker. plan-sso-dep-testing.md §6 updated by the
orchestrator to match.
## Phase 2pc — image-prune policy; local store IS the cache; registry pull-through DROPPED (2026-05-29) — SETTLED
Decision (PC1): removed `virtualisation.docker.autoPrune` (it ran `docker system prune --force --all
--filter until=24h` daily). The `--all` evicts every image not used by a *running* container —
between runs no test apps run, so it wiped the cached recipe base images → cold re-pull → Docker-Hub
rate-limit churn (JOURNAL-2 507/542/690-693). Replaced with `nix/modules/docker-prune.nix`: the
`ci-docker-prune` daily timer + oneshot, a **surgical triple-gated** prune that no-ops unless ALL of
(1) `/` ≥ 80%, (2) no run-app stack live, (3) no swarm service converging; and when it runs prunes
only **dangling images + stopped containers + dangling build cache, `until=24h`** — never `--all`
(keeps tagged base/in-use images), never `--volumes` (warm canonical data). Teardown
(`lifecycle.teardown_app`) already removes only services/volumes/secrets/.env, never images — kept.
Why: on this **single host Docker's own local image store IS the cache** — a pulled image stays and
redeploys reuse local layers with no re-download (proven: redis:7-alpine cold pull 5303ms w/ 6 layer
downloads → after `service rm` teardown the image is retained → warm redeploy "Image is up to date"
674ms, no bytes); the PAT-authenticated daemon (200/6h) makes the residual warm-deploy manifest check
free of rate-limit pressure. So *keeping* the store recovers ~all the benefit a cache would give.
Decision (registry pull-through cache): **DROPPED here, deferred to IDEAS / Phase 2b** (operator
scope correction 2026-05-29, mid-phase). A `registry:2` pull-through cache's distinctive wins —
multi-node fan-out, surviving prune/VM-rebuild on *separate* storage, cache-miss authentication —
**don't apply** to a single authenticated non-pruning host (one node; co-located cache lost on a
recreate anyway; daemon already authenticated). It would add a registry service + daemon-mirror
config + cache GC for marginal gain. **Revisit ONLY if** (a) cc-ci goes multi-node, OR (b) Phase-2b
measurement shows cold-deploy pull time is a real bottleneck AND the cache can live on
recreate-surviving storage (Incus volume / host b1 path, not the VM's ephemeral disk). No registry
code was written (caught during orientation) — nothing to revert.

View File

@ -45,3 +45,42 @@ oneshot `systemd.service` running a surgical, **triple-gated** prune:
When all gates pass: `docker {container,image,builder} prune -f --filter until=24h` — dangling +
age-gated only. NEVER `--all` (keeps tagged base/in-use images), NEVER `--volumes` (warm canonical
data, per swarm.nix's existing comment).
## 2026-05-29 — Implemented + deployed + verified on cc-ci
**Implementation.** `nix/modules/docker-prune.nix` (NEW) + `swarm.nix` (dropped autoPrune block) +
`configuration.nix` import. Unit renamed `docker-prune`**`ci-docker-prune`** because the NixOS
docker module reserves `systemd.services.docker-prune` (build conflict caught by `nixos-rebuild
build`: "conflicting definition values for systemd.services.docker-prune.description"). Renamed,
rebuilt clean.
**Deploy.** Synced the 3 changed nix files to `/root/cc-ci` (tar over ssh; isolated change — host
tree otherwise unchanged), `nixos-rebuild build` (clean, shellcheck on the writeShellApplication
passed), then `systemd-run --unit=ccci-sw ... nixos-rebuild switch path:/root/cc-ci#cc-ci`. Switch
finished (22.5s CPU), `systemctl is-system-running``running`.
**Verification (real host).**
- Old NixOS `docker-prune.timer``is-enabled` = **not-found** (autoPrune gone). `ci-docker-prune.timer`
→ enabled + active; `list-timers` NEXT = Sat 2026-05-30 00:00 UTC (daily).
- Manual `systemctl start ci-docker-prune.service` at `/`=31%: log →
`docker-prune: / at 31% (< 80%) — keeping local image cache, nothing to do`. No images removed
(21 → 21). Gate works.
- PC2: `docker info | grep Username``nptest2` (PAT auth retained after rebuild). `/var/lib/docker`
persistent (21 recipe images retained across the rebuild).
- PC3 layer-reuse proof (real swarm deploy→teardown→redeploy, redis:7-alpine, docker.io via authed daemon):
```
COLD pull: 897d... Already exists; c14c.. f546.. a300.. 941e.. 4f4f.. 677c.. Pull complete (6 downloaded)
Status: Downloaded newer image for redis:7-alpine COLD_PULL_MS=5303
service create pc3b -> 1/1
service rm pc3b -> retained_after_teardown: redis:7-alpine 487efc061638 (image REMAINS)
WARM pull: Status: Image is up to date for redis:7-alpine WARM_PULL_MS=674 (no bytes)
redeploy create pc3b -> redeploy_ok (reused local layers)
```
Cold 5303ms (6 layer downloads) → warm 674ms (authenticated manifest check only, 0 layers
re-downloaded). The alpine base layer `897d...` showed "Already exists" even on the cold pull =
cross-image base-layer reuse, a bonus cache win. Teardown (`service rm`) retained the image —
matches `teardown_app` (no rmi).
**Docs/decisions.** `docs/runbook.md` (new "Image cache & prune policy" + updated rate-limit note),
`docs/warm.md` (autoPrune→ci-docker-prune), `DECISIONS.md` (Phase-2pc entry), `cc-ci-plan/IDEAS.md`
(deferred registry cache + revisit trigger). Gate claimed.

View File

@ -1,22 +1,93 @@
# STATUS — Phase 2pc (sane image-prune policy)
**SSOT:** `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`
**Scope (operator correction 2026-05-29):** PC1 conservative prune + PC2/PC3 confirm-and-verify
local-store retention/auth. **Registry pull-through cache DROPPED** (deferred to IDEAS / Phase 2b).
**Scope (operator correction 2026-05-29):** PC1 conservative prune + PC2/PC3 confirm+verify
local-store retention/auth. **Registry pull-through cache DROPPED** (deferred `cc-ci-plan/IDEAS.md`
+ DECISIONS Phase-2pc; no registry code was written).
## Phase: PC1 implemented, deploy+verify in flight (NOT yet claimed)
## Gate: 2pc — CLAIMED, awaiting Adversary
In flight: build the new prune module onto cc-ci via `nixos-rebuild switch`, then run the
deploy→teardown→redeploy layer-reuse proof. Gate will be CLAIMED once verified on the real host.
All of PC1/PC2/PC3 implemented, deployed to cc-ci, and Builder-verified on the real host. Commit
sha for this claim: see `claim(2pc)` HEAD. WHAT / HOW / EXPECTED / WHERE below.
## What changed (the diff)
---
- `nix/modules/swarm.nix` — removed `virtualisation.docker.autoPrune` (it ran
`docker system prune --force --all --filter until=24h` daily; `--all` evicts every image not used
by a *running* container → wiped cached recipe base images → cold re-pull → Hub rate-limit churn).
- `nix/modules/docker-prune.nix` (NEW) — daily `systemd.timer` + oneshot `systemd.service`
`docker-prune` running a surgical, triple-gated prune. Imported in `nix/hosts/cc-ci/configuration.nix`.
- Teardown (`runner/harness/lifecycle.py::teardown_app`) UNCHANGED — already removes only
services/volumes/secrets/.env, never images (PC1 teardown requirement already held).
### PC1 — Conservative prune policy
(Verification context — WHAT/HOW/EXPECTED/WHERE — will be filled in here at gate-claim time.)
**WHAT.** Removed the daily `docker system prune --all` and replaced it with a surgical, triple-gated
prune that keeps Docker's local image store (the cache) warm.
- **WHERE.** `nix/modules/docker-prune.nix` (NEW, unit `ci-docker-prune` service+timer);
`nix/modules/swarm.nix` (`virtualisation.docker.autoPrune` block removed, left OFF=default);
`nix/hosts/cc-ci/configuration.nix` (imports `docker-prune.nix`). Deployed via
`nixos-rebuild switch --flake path:/root/cc-ci#cc-ci`.
- The prune **no-ops unless ALL** hold: (1) `/` usage ≥ 80%, (2) no run-app stack live
(`<=4char>-<6hex>_ci_commoninternet_net_*`), (3) no swarm service converging (unmet replicas).
When it runs: `docker {container,image,builder} prune -f --filter until=24h` — **dangling+old only,
never `--all`, never `--volumes`.**
- Teardown unchanged: `runner/harness/lifecycle.py::teardown_app` removes services/volumes/secrets/
.env and **no images** (`grep -n 'rmi\|image rm\|image prune' runner/ tests/conftest.py` = empty).
**HOW to verify (cold, Adversary's own checks):**
```sh
ssh cc-ci 'systemctl is-enabled docker-prune.timer' # EXPECT: not-found (autoPrune gone)
ssh cc-ci 'systemctl is-enabled ci-docker-prune.timer; systemctl is-active ci-docker-prune.timer'
ssh cc-ci 'systemctl list-timers ci-docker-prune.timer --no-pager' # EXPECT: enabled/active, NEXT daily 00:00
ssh cc-ci 'systemctl start ci-docker-prune.service; \
journalctl -u ci-docker-prune.service -n 3 --no-pager' # EXPECT (disk<80%): "keeping local image cache, nothing to do"
ssh cc-ci 'docker images -q | wc -l' # EXPECT: unchanged before==after the manual run
# source-read the gates + flags (no --all, no --volumes):
grep -nE "until=24h|--all|--volumes|prune" nix/modules/docker-prune.nix
grep -n "autoPrune" nix/modules/swarm.nix # EXPECT: only a comment, no enable=true
```
**EXPECTED:** old timer not-found; `ci-docker-prune.timer` enabled+active (daily); manual run below
80% prints the no-op line and removes nothing; module flags are `--filter until=24h` only (never
`--all`/`--volumes`); swarm.nix has no live autoPrune.
### PC2 — Local cache retained + authenticated (confirm)
**WHAT.** Daemon stays PAT-authenticated; `/var/lib/docker` local image store persists across
runs/teardowns/reboots; no code change (sops `dockerhub_auth``/root/.docker/config.json` in
`nix/modules/secrets.nix`, unchanged).
**HOW / EXPECTED:**
```sh
ssh cc-ci 'docker info 2>/dev/null | grep Username' # EXPECT: Username: nptest2
ssh cc-ci 'ls -l /root/.docker/config.json' # EXPECT: -> /run/secrets/rendered/docker-config.json (0600)
ssh cc-ci 'docker images | wc -l' # EXPECT: many recipe images retained (was 21 leaf images)
```
### PC3 — Deploy → teardown → redeploy reuses local layers (no re-download)
**WHAT.** A previously-pulled image is retained through teardown and a redeploy reuses local layers;
only an authenticated manifest check remains. Builder-proven with a real swarm deploy/teardown/
redeploy on `redis:7-alpine` (docker.io through the authenticated daemon — same pull path abra/swarm
use).
**HOW (Adversary, reproducible):**
```sh
ssh cc-ci 'bash -s' <<'PROOF'
IMG=redis:7-alpine; docker rmi -f "$IMG" >/dev/null 2>&1 || true
t0=$(date +%s%N); docker pull "$IMG" 2>&1 | grep -E "Pull complete|Downloaded|Already exists|up to date"; t1=$(date +%s%N)
echo COLD_MS=$(((t1-t0)/1000000))
docker service create --name pc3 --replicas 1 "$IMG" sleep 120 >/dev/null 2>&1; docker service ls --filter name=pc3 --format '{{.Replicas}}'
docker service rm pc3 >/dev/null 2>&1
echo retained: $(docker images redis:7-alpine --format '{{.ID}}')
t2=$(date +%s%N); docker pull "$IMG" 2>&1 | grep -E "Pull complete|Downloaded|Already exists|up to date"; t3=$(date +%s%N)
echo WARM_MS=$(((t3-t2)/1000000)); docker rmi -f "$IMG" >/dev/null 2>&1
PROOF
```
**EXPECTED:** COLD pull shows layer "Pull complete" lines (download) — Builder saw 6 layers,
COLD_MS≈5303; after `service rm` the image ID is still listed (retained); WARM pull shows
`Image is up to date` (no layer download), WARM_MS≈674 (≈8× faster, manifest-only). Confirms the
local store is the cache, survives teardown, and a redeploy needs no Docker-Hub layer download.
Optional fuller proof: a real recipe cycle
`RECIPE=custom-html-tiny PR=0 STAGES=install cc-ci-run runner/run_recipe_ci.py` run twice — the 2nd
deploy shows no image-layer download.
---
## DoD checklist (Builder view — Adversary owns the verdict in REVIEW-2pc.md)
- [x] **PC1** — autoPrune `--all` removed; surgical gated `ci-docker-prune` deployed; teardown keeps images.
- [x] **PC2** — daemon PAT-authenticated (nptest2); local store retained across rebuild.
- [x] **PC3** — deploy→teardown→redeploy reuses local layers (no re-download), measured; disk bounded
(31%) without `-af`. Documented (runbook/warm/DECISIONS/IDEAS).
## Not blocked. No standing blockers.