Adversary FAILed claimde6103dbecause that commit still named the units docker-prune while the host runs ci-docker-prune; the rename was committed inb9bbd25(its endorsed fix) which is in the current pushed HEAD. git now defines the same ci-docker-prune units STATUS documents and the host runs. Behavior was already cold-verified GREEN. Inert NixOS-builtin docker-prune.service (inactive/linked, no timer) is unchanged by this and reproduces identically from git. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
117 lines
7.9 KiB
Markdown
117 lines
7.9 KiB
Markdown
# JOURNAL — Phase 2pc (sane image-prune policy)
|
|
|
|
Append-only reasoning log. Facts/verification for the Adversary live in STATUS-2pc.md.
|
|
|
|
## 2026-05-29 — Orientation + scope correction
|
|
|
|
Read SSOT `plan-phase2pc-image-cache.md` + plan.md §6.1/§7/§9. Operator issued a **scope
|
|
correction** mid-orientation: **drop the registry:2 pull-through cache.** Rationale (operator):
|
|
single host → Docker's own local image store already IS the cache; re-deploys reuse local layers
|
|
with no re-download; the daemon is PAT-authenticated so residual manifest checks sit under 200/6h.
|
|
The churn was caused by **over-pruning** (`docker image prune -af` wiping the store), not a missing
|
|
cache. A separate registry only pays off multi-node / separate-survivable storage, which we are not.
|
|
**I had not yet written any registry code** (still orienting) → nothing to revert.
|
|
|
|
Phase 2pc is now **PC1 (prune policy) + PC2/PC3 (confirm + verify local-store retention/auth).**
|
|
|
|
### Findings from orientation (why the fix is one module)
|
|
|
|
- The ONLY automated image pruner in the whole repo is
|
|
`virtualisation.docker.autoPrune = { flags = ["--all" "--filter" "until=24h"]; }` in
|
|
`nix/modules/swarm.nix`. NixOS renders this as `docker system prune --force --all --filter until=24h`
|
|
daily. `--all` removes every image **not used by a running container** — between runs there are no
|
|
test apps running, so it evicts the cached recipe base images → cold re-pull on the next run. That
|
|
is exactly the prune→re-pull→rate-limit churn documented in JOURNAL-2 (lines 507/542/690-693).
|
|
- `runner/harness/lifecycle.py::teardown_app` removes services (abra undeploy / `docker stack rm`),
|
|
volumes, secrets, and the `.env` — and **no images** (`grep` for `rmi`/`image rm`/`image prune` in
|
|
`runner/` + `tests/conftest.py` is empty). So PC1's "teardown must NOT remove images" already holds.
|
|
- `janitor`, `warm_reconcile.py`, `nightly-sweep.nix`, `drone*.nix`, `.drone.yml` — none prune images.
|
|
- Daemon is already PAT-authenticated: `docker info` → `Username: nptest2`; sops `dockerhub_auth`
|
|
(base64 `nptest2:<PAT>`) → `sops.templates."docker-config.json"` → `/root/.docker/config.json`
|
|
(`nix/modules/secrets.nix`). PC2 needs no change — confirm + document.
|
|
- Disk on cc-ci: `/` is 64G, 19G used, **43G free (31%)** — bounded; aggressive `--all` is
|
|
unnecessary, which is the whole premise.
|
|
|
|
### PC1 design
|
|
|
|
Replace `autoPrune` with a dedicated `nix/modules/docker-prune.nix`: a daily `systemd.timer` +
|
|
oneshot `systemd.service` running a surgical, **triple-gated** prune:
|
|
1. **Disk-pressure gate** — do nothing unless `/` usage ≥ 80% (Docker's local store IS our cache;
|
|
keep it warm; reclaim only under genuine pressure).
|
|
2. **No-run gate** — skip if any run-app stack (`<=4char>-<6hex>_ci_commoninternet_net_*`) is live
|
|
(mid-pull layers can look prunable; "never prune mid-run").
|
|
3. **No-converge gate** — skip if any swarm service has unmet replicas (a deploy/pull in flight,
|
|
incl. infra warm redeploys).
|
|
When all gates pass: `docker {container,image,builder} prune -f --filter until=24h` — dangling +
|
|
age-gated only. NEVER `--all` (keeps tagged base/in-use images), NEVER `--volumes` (warm canonical
|
|
data, per swarm.nix's existing comment).
|
|
|
|
## 2026-05-29 — Implemented + deployed + verified on cc-ci
|
|
|
|
**Implementation.** `nix/modules/docker-prune.nix` (NEW) + `swarm.nix` (dropped autoPrune block) +
|
|
`configuration.nix` import. Unit renamed `docker-prune` → **`ci-docker-prune`** because the NixOS
|
|
docker module reserves `systemd.services.docker-prune` (build conflict caught by `nixos-rebuild
|
|
build`: "conflicting definition values for systemd.services.docker-prune.description"). Renamed,
|
|
rebuilt clean.
|
|
|
|
**Deploy.** Synced the 3 changed nix files to `/root/cc-ci` (tar over ssh; isolated change — host
|
|
tree otherwise unchanged), `nixos-rebuild build` (clean, shellcheck on the writeShellApplication
|
|
passed), then `systemd-run --unit=ccci-sw ... nixos-rebuild switch path:/root/cc-ci#cc-ci`. Switch
|
|
finished (22.5s CPU), `systemctl is-system-running` → `running`.
|
|
|
|
**Verification (real host).**
|
|
- Old NixOS `docker-prune.timer` → `is-enabled` = **not-found** (autoPrune gone). `ci-docker-prune.timer`
|
|
→ enabled + active; `list-timers` NEXT = Sat 2026-05-30 00:00 UTC (daily).
|
|
- Manual `systemctl start ci-docker-prune.service` at `/`=31%: log →
|
|
`docker-prune: / at 31% (< 80%) — keeping local image cache, nothing to do`. No images removed
|
|
(21 → 21). Gate works.
|
|
- PC2: `docker info | grep Username` → `nptest2` (PAT auth retained after rebuild). `/var/lib/docker`
|
|
persistent (21 recipe images retained across the rebuild).
|
|
- PC3 layer-reuse proof (real swarm deploy→teardown→redeploy, redis:7-alpine, docker.io via authed daemon):
|
|
```
|
|
COLD pull: 897d... Already exists; c14c.. f546.. a300.. 941e.. 4f4f.. 677c.. Pull complete (6 downloaded)
|
|
Status: Downloaded newer image for redis:7-alpine COLD_PULL_MS=5303
|
|
service create pc3b -> 1/1
|
|
service rm pc3b -> retained_after_teardown: redis:7-alpine 487efc061638 (image REMAINS)
|
|
WARM pull: Status: Image is up to date for redis:7-alpine WARM_PULL_MS=674 (no bytes)
|
|
redeploy create pc3b -> redeploy_ok (reused local layers)
|
|
```
|
|
Cold 5303ms (6 layer downloads) → warm 674ms (authenticated manifest check only, 0 layers
|
|
re-downloaded). The alpine base layer `897d...` showed "Already exists" even on the cold pull =
|
|
cross-image base-layer reuse, a bonus cache win. Teardown (`service rm`) retained the image —
|
|
matches `teardown_app` (no rmi).
|
|
|
|
**Docs/decisions.** `docs/runbook.md` (new "Image cache & prune policy" + updated rate-limit note),
|
|
`docs/warm.md` (autoPrune→ci-docker-prune), `DECISIONS.md` (Phase-2pc entry), `cc-ci-plan/IDEAS.md`
|
|
(deferred registry cache + revisit trigger). Gate claimed.
|
|
|
|
## 2026-05-29 — Probe-5 evidence: surgical prune reclaims, keeps tagged/recent
|
|
|
|
Ran the exact active-path command the gated unit uses (`docker image prune -f --filter until=24h`
|
|
+ container/builder variants) on the host to demonstrate surgical reclaim (the daily timer only
|
|
reaches this under ≥80% disk, but the command's effect is the same):
|
|
- all images 23→17, dangling 10→**4** (the 4 remaining are <24h old — the `until=24h` age gate kept
|
|
them), **2.341 GB reclaimed**, disk 31%→27% (19G→17G used).
|
|
- ALL tagged/in-use images survived (keycloak:26.6.2, mariadb:12.2, nginx:1.30.0, redis:8.6.3, …) —
|
|
no `--all`, so nothing tagged or container-referenced was touched.
|
|
Confirms: disk stays bounded WITHOUT `-af`; the policy reclaims real space from old orphaned layers
|
|
while keeping the warm cache intact.
|
|
|
|
## 2026-05-29 — F2pc-1 (committed≠host) resolution + claim discipline
|
|
|
|
Adversary FAILed gate 2pc on F2pc-1: at claim commit `de6103d` the committed `docker-prune.nix` still
|
|
named units `docker-prune` while the verified host runs `ci-docker-prune` → git wouldn't reproduce
|
|
the verified system (D8). Root cause: I renamed the units locally (sed) + synced to host + verified,
|
|
but the rename rode in a SEPARATE commit (`b9bbd25`) pushed AFTER the `claim(` commit — and the
|
|
Adversary cold-verified the claim commit's tree. Behavior was GREEN; only the artifact lagged.
|
|
|
|
`b9bbd25` already committed the rename (git == host == ci-docker-prune), which is the Adversary's own
|
|
endorsed fix. Confirmed current HEAD: `grep systemd.(services|timers)` → ci-docker-prune; host module
|
|
matches; host runs ci-docker-prune.timer enabled+active; builtin docker-prune.service inactive/linked
|
|
(inert NixOS default, never triggered with autoPrune off). Re-claimed.
|
|
|
|
**Lesson (now a standing rule, orchestrator):** before ANY gate claim, `git status` must be clean —
|
|
everything committed AND pushed — because the Adversary cold-verifies from a fresh clone. A fix built
|
|
locally but uncommitted (or trailing the claim commit) is a guaranteed cold-build mismatch. The claim
|
|
commit must be the LAST thing, with the verified artifact already in it.
|