128 lines
9.2 KiB
Markdown
128 lines
9.2 KiB
Markdown
# REVIEW-2pc — Adversary verdicts for Phase 2pc (sane image-prune policy)
|
||
|
||
SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`. DoD = PC1 + PC2 + PC3,
|
||
each Adversary cold-verified here before Builder may write `## DONE` to STATUS-2pc.md.
|
||
|
||
**SCOPE CORRECTION (operator, 2026-05-29):** the registry pull-through cache (old PC2)
|
||
is **DROPPED / deferred to IDEAS** — single authenticated non-pruning host ⇒ Docker's own
|
||
local image store already IS the cache. Phase 2pc is now **prune-policy only**.
|
||
|
||
## Status: PASS @2026-05-29 (gate 2pc re-claim 9e73ebd) — PC1+PC2+PC3 cold-verified; F2pc-1 CLEARED
|
||
|
||
**Verdict: PASS.** Builder reconciled the git≠host drift (F2pc-1) via `b9bbd25` (rename
|
||
committed units `docker-prune`→`ci-docker-prune`; NixOS reserves `docker-prune`). Re-verified
|
||
cold:
|
||
- **git == deploy source**: `git show HEAD:nix/modules/docker-prune.nix` and `swarm.nix` are
|
||
**byte-identical** to the host's `/root/cc-ci` copies (diff clean). Committed units now
|
||
`systemd.services.ci-docker-prune` / `.timer` (`docker-prune.nix:56,67`) = what runs live.
|
||
- **live**: `ci-docker-prune.timer` enabled+active (daily 00:00); old `docker-prune.timer`
|
||
`not-found`. PC1 no-op @<80% (`docker images` 18→18 unchanged). PC3 redis re-confirm: cold
|
||
`Downloaded newer` → warm `Image is up to date` (local reuse, manifest-only).
|
||
- All PC1/PC2/PC3 substance from the prior pass still holds (below). A from-git rebuild now
|
||
reproduces the verified system, and STATUS-2pc's `ci-docker-prune.timer` verify commands match.
|
||
|
||
**F2pc-1 → CLOSED** (Adversary, this verdict): git==host==`ci-docker-prune`, confirmed by
|
||
byte-diff + live unit state.
|
||
|
||
_Scope note on PC1 pressure branch:_ I verified the no-op (<80%) gate live and the ≥80% code
|
||
path by read — it runs `docker {container,image,builder} prune -f --filter until=24h`. Crucially
|
||
`image prune` **without `--all`** removes only dangling+old layers and **cannot** evict tagged
|
||
base/in-use images (docker contract) — the cardinal "keep the cache" property is structural, not
|
||
incidental. I did **not** fill the 64G disk to fire the ≥80% branch live (disproportionate); I
|
||
rely on that code-read + Builder probe-5 evidence (2.34 GB dangling reclaimed, tagged images
|
||
kept). The behavior I could break-test (no-op, teardown-keeps-images, bogus-tag-fails,
|
||
cold→warm reuse) is all GREEN.
|
||
|
||
---
|
||
### (superseded) FAIL @2026-05-29 (gate 2pc claim de6103d) — substance GREEN, git ≠ verified host
|
||
|
||
**Verdict: FAIL** — PC1/PC2/PC3 *behavior* is verified-GREEN on the live host, but the
|
||
**committed code does not match the deployed-and-"verified" artifact**, so the claim is not
|
||
reproducible from git (D8 contract violated). One blocking defect → **F2pc-1** below. Fix is
|
||
a one-shot reconciliation, not a redo.
|
||
|
||
### What I cold-verified live (all GREEN on host — substance is sound)
|
||
- **PC1 prune logic** (`nix/modules/docker-prune.nix`): triple-gated (≥80% `/`, no run-app
|
||
stack `^[a-z0-9]{1,4}-[0-9a-f]{6}_ci_commoninternet_net_`, no converging service), prunes
|
||
`container|image|builder prune -f --filter until=24h` only — **never `--all`, never
|
||
`--volumes`**. Ran the service live @ ~27–31% `/`: printed "keeping local image cache,
|
||
nothing to do", `docker images` count **17→17 unchanged**. ✓
|
||
- **PC1 teardown keeps images**: `grep -rnE 'rmi|image rm|image prune|images -q' runner/
|
||
tests/conftest.py` → only comments, no image removal. Live: after `docker service rm` the
|
||
redis image (487efc061638) **stayed present**. ✓
|
||
- **PC1 autoPrune removed**: committed `swarm.nix` no longer sets `autoPrune` (left default
|
||
off); daemon `enable=true` only. A fresh rebuild creates no autoPrune unit. ✓
|
||
- **PC2 PAT-auth + retention**: `docker info` → `Username: nptest2`; `/root/.docker/
|
||
config.json` → `/run/secrets/rendered/docker-config.json` (sops, symlink); `auths` has
|
||
`https://index.docker.io/v1/`. **No registry mirrors** (cache correctly dropped). ✓
|
||
- **PC3 cold→teardown→warm** (live, redis:7-alpine, real daemon = abra/swarm pull path):
|
||
COLD = 7 layers "Pull complete" / "Downloaded newer"; service up 1/1 → `service rm`;
|
||
image **retained**; WARM re-pull = **"Image is up to date"** (no layer download,
|
||
manifest-only). ✓
|
||
- **Break-it (cardinal rule)**: `docker pull redis:<bogus-tag>` → `manifest unknown` error.
|
||
Retained store does **not** mask a broken/changed image. ✓
|
||
|
||
### Why FAIL anyway — F2pc-1 (blocking): committed code ≠ verified host
|
||
- origin/main HEAD **de6103d** (= the `claim(2pc)` commit) defines the units as
|
||
`systemd.services.docker-prune` / `systemd.timers.docker-prune` (`nix/modules/docker-prune.nix:56,67`).
|
||
- The **live, "verified" host** runs **`ci-docker-prune.service` / `ci-docker-prune.timer`**
|
||
(enabled+active, next daily 00:00), built from **uncommitted** source in `/root/cc-ci`
|
||
(`/root/cc-ci` is not even a git repo; its module has `systemd.services.ci-docker-prune`).
|
||
- Consequences: (1) the artifact the Builder "deployed+verified" was **never committed** —
|
||
git does not reproduce the verified system (a D8/fresh rebuild yields `docker-prune.*`,
|
||
a *different* unit name than what was verified); (2) **STATUS-2pc's own HOW-to-verify
|
||
commands reference `ci-docker-prune.timer`**, which a from-git rebuild will report
|
||
`not-found` → a cold verifier following STATUS against a git-built host gets a false FAIL.
|
||
- This is a reproducibility/integrity defect, not a behavioral one. The script body is the
|
||
same (`cc-ci-docker-prune`); only the systemd unit wrapper name diverges.
|
||
- **To clear**: make git == the deployed host — commit the `ci-docker-prune` naming actually
|
||
deployed (push `/root/cc-ci`'s `docker-prune.nix`), OR rename the module's units back to
|
||
`docker-prune`, `nixos-rebuild switch`, and update STATUS-2pc verify commands to match.
|
||
Then I re-verify `git rev` builds the exact `ci-docker-prune`/`docker-prune` units STATUS
|
||
documents. (Also confirm the stale `docker-prune.service` [linked,ignored] leftover is
|
||
harmless / GC'd on next rebuild.)
|
||
|
||
_Did NOT read JOURNAL-2pc before this verdict (anti-anchoring). Verdict formed from plan +
|
||
committed code + my own cold re-run on cc-ci._
|
||
|
||
## DoD (narrowed scope)
|
||
- **PC1 — Conservative prune policy.** No reflexive `docker image prune -af`. NEVER prune
|
||
during a deploy/test run. Keep base/in-use images. Prune only dangling + age-gated old
|
||
layers, only under genuine disk pressure. Per-run teardown still removes the run's
|
||
**volumes/secrets/services** (sacred) but **must NOT remove images.**
|
||
- **PC2 — Local cache retained + authenticated (confirm).** Daemon stays PAT-authenticated
|
||
for `docker.io`; local image store retained across runs, teardowns, reboots → repeat
|
||
deploy reuses local layers (no re-download), at most an authenticated manifest check.
|
||
- **PC3 — Verified + documented.** Adversary proof: deploy → teardown → redeploy does NOT
|
||
re-download layers (via `docker` events/pull output / measured pull-time drop); normal run
|
||
doesn't evict cached base images; disk bounded WITHOUT `-af`. docs/ notes policy;
|
||
deviations in DECISIONS.md.
|
||
|
||
## Pre-claim baseline recon (read-only; NOT a verdict — just what "before" looks like)
|
||
- **autoPrune** (`nix/modules/swarm.nix:15-19`): `flags = ["--all" "--filter" "until=24h"]`,
|
||
no `--volumes`. `--all` evicts *any* image unused for 24h → would drop warm base images
|
||
between runs (exactly PC1's complaint). The destructive `docker image prune -af` cited in
|
||
JOURNAL-2 (507, 690-693) was a **manual** operator action mid-deploy, NOT this systemd unit.
|
||
→ PC1 must (a) tighten autoPrune off `--all` toward dangling-only/age-gated, AND (b) ensure
|
||
no `-af` exists in any harness/janitor/teardown code path.
|
||
- **Teardown image-removal grep target:** DECISIONS.md:708 documents a manual cleanup recipe
|
||
ending `docker image prune -f`. Must confirm the *automated* per-run teardown
|
||
(run_recipe_ci.py / harness) does NOT `docker rmi` / `image prune` the run's images.
|
||
- **No registry cache** exists (confirmed) and per scope correction none should be built.
|
||
|
||
## Break-it probes to run once PC1 claimed (anti-anchoring checklist)
|
||
1. **Teardown must NOT remove images.** Deploy a recipe, capture `docker images` digest set,
|
||
run the real teardown, re-check: the recipe's image layers must STILL be present locally.
|
||
2. **Redeploy reuses local layers (PC3 core).** After teardown, redeploy the SAME recipe and
|
||
confirm via `docker events` / pull output there is NO layer download (only a manifest
|
||
check, or fully local). Measure the pull-time delta vs a genuine cold pull.
|
||
3. **No mid-run prune.** Grep all code paths; confirm nothing prunes images while a
|
||
deploy/test is active (the JOURNAL-2 landmine). autoPrune is daily/off-run only.
|
||
4. **Cache must NOT mask a broken image (cardinal rule).** A pinned version still resolves to
|
||
the correct digest; a genuinely-new/changed digest still triggers a real pull — the
|
||
retained store must not serve a stale image for a recipe that actually changed.
|
||
5. **Disk stays bounded without `-af`.** Confirm the surgical policy + disk-pressure trigger
|
||
actually reclaims under pressure (don't trade rate-limit churn for a full disk).
|
||
6. **PAT auth intact + not leaked.** Daemon still authenticated to docker.io (under 200/6h);
|
||
PAT not exposed in published logs / dashboard / world-readable config.
|