Files
cc-ci/machine-docs/REVIEW-2pc.md

128 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# REVIEW-2pc — Adversary verdicts for Phase 2pc (sane image-prune policy)
SSOT: `/srv/cc-ci/cc-ci-plan/plan-phase2pc-image-cache.md`. DoD = PC1 + PC2 + PC3,
each Adversary cold-verified here before Builder may write `## DONE` to STATUS-2pc.md.
**SCOPE CORRECTION (operator, 2026-05-29):** the registry pull-through cache (old PC2)
is **DROPPED / deferred to IDEAS** — single authenticated non-pruning host ⇒ Docker's own
local image store already IS the cache. Phase 2pc is now **prune-policy only**.
## Status: PASS @2026-05-29 (gate 2pc re-claim 9e73ebd) — PC1+PC2+PC3 cold-verified; F2pc-1 CLEARED
**Verdict: PASS.** Builder reconciled the git≠host drift (F2pc-1) via `b9bbd25` (rename
committed units `docker-prune``ci-docker-prune`; NixOS reserves `docker-prune`). Re-verified
cold:
- **git == deploy source**: `git show HEAD:nix/modules/docker-prune.nix` and `swarm.nix` are
**byte-identical** to the host's `/root/cc-ci` copies (diff clean). Committed units now
`systemd.services.ci-docker-prune` / `.timer` (`docker-prune.nix:56,67`) = what runs live.
- **live**: `ci-docker-prune.timer` enabled+active (daily 00:00); old `docker-prune.timer`
`not-found`. PC1 no-op @<80% (`docker images` 1818 unchanged). PC3 redis re-confirm: cold
`Downloaded newer` warm `Image is up to date` (local reuse, manifest-only).
- All PC1/PC2/PC3 substance from the prior pass still holds (below). A from-git rebuild now
reproduces the verified system, and STATUS-2pc's `ci-docker-prune.timer` verify commands match.
**F2pc-1 → CLOSED** (Adversary, this verdict): git==host==`ci-docker-prune`, confirmed by
byte-diff + live unit state.
_Scope note on PC1 pressure branch:_ I verified the no-op (<80%) gate live and the 80% code
path by read it runs `docker {container,image,builder} prune -f --filter until=24h`. Crucially
`image prune` **without `--all`** removes only dangling+old layers and **cannot** evict tagged
base/in-use images (docker contract) the cardinal "keep the cache" property is structural, not
incidental. I did **not** fill the 64G disk to fire the 80% branch live (disproportionate); I
rely on that code-read + Builder probe-5 evidence (2.34 GB dangling reclaimed, tagged images
kept). The behavior I could break-test (no-op, teardown-keeps-images, bogus-tag-fails,
coldwarm reuse) is all GREEN.
---
### (superseded) FAIL @2026-05-29 (gate 2pc claim de6103d) — substance GREEN, git ≠ verified host
**Verdict: FAIL** PC1/PC2/PC3 *behavior* is verified-GREEN on the live host, but the
**committed code does not match the deployed-and-"verified" artifact**, so the claim is not
reproducible from git (D8 contract violated). One blocking defect **F2pc-1** below. Fix is
a one-shot reconciliation, not a redo.
### What I cold-verified live (all GREEN on host — substance is sound)
- **PC1 prune logic** (`nix/modules/docker-prune.nix`): triple-gated (≥80% `/`, no run-app
stack `^[a-z0-9]{1,4}-[0-9a-f]{6}_ci_commoninternet_net_`, no converging service), prunes
`container|image|builder prune -f --filter until=24h` only **never `--all`, never
`--volumes`**. Ran the service live @ ~2731% `/`: printed "keeping local image cache,
nothing to do", `docker images` count **17→17 unchanged**.
- **PC1 teardown keeps images**: `grep -rnE 'rmi|image rm|image prune|images -q' runner/
tests/conftest.py` → only comments, no image removal. Live: after `docker service rm` the
redis image (487efc061638) **stayed present**. ✓
- **PC1 autoPrune removed**: committed `swarm.nix` no longer sets `autoPrune` (left default
off); daemon `enable=true` only. A fresh rebuild creates no autoPrune unit. ✓
- **PC2 PAT-auth + retention**: `docker info` → `Username: nptest2`; `/root/.docker/
config.json` → `/run/secrets/rendered/docker-config.json` (sops, symlink); `auths` has
`https://index.docker.io/v1/`. **No registry mirrors** (cache correctly dropped). ✓
- **PC3 cold→teardown→warm** (live, redis:7-alpine, real daemon = abra/swarm pull path):
COLD = 7 layers "Pull complete" / "Downloaded newer"; service up 1/1 → `service rm`;
image **retained**; WARM re-pull = **"Image is up to date"** (no layer download,
manifest-only). ✓
- **Break-it (cardinal rule)**: `docker pull redis:<bogus-tag>` → `manifest unknown` error.
Retained store does **not** mask a broken/changed image. ✓
### Why FAIL anyway — F2pc-1 (blocking): committed code ≠ verified host
- origin/main HEAD **de6103d** (= the `claim(2pc)` commit) defines the units as
`systemd.services.docker-prune` / `systemd.timers.docker-prune` (`nix/modules/docker-prune.nix:56,67`).
- The **live, "verified" host** runs **`ci-docker-prune.service` / `ci-docker-prune.timer`**
(enabled+active, next daily 00:00), built from **uncommitted** source in `/root/cc-ci`
(`/root/cc-ci` is not even a git repo; its module has `systemd.services.ci-docker-prune`).
- Consequences: (1) the artifact the Builder "deployed+verified" was **never committed** —
git does not reproduce the verified system (a D8/fresh rebuild yields `docker-prune.*`,
a *different* unit name than what was verified); (2) **STATUS-2pc's own HOW-to-verify
commands reference `ci-docker-prune.timer`**, which a from-git rebuild will report
`not-found` → a cold verifier following STATUS against a git-built host gets a false FAIL.
- This is a reproducibility/integrity defect, not a behavioral one. The script body is the
same (`cc-ci-docker-prune`); only the systemd unit wrapper name diverges.
- **To clear**: make git == the deployed host — commit the `ci-docker-prune` naming actually
deployed (push `/root/cc-ci`'s `docker-prune.nix`), OR rename the module's units back to
`docker-prune`, `nixos-rebuild switch`, and update STATUS-2pc verify commands to match.
Then I re-verify `git rev` builds the exact `ci-docker-prune`/`docker-prune` units STATUS
documents. (Also confirm the stale `docker-prune.service` [linked,ignored] leftover is
harmless / GC'd on next rebuild.)
_Did NOT read JOURNAL-2pc before this verdict (anti-anchoring). Verdict formed from plan +
committed code + my own cold re-run on cc-ci._
## DoD (narrowed scope)
- **PC1 — Conservative prune policy.** No reflexive `docker image prune -af`. NEVER prune
during a deploy/test run. Keep base/in-use images. Prune only dangling + age-gated old
layers, only under genuine disk pressure. Per-run teardown still removes the run's
**volumes/secrets/services** (sacred) but **must NOT remove images.**
- **PC2 — Local cache retained + authenticated (confirm).** Daemon stays PAT-authenticated
for `docker.io`; local image store retained across runs, teardowns, reboots → repeat
deploy reuses local layers (no re-download), at most an authenticated manifest check.
- **PC3 — Verified + documented.** Adversary proof: deploy → teardown → redeploy does NOT
re-download layers (via `docker` events/pull output / measured pull-time drop); normal run
doesn't evict cached base images; disk bounded WITHOUT `-af`. docs/ notes policy;
deviations in DECISIONS.md.
## Pre-claim baseline recon (read-only; NOT a verdict — just what "before" looks like)
- **autoPrune** (`nix/modules/swarm.nix:15-19`): `flags = ["--all" "--filter" "until=24h"]`,
no `--volumes`. `--all` evicts *any* image unused for 24h → would drop warm base images
between runs (exactly PC1's complaint). The destructive `docker image prune -af` cited in
JOURNAL-2 (507, 690-693) was a **manual** operator action mid-deploy, NOT this systemd unit.
→ PC1 must (a) tighten autoPrune off `--all` toward dangling-only/age-gated, AND (b) ensure
no `-af` exists in any harness/janitor/teardown code path.
- **Teardown image-removal grep target:** DECISIONS.md:708 documents a manual cleanup recipe
ending `docker image prune -f`. Must confirm the *automated* per-run teardown
(run_recipe_ci.py / harness) does NOT `docker rmi` / `image prune` the run's images.
- **No registry cache** exists (confirmed) and per scope correction none should be built.
## Break-it probes to run once PC1 claimed (anti-anchoring checklist)
1. **Teardown must NOT remove images.** Deploy a recipe, capture `docker images` digest set,
run the real teardown, re-check: the recipe's image layers must STILL be present locally.
2. **Redeploy reuses local layers (PC3 core).** After teardown, redeploy the SAME recipe and
confirm via `docker events` / pull output there is NO layer download (only a manifest
check, or fully local). Measure the pull-time delta vs a genuine cold pull.
3. **No mid-run prune.** Grep all code paths; confirm nothing prunes images while a
deploy/test is active (the JOURNAL-2 landmine). autoPrune is daily/off-run only.
4. **Cache must NOT mask a broken image (cardinal rule).** A pinned version still resolves to
the correct digest; a genuinely-new/changed digest still triggers a real pull — the
retained store must not serve a stale image for a recipe that actually changed.
5. **Disk stays bounded without `-af`.** Confirm the surgical policy + disk-pressure trigger
actually reclaims under pressure (don't trade rate-limit churn for a full disk).
6. **PAT auth intact + not leaked.** Daemon still authenticated to docker.io (under 200/6h);
PAT not exposed in published logs / dashboard / world-readable config.