claim(2pc): PC1 conservative prune deployed+verified; PC2/PC3 local-store cache confirmed
ci-docker-prune (gated surgical prune) live on cc-ci: old autoPrune --all gone, new timer enabled (daily), no-ops below 80% disk keeping the local image cache, never --all/--volumes. Daemon stays PAT-authenticated (nptest2); /var/lib/docker retained across rebuild. PC3 proof: redis:7-alpine deploy->teardown(service rm, image retained)->redeploy = "Image is up to date", no layer re-download (cold 5303ms -> warm 674ms). Docs: runbook "Image cache & prune policy", warm.md, DECISIONS Phase-2pc, IDEAS (registry pull-through cache deferred + revisit trigger). Gate 2pc CLAIMED, awaiting Adversary cold-verify. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -24,10 +24,10 @@ curl -s -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 \
|
||||
convergence `TIMEOUT` (default 300s). Bump `TIMEOUT` via the recipe's `recipe_meta.py` `EXTRA_ENV`
|
||||
(lasuite-docs uses 900). Verify the stack converges manually: `docker stack services <stack>`.
|
||||
- **`toomanyrequests: unauthenticated pull rate limit`** (task Rejected "No such image"): Docker Hub
|
||||
anonymous rate limit — the A1 registry-creds finding. Provide Docker Hub creds (sops `secrets/`,
|
||||
wire into the docker daemon). Do **not** `docker image prune -af` mid-breadth — it evicts cached
|
||||
images and forces re-pulls that hit the limit. Check disk first: `df -h /` (heavy recipes need
|
||||
headroom; prune only `dangling` between runs or rely on the daily autoprune).
|
||||
anonymous rate limit. The daemon is now PAT-authenticated (sops `dockerhub_auth` →
|
||||
`/root/.docker/config.json`; `docker info` Username=nptest2; 200/6h per-account). Do **not**
|
||||
`docker image prune -af` — it evicts cached base/in-use images and forces re-pulls that burn the
|
||||
limit. See **Image cache & prune policy** below. Check disk first: `df -h /`.
|
||||
- **`authentication required: Unauthorized` fetching recipe tags:** an abra command tried to fetch
|
||||
from the private mirror origin. All recipe-touching harness calls pass `-C -o` (chaos+offline);
|
||||
`recipe_versions`/upgrade use the upstream tags fetched read-only at clone time. If you see this,
|
||||
@ -54,6 +54,31 @@ abra app volume remove "$D" -f -n; abra app secret remove "$D" --all -n; abra ap
|
||||
```
|
||||
Confirm clean: `docker service ls | grep <prefix>` returns nothing.
|
||||
|
||||
## Image cache & prune policy
|
||||
|
||||
On this **single host, Docker's own local image store IS the cache** — a pulled image stays, and
|
||||
re-deploys (cold tests, warm canonical, reboots) reuse the local layers with no re-download; the
|
||||
daemon is PAT-authenticated so a warm redeploy makes at most one authenticated manifest check.
|
||||
Teardown removes the run's services/volumes/secrets/.env but **never images** — so the next deploy
|
||||
of the same recipe is local. (No separate `registry:2` pull-through cache: it only pays off
|
||||
multi-node / separate-survivable storage, neither of which we have — see DECISIONS Phase-2pc.)
|
||||
|
||||
Pruning is the **`ci-docker-prune`** unit (`nix/modules/docker-prune.nix`), a daily timer that is
|
||||
**surgical and triple-gated** — it does **nothing** unless ALL hold: (1) `/` usage ≥ 80% (genuine
|
||||
disk pressure), (2) no run-app stack live (never prune mid-run), (3) no swarm service converging
|
||||
(no deploy/pull in flight). When it does run it prunes only **dangling images + stopped containers +
|
||||
dangling build cache, age-gated `until=24h`** — **never `--all`** (keeps tagged base/in-use images),
|
||||
**never `--volumes`** (warm canonical data). The old `virtualisation.docker.autoPrune --all` was
|
||||
removed — its daily `--all` evicted cached recipe base images → cold re-pull → Hub rate-limit churn.
|
||||
|
||||
```sh
|
||||
ssh cc-ci 'systemctl list-timers ci-docker-prune.timer --no-pager; \
|
||||
systemctl start ci-docker-prune.service; \
|
||||
journalctl -u ci-docker-prune.service -n 3 --no-pager' # below 80% -> no-op, keeps cache
|
||||
```
|
||||
Reclaim manually under real pressure (still surgical, never `-af`):
|
||||
`ssh cc-ci 'docker image prune -f --filter until=24h'` (dangling only).
|
||||
|
||||
## Re-running / triggering by hand
|
||||
|
||||
- Re-comment `!testme` on the PR (distinct comment id → re-runs; deduped per comment).
|
||||
|
||||
10
docs/warm.md
10
docs/warm.md
@ -85,10 +85,12 @@ back cleanly to a full cold run (the PR is still tested).
|
||||
- **Serialize:** `DRONE_RUNNER_CAPACITY = MAX_TESTS` (default 1); the nightly sweep is serial and
|
||||
skips if a `run_recipe_ci.py` is active. At most MAX_TESTS apps are ever live at once.
|
||||
- **Warm keycloak shared safely** via per-run namespaced realms (above); orphan realms reaped.
|
||||
- **Disk** (warm is the budget, not RAM): `virtualisation.docker.autoPrune` prunes
|
||||
images/containers/networks/build-cache older than 24h but **never `--volumes`** (so data-warm
|
||||
canonical volumes survive). Each canonical = one data volume + one snapshot (small; the keycloak DB
|
||||
snapshot ~300M dominates). `canonical.prune_stale()` (run nightly) drops warm data for
|
||||
- **Disk** (warm is the budget, not RAM): the `ci-docker-prune` unit (`nix/modules/docker-prune.nix`,
|
||||
Phase-2pc) prunes only **dangling** images/containers/build-cache (`until=24h`), and only under
|
||||
genuine disk pressure (`/` ≥ 80%) with nothing in flight — **never `--all`** (keeps cached base/
|
||||
in-use images warm; the local store IS the cache on this single host) and **never `--volumes`** (so
|
||||
data-warm canonical volumes survive). Each canonical = one data volume + one snapshot (small; the
|
||||
keycloak DB snapshot ~300M dominates). `canonical.prune_stale()` (run nightly) drops warm data for
|
||||
**de-enrolled** canonicals. Monitor with `df -h /` (the nightly logs it).
|
||||
- **Cold teardown stays sacred:** a cold per-run app's volumes/secrets are always deleted at run end
|
||||
(or janitor-reaped); promote re-seeds the canonical separately (never reuses a per-run volume).
|
||||
|
||||
Reference in New Issue
Block a user