Files
cc-ci/docs/runbook.md
autonomic-bot de6103d41d claim(2pc): PC1 conservative prune deployed+verified; PC2/PC3 local-store cache confirmed
ci-docker-prune (gated surgical prune) live on cc-ci: old autoPrune --all gone, new timer
enabled (daily), no-ops below 80% disk keeping the local image cache, never --all/--volumes.
Daemon stays PAT-authenticated (nptest2); /var/lib/docker retained across rebuild. PC3 proof:
redis:7-alpine deploy->teardown(service rm, image retained)->redeploy = "Image is up to date",
no layer re-download (cold 5303ms -> warm 674ms). Docs: runbook "Image cache & prune policy",
warm.md, DECISIONS Phase-2pc, IDEAS (registry pull-through cache deferred + revisit trigger).
Gate 2pc CLAIMED, awaiting Adversary cold-verify.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:42:36 +01:00

5.8 KiB

Runbook — debugging a failed run

Where to look

  • Per-run logs: the PR comment links to the Drone build (drone.ci.commoninternet.net/...). Each stage (install / upgrade / backup / recipe-local) is a separate pytest invocation with its own reported result. Logs are live/tail-able while running.
  • Overview: ci.commoninternet.net — latest run per recipe + pass/fail/running badges.
  • Bridge: docker service logs ccci-bridge_app on the host — shows poll/trigger decisions, auth rejections, and outcome reflection.
  • Host: docker service ls / docker service ps <stack>_<svc> --no-trunc for a deploy that isn't converging; journalctl -u deploy-<x> for the reconcile oneshots.

Fetch a build's step log via the API:

DT=$(ssh cc-ci 'cat /run/secrets/bridge_drone_token')
curl -s -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 \
  https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds/<N>/logs/1/2

Common failure modes

  • FATA deploy timed out / services stuck "Preparing": images cold-pulling slower than abra's convergence TIMEOUT (default 300s). Bump TIMEOUT via the recipe's recipe_meta.py EXTRA_ENV (lasuite-docs uses 900). Verify the stack converges manually: docker stack services <stack>.
  • toomanyrequests: unauthenticated pull rate limit (task Rejected "No such image"): Docker Hub anonymous rate limit. The daemon is now PAT-authenticated (sops dockerhub_auth/root/.docker/config.json; docker info Username=nptest2; 200/6h per-account). Do not docker image prune -af — it evicts cached base/in-use images and forces re-pulls that burn the limit. See Image cache & prune policy below. Check disk first: df -h /.
  • authentication required: Unauthorized fetching recipe tags: an abra command tried to fetch from the private mirror origin. All recipe-touching harness calls pass -C -o (chaos+offline); recipe_versions/upgrade use the upstream tags fetched read-only at clone time. If you see this, a new abra call is missing -o.
  • upgrade stage SKIPPED ("no previous published version"): the recipe clone has no version tags. fetch_recipe read-only-fetches them from the public upstream (git.coopcloud.tech/coop-cloud/<r>); confirm the upstream has ≥2 tags (git ls-remote --tags).
  • health wait hangs / 502: the app isn't answering HEALTH_PATH yet. Slow apps (keycloak JVM + Liquibase, lasuite 9-service) just need time; raise DEPLOY_TIMEOUT/HTTP_TIMEOUT in recipe_meta.py. A persistent 502 with services 1/1 = wrong HEALTH_PATH (e.g. keycloak needs /realms/master, not /).
  • data-survival assertion fails: the marker wasn't in a backed-up volume / the DB hook didn't run. Check the recipe's backupbot.backup* labels; DB recipes use a pg_backup.sh pre/post-hook.

Orphans / cleanup

Teardown is guaranteed (try/finally) and verified (_residual raises if anything is left). A SIGKILL'd/timed-out build can't run its own teardown — the run-start janitor reaps orphaned run apps before the next deploy. To reap now, or after cancelling a stuck build, manually:

ssh cc-ci 'export HOME=/root; D=<recipe[:4]>-<6hex>.ci.commoninternet.net
abra app undeploy "$D" -n; docker stack rm "$(echo $D | tr . _)"; sleep 6
abra app volume remove "$D" -f -n; abra app secret remove "$D" --all -n; abra app config remove "$D"'

Confirm clean: docker service ls | grep <prefix> returns nothing.

Image cache & prune policy

On this single host, Docker's own local image store IS the cache — a pulled image stays, and re-deploys (cold tests, warm canonical, reboots) reuse the local layers with no re-download; the daemon is PAT-authenticated so a warm redeploy makes at most one authenticated manifest check. Teardown removes the run's services/volumes/secrets/.env but never images — so the next deploy of the same recipe is local. (No separate registry:2 pull-through cache: it only pays off multi-node / separate-survivable storage, neither of which we have — see DECISIONS Phase-2pc.)

Pruning is the ci-docker-prune unit (nix/modules/docker-prune.nix), a daily timer that is surgical and triple-gated — it does nothing unless ALL hold: (1) / usage ≥ 80% (genuine disk pressure), (2) no run-app stack live (never prune mid-run), (3) no swarm service converging (no deploy/pull in flight). When it does run it prunes only dangling images + stopped containers + dangling build cache, age-gated until=24hnever --all (keeps tagged base/in-use images), never --volumes (warm canonical data). The old virtualisation.docker.autoPrune --all was removed — its daily --all evicted cached recipe base images → cold re-pull → Hub rate-limit churn.

ssh cc-ci 'systemctl list-timers ci-docker-prune.timer --no-pager; \
           systemctl start ci-docker-prune.service; \
           journalctl -u ci-docker-prune.service -n 3 --no-pager'   # below 80% -> no-op, keeps cache

Reclaim manually under real pressure (still surgical, never -af): ssh cc-ci 'docker image prune -f --filter until=24h' (dangling only).

Re-running / triggering by hand

  • Re-comment !testme on the PR (distinct comment id → re-runs; deduped per comment).
  • Or trigger the recipe-ci pipeline directly (same params the bridge sends):
    curl -s -H "Authorization: Bearer $DT" -X POST --proxy socks5h://localhost:1055 \
      "https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds?branch=main&RECIPE=<r>&PR=0"
    
  • Or run a stage on the host: cd /root/cc-ci && HOME=/root RECIPE=<r> PR=0 STAGES=install,upgrade,backup cc-ci-run runner/run_recipe_ci.py.

Cancelling a stuck build

curl -s -X DELETE -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 .../builds/<N>, then manually teardown (above) since a cancelled build skips its finalizer.