Files
cc-ci/docs/runbook.md
autonomic-bot 8c286bff60
All checks were successful
continuous-integration/drone/push Build is passing
docs(prevb): update recipe-customization/testing/runbook for dynamic base + previous/ (drop stale recipe_versions[-2] model)
2026-06-17 00:46:03 +00:00

98 lines
6.1 KiB
Markdown

# Runbook — debugging a failed run
## Where to look
- **Per-run logs:** the PR comment links to the Drone build (`drone.ci.commoninternet.net/...`).
Each stage (install / upgrade / backup / recipe-local) is a separate pytest invocation with its
own reported result. Logs are live/tail-able while running.
- **Overview:** `ci.commoninternet.net` — latest run per recipe + pass/fail/running badges.
- **Bridge:** `docker service logs ccci-bridge_app` on the host — shows poll/trigger decisions,
auth rejections, and outcome reflection.
- **Host:** `docker service ls` / `docker service ps <stack>_<svc> --no-trunc` for a deploy that
isn't converging; `journalctl -u deploy-<x>` for the reconcile oneshots.
Fetch a build's step log via the API:
```sh
DT=$(ssh cc-ci 'cat /run/secrets/bridge_drone_token')
curl -s -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 \
https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds/<N>/logs/1/2
```
## Common failure modes
- **`FATA deploy timed out` / services stuck "Preparing":** images cold-pulling slower than abra's
convergence `TIMEOUT` (default 300s). Bump `TIMEOUT` via the recipe's `recipe_meta.py` `EXTRA_ENV`
(lasuite-docs uses 900). Verify the stack converges manually: `docker stack services <stack>`.
- **`toomanyrequests: unauthenticated pull rate limit`** (task Rejected "No such image"): Docker Hub
anonymous rate limit. The daemon is now PAT-authenticated (sops `dockerhub_auth`
`/root/.docker/config.json`; `docker info` Username=nptest2; 200/6h per-account). Do **not**
`docker image prune -af` — it evicts cached base/in-use images and forces re-pulls that burn the
limit. See **Image cache & prune policy** below. Check disk first: `df -h /`.
- **`authentication required: Unauthorized` fetching recipe tags:** an abra command tried to fetch
from the private mirror origin. All recipe-touching harness calls pass `-C -o` (chaos+offline);
`recipe_versions`/upgrade use the upstream tags fetched read-only at clone time. If you see this,
a new abra call is missing `-o`.
- **upgrade stage SKIPPED:** the dynamic base resolved to `skip` (phase prevb) — no last-green warm
canonical AND no resolvable `main` tip, or `head == main tip` (no predecessor delta), or a declared
`EXPECTED_NA[upgrade]`. The run log prints the exact reason (`upgrade base: kind=skip … SKIP: <reason>`).
For a recipe that should upgrade from `main`, confirm the per-run clone has `origin/main` (or
`origin/master`) and that it differs from the PR head (`resolve_upgrade_base` in `run_recipe_ci.py`).
- **health wait hangs / 502:** the app isn't answering `HEALTH_PATH` yet. Slow apps (keycloak JVM +
Liquibase, lasuite 9-service) just need time; raise `DEPLOY_TIMEOUT`/`HTTP_TIMEOUT` in
`recipe_meta.py`. A persistent 502 with services 1/1 = wrong `HEALTH_PATH` (e.g. keycloak needs
`/realms/master`, not `/`).
- **data-survival assertion fails:** the marker wasn't in a backed-up volume / the DB hook didn't run.
Check the recipe's `backupbot.backup*` labels; DB recipes use a `pg_backup.sh` pre/post-hook.
## Orphans / cleanup
Teardown is guaranteed (`try/finally`) and verified (`_residual` raises if anything is left). A
SIGKILL'd/timed-out build can't run its own teardown — the **run-start janitor** reaps orphaned run
apps before the next deploy. To reap now, or after cancelling a stuck build, manually:
```sh
ssh cc-ci 'export HOME=/root; D=<recipe[:4]>-<6hex>.ci.commoninternet.net
abra app undeploy "$D" -n; docker stack rm "$(echo $D | tr . _)"; sleep 6
abra app volume remove "$D" -f -n; abra app secret remove "$D" --all -n; abra app config remove "$D"'
```
Confirm clean: `docker service ls | grep <prefix>` returns nothing.
## Image cache & prune policy
On this **single host, Docker's own local image store IS the cache** — a pulled image stays, and
re-deploys (cold tests, warm canonical, reboots) reuse the local layers with no re-download; the
daemon is PAT-authenticated so a warm redeploy makes at most one authenticated manifest check.
Teardown removes the run's services/volumes/secrets/.env but **never images** — so the next deploy
of the same recipe is local. (No separate `registry:2` pull-through cache: it only pays off
multi-node / separate-survivable storage, neither of which we have — see DECISIONS Phase-2pc.)
Pruning is the **`ci-docker-prune`** unit (`nix/modules/docker-prune.nix`), a daily timer that is
**surgical and triple-gated** — it does **nothing** unless ALL hold: (1) `/` usage ≥ 80% (genuine
disk pressure), (2) no run-app stack live (never prune mid-run), (3) no swarm service converging
(no deploy/pull in flight). When it does run it prunes only **dangling images + stopped containers +
dangling build cache, age-gated `until=24h`** — **never `--all`** (keeps tagged base/in-use images),
**never `--volumes`** (warm canonical data). The old `virtualisation.docker.autoPrune --all` was
removed — its daily `--all` evicted cached recipe base images → cold re-pull → Hub rate-limit churn.
```sh
ssh cc-ci 'systemctl list-timers ci-docker-prune.timer --no-pager; \
systemctl start ci-docker-prune.service; \
journalctl -u ci-docker-prune.service -n 3 --no-pager' # below 80% -> no-op, keeps cache
```
Reclaim manually under real pressure (still surgical, never `-af`):
`ssh cc-ci 'docker image prune -f --filter until=24h'` (dangling only).
## Re-running / triggering by hand
- Re-comment `!testme` on the PR (distinct comment id → re-runs; deduped per comment).
- Or trigger the recipe-ci pipeline directly (same params the bridge sends):
```sh
curl -s -H "Authorization: Bearer $DT" -X POST --proxy socks5h://localhost:1055 \
"https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds?branch=main&RECIPE=<r>&PR=0"
```
- Or run a stage on the host: `cd /root/cc-ci && HOME=/root RECIPE=<r> PR=0 STAGES=install,upgrade,backup cc-ci-run runner/run_recipe_ci.py`.
## Cancelling a stuck build
`curl -s -X DELETE -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 .../builds/<N>`,
then manually teardown (above) since a cancelled build skips its finalizer.