All checks were successful
continuous-integration/drone/push Build is passing
architecture.md: components, the !testme flow, network/TLS, resource safety, enrollment. runbook.md: where to look, common failure modes (timeout/rate-limit/auth/skip/health/data), orphan cleanup, re-trigger, cancel. Completes the D9 doc set (README+install+enroll+secrets+arch+runbook). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
71 lines
4.2 KiB
Markdown
71 lines
4.2 KiB
Markdown
# Runbook — debugging a failed run
|
|
|
|
## Where to look
|
|
|
|
- **Per-run logs:** the PR comment links to the Drone build (`drone.ci.commoninternet.net/...`).
|
|
Each stage (install / upgrade / backup / recipe-local) is a separate pytest invocation with its
|
|
own reported result. Logs are live/tail-able while running.
|
|
- **Overview:** `ci.commoninternet.net` — latest run per recipe + pass/fail/running badges.
|
|
- **Bridge:** `docker service logs ccci-bridge_app` on the host — shows poll/trigger decisions,
|
|
auth rejections, and outcome reflection.
|
|
- **Host:** `docker service ls` / `docker service ps <stack>_<svc> --no-trunc` for a deploy that
|
|
isn't converging; `journalctl -u deploy-<x>` for the reconcile oneshots.
|
|
|
|
Fetch a build's step log via the API:
|
|
```sh
|
|
DT=$(ssh cc-ci 'cat /run/secrets/bridge_drone_token')
|
|
curl -s -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 \
|
|
https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds/<N>/logs/1/2
|
|
```
|
|
|
|
## Common failure modes
|
|
|
|
- **`FATA deploy timed out` / services stuck "Preparing":** images cold-pulling slower than abra's
|
|
convergence `TIMEOUT` (default 300s). Bump `TIMEOUT` via the recipe's `recipe_meta.py` `EXTRA_ENV`
|
|
(lasuite-docs uses 900). Verify the stack converges manually: `docker stack services <stack>`.
|
|
- **`toomanyrequests: unauthenticated pull rate limit`** (task Rejected "No such image"): Docker Hub
|
|
anonymous rate limit — the A1 registry-creds finding. Provide Docker Hub creds (sops `secrets/`,
|
|
wire into the docker daemon). Do **not** `docker image prune -af` mid-breadth — it evicts cached
|
|
images and forces re-pulls that hit the limit. Check disk first: `df -h /` (heavy recipes need
|
|
headroom; prune only `dangling` between runs or rely on the daily autoprune).
|
|
- **`authentication required: Unauthorized` fetching recipe tags:** an abra command tried to fetch
|
|
from the private mirror origin. All recipe-touching harness calls pass `-C -o` (chaos+offline);
|
|
`recipe_versions`/upgrade use the upstream tags fetched read-only at clone time. If you see this,
|
|
a new abra call is missing `-o`.
|
|
- **upgrade stage SKIPPED ("no previous published version"):** the recipe clone has no version tags.
|
|
`fetch_recipe` read-only-fetches them from the public upstream (`git.coopcloud.tech/coop-cloud/<r>`);
|
|
confirm the upstream has ≥2 tags (`git ls-remote --tags`).
|
|
- **health wait hangs / 502:** the app isn't answering `HEALTH_PATH` yet. Slow apps (keycloak JVM +
|
|
Liquibase, lasuite 9-service) just need time; raise `DEPLOY_TIMEOUT`/`HTTP_TIMEOUT` in
|
|
`recipe_meta.py`. A persistent 502 with services 1/1 = wrong `HEALTH_PATH` (e.g. keycloak needs
|
|
`/realms/master`, not `/`).
|
|
- **data-survival assertion fails:** the marker wasn't in a backed-up volume / the DB hook didn't run.
|
|
Check the recipe's `backupbot.backup*` labels; DB recipes use a `pg_backup.sh` pre/post-hook.
|
|
|
|
## Orphans / cleanup
|
|
|
|
Teardown is guaranteed (`try/finally`) and verified (`_residual` raises if anything is left). A
|
|
SIGKILL'd/timed-out build can't run its own teardown — the **run-start janitor** reaps orphaned run
|
|
apps before the next deploy. To reap now, or after cancelling a stuck build, manually:
|
|
```sh
|
|
ssh cc-ci 'export HOME=/root; D=<recipe[:4]>-<6hex>.ci.commoninternet.net
|
|
abra app undeploy "$D" -n; docker stack rm "$(echo $D | tr . _)"; sleep 6
|
|
abra app volume remove "$D" -f -n; abra app secret remove "$D" --all -n; abra app config remove "$D"'
|
|
```
|
|
Confirm clean: `docker service ls | grep <prefix>` returns nothing.
|
|
|
|
## Re-running / triggering by hand
|
|
|
|
- Re-comment `!testme` on the PR (distinct comment id → re-runs; deduped per comment).
|
|
- Or trigger the recipe-ci pipeline directly (same params the bridge sends):
|
|
```sh
|
|
curl -s -H "Authorization: Bearer $DT" -X POST --proxy socks5h://localhost:1055 \
|
|
"https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds?branch=main&RECIPE=<r>&PR=0"
|
|
```
|
|
- Or run a stage on the host: `cd /root/cc-ci && HOME=/root RECIPE=<r> PR=0 STAGES=install,upgrade,backup cc-ci-run runner/run_recipe_ci.py`.
|
|
|
|
## Cancelling a stuck build
|
|
|
|
`curl -s -X DELETE -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 .../builds/<N>`,
|
|
then manually teardown (above) since a cancelled build skips its finalizer.
|