All checks were successful
continuous-integration/drone/push Build is passing
architecture.md: components, the !testme flow, network/TLS, resource safety, enrollment. runbook.md: where to look, common failure modes (timeout/rate-limit/auth/skip/health/data), orphan cleanup, re-trigger, cancel. Completes the D9 doc set (README+install+enroll+secrets+arch+runbook). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.2 KiB
4.2 KiB
Runbook — debugging a failed run
Where to look
- Per-run logs: the PR comment links to the Drone build (
drone.ci.commoninternet.net/...). Each stage (install / upgrade / backup / recipe-local) is a separate pytest invocation with its own reported result. Logs are live/tail-able while running. - Overview:
ci.commoninternet.net— latest run per recipe + pass/fail/running badges. - Bridge:
docker service logs ccci-bridge_appon the host — shows poll/trigger decisions, auth rejections, and outcome reflection. - Host:
docker service ls/docker service ps <stack>_<svc> --no-truncfor a deploy that isn't converging;journalctl -u deploy-<x>for the reconcile oneshots.
Fetch a build's step log via the API:
DT=$(ssh cc-ci 'cat /run/secrets/bridge_drone_token')
curl -s -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 \
https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds/<N>/logs/1/2
Common failure modes
FATA deploy timed out/ services stuck "Preparing": images cold-pulling slower than abra's convergenceTIMEOUT(default 300s). BumpTIMEOUTvia the recipe'srecipe_meta.pyEXTRA_ENV(lasuite-docs uses 900). Verify the stack converges manually:docker stack services <stack>.toomanyrequests: unauthenticated pull rate limit(task Rejected "No such image"): Docker Hub anonymous rate limit — the A1 registry-creds finding. Provide Docker Hub creds (sopssecrets/, wire into the docker daemon). Do notdocker image prune -afmid-breadth — it evicts cached images and forces re-pulls that hit the limit. Check disk first:df -h /(heavy recipes need headroom; prune onlydanglingbetween runs or rely on the daily autoprune).authentication required: Unauthorizedfetching recipe tags: an abra command tried to fetch from the private mirror origin. All recipe-touching harness calls pass-C -o(chaos+offline);recipe_versions/upgrade use the upstream tags fetched read-only at clone time. If you see this, a new abra call is missing-o.- upgrade stage SKIPPED ("no previous published version"): the recipe clone has no version tags.
fetch_reciperead-only-fetches them from the public upstream (git.coopcloud.tech/coop-cloud/<r>); confirm the upstream has ≥2 tags (git ls-remote --tags). - health wait hangs / 502: the app isn't answering
HEALTH_PATHyet. Slow apps (keycloak JVM + Liquibase, lasuite 9-service) just need time; raiseDEPLOY_TIMEOUT/HTTP_TIMEOUTinrecipe_meta.py. A persistent 502 with services 1/1 = wrongHEALTH_PATH(e.g. keycloak needs/realms/master, not/). - data-survival assertion fails: the marker wasn't in a backed-up volume / the DB hook didn't run.
Check the recipe's
backupbot.backup*labels; DB recipes use apg_backup.shpre/post-hook.
Orphans / cleanup
Teardown is guaranteed (try/finally) and verified (_residual raises if anything is left). A
SIGKILL'd/timed-out build can't run its own teardown — the run-start janitor reaps orphaned run
apps before the next deploy. To reap now, or after cancelling a stuck build, manually:
ssh cc-ci 'export HOME=/root; D=<recipe[:4]>-<6hex>.ci.commoninternet.net
abra app undeploy "$D" -n; docker stack rm "$(echo $D | tr . _)"; sleep 6
abra app volume remove "$D" -f -n; abra app secret remove "$D" --all -n; abra app config remove "$D"'
Confirm clean: docker service ls | grep <prefix> returns nothing.
Re-running / triggering by hand
- Re-comment
!testmeon the PR (distinct comment id → re-runs; deduped per comment). - Or trigger the recipe-ci pipeline directly (same params the bridge sends):
curl -s -H "Authorization: Bearer $DT" -X POST --proxy socks5h://localhost:1055 \ "https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds?branch=main&RECIPE=<r>&PR=0" - Or run a stage on the host:
cd /root/cc-ci && HOME=/root RECIPE=<r> PR=0 STAGES=install,upgrade,backup cc-ci-run runner/run_recipe_ci.py.
Cancelling a stuck build
curl -s -X DELETE -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 .../builds/<N>,
then manually teardown (above) since a cancelled build skips its finalizer.