Files
cc-ci/docs/runbook.md
autonomic-bot 9b58fd0dfb
All checks were successful
continuous-integration/drone/push Build is passing
M9/D9: add architecture.md + runbook.md — docs set complete
architecture.md: components, the !testme flow, network/TLS, resource safety, enrollment.
runbook.md: where to look, common failure modes (timeout/rate-limit/auth/skip/health/data), orphan
cleanup, re-trigger, cancel. Completes the D9 doc set (README+install+enroll+secrets+arch+runbook).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 10:34:37 +01:00

4.2 KiB

Runbook — debugging a failed run

Where to look

  • Per-run logs: the PR comment links to the Drone build (drone.ci.commoninternet.net/...). Each stage (install / upgrade / backup / recipe-local) is a separate pytest invocation with its own reported result. Logs are live/tail-able while running.
  • Overview: ci.commoninternet.net — latest run per recipe + pass/fail/running badges.
  • Bridge: docker service logs ccci-bridge_app on the host — shows poll/trigger decisions, auth rejections, and outcome reflection.
  • Host: docker service ls / docker service ps <stack>_<svc> --no-trunc for a deploy that isn't converging; journalctl -u deploy-<x> for the reconcile oneshots.

Fetch a build's step log via the API:

DT=$(ssh cc-ci 'cat /run/secrets/bridge_drone_token')
curl -s -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 \
  https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds/<N>/logs/1/2

Common failure modes

  • FATA deploy timed out / services stuck "Preparing": images cold-pulling slower than abra's convergence TIMEOUT (default 300s). Bump TIMEOUT via the recipe's recipe_meta.py EXTRA_ENV (lasuite-docs uses 900). Verify the stack converges manually: docker stack services <stack>.
  • toomanyrequests: unauthenticated pull rate limit (task Rejected "No such image"): Docker Hub anonymous rate limit — the A1 registry-creds finding. Provide Docker Hub creds (sops secrets/, wire into the docker daemon). Do not docker image prune -af mid-breadth — it evicts cached images and forces re-pulls that hit the limit. Check disk first: df -h / (heavy recipes need headroom; prune only dangling between runs or rely on the daily autoprune).
  • authentication required: Unauthorized fetching recipe tags: an abra command tried to fetch from the private mirror origin. All recipe-touching harness calls pass -C -o (chaos+offline); recipe_versions/upgrade use the upstream tags fetched read-only at clone time. If you see this, a new abra call is missing -o.
  • upgrade stage SKIPPED ("no previous published version"): the recipe clone has no version tags. fetch_recipe read-only-fetches them from the public upstream (git.coopcloud.tech/coop-cloud/<r>); confirm the upstream has ≥2 tags (git ls-remote --tags).
  • health wait hangs / 502: the app isn't answering HEALTH_PATH yet. Slow apps (keycloak JVM + Liquibase, lasuite 9-service) just need time; raise DEPLOY_TIMEOUT/HTTP_TIMEOUT in recipe_meta.py. A persistent 502 with services 1/1 = wrong HEALTH_PATH (e.g. keycloak needs /realms/master, not /).
  • data-survival assertion fails: the marker wasn't in a backed-up volume / the DB hook didn't run. Check the recipe's backupbot.backup* labels; DB recipes use a pg_backup.sh pre/post-hook.

Orphans / cleanup

Teardown is guaranteed (try/finally) and verified (_residual raises if anything is left). A SIGKILL'd/timed-out build can't run its own teardown — the run-start janitor reaps orphaned run apps before the next deploy. To reap now, or after cancelling a stuck build, manually:

ssh cc-ci 'export HOME=/root; D=<recipe[:4]>-<6hex>.ci.commoninternet.net
abra app undeploy "$D" -n; docker stack rm "$(echo $D | tr . _)"; sleep 6
abra app volume remove "$D" -f -n; abra app secret remove "$D" --all -n; abra app config remove "$D"'

Confirm clean: docker service ls | grep <prefix> returns nothing.

Re-running / triggering by hand

  • Re-comment !testme on the PR (distinct comment id → re-runs; deduped per comment).
  • Or trigger the recipe-ci pipeline directly (same params the bridge sends):
    curl -s -H "Authorization: Bearer $DT" -X POST --proxy socks5h://localhost:1055 \
      "https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds?branch=main&RECIPE=<r>&PR=0"
    
  • Or run a stage on the host: cd /root/cc-ci && HOME=/root RECIPE=<r> PR=0 STAGES=install,upgrade,backup cc-ci-run runner/run_recipe_ci.py.

Cancelling a stuck build

curl -s -X DELETE -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 .../builds/<N>, then manually teardown (above) since a cancelled build skips its finalizer.