M9/D9: add architecture.md + runbook.md — docs set complete
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
architecture.md: components, the !testme flow, network/TLS, resource safety, enrollment. runbook.md: where to look, common failure modes (timeout/rate-limit/auth/skip/health/data), orphan cleanup, re-trigger, cancel. Completes the D9 doc set (README+install+enroll+secrets+arch+runbook). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
58
docs/architecture.md
Normal file
58
docs/architecture.md
Normal file
@ -0,0 +1,58 @@
|
||||
# Architecture
|
||||
|
||||
cc-ci turns a `!testme` PR comment into a real end-to-end deploy + test of a Co-op Cloud recipe and
|
||||
reports the result back. Everything on the `cc-ci` host is declared in this repo's NixOS flake.
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Where | Role |
|
||||
|---|---|---|
|
||||
| **comment-bridge** | `bridge/bridge.py`, `modules/bridge.nix` (swarm svc, `ci.commoninternet.net/hook`) | Polls enrolled repos for `!testme` (primary, read-only) + optional admin webhook; authorizes the commenter (org membership); triggers a parameterized Drone build; posts/edits the PR comment with the run link + final pass/fail. |
|
||||
| **Drone server** | `modules/drone.nix` — coop-cloud `drone` recipe via abra (`drone.ci.commoninternet.net`, Gitea SSO) | CI engine. Holds the `recipe-ci` (custom-event) and `self-test` (push) pipelines (`.drone.yml`). |
|
||||
| **Drone exec runner** | `modules/drone-runner.nix` — host systemd service | Runs pipeline steps **on the host** so they can drive `abra`/Docker. `DRONE_RUNNER_CAPACITY=1` (MAX_TESTS) caps concurrent builds; the rest queue natively. |
|
||||
| **harness** | `runner/run_recipe_ci.py` + `runner/harness/` + `tests/` | Orchestrates per run: fetch recipe at the PR head → install → upgrade → backup/restore → recipe-local (D4) → guaranteed teardown. pytest + Playwright via the Nix `cc-ci-run` env. |
|
||||
| **swarm + traefik** | `modules/swarm.nix`, `modules/proxy.nix` — coop-cloud `traefik` recipe via abra | Single-node Docker Swarm + `proxy` overlay; traefik terminates TLS with the pre-issued wildcard cert (file provider, **no ACME**). The real deploy target for recipes-under-test. |
|
||||
| **backup-bot-two** | `modules/backupbot.nix` | restic-based volume/DB backups; `abra app backup/restore` drive it. |
|
||||
| **dashboard** | `dashboard/dashboard.py`, `modules/dashboard.nix` (`ci.commoninternet.net`) | YunoHost-CI-like overview: latest run per recipe + status badges + run links; `/badge/<recipe>.svg`. |
|
||||
| **secrets** | `modules/secrets.nix` + `secrets/secrets.yaml` (sops-nix) | Infra secrets, decrypted at activation via the host SSH key as the age identity. See `secrets.md`. |
|
||||
|
||||
All swarm infra (traefik, drone, bridge, dashboard, backupbot) is brought up by **idempotent-reconcile
|
||||
systemd oneshots** that converge on every activation/boot (no run-once sentinels) — so a from-scratch
|
||||
install is `git clone` + `nixos-rebuild switch` + the operator preconditions (`install.md`).
|
||||
|
||||
## The `!testme` flow
|
||||
|
||||
```
|
||||
PR comment "!testme"
|
||||
│ (poll ≤30s, read-only; or optional admin webhook → /hook, HMAC-verified)
|
||||
▼ comment-bridge: exact-match "!testme"? · commenter ∈ recipe-maintainers org? · resolve PR head
|
||||
▼ Drone API: create build (event=custom, params RECIPE/REF/PR/SRC)
|
||||
▼ recipe-ci pipeline (exec runner, on host): cc-ci-run runner/run_recipe_ci.py
|
||||
│ fetch recipe@PR-head (mirror clone + upstream version tags) → install → upgrade → backup
|
||||
│ → recipe-local (D4) → ALWAYS teardown (undeploy+volumes+secrets, verified)
|
||||
▼ bridge watcher polls the build → edits the PR comment to ✅ passed / ❌ <status>
|
||||
▼ dashboard reflects latest-per-recipe status + badges
|
||||
```
|
||||
|
||||
## Network & TLS (see install.md §domain)
|
||||
|
||||
`*.ci.commoninternet.net` (and bare `ci.commoninternet.net`) resolve to an operator **gateway** that
|
||||
**TLS-passthroughs** by SNI to cc-ci. cc-ci's traefik terminates TLS with the **pre-issued wildcard
|
||||
cert** at `/var/lib/ci-certs/live/` (no ACME, no DNS token on the box). Each run gets a unique short
|
||||
subdomain `<recipe[:4]>-<6hex>.ci.commoninternet.net` (covered by the wildcard) so concurrent/serial
|
||||
runs never collide; it's torn down at run end.
|
||||
|
||||
## Resource safety (§4.2/§4.3)
|
||||
|
||||
- **MAX_TESTS=1** (runner capacity) → at most one test app live; Drone queues the rest.
|
||||
- **Per-build timeout 60m** (Drone repo timeout) → a hung build is killed, freeing the slot.
|
||||
- **Guaranteed teardown** (`try/finally`) + a **run-start janitor** that reaps orphaned `*-`-scheme
|
||||
apps (backstop for a SIGKILL'd build). `CCCI_JANITOR_MAX_AGE=0` in the recipe-ci pipeline (safe at
|
||||
capacity=1).
|
||||
- Heavy recipes pull many images; keep registry creds configured + adequate disk (see `runbook.md`).
|
||||
|
||||
## Enrolling a recipe (D5, see enroll-recipe.md)
|
||||
|
||||
Add `tests/<recipe>/` (recipe_meta.py + test_install/upgrade/backup.py) + the repo to the bridge
|
||||
`POLL_REPOS`. Per-recipe quirks go in `recipe_meta.py` (HEALTH_PATH/timeouts, `EXTRA_ENV` for e.g.
|
||||
cryptpad's SANDBOX_DOMAIN or lasuite's TIMEOUT) — **no shared-harness edits**.
|
||||
70
docs/runbook.md
Normal file
70
docs/runbook.md
Normal file
@ -0,0 +1,70 @@
|
||||
# Runbook — debugging a failed run
|
||||
|
||||
## Where to look
|
||||
|
||||
- **Per-run logs:** the PR comment links to the Drone build (`drone.ci.commoninternet.net/...`).
|
||||
Each stage (install / upgrade / backup / recipe-local) is a separate pytest invocation with its
|
||||
own reported result. Logs are live/tail-able while running.
|
||||
- **Overview:** `ci.commoninternet.net` — latest run per recipe + pass/fail/running badges.
|
||||
- **Bridge:** `docker service logs ccci-bridge_app` on the host — shows poll/trigger decisions,
|
||||
auth rejections, and outcome reflection.
|
||||
- **Host:** `docker service ls` / `docker service ps <stack>_<svc> --no-trunc` for a deploy that
|
||||
isn't converging; `journalctl -u deploy-<x>` for the reconcile oneshots.
|
||||
|
||||
Fetch a build's step log via the API:
|
||||
```sh
|
||||
DT=$(ssh cc-ci 'cat /run/secrets/bridge_drone_token')
|
||||
curl -s -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 \
|
||||
https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds/<N>/logs/1/2
|
||||
```
|
||||
|
||||
## Common failure modes
|
||||
|
||||
- **`FATA deploy timed out` / services stuck "Preparing":** images cold-pulling slower than abra's
|
||||
convergence `TIMEOUT` (default 300s). Bump `TIMEOUT` via the recipe's `recipe_meta.py` `EXTRA_ENV`
|
||||
(lasuite-docs uses 900). Verify the stack converges manually: `docker stack services <stack>`.
|
||||
- **`toomanyrequests: unauthenticated pull rate limit`** (task Rejected "No such image"): Docker Hub
|
||||
anonymous rate limit — the A1 registry-creds finding. Provide Docker Hub creds (sops `secrets/`,
|
||||
wire into the docker daemon). Do **not** `docker image prune -af` mid-breadth — it evicts cached
|
||||
images and forces re-pulls that hit the limit. Check disk first: `df -h /` (heavy recipes need
|
||||
headroom; prune only `dangling` between runs or rely on the daily autoprune).
|
||||
- **`authentication required: Unauthorized` fetching recipe tags:** an abra command tried to fetch
|
||||
from the private mirror origin. All recipe-touching harness calls pass `-C -o` (chaos+offline);
|
||||
`recipe_versions`/upgrade use the upstream tags fetched read-only at clone time. If you see this,
|
||||
a new abra call is missing `-o`.
|
||||
- **upgrade stage SKIPPED ("no previous published version"):** the recipe clone has no version tags.
|
||||
`fetch_recipe` read-only-fetches them from the public upstream (`git.coopcloud.tech/coop-cloud/<r>`);
|
||||
confirm the upstream has ≥2 tags (`git ls-remote --tags`).
|
||||
- **health wait hangs / 502:** the app isn't answering `HEALTH_PATH` yet. Slow apps (keycloak JVM +
|
||||
Liquibase, lasuite 9-service) just need time; raise `DEPLOY_TIMEOUT`/`HTTP_TIMEOUT` in
|
||||
`recipe_meta.py`. A persistent 502 with services 1/1 = wrong `HEALTH_PATH` (e.g. keycloak needs
|
||||
`/realms/master`, not `/`).
|
||||
- **data-survival assertion fails:** the marker wasn't in a backed-up volume / the DB hook didn't run.
|
||||
Check the recipe's `backupbot.backup*` labels; DB recipes use a `pg_backup.sh` pre/post-hook.
|
||||
|
||||
## Orphans / cleanup
|
||||
|
||||
Teardown is guaranteed (`try/finally`) and verified (`_residual` raises if anything is left). A
|
||||
SIGKILL'd/timed-out build can't run its own teardown — the **run-start janitor** reaps orphaned run
|
||||
apps before the next deploy. To reap now, or after cancelling a stuck build, manually:
|
||||
```sh
|
||||
ssh cc-ci 'export HOME=/root; D=<recipe[:4]>-<6hex>.ci.commoninternet.net
|
||||
abra app undeploy "$D" -n; docker stack rm "$(echo $D | tr . _)"; sleep 6
|
||||
abra app volume remove "$D" -f -n; abra app secret remove "$D" --all -n; abra app config remove "$D"'
|
||||
```
|
||||
Confirm clean: `docker service ls | grep <prefix>` returns nothing.
|
||||
|
||||
## Re-running / triggering by hand
|
||||
|
||||
- Re-comment `!testme` on the PR (distinct comment id → re-runs; deduped per comment).
|
||||
- Or trigger the recipe-ci pipeline directly (same params the bridge sends):
|
||||
```sh
|
||||
curl -s -H "Authorization: Bearer $DT" -X POST --proxy socks5h://localhost:1055 \
|
||||
"https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds?branch=main&RECIPE=<r>&PR=0"
|
||||
```
|
||||
- Or run a stage on the host: `cd /root/cc-ci && HOME=/root RECIPE=<r> PR=0 STAGES=install,upgrade,backup cc-ci-run runner/run_recipe_ci.py`.
|
||||
|
||||
## Cancelling a stuck build
|
||||
|
||||
`curl -s -X DELETE -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 .../builds/<N>`,
|
||||
then manually teardown (above) since a cancelled build skips its finalizer.
|
||||
Reference in New Issue
Block a user