diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..943a65c --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,58 @@ +# Architecture + +cc-ci turns a `!testme` PR comment into a real end-to-end deploy + test of a Co-op Cloud recipe and +reports the result back. Everything on the `cc-ci` host is declared in this repo's NixOS flake. + +## Components + +| Component | Where | Role | +|---|---|---| +| **comment-bridge** | `bridge/bridge.py`, `modules/bridge.nix` (swarm svc, `ci.commoninternet.net/hook`) | Polls enrolled repos for `!testme` (primary, read-only) + optional admin webhook; authorizes the commenter (org membership); triggers a parameterized Drone build; posts/edits the PR comment with the run link + final pass/fail. | +| **Drone server** | `modules/drone.nix` — coop-cloud `drone` recipe via abra (`drone.ci.commoninternet.net`, Gitea SSO) | CI engine. Holds the `recipe-ci` (custom-event) and `self-test` (push) pipelines (`.drone.yml`). | +| **Drone exec runner** | `modules/drone-runner.nix` — host systemd service | Runs pipeline steps **on the host** so they can drive `abra`/Docker. `DRONE_RUNNER_CAPACITY=1` (MAX_TESTS) caps concurrent builds; the rest queue natively. | +| **harness** | `runner/run_recipe_ci.py` + `runner/harness/` + `tests/` | Orchestrates per run: fetch recipe at the PR head → install → upgrade → backup/restore → recipe-local (D4) → guaranteed teardown. pytest + Playwright via the Nix `cc-ci-run` env. | +| **swarm + traefik** | `modules/swarm.nix`, `modules/proxy.nix` — coop-cloud `traefik` recipe via abra | Single-node Docker Swarm + `proxy` overlay; traefik terminates TLS with the pre-issued wildcard cert (file provider, **no ACME**). The real deploy target for recipes-under-test. | +| **backup-bot-two** | `modules/backupbot.nix` | restic-based volume/DB backups; `abra app backup/restore` drive it. | +| **dashboard** | `dashboard/dashboard.py`, `modules/dashboard.nix` (`ci.commoninternet.net`) | YunoHost-CI-like overview: latest run per recipe + status badges + run links; `/badge/.svg`. | +| **secrets** | `modules/secrets.nix` + `secrets/secrets.yaml` (sops-nix) | Infra secrets, decrypted at activation via the host SSH key as the age identity. See `secrets.md`. | + +All swarm infra (traefik, drone, bridge, dashboard, backupbot) is brought up by **idempotent-reconcile +systemd oneshots** that converge on every activation/boot (no run-once sentinels) — so a from-scratch +install is `git clone` + `nixos-rebuild switch` + the operator preconditions (`install.md`). + +## The `!testme` flow + +``` +PR comment "!testme" + │ (poll ≤30s, read-only; or optional admin webhook → /hook, HMAC-verified) + ▼ comment-bridge: exact-match "!testme"? · commenter ∈ recipe-maintainers org? · resolve PR head + ▼ Drone API: create build (event=custom, params RECIPE/REF/PR/SRC) + ▼ recipe-ci pipeline (exec runner, on host): cc-ci-run runner/run_recipe_ci.py + │ fetch recipe@PR-head (mirror clone + upstream version tags) → install → upgrade → backup + │ → recipe-local (D4) → ALWAYS teardown (undeploy+volumes+secrets, verified) + ▼ bridge watcher polls the build → edits the PR comment to ✅ passed / ❌ + ▼ dashboard reflects latest-per-recipe status + badges +``` + +## Network & TLS (see install.md §domain) + +`*.ci.commoninternet.net` (and bare `ci.commoninternet.net`) resolve to an operator **gateway** that +**TLS-passthroughs** by SNI to cc-ci. cc-ci's traefik terminates TLS with the **pre-issued wildcard +cert** at `/var/lib/ci-certs/live/` (no ACME, no DNS token on the box). Each run gets a unique short +subdomain `-<6hex>.ci.commoninternet.net` (covered by the wildcard) so concurrent/serial +runs never collide; it's torn down at run end. + +## Resource safety (§4.2/§4.3) + +- **MAX_TESTS=1** (runner capacity) → at most one test app live; Drone queues the rest. +- **Per-build timeout 60m** (Drone repo timeout) → a hung build is killed, freeing the slot. +- **Guaranteed teardown** (`try/finally`) + a **run-start janitor** that reaps orphaned `*-`-scheme + apps (backstop for a SIGKILL'd build). `CCCI_JANITOR_MAX_AGE=0` in the recipe-ci pipeline (safe at + capacity=1). +- Heavy recipes pull many images; keep registry creds configured + adequate disk (see `runbook.md`). + +## Enrolling a recipe (D5, see enroll-recipe.md) + +Add `tests//` (recipe_meta.py + test_install/upgrade/backup.py) + the repo to the bridge +`POLL_REPOS`. Per-recipe quirks go in `recipe_meta.py` (HEALTH_PATH/timeouts, `EXTRA_ENV` for e.g. +cryptpad's SANDBOX_DOMAIN or lasuite's TIMEOUT) — **no shared-harness edits**. diff --git a/docs/runbook.md b/docs/runbook.md new file mode 100644 index 0000000..1bf466c --- /dev/null +++ b/docs/runbook.md @@ -0,0 +1,70 @@ +# Runbook — debugging a failed run + +## Where to look + +- **Per-run logs:** the PR comment links to the Drone build (`drone.ci.commoninternet.net/...`). + Each stage (install / upgrade / backup / recipe-local) is a separate pytest invocation with its + own reported result. Logs are live/tail-able while running. +- **Overview:** `ci.commoninternet.net` — latest run per recipe + pass/fail/running badges. +- **Bridge:** `docker service logs ccci-bridge_app` on the host — shows poll/trigger decisions, + auth rejections, and outcome reflection. +- **Host:** `docker service ls` / `docker service ps _ --no-trunc` for a deploy that + isn't converging; `journalctl -u deploy-` for the reconcile oneshots. + +Fetch a build's step log via the API: +```sh +DT=$(ssh cc-ci 'cat /run/secrets/bridge_drone_token') +curl -s -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 \ + https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds//logs/1/2 +``` + +## Common failure modes + +- **`FATA deploy timed out` / services stuck "Preparing":** images cold-pulling slower than abra's + convergence `TIMEOUT` (default 300s). Bump `TIMEOUT` via the recipe's `recipe_meta.py` `EXTRA_ENV` + (lasuite-docs uses 900). Verify the stack converges manually: `docker stack services `. +- **`toomanyrequests: unauthenticated pull rate limit`** (task Rejected "No such image"): Docker Hub + anonymous rate limit — the A1 registry-creds finding. Provide Docker Hub creds (sops `secrets/`, + wire into the docker daemon). Do **not** `docker image prune -af` mid-breadth — it evicts cached + images and forces re-pulls that hit the limit. Check disk first: `df -h /` (heavy recipes need + headroom; prune only `dangling` between runs or rely on the daily autoprune). +- **`authentication required: Unauthorized` fetching recipe tags:** an abra command tried to fetch + from the private mirror origin. All recipe-touching harness calls pass `-C -o` (chaos+offline); + `recipe_versions`/upgrade use the upstream tags fetched read-only at clone time. If you see this, + a new abra call is missing `-o`. +- **upgrade stage SKIPPED ("no previous published version"):** the recipe clone has no version tags. + `fetch_recipe` read-only-fetches them from the public upstream (`git.coopcloud.tech/coop-cloud/`); + confirm the upstream has ≥2 tags (`git ls-remote --tags`). +- **health wait hangs / 502:** the app isn't answering `HEALTH_PATH` yet. Slow apps (keycloak JVM + + Liquibase, lasuite 9-service) just need time; raise `DEPLOY_TIMEOUT`/`HTTP_TIMEOUT` in + `recipe_meta.py`. A persistent 502 with services 1/1 = wrong `HEALTH_PATH` (e.g. keycloak needs + `/realms/master`, not `/`). +- **data-survival assertion fails:** the marker wasn't in a backed-up volume / the DB hook didn't run. + Check the recipe's `backupbot.backup*` labels; DB recipes use a `pg_backup.sh` pre/post-hook. + +## Orphans / cleanup + +Teardown is guaranteed (`try/finally`) and verified (`_residual` raises if anything is left). A +SIGKILL'd/timed-out build can't run its own teardown — the **run-start janitor** reaps orphaned run +apps before the next deploy. To reap now, or after cancelling a stuck build, manually: +```sh +ssh cc-ci 'export HOME=/root; D=-<6hex>.ci.commoninternet.net +abra app undeploy "$D" -n; docker stack rm "$(echo $D | tr . _)"; sleep 6 +abra app volume remove "$D" -f -n; abra app secret remove "$D" --all -n; abra app config remove "$D"' +``` +Confirm clean: `docker service ls | grep ` returns nothing. + +## Re-running / triggering by hand + +- Re-comment `!testme` on the PR (distinct comment id → re-runs; deduped per comment). +- Or trigger the recipe-ci pipeline directly (same params the bridge sends): + ```sh + curl -s -H "Authorization: Bearer $DT" -X POST --proxy socks5h://localhost:1055 \ + "https://drone.ci.commoninternet.net/api/repos/recipe-maintainers/cc-ci/builds?branch=main&RECIPE=&PR=0" + ``` +- Or run a stage on the host: `cd /root/cc-ci && HOME=/root RECIPE= PR=0 STAGES=install,upgrade,backup cc-ci-run runner/run_recipe_ci.py`. + +## Cancelling a stuck build + +`curl -s -X DELETE -H "Authorization: Bearer $DT" --proxy socks5h://localhost:1055 .../builds/`, +then manually teardown (above) since a cancelled build skips its finalizer.