Files
cc-ci/docs/architecture.md

64 lines
5.5 KiB
Markdown

# Architecture
cc-ci turns a `!testme` PR comment into a real end-to-end deploy + test of a Co-op Cloud recipe and
reports the result back. Everything on the `cc-ci` host is declared in this repo's NixOS flake.
## Components
| Component | Where | Role |
|---|---|---|
| **comment-bridge** | `bridge/bridge.py`, `modules/bridge.nix` (swarm svc, `ci.commoninternet.net/hook`) | Polls enrolled repos for `!testme` (primary, read-only) + optional admin webhook; authorizes the commenter (org membership); triggers a parameterized Drone build; posts/edits the PR comment with the run link + final pass/fail. |
| **Drone server** | `modules/drone.nix` — coop-cloud `drone` recipe via abra (`drone.ci.commoninternet.net`, Gitea SSO) | CI engine. Holds the `recipe-ci` (custom-event) and `self-test` (push) pipelines (`.drone.yml`). |
| **Drone exec runner** | `modules/drone-runner.nix` — host systemd service | Runs pipeline steps **on the host** so they can drive `abra`/Docker. `DRONE_RUNNER_CAPACITY=1` (MAX_TESTS) caps concurrent builds; the rest queue natively. |
| **harness** | `runner/run_recipe_ci.py` + `runner/harness/` + `tests/` | Orchestrates per run: fetch recipe at the PR head → install → upgrade → backup/restore → recipe-local (D4) → guaranteed teardown. pytest + Playwright via the Nix `cc-ci-run` env. |
| **swarm + traefik** | `modules/swarm.nix`, `modules/proxy.nix` — coop-cloud `traefik` recipe via abra | Single-node Docker Swarm + `proxy` overlay; traefik terminates TLS with the wildcard cert (**sops-decrypted from git** to `/var/lib/ci-certs/live`, file provider, **no ACME**). The real deploy target for recipes-under-test. |
| **backup-bot-two** | `modules/backupbot.nix` | restic-based volume/DB backups; `abra app backup/restore` drive it. |
| **dashboard** | `dashboard/dashboard.py`, `modules/dashboard.nix` (`ci.commoninternet.net`) | YunoHost-CI-like overview: latest run per recipe + status badges + run links; `/badge/<recipe>.svg`. |
| **secrets** | `modules/secrets.nix` + `secrets/` = **`cc-ci-secrets` submodule** (sops-nix) | **Phase-1c secrets model:** ALL secrets incl. the **wildcard TLS cert+key are sops-encrypted in git** in the private `cc-ci-secrets` repo, mounted as a **git submodule** at `secrets/` (the base `cc-ci` repo holds **no** secret material). Decrypted at activation by the **bootstrap age key** at `/var/lib/sops-nix/key.txt` (`sops.age.keyFile`) — cc-ci's host-derived age identity, or the **off-box recovery key on a fresh/cloned host** whose SSH key isn't a recipient; the host SSH key is also offered (`sops.age.sshKeyPaths`). The cert is decrypted to `/var/lib/ci-certs/live/` (no out-of-band file drop). This **one** age key is the only secret not in git. See `secrets.md`. |
All swarm infra (traefik, drone, bridge, dashboard, backupbot) is brought up by **idempotent-reconcile
systemd oneshots** that converge on every activation/boot (no run-once sentinels), **serialized**
(proxy→drone→bridge→dashboard→backupbot) so a single switch converges on a blank host — so a
from-scratch install is `git clone --recursive` + provision the one bootstrap age key +
`nixos-rebuild switch` + the external DNS/gateway (`install.md`). **Phase-1c verified this on a real
throwaway VM (D8): blank host + the two repos + the age key → a fully-converged cc-ci that serves a
real `!testme` run end-to-end over the public domain.**
## The `!testme` flow
```
PR comment "!testme"
│ (poll ≤30s, read-only; or optional admin webhook → /hook, HMAC-verified)
▼ comment-bridge: exact-match "!testme"? · commenter ∈ recipe-maintainers org? · resolve PR head
▼ Drone API: create build (event=custom, params RECIPE/REF/PR/SRC)
▼ recipe-ci pipeline (exec runner, on host): cc-ci-run runner/run_recipe_ci.py
│ fetch recipe@PR-head (mirror clone + upstream version tags) → install → upgrade → backup
│ → recipe-local (D4) → ALWAYS teardown (undeploy+volumes+secrets, verified)
▼ bridge watcher polls the build → edits the PR comment to ✅ passed / ❌ <status>
▼ dashboard reflects latest-per-recipe status + badges
```
## Network & TLS (see install.md §domain)
`*.ci.commoninternet.net` (and bare `ci.commoninternet.net`) resolve to an operator **gateway** that
**TLS-passthroughs** by SNI to cc-ci. cc-ci's traefik terminates TLS with the **wildcard cert
sops-decrypted from git** (`cc-ci-secrets`) to `/var/lib/ci-certs/live/` (no ACME, no DNS token on the
box; operator re-issues + re-commits to rotate). Each run gets a unique short
subdomain `<recipe[:4]>-<6hex>.ci.commoninternet.net` (covered by the wildcard) so concurrent/serial
runs never collide; it's torn down at run end.
## Resource safety (§4.2/§4.3)
- **MAX_TESTS=1** (runner capacity) → at most one test app live; Drone queues the rest.
- **Per-build timeout 60m** (Drone repo timeout) → a hung build is killed, freeing the slot.
- **Guaranteed teardown** (`try/finally`) + a **run-start janitor** that reaps orphaned `*-`-scheme
apps (backstop for a SIGKILL'd build). `CCCI_JANITOR_MAX_AGE=0` in the recipe-ci pipeline (safe at
capacity=1).
- Heavy recipes pull many images; keep registry creds configured + adequate disk (see `runbook.md`).
## Enrolling a recipe (D5, see enroll-recipe.md)
Add `tests/<recipe>/` (recipe_meta.py + test_install/upgrade/backup.py) + the repo to the bridge
`POLL_REPOS`. Per-recipe quirks go in `recipe_meta.py` (HEALTH_PATH/timeouts, `EXTRA_ENV` for e.g.
cryptpad's SANDBOX_DOMAIN or lasuite's TIMEOUT) — **no shared-harness edits**.