diff --git a/cc-ci-plan/plan.md b/cc-ci-plan/plan.md index b7b0a7b..67b7d4b 100644 --- a/cc-ci-plan/plan.md +++ b/cc-ci-plan/plan.md @@ -45,9 +45,10 @@ Do these in order. Each step is idempotent; re-running is safe. resolves, e.g. `getent hosts probe-$RANDOM.ci.commoninternet.net` returns the **gateway's** IP (not cc-ci's — the gateway TLS-passthroughs to cc-ci, so do not expect cc-ci's address; and use `getent`, not `dig`, since this host's resolver is Tailscale-only — see §1.5). - Traefik is *not* up yet — you configure it (file provider → the pre-issued cert at - `/var/lib/ci-certs/live/`, **no ACME**); the DNS record + gateway passthrough + cert are the - preconditions, and full end-to-end HTTPS reachability is proven at M1, not now. + Traefik is *not* up yet — you deploy it at M1 (the real `coop-cloud/traefik` recipe via abra, + wildcard/file-provider mode → the pre-issued cert at `/var/lib/ci-certs/live/`, **no ACME**); + the DNS record + gateway passthrough + cert are the preconditions, and full end-to-end HTTPS + reachability is proven at M1, not now. If the wildcard does not resolve at all, that's a `## Blocked` item (operator fixes DNS/gateway). - If any check fails, write the failure to `STATUS.md` under `## Blocked` and stop — a human must fix access. Do **not** try to work around missing access. @@ -125,11 +126,11 @@ without the auth key. - **Wildcard TLS cert — PROVIDED, not a token.** The operator has pre-issued the wildcard SAN cert (`*.ci.commoninternet.net` + `ci.commoninternet.net`) and placed it on cc-ci at - `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` (§4.0). The agent points Traefik's file - provider at those paths and runs **no ACME** for this domain. **Do not request or expect a - `commoninternet.net` DNS token** — issuance/renewal is handled out-of-band by the operator (LE - 90-day cert; next renewal ~2026-08-24). A missing/expired cert is a finding for the operator, not - an agent re-issue. + `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` (§4.0). The agent feeds these into the + `coop-cloud/traefik` recipe as its `ssl_cert`/`ssl_key` swarm secrets (wildcard/file-provider + mode) and runs **no ACME** for this domain. **Do not request or expect a `commoninternet.net` DNS + token** — issuance/renewal is handled out-of-band by the operator (LE 90-day cert; next renewal + ~2026-08-24). A missing/expired cert is a finding for the operator, not an agent re-issue. - **Registry pull credentials** (e.g. Docker Hub) — *recommended* to avoid anonymous pull-rate limits breaking deploys under load. Treat a rate-limit failure traced to this as a finding, then request creds. Store sops-encrypted in `secrets/`. @@ -213,7 +214,8 @@ cc-ci/ ├── modules/ │ ├── drone.nix # Drone server + runner (exec/docker) │ ├── comment-bridge.nix # !testme webhook listener service -│ ├── swarm.nix # Docker + single-node swarm + Traefik for test apps +│ ├── swarm.nix # Docker + single-node swarm + `proxy` net; deploys the +│ │ # coop-cloud/traefik recipe via abra (wildcard/file-provider, §4.2) │ ├── dashboard.nix # results overview site │ └── secrets.nix # sops-nix / agenix wiring ├── secrets/ # sops-encrypted (*.enc / *.age); see §4.4 @@ -287,13 +289,16 @@ Two DNS zones, deliberately separated — do **not** conflate them: SAN cert covering **`*.ci.commoninternet.net` + `ci.commoninternet.net`** (issued via Let's Encrypt DNS-01 against Gandi, by the operator, using a token the agent never sees) lives on cc-ci: - `/var/lib/ci-certs/live/fullchain.pem` (leaf+intermediate) and `…/privkey.pem`. - - The agent configures **Traefik's file provider** (`tls.certificates`, `certFile`/`keyFile` - pointing at those paths) to serve it, and runs **no ACME resolver** for this domain. One cert - covers every per-run subdomain, so spinning up an app domain needs no cert work at all. - - **Renewal is a manual operator task** (LE 90-day cert): the operator re-issues out-of-band and - drops the new files at the same paths (Traefik file provider hot-reloads). The agent must **not** - attempt ACME/DNS-01 for `commoninternet.net` and must **not** expect a DNS token — a missing/ - expired cert is an operator action, surfaced as a finding, not something the agent re-issues. + - **Traefik is the real `coop-cloud/traefik` recipe, deployed via abra** (for e2e fidelity — see + §4.2), run in its **wildcard / file-provider mode** (`WILDCARDS_ENABLED=1` + `compose.wildcard.yml`). + The pre-issued cert is supplied as the recipe's `ssl_cert`/`ssl_key` **swarm secrets** (sourced + from the files above); the recipe's file provider then serves it under `tls.certificates`. **No + ACME resolver / no DNS provider** is enabled — only the cert+key reach cc-ci, never the DNS token. + One cert covers every per-run subdomain (matched by SNI), so a new app domain needs no cert work. + - **Renewal is a manual operator task** (LE 90-day cert): the operator re-issues out-of-band, then + updates the `ssl_cert`/`ssl_key` secret (bump its version) and redeploys traefik. The agent must + **not** attempt ACME/DNS-01 for `commoninternet.net` and must **not** expect a DNS token — a + missing/expired cert is an operator action surfaced as a finding, not something the agent re-issues. (Rationale for choosing a wildcard cert over per-subdomain: a wildcard is reused for every churning run subdomain and sidesteps LE's 50-certs/week-per-domain limit; only DNS-01 can mint a wildcard. We keep that DNS-01 issuance with the operator rather than handing the agent the zone token.) @@ -345,11 +350,16 @@ Bridge posts/updates a Gitea PR comment with the run URL and (on completion) pas - Drone server connects to Gitea via OAuth app (Gitea → Settings → Applications). Runner is the **exec runner** (or a privileged docker runner) running **on cc-ci itself**, because tests must drive `abra` to deploy real recipes onto a real swarm. -- cc-ci doubles as the **deploy target**: single-node Docker Swarm + Traefik, abra installed, - serving the `*.ci.commoninternet.net` wildcard, TLS terminated on cc-ci's Traefik using the - **pre-issued static wildcard cert** at `/var/lib/ci-certs/live/` (§4.0). The operator preconfigures - the wildcard DNS record (→ gateway), the gateway's TLS-passthrough to cc-ci, and the cert itself - (§4.4); the agent configures Traefik (file provider → that cert) and swarm on top — **no ACME**. +- cc-ci doubles as the **deploy target**: single-node Docker Swarm + abra, with the reverse proxy + provided by the **real `coop-cloud/traefik` recipe deployed via abra** (not a hand-rolled Traefik + — chosen for **end-to-end fidelity**: test apps route through the exact proxy a real Co-op Cloud + host uses — `web`/`web-secure` entrypoints, the `proxy` overlay, the swarm provider). TLS + terminates on it using the **pre-issued static wildcard cert** (§4.0): run the recipe in + **wildcard/file-provider mode** (`WILDCARDS_ENABLED=1` + `compose.wildcard.yml`) and supply the + cert as the recipe's `ssl_cert`/`ssl_key` swarm secrets from `/var/lib/ci-certs/live/`. The + operator preconfigures the wildcard DNS (→ gateway), the gateway's TLS-passthrough, and the cert + itself (§4.4); the agent deploys the traefik recipe + swarm on top — **no ACME, no DNS token on + cc-ci**. Make the `abra app new/deploy traefik` steps reproducible (scripted/Nix-invoked) for D8. - Each CI run gets an isolated app domain `-pr-.ci.commoninternet.net` (§4.0) so concurrent runs don't collide. Teardown removes app, secrets, and volumes. - Consider a concurrency cap (1–2 deploys at a time) to avoid resource thrash; document it. @@ -462,10 +472,12 @@ verify the acceptance check before the next milestone starts). Seed `BACKLOG.md` - **M0 — Foundations.** Repo created; flake builds; `nixos-rebuild` (or deploy-rs) applies a no-op-then-base config to cc-ci; sops decrypts a test secret on the host. *Accept:* `ssh cc-ci 'systemctl is-system-running'` healthy after a rebuild from the repo. -- **M1 — Swarm + abra target.** Docker + single-node swarm + Traefik up; wildcard DNS + TLS; - abra can deploy and tear down a trivial recipe by hand. - *Accept:* a recipe deployed via abra is reachable over HTTPS at `*.ci.commoninternet.net`, then - fully torn down leaving no volumes. +- **M1 — Swarm + abra target.** Docker + single-node swarm + `proxy` network; the **`coop-cloud/traefik` + recipe deployed via abra** (wildcard/file-provider mode, serving the pre-issued cert — §4.0/§4.2, + not a custom Traefik); abra can deploy and tear down a trivial recipe by hand. + *Accept:* a recipe deployed via abra is reachable over HTTPS (valid wildcard cert) on the + `web-secure` entrypoint at `*.ci.commoninternet.net`, then fully torn down leaving no volumes; the + proxy is verifiably the traefik recipe and **no DNS/ACME token is present on cc-ci**. - **M2 — Drone online.** Drone server+runner via Nix, OAuth to Gitea; a hello-world `.drone.yml` in cc-ci runs green; logs visible in Drone UI. *Accept:* push to cc-ci triggers a visible green Drone build. @@ -616,12 +628,13 @@ iterations spinning on a build that takes minutes. - Webhook scope: per-repo vs org-level Gitea webhook. (Default: per-repo via enroll script.) - Drone runner type: exec vs privileged docker. (Default: exec, since it must drive host abra.) - Secret tool: sops-nix vs agenix. (Default: sops-nix for multi-recipient + yaml ergonomics.) -- Wildcard TLS: **SETTLED — operator pre-issues a wildcard cert; the agent serves it statically, no - token** (§4.0). The operator issued a wildcard SAN cert (`*.ci.commoninternet.net` + - `ci.commoninternet.net`) via LE DNS-01/Gandi out-of-band and placed it at - `/var/lib/ci-certs/live/`; the agent configures Traefik's file provider to serve it and runs no - ACME for this domain. Chosen so the DNS-editing token never enters the repo/agent. **Manual - renewal** every ~90 days (next ~2026-08-24) — operator re-issues and replaces the files in place. +- Reverse proxy / Wildcard TLS: **SETTLED — deploy the real `coop-cloud/traefik` recipe via abra + (for e2e fidelity), in wildcard/file-provider mode, serving the operator's pre-issued wildcard + cert; no ACME, no token** (§4.0/§4.2). Supersedes the original plan's hand-rolled + `modules/traefik.nix`. The operator issued the wildcard SAN cert (`*.ci.commoninternet.net` + + `ci.commoninternet.net`) via LE DNS-01/Gandi out-of-band into `/var/lib/ci-certs/live/`; the agent + feeds it as the recipe's `ssl_cert`/`ssl_key` swarm secrets so the DNS-editing token never reaches + cc-ci. **Manual renewal** ~90 days (next ~2026-08-24): re-issue → update the secret → redeploy. - Proof recipe set (D10 — six, category-spanning). Default candidates, all previously verified deployable: `hedgedoc`, `cryptpad`, `keycloak`, `authentik`, `lasuite-docs`/`lasuite-drive`, `matrix-synapse`, `immich`, `bluesky-pds`. Lock the final six early so M4–M6.5 build toward them.