plan: use the real coop-cloud/traefik recipe via abra (e2e fidelity), not a custom Traefik
Supersedes the original modules/traefik.nix hand-rolled proxy. cc-ci now deploys the coop-cloud/traefik recipe via abra in wildcard/file-provider mode, serving the operator's pre-issued wildcard cert as the recipe's ssl_cert/ssl_key swarm secrets — canonical web/web-secure + proxy/swarm conventions every recipe expects, no ACME, DNS token never on cc-ci. Updated §1, §1.5, §3, §4.0, §4.2, §5 (M1), §8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -45,9 +45,10 @@ Do these in order. Each step is idempotent; re-running is safe.
|
||||
resolves, e.g. `getent hosts probe-$RANDOM.ci.commoninternet.net` returns the **gateway's** IP
|
||||
(not cc-ci's — the gateway TLS-passthroughs to cc-ci, so do not expect cc-ci's address; and use
|
||||
`getent`, not `dig`, since this host's resolver is Tailscale-only — see §1.5).
|
||||
Traefik is *not* up yet — you configure it (file provider → the pre-issued cert at
|
||||
`/var/lib/ci-certs/live/`, **no ACME**); the DNS record + gateway passthrough + cert are the
|
||||
preconditions, and full end-to-end HTTPS reachability is proven at M1, not now.
|
||||
Traefik is *not* up yet — you deploy it at M1 (the real `coop-cloud/traefik` recipe via abra,
|
||||
wildcard/file-provider mode → the pre-issued cert at `/var/lib/ci-certs/live/`, **no ACME**);
|
||||
the DNS record + gateway passthrough + cert are the preconditions, and full end-to-end HTTPS
|
||||
reachability is proven at M1, not now.
|
||||
If the wildcard does not resolve at all, that's a `## Blocked` item (operator fixes DNS/gateway).
|
||||
- If any check fails, write the failure to `STATUS.md` under `## Blocked` and stop — a human must fix access. Do **not** try to work around missing access.
|
||||
|
||||
@ -125,11 +126,11 @@ without the auth key.
|
||||
|
||||
- **Wildcard TLS cert — PROVIDED, not a token.** The operator has pre-issued the wildcard SAN cert
|
||||
(`*.ci.commoninternet.net` + `ci.commoninternet.net`) and placed it on cc-ci at
|
||||
`/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` (§4.0). The agent points Traefik's file
|
||||
provider at those paths and runs **no ACME** for this domain. **Do not request or expect a
|
||||
`commoninternet.net` DNS token** — issuance/renewal is handled out-of-band by the operator (LE
|
||||
90-day cert; next renewal ~2026-08-24). A missing/expired cert is a finding for the operator, not
|
||||
an agent re-issue.
|
||||
`/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` (§4.0). The agent feeds these into the
|
||||
`coop-cloud/traefik` recipe as its `ssl_cert`/`ssl_key` swarm secrets (wildcard/file-provider
|
||||
mode) and runs **no ACME** for this domain. **Do not request or expect a `commoninternet.net` DNS
|
||||
token** — issuance/renewal is handled out-of-band by the operator (LE 90-day cert; next renewal
|
||||
~2026-08-24). A missing/expired cert is a finding for the operator, not an agent re-issue.
|
||||
- **Registry pull credentials** (e.g. Docker Hub) — *recommended* to avoid anonymous pull-rate
|
||||
limits breaking deploys under load. Treat a rate-limit failure traced to this as a finding, then
|
||||
request creds. Store sops-encrypted in `secrets/`.
|
||||
@ -213,7 +214,8 @@ cc-ci/
|
||||
├── modules/
|
||||
│ ├── drone.nix # Drone server + runner (exec/docker)
|
||||
│ ├── comment-bridge.nix # !testme webhook listener service
|
||||
│ ├── swarm.nix # Docker + single-node swarm + Traefik for test apps
|
||||
│ ├── swarm.nix # Docker + single-node swarm + `proxy` net; deploys the
|
||||
│ │ # coop-cloud/traefik recipe via abra (wildcard/file-provider, §4.2)
|
||||
│ ├── dashboard.nix # results overview site
|
||||
│ └── secrets.nix # sops-nix / agenix wiring
|
||||
├── secrets/ # sops-encrypted (*.enc / *.age); see §4.4
|
||||
@ -287,13 +289,16 @@ Two DNS zones, deliberately separated — do **not** conflate them:
|
||||
SAN cert covering **`*.ci.commoninternet.net` + `ci.commoninternet.net`** (issued via Let's
|
||||
Encrypt DNS-01 against Gandi, by the operator, using a token the agent never sees) lives on cc-ci:
|
||||
- `/var/lib/ci-certs/live/fullchain.pem` (leaf+intermediate) and `…/privkey.pem`.
|
||||
- The agent configures **Traefik's file provider** (`tls.certificates`, `certFile`/`keyFile`
|
||||
pointing at those paths) to serve it, and runs **no ACME resolver** for this domain. One cert
|
||||
covers every per-run subdomain, so spinning up an app domain needs no cert work at all.
|
||||
- **Renewal is a manual operator task** (LE 90-day cert): the operator re-issues out-of-band and
|
||||
drops the new files at the same paths (Traefik file provider hot-reloads). The agent must **not**
|
||||
attempt ACME/DNS-01 for `commoninternet.net` and must **not** expect a DNS token — a missing/
|
||||
expired cert is an operator action, surfaced as a finding, not something the agent re-issues.
|
||||
- **Traefik is the real `coop-cloud/traefik` recipe, deployed via abra** (for e2e fidelity — see
|
||||
§4.2), run in its **wildcard / file-provider mode** (`WILDCARDS_ENABLED=1` + `compose.wildcard.yml`).
|
||||
The pre-issued cert is supplied as the recipe's `ssl_cert`/`ssl_key` **swarm secrets** (sourced
|
||||
from the files above); the recipe's file provider then serves it under `tls.certificates`. **No
|
||||
ACME resolver / no DNS provider** is enabled — only the cert+key reach cc-ci, never the DNS token.
|
||||
One cert covers every per-run subdomain (matched by SNI), so a new app domain needs no cert work.
|
||||
- **Renewal is a manual operator task** (LE 90-day cert): the operator re-issues out-of-band, then
|
||||
updates the `ssl_cert`/`ssl_key` secret (bump its version) and redeploys traefik. The agent must
|
||||
**not** attempt ACME/DNS-01 for `commoninternet.net` and must **not** expect a DNS token — a
|
||||
missing/expired cert is an operator action surfaced as a finding, not something the agent re-issues.
|
||||
(Rationale for choosing a wildcard cert over per-subdomain: a wildcard is reused for every churning
|
||||
run subdomain and sidesteps LE's 50-certs/week-per-domain limit; only DNS-01 can mint a wildcard.
|
||||
We keep that DNS-01 issuance with the operator rather than handing the agent the zone token.)
|
||||
@ -345,11 +350,16 @@ Bridge posts/updates a Gitea PR comment with the run URL and (on completion) pas
|
||||
- Drone server connects to Gitea via OAuth app (Gitea → Settings → Applications). Runner is the
|
||||
**exec runner** (or a privileged docker runner) running **on cc-ci itself**, because tests must
|
||||
drive `abra` to deploy real recipes onto a real swarm.
|
||||
- cc-ci doubles as the **deploy target**: single-node Docker Swarm + Traefik, abra installed,
|
||||
serving the `*.ci.commoninternet.net` wildcard, TLS terminated on cc-ci's Traefik using the
|
||||
**pre-issued static wildcard cert** at `/var/lib/ci-certs/live/` (§4.0). The operator preconfigures
|
||||
the wildcard DNS record (→ gateway), the gateway's TLS-passthrough to cc-ci, and the cert itself
|
||||
(§4.4); the agent configures Traefik (file provider → that cert) and swarm on top — **no ACME**.
|
||||
- cc-ci doubles as the **deploy target**: single-node Docker Swarm + abra, with the reverse proxy
|
||||
provided by the **real `coop-cloud/traefik` recipe deployed via abra** (not a hand-rolled Traefik
|
||||
— chosen for **end-to-end fidelity**: test apps route through the exact proxy a real Co-op Cloud
|
||||
host uses — `web`/`web-secure` entrypoints, the `proxy` overlay, the swarm provider). TLS
|
||||
terminates on it using the **pre-issued static wildcard cert** (§4.0): run the recipe in
|
||||
**wildcard/file-provider mode** (`WILDCARDS_ENABLED=1` + `compose.wildcard.yml`) and supply the
|
||||
cert as the recipe's `ssl_cert`/`ssl_key` swarm secrets from `/var/lib/ci-certs/live/`. The
|
||||
operator preconfigures the wildcard DNS (→ gateway), the gateway's TLS-passthrough, and the cert
|
||||
itself (§4.4); the agent deploys the traefik recipe + swarm on top — **no ACME, no DNS token on
|
||||
cc-ci**. Make the `abra app new/deploy traefik` steps reproducible (scripted/Nix-invoked) for D8.
|
||||
- Each CI run gets an isolated app domain `<recipe>-pr<n>-<short-sha>.ci.commoninternet.net`
|
||||
(§4.0) so concurrent runs don't collide. Teardown removes app, secrets, and volumes.
|
||||
- Consider a concurrency cap (1–2 deploys at a time) to avoid resource thrash; document it.
|
||||
@ -462,10 +472,12 @@ verify the acceptance check before the next milestone starts). Seed `BACKLOG.md`
|
||||
- **M0 — Foundations.** Repo created; flake builds; `nixos-rebuild` (or deploy-rs) applies a
|
||||
no-op-then-base config to cc-ci; sops decrypts a test secret on the host.
|
||||
*Accept:* `ssh cc-ci 'systemctl is-system-running'` healthy after a rebuild from the repo.
|
||||
- **M1 — Swarm + abra target.** Docker + single-node swarm + Traefik up; wildcard DNS + TLS;
|
||||
abra can deploy and tear down a trivial recipe by hand.
|
||||
*Accept:* a recipe deployed via abra is reachable over HTTPS at `*.ci.commoninternet.net`, then
|
||||
fully torn down leaving no volumes.
|
||||
- **M1 — Swarm + abra target.** Docker + single-node swarm + `proxy` network; the **`coop-cloud/traefik`
|
||||
recipe deployed via abra** (wildcard/file-provider mode, serving the pre-issued cert — §4.0/§4.2,
|
||||
not a custom Traefik); abra can deploy and tear down a trivial recipe by hand.
|
||||
*Accept:* a recipe deployed via abra is reachable over HTTPS (valid wildcard cert) on the
|
||||
`web-secure` entrypoint at `*.ci.commoninternet.net`, then fully torn down leaving no volumes; the
|
||||
proxy is verifiably the traefik recipe and **no DNS/ACME token is present on cc-ci**.
|
||||
- **M2 — Drone online.** Drone server+runner via Nix, OAuth to Gitea; a hello-world `.drone.yml`
|
||||
in cc-ci runs green; logs visible in Drone UI.
|
||||
*Accept:* push to cc-ci triggers a visible green Drone build.
|
||||
@ -616,12 +628,13 @@ iterations spinning on a build that takes minutes.
|
||||
- Webhook scope: per-repo vs org-level Gitea webhook. (Default: per-repo via enroll script.)
|
||||
- Drone runner type: exec vs privileged docker. (Default: exec, since it must drive host abra.)
|
||||
- Secret tool: sops-nix vs agenix. (Default: sops-nix for multi-recipient + yaml ergonomics.)
|
||||
- Wildcard TLS: **SETTLED — operator pre-issues a wildcard cert; the agent serves it statically, no
|
||||
token** (§4.0). The operator issued a wildcard SAN cert (`*.ci.commoninternet.net` +
|
||||
`ci.commoninternet.net`) via LE DNS-01/Gandi out-of-band and placed it at
|
||||
`/var/lib/ci-certs/live/`; the agent configures Traefik's file provider to serve it and runs no
|
||||
ACME for this domain. Chosen so the DNS-editing token never enters the repo/agent. **Manual
|
||||
renewal** every ~90 days (next ~2026-08-24) — operator re-issues and replaces the files in place.
|
||||
- Reverse proxy / Wildcard TLS: **SETTLED — deploy the real `coop-cloud/traefik` recipe via abra
|
||||
(for e2e fidelity), in wildcard/file-provider mode, serving the operator's pre-issued wildcard
|
||||
cert; no ACME, no token** (§4.0/§4.2). Supersedes the original plan's hand-rolled
|
||||
`modules/traefik.nix`. The operator issued the wildcard SAN cert (`*.ci.commoninternet.net` +
|
||||
`ci.commoninternet.net`) via LE DNS-01/Gandi out-of-band into `/var/lib/ci-certs/live/`; the agent
|
||||
feeds it as the recipe's `ssl_cert`/`ssl_key` swarm secrets so the DNS-editing token never reaches
|
||||
cc-ci. **Manual renewal** ~90 days (next ~2026-08-24): re-issue → update the secret → redeploy.
|
||||
- Proof recipe set (D10 — six, category-spanning). Default candidates, all previously verified
|
||||
deployable: `hedgedoc`, `cryptpad`, `keycloak`, `authentik`, `lasuite-docs`/`lasuite-drive`,
|
||||
`matrix-synapse`, `immich`, `bluesky-pds`. Lock the final six early so M4–M6.5 build toward them.
|
||||
|
||||
Reference in New Issue
Block a user