diff --git a/BACKLOG.md b/BACKLOG.md index 34b780f..b54c222 100644 --- a/BACKLOG.md +++ b/BACKLOG.md @@ -16,11 +16,14 @@ Two single-writer sections (§6.1): Builder edits only `## Build backlog`; Adver ### M1 — Swarm + abra target - [x] Docker + single-node swarm via Nix (modules/swarm.nix: docker + swarm-init oneshot + `proxy` overlay net + daily autoprune). Verified: Swarm=active, proxy overlay present. -- [x] Traefik (file provider → /var/lib/ci-certs/live/) as a swarm stack on `proxy`; wildcard cert - served as default cert. Verified end-to-end: gateway 143.244.213.108:443 SNI-passthrough → - cc-ci Traefik terminates TLS w/ `CN=*.ci.commoninternet.net` (LE E8), HTTP 404 (no router yet). -- [ ] abra installed; deploy + tear down a trivial recipe by hand over HTTPS -- [ ] Gate: M1 — recipe reachable over HTTPS at *.ci.commoninternet.net, torn down clean +- [x] Proxy = real coop-cloud/traefik via abra (orchestrator decision, replaces custom traefik.nix): + wildcard/file-provider mode, pre-issued cert as ssl_cert/ssl_key swarm secrets, LETS_ENCRYPT_ENV + empty → no ACME. `scripts/deploy-proxy.sh` (idempotent). Verified E2E via gateway: wildcard cert + served, 0 ACME log lines. +- [x] abra installed (modules/abra.nix, pinned 0.13.0-beta); deployed custom-html by hand over HTTPS + (HTTP 200 nginx page via gateway) and tore it down clean (services/volumes/secrets/containers=0). +- [x] Gate: M1 — recipe reachable over HTTPS at *.ci.commoninternet.net, torn down clean → + CLAIMED 2026-05-26, awaiting Adversary. ### M2 — Drone online - [ ] Drone server + exec runner via Nix; Gitea OAuth app diff --git a/DECISIONS.md b/DECISIONS.md index 340c2dd..71c97a7 100644 --- a/DECISIONS.md +++ b/DECISIONS.md @@ -10,6 +10,28 @@ Architecture decisions and dead-ends. One line of rationale each. (§0, §8) - **Git credentials:** helper script in repo-local git config sources `/srv/cc-ci/.testenv` at call time — no secret values stored in `.git/config` or commits. +- **Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26, + overrides plan §3 `modules/traefik.nix`).** Instead of a hand-rolled Traefik we deploy the + canonical Co-op Cloud `traefik` recipe via abra in **wildcard / file-provider mode**, for + end-to-end fidelity (canonical `web`/`web-secure` entrypoints + proxy/swarm conventions every + recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO + DNS token on the box: + - `WILDCARDS_ENABLED=1` + append `compose.wildcard.yml`; the pre-issued cert is fed as the + `ssl_cert`/`ssl_key` swarm secrets (v1) via `abra app secret insert … -f` from + `/var/lib/ci-certs/live/{fullchain,privkey}.pem`. The file provider serves it (`tls.certificates`). + - `LETS_ENCRYPT_ENV=` **empty** on the traefik app *and* on every test app → the recipe's + `tls.certresolver=${LETS_ENCRYPT_ENV}` label resolves to no resolver → routers serve the + wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.) + - Reproducibility (D8): `scripts/deploy-proxy.sh` is idempotent (ensures local abra server, fetches + recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in + `docs/install.md`. The custom `modules/traefik.nix` was removed; `modules/swarm.nix` keeps swarm + init + `proxy` net + firewall 80/443. + - **Renewal (manual, ~90d):** operator re-issues the wildcard at the same paths, then + `abra app secret rm traefik.ci.commoninternet.net ssl_cert -n` + re-insert at a new version (bump + `SECRET_WILDCARD_CERT_VERSION`) and redeploy. (Documented in docs/secrets.md at M7.) + - **abra teardown syntax** (for harness, §4.3): `abra app undeploy -n`, + `abra app volume remove -f -n`, `abra app secret remove --all -n`. None take `--chaos`. + ## Open (defaults from §8, to confirm as reality lands) - **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on diff --git a/JOURNAL.md b/JOURNAL.md index 44edc5b..f476fed 100644 --- a/JOURNAL.md +++ b/JOURNAL.md @@ -146,3 +146,43 @@ firewall 80/443 (gateway forwards over enp5s0). **Next:** install abra (M1 last task), `abra app new` a trivial recipe (custom-html) → deploy → reach over HTTPS at .ci.commoninternet.net → teardown leaving no volumes. That completes M1 → CLAIM M1 gate. + +## 2026-05-26 — M1: proxy pivot to real coop-cloud/traefik via abra; recipe deploy/teardown (M1 CLAIMED) + +**Orchestrator decision (mid-M1):** replace the hand-rolled Traefik with the canonical Co-op Cloud +`traefik` recipe deployed via abra, wildcard/file-provider mode, no ACME/token. Removed custom +`modules/traefik.nix`; moved firewall 80/443 into `modules/swarm.nix`. Recorded in DECISIONS.md. + +**Why the pivot also fixed a real bug:** my custom Traefik used entrypoint `websecure`; coop-cloud +recipes label `entrypoints=web-secure`. While chasing that I also hit a sharp **systemd-run gotcha**: +`systemd-run … nixos-rebuild switch --flake .#cc-ci` runs with cwd `/`, so `.#` → `/` → "could not +find a flake.nix"; the switch silently failed while a post-`--collect` `systemctl show` returned a +stale `Result=success`. Fix: always use the **absolute** flake path `/root/cc-ci#cc-ci`, and read the +result before resetting. (rebuild6/7 had silently not applied; rebuild2–5 used the absolute path.) + +**abra packaged** (modules/abra.nix): release binary 0.13.0-beta, pinned by sha256, autoPatchelf'd. +`abra --version` → `0.13.0-beta-06a57de`. + +**scripts/deploy-proxy.sh** (idempotent, pure-bash — host has no python3): ensure local abra server, +fetch traefik, write wildcard/no-ACME env (`WILDCARDS_ENABLED=1`, `SECRET_WILDCARD_*_VERSION=v1`, +`COMPOSE_FILE=compose.yml:compose.wildcard.yml`, `LETS_ENCRYPT_ENV=` empty), insert cert secrets via +`abra app secret insert … -f` from /var/lib/ci-certs/live, deploy. Bugs fixed en route: multi-line +PEM must use `-f` (not arg); secret-presence must check `docker secret ls` (abra's recipe list always +shows the name with `created on server:false`). + +**Traefik deploy:** `abra app deploy` → `deploy succeeded 🟢` (traefik v3.6.15 + socket-proxy). +Verify: `docker service ls` → app+socket-proxy 1/1; via gateway `curl --resolve probe.*:443: +143.244.213.108` → `CN=*.ci.commoninternet.net` (LE E8); **0 ACME log lines**. + +**M1 gate (recipe over HTTPS + teardown):** +- `abra app new custom-html -s default -D cchtml1.ci.commoninternet.net -S -n` then set + `LETS_ENCRYPT_ENV=` and `abra app deploy -n -C` → `🟢` (nginx 1.29.0). +- `curl -ks --resolve cchtml1.ci.commoninternet.net:443:143.244.213.108 https://…/` → + `http_code=200 size=615`, served the nginx welcome page over HTTPS with the wildcard cert. +- Teardown: `abra app undeploy -n` → 🟢; `abra app volume remove -f -n` → "1 volumes removed"; + leak check → services 0 / volumes 0 / secrets 0 / containers 0. **Clean.** +- Correct teardown syntax confirmed: `secret remove --all -n` (not `--all-secrets`). + +**docs/install.md** seeded (flake apply + deploy-proxy + verify). M1 gate CLAIMED in STATUS.md. + +**Next:** M2 — Drone server + exec runner via Nix, Gitea OAuth app, hello-world .drone.yml green. diff --git a/STATUS.md b/STATUS.md index 276aa13..f79f43a 100644 --- a/STATUS.md +++ b/STATUS.md @@ -1,9 +1,8 @@ # STATUS — cc-ci Builder -**Phase:** M0 → M1. M0 complete & CLAIMED; starting M1 (swarm + Traefik + abra) while awaiting verdict. -**In-flight:** M1 — abra install + by-hand HTTPS deploy/teardown of a trivial recipe (M1 gate). -Swarm + Traefik (wildcard cert via gateway passthrough) both up and verified. -**Last updated:** 2026-05-26 (M1 Traefik up, HTTPS path proven) +**Phase:** M1 complete & CLAIMED → starting M2 (Drone). M0 PASS (Adversary @21:35Z). M1 awaiting verdict. +**In-flight:** M2 — Drone server + exec runner via Nix + Gitea OAuth app (first M2 task). +**Last updated:** 2026-05-26 (M1 claimed) ## Gates - **Gate: M0 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: flake rebuilds cc-ci from repo @@ -11,6 +10,13 @@ Swarm + Traefik (wildcard cert via gateway passthrough) both up and verified. `/run/secrets/test_secret` (0400 root, value = generated `cc-ci-m0-…`). Repro: clone repo, sync to host, `nixos-rebuild switch --flake .#cc-ci`, then `systemctl is-system-running` + check the secret. Per §6.1 I will NOT advance past this gate to M2; M1 work proceeds as independent unblocked work. + → **M0 PASS** logged by Adversary in REVIEW.md @2026-05-26T21:35Z (cold verify, leak probe clean). +- **Gate: M1 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: Docker single-node swarm + + `proxy` overlay; real coop-cloud/traefik via abra (wildcard/file-provider, no ACME); custom-html + deployed by hand → HTTP 200 over HTTPS via gateway at cchtml1.ci.commoninternet.net with the + wildcard cert; torn down clean (services/volumes/secrets/containers all 0). Repro: + `scripts/deploy-proxy.sh` + `abra app new/deploy/undeploy`. Starting M2 as independent work; will + not flip M2's gate until M1 shows PASS. ## Blocked - (none) diff --git a/docs/install.md b/docs/install.md new file mode 100644 index 0000000..5857a89 --- /dev/null +++ b/docs/install.md @@ -0,0 +1,54 @@ +# Installing cc-ci from scratch + +> WORK IN PROGRESS — grows with each milestone; the full from-scratch rebuild is verified at M9 (D8). + +cc-ci is declared as a NixOS flake (this repo) plus a reproducible proxy-deploy step. Target: +a NixOS 24.11 host reachable as `cc-ci` over SSH (root), with the operator preconditions in place. + +## Operator preconditions (class-A1, see DECISIONS.md / docs/baseline.md) + +- Wildcard TLS cert at `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` + (`*.ci.commoninternet.net` + `ci.commoninternet.net`). **Renewed out-of-band; never ACME here.** +- DNS: `*.ci.commoninternet.net` (+ bare) → the **gateway**, which TLS-passthroughs (SNI) to cc-ci. +- Firewall path: gateway reaches cc-ci on tcp/80+443 (opened by `modules/swarm.nix`). + +## 1. Apply the NixOS flake + +The flake (`flake.nix`, `hosts/cc-ci/`, `modules/`) declares: base host, sops-nix (decrypts via the +host SSH key), Docker + single-node Swarm + the `proxy` overlay (`modules/swarm.nix`), and abra +(`modules/abra.nix`). + +```sh +# materialise the repo on the host (the build runs on cc-ci itself — see DECISIONS.md deploy mech) +# e.g. git clone /root/cc-ci (or sync it) +nixos-rebuild switch --flake /root/cc-ci#cc-ci +# verify +systemctl is-system-running # -> running +docker info --format '{{.Swarm.LocalNodeState}}' # -> active +docker network ls | grep proxy # -> proxy ... overlay swarm +``` + +> Tip: when driving the switch over an SSH session that rides Tailscale, run it as a detached unit so +> it survives a momentary drop, and **use the absolute flake path** (systemd units run with cwd `/`): +> `systemd-run --unit=ccci-sw --property=Type=oneshot nixos-rebuild switch --flake /root/cc-ci#cc-ci` + +## 2. Deploy the reverse proxy (coop-cloud traefik, wildcard/file-provider, no ACME) + +```sh +bash /root/cc-ci/scripts/deploy-proxy.sh +``` + +This idempotently deploys the canonical Co-op Cloud `traefik` recipe via abra in wildcard mode, +serving the pre-issued cert as the `ssl_cert`/`ssl_key` swarm secrets, with `LETS_ENCRYPT_ENV` empty +so no ACME ever runs (see DECISIONS.md "Proxy: real coop-cloud/traefik via abra"). Verify: + +```sh +docker service ls | grep traefik # app + socket-proxy, 1/1 +# wildcard cert served end-to-end via the gateway: +curl -ksv --resolve probe.ci.commoninternet.net:443: https://probe.ci.commoninternet.net/ \ + 2>&1 | grep -E 'subject:|HTTP/' # -> CN=*.ci.commoninternet.net, HTTP 404 (no app router yet) +``` + +## 3. (later milestones) Drone, comment-bridge, dashboard, recipe enrollment + +See `docs/enroll-recipe.md` (D5), `docs/secrets.md` (D6), `docs/runbook.md`. Added as those land. diff --git a/hosts/cc-ci/configuration.nix b/hosts/cc-ci/configuration.nix index 22b900f..3719d59 100644 --- a/hosts/cc-ci/configuration.nix +++ b/hosts/cc-ci/configuration.nix @@ -7,7 +7,7 @@ ./hardware.nix ../../modules/secrets.nix ../../modules/swarm.nix - ../../modules/traefik.nix + ../../modules/abra.nix ]; # --- Tailscale (ACCESS-CRITICAL: do not break, this is the only route in) --- diff --git a/modules/abra.nix b/modules/abra.nix new file mode 100644 index 0000000..48bf657 --- /dev/null +++ b/modules/abra.nix @@ -0,0 +1,25 @@ +# abra — the Co-op Cloud CLI used by the harness to deploy/upgrade/backup recipes (M1+). +# Packaged from the upstream release binary, pinned by version + hash for reproducibility (D8). +{ pkgs, ... }: +let + abra = pkgs.stdenv.mkDerivation rec { + pname = "abra"; + version = "0.13.0-beta"; + src = pkgs.fetchurl { + url = "https://git.coopcloud.tech/toolshed/abra/releases/download/${version}/abra_${version}_linux_amd64.tar.gz"; + sha256 = "12csk6wp1pk9cspzqfl4a6h5jdz8p055sf0ggxw9k7ljhpd5qvc6"; + }; + # Tarball has files at the root (LICENSE, README.md, abra), no common subdir. + sourceRoot = "."; + nativeBuildInputs = [ pkgs.autoPatchelfHook ]; + buildInputs = [ pkgs.stdenv.cc.cc.lib ]; + installPhase = '' + runHook preInstall + install -Dm755 abra "$out/bin/abra" + runHook postInstall + ''; + }; +in +{ + environment.systemPackages = [ abra ]; +} diff --git a/modules/swarm.nix b/modules/swarm.nix index 6986bd7..d36e676 100644 --- a/modules/swarm.nix +++ b/modules/swarm.nix @@ -15,6 +15,10 @@ environment.systemPackages = [ pkgs.docker ]; + # Gateway forwards 80/443 to cc-ci over the public interface (enp5s0); the coop-cloud + # traefik stack (deployed via abra, see docs/install.md) publishes these ports. + networking.firewall.allowedTCPPorts = [ 80 443 ]; + # Bring up a single-node swarm + the shared `proxy` overlay network. Idempotent: # safe to re-run every boot/rebuild. advertise-addr 127.0.0.1 is fine for a lone node. systemd.services.swarm-init = { diff --git a/modules/traefik.nix b/modules/traefik.nix deleted file mode 100644 index c700415..0000000 --- a/modules/traefik.nix +++ /dev/null @@ -1,96 +0,0 @@ -# Traefik for the test swarm (M1). Runs as a swarm service on the `proxy` overlay so it can -# reach recipe service VIPs (a host process couldn't). TLS terminates here using the operator's -# pre-issued wildcard cert via the file provider — NO ACME for commoninternet.net (§4.0). -# Recipe routers only need `traefik.enable=true` + a Host(...) rule + tls=true; the default -# certificate (the wildcard) is served for every *.ci.commoninternet.net host. -{ pkgs, ... }: -let - # Static config. Docker *Swarm* provider (v3) + file provider for the cert. - staticCfg = pkgs.writeText "traefik.yml" '' - entryPoints: - web: - address: ":80" - websecure: - address: ":443" - providers: - swarm: - endpoint: "unix:///var/run/docker.sock" - exposedByDefault: false - network: proxy - file: - directory: /etc/traefik/dynamic - watch: true - log: - level: INFO - accessLog: {} - api: - dashboard: false - ping: {} - ''; - - # Dynamic config: serve the pre-issued wildcard as the DEFAULT certificate, so any - # *.ci.commoninternet.net router with tls=true is covered without a cert resolver. - certsCfg = pkgs.writeText "certs.yml" '' - tls: - stores: - default: - defaultCertificate: - certFile: /var/lib/ci-certs/live/fullchain.pem - keyFile: /var/lib/ci-certs/live/privkey.pem - certificates: - - certFile: /var/lib/ci-certs/live/fullchain.pem - keyFile: /var/lib/ci-certs/live/privkey.pem - ''; - - stack = pkgs.writeText "traefik-stack.yml" '' - version: "3.8" - services: - traefik: - image: traefik:v3.3 - ports: - - target: 80 - published: 80 - mode: host - - target: 443 - published: 443 - mode: host - volumes: - - /var/run/docker.sock:/var/run/docker.sock:ro - - /var/lib/ci-certs/live:/var/lib/ci-certs/live:ro - - ${staticCfg}:/etc/traefik/traefik.yml:ro - - ${certsCfg}:/etc/traefik/dynamic/certs.yml:ro - networks: - - proxy - deploy: - mode: replicated - replicas: 1 - placement: - constraints: - - node.role == manager - restart_policy: - condition: any - networks: - proxy: - external: true - ''; -in -{ - # Gateway forwards 80/443 to cc-ci over the public interface (enp5s0), so open them. - networking.firewall.allowedTCPPorts = [ 80 443 ]; - - systemd.services.traefik-deploy = { - description = "Deploy the Traefik swarm stack"; - after = [ "swarm-init.service" ]; - requires = [ "swarm-init.service" ]; - wantedBy = [ "multi-user.target" ]; - path = [ pkgs.docker ]; - serviceConfig = { - Type = "oneshot"; - RemainAfterExit = true; - }; - script = '' - set -eu - docker stack deploy --detach=true -c ${stack} traefik - ''; - }; -} diff --git a/scripts/deploy-proxy.sh b/scripts/deploy-proxy.sh new file mode 100755 index 0000000..82ef704 --- /dev/null +++ b/scripts/deploy-proxy.sh @@ -0,0 +1,60 @@ +#!/usr/bin/env bash +# Reproducibly deploy the canonical Co-op Cloud `traefik` recipe as cc-ci's reverse proxy, +# in wildcard / file-provider mode — serving the operator's pre-issued wildcard cert, with +# NO ACME and NO DNS token on the box (see DECISIONS.md "Proxy: real coop-cloud/traefik"). +# +# Idempotent: safe to re-run. Run as root on cc-ci (abra drives the local Docker swarm). +# ssh cc-ci 'bash /root/cc-ci/scripts/deploy-proxy.sh' +# +# Prereqs (declared in the flake): docker + single-node swarm + `proxy` overlay (modules/swarm.nix), +# abra (modules/abra.nix), and the wildcard cert at /var/lib/ci-certs/live/ (operator-provided). +set -euo pipefail + +PROXY_DOMAIN="${PROXY_DOMAIN:-traefik.ci.commoninternet.net}" +CERT_DIR="${CERT_DIR:-/var/lib/ci-certs/live}" +ENV_FILE="$HOME/.abra/servers/default/${PROXY_DOMAIN}.env" + +export PATH=/run/current-system/sw/bin:"$PATH" + +echo "==> ensure local abra server" +abra server ls -m -n >/dev/null 2>&1 || abra server add --local -n || true + +echo "==> fetch traefik recipe" +abra recipe fetch traefik -n >/dev/null + +if [ ! -f "$ENV_FILE" ]; then + echo "==> create traefik app ($PROXY_DOMAIN)" + abra app new traefik -s default -D "$PROXY_DOMAIN" -n +fi + +echo "==> configure wildcard / no-ACME env" +# Set each var deterministically: drop any existing (commented or not) line, then append. +# Empty LETS_ENCRYPT_ENV => the traefik router uses no cert resolver => no ACME ever fires. +set_env() { + local key="$1" val="$2" + sed -i -E "/^[[:space:]]*#?[[:space:]]*${key}=/d" "$ENV_FILE" + printf '%s=%s\n' "$key" "$val" >> "$ENV_FILE" +} +set_env LETS_ENCRYPT_ENV "" +set_env WILDCARDS_ENABLED "1" +set_env SECRET_WILDCARD_CERT_VERSION "v1" +set_env SECRET_WILDCARD_KEY_VERSION "v1" +set_env COMPOSE_FILE '"compose.yml:compose.wildcard.yml"' +echo " env written: $ENV_FILE" + +echo "==> insert wildcard cert secrets (v1) from $CERT_DIR (idempotent)" +# Check the actual swarm secret (generated name ${STACK_NAME}__v1), not abra's +# recipe-defined list (which always shows the names with "created on server":"false"). +have_secret() { docker secret ls --format '{{.Name}}' | grep -q "_${1}_v1\$"; } +# Insert from file (-f) so the multi-line PEM is read verbatim, not arg-parsed. +if ! have_secret ssl_cert; then + abra app secret insert "$PROXY_DOMAIN" ssl_cert v1 "$CERT_DIR/fullchain.pem" -f -n +fi +if ! have_secret ssl_key; then + abra app secret insert "$PROXY_DOMAIN" ssl_key v1 "$CERT_DIR/privkey.pem" -f -n +fi + +echo "==> deploy traefik" +abra app deploy "$PROXY_DOMAIN" -n -C + +echo "==> done"