# DECISIONS — cc-ci Builder Architecture decisions and dead-ends. One line of rationale each. (§0, §8) ## Settled - **Wildcard TLS:** operator pre-issues wildcard cert at `/var/lib/ci-certs/live/`; Traefik file provider serves it; **no ACME** for commoninternet.net. (Plan §4.0/§8 — fixed.) - **Repo:** `git.autonomic.zone/recipe-maintainers/cc-ci`, private. Bot is org admin. (Bootstrap.) - **Git credentials:** helper script in repo-local git config sources `/srv/cc-ci/.testenv` at call time — no secret values stored in `.git/config` or commits. - **Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26, overrides plan §3 `modules/traefik.nix`).** Instead of a hand-rolled Traefik we deploy the canonical Co-op Cloud `traefik` recipe via abra in **wildcard / file-provider mode**, for end-to-end fidelity (canonical `web`/`web-secure` entrypoints + proxy/swarm conventions every recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO DNS token on the box: - `WILDCARDS_ENABLED=1` + append `compose.wildcard.yml`; the pre-issued cert is fed as the `ssl_cert`/`ssl_key` swarm secrets (v1) via `abra app secret insert … -f` from `/var/lib/ci-certs/live/{fullchain,privkey}.pem`. The file provider serves it (`tls.certificates`). - `LETS_ENCRYPT_ENV=` **empty** on the traefik app *and* on every test app → the recipe's `tls.certresolver=${LETS_ENCRYPT_ENV}` label resolves to no resolver → routers serve the wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.) - Reproducibility (D8): `scripts/deploy-proxy.sh` is idempotent (ensures local abra server, fetches recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in `docs/install.md`. The custom `modules/traefik.nix` was removed; `modules/swarm.nix` keeps swarm init + `proxy` net + firewall 80/443. - **Renewal (manual, ~90d):** operator re-issues the wildcard at the same paths, then `abra app secret rm traefik.ci.commoninternet.net ssl_cert -n` + re-insert at a new version (bump `SECRET_WILDCARD_CERT_VERSION`) and redeploy. (Documented in docs/secrets.md at M7.) - **abra teardown syntax** (for harness, §4.3): `abra app undeploy -n`, `abra app volume remove -f -n`, `abra app secret remove --all -n`. None take `--chaos`. - **Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer 2026-05-26).** Every piece of swarm infra that abra deploys (traefik `modules/proxy.nix`, Drone `modules/drone.nix`, later comment-bridge + dashboard) is a `systemd.services.` with `Type=oneshot` + `RemainAfterExit`, `after`/`requires` swarm-init + docker, `wants` network-online, `wantedBy` multi-user, embedding its script via **`pkgs.writeShellApplication`** (self-contained in the store, not a `/root/cc-ci` path). The script **reconciles** (inspect → converge → no-op if correct) on *every* activation/boot — **no run-once sentinel** — so it self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit) on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses to `git clone` + `nixos-rebuild switch` + operator preconditions, no manual post-steps. The old `scripts/deploy-*.sh` were folded into these modules and removed. `pkgs.abra` is provided via an overlay (`modules/packages.nix`) so all modules share the one pinned build. - *Cert rotation note:* the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the wildcard means bumping `SECRET_WILDCARD_*_VERSION` (operator) so the next reconcile re-inserts. Documented in docs/secrets.md at M7. - **Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27, supersedes the earlier "keep webhook, do NOT pivot to polling" steer).** Hard constraint: the bot/server runs at **READ level, never repo-admin**, and **never self-registers a webhook**. - **Polling is PRIMARY and the source of truth for D1.** The bridge polls each enrolled repo's open PRs for new `!testme` comments every `POLL_INTERVAL` (30s ≤ 60s). Outbound (cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On startup the first poll marks pre-existing comments seen so it doesn't fire on old comments. - **Webhook is an OPTIONAL push optimization.** The `/hook` endpoint stays live (HMAC-verified) so an *admin-registered* `issue_comment` webhook lowers latency, but the bridge never registers one. Manual registration is documented in `docs/enroll-recipe.md`. Both paths share an in-memory seen-set keyed by comment id → a comment seen by both fires at most once. - **Commenter authorization via org membership (read-level, no admin).** Allowed iff `GET /orgs/{owner}/members/{user}` → 204 (verified 2026-05-27: admits bot/trav/notplants, 404 for a non-member, works with bot read-level basic-auth) **or** the user is in the optional `AUTH_ALLOWLIST`. Replaces the earlier `/collaborators/{user}/permission` check, which needs repo-admin. Fail-closed on any error. - **Enrollment** = add the repo to the bridge `POLL_REPOS` csv + ensure `tests//` exists. No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't matter: polling makes it irrelevant; the operator was whitelisting `ci.commoninternet.net` in Gitea's `ALLOWED_HOST_LIST`, but D1 no longer depends on that.) - **Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27, plan §4.2/§4.3).** Do NOT keep multiple test apps deployed at once. Three layers, all configurable: - **MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1** (`modules/drone-runner.nix`, `maxTests` let-binding). Drone runs at most MAX_TESTS builds at once and **auto-queues the rest in its native pending queue** — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly. - **Per-build TIMEOUT = 60 min** (`modules/drone.nix`, `buildTimeoutMinutes`; reconciled best-effort via `PATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}` using the bridge's Drone admin token, local `--resolve`, non-fatal). A build over the limit is cancelled by Drone → the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue once a test finishes OR times out". - **Teardown + janitor backstop.** Each build deploys → runs the 3 stages → undeploys (guaranteed `try/finally` in `conftest`/orchestrator). A SIGKILL'd/timed-out build can't run its own teardown, so the **run-start janitor** (`lifecycle.janitor`, called before every deploy in both fixtures + `run_recipe_ci`) reaps orphaned run apps as the backstop. At capacity=1 the CI path will set `CCCI_JANITOR_MAX_AGE=0` (reap any orphan immediately — safe with no concurrent runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default 2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live. - Optional `concurrency: {limit: 1}` in the recipe-CI `.drone.yml` is a redundant belt — primary mechanism is `DRONE_RUNNER_CAPACITY`. (Wired when the recipe-CI pipeline lands — see backlog.) ## Open (defaults from §8, to confirm as reality lands) - **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on cc-ci itself*, with the repo materialised on the host at `/root/cc-ci`. Chosen over `--target-host`/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS proxy (slow/fragile). Atomic rollback preserved by Nix generations (`nixos-rebuild --rollback`). The switch is launched as a **detached transient systemd unit** (`systemd-run --unit=ccci-rebuild --collect`) so it survives a momentary ssh-over-tailscale drop during activation. For the build loop the host copy is synced from the sandbox clone via `tar | ssh` (rsync absent on host); source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo on a fresh host, then `nixos-rebuild switch --flake .#cc-ci`). - **nixpkgs pin:** flake pins the exact rev cc-ci already ran (`50ab793…`) so the first rebuild is a true no-op-then-base. Bump deliberately, never drift. - **Webhook scope:** default per-repo via enroll script. - **CI engine: Drone (per plan) — kept, with a noted risk.** nixpkgs 24.11 has Drone **server** 2.24.0 but `drone-runner-exec` is **abandoned (unstable-2020-04-19)** — the only exec runner Drone ever shipped (upstream archived ~2021). The maintained fork **Woodpecker** (2.7.3, with NixOS modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific (D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern Drone server (RPC protocol stable). **Fallback:** if the exec runner proves incompatible/broken, pivot to Woodpecker (coop-cloud ships a `woodpecker` recipe too) and record it — like the traefik pivot. Re-evaluate at the M2 gate. - **Drone deployment shape — SETTLED (M2):** mirror the traefik pattern. The **server** is the coop-cloud `drone` recipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by traefik at `drone.ci.commoninternet.net`, `LETS_ENCRYPT_ENV` empty → wildcard cert, no ACME), with Gitea SSO (`compose.gitea.yml`). The **exec runner** runs as a Nix systemd service on the host (`modules/drone-runner.nix`) so it can drive host abra/swarm (plan §4.2). One generated `DRONE_RPC_SECRET` is shared: inserted as the server's `rpc_secret` swarm secret AND read by the runner from sops. Reproducible deploy: `scripts/deploy-drone.sh`. - Gitea OAuth app `cc-ci-drone` created under the bot (client_id `ab4cdb9d-ee96-4867-875f- 87384505fc52`, redirect `https://drone.ci.commoninternet.net/login`); client_secret + rpc_secret stored sops-encrypted in `secrets/secrets.yaml` (A2 internal secrets). - **Drone runner type:** exec (must drive host abra). - **Secret tool — SETTLED (M0):** sops-nix. cc-ci decrypts at activation using its **ed25519 SSH host key** as the age identity (`sops.age.sshKeyPaths`), so no extra key file to manage on the box. Recipients in `/.sops.yaml`: the host age key (`age1h90ut…`, from ssh-to-age) + an off-box **master recovery key** (`age1cmk26t…`; private half only at `/srv/cc-ci/.sops/master-age.txt` on the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing plaintext into `secrets/.yaml` then `sops -e -i` (run inside the repo so `.sops.yaml` is found). - **D10 recipe set:** lock six early. Candidates favouring already-mirrored: custom-html (simple), cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3), bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5. - **Per-run app domain scheme — adapted (M4, deviates from plan §4.0).** Plan §4.0 wanted `-pr-.ci.commoninternet.net`, but Docker swarm config/secret names (`__`) must be ≤ 64 chars and abra derives `` from the domain (dots→`_`, hyphens kept). `.ci.commoninternet.net` alone is 22 chars, so long recipe names + config names overflow 64 (hit with `custom-html-pr0-m4demo…_nginx_default_conf_v6` = 66). New scheme: **`-<6hex(recipe|pr|ref)>.ci.commoninternet.net`** (e.g. `cust-e084bd`) — short, unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/ ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain. - **abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6).** Many abra commands (`app ls`, `secret generate` without flags, version resolution) silently `git checkout ` in `~/.abra/recipes/`, discarding a PR branch's files. To test the *PR head code* (not a re-resolved tag): (1) `fetch_recipe` clones the mirror branch/ref (private → bot token via per-command `http.extraHeader`, never persisted/logged); (2) all harness abra calls that touch the recipe pass `-C` (chaos: use current checkout) `-o` (offline: no remote fetch); (3) recipe-shipped `tests/` (D4) are **snapshotted to a temp dir right after fetch**, since later abra commands still reset the checkout — the recipe-local stage runs from the snapshot. ## Risks - **Disk — RESOLVED 2026-05-26.** Original 8.9 GiB root had only ~3.8 GiB free *and* a hard **inode** ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on inodes before bytes. Operator grew the VM to **28 GiB** (22 GiB free, 1.78M inodes / 1.21M free); the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown + periodic `docker image prune` to avoid regressing during M6.5 breadth. ## Dead-ends - (none yet)