# DECISIONS — cc-ci Builder

Architecture decisions and dead-ends. One line of rationale each. (§0, §8)

## Settled

- **Wildcard TLS:** operator pre-issues wildcard cert at `/var/lib/ci-certs/live/`; Traefik file
  provider serves it; **no ACME** for commoninternet.net. (Plan §4.0/§8 — fixed.)
- **Repo:** `git.autonomic.zone/recipe-maintainers/cc-ci`, private. Bot is org admin. (Bootstrap.)
- **Git credentials:** helper script in repo-local git config sources `/srv/cc-ci/.testenv` at call
  time — no secret values stored in `.git/config` or commits.

- **Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26,
  overrides plan §3 `modules/traefik.nix`).** Instead of a hand-rolled Traefik we deploy the
  canonical Co-op Cloud `traefik` recipe via abra in **wildcard / file-provider mode**, for
  end-to-end fidelity (canonical `web`/`web-secure` entrypoints + proxy/swarm conventions every
  recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO
  DNS token on the box:
  - `WILDCARDS_ENABLED=1` + append `compose.wildcard.yml`; the pre-issued cert is fed as the
    `ssl_cert`/`ssl_key` swarm secrets (v1) via `abra app secret insert … -f` from
    `/var/lib/ci-certs/live/{fullchain,privkey}.pem`. The file provider serves it (`tls.certificates`).
  - `LETS_ENCRYPT_ENV=` **empty** on the traefik app *and* on every test app → the recipe's
    `tls.certresolver=${LETS_ENCRYPT_ENV}` label resolves to no resolver → routers serve the
    wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
  - Reproducibility (D8): `scripts/deploy-proxy.sh` is idempotent (ensures local abra server, fetches
    recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in
    `docs/install.md`. The custom `modules/traefik.nix` was removed; `modules/swarm.nix` keeps swarm
    init + `proxy` net + firewall 80/443.
  - **Renewal (manual, ~90d):** operator re-issues the wildcard at the same paths, then
    `abra app secret rm traefik.ci.commoninternet.net ssl_cert -n` + re-insert at a new version (bump
    `SECRET_WILDCARD_CERT_VERSION`) and redeploy. (Documented in docs/secrets.md at M7.)
  - **abra teardown syntax** (for harness, §4.3): `abra app undeploy <d> -n`,
    `abra app volume remove <d> -f -n`, `abra app secret remove <d> --all -n`. None take `--chaos`.

- **Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer
  2026-05-26).** Every piece of swarm infra that abra deploys (traefik `modules/proxy.nix`, Drone
  `modules/drone.nix`, later comment-bridge + dashboard) is a `systemd.services.<x>` with
  `Type=oneshot` + `RemainAfterExit`, `after`/`requires` swarm-init + docker, `wants`
  network-online, `wantedBy` multi-user, embedding its script via **`pkgs.writeShellApplication`**
  (self-contained in the store, not a `/root/cc-ci` path). The script **reconciles** (inspect →
  converge → no-op if correct) on *every* activation/boot — **no run-once sentinel** — so it
  self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit)
  on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses to
  `git clone` + `nixos-rebuild switch` + operator preconditions, no manual post-steps. The old
  `scripts/deploy-*.sh` were folded into these modules and removed. `pkgs.abra` is provided via an
  overlay (`modules/packages.nix`) so all modules share the one pinned build.
  - *Cert rotation note:* the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
    wildcard means bumping `SECRET_WILDCARD_*_VERSION` (operator) so the next reconcile re-inserts.
    Documented in docs/secrets.md at M7.

- **Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27,
  supersedes the earlier "keep webhook, do NOT pivot to polling" steer).** Hard constraint: the
  bot/server runs at **READ level, never repo-admin**, and **never self-registers a webhook**.
  - **Polling is PRIMARY and the source of truth for D1.** The bridge polls each enrolled repo's
    open PRs for new `!testme` comments every `POLL_INTERVAL` (30s ≤ 60s). Outbound
    (cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On
    startup the first poll marks pre-existing comments seen so it doesn't fire on old comments.
  - **Webhook is an OPTIONAL push optimization.** The `/hook` endpoint stays live (HMAC-verified)
    so an *admin-registered* `issue_comment` webhook lowers latency, but the bridge never registers
    one. Manual registration is documented in `docs/enroll-recipe.md`. Both paths share an
    in-memory seen-set keyed by comment id → a comment seen by both fires at most once.
  - **Commenter authorization via org membership (read-level, no admin).** Allowed iff
    `GET /orgs/{owner}/members/{user}` → 204 (verified 2026-05-27: admits bot/trav/notplants, 404
    for a non-member, works with bot read-level basic-auth) **or** the user is in the optional
    `AUTH_ALLOWLIST`. Replaces the earlier `/collaborators/{user}/permission` check, which needs
    repo-admin. Fail-closed on any error.
  - **Enrollment** = add the repo to the bridge `POLL_REPOS` csv + ensure `tests/<recipe>/` exists.
    No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't
    matter: polling makes it irrelevant; the operator was whitelisting `ci.commoninternet.net` in
    Gitea's `ALLOWED_HOST_LIST`, but D1 no longer depends on that.)

- **Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27,
  plan §4.2/§4.3).** Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
  - **MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1** (`modules/drone-runner.nix`, `maxTests` let-binding).
    Drone runs at most MAX_TESTS builds at once and **auto-queues the rest in its native pending
    queue** — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is
    never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly.
  - **Per-build TIMEOUT = 60 min** (`modules/drone.nix`, `buildTimeoutMinutes`; reconciled
    best-effort via `PATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}` using the bridge's
    Drone admin token, local `--resolve`, non-fatal). A build over the limit is cancelled by Drone →
    the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue
    once a test finishes OR times out".
  - **Teardown + janitor backstop.** Each build deploys → runs the 3 stages → undeploys
    (guaranteed `try/finally` in `conftest`/orchestrator). A SIGKILL'd/timed-out build can't run its
    own teardown, so the **run-start janitor** (`lifecycle.janitor`, called before every deploy in
    both fixtures + `run_recipe_ci`) reaps orphaned run apps as the backstop. At capacity=1 the CI
    path will set `CCCI_JANITOR_MAX_AGE=0` (reap any orphan immediately — safe with no concurrent
    runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default
    2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live.
  - Optional `concurrency: {limit: 1}` in the recipe-CI `.drone.yml` is a redundant belt — primary
    mechanism is `DRONE_RUNNER_CAPACITY`. (Wired when the recipe-CI pipeline lands — see backlog.)

## Open (defaults from §8, to confirm as reality lands)

- **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on
  cc-ci itself*, with the repo materialised on the host at `/root/cc-ci`. Chosen over
  `--target-host`/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS
  proxy (slow/fragile). Atomic rollback preserved by Nix generations (`nixos-rebuild --rollback`).
  The switch is launched as a **detached transient systemd unit** (`systemd-run --unit=ccci-rebuild
  --collect`) so it survives a momentary ssh-over-tailscale drop during activation. For the build
  loop the host copy is synced from the sandbox clone via `tar | ssh` (rsync absent on host);
  source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo
  on a fresh host, then `nixos-rebuild switch --flake .#cc-ci`).
  - **nixpkgs pin:** flake pins the exact rev cc-ci already ran (`50ab793…`) so the first rebuild
    is a true no-op-then-base. Bump deliberately, never drift.
- **Webhook scope:** default per-repo via enroll script.
- **CI engine: Drone (per plan) — kept, with a noted risk.** nixpkgs 24.11 has Drone **server**
  2.24.0 but `drone-runner-exec` is **abandoned (unstable-2020-04-19)** — the only exec runner Drone
  ever shipped (upstream archived ~2021). The maintained fork **Woodpecker** (2.7.3, with NixOS
  modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific
  (D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern
  Drone server (RPC protocol stable). **Fallback:** if the exec runner proves incompatible/broken,
  pivot to Woodpecker (coop-cloud ships a `woodpecker` recipe too) and record it — like the traefik
  pivot. Re-evaluate at the M2 gate.
- **Drone deployment shape — SETTLED (M2):** mirror the traefik pattern. The **server** is the
  coop-cloud `drone` recipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by
  traefik at `drone.ci.commoninternet.net`, `LETS_ENCRYPT_ENV` empty → wildcard cert, no ACME),
  with Gitea SSO (`compose.gitea.yml`). The **exec runner** runs as a Nix systemd service on the
  host (`modules/drone-runner.nix`) so it can drive host abra/swarm (plan §4.2). One generated
  `DRONE_RPC_SECRET` is shared: inserted as the server's `rpc_secret` swarm secret AND read by the
  runner from sops. Reproducible deploy: `scripts/deploy-drone.sh`.
  - Gitea OAuth app `cc-ci-drone` created under the bot (client_id `ab4cdb9d-ee96-4867-875f-
    87384505fc52`, redirect `https://drone.ci.commoninternet.net/login`); client_secret +
    rpc_secret stored sops-encrypted in `secrets/secrets.yaml` (A2 internal secrets).
- **Drone runner type:** exec (must drive host abra).
- **Secret tool — SETTLED (M0):** sops-nix. cc-ci decrypts at activation using its **ed25519 SSH
  host key** as the age identity (`sops.age.sshKeyPaths`), so no extra key file to manage on the box.
  Recipients in `/.sops.yaml`: the host age key (`age1h90ut…`, from ssh-to-age) + an off-box
  **master recovery key** (`age1cmk26t…`; private half only at `/srv/cc-ci/.sops/master-age.txt` on
  the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing
  plaintext into `secrets/<f>.yaml` then `sops -e -i` (run inside the repo so `.sops.yaml` is found).
- **D10 recipe set:** lock six early. Candidates favouring already-mirrored: custom-html (simple),
  cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3),
  bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5.

- **Per-run app domain scheme — adapted (M4, deviates from plan §4.0).** Plan §4.0 wanted
  `<recipe>-pr<n>-<short-sha>.ci.commoninternet.net`, but Docker swarm config/secret names
  (`<stackname>_<resource>_<version>`) must be ≤ 64 chars and abra derives `<stackname>` from the
  domain (dots→`_`, hyphens kept). `.ci.commoninternet.net` alone is 22 chars, so long recipe names
  + config names overflow 64 (hit with `custom-html-pr0-m4demo…_nginx_default_conf_v6` = 66). New
  scheme: **`<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net`** (e.g. `cust-e084bd`) — short,
  unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/
  ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.

- **abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6).** Many
  abra commands (`app ls`, `secret generate` without flags, version resolution) silently
  `git checkout <version-tag>` in `~/.abra/recipes/<recipe>`, discarding a PR branch's files. To
  test the *PR head code* (not a re-resolved tag): (1) `fetch_recipe` clones the mirror branch/ref
  (private → bot token via per-command `http.extraHeader`, never persisted/logged); (2) all harness
  abra calls that touch the recipe pass `-C` (chaos: use current checkout) `-o` (offline: no remote
  fetch); (3) recipe-shipped `tests/` (D4) are **snapshotted to a temp dir right after fetch**, since
  later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.

## Risks

- **Disk — RESOLVED 2026-05-26.** Original 8.9 GiB root had only ~3.8 GiB free *and* a hard
  **inode** ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
  inodes before bytes. Operator grew the VM to **28 GiB** (22 GiB free, 1.78M inodes / 1.21M free);
  the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
  periodic `docker image prune` to avoid regressing during M6.5 breadth.

## Dead-ends
- (none yet)