Files
cc-ci/DECISIONS.md
autonomic-bot 72ff8e213d
All checks were successful
continuous-integration/drone/push Build is passing
resource safety: MAX_TESTS=capacity=1 + per-build 60m timeout (orchestrator design change)
Bound live test apps on the single 28GiB node. DRONE_RUNNER_CAPACITY=1 (MAX_TESTS)
caps concurrent builds; Drone auto-queues the rest natively. deploy-drone reconcile
sets the cc-ci repo build timeout to 60m (best-effort PATCH, non-fatal) so a hung
build is killed and frees its slot. Janitor remains the backstop for SIGKILL'd builds.

Verified on host: DRONE_RUNNER_CAPACITY=1; repo timeout=60 via Drone API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 02:53:29 +01:00

164 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# DECISIONS — cc-ci Builder
Architecture decisions and dead-ends. One line of rationale each. (§0, §8)
## Settled
- **Wildcard TLS:** operator pre-issues wildcard cert at `/var/lib/ci-certs/live/`; Traefik file
provider serves it; **no ACME** for commoninternet.net. (Plan §4.0/§8 — fixed.)
- **Repo:** `git.autonomic.zone/recipe-maintainers/cc-ci`, private. Bot is org admin. (Bootstrap.)
- **Git credentials:** helper script in repo-local git config sources `/srv/cc-ci/.testenv` at call
time — no secret values stored in `.git/config` or commits.
- **Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26,
overrides plan §3 `modules/traefik.nix`).** Instead of a hand-rolled Traefik we deploy the
canonical Co-op Cloud `traefik` recipe via abra in **wildcard / file-provider mode**, for
end-to-end fidelity (canonical `web`/`web-secure` entrypoints + proxy/swarm conventions every
recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO
DNS token on the box:
- `WILDCARDS_ENABLED=1` + append `compose.wildcard.yml`; the pre-issued cert is fed as the
`ssl_cert`/`ssl_key` swarm secrets (v1) via `abra app secret insert … -f` from
`/var/lib/ci-certs/live/{fullchain,privkey}.pem`. The file provider serves it (`tls.certificates`).
- `LETS_ENCRYPT_ENV=` **empty** on the traefik app *and* on every test app → the recipe's
`tls.certresolver=${LETS_ENCRYPT_ENV}` label resolves to no resolver → routers serve the
wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
- Reproducibility (D8): `scripts/deploy-proxy.sh` is idempotent (ensures local abra server, fetches
recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in
`docs/install.md`. The custom `modules/traefik.nix` was removed; `modules/swarm.nix` keeps swarm
init + `proxy` net + firewall 80/443.
- **Renewal (manual, ~90d):** operator re-issues the wildcard at the same paths, then
`abra app secret rm traefik.ci.commoninternet.net ssl_cert -n` + re-insert at a new version (bump
`SECRET_WILDCARD_CERT_VERSION`) and redeploy. (Documented in docs/secrets.md at M7.)
- **abra teardown syntax** (for harness, §4.3): `abra app undeploy <d> -n`,
`abra app volume remove <d> -f -n`, `abra app secret remove <d> --all -n`. None take `--chaos`.
- **Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer
2026-05-26).** Every piece of swarm infra that abra deploys (traefik `modules/proxy.nix`, Drone
`modules/drone.nix`, later comment-bridge + dashboard) is a `systemd.services.<x>` with
`Type=oneshot` + `RemainAfterExit`, `after`/`requires` swarm-init + docker, `wants`
network-online, `wantedBy` multi-user, embedding its script via **`pkgs.writeShellApplication`**
(self-contained in the store, not a `/root/cc-ci` path). The script **reconciles** (inspect →
converge → no-op if correct) on *every* activation/boot — **no run-once sentinel** — so it
self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit)
on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses to
`git clone` + `nixos-rebuild switch` + operator preconditions, no manual post-steps. The old
`scripts/deploy-*.sh` were folded into these modules and removed. `pkgs.abra` is provided via an
overlay (`modules/packages.nix`) so all modules share the one pinned build.
- *Cert rotation note:* the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
wildcard means bumping `SECRET_WILDCARD_*_VERSION` (operator) so the next reconcile re-inserts.
Documented in docs/secrets.md at M7.
- **Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27,
supersedes the earlier "keep webhook, do NOT pivot to polling" steer).** Hard constraint: the
bot/server runs at **READ level, never repo-admin**, and **never self-registers a webhook**.
- **Polling is PRIMARY and the source of truth for D1.** The bridge polls each enrolled repo's
open PRs for new `!testme` comments every `POLL_INTERVAL` (30s ≤ 60s). Outbound
(cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On
startup the first poll marks pre-existing comments seen so it doesn't fire on old comments.
- **Webhook is an OPTIONAL push optimization.** The `/hook` endpoint stays live (HMAC-verified)
so an *admin-registered* `issue_comment` webhook lowers latency, but the bridge never registers
one. Manual registration is documented in `docs/enroll-recipe.md`. Both paths share an
in-memory seen-set keyed by comment id → a comment seen by both fires at most once.
- **Commenter authorization via org membership (read-level, no admin).** Allowed iff
`GET /orgs/{owner}/members/{user}` → 204 (verified 2026-05-27: admits bot/trav/notplants, 404
for a non-member, works with bot read-level basic-auth) **or** the user is in the optional
`AUTH_ALLOWLIST`. Replaces the earlier `/collaborators/{user}/permission` check, which needs
repo-admin. Fail-closed on any error.
- **Enrollment** = add the repo to the bridge `POLL_REPOS` csv + ensure `tests/<recipe>/` exists.
No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't
matter: polling makes it irrelevant; the operator was whitelisting `ci.commoninternet.net` in
Gitea's `ALLOWED_HOST_LIST`, but D1 no longer depends on that.)
- **Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27,
plan §4.2/§4.3).** Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
- **MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1** (`modules/drone-runner.nix`, `maxTests` let-binding).
Drone runs at most MAX_TESTS builds at once and **auto-queues the rest in its native pending
queue** — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is
never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly.
- **Per-build TIMEOUT = 60 min** (`modules/drone.nix`, `buildTimeoutMinutes`; reconciled
best-effort via `PATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}` using the bridge's
Drone admin token, local `--resolve`, non-fatal). A build over the limit is cancelled by Drone →
the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue
once a test finishes OR times out".
- **Teardown + janitor backstop.** Each build deploys → runs the 3 stages → undeploys
(guaranteed `try/finally` in `conftest`/orchestrator). A SIGKILL'd/timed-out build can't run its
own teardown, so the **run-start janitor** (`lifecycle.janitor`, called before every deploy in
both fixtures + `run_recipe_ci`) reaps orphaned run apps as the backstop. At capacity=1 the CI
path will set `CCCI_JANITOR_MAX_AGE=0` (reap any orphan immediately — safe with no concurrent
runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default
2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live.
- Optional `concurrency: {limit: 1}` in the recipe-CI `.drone.yml` is a redundant belt — primary
mechanism is `DRONE_RUNNER_CAPACITY`. (Wired when the recipe-CI pipeline lands — see backlog.)
## Open (defaults from §8, to confirm as reality lands)
- **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on
cc-ci itself*, with the repo materialised on the host at `/root/cc-ci`. Chosen over
`--target-host`/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS
proxy (slow/fragile). Atomic rollback preserved by Nix generations (`nixos-rebuild --rollback`).
The switch is launched as a **detached transient systemd unit** (`systemd-run --unit=ccci-rebuild
--collect`) so it survives a momentary ssh-over-tailscale drop during activation. For the build
loop the host copy is synced from the sandbox clone via `tar | ssh` (rsync absent on host);
source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo
on a fresh host, then `nixos-rebuild switch --flake .#cc-ci`).
- **nixpkgs pin:** flake pins the exact rev cc-ci already ran (`50ab793…`) so the first rebuild
is a true no-op-then-base. Bump deliberately, never drift.
- **Webhook scope:** default per-repo via enroll script.
- **CI engine: Drone (per plan) — kept, with a noted risk.** nixpkgs 24.11 has Drone **server**
2.24.0 but `drone-runner-exec` is **abandoned (unstable-2020-04-19)** — the only exec runner Drone
ever shipped (upstream archived ~2021). The maintained fork **Woodpecker** (2.7.3, with NixOS
modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific
(D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern
Drone server (RPC protocol stable). **Fallback:** if the exec runner proves incompatible/broken,
pivot to Woodpecker (coop-cloud ships a `woodpecker` recipe too) and record it — like the traefik
pivot. Re-evaluate at the M2 gate.
- **Drone deployment shape — SETTLED (M2):** mirror the traefik pattern. The **server** is the
coop-cloud `drone` recipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by
traefik at `drone.ci.commoninternet.net`, `LETS_ENCRYPT_ENV` empty → wildcard cert, no ACME),
with Gitea SSO (`compose.gitea.yml`). The **exec runner** runs as a Nix systemd service on the
host (`modules/drone-runner.nix`) so it can drive host abra/swarm (plan §4.2). One generated
`DRONE_RPC_SECRET` is shared: inserted as the server's `rpc_secret` swarm secret AND read by the
runner from sops. Reproducible deploy: `scripts/deploy-drone.sh`.
- Gitea OAuth app `cc-ci-drone` created under the bot (client_id `ab4cdb9d-ee96-4867-875f-
87384505fc52`, redirect `https://drone.ci.commoninternet.net/login`); client_secret +
rpc_secret stored sops-encrypted in `secrets/secrets.yaml` (A2 internal secrets).
- **Drone runner type:** exec (must drive host abra).
- **Secret tool — SETTLED (M0):** sops-nix. cc-ci decrypts at activation using its **ed25519 SSH
host key** as the age identity (`sops.age.sshKeyPaths`), so no extra key file to manage on the box.
Recipients in `/.sops.yaml`: the host age key (`age1h90ut…`, from ssh-to-age) + an off-box
**master recovery key** (`age1cmk26t…`; private half only at `/srv/cc-ci/.sops/master-age.txt` on
the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing
plaintext into `secrets/<f>.yaml` then `sops -e -i` (run inside the repo so `.sops.yaml` is found).
- **D10 recipe set:** lock six early. Candidates favouring already-mirrored: custom-html (simple),
cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3),
bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4M6.5.
- **Per-run app domain scheme — adapted (M4, deviates from plan §4.0).** Plan §4.0 wanted
`<recipe>-pr<n>-<short-sha>.ci.commoninternet.net`, but Docker swarm config/secret names
(`<stackname>_<resource>_<version>`) must be ≤ 64 chars and abra derives `<stackname>` from the
domain (dots→`_`, hyphens kept). `.ci.commoninternet.net` alone is 22 chars, so long recipe names
+ config names overflow 64 (hit with `custom-html-pr0-m4demo…_nginx_default_conf_v6` = 66). New
scheme: **`<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net`** (e.g. `cust-e084bd`) — short,
unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/
ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.
- **abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6).** Many
abra commands (`app ls`, `secret generate` without flags, version resolution) silently
`git checkout <version-tag>` in `~/.abra/recipes/<recipe>`, discarding a PR branch's files. To
test the *PR head code* (not a re-resolved tag): (1) `fetch_recipe` clones the mirror branch/ref
(private → bot token via per-command `http.extraHeader`, never persisted/logged); (2) all harness
abra calls that touch the recipe pass `-C` (chaos: use current checkout) `-o` (offline: no remote
fetch); (3) recipe-shipped `tests/` (D4) are **snapshotted to a temp dir right after fetch**, since
later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.
## Risks
- **Disk — RESOLVED 2026-05-26.** Original 8.9 GiB root had only ~3.8 GiB free *and* a hard
**inode** ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
inodes before bytes. Operator grew the VM to **28 GiB** (22 GiB free, 1.78M inodes / 1.21M free);
the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
periodic `docker image prune` to avoid regressing during M6.5 breadth.
## Dead-ends
- (none yet)