All checks were successful
continuous-integration/drone/push Build is passing
Bound live test apps on the single 28GiB node. DRONE_RUNNER_CAPACITY=1 (MAX_TESTS) caps concurrent builds; Drone auto-queues the rest natively. deploy-drone reconcile sets the cc-ci repo build timeout to 60m (best-effort PATCH, non-fatal) so a hung build is killed and frees its slot. Janitor remains the backstop for SIGKILL'd builds. Verified on host: DRONE_RUNNER_CAPACITY=1; repo timeout=60 via Drone API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
164 lines
13 KiB
Markdown
164 lines
13 KiB
Markdown
# DECISIONS — cc-ci Builder
|
||
|
||
Architecture decisions and dead-ends. One line of rationale each. (§0, §8)
|
||
|
||
## Settled
|
||
|
||
- **Wildcard TLS:** operator pre-issues wildcard cert at `/var/lib/ci-certs/live/`; Traefik file
|
||
provider serves it; **no ACME** for commoninternet.net. (Plan §4.0/§8 — fixed.)
|
||
- **Repo:** `git.autonomic.zone/recipe-maintainers/cc-ci`, private. Bot is org admin. (Bootstrap.)
|
||
- **Git credentials:** helper script in repo-local git config sources `/srv/cc-ci/.testenv` at call
|
||
time — no secret values stored in `.git/config` or commits.
|
||
|
||
- **Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26,
|
||
overrides plan §3 `modules/traefik.nix`).** Instead of a hand-rolled Traefik we deploy the
|
||
canonical Co-op Cloud `traefik` recipe via abra in **wildcard / file-provider mode**, for
|
||
end-to-end fidelity (canonical `web`/`web-secure` entrypoints + proxy/swarm conventions every
|
||
recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO
|
||
DNS token on the box:
|
||
- `WILDCARDS_ENABLED=1` + append `compose.wildcard.yml`; the pre-issued cert is fed as the
|
||
`ssl_cert`/`ssl_key` swarm secrets (v1) via `abra app secret insert … -f` from
|
||
`/var/lib/ci-certs/live/{fullchain,privkey}.pem`. The file provider serves it (`tls.certificates`).
|
||
- `LETS_ENCRYPT_ENV=` **empty** on the traefik app *and* on every test app → the recipe's
|
||
`tls.certresolver=${LETS_ENCRYPT_ENV}` label resolves to no resolver → routers serve the
|
||
wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
|
||
- Reproducibility (D8): `scripts/deploy-proxy.sh` is idempotent (ensures local abra server, fetches
|
||
recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in
|
||
`docs/install.md`. The custom `modules/traefik.nix` was removed; `modules/swarm.nix` keeps swarm
|
||
init + `proxy` net + firewall 80/443.
|
||
- **Renewal (manual, ~90d):** operator re-issues the wildcard at the same paths, then
|
||
`abra app secret rm traefik.ci.commoninternet.net ssl_cert -n` + re-insert at a new version (bump
|
||
`SECRET_WILDCARD_CERT_VERSION`) and redeploy. (Documented in docs/secrets.md at M7.)
|
||
- **abra teardown syntax** (for harness, §4.3): `abra app undeploy <d> -n`,
|
||
`abra app volume remove <d> -f -n`, `abra app secret remove <d> --all -n`. None take `--chaos`.
|
||
|
||
- **Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer
|
||
2026-05-26).** Every piece of swarm infra that abra deploys (traefik `modules/proxy.nix`, Drone
|
||
`modules/drone.nix`, later comment-bridge + dashboard) is a `systemd.services.<x>` with
|
||
`Type=oneshot` + `RemainAfterExit`, `after`/`requires` swarm-init + docker, `wants`
|
||
network-online, `wantedBy` multi-user, embedding its script via **`pkgs.writeShellApplication`**
|
||
(self-contained in the store, not a `/root/cc-ci` path). The script **reconciles** (inspect →
|
||
converge → no-op if correct) on *every* activation/boot — **no run-once sentinel** — so it
|
||
self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit)
|
||
on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses to
|
||
`git clone` + `nixos-rebuild switch` + operator preconditions, no manual post-steps. The old
|
||
`scripts/deploy-*.sh` were folded into these modules and removed. `pkgs.abra` is provided via an
|
||
overlay (`modules/packages.nix`) so all modules share the one pinned build.
|
||
- *Cert rotation note:* the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
|
||
wildcard means bumping `SECRET_WILDCARD_*_VERSION` (operator) so the next reconcile re-inserts.
|
||
Documented in docs/secrets.md at M7.
|
||
|
||
- **Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27,
|
||
supersedes the earlier "keep webhook, do NOT pivot to polling" steer).** Hard constraint: the
|
||
bot/server runs at **READ level, never repo-admin**, and **never self-registers a webhook**.
|
||
- **Polling is PRIMARY and the source of truth for D1.** The bridge polls each enrolled repo's
|
||
open PRs for new `!testme` comments every `POLL_INTERVAL` (30s ≤ 60s). Outbound
|
||
(cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On
|
||
startup the first poll marks pre-existing comments seen so it doesn't fire on old comments.
|
||
- **Webhook is an OPTIONAL push optimization.** The `/hook` endpoint stays live (HMAC-verified)
|
||
so an *admin-registered* `issue_comment` webhook lowers latency, but the bridge never registers
|
||
one. Manual registration is documented in `docs/enroll-recipe.md`. Both paths share an
|
||
in-memory seen-set keyed by comment id → a comment seen by both fires at most once.
|
||
- **Commenter authorization via org membership (read-level, no admin).** Allowed iff
|
||
`GET /orgs/{owner}/members/{user}` → 204 (verified 2026-05-27: admits bot/trav/notplants, 404
|
||
for a non-member, works with bot read-level basic-auth) **or** the user is in the optional
|
||
`AUTH_ALLOWLIST`. Replaces the earlier `/collaborators/{user}/permission` check, which needs
|
||
repo-admin. Fail-closed on any error.
|
||
- **Enrollment** = add the repo to the bridge `POLL_REPOS` csv + ensure `tests/<recipe>/` exists.
|
||
No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't
|
||
matter: polling makes it irrelevant; the operator was whitelisting `ci.commoninternet.net` in
|
||
Gitea's `ALLOWED_HOST_LIST`, but D1 no longer depends on that.)
|
||
|
||
- **Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27,
|
||
plan §4.2/§4.3).** Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
|
||
- **MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1** (`modules/drone-runner.nix`, `maxTests` let-binding).
|
||
Drone runs at most MAX_TESTS builds at once and **auto-queues the rest in its native pending
|
||
queue** — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is
|
||
never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly.
|
||
- **Per-build TIMEOUT = 60 min** (`modules/drone.nix`, `buildTimeoutMinutes`; reconciled
|
||
best-effort via `PATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}` using the bridge's
|
||
Drone admin token, local `--resolve`, non-fatal). A build over the limit is cancelled by Drone →
|
||
the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue
|
||
once a test finishes OR times out".
|
||
- **Teardown + janitor backstop.** Each build deploys → runs the 3 stages → undeploys
|
||
(guaranteed `try/finally` in `conftest`/orchestrator). A SIGKILL'd/timed-out build can't run its
|
||
own teardown, so the **run-start janitor** (`lifecycle.janitor`, called before every deploy in
|
||
both fixtures + `run_recipe_ci`) reaps orphaned run apps as the backstop. At capacity=1 the CI
|
||
path will set `CCCI_JANITOR_MAX_AGE=0` (reap any orphan immediately — safe with no concurrent
|
||
runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default
|
||
2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live.
|
||
- Optional `concurrency: {limit: 1}` in the recipe-CI `.drone.yml` is a redundant belt — primary
|
||
mechanism is `DRONE_RUNNER_CAPACITY`. (Wired when the recipe-CI pipeline lands — see backlog.)
|
||
|
||
## Open (defaults from §8, to confirm as reality lands)
|
||
|
||
- **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on
|
||
cc-ci itself*, with the repo materialised on the host at `/root/cc-ci`. Chosen over
|
||
`--target-host`/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS
|
||
proxy (slow/fragile). Atomic rollback preserved by Nix generations (`nixos-rebuild --rollback`).
|
||
The switch is launched as a **detached transient systemd unit** (`systemd-run --unit=ccci-rebuild
|
||
--collect`) so it survives a momentary ssh-over-tailscale drop during activation. For the build
|
||
loop the host copy is synced from the sandbox clone via `tar | ssh` (rsync absent on host);
|
||
source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo
|
||
on a fresh host, then `nixos-rebuild switch --flake .#cc-ci`).
|
||
- **nixpkgs pin:** flake pins the exact rev cc-ci already ran (`50ab793…`) so the first rebuild
|
||
is a true no-op-then-base. Bump deliberately, never drift.
|
||
- **Webhook scope:** default per-repo via enroll script.
|
||
- **CI engine: Drone (per plan) — kept, with a noted risk.** nixpkgs 24.11 has Drone **server**
|
||
2.24.0 but `drone-runner-exec` is **abandoned (unstable-2020-04-19)** — the only exec runner Drone
|
||
ever shipped (upstream archived ~2021). The maintained fork **Woodpecker** (2.7.3, with NixOS
|
||
modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific
|
||
(D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern
|
||
Drone server (RPC protocol stable). **Fallback:** if the exec runner proves incompatible/broken,
|
||
pivot to Woodpecker (coop-cloud ships a `woodpecker` recipe too) and record it — like the traefik
|
||
pivot. Re-evaluate at the M2 gate.
|
||
- **Drone deployment shape — SETTLED (M2):** mirror the traefik pattern. The **server** is the
|
||
coop-cloud `drone` recipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by
|
||
traefik at `drone.ci.commoninternet.net`, `LETS_ENCRYPT_ENV` empty → wildcard cert, no ACME),
|
||
with Gitea SSO (`compose.gitea.yml`). The **exec runner** runs as a Nix systemd service on the
|
||
host (`modules/drone-runner.nix`) so it can drive host abra/swarm (plan §4.2). One generated
|
||
`DRONE_RPC_SECRET` is shared: inserted as the server's `rpc_secret` swarm secret AND read by the
|
||
runner from sops. Reproducible deploy: `scripts/deploy-drone.sh`.
|
||
- Gitea OAuth app `cc-ci-drone` created under the bot (client_id `ab4cdb9d-ee96-4867-875f-
|
||
87384505fc52`, redirect `https://drone.ci.commoninternet.net/login`); client_secret +
|
||
rpc_secret stored sops-encrypted in `secrets/secrets.yaml` (A2 internal secrets).
|
||
- **Drone runner type:** exec (must drive host abra).
|
||
- **Secret tool — SETTLED (M0):** sops-nix. cc-ci decrypts at activation using its **ed25519 SSH
|
||
host key** as the age identity (`sops.age.sshKeyPaths`), so no extra key file to manage on the box.
|
||
Recipients in `/.sops.yaml`: the host age key (`age1h90ut…`, from ssh-to-age) + an off-box
|
||
**master recovery key** (`age1cmk26t…`; private half only at `/srv/cc-ci/.sops/master-age.txt` on
|
||
the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing
|
||
plaintext into `secrets/<f>.yaml` then `sops -e -i` (run inside the repo so `.sops.yaml` is found).
|
||
- **D10 recipe set:** lock six early. Candidates favouring already-mirrored: custom-html (simple),
|
||
cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3),
|
||
bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5.
|
||
|
||
- **Per-run app domain scheme — adapted (M4, deviates from plan §4.0).** Plan §4.0 wanted
|
||
`<recipe>-pr<n>-<short-sha>.ci.commoninternet.net`, but Docker swarm config/secret names
|
||
(`<stackname>_<resource>_<version>`) must be ≤ 64 chars and abra derives `<stackname>` from the
|
||
domain (dots→`_`, hyphens kept). `.ci.commoninternet.net` alone is 22 chars, so long recipe names
|
||
+ config names overflow 64 (hit with `custom-html-pr0-m4demo…_nginx_default_conf_v6` = 66). New
|
||
scheme: **`<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net`** (e.g. `cust-e084bd`) — short,
|
||
unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/
|
||
ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.
|
||
|
||
- **abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6).** Many
|
||
abra commands (`app ls`, `secret generate` without flags, version resolution) silently
|
||
`git checkout <version-tag>` in `~/.abra/recipes/<recipe>`, discarding a PR branch's files. To
|
||
test the *PR head code* (not a re-resolved tag): (1) `fetch_recipe` clones the mirror branch/ref
|
||
(private → bot token via per-command `http.extraHeader`, never persisted/logged); (2) all harness
|
||
abra calls that touch the recipe pass `-C` (chaos: use current checkout) `-o` (offline: no remote
|
||
fetch); (3) recipe-shipped `tests/` (D4) are **snapshotted to a temp dir right after fetch**, since
|
||
later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.
|
||
|
||
## Risks
|
||
|
||
- **Disk — RESOLVED 2026-05-26.** Original 8.9 GiB root had only ~3.8 GiB free *and* a hard
|
||
**inode** ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
|
||
inodes before bytes. Operator grew the VM to **28 GiB** (22 GiB free, 1.78M inodes / 1.21M free);
|
||
the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
|
||
periodic `docker image prune` to avoid regressing during M6.5 breadth.
|
||
|
||
## Dead-ends
|
||
- (none yet)
|