# STATUS — cc-ci Builder

**Phase:** M0/M1/M2/M4/M5 PASS; M3 PASS (Adversary-verified); M6 CLAIMED (awaiting Adversary).
Bridge→Drone→harness integration DONE (recipe-ci pipeline). M6.5 underway: keycloak full 3-stage
GREEN through Drone (build #39). Next: enroll recipes 3–6 (remaining D10 categories), M7, M8.
**In-flight:** M6.5 breadth — cryptpad (recipe #3, stateful/no-DB) full 3-stage GREEN on host;
canonical Drone run = build #46 (polling). Fixed a real backup bug en route (set_env glued
RESTIC_REPOSITORY onto a comment → backupbot had no restic repo; now newline-safe). Next: recipes
4–6 (multi-service+S3 e.g. lasuite-docs, large-volume e.g. matrix/immich, TLS-passthrough e.g.
bluesky-pds). Pending: re-verify keycloak backup post-fix; full single-`!testme`-on-a-recipe-PR E2E.
**Last updated:** 2026-05-27 (M6.5: cryptpad 3-stage green on host; set_env/RESTIC backup fix)

## Gates
- **Gate: M0 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: flake rebuilds cc-ci from repo
  (`switch --flake /root/cc-ci#cc-ci`, gen healthy, no failed units); sops-nix decrypts
  `/run/secrets/test_secret` (0400 root, value = generated `cc-ci-m0-…`). Repro: clone repo, sync to
  host, `nixos-rebuild switch --flake .#cc-ci`, then `systemctl is-system-running` + check the secret.
  Per §6.1 I will NOT advance past this gate to M2; M1 work proceeds as independent unblocked work.
  → **M0 PASS** logged by Adversary in REVIEW.md @2026-05-26T21:35Z (cold verify, leak probe clean).
- **Gate: M1 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: Docker single-node swarm +
  `proxy` overlay; real coop-cloud/traefik via abra (wildcard/file-provider, no ACME); custom-html
  deployed by hand → HTTP 200 over HTTPS via gateway at cchtml1.ci.commoninternet.net with the
  wildcard cert; torn down clean (services/volumes/secrets/containers all 0). Repro:
  `scripts/deploy-proxy.sh` + `abra app new/deploy/undeploy`. Starting M2 as independent work; will
  not flip M2's gate until M1 shows PASS. → **M1 PASS** @2026-05-26T22:20Z.
- **Gate: M2 — CLAIMED, awaiting Adversary** (2026-05-26). Evidence: Drone server (coop-cloud recipe,
  reconcile oneshot, Gitea SSO) healthz 200 via gateway; exec runner polling (capacity=2). cc-ci repo
  activated (push webhook). Pushing `.drone.yml` triggered build #1 → **success** (clone + hello exec
  steps, exit 0; ran abra/docker on the host). Repro: `nixos-rebuild switch` + one-time
  `scripts/bootstrap-drone-oauth.sh`. Starting M3 as independent work; won't flip M3 gate until M2 PASS.
- **Gate: M3 — CLAIMED, awaiting Adversary** (2026-05-27). Trigger redesigned per orchestrator
  (plan §4.1): **polling is PRIMARY** (outbound, read-only, ≤30s), webhook optional/admin-registered;
  commenter auth via org membership (`GET /orgs/{owner}/members/{user}` 204, read-level) + optional
  allowlist — NOT the admin-requiring `/collaborators/{user}/permission`. Evidence: posted `!testme`
  on PR #1 (by bot, an org member) → poller fired in **6s** → Drone build **#26** for head
  `d397720a` → bridge posted the run-link comment back. Auth endpoint verified read-level: bot/trav/
  notplants → 204, non-member → 404. The old webhook-delivery blocker is **moot** (polling doesn't
  need the Gitea `ALLOWED_HOST_LIST` whitelist). Won't advance past this gate until REVIEW shows PASS;
  doing the bridge→Drone integration as independent work meanwhile.

## Resource safety (plan §4.2/§4.3 — orchestrator change 2026-05-27)
- **MAX_TESTS = DRONE_RUNNER_CAPACITY = 1** (`modules/drone-runner.nix`): ≤1 build at once, Drone
  auto-queues the rest natively. Verified `DRONE_RUNNER_CAPACITY=1` on the runner.
- **Per-build timeout = 60m** (`modules/drone.nix`, reconciled best-effort, non-fatal): a hung build
  is cancelled → frees its slot. Verified Drone repo `timeout: 60`.
- **Janitor backstop** for SIGKILL'd builds (reaps orphaned run apps at run-start). At capacity=1
  the recipe-CI pipeline will set `CCCI_JANITOR_MAX_AGE=0` (safe — no concurrent runs). See DECISIONS.

## Blocked
- (none) — M3 webhook blocker cleared by the polling-primary redesign (polling is
  read-only/outbound and needs no Gitea `ALLOWED_HOST_LIST` whitelist).

## Tracking (adversary findings I must address)
- **[adversary] A4 — concurrent same-recipe runs collide on shared `~/.abra/recipes/<recipe>`.**
  Root cause the finding names ("no Drone concurrency cap — runner capacity=2") is now **eliminated**:
  MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1 (resource-safety change). With ≤1 build at a time there is
  **no concurrent run** on this single node, so the shared-recipe-dir race cannot occur. Builder side
  addressed via the concurrency cap (per plan §4.2 "concurrency cap 1–2"); Adversary to re-test/close.
  (Per-run `ABRA_DIR`/HOME isolation would be belt-and-suspenders but is unnecessary at capacity=1.)
- **[adversary] A2 — janitor `-pr` filter dead.** Already fixed in code: `lifecycle.RUN_APP_RE` =
  `^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$` (the hashed scheme), plus a stack-name regex
  for `.env`-less orphans, gated on age. Awaiting Adversary kill-probe re-test.
- **[adversary] A3 — teardown unverified; `.env` removed before confirmed undeploy.** Already fixed:
  `lifecycle.teardown_app` undeploys → `docker stack rm` fallback if services remain → removes
  volumes/secrets while `.env` exists → drops `.env` LAST → then `_residual()` check raises
  `TeardownError` if anything is left. Awaiting Adversary kill-mid-run re-test.
- **[adversary] A1 — no-ACME hazard for test apps.** Acknowledged (valid). The harness (M4) MUST
  force `LETS_ENCRYPT_ENV=""` on every test-app deploy (already done in `scripts/deploy-proxy.sh` and
  the M1 manual custom-html deploy; `scripts/deploy-drone.sh` will too). Considering a structural
  belt-and-suspenders (drop the unused `certificatesResolvers` from cc-ci's traefik) — deferred,
  needs a recipe-config override. Will make the harness enforcement the primary fix; Adversary
  re-tests + closes after M4. → **Now enforced**: `harness.lifecycle.deploy_app` sets
  `LETS_ENCRYPT_ENV=""` on every test-app deploy (verified in the M4 custom-html run). Adversary can
  re-test + close A1.

## Notes
- **Disk RESOLVED:** operator grew the VM 8.9→**28 GiB** (22 GiB free) on 2026-05-26. Inodes
  1.78M total / 1.21M free (was ~6k free — old 8.9 GiB fs had only 586k inodes, which the flake's
  nixpkgs fetch exhausted). Both byte + inode pressure gone.
- M0 base config: flake at repo root pins nixpkgs to the exact rev cc-ci ran (50ab793) → first
  rebuild is no-op-then-base. Deployed via `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run as
  a detached transient systemd unit (survives ssh-over-tailscale drops). Gen 3 current, healthy.
- Open warning: incus module enables `systemd.network` while we set `networking.useDHCP=true`
  (scripted dhcpcd) — Nix warns both may manage interfaces. Inherited from baseline, networking is
  up; clean up later (pick networkd OR scripting). Tracked, non-blocking.