cc-ci/BACKLOG.md

# BACKLOG — cc-ci

Two single-writer sections (§6.1): Builder edits only `## Build backlog`; Adversary edits only
`## Adversary findings`. Closing an item = checking the box in your own section.

## Build backlog

### M0 — Foundations
- [x] Author flake.nix (NixOS host cc-ci) + hosts/cc-ci/{configuration,hardware}.nix from baseline
- [x] Deploy mechanism decision + first rebuild from repo (DECISIONS.md) — switch --flake on host
- [x] sops-nix wiring: host age key (from ssh host key) + master recovery key; secrets/secrets.yaml;
      decrypt a test secret on host → /run/secrets/test_secret (0400 root) verified
- [x] Gate: M0 — `ssh cc-ci 'systemctl is-system-running'` healthy after rebuild from repo
      → CLAIMED 2026-05-26, awaiting Adversary (see STATUS.md)

### M1 — Swarm + abra target
- [x] Docker + single-node swarm via Nix (modules/swarm.nix: docker + swarm-init oneshot + `proxy`
      overlay net + daily autoprune). Verified: Swarm=active, proxy overlay present.
- [x] Proxy = real coop-cloud/traefik via abra (orchestrator decision, replaces custom traefik.nix):
      wildcard/file-provider mode, pre-issued cert as ssl_cert/ssl_key swarm secrets, LETS_ENCRYPT_ENV
      empty → no ACME. `scripts/deploy-proxy.sh` (idempotent). Verified E2E via gateway: wildcard cert
      served, 0 ACME log lines.
- [x] abra installed (modules/abra.nix, pinned 0.13.0-beta); deployed custom-html by hand over HTTPS
      (HTTP 200 nginx page via gateway) and tore it down clean (services/volumes/secrets/containers=0).
- [x] Gate: M1 — recipe reachable over HTTPS at *.ci.commoninternet.net, torn down clean →
      CLAIMED 2026-05-26, awaiting Adversary.

### M2 — Drone online
- [x] Drone server (coop-cloud recipe, reconcile oneshot) + exec runner via Nix; Gitea OAuth app.
      Server healthz 200 via gateway; runner polling (capacity=2, type=exec).
- [x] hello-world .drone.yml runs green; logs visible (Drone UI + API). Build #1 success: clone +
      hello (echo/whoami=root/abra 0.13.0-beta/swarm=active), both exit 0.
- [x] Gate: M2 — push to cc-ci triggers visible green build → CLAIMED 2026-05-26, awaiting Adversary.
      OAuth link via one-time `scripts/bootstrap-drone-oauth.sh` (documented in install.md §2).

### M3 — Comment bridge
- [ ] comment-bridge service: HMAC verify, !testme exact match, collaborator check, Drone API call
- [ ] PR comment posting with run link
- [ ] Gate: M3 — live demo on scratch PR; auth enforced

### M4 — Harness + install stage
- [x] run_recipe_ci.py + conftest + harness (abra wrappers, lifecycle) + Nix python/playwright env
      (cc-ci-run); install stage for recipe #1 (custom-html) + Playwright assertion; guaranteed teardown
- [x] Gate: M4 — green install run, no orphaned app/volume → CLAIMED 2026-05-27, awaiting Adversary.
      Repro: `cd /root/cc-ci && RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py`
      → 2 passed (http 200 + playwright); teardown leaves services/volumes/secrets/containers/env = 0.

### M5 — Upgrade + backup/restore stages
- [x] Add upgrade + backup/restore stages for recipe #1 (custom-html). backup-bot-two deployed as a
      reconcile oneshot (modules/backupbot.nix). Data marker served via nginx for assertions.
- [x] Gate: M5 — upgrade preserves data; backup→mutate→restore returns original → CLAIMED 2026-05-27.
      Full 3-stage run green: install(2)+upgrade(1)+backup(1) passed; teardown leaves 0 orphans, infra intact.

### M6 — Recipe-local tests + second recipe
- [x] D4 recipe-local discovery: recipe-shipped tests/ snapshotted post-fetch + run against the live
      app as a `recipe-local` stage (contract CCCI_BASE_URL/CCCI_APP_DOMAIN). Demo'd via mirror branch
      recipe-maintainers/custom-html@ci/d4-recipe-local → recipe-local test PASSED against live app.
- [x] Enroll DB-backed recipe #2 (keycloak + mariadb) via per-recipe tests/keycloak/ only (no harness
      surgery): install green (realm health + Playwright admin login). docs/enroll-recipe.md written.
- [x] Gate: M6 — both recipes green (custom-html 3-stage; keycloak install) + recipe-local merged →
      CLAIMED 2026-05-27. keycloak full 3-stage (DB data survival) folds into the M6.5 breadth ramp.

### M6.5 — Breadth ramp (recipes 3→6)
- [ ] Enroll recipes 3–6 covering remaining D10 categories, no harness surgery
- [ ] Gate: M6.5 — recipes 3–6 three-stage green

### M7 — Secrets hardening (D6)
- [ ] Full sops model, rotation doc, log redaction + leak test
- [ ] Gate: M7 — secret-grep finds nothing

### M8 — Dashboard (D7)
- [ ] Overview page + badges + PR-comment outcome reflection
- [ ] Gate: M8 — overview matches reality; outcomes mirrored

### M9 — Reproducibility + docs (D8/D9)
- [ ] docs/install.md from-scratch rebuild; all docs complete
- [ ] Gate: M9 — Adversary rebuilds from docs on throwaway host

### M10 — Proof (D10)
- [ ] All six recipes green via real !testme PRs; flip STATUS to DONE

## Adversary findings
<!-- Adversary-only section. Builder must not edit below this line. -->

- [x] **[adversary] A1 — Test-app deploys can silently trigger ACME (no-ACME design hazard).**
      **CLOSED @2026-05-27T00:35Z** by Adversary re-test. `runner/harness/lifecycle.deploy_app`
      calls `abra.env_set(domain, "LETS_ENCRYPT_ENV", "")` before every deploy. Verified on a live
      harness app (`cust-c95a69`): env `LETS_ENCRYPT_ENV=` empty, no `certresolver` label, **0 ACME
      log lines**, and the served cert is the **wildcard** `CN=*.ci.commoninternet.net` (verify ok)
      — not a per-host ACME cert. No-ACME holds for harness deploys. (Structural belt-and-suspenders
      — dropping the unused `certificatesResolvers` from traefik — remains a nice-to-have, tracked
      under A3/M7, not required to close A1.)

- [ ] **[adversary] A2 — Janitor never reaps current-scheme orphans (dead `-pr` filter).**
      Found during M4 review. `harness.lifecycle.janitor()` only tears down apps where
      `"-pr" in name`, but per DECISIONS the harness now names apps `<recipe[:4]>-<6hex>` (e.g.
      `cust-c95a69`) — **no `-pr` substring**. So the run-start crash-recovery sweep (§4.3: "nuke
      any orphaned `*-pr*` apps") matches **nothing** and is effectively a no-op. The happy-path
      finalizer in `conftest.deployed_app` does work (observed: `cust-e084bd` from a prior run was
      torn down), but a run that crashes/reboots *before* the finalizer runs leaves an orphan that
      no later run will reap. *Fix:* match the actual naming (e.g. regex `^[a-z]{1,4}-[0-9a-f]{6}\.`
      or a dedicated CI label/prefix) and gate on age. *Re-test:* deploy a harness app, simulate a
      crash (kill the run before teardown), then start a new run and confirm janitor reaps the
      orphan. Adversary closes after re-test.

- [ ] **[adversary] A3 — Teardown is unverified/best-effort; a failure silently orphans + run stays green.**
      Found during M4 review (to confirm empirically with a kill-mid-run probe). `lifecycle.teardown_app`
      runs every abra call with `check=False` and "never raises"; the conftest finalizer never
      asserts teardown succeeded. Worse, `abra.app_config_remove` deletes the app `.env`
      **unconditionally**, even if `abra.undeploy` failed first — leaving the swarm service+volume
      running but with no `.env`, so the app can no longer be managed/undeployed via abra (and a
      fixed janitor that shells `abra app undeploy` couldn't reap it either). Net: a partial teardown
      leaves a silent orphan while pytest still reports the run **green**, so the M4/D2 guarantee
      "no orphaned app/volume afterward" is not actually *verified* by the harness. *Fix:* assert
      post-teardown that the stack/services/volumes/secrets are gone (fail the run otherwise); only
      remove the `.env` after a confirmed undeploy, or undeploy-by-stack-name as a fallback that
      doesn't need the `.env`. *Re-test:* run install, kill the process mid-deploy, verify the next
      run (or janitor) leaves zero residual service/volume/secret. Adversary closes after re-test.

- [ ] **[adversary] A4 — Concurrent same-recipe runs collide on the shared recipe checkout.**
      Found by review (M6 verify); to confirm empirically. Per-run isolation is correct for the app
      **domain/volume/secret** (hashed `<recipe[:4]>-<6hex(recipe|pr|ref)>`), but the recipe *source
      checkout* is a single shared path `~/.abra/recipes/<recipe>`: `run_recipe_ci.fetch_recipe`
      does `rm -rf ~/.abra/recipes/<recipe>` then `git clone`+`checkout <ref>`, and abra itself
      re-checks-out the recipe to a version tag mid-deploy. There is **no per-run abra home
      (`ABRA_DIR`/`HOME`), no lock, and no Drone concurrency cap** (runner capacity=2). So two
      concurrent runs of the **same recipe at different refs** (e.g. `!testme` on two PRs of one
      recipe) race on that dir — one can deploy/test the other's code, or fail mid-fetch. (Benign
      when both want identical content, which is why an earlier accidental same-recipe overlap
      didn't visibly break — masking the bug.) This weakens the §6 "two concurrent runs don't
      collide" guarantee and matters for D10 (6 recipes via real PRs). *Repro:* start two runs of
      one recipe with different REFs simultaneously; check each deploys its own ref's code (add a
      per-ref marker) and neither errors mid-fetch. *Fix:* per-run abra home/recipe dir (e.g.
      `ABRA_DIR=$(mktemp -d)` or `~/.abra-runs/<app>`), or a per-recipe lock, or cap Drone to
      serialize same-recipe builds. Adversary confirms + closes after re-test.