Files
cc-ci/BACKLOG.md
2026-05-27 01:51:15 +01:00

136 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BACKLOG — cc-ci
Two single-writer sections (§6.1): Builder edits only `## Build backlog`; Adversary edits only
`## Adversary findings`. Closing an item = checking the box in your own section.
## Build backlog
### M0 — Foundations
- [x] Author flake.nix (NixOS host cc-ci) + hosts/cc-ci/{configuration,hardware}.nix from baseline
- [x] Deploy mechanism decision + first rebuild from repo (DECISIONS.md) — switch --flake on host
- [x] sops-nix wiring: host age key (from ssh host key) + master recovery key; secrets/secrets.yaml;
decrypt a test secret on host → /run/secrets/test_secret (0400 root) verified
- [x] Gate: M0 — `ssh cc-ci 'systemctl is-system-running'` healthy after rebuild from repo
→ CLAIMED 2026-05-26, awaiting Adversary (see STATUS.md)
### M1 — Swarm + abra target
- [x] Docker + single-node swarm via Nix (modules/swarm.nix: docker + swarm-init oneshot + `proxy`
overlay net + daily autoprune). Verified: Swarm=active, proxy overlay present.
- [x] Proxy = real coop-cloud/traefik via abra (orchestrator decision, replaces custom traefik.nix):
wildcard/file-provider mode, pre-issued cert as ssl_cert/ssl_key swarm secrets, LETS_ENCRYPT_ENV
empty → no ACME. `scripts/deploy-proxy.sh` (idempotent). Verified E2E via gateway: wildcard cert
served, 0 ACME log lines.
- [x] abra installed (modules/abra.nix, pinned 0.13.0-beta); deployed custom-html by hand over HTTPS
(HTTP 200 nginx page via gateway) and tore it down clean (services/volumes/secrets/containers=0).
- [x] Gate: M1 — recipe reachable over HTTPS at *.ci.commoninternet.net, torn down clean →
CLAIMED 2026-05-26, awaiting Adversary.
### M2 — Drone online
- [x] Drone server (coop-cloud recipe, reconcile oneshot) + exec runner via Nix; Gitea OAuth app.
Server healthz 200 via gateway; runner polling (capacity=2, type=exec).
- [x] hello-world .drone.yml runs green; logs visible (Drone UI + API). Build #1 success: clone +
hello (echo/whoami=root/abra 0.13.0-beta/swarm=active), both exit 0.
- [x] Gate: M2 — push to cc-ci triggers visible green build → CLAIMED 2026-05-26, awaiting Adversary.
OAuth link via one-time `scripts/bootstrap-drone-oauth.sh` (documented in install.md §2).
### M3 — Comment bridge
- [ ] comment-bridge service: HMAC verify, !testme exact match, collaborator check, Drone API call
- [ ] PR comment posting with run link
- [ ] Gate: M3 — live demo on scratch PR; auth enforced
### M4 — Harness + install stage
- [x] run_recipe_ci.py + conftest + harness (abra wrappers, lifecycle) + Nix python/playwright env
(cc-ci-run); install stage for recipe #1 (custom-html) + Playwright assertion; guaranteed teardown
- [x] Gate: M4 — green install run, no orphaned app/volume → CLAIMED 2026-05-27, awaiting Adversary.
Repro: `cd /root/cc-ci && RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py`
→ 2 passed (http 200 + playwright); teardown leaves services/volumes/secrets/containers/env = 0.
### M5 — Upgrade + backup/restore stages
- [x] Add upgrade + backup/restore stages for recipe #1 (custom-html). backup-bot-two deployed as a
reconcile oneshot (modules/backupbot.nix). Data marker served via nginx for assertions.
- [x] Gate: M5 — upgrade preserves data; backup→mutate→restore returns original → CLAIMED 2026-05-27.
Full 3-stage run green: install(2)+upgrade(1)+backup(1) passed; teardown leaves 0 orphans, infra intact.
### M6 — Recipe-local tests + second recipe
- [x] D4 recipe-local discovery: recipe-shipped tests/ snapshotted post-fetch + run against the live
app as a `recipe-local` stage (contract CCCI_BASE_URL/CCCI_APP_DOMAIN). Demo'd via mirror branch
recipe-maintainers/custom-html@ci/d4-recipe-local → recipe-local test PASSED against live app.
- [x] Enroll DB-backed recipe #2 (keycloak + mariadb) via per-recipe tests/keycloak/ only (no harness
surgery): install green (realm health + Playwright admin login). docs/enroll-recipe.md written.
- [x] Gate: M6 — both recipes green (custom-html 3-stage; keycloak install) + recipe-local merged →
CLAIMED 2026-05-27. keycloak full 3-stage (DB data survival) folds into the M6.5 breadth ramp.
### M6.5 — Breadth ramp (recipes 3→6)
- [ ] Enroll recipes 36 covering remaining D10 categories, no harness surgery
- [ ] Gate: M6.5 — recipes 36 three-stage green
### M7 — Secrets hardening (D6)
- [ ] Full sops model, rotation doc, log redaction + leak test
- [ ] Gate: M7 — secret-grep finds nothing
### M8 — Dashboard (D7)
- [ ] Overview page + badges + PR-comment outcome reflection
- [ ] Gate: M8 — overview matches reality; outcomes mirrored
### M9 — Reproducibility + docs (D8/D9)
- [ ] docs/install.md from-scratch rebuild; all docs complete
- [ ] Gate: M9 — Adversary rebuilds from docs on throwaway host
### M10 — Proof (D10)
- [ ] All six recipes green via real !testme PRs; flip STATUS to DONE
## Adversary findings
<!-- Adversary-only section. Builder must not edit below this line. -->
- [x] **[adversary] A1 — Test-app deploys can silently trigger ACME (no-ACME design hazard).**
**CLOSED @2026-05-27T00:35Z** by Adversary re-test. `runner/harness/lifecycle.deploy_app`
calls `abra.env_set(domain, "LETS_ENCRYPT_ENV", "")` before every deploy. Verified on a live
harness app (`cust-c95a69`): env `LETS_ENCRYPT_ENV=` empty, no `certresolver` label, **0 ACME
log lines**, and the served cert is the **wildcard** `CN=*.ci.commoninternet.net` (verify ok)
— not a per-host ACME cert. No-ACME holds for harness deploys. (Structural belt-and-suspenders
— dropping the unused `certificatesResolvers` from traefik — remains a nice-to-have, tracked
under A3/M7, not required to close A1.)
- [ ] **[adversary] A2 — Janitor never reaps current-scheme orphans (dead `-pr` filter).**
Found during M4 review. `harness.lifecycle.janitor()` only tears down apps where
`"-pr" in name`, but per DECISIONS the harness now names apps `<recipe[:4]>-<6hex>` (e.g.
`cust-c95a69`) — **no `-pr` substring**. So the run-start crash-recovery sweep (§4.3: "nuke
any orphaned `*-pr*` apps") matches **nothing** and is effectively a no-op. The happy-path
finalizer in `conftest.deployed_app` does work (observed: `cust-e084bd` from a prior run was
torn down), but a run that crashes/reboots *before* the finalizer runs leaves an orphan that
no later run will reap. *Fix:* match the actual naming (e.g. regex `^[a-z]{1,4}-[0-9a-f]{6}\.`
or a dedicated CI label/prefix) and gate on age. *Re-test:* deploy a harness app, simulate a
crash (kill the run before teardown), then start a new run and confirm janitor reaps the
orphan. Adversary closes after re-test.
- [ ] **[adversary] A3 — Teardown is unverified/best-effort; a failure silently orphans + run stays green.**
Found during M4 review (to confirm empirically with a kill-mid-run probe). `lifecycle.teardown_app`
runs every abra call with `check=False` and "never raises"; the conftest finalizer never
asserts teardown succeeded. Worse, `abra.app_config_remove` deletes the app `.env`
**unconditionally**, even if `abra.undeploy` failed first — leaving the swarm service+volume
running but with no `.env`, so the app can no longer be managed/undeployed via abra (and a
fixed janitor that shells `abra app undeploy` couldn't reap it either). Net: a partial teardown
leaves a silent orphan while pytest still reports the run **green**, so the M4/D2 guarantee
"no orphaned app/volume afterward" is not actually *verified* by the harness. *Fix:* assert
post-teardown that the stack/services/volumes/secrets are gone (fail the run otherwise); only
remove the `.env` after a confirmed undeploy, or undeploy-by-stack-name as a fallback that
doesn't need the `.env`. *Re-test:* run install, kill the process mid-deploy, verify the next
run (or janitor) leaves zero residual service/volume/secret. Adversary closes after re-test.
- [ ] **[adversary] A4 — Concurrent same-recipe runs collide on the shared recipe checkout.**
Found by review (M6 verify); to confirm empirically. Per-run isolation is correct for the app
**domain/volume/secret** (hashed `<recipe[:4]>-<6hex(recipe|pr|ref)>`), but the recipe *source
checkout* is a single shared path `~/.abra/recipes/<recipe>`: `run_recipe_ci.fetch_recipe`
does `rm -rf ~/.abra/recipes/<recipe>` then `git clone`+`checkout <ref>`, and abra itself
re-checks-out the recipe to a version tag mid-deploy. There is **no per-run abra home
(`ABRA_DIR`/`HOME`), no lock, and no Drone concurrency cap** (runner capacity=2). So two
concurrent runs of the **same recipe at different refs** (e.g. `!testme` on two PRs of one
recipe) race on that dir — one can deploy/test the other's code, or fail mid-fetch. (Benign
when both want identical content, which is why an earlier accidental same-recipe overlap
didn't visibly break — masking the bug.) This weakens the §6 "two concurrent runs don't
collide" guarantee and matters for D10 (6 recipes via real PRs). *Repro:* start two runs of
one recipe with different REFs simultaneously; check each deploys its own ref's code (add a
per-ref marker) and neither errors mid-fetch. *Fix:* per-run abra home/recipe dir (e.g.
`ABRA_DIR=$(mktemp -d)` or `~/.abra-runs/<app>`), or a per-recipe lock, or cap Drone to
serialize same-recipe builds. Adversary confirms + closes after re-test.