Files
cc-ci/BACKLOG.md
autonomic-bot 7eb0dd3c77
All checks were successful
continuous-integration/drone/push Build is passing
M5: upgrade + backup/restore stages green (custom-html); backup-bot-two oneshot
3-stage run green (install/upgrade/backup), clean teardown. backupbot deployed
via reconcile oneshot; PTY (script) for abra backup/restore; -m for secret generate
(no value leak). M5 CLAIMED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 00:53:16 +01:00

129 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BACKLOG — cc-ci
Two single-writer sections (§6.1): Builder edits only `## Build backlog`; Adversary edits only
`## Adversary findings`. Closing an item = checking the box in your own section.
## Build backlog
### M0 — Foundations
- [x] Author flake.nix (NixOS host cc-ci) + hosts/cc-ci/{configuration,hardware}.nix from baseline
- [x] Deploy mechanism decision + first rebuild from repo (DECISIONS.md) — switch --flake on host
- [x] sops-nix wiring: host age key (from ssh host key) + master recovery key; secrets/secrets.yaml;
decrypt a test secret on host → /run/secrets/test_secret (0400 root) verified
- [x] Gate: M0 — `ssh cc-ci 'systemctl is-system-running'` healthy after rebuild from repo
→ CLAIMED 2026-05-26, awaiting Adversary (see STATUS.md)
### M1 — Swarm + abra target
- [x] Docker + single-node swarm via Nix (modules/swarm.nix: docker + swarm-init oneshot + `proxy`
overlay net + daily autoprune). Verified: Swarm=active, proxy overlay present.
- [x] Proxy = real coop-cloud/traefik via abra (orchestrator decision, replaces custom traefik.nix):
wildcard/file-provider mode, pre-issued cert as ssl_cert/ssl_key swarm secrets, LETS_ENCRYPT_ENV
empty → no ACME. `scripts/deploy-proxy.sh` (idempotent). Verified E2E via gateway: wildcard cert
served, 0 ACME log lines.
- [x] abra installed (modules/abra.nix, pinned 0.13.0-beta); deployed custom-html by hand over HTTPS
(HTTP 200 nginx page via gateway) and tore it down clean (services/volumes/secrets/containers=0).
- [x] Gate: M1 — recipe reachable over HTTPS at *.ci.commoninternet.net, torn down clean →
CLAIMED 2026-05-26, awaiting Adversary.
### M2 — Drone online
- [x] Drone server (coop-cloud recipe, reconcile oneshot) + exec runner via Nix; Gitea OAuth app.
Server healthz 200 via gateway; runner polling (capacity=2, type=exec).
- [x] hello-world .drone.yml runs green; logs visible (Drone UI + API). Build #1 success: clone +
hello (echo/whoami=root/abra 0.13.0-beta/swarm=active), both exit 0.
- [x] Gate: M2 — push to cc-ci triggers visible green build → CLAIMED 2026-05-26, awaiting Adversary.
OAuth link via one-time `scripts/bootstrap-drone-oauth.sh` (documented in install.md §2).
### M3 — Comment bridge
- [ ] comment-bridge service: HMAC verify, !testme exact match, collaborator check, Drone API call
- [ ] PR comment posting with run link
- [ ] Gate: M3 — live demo on scratch PR; auth enforced
### M4 — Harness + install stage
- [x] run_recipe_ci.py + conftest + harness (abra wrappers, lifecycle) + Nix python/playwright env
(cc-ci-run); install stage for recipe #1 (custom-html) + Playwright assertion; guaranteed teardown
- [x] Gate: M4 — green install run, no orphaned app/volume → CLAIMED 2026-05-27, awaiting Adversary.
Repro: `cd /root/cc-ci && RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py`
→ 2 passed (http 200 + playwright); teardown leaves services/volumes/secrets/containers/env = 0.
### M5 — Upgrade + backup/restore stages
- [x] Add upgrade + backup/restore stages for recipe #1 (custom-html). backup-bot-two deployed as a
reconcile oneshot (modules/backupbot.nix). Data marker served via nginx for assertions.
- [x] Gate: M5 — upgrade preserves data; backup→mutate→restore returns original → CLAIMED 2026-05-27.
Full 3-stage run green: install(2)+upgrade(1)+backup(1) passed; teardown leaves 0 orphans, infra intact.
### M6 — Recipe-local tests + second recipe
- [ ] Discover/run recipe-repo tests/; enroll DB-backed recipe #2
- [ ] Gate: M6 — both green; recipe-local tests merged
### M6.5 — Breadth ramp (recipes 3→6)
- [ ] Enroll recipes 36 covering remaining D10 categories, no harness surgery
- [ ] Gate: M6.5 — recipes 36 three-stage green
### M7 — Secrets hardening (D6)
- [ ] Full sops model, rotation doc, log redaction + leak test
- [ ] Gate: M7 — secret-grep finds nothing
### M8 — Dashboard (D7)
- [ ] Overview page + badges + PR-comment outcome reflection
- [ ] Gate: M8 — overview matches reality; outcomes mirrored
### M9 — Reproducibility + docs (D8/D9)
- [ ] docs/install.md from-scratch rebuild; all docs complete
- [ ] Gate: M9 — Adversary rebuilds from docs on throwaway host
### M10 — Proof (D10)
- [ ] All six recipes green via real !testme PRs; flip STATUS to DONE
## Adversary findings
<!-- Adversary-only section. Builder must not edit below this line. -->
- [x] **[adversary] A1 — Test-app deploys can silently trigger ACME (no-ACME design hazard).**
**CLOSED @2026-05-27T00:35Z** by Adversary re-test. `runner/harness/lifecycle.deploy_app`
calls `abra.env_set(domain, "LETS_ENCRYPT_ENV", "")` before every deploy. Verified on a live
harness app (`cust-c95a69`): env `LETS_ENCRYPT_ENV=` empty, no `certresolver` label, **0 ACME
log lines**, and the served cert is the **wildcard** `CN=*.ci.commoninternet.net` (verify ok)
— not a per-host ACME cert. No-ACME holds for harness deploys. (Structural belt-and-suspenders
— dropping the unused `certificatesResolvers` from traefik — remains a nice-to-have, tracked
under A3/M7, not required to close A1.)
- [ ] **[adversary] A2 — Janitor never reaps current-scheme orphans (dead `-pr` filter).**
Found during M4 review. `harness.lifecycle.janitor()` only tears down apps where
`"-pr" in name`, but per DECISIONS the harness now names apps `<recipe[:4]>-<6hex>` (e.g.
`cust-c95a69`) — **no `-pr` substring**. So the run-start crash-recovery sweep (§4.3: "nuke
any orphaned `*-pr*` apps") matches **nothing** and is effectively a no-op. The happy-path
finalizer in `conftest.deployed_app` does work (observed: `cust-e084bd` from a prior run was
torn down), but a run that crashes/reboots *before* the finalizer runs leaves an orphan that
no later run will reap. *Fix:* match the actual naming (e.g. regex `^[a-z]{1,4}-[0-9a-f]{6}\.`
or a dedicated CI label/prefix) and gate on age. *Re-test:* deploy a harness app, simulate a
crash (kill the run before teardown), then start a new run and confirm janitor reaps the
orphan. Adversary closes after re-test.
- [ ] **[adversary] A3 — Teardown is unverified/best-effort; a failure silently orphans + run stays green.**
Found during M4 review (to confirm empirically with a kill-mid-run probe). `lifecycle.teardown_app`
runs every abra call with `check=False` and "never raises"; the conftest finalizer never
asserts teardown succeeded. Worse, `abra.app_config_remove` deletes the app `.env`
**unconditionally**, even if `abra.undeploy` failed first — leaving the swarm service+volume
running but with no `.env`, so the app can no longer be managed/undeployed via abra (and a
fixed janitor that shells `abra app undeploy` couldn't reap it either). Net: a partial teardown
leaves a silent orphan while pytest still reports the run **green**, so the M4/D2 guarantee
"no orphaned app/volume afterward" is not actually *verified* by the harness. *Fix:* assert
post-teardown that the stack/services/volumes/secrets are gone (fail the run otherwise); only
remove the `.env` after a confirmed undeploy, or undeploy-by-stack-name as a fallback that
doesn't need the `.env`. *Re-test:* run install, kill the process mid-deploy, verify the next
run (or janitor) leaves zero residual service/volume/secret. Adversary closes after re-test.
Found during M1 verify (M1 still PASSes — proxy itself fires no ACME). cc-ci's traefik static
config (`/etc/traefik/traefik.yml`) defines `staging` + `production` HTTP-01 `certificatesResolvers`
(stock coop-cloud template). They're currently inert (no router references them; both
`*-acme.json` are 0 bytes; 0 ACME log lines) because the proxy runs `LETS_ENCRYPT_ENV=""`.
**But** the recipe default for test apps (e.g. `custom-html/.env.sample`) ships
`LETS_ENCRYPT_ENV=production`, which renders `traefik.http.routers.<app>.tls.certresolver=production`.
So if the harness (M4+) deploys a test app *without* forcing `LETS_ENCRYPT_ENV=""`, traefik
WILL attempt Let's Encrypt HTTP-01 for that app's domain — contradicting the "NO ACME" design,
hitting LE rate limits, and likely failing (HTTP-01 needs :80 reachable; gateway passes TLS).
*Repro:* `abra app new custom-html -D x.ci.commoninternet.net` (keep default env) → deploy →
`docker service inspect <app> ... | grep certresolver` shows `=production`.
*Fix:* harness must force `LETS_ENCRYPT_ENV=""` (or strip the certresolver label) on every
test-app deploy; and/or remove the unused `certificatesResolvers` from cc-ci's traefik so
no-ACME is structural. Re-test: deploy a test app via the harness and confirm 0 ACME log lines
+ served cert is the wildcard. Adversary closes after re-test.