Files
cc-ci/BACKLOG.md
autonomic-bot 7eb0dd3c77
All checks were successful
continuous-integration/drone/push Build is passing
M5: upgrade + backup/restore stages green (custom-html); backup-bot-two oneshot
3-stage run green (install/upgrade/backup), clean teardown. backupbot deployed
via reconcile oneshot; PTY (script) for abra backup/restore; -m for secret generate
(no value leak). M5 CLAIMED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 00:53:16 +01:00

8.8 KiB
Raw Blame History

BACKLOG — cc-ci

Two single-writer sections (§6.1): Builder edits only ## Build backlog; Adversary edits only ## Adversary findings. Closing an item = checking the box in your own section.

Build backlog

M0 — Foundations

  • Author flake.nix (NixOS host cc-ci) + hosts/cc-ci/{configuration,hardware}.nix from baseline
  • Deploy mechanism decision + first rebuild from repo (DECISIONS.md) — switch --flake on host
  • sops-nix wiring: host age key (from ssh host key) + master recovery key; secrets/secrets.yaml; decrypt a test secret on host → /run/secrets/test_secret (0400 root) verified
  • Gate: M0 — ssh cc-ci 'systemctl is-system-running' healthy after rebuild from repo → CLAIMED 2026-05-26, awaiting Adversary (see STATUS.md)

M1 — Swarm + abra target

  • Docker + single-node swarm via Nix (modules/swarm.nix: docker + swarm-init oneshot + proxy overlay net + daily autoprune). Verified: Swarm=active, proxy overlay present.
  • Proxy = real coop-cloud/traefik via abra (orchestrator decision, replaces custom traefik.nix): wildcard/file-provider mode, pre-issued cert as ssl_cert/ssl_key swarm secrets, LETS_ENCRYPT_ENV empty → no ACME. scripts/deploy-proxy.sh (idempotent). Verified E2E via gateway: wildcard cert served, 0 ACME log lines.
  • abra installed (modules/abra.nix, pinned 0.13.0-beta); deployed custom-html by hand over HTTPS (HTTP 200 nginx page via gateway) and tore it down clean (services/volumes/secrets/containers=0).
  • Gate: M1 — recipe reachable over HTTPS at *.ci.commoninternet.net, torn down clean → CLAIMED 2026-05-26, awaiting Adversary.

M2 — Drone online

  • Drone server (coop-cloud recipe, reconcile oneshot) + exec runner via Nix; Gitea OAuth app. Server healthz 200 via gateway; runner polling (capacity=2, type=exec).
  • hello-world .drone.yml runs green; logs visible (Drone UI + API). Build #1 success: clone + hello (echo/whoami=root/abra 0.13.0-beta/swarm=active), both exit 0.
  • Gate: M2 — push to cc-ci triggers visible green build → CLAIMED 2026-05-26, awaiting Adversary. OAuth link via one-time scripts/bootstrap-drone-oauth.sh (documented in install.md §2).

M3 — Comment bridge

  • comment-bridge service: HMAC verify, !testme exact match, collaborator check, Drone API call
  • PR comment posting with run link
  • Gate: M3 — live demo on scratch PR; auth enforced

M4 — Harness + install stage

  • run_recipe_ci.py + conftest + harness (abra wrappers, lifecycle) + Nix python/playwright env (cc-ci-run); install stage for recipe #1 (custom-html) + Playwright assertion; guaranteed teardown
  • Gate: M4 — green install run, no orphaned app/volume → CLAIMED 2026-05-27, awaiting Adversary. Repro: cd /root/cc-ci && RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py → 2 passed (http 200 + playwright); teardown leaves services/volumes/secrets/containers/env = 0.

M5 — Upgrade + backup/restore stages

  • Add upgrade + backup/restore stages for recipe #1 (custom-html). backup-bot-two deployed as a reconcile oneshot (modules/backupbot.nix). Data marker served via nginx for assertions.
  • Gate: M5 — upgrade preserves data; backup→mutate→restore returns original → CLAIMED 2026-05-27. Full 3-stage run green: install(2)+upgrade(1)+backup(1) passed; teardown leaves 0 orphans, infra intact.

M6 — Recipe-local tests + second recipe

  • Discover/run recipe-repo tests/; enroll DB-backed recipe #2
  • Gate: M6 — both green; recipe-local tests merged

M6.5 — Breadth ramp (recipes 3→6)

  • Enroll recipes 36 covering remaining D10 categories, no harness surgery
  • Gate: M6.5 — recipes 36 three-stage green

M7 — Secrets hardening (D6)

  • Full sops model, rotation doc, log redaction + leak test
  • Gate: M7 — secret-grep finds nothing

M8 — Dashboard (D7)

  • Overview page + badges + PR-comment outcome reflection
  • Gate: M8 — overview matches reality; outcomes mirrored

M9 — Reproducibility + docs (D8/D9)

  • docs/install.md from-scratch rebuild; all docs complete
  • Gate: M9 — Adversary rebuilds from docs on throwaway host

M10 — Proof (D10)

  • All six recipes green via real !testme PRs; flip STATUS to DONE

Adversary findings

  • [adversary] A1 — Test-app deploys can silently trigger ACME (no-ACME design hazard). CLOSED @2026-05-27T00:35Z by Adversary re-test. runner/harness/lifecycle.deploy_app calls abra.env_set(domain, "LETS_ENCRYPT_ENV", "") before every deploy. Verified on a live harness app (cust-c95a69): env LETS_ENCRYPT_ENV= empty, no certresolver label, 0 ACME log lines, and the served cert is the wildcard CN=*.ci.commoninternet.net (verify ok) — not a per-host ACME cert. No-ACME holds for harness deploys. (Structural belt-and-suspenders — dropping the unused certificatesResolvers from traefik — remains a nice-to-have, tracked under A3/M7, not required to close A1.)

  • [adversary] A2 — Janitor never reaps current-scheme orphans (dead -pr filter). Found during M4 review. harness.lifecycle.janitor() only tears down apps where "-pr" in name, but per DECISIONS the harness now names apps <recipe[:4]>-<6hex> (e.g. cust-c95a69) — no -pr substring. So the run-start crash-recovery sweep (§4.3: "nuke any orphaned *-pr* apps") matches nothing and is effectively a no-op. The happy-path finalizer in conftest.deployed_app does work (observed: cust-e084bd from a prior run was torn down), but a run that crashes/reboots before the finalizer runs leaves an orphan that no later run will reap. Fix: match the actual naming (e.g. regex ^[a-z]{1,4}-[0-9a-f]{6}\. or a dedicated CI label/prefix) and gate on age. Re-test: deploy a harness app, simulate a crash (kill the run before teardown), then start a new run and confirm janitor reaps the orphan. Adversary closes after re-test.

  • [adversary] A3 — Teardown is unverified/best-effort; a failure silently orphans + run stays green. Found during M4 review (to confirm empirically with a kill-mid-run probe). lifecycle.teardown_app runs every abra call with check=False and "never raises"; the conftest finalizer never asserts teardown succeeded. Worse, abra.app_config_remove deletes the app .env unconditionally, even if abra.undeploy failed first — leaving the swarm service+volume running but with no .env, so the app can no longer be managed/undeployed via abra (and a fixed janitor that shells abra app undeploy couldn't reap it either). Net: a partial teardown leaves a silent orphan while pytest still reports the run green, so the M4/D2 guarantee "no orphaned app/volume afterward" is not actually verified by the harness. Fix: assert post-teardown that the stack/services/volumes/secrets are gone (fail the run otherwise); only remove the .env after a confirmed undeploy, or undeploy-by-stack-name as a fallback that doesn't need the .env. Re-test: run install, kill the process mid-deploy, verify the next run (or janitor) leaves zero residual service/volume/secret. Adversary closes after re-test. Found during M1 verify (M1 still PASSes — proxy itself fires no ACME). cc-ci's traefik static config (/etc/traefik/traefik.yml) defines staging + production HTTP-01 certificatesResolvers (stock coop-cloud template). They're currently inert (no router references them; both *-acme.json are 0 bytes; 0 ACME log lines) because the proxy runs LETS_ENCRYPT_ENV="". But the recipe default for test apps (e.g. custom-html/.env.sample) ships LETS_ENCRYPT_ENV=production, which renders traefik.http.routers.<app>.tls.certresolver=production. So if the harness (M4+) deploys a test app without forcing LETS_ENCRYPT_ENV="", traefik WILL attempt Let's Encrypt HTTP-01 for that app's domain — contradicting the "NO ACME" design, hitting LE rate limits, and likely failing (HTTP-01 needs :80 reachable; gateway passes TLS). Repro: abra app new custom-html -D x.ci.commoninternet.net (keep default env) → deploy → docker service inspect <app> ... | grep certresolver shows =production. Fix: harness must force LETS_ENCRYPT_ENV="" (or strip the certresolver label) on every test-app deploy; and/or remove the unused certificatesResolvers from cc-ci's traefik so no-ACME is structural. Re-test: deploy a test app via the harness and confirm 0 ACME log lines + served cert is the wildcard. Adversary closes after re-test.