Files
cc-ci/BACKLOG.md
autonomic-bot 537fd47818
All checks were successful
continuous-integration/drone/push Build is passing
M7/D6 gate CLAIMED: rotation doc + redaction; M6.5 PASS recorded
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 07:45:19 +01:00

17 KiB
Raw Blame History

BACKLOG — cc-ci

Two single-writer sections (§6.1): Builder edits only ## Build backlog; Adversary edits only ## Adversary findings. Closing an item = checking the box in your own section.

Build backlog

M0 — Foundations

  • Author flake.nix (NixOS host cc-ci) + hosts/cc-ci/{configuration,hardware}.nix from baseline
  • Deploy mechanism decision + first rebuild from repo (DECISIONS.md) — switch --flake on host
  • sops-nix wiring: host age key (from ssh host key) + master recovery key; secrets/secrets.yaml; decrypt a test secret on host → /run/secrets/test_secret (0400 root) verified
  • Gate: M0 — ssh cc-ci 'systemctl is-system-running' healthy after rebuild from repo → CLAIMED 2026-05-26, awaiting Adversary (see STATUS.md)

M1 — Swarm + abra target

  • Docker + single-node swarm via Nix (modules/swarm.nix: docker + swarm-init oneshot + proxy overlay net + daily autoprune). Verified: Swarm=active, proxy overlay present.
  • Proxy = real coop-cloud/traefik via abra (orchestrator decision, replaces custom traefik.nix): wildcard/file-provider mode, pre-issued cert as ssl_cert/ssl_key swarm secrets, LETS_ENCRYPT_ENV empty → no ACME. scripts/deploy-proxy.sh (idempotent). Verified E2E via gateway: wildcard cert served, 0 ACME log lines.
  • abra installed (modules/abra.nix, pinned 0.13.0-beta); deployed custom-html by hand over HTTPS (HTTP 200 nginx page via gateway) and tore it down clean (services/volumes/secrets/containers=0).
  • Gate: M1 — recipe reachable over HTTPS at *.ci.commoninternet.net, torn down clean → CLAIMED 2026-05-26, awaiting Adversary.

M2 — Drone online

  • Drone server (coop-cloud recipe, reconcile oneshot) + exec runner via Nix; Gitea OAuth app. Server healthz 200 via gateway; runner polling (capacity=2, type=exec).
  • hello-world .drone.yml runs green; logs visible (Drone UI + API). Build #1 success: clone + hello (echo/whoami=root/abra 0.13.0-beta/swarm=active), both exit 0.
  • Gate: M2 — push to cc-ci triggers visible green build → CLAIMED 2026-05-26, awaiting Adversary. OAuth link via one-time scripts/bootstrap-drone-oauth.sh (documented in install.md §2).

M3 — Comment bridge

  • comment-bridge service: polling PRIMARY (read-only, ≤30s) + optional admin webhook; !testme exact match; org-membership auth (GET /orgs/{owner}/members/{user} 204) + allowlist; Drone API
  • PR comment posting with run link
  • Gate: M3 — live demo on scratch PR; auth enforced → CLAIMED 2026-05-27. Posted !testme on PR #1 → poll fired in 6s → Drone build #26 for head d397720a → bridge commented run link back. Org-membership auth verified (bot/trav/notplants 204, non-member 404 at read level).

Bridge→Drone→harness integration (connects M3 trigger to M4/M5 recipe CI; blocks D2/D10 via !testme)

  • Add a recipe-CI pipeline to .drone.yml keyed on event=custom: runs cc-ci-run runner/run_recipe_ci.py STAGES=install,upgrade,backup, CCCI_JANITOR_MAX_AGE=0, concurrency:{limit:1}, HOME=/root. Self-test pipeline now event=push. (commits 9d51cb6+)
  • Verify a recipe build runs the full 3-stage CI through Drone (not self-test): build #33 → success, install/upgrade/backup all green, clean teardown (0 orphans). HOME + backup -C -o + clean-reclone fixes applied.
  • Full single-comment E2E: enroll a recipe in the bridge POLL_REPOS + open a recipe PR → !testme → full 3-stage CI + PR comment outcome (folds into M6.5/M10 breadth).

M4 — Harness + install stage

  • run_recipe_ci.py + conftest + harness (abra wrappers, lifecycle) + Nix python/playwright env (cc-ci-run); install stage for recipe #1 (custom-html) + Playwright assertion; guaranteed teardown
  • Gate: M4 — green install run, no orphaned app/volume → CLAIMED 2026-05-27, awaiting Adversary. Repro: cd /root/cc-ci && RECIPE=custom-html PR=0 REF=m4demo cc-ci-run runner/run_recipe_ci.py → 2 passed (http 200 + playwright); teardown leaves services/volumes/secrets/containers/env = 0.

M5 — Upgrade + backup/restore stages

  • Add upgrade + backup/restore stages for recipe #1 (custom-html). backup-bot-two deployed as a reconcile oneshot (modules/backupbot.nix). Data marker served via nginx for assertions.
  • Gate: M5 — upgrade preserves data; backup→mutate→restore returns original → CLAIMED 2026-05-27. Full 3-stage run green: install(2)+upgrade(1)+backup(1) passed; teardown leaves 0 orphans, infra intact.

M6 — Recipe-local tests + second recipe

  • D4 recipe-local discovery: recipe-shipped tests/ snapshotted post-fetch + run against the live app as a recipe-local stage (contract CCCI_BASE_URL/CCCI_APP_DOMAIN). Demo'd via mirror branch recipe-maintainers/custom-html@ci/d4-recipe-local → recipe-local test PASSED against live app.
  • Enroll DB-backed recipe #2 (keycloak + mariadb) via per-recipe tests/keycloak/ only (no harness surgery): install green (realm health + Playwright admin login). docs/enroll-recipe.md written.
  • Gate: M6 — both recipes green (custom-html 3-stage; keycloak install) + recipe-local merged → CLAIMED 2026-05-27. keycloak full 3-stage (DB data survival) folds into the M6.5 breadth ramp.

M6.5 — Breadth ramp (recipes 3→6)

  • keycloak (SSO/DB-backed, recipe #2) full 3-stage green through the Drone recipe-ci pipeline: build #39 success (~31m): install 2✓ (realm health + Playwright admin login), upgrade 1✓ (test_upgrade_preserves_realm — DB data survives), backup 1✓ (test_backup_mutate_restore). Clean teardown (0 keyc services/volumes). Proves DB-backed data survival + integration path.
  • cryptpad (stateful/no-DB, recipe #3) full 3-stage green on host (cc-ci-run): install 2✓ (http + Playwright), upgrade 1✓ (marker in cryptpad_data survives), backup 1✓ (test_backup_mutate_restore). No harness surgery — added generic per-recipe EXTRA_ENV (handles cryptpad's SANDBOX_DOMAIN). Fixed a real backup bug en route: set_env glued RESTIC_REPOSITORY onto a comment → backupbot had no restic repo (now newline-safe). Drone canonical run = build #46 success (~6m, all 3 stages green, clean teardown).
  • matrix-synapse (DB+media/large-volume, recipe #4) full 3-stage green on host: install 2✓ (client API + versions JSON), upgrade 1✓ (postgres marker survives), backup 1✓ — exercises the recipe's pg_backup.sh DB-dump hook (not a plain volume copy). No harness surgery. Drone canonical run = build #51 success (~10.5m, all 3 stages green, clean teardown).
  • lasuite-docs (multi-service + S3/MinIO, recipe #5) full 3-stage green on host: install 2✓ (9-service stack converges + SPA + Playwright), upgrade 1✓ (postgres marker survives), backup 1✓ (pg_backup.sh hook). Fixed deploy timeout (cold-pull of ~9 images > abra 300s) via TIMEOUT=900 EXTRA_ENV; OIDC config-only so starts healthy w/ placeholder. Drone canonical run = build #57 success (all 3 stages green, clean teardown).
  • n8n (workflow automation, recipe #6 — bluesky-pds swapped out per DECISIONS) full 3-stage green on host: install 2✓ (/healthz + Playwright editor), upgrade 1✓ (marker in /home/node/.n8n survives), backup 1✓ (backupbot.backup.path file backup). Drone canonical run = build #63 success (~5.5m, all 3 stages green, clean teardown).
  • Re-verify keycloak backup post set_env fix (build #39 ran off an earlier backupbot deploy)
  • Gate: M6.5 — recipes 36 three-stage green → CLAIMED 2026-05-27. All 6 D10 recipes have a full 3-stage green run (host + canonical Drone): custom-html, keycloak(#39), cryptpad(#46), matrix-synapse(#51), lasuite-docs(#57), n8n(#63). All 5 categories covered; D5 no-harness-surgery held (per-recipe tests// + recipe_meta EXTRA_ENV only). Awaiting Adversary.

M7 — Secrets hardening (D6)

  • Full sops model + rotation doc (docs/secrets.md: 3 classes, decryption chain, rotation per class) + log redaction filter (run_recipe_ci masks /run/secrets/* values in stage output, live-streaming preserved). Adversary leak scans clean (baseline + recipe-CI logs).
  • Gate: M7 — secret-grep finds nothing → CLAIMED 2026-05-27. No-plaintext: harness never prints secrets, abra doesn't echo generated ones, reconciles redirect secret-gen to /dev/null, dashboard shows status only; redaction filter as belt-and-suspenders. Awaiting Adversary (re-grep published logs + dashboard; optionally follow a rotation procedure).

M8 — Dashboard (D7)

  • Overview page + badges: dashboard/dashboard.py + modules/dashboard.nix — live at ci.commoninternet.net/, lists the 6 recipes w/ pass/fail/running badges + run links, plus /badge/.svg. Verified via gateway; /hook still routes to bridge. (content-hash image tag so the swarm service rolls on code change.)
  • PR-comment outcome reflection: bridge updates its run comment with final pass/fail (needs a Drone build-completion hook or bridge status poll). Currently posts start/run-link only.
  • [idea] give the bridge image the same content-hash tag (latent :latest no-roll issue)
  • Gate: M8 — overview matches reality; outcomes mirrored

M9 — Reproducibility + docs (D8/D9)

  • docs/install.md from-scratch rebuild; all docs complete
  • Gate: M9 — Adversary rebuilds from docs on throwaway host

M10 — Proof (D10)

  • All six recipes green via real !testme PRs; flip STATUS to DONE

Adversary findings

  • [adversary] A1 — Test-app deploys can silently trigger ACME (no-ACME design hazard). CLOSED @2026-05-27T00:35Z by Adversary re-test. runner/harness/lifecycle.deploy_app calls abra.env_set(domain, "LETS_ENCRYPT_ENV", "") before every deploy. Verified on a live harness app (cust-c95a69): env LETS_ENCRYPT_ENV= empty, no certresolver label, 0 ACME log lines, and the served cert is the wildcard CN=*.ci.commoninternet.net (verify ok) — not a per-host ACME cert. No-ACME holds for harness deploys. (Structural belt-and-suspenders — dropping the unused certificatesResolvers from traefik — remains a nice-to-have, tracked under A3/M7, not required to close A1.)

  • [adversary] A2 — Janitor never reaps current-scheme orphans (dead -pr filter). Found during M4 review. harness.lifecycle.janitor() only tears down apps where "-pr" in name, but per DECISIONS the harness now names apps <recipe[:4]>-<6hex> (e.g. cust-c95a69) — no -pr substring. So the run-start crash-recovery sweep (§4.3: "nuke any orphaned *-pr* apps") matches nothing and is effectively a no-op. The happy-path finalizer in conftest.deployed_app does work (observed: cust-e084bd from a prior run was torn down), but a run that crashes/reboots before the finalizer runs leaves an orphan that no later run will reap. Fix: match the actual naming (e.g. regex ^[a-z]{1,4}-[0-9a-f]{6}\. or a dedicated CI label/prefix) and gate on age. Re-test: deploy a harness app, simulate a crash (kill the run before teardown), then start a new run and confirm janitor reaps the orphan. Adversary closes after re-test. Re-test progress @2026-05-27T05:00Z (fix b7a2d70): the reaping mechanism is verified — janitor now matches the real naming via RUN_APP_RE (^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci…, matches cust-c95a69) AND reconstructs .env-gone orphans from orphaned service names (regex matches my synthetic advx-aaaaaa_ci_commoninternet_net_app), with an age gate to spare concurrent runs, then reaps via teardown_app (verified clean under A3). Still pending: one live janitor() end-to-end sweep — needs CCCI_JANITOR_MAX_AGE=0, which would also reap the Builder's live apps, so it must run on an idle host. Will close then.

  • [adversary] A3 — Teardown is unverified/best-effort; a failure silently orphans + run stays green. CLOSED @2026-05-27T05:00Z by Adversary re-test of the Builder's fix (commit b7a2d70). teardown_app now: undeploy → if the service persists, docker stack rm fallback (needs no .env) → remove volumes/secrets by stack name (retry loop) → drop .env LAST → verify _residual() and raise TeardownError if anything remains. Empirical worst-case test: I docker stack deploy-ed a synthetic orphan advx-aaaaaa_ci_commoninternet_net (service + volume + network, no .env — exactly the crash-orphan that defeated the old code), then called lifecycle.teardown_app("advx-aaaaaa.ci.commoninternet.net") → returned OK (verify passed) and afterwards services/volumes/networks = 0. So a .env-less orphan is fully reaped and teardown is now verified (would raise on residual). Original finding below. Found during M4 review (to confirm empirically with a kill-mid-run probe). lifecycle.teardown_app runs every abra call with check=False and "never raises"; the conftest finalizer never asserts teardown succeeded. Worse, abra.app_config_remove deletes the app .env unconditionally, even if abra.undeploy failed first — leaving the swarm service+volume running but with no .env, so the app can no longer be managed/undeployed via abra (and a fixed janitor that shells abra app undeploy couldn't reap it either). Net: a partial teardown leaves a silent orphan while pytest still reports the run green, so the M4/D2 guarantee "no orphaned app/volume afterward" is not actually verified by the harness. Fix: assert post-teardown that the stack/services/volumes/secrets are gone (fail the run otherwise); only remove the .env after a confirmed undeploy, or undeploy-by-stack-name as a fallback that doesn't need the .env. Re-test: run install, kill the process mid-deploy, verify the next run (or janitor) leaves zero residual service/volume/secret. Adversary closes after re-test.

  • [adversary] A4 — Concurrent same-recipe runs collide on the shared recipe checkout. CLOSED @2026-05-27T03:13Z — mitigated by the runtime concurrency cap. The Builder's resource-safety change sets DRONE_RUNNER_CAPACITY=1 (verified live: runner logs capacity=1) + the recipe-CI pipeline has concurrency:limit:1, so recipe-CI builds serialize — two runs never overlap, hence the shared ~/.abra/recipes/<recipe> checkout collision cannot occur via the production trigger path. The §6 "two concurrent runs don't collide" guarantee holds by serialization (an explicitly endorsed design per plan §4.2). Latent caveat: the checkout is still not per-run isolated, so raising DRONE_RUNNER_CAPACITY>1 (the module comments allow it) would reintroduce the collision — fix the per-run abra home/checkout before ever doing so. (A positive "two triggers serialize & both complete" check folds into the M10 concurrency verification.) Found by review (M6 verify); to confirm empirically. Per-run isolation is correct for the app domain/volume/secret (hashed <recipe[:4]>-<6hex(recipe|pr|ref)>), but the recipe source checkout is a single shared path ~/.abra/recipes/<recipe>: run_recipe_ci.fetch_recipe does rm -rf ~/.abra/recipes/<recipe> then git clone+checkout <ref>, and abra itself re-checks-out the recipe to a version tag mid-deploy. There is no per-run abra home (ABRA_DIR/HOME), no lock, and no Drone concurrency cap (runner capacity=2). So two concurrent runs of the same recipe at different refs (e.g. !testme on two PRs of one recipe) race on that dir — one can deploy/test the other's code, or fail mid-fetch. (Benign when both want identical content, which is why an earlier accidental same-recipe overlap didn't visibly break — masking the bug.) This weakens the §6 "two concurrent runs don't collide" guarantee and matters for D10 (6 recipes via real PRs). Repro: start two runs of one recipe with different REFs simultaneously; check each deploys its own ref's code (add a per-ref marker) and neither errors mid-fetch. Fix: per-run abra home/recipe dir (e.g. ABRA_DIR=$(mktemp -d) or ~/.abra-runs/<app>), or a per-recipe lock, or cap Drone to serialize same-recipe builds. Adversary confirms + closes after re-test.