Files
cc-ci/machine-docs/DECISIONS.md
autonomic-bot de6103d41d claim(2pc): PC1 conservative prune deployed+verified; PC2/PC3 local-store cache confirmed
ci-docker-prune (gated surgical prune) live on cc-ci: old autoPrune --all gone, new timer
enabled (daily), no-ops below 80% disk keeping the local image cache, never --all/--volumes.
Daemon stays PAT-authenticated (nptest2); /var/lib/docker retained across rebuild. PC3 proof:
redis:7-alpine deploy->teardown(service rm, image retained)->redeploy = "Image is up to date",
no layer re-download (cold 5303ms -> warm 674ms). Docs: runbook "Image cache & prune policy",
warm.md, DECISIONS Phase-2pc, IDEAS (registry pull-through cache deferred + revisit trigger).
Gate 2pc CLAIMED, awaiting Adversary cold-verify.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:42:36 +01:00

59 KiB
Raw Blame History

DECISIONS — cc-ci Builder

Architecture decisions and dead-ends. One line of rationale each. (§0, §8)

Settled

  • Wildcard TLS: operator pre-issues wildcard cert at /var/lib/ci-certs/live/; Traefik file provider serves it; no ACME for commoninternet.net. (Plan §4.0/§8 — fixed.)

  • Repo: git.autonomic.zone/recipe-maintainers/cc-ci, private. Bot is org admin. (Bootstrap.)

  • Git credentials: helper script in repo-local git config sources /srv/cc-ci/.testenv at call time — no secret values stored in .git/config or commits.

  • Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26, overrides plan §3 modules/traefik.nix). Instead of a hand-rolled Traefik we deploy the canonical Co-op Cloud traefik recipe via abra in wildcard / file-provider mode, for end-to-end fidelity (canonical web/web-secure entrypoints + proxy/swarm conventions every recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO DNS token on the box:

    • WILDCARDS_ENABLED=1 + append compose.wildcard.yml; the pre-issued cert is fed as the ssl_cert/ssl_key swarm secrets (v1) via abra app secret insert … -f from /var/lib/ci-certs/live/{fullchain,privkey}.pem. The file provider serves it (tls.certificates).
    • LETS_ENCRYPT_ENV= empty on the traefik app and on every test app → the recipe's tls.certresolver=${LETS_ENCRYPT_ENV} label resolves to no resolver → routers serve the wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
    • Reproducibility (D8): scripts/deploy-proxy.sh is idempotent (ensures local abra server, fetches recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in docs/install.md. The custom modules/traefik.nix was removed; modules/swarm.nix keeps swarm init + proxy net + firewall 80/443.
    • Renewal (manual, ~90d): operator re-issues the wildcard at the same paths, then abra app secret rm traefik.ci.commoninternet.net ssl_cert -n + re-insert at a new version (bump SECRET_WILDCARD_CERT_VERSION) and redeploy. (Documented in docs/secrets.md at M7.)
    • abra teardown syntax (for harness, §4.3): abra app undeploy <d> -n, abra app volume remove <d> -f -n, abra app secret remove <d> --all -n. None take --chaos.
  • Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer 2026-05-26). Every piece of swarm infra that abra deploys (traefik modules/proxy.nix, Drone modules/drone.nix, later comment-bridge + dashboard) is a systemd.services.<x> with Type=oneshot + RemainAfterExit, after/requires swarm-init + docker, wants network-online, wantedBy multi-user, embedding its script via pkgs.writeShellApplication (self-contained in the store, not a /root/cc-ci path). The script reconciles (inspect → converge → no-op if correct) on every activation/boot — no run-once sentinel — so it self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit) on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses to git clone + nixos-rebuild switch + operator preconditions, no manual post-steps. The old scripts/deploy-*.sh were folded into these modules and removed. pkgs.abra is provided via an overlay (modules/packages.nix) so all modules share the one pinned build.

    • Cert rotation note: the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the wildcard means bumping SECRET_WILDCARD_*_VERSION (operator) so the next reconcile re-inserts. Documented in docs/secrets.md at M7.
  • Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27, supersedes the earlier "keep webhook, do NOT pivot to polling" steer). Hard constraint: the bot/server runs at READ level, never repo-admin, and never self-registers a webhook.

    • Polling is PRIMARY and the source of truth for D1. The bridge polls each enrolled repo's open PRs for new !testme comments every POLL_INTERVAL (30s ≤ 60s). Outbound (cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On startup the first poll marks pre-existing comments seen so it doesn't fire on old comments.
    • Webhook is an OPTIONAL push optimization. The /hook endpoint stays live (HMAC-verified) so an admin-registered issue_comment webhook lowers latency, but the bridge never registers one. Manual registration is documented in docs/enroll-recipe.md. Both paths share an in-memory seen-set keyed by comment id → a comment seen by both fires at most once.
    • Commenter authorization via org membership (read-level, no admin). Allowed iff GET /orgs/{owner}/members/{user} → 204 (verified 2026-05-27: admits bot/trav/notplants, 404 for a non-member, works with bot read-level basic-auth) or the user is in the optional AUTH_ALLOWLIST. Replaces the earlier /collaborators/{user}/permission check, which needs repo-admin. Fail-closed on any error.
    • Enrollment = add the repo to the bridge POLL_REPOS csv + ensure tests/<recipe>/ exists. No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't matter: polling makes it irrelevant; the operator was whitelisting ci.commoninternet.net in Gitea's ALLOWED_HOST_LIST, but D1 no longer depends on that.)
  • Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27, plan §4.2/§4.3). Do NOT keep multiple test apps deployed at once. Three layers, all configurable:

    • MAX_TESTS = DRONE_RUNNER_CAPACITY = 1 (modules/drone-runner.nix, maxTests let-binding). Drone runs at most MAX_TESTS builds at once and auto-queues the rest in its native pending queue — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly.
    • Per-build TIMEOUT = 60 min (modules/drone.nix, buildTimeoutMinutes; reconciled best-effort via PATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60} using the bridge's Drone admin token, local --resolve, non-fatal). A build over the limit is cancelled by Drone → the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue once a test finishes OR times out".
    • Teardown + janitor backstop. Each build deploys → runs the 3 stages → undeploys (guaranteed try/finally in conftest/orchestrator). A SIGKILL'd/timed-out build can't run its own teardown, so the run-start janitor (lifecycle.janitor, called before every deploy in both fixtures + run_recipe_ci) reaps orphaned run apps as the backstop. At capacity=1 the CI path will set CCCI_JANITOR_MAX_AGE=0 (reap any orphan immediately — safe with no concurrent runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default 2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live.
    • Optional concurrency: {limit: 1} in the recipe-CI .drone.yml is a redundant belt — primary mechanism is DRONE_RUNNER_CAPACITY. (Wired when the recipe-CI pipeline lands — see backlog.)
  • D10 recipe #6: bluesky-pds (TLS-passthrough) SWAPPED → n8n — SETTLED (2026-05-27, plan §4.0 sanctions this swap-with-reason). bluesky-pds routes via a Traefik TCP router with tls.passthrough=true to an in-container caddy that terminates TLS itself and obtains its own cert via ACME. cc-ci's design is the opposite: the operator gateway passes wildcard TLS through to cc-ci's Traefik, which terminates it with the pre-issued static wildcard cert, and ACME is hard-forbidden for commoninternet.net (no DNS token on the box — §4.0/§9). Serving bluesky-pds would require either (a) ACME inside caddy (forbidden), or (b) injecting the wildcard cert into caddy + a per-host TCP-passthrough router on cc-ci Traefik (recipe-internal surgery + a bespoke proxy mode — not a clean shared-harness absorb). This is a genuine design conflict, not a harness gap. Per the plan's explicit allowance, bluesky-pds is a documented non-CI'd recipe (reason here), and n8n takes the 6th slot. The 5 required D10 categories are already covered by recipes 15 (simple=custom-html, single-DB+SSO=keycloak, stateful/no-DB=cryptpad, DB+media/large-volume= matrix-synapse, multi-service+S3/object-storage=lasuite-docs); n8n adds a 6th real deployable app (workflow automation) behind the normal terminate-at-Traefik path.

  • Docker Hub rate limit + mid-breadth prune — FINDING (2026-05-27). D10 real-!testme breadth runs exhausted Docker Hub's anonymous pull rate limit (lasuite-docs, 9 images, upgrade stage: toomanyrequests). Two lessons: (1) registry pull creds are an A1 operator input needed for reliable heavy-recipe deploys under load (request + sops-store + wire into docker daemon). (2) Don't docker image prune -af mid-breadth — it evicts cached recipe images and forces re-pulls that hit the limit. The first lasuite failure was disk pressure (90% full); pruning fixed disk but triggered re-pulls → rate limit. Better: rely on the daily autoprune, prune only dangling (not -a) between runs, or grow disk so heavy images stay cached. Net for D10: 5/6 recipes green via real !testme; lasuite-docs gated on the rate limit (transient ~hours; durable fix = creds).

Open (defaults from §8, to confirm as reality lands)

  • Deploy mechanism — SETTLED (M0): nixos-rebuild switch --flake /root/cc-ci#cc-ci run on cc-ci itself, with the repo materialised on the host at /root/cc-ci. Chosen over --target-host/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS proxy (slow/fragile). Atomic rollback preserved by Nix generations (nixos-rebuild --rollback). The switch is launched as a detached transient systemd unit (systemd-run --unit=ccci-rebuild --collect) so it survives a momentary ssh-over-tailscale drop during activation. For the build loop the host copy is synced from the sandbox clone via tar | ssh (rsync absent on host); source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo on a fresh host, then nixos-rebuild switch --flake .#cc-ci).

    • nixpkgs pin: flake pins the exact rev cc-ci already ran (50ab793…) so the first rebuild is a true no-op-then-base. Bump deliberately, never drift.
  • Webhook scope: default per-repo via enroll script.

  • CI engine: Drone (per plan) — kept, with a noted risk. nixpkgs 24.11 has Drone server 2.24.0 but drone-runner-exec is abandoned (unstable-2020-04-19) — the only exec runner Drone ever shipped (upstream archived ~2021). The maintained fork Woodpecker (2.7.3, with NixOS modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific (D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern Drone server (RPC protocol stable). Fallback: if the exec runner proves incompatible/broken, pivot to Woodpecker (coop-cloud ships a woodpecker recipe too) and record it — like the traefik pivot. Re-evaluate at the M2 gate.

  • Drone deployment shape — SETTLED (M2): mirror the traefik pattern. The server is the coop-cloud drone recipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by traefik at drone.ci.commoninternet.net, LETS_ENCRYPT_ENV empty → wildcard cert, no ACME), with Gitea SSO (compose.gitea.yml). The exec runner runs as a Nix systemd service on the host (modules/drone-runner.nix) so it can drive host abra/swarm (plan §4.2). One generated DRONE_RPC_SECRET is shared: inserted as the server's rpc_secret swarm secret AND read by the runner from sops. Reproducible deploy: scripts/deploy-drone.sh.

    • Gitea OAuth app cc-ci-drone created under the bot (client_id ab4cdb9d-ee96-4867-875f- 87384505fc52, redirect https://drone.ci.commoninternet.net/login); client_secret + rpc_secret stored sops-encrypted in secrets/secrets.yaml (A2 internal secrets).
  • Drone runner type: exec (must drive host abra).

  • Secret tool — SETTLED (M0): sops-nix. cc-ci decrypts at activation using its ed25519 SSH host key as the age identity (sops.age.sshKeyPaths), so no extra key file to manage on the box. Recipients in /.sops.yaml: the host age key (age1h90ut…, from ssh-to-age) + an off-box master recovery key (age1cmk26t…; private half only at /srv/cc-ci/.sops/master-age.txt on the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing plaintext into secrets/<f>.yaml then sops -e -i (run inside the repo so .sops.yaml is found).

  • D10 recipe set: lock six early. Candidates favouring already-mirrored: custom-html (simple), cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3), bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4M6.5.

  • Per-run app domain scheme — adapted (M4, deviates from plan §4.0). Plan §4.0 wanted <recipe>-pr<n>-<short-sha>.ci.commoninternet.net, but Docker swarm config/secret names (<stackname>_<resource>_<version>) must be ≤ 64 chars and abra derives <stackname> from the domain (dots→_, hyphens kept). .ci.commoninternet.net alone is 22 chars, so long recipe names

    • config names overflow 64 (hit with custom-html-pr0-m4demo…_nginx_default_conf_v6 = 66). New scheme: <recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net (e.g. cust-e084bd) — short, unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/ ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.
  • abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6). Many abra commands (app ls, secret generate without flags, version resolution) silently git checkout <version-tag> in ~/.abra/recipes/<recipe>, discarding a PR branch's files. To test the PR head code (not a re-resolved tag): (1) fetch_recipe clones the mirror branch/ref (private → bot token via per-command http.extraHeader, never persisted/logged); (2) all harness abra calls that touch the recipe pass -C (chaos: use current checkout) -o (offline: no remote fetch); (3) recipe-shipped tests/ (D4) are snapshotted to a temp dir right after fetch, since later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.

Risks

  • Disk — RESOLVED 2026-05-26. Original 8.9 GiB root had only ~3.8 GiB free and a hard inode ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on inodes before bytes. Operator grew the VM to 28 GiB (22 GiB free, 1.78M inodes / 1.21M free); the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown + periodic docker image prune to avoid regressing during M6.5 breadth.

Dead-ends

  • (none yet)

Phase 1c (full reproducibility + genuine D8 live rebuild) — 2026-05-27

  • Secrets linkage = git SUBMODULE (deviates from plan §7 flake-input default). cc-ci-secrets is mounted as a submodule at cc-ci/secrets/ rather than a flake inputs.secrets. Rationale: a private flake input must be re-fetched at every nix eval, requiring the bot token persistently in nix config/netrc on cc-ci AND the throwaway VM (a token in the store/config = a 2nd out-of-band secret, which 1c forbids). A submodule makes secrets/secrets.yaml a plain path in the working tree → defaultSopsFile = ../secrets/secrets.yaml is unchanged (minimal diff, trivially byte-identical), and the only credential use is the one git clone --recursive at provisioning ("the two repos are given", Mission §1). Build invocation becomes nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci' so the submodule tree is included. (Revisit if ?submodules=1 proves unreliable on cc-ci's nix version.)
  • Bootstrap key for the throwaway VM = the existing RECOVERY (master) age key, via sops.age.keyFile. The recovery key (age1cmk26…, private at /srv/cc-ci/.sops/master-age.txt) is already a sops recipient, so a fresh host with a different ssh host key still decrypts every secret with no re-keying — this is exactly the §0 argument that defeats "host-key binding". Provisioned to the VM at a fixed path (the ONE out-of-band secret). cc-ci itself keeps decrypting via its host key (age.sshKeyPaths); secrets.nix will offer both identity sources. (Per-host re-encrypt is cleaner for a permanent new instance — documented as the alternative, not used for the throwaway test.)
  • Cert into git: wildcard cert+key become sops secrets in cc-ci-secrets, decrypted at activation back to /var/lib/ci-certs/live/{fullchain.pem,privkey.pem} via sops.secrets.<name>.path; proxy.nix keeps reading that path (now sops-sourced, not operator-drop).
  • cc-nix-test final sizing (C6) — SETTLED by operator 2026-05-27: PROMOTE the rebuilt VM. The freshly-rebuilt reproducible VM (the FINAL W5/C4-C5 clean-room throwaway) becomes the canonical cc-nix-test; the operator will repurpose it for a live real-traffic test through the public gateway.
  • C6 teardown OVERRIDE (operator, 2026-05-27): do NOT destroy the FINAL throwaway VM after W5/C4-C5 PASSes — keep it RUNNING; defer its C6 teardown until the operator explicitly says otherwise. This overrides the plan §5/§6 "destroy the throwaway" for that one VM only. All other cleanup proceeds normally (the Builder's first throwaway was already destroyed; RAM accounting holds).

Phase 1b — lint/format tooling (open decisions §6, settled W0)

  • Formatters/linters (RL1): Nix = nixpkgs-fmt (format) + statix (lints) + deadnix (dead code); Python = ruff (lint + format); Shell = shellcheck + shfmt -i 2 -ci; YAML = yamllint. Kept nixpkgs-fmt over alejandra because it was already the repo formatter and devshell tool (no extra churn / restyle of every .nix). All built from the already-pinned nixpkgs via a flake lint devshell (nix develop .#lint) so CI and local use byte-identical tool versions.
  • Lint entrypoint: scripts/lint.sh (check-only by default; --fix auto-applies). The .drone.yml push pipeline runs it via nix develop .#lint --command bash scripts/lint.sh.
  • ruff strictness: select = [E,F,W,I,UP,B,C4,SIM], ignore = [E501] (line length is the formatter's job; only un-splittable strings would trip it). line-length=100, target=py311.
  • Drone lint stage = FAIL (not warn). The codebase is green now, so enforce from here on — an unclean commit fails the lint step. (Resolves the §6 open question.)
  • Python type-checking (mypy/pyright): DEFERRED to IDEAS, not added in 1b. The harness is small and dynamically typed around abra/subprocess JSON; gradual typing is a larger effort than this bounded pass warrants. Revisit if Phase 2's 18-recipe ramp shows type bugs.
  • blocking vs advisory split (§3): treated as in the phase plan — tests-real, Nix-idempotent, no-footguns, no-secrets, log-redaction, harness-DRY = blocking; readability/docs/arch-drift = advisory unless a real plan deviation. Recorded per-finding in REVIEW-1b / BACKLOG-1b.
  • cc-ci self-CI push trigger: the lint stage lives in the event: push pipeline. The Gitea→Drone push webhook on this instance is flaky (last_status: None; documented §4.1) and predates 1b — recipe CI uses polling as primary, but cc-ci's own self-test/lint relies on the push webhook. The lint stage is correctly wired and proven green via the identical nix develop .#lint command; reliably auto-firing it on every push is tracked as a (pre-existing) infra item, not a 1b lint gap.

Phase 1b — repo layout (operator review items RL5/RL6, plan §7)

  • RL5 — all Nix code under nix/. Moved modules/nix/modules/ and hosts/nix/hosts/. flake.nix/flake.lock STAY at the repo root (entry point) so the build ref #cc-ci and nixos-rebuild --flake '…#cc-ci' are unchanged — only flake.nix's internal ./hosts/cc-ci/configuration.nix./nix/hosts/cc-ci/configuration.nix changed. Root-relative refs inside the moved modules were re-based ../X../../X (secrets.nix → ../../secrets/, bridge.nix → ../../bridge/, dashboard.nix → ../../dashboard/); configuration.nix's ../../modules/* imports are unchanged (both dirs moved under nix/, so the relative path still resolves). Toplevel is byte-identical (8i3jcad9…) before/after the move — store derivations are content-addressed on the copied file contents, and the module .nix files aren't part of the runtime closure, so relocating folders doesn't change the build. (The operator anticipated a hash change; in practice it's stable, which is even stronger for reproducibility.) Living docs (README, architecture/install/secrets/enroll) + the .drone.yml comment updated to nix/…; append-only history logs left as the record of what was true then.
  • RL6 — protocol files → machine-docs/: DEFERRED to the coordinated end of 1b. Will git mv STATUS*/REVIEW*/JOURNAL*/BACKLOG*/DECISIONS.md into machine-docs/ (README.md STAYS at root — operator decision, it's the human readme, not a protocol file). The live watchdog (launch.sh) reads STATUS-<id>.md/REVIEW-<id>.md at the repo root for handoffs/transition, so this is done LAST, in lockstep with the orchestrator updating launch.sh + restarting the watchdog — not unilaterally and not while a phase transition is pending. The Adversary likewise git mvs its own REVIEW files at the cutover (single-writer rule).

Phase 1b — recorded deviation: no tests/_template/ dir (enroll = copy an existing recipe)

Plan §3's repo layout lists a tests/_template/ "copy-to-add-a-recipe" dir. It was never created (pre-1b; not introduced or removed by 1b) — instead the documented enroll flow in docs/enroll-recipe.md is "copy an existing recipe's tree, e.g. tests/custom-html/…, then adjust recipe_meta.py + the per-recipe test files." This satisfies D5's "small, repeatable, documented operation with no harness surgery" the same way (a concrete recipe is a better starting template than an abstract skeleton that can drift). Recording per the Adversary's RL3 D5 advisory; not a blocker.

Phase 1d — generic test suite + layered overlays (design, 2026-05-27)

SSOT: cc-ci-plan/plan-phase1d-generic-test-suite.md. Resolves the §6 open decisions.

  • Tier model & op/assertion split (the core call). A run is a sequence of TIERS — install, upgrade, backup, restore, custom — each = generic default [overridden by a recipe overlay]. The lifecycle OP (deploy, upgrade, backup, restore) is owned by the shared harness (harness.generic helpers), NOT duplicated in every test file. A tier's test file (generic or overlay) carries the ASSERTIONS and calls the shared op helper. This keeps the op single-sourced (DRY, DG7) and makes deploy-once trivial: only the orchestrator deploys/tears-down.

  • Override (not additive) — Builder's call (plan §6, operator leaned override). For each lifecycle op exactly ONE assertion file runs, by precedence: repo-local tests/test_<op>.py > cc-ci tests/<recipe>/test_<op>.py > generic (tests/_generic/test_<op>.py). A present overlay REPLACES the generic for that op. Invariant: no overlay for an op ⇒ the generic runs (so any recipe is testable with zero config). Repo-local wins same-name collisions (upstream is authoritative, plan §2.5); cc-ci's overlay is the curated fallback until upstream adopts it. Extend-by-composition: an overlay may from harness import generic and call generic.assert_serving(...) / generic.do_upgrade(...) then add its own assertions — so "extend" needs no separate mechanism.

  • Custom (non-lifecycle) test_*.py: ALL discovered from BOTH locations run additively, opt-in (no override, no generic equivalent) — e.g. test_sso.py.

  • Deploy ONCE, mutate in place (operator requirement, DG4.1). The orchestrator deploys the app ONCE, runs all tiers against that single live deployment (install asserts; upgrade does abra app upgrade in place; backup/restore mutate in place; custom asserts), then ONE teardown in finally. No per-tier/per-overlay abra app new/deploy/undeploy. A CCCI_DEPLOY_COUNT counter in lifecycle.deploy_app is asserted == 1 per run (DG4.1 evidence).

  • Deployment-sharing scope & base version (§6 open). One deployment for the whole lifecycle. Base version deployed once = the previous published version when an upgrade tier will run and a previous exists (so upgrade goes previous→target in place), else the target (current/$REF). Recipe with only one published version ⇒ upgrade tier is a clean SKIP (nothing to upgrade from). Standalone generic-install demo (no PR) deploys current.

  • Fail handling across shared tiers (§6 open): install failing (app never serves) fail-fasts the run (later tiers can't meaningfully run on a dead deployment) and they report error/skip; upgrade/backup/restore failures are recorded per-op but do not abort the remaining independent tiers where they can still run. Teardown always runs.

  • Backup-capability detection (DG3, §6 open): auto — scan the recipe's compose*.yml for a backupbot.backup label (verified present in custom-html). recipe_meta.BACKUP_CAPABLE (bool) overrides the auto-detect. Not capable ⇒ backup+restore tiers are N/A (skip), not failures.

  • Custom install-steps hook (DG5, §6 open): a shell hook — tests/<recipe>/install_steps.sh (cc-ci) or repo-local tests/install_steps.sh — run by the orchestrator during the install tier AFTER abra app new + env defaults but BEFORE abra app deploy, with env CCCI_APP_DOMAIN, CCCI_RECIPE, CCCI_APP_ENV (path to the app .env). Chosen over a fixture/declarative field as the simplest thing the harness runs uniformly (can abra app secret insert, set env, seed). Graceful rule: a recipe with NO hook still attempts the generic install; if it genuinely needs a step it FAILS the generic install (reported per-op) — that is correct, not a harness bug.

  • Per-op result vocabulary (Phase-3 feed): pass | fail | skip(N/A) | error. The orchestrator prints a per-op summary line per run (feeds DG6 + Phase-3 level).

  • Discovery layout: cc-ci overlays/custom/hook live in tests/<recipe>/; repo-local in the recipe repo's tests/ (snapshotted after fetch, per the existing volatile-checkout handling). Generic tier files live in tests/_generic/ (assertion-only, use the shared live-deployment fixtures).


Phase 1e — generic-harness corrections (HC1HC4)

Three operator-review corrections to the Phase-1d shared harness, settled here (plan §5).

  • HC2 — repo-local approval allowlist (form/location + workflow). PR-author-controlled code (install_steps.sh, repo-local test_*.py) runs on the CI host with /run/secrets/* present, so it is default-deny. Allowlist file: tests/repo-local-approved.txt (checked into the cc-ci repo, git-auditable). Format: one recipe name per line; # comments + blank lines ignored; a lone * is NOT a wildcard (no global opt-in — every recipe is explicit). Default: empty ⇒ no recipe trusts repo-local code. Discovery (resolve_op/custom_tests/install_steps) consults the repo-local source only when repo_local_approved(recipe) is true; otherwise precedence is cc-ci > generic only and repo-local is discovered-but-not-executed. Workflow: a cc-ci maintainer reviews a recipe's repo-local tests, then adds the recipe name to tests/repo-local-approved.txt in a cc-ci PR — a deliberate, reviewable act. The gate is centralized in discovery.py (one reader) so the unit tests pin it.

  • HC3 — generic-by-default opt-out flag (name/granularity + recipe_meta). Generic assertions run additively alongside any overlay by default. Opt-out, in increasing specificity (any one skips): env CCCI_SKIP_GENERIC (truthy ⇒ skip generic for ALL ops), env CCCI_SKIP_GENERIC_<OP> (e.g. CCCI_SKIP_GENERIC_UPGRADE ⇒ skip generic for that op only), and declarative recipe_meta.SKIP_GENERIC = a list of op names (or ["all"]) so the opt-out is per-recipe and visible in git, not a hidden global. Truthy = 1/true/yes/on (case-insensitive). Op-vs-assertion split: a mutating op (upgrade/backup/restore) is performed once by the orchestrator (the harness owns the op); then the generic assertion file (unless opted out) and the overlay assertion file both evaluate the shared post-op state. Op results that an assertion needs (pre-upgrade identity, backup snapshot_id) are passed op→assertions via a run-scoped JSON state file at $CCCI_OP_STATE_FILE (read by harness.generic.op_state()); never logged. Overlays that need to seed pre-op state (data-continuity markers, the backup→restore mutation) ship an optional tests/<recipe>/ops.py with pre_install/pre_upgrade/pre_backup/pre_restore(domain, meta) callables the orchestrator runs before the op (repo-local ops.py is allowlist-gated like other repo-local code). Overlay test_<op>.py files are now assertion-only (they no longer call generic.do_*).

  • HC1 — DG4.1 deploy-count vs the in-place chaos upgrade. The upgrade tier now upgrades to the PR head (code under test), not a published tag: deploy the previous published version (base), re-checkout the PR head (recorded as the recipe repo HEAD right after fetch, before any version-tag checkout), then abra app deploy --chaos in place = the upgrade. The deploy-count guard counts abra app new installs only (_record_deploy() fires in deploy_app(), NOT in the chaos redeploy, which calls abra.deploy directly) — so a run is still deploy-count == 1 and the legitimate in-place chaos upgrade is not flagged. Moved assertion (adapted): prev→PR-head may not bump the coop-cloud version label, so assert_upgraded accepts ANY of: version-label change, image change, or a chaos label now present carrying the PR-head commit (a chaos deploy stamps coop-cloud.<stack>.chaos/.chaos-version) — the chaos label IS the proof PR-head was deployed. Non-PR !testme (no SRC/REF): "PR head" = the catalogue current checkout, so upgrade is prev→current — still a genuine move via chaos. (Exact chaos label name verified on the live abra during E2.)

Phase 2 — per-recipe test authoring (design, 2026-05-28)

Inherits the Phase 1d/1e shared-deployment + additive-overlay + op/assertion-split model. Phase 2 adds content, not infra, with a few small harness primitives ported from references/recipe-maintainer/utils/tests/helpers.py.

  • Per-recipe layout (per plan §4.1). The cc-ci tests/<recipe>/ dir continues to use the Phase-1d/1e overlays at the top level (test_install.py, test_upgrade.py, test_backup.py, test_restore.py, ops.py, recipe_meta.py, optional install_steps.sh). NEW Phase-2 subdirectories:
    • tests/<recipe>/functional/ — parity-port tests (one per recipe-maintainer tests/*.py) + ≥2 NEW recipe-specific functional tests (P2/P3). Each file is test_*.py (pytest-discoverable); each parity port carries a SOURCE = "recipe-info/<recipe>/tests/<file>" comment near the top so the audit trail is in the file, not just in PARITY.md.
    • tests/<recipe>/playwright/ — browser flows (P6) where the app's UX is a UI flow. Same test_*.py convention; each file imports playwright.sync_api.
    • tests/<recipe>/PARITY.md — required mapping table (P2) with one row per recipe-info parity test: | recipe-maintainer file | cc-ci file | what's verified | status |. A deliberate non-port is a documented row in DECISIONS.md (linked from PARITY.md), not a silent omission.
  • Discovery for the new subdirs. runner/harness/discovery.custom_tests recurses into tests/<recipe>/functional/ and tests/<recipe>/playwright/ (in addition to the top-level glob), so Phase-2 functional tests run as part of the custom stage automatically. Repo-local (HC2) gate still applies if the recipe is approved; otherwise only cc-ci's own functional/ + playwright/ run. The top-level test_install.py/etc. continue to drive the lifecycle overlays — the functional/ + playwright/ files are always custom-stage, never lifecycle (so they don't perform an op; they assert against the post-install live deployment).
  • Vendored helpers in runner/harness/. Capabilities ported from recipe-maintainer/utils/tests/ helpers.py (cc-ci is self-contained at runtime — does NOT import recipe-maintainer's workspace, per plan §8 default):
    • harness.httphttp_get(url, headers=, timeout=) -> (status, json_or_None), http_post(...), retry_http_get(url, timeout=, **), wait_for_http(url, label, max_wait=), assert_converges(fn, description, max_wait=, interval=). (Several variants exist lifecycle.http_fetch/http_get/http_body already; the harness.http module is the canonical Phase-2 HTTP API for tests; lifecycle.* helpers stay for infra-level checks.)
    • harness.abra_ttyscript -qefc "abra …" /dev/null wrapper for the abra commands that require a TTY (backup/restore/secret/run/logs/lint), used by parity tests that drive abra directly. Lifecycle already exposes typed wrappers — this is for tests that need raw shell-abra.
    • harness.deps — dependency resolver primitive. Reads tests/<recipe>/recipe.toml (requires / test_requires), deploys each declared dep via the same lifecycle.deploy_app
      • wait_healthy path (so the dep is a real <dep[:4]>-<6hex>.ci.commoninternet.net on the same swarm), persists per-run, tears down with the parent in the orchestrator's finally. Heavy recipes sequence sequentially; MAX_TESTS/node budget is the cap.
    • harness.sso — OIDC-flow primitive (Q2 deliverable). Given a deployed provider domain and a recipe-defined realm/client/test-user, performs the full "deploy provider → setup realm/client via admin API → obtain access token (password + client-credentials grants) → assert protected API call accepts it" assertion. Reusable by every SSO-dependent recipe (cryptpad, lasuite-*, immich, etc.). Setup scripts ported from recipe-info/<dep>/setup_<provider>_integration.py.
    • harness.data_integrity — backup data-integrity primitive: a recipe-aware "seed a marker → backup → mutate → restore → assert seeded marker survived" helper around lifecycle.exec_in_app / http_get (the recipe chooses the marker mechanism, the helper guarantees the pattern).
  • Run-scoped credentials for SSO/recipe-specific tests (plan §4.4 class-B). Generated secrets (realm/client/test-user passwords, API tokens) persist for the run via the existing runs/<app-name>/ mechanism (Phase 1d). Destroyed at teardown alongside abra secrets/volumes.
  • Recipe-versioned tests (anti-anchoring). Per plan §7.1, tests read versions/endpoints dynamically (the app's own discovery endpoints, env from live_app) — never hardcode published release values. Each functional test file declares the recipe-info SOURCE path it ports from so the Adversary can audit parity cold.
  • Heavy-recipe parking. Drone's MAX_TESTS=1 + per-build timeout already serialize runs; for Phase 2 we DO NOT lift it. Within a single run, the orchestrator deploys deps before the recipe-under-test sequentially (never concurrently) per plan §4.2.

Phase 2 Q3.4 — cryptpad create-pad deeper test deferral (2026-05-28)

Status: Deferred to Q3.4 follow-up (or Q5 catch-up), with Adversary sign-off pending per plan §7.1.

What's deferred: The "create-an-object + read-it-back" deep test for cryptpad — authenticate-and-create a real pad in the browser, type a uniquely-marked content string, reload the page (retaining the client-side encryption key in the URL fragment), assert the marker survives. This is the canonical create-and-read-back per plan §4.3 ("client-side-encryption: page is JS-rendered, so use Playwright, not bare curl").

Why deferred (the technical reason):

  • CryptPad's pad-creation client-side flow is version-specific. In the recipe under test (10.6.0+5.7.0), visiting /pad/ does NOT auto-inject a fragment-keyed pad URL; CryptPad requires the user to explicitly click a "new rich text" / "new pad" link from the landing page, AND those UI selectors (.cp-apps-grid a, [data-app='pad'], a[href*='/pad/']) are not stable across CryptPad versions.
  • Three attempted drafts during Q3.4 each failed cold on this:
    1. Type + reload + content-survives: contenteditable inside nested iframe with origin mismatch (SANDBOX_DOMAIN).
    2. Direct-/pad/-then-fragment: no fragment ever appeared on this version.
    3. Click-fallback for known app-launch selectors: none of the candidate selectors matched.

The maximal testable subset that IS shipped (P3 floor met):

  • tests/cryptpad/functional/test_health_check.py — parity HTTP 200.
  • tests/cryptpad/functional/test_spa_assets.py — CryptPad branding + canonical asset paths in served HTML. Catches the wedged-server-fallback-page failure mode.
  • tests/cryptpad/playwright/test_pad_create.py — Chromium renders the SPA, asserts brand
    • canonical asset references + zero non-filtered JavaScript console errors.

The Playwright test exercises the JS pipeline in a real browser (per §4.3 directive); the piece NOT exercised is the user-action-driven pad lifecycle. What's required to lift the deferral: pin a specific CryptPad app-launch contract (CryptPad's source has app-launch URL patterns like /pad/?new=1 on some versions) OR write a Playwright helper that walks the SPA's main menu via a stable accessibility tree (role-based selectors instead of CSS).

Adversary may file F2-N requesting full create-pad coverage; the answer above is the honest technical reason + the maximal subset. Logged here per plan §7.1.


Phase 2 — nested DOMAIN-derived subdomains flattened to single-label wildcard siblings

Decision (settled): When an enrolled recipe routes additional services on nested subdomains derived from DOMAIN (e.g. lasuite-drive MINIO_DOMAIN="minio.${DOMAIN}" + COLLABORA_DOMAIN="collabora.${DOMAIN}"; lasuite-meet LIVEKIT_DOMAIN="livekit.${DOMAIN}"), the recipe's recipe_meta.EXTRA_ENV(domain) MUST override those vars to a single-label sibling under the wildcardminio-<domain>, collabora-<domain>, livekit-<domain> — NOT the recipe's default <svc>.<domain>.

Why: cc-ci's TLS cert is the operator's pre-issued wildcard *.ci.commoninternet.net (+ bare ci.commoninternet.net) — §4.0/§1.5, renewed out-of-band, no ACME. A wildcard matches exactly one label. The per-run app domain is already one label (lasuite-drive-pr<n>-<sha>.ci.commoninternet.net), so a nested minio.lasuite-drive-pr<n>-<sha>.ci.commoninternet.net is a 2-label name the wildcard does NOT cover → Traefik would serve an invalid cert on that router and the service is unreachable over HTTPS. Re-prefixing with a hyphen keeps it one label (minio-lasuite-drive-pr<n>-<sha> + .ci.commoninternet.net), covered by the same wildcard, routed by Traefik's swarm provider with no cert work and no gateway change (the gateway already passes the whole wildcard, §4.0). We must NOT mint per-host certs / ACME for these (class-A1 boundary, §9).

Scope: purely a per-recipe EXTRA_ENV concern (no shared-harness change). Recipes with no DOMAIN-derived nested subdomains (most) are unaffected.

Phase 2 — services_converged treats a replicas: 0 one-shot as converged

Decision (settled): runner/harness/lifecycle.py::services_converged now considers a service converged when cur == want (desired replica count met), removing the prior or want == "0" rejection.

Why: lasuite-drive's minio-createbuckets is declared deploy: {mode: replicated, replicas: 0, restart_policy: {condition: none}} — an on-demand one-shot (scaled up manually only when buckets need (re)creating; it mc mb … then exit 0). docker stack services reports it 0/0. The old check rejected any want == "0" row, so the stack could never report converged → every deploy hung until deploy_timeout. A service AT its desired count (including 0/0) is converged; a service still spinning up shows 0/1 (cur != want) and is correctly not-yet-converged, so the HTTP readiness wait still gates real liveness. Safe for all currently-green recipes (their services are all N/N with N>0; the 0/0 case did not previously occur). Buckets/migrations that the one-shot performs are run on-demand in the recipe's setup_custom_tests.sh (post-deploy), not relied upon for generic-install convergence (the SPA at / serves 200 without them).

2026-05-28 — Docker Hub auth: declarative config.json via sops (rate-limit fix) — SETTLED

Context. Heavy Phase-2 recipe deploys exhausted Docker Hub's anonymous pull rate limit (100/6h per shared IP 68.14.43.142) → toomanyrequests blocked all new deploys. Operator provided a read-only Docker Hub PAT (Class A1 registry creds, plan §1.5): DOCKERHUB_USERNAME=nptest2

  • DOCKERHUB_TOKEN in /srv/cc-ci/.testenv. Authenticated pulls = 200/6h per-account.

Decision. Wire it declaratively (survives a 1c NixOS rebuild), not just an imperative login:

  • Secret: secrets/secrets.yaml (cc-ci-secrets submodule, commit cdd5e0a) gains key dockerhub_auth = base64("nptest2:<PAT>") — i.e. the exact auth field docker config.json wants, so the nix template is a pure render (no runtime base64). sops-encrypted to host+master age recipients (edited on cc-ci using its ssh-host-key→age identity via nix shell nixpkgs#sops; plaintext shredded; PAT never committed plaintext nor exposed in process args/logs).
  • Render: nix/modules/secrets.nix adds sops.secrets.dockerhub_auth + a sops.templates."docker-config.json" that renders /root/.docker/config.json (0600, root) at activation. It becomes a symlink to /run/secrets/rendered/docker-config.json.
  • Why /root: the drone exec runner runs pipelines as User=root (drone-runner.nix), and manual deploys ssh in as root — so /root/.docker/config.json covers both the !testme CI path and manual ops. Single config, single user.

Swarm-propagation question — RESOLVED empirically (no --with-registry-auth / pre-pull needed). The operator/Adversary flagged that a node docker login may NOT propagate to swarm SERVICE-task pulls. Tested on cc-ci with the authenticated config.json in place:

  • Account ratelimit baseline 197/200 (source = account hash b662dd8b-…, not the IP).
  • Deployed uncached n8nio/n8n:2.20.6 via abra (RECIPE=n8n STAGES=install). The swarm service task pulled it to 1/1 Running with no toomanyrequests.
  • Account counter dropped 197 → 196 (manager manifest resolution) → 195 (agent layer-manifest pull), source still the account hash. So abra's docker stack deploy propagates the cred to the swarm task pull on this single-node swarm — billed to the account, not the anon IP.
  • Corroborating: the earlier lasuite-drive deploy resolved 12 images with no toomanyrequests while anon budget was ≤4 — impossible anonymously → manager resolution is authenticated too.

So: declarative root config.json is sufficient end-to-end here; --with-registry-auth is not required (abra/SDK attaches it). Caveat (Phase 2b): 200/6h may still be tight for a full ~18-recipe sweep; the permanent structural fix is a registry pull-through cache authenticated with this same PAT.


Phase 2w — warm canonical + --quick (2026-05-28)

Stable-domain scheme for warm apps: warm-<recipe>.ci.commoninternet.net. Distinct from cold per-run <recipe[:4]>-<6hex> (naming.app_domain) so a warm app is never confused with a disposable cold run. Live-warm keycloak = warm-keycloak.ci.commoninternet.net; data-warm canonicals (W1) = warm-<recipe>.... Risk to watch: longer stack name vs swarm's 64-char config/secret limit — verified per-recipe on first deploy; shorten the scheme if any recipe's secret name overflows.

Realm is the per-run isolation unit on the shared live-warm keycloak (WC1). Instead of co-deploying a fresh keycloak per dependent run, dependents use the one live-warm keycloak and create a per-run namespaced realm+client+user, deleted at run teardown. Realm name = <parent_recipe>-<6hex> where 6hex is the parent's per-run domain label suffix — unique per (parent, pr, ref) so concurrent dependents never collide, and traceable for debugging. (Was realm=parent_recipe, which would collide across concurrent same-recipe runs.)

Warm keycloak is declarative INFRA, not warm DATA. The live-warm keycloak service is brought up by a Nix systemd-oneshot reconciler (converges to deployed+healthy at the stable domain), exactly like the traefik recipe deploy — so it IS in the D8 reproducibility closure (re-warmable from scratch) and self-heals on activation/boot. Only warm volumes/snapshots (W1+) are cache excluded from D8. The keycloak's realm data is ephemeral per-run, so nothing persistent to exclude.

Live-warm is an optimization layer with a cold fallback. If no warm keycloak is present (e.g. a from-scratch host before the reconciler has run, or the warm app is down), the keycloak dep path falls back to the existing cold co-deploy so dependent runs still work. The warm path is preferred when available.

Phase 2w — design update: unpinned warm/infra + health-gated rollback (2026-05-28/29)

Warm/infra apps (traefik + keycloak) auto-update to LATEST nightly, health-gated (operator). Supersedes the W0.3 pinned kcVersion. Keycloak is now unpinned like traefik: reconciler abra recipe fetch latest + chaos deploy; keep secret-generate-only-if-missing + health-wait. D8 holds because the recipe is fetched at activation (runtime), so the nix store closure is byte-identical regardless of which keycloak version is live.

Snapshot helper (WC3) — format + path. runner/harness/warmsnap.py. A snapshot is a raw tar of each docker volume belonging to the app's stack, taken while the app is undeployed (nothing writing → consistent). Stored under /var/lib/ci-warm/<recipe>/ as <recipe>.snapshot.tar + a <recipe>.meta.json (commit/version/timestamp/volume list). One last-good per app, replaced atomically (write to .tmp then rename). Restore: for each volume, clear _data and untar back. Docker volumes are stack-scoped (<stack>_<vol>); the helper enumerates them via docker volume ls filtered to the stack. Reused by WC1.1 (pre-upgrade snapshot of keycloak) and WC5 (promote-on-green-cold). Warm snapshots are cache, excluded from the D8 closure (WC8).

Alert mechanism — sentinel files relayed by the Builder loop. The warm/infra reconciler is an autonomous bash systemd unit on cc-ci; it cannot call the agent's PushNotification tool. So a reconciler that rolls back (WC1.1) or holds a major/manual-migration upgrade (WC1.2) writes a JSON alert sentinel to /var/lib/ci-warm/alerts/<ts>-<app>-<reason>.json (fields: app, reason [rollback|held-major|held-manual-migration], from_version, to_version, release_notes, ts). The Builder loop, each wake, scans that dir; for each new alert it (a) issues PushNotification to the operator, (b) records it in STATUS-2w/JOURNAL-2w, (c) archives it to alerts/seen/. This bridges the autonomous reconciler to operator visibility (latency = next Builder wake; acceptable for an alert).

Re-sequence: WC1.1's keycloak rollback needs the WC3 snapshot helper, so build that FIRST, then rewrite the reconciler ONCE into the unpinned + WC1.2-safety-gated + WC1.1-health-gated-rollback form (avoids reworking the reconciler twice). The W0.3 reconciler is INTERIM until then.

Phase 2w — W0.6 reconciler: version model + deploy-by-tag (2026-05-29)

Reconcile entrypoint in Python, packaged in the nix store. runner/warm_reconcile.py, invoked by the systemd unit as ${pyEnv}/bin/python3 ${../../runner}/warm_reconcile.py <app> (the runner/ dir is copied into the store → D8-clean, no dependence on the /root/cc-ci checkout). Reuses warmsnap/sso/abra/lifecycle so there is ONE snapshot impl (also used by the runner for WC5). Replaces the bash reconcile in warm-keycloak.nix.

"latest" = newest published version TAG, deployed pinned (not chaos-of-main). WC1.2's "major recipe-version bump" detection needs comparable versions, which chaos (deploy main HEAD) doesn't give. So the reconciler resolves latest = git tag | sort -V | tail -1 (valid coop-cloud version tags), records current = the app .env VERSION, and deploys the chosen tag pinned (abra app deploy <domain> <version> -o -n -f, after git checkout <tag>). "Auto-update to latest" is satisfied by converging to the newest tag; "chaos" in the operator note is read as "auto-deploy latest", and tag-pinning is the correct mechanism for a version-gated auto-update.

coop-cloud version format is <recipe-semver>+<app-version> (observed), not the plan's <upstream>+<recipe-semver>. Evidence: keycloak 10.7.1+26.6.2 → image keycloak:26.6.2; n8n 3.2.0+2.20.6 → image n8nio/n8n:2.20.6 (the post-+ part is the app image tag). So the recipe semver is the part BEFORE +. WC1.2's "major recipe bump = breaking" keys off the major (first) component of the pre-+ recipe semver (e.g. 3.x→4.0 = held). Secondary signal: scan the target's releaseNotes/<version>.md for manual-migration markers.

Scope order for W0.6: keycloak first (the W0 focus, stateful → snapshot path); apply the same health-gated + safety-gate pattern to traefik (stateless, version-rollback-only) afterward by migrating proxy.nix onto the shared reconcile entrypoint.

Phase 2w — W1 canonical registry design (WC2/WC3) (2026-05-29)

Enrollment is declarative per-recipe via recipe_meta.WARM_CANONICAL = True (consistent with how DEPS/EXTRA_ENV are declared — enrolling a recipe stays a tests/<recipe>/ change, D5). A recipe so flagged gets a DATA-WARM canonical. Prove the model on a couple of recipes (custom-html simplest: stateful, no external DB), NOT all (the nightly sweep populates the rest over time).

Stable domain warm-<recipe>.ci.commoninternet.net (already decided for keycloak; same scheme for canonicals). Distinct from cold <recipe[:4]>-<6hex>. Watch the swarm 64-char secret-name limit per recipe on first deploy.

Known-good state per canonical, under /var/lib/ci-warm/<recipe>/: last_good (version string, already written by warm_reconcile), snapshot/ (warmsnap, W0.5), and a small canonical.json registry record {recipe, domain, version, commit, status, ts}. The DATA VOLUME is retained while the app is undeployed (data-warm). These are cache (excluded from D8, WC8).

Data-warm lifecycle (new runner/harness/canonical.py): is_enrolled(recipe) (reads WARM_CANONICAL), canonical_domain(recipe), read/write_registry(recipe), deploy_canonical(recipe) (deploy warm-<recipe> at last_good, reattaching the retained volume → warm boot), undeploy_keep_ volume(recipe) (undeploy, volume retained = idle data-warm), seed_canonical(recipe, version, commit) (record + snapshot; the volume becomes the canonical). LIVE-warm (keycloak, always up) vs DATA-warm (canonicals, undeployed when idle) both use warm-<recipe> + warmsnap.

W1 scope vs W3: W1 builds the registry + data-warm lifecycle and proves it (seed a custom-html canonical → undeploy keep volume → redeploy reattach → data survives; re-warmable from scratch). Automatic promote-on-green-cold (WC5) + nightly (WC6) are W3 — for W1 the canonical is seeded programmatically to prove the model; the cold-advances-canonical wiring comes later.

Phase 2w — W3 WC5 promote-on-green-cold mechanism (2026-05-29)

Promote = re-seed the canonical from a fresh deploy of the green-verified latest (NOT "keep the cold run's per-run volume"). Rationale: a cold run uses a fresh per-run domain <recipe>-<6hex> with a fresh volume (cold stays authoritative + fresh); its volume names are per-run-specific and differ from the canonical's warm-<recipe> volume names, so the per-run volume can't be directly reused as the canonical without a fragile name-remap. AND the cardinal guardrail "never lose the known-good" forbids touching the existing canonical until a new green one is ready.

So: on a run that is enrolled (recipe_meta.WARM_CANONICAL) + GREEN + COLD (not --quick) + on LATEST (no PR head, i.e. REF empty — the nightly/manual-latest run, NOT a PR !testme), AFTER the normal per-run teardown, the orchestrator PROMOTES: deploy warm-<recipe> at latest → wait healthy → undeploy → canonical.seed_canonical(version=latest, commit=head) (snapshot-while-undeployed + atomic registry/snapshot replace). The old known-good is replaced ATOMICALLY only on a green promote (a red run never reaches promote → known-good safe). The canonical's data = a clean install of the green-verified latest (a valid known-good baseline; --quick reattaches + upgrades it). Cost: one extra (canonical) deploy per promote — acceptable for cold/nightly (not latency-sensitive). The FIRST such green run SEEDS the canonical. --quick never promotes (proven W2). Only cold advances (WC5).

Promote gate predicate (unit-tested): is_enrolled(recipe) and overall==0 and not quick and not ref. (not ref = a catalogue-latest run, i.e. the nightly sweep or a manual RECIPE=<r> run — a PR !testme carries REF=PR-head and must NOT advance the canonical to a PR's code.)

Phase 2 — heavy-recipe upgrade tier disk constraint (28GB host) — SETTLED finding @2026-05-29

The upgrade tier (HC1: prev published → PR-head via in-place abra app deploy --chaos) cannot complete for recipes whose successive releases bump multi-GB image tags, because the rolling update must hold BOTH versions on disk transiently. Proven on lasuite-drive: onlyoffice 9.2 → 9.3.1.2 (3.94GB each) + collabora two versions → ~10GB office images at once vs ~14GB docker headroom on the 28GB host → 99% → deploy fail. No harness fix is possible (the prev images are running, so they are neither dangling-prunable nor rmi-able when the new must be pulled). install/backup/restore/ custom (single version) fit and pass. Resolution = grow the host disk (Class A1 operator input, DEFERRED.md 2026-05-29). Until then, heavy recipes are verified via their maximal testable subset (install+backup+restore+custom) with the upgrade tier flagged as a genuine env-level (disk) blocker per plan §7.1 (Adversary sign-off required). The cleanup runbook for an over-full host: pkill -f run_recipe_ci.py; docker stack rm <leftover>; remove its volumes+secrets; docker image prune -f.

SSO-provider policy (operator, 2026-05-29) — keycloak is the DEFAULT; authentik is NOT a DONE gate

Standing policy for all Phase-2 (and later) recipe OIDC/SSO testing:

  • keycloak is the default SSO provider. Default ALL recipe OIDC tests to keycloak (live-warm WC1).
  • Do NOT test authentik↔keycloak integration, and do NOT enroll authentik merely to "prove pluggability" / second-provider coverage. Phase-2 DONE is NOT gated on authentik.
  • Enroll authentik + add setup_authentik_realm (the provider-pluggable backend in runner/harness/sso.py) ONLY if a recipe genuinely REQUIRES authentik (cannot work under keycloak). If it works with keycloak, use keycloak.
  • cryptpad: its recipe-maintainer upstream SSO test uses authentik, but cc-ci tests cryptpad's OIDC under keycloak (equally valid). Same for any recipe whose upstream happens to use authentik but functions fine under keycloak.
  • The OIDC FLOW primitives (oidc_password_grant, assert_discovery_endpoint) are already provider-agnostic; only realm/client SETUP is provider-specific, and we only need the keycloak setup (setup_keycloak_realm) unless/until a recipe forces authentik. Consequences: DEFERRED #9 (authentik enrollment) re-entry trigger narrowed to "a recipe requires authentik"; F2-7 (authentik backend) is not a DONE blocker. plan-sso-dep-testing.md §6 updated by the orchestrator to match.

Phase 2pc — image-prune policy; local store IS the cache; registry pull-through DROPPED (2026-05-29) — SETTLED

Decision (PC1): removed virtualisation.docker.autoPrune (it ran docker system prune --force --all --filter until=24h daily). The --all evicts every image not used by a running container — between runs no test apps run, so it wiped the cached recipe base images → cold re-pull → Docker-Hub rate-limit churn (JOURNAL-2 507/542/690-693). Replaced with nix/modules/docker-prune.nix: the ci-docker-prune daily timer + oneshot, a surgical triple-gated prune that no-ops unless ALL of (1) / ≥ 80%, (2) no run-app stack live, (3) no swarm service converging; and when it runs prunes only dangling images + stopped containers + dangling build cache, until=24h — never --all (keeps tagged base/in-use images), never --volumes (warm canonical data). Teardown (lifecycle.teardown_app) already removes only services/volumes/secrets/.env, never images — kept. Why: on this single host Docker's own local image store IS the cache — a pulled image stays and redeploys reuse local layers with no re-download (proven: redis:7-alpine cold pull 5303ms w/ 6 layer downloads → after service rm teardown the image is retained → warm redeploy "Image is up to date" 674ms, no bytes); the PAT-authenticated daemon (200/6h) makes the residual warm-deploy manifest check free of rate-limit pressure. So keeping the store recovers ~all the benefit a cache would give.

Decision (registry pull-through cache): DROPPED here, deferred to IDEAS / Phase 2b (operator scope correction 2026-05-29, mid-phase). A registry:2 pull-through cache's distinctive wins — multi-node fan-out, surviving prune/VM-rebuild on separate storage, cache-miss authentication — don't apply to a single authenticated non-pruning host (one node; co-located cache lost on a recreate anyway; daemon already authenticated). It would add a registry service + daemon-mirror config + cache GC for marginal gain. Revisit ONLY if (a) cc-ci goes multi-node, OR (b) Phase-2b measurement shows cold-deploy pull time is a real bottleneck AND the cache can live on recreate-surviving storage (Incus volume / host b1 path, not the VM's ephemeral disk). No registry code was written (caught during orientation) — nothing to revert.