72 KiB
DECISIONS — cc-ci Builder
Architecture decisions and dead-ends. One line of rationale each. (§0, §8)
Settled
-
Wildcard TLS: operator pre-issues wildcard cert at
/var/lib/ci-certs/live/; Traefik file provider serves it; no ACME for commoninternet.net. (Plan §4.0/§8 — fixed.) -
Repo:
git.autonomic.zone/recipe-maintainers/cc-ci, private. Bot is org admin. (Bootstrap.) -
Git credentials: helper script in repo-local git config sources
/srv/cc-ci/.testenvat call time — no secret values stored in.git/configor commits. -
Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26, overrides plan §3
modules/traefik.nix). Instead of a hand-rolled Traefik we deploy the canonical Co-op Cloudtraefikrecipe via abra in wildcard / file-provider mode, for end-to-end fidelity (canonicalweb/web-secureentrypoints + proxy/swarm conventions every recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO DNS token on the box:WILDCARDS_ENABLED=1+ appendcompose.wildcard.yml; the pre-issued cert is fed as thessl_cert/ssl_keyswarm secrets (v1) viaabra app secret insert … -ffrom/var/lib/ci-certs/live/{fullchain,privkey}.pem. The file provider serves it (tls.certificates).LETS_ENCRYPT_ENV=empty on the traefik app and on every test app → the recipe'stls.certresolver=${LETS_ENCRYPT_ENV}label resolves to no resolver → routers serve the wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)- Reproducibility (D8):
scripts/deploy-proxy.shis idempotent (ensures local abra server, fetches recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented indocs/install.md. The custommodules/traefik.nixwas removed;modules/swarm.nixkeeps swarm init +proxynet + firewall 80/443. - Renewal (manual, ~90d): operator re-issues the wildcard at the same paths, then
abra app secret rm traefik.ci.commoninternet.net ssl_cert -n+ re-insert at a new version (bumpSECRET_WILDCARD_CERT_VERSION) and redeploy. (Documented in docs/secrets.md at M7.) - abra teardown syntax (for harness, §4.3):
abra app undeploy <d> -n,abra app volume remove <d> -f -n,abra app secret remove <d> --all -n. None take--chaos.
-
Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer 2026-05-26). Every piece of swarm infra that abra deploys (traefik
modules/proxy.nix, Dronemodules/drone.nix, later comment-bridge + dashboard) is asystemd.services.<x>withType=oneshot+RemainAfterExit,after/requiresswarm-init + docker,wantsnetwork-online,wantedBymulti-user, embedding its script viapkgs.writeShellApplication(self-contained in the store, not a/root/cc-cipath). The script reconciles (inspect → converge → no-op if correct) on every activation/boot — no run-once sentinel — so it self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit) on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses togit clone+nixos-rebuild switch+ operator preconditions, no manual post-steps. The oldscripts/deploy-*.shwere folded into these modules and removed.pkgs.abrais provided via an overlay (modules/packages.nix) so all modules share the one pinned build.- Cert rotation note: the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
wildcard means bumping
SECRET_WILDCARD_*_VERSION(operator) so the next reconcile re-inserts. Documented in docs/secrets.md at M7.
- Cert rotation note: the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
wildcard means bumping
-
Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27, supersedes the earlier "keep webhook, do NOT pivot to polling" steer). Hard constraint: the bot/server runs at READ level, never repo-admin, and never self-registers a webhook.
- Polling is PRIMARY and the source of truth for D1. The bridge polls each enrolled repo's
open PRs for new
!testmecomments everyPOLL_INTERVAL(30s ≤ 60s). Outbound (cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On startup the first poll marks pre-existing comments seen so it doesn't fire on old comments. - Webhook is an OPTIONAL push optimization. The
/hookendpoint stays live (HMAC-verified) so an admin-registeredissue_commentwebhook lowers latency, but the bridge never registers one. Manual registration is documented indocs/enroll-recipe.md. Both paths share an in-memory seen-set keyed by comment id → a comment seen by both fires at most once. - Commenter authorization via org membership (read-level, no admin). Allowed iff
GET /orgs/{owner}/members/{user}→ 204 (verified 2026-05-27: admits bot/trav/notplants, 404 for a non-member, works with bot read-level basic-auth) or the user is in the optionalAUTH_ALLOWLIST. Replaces the earlier/collaborators/{user}/permissioncheck, which needs repo-admin. Fail-closed on any error. - Enrollment = add the repo to the bridge
POLL_REPOScsv + ensuretests/<recipe>/exists. No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't matter: polling makes it irrelevant; the operator was whitelistingci.commoninternet.netin Gitea'sALLOWED_HOST_LIST, but D1 no longer depends on that.)
- Polling is PRIMARY and the source of truth for D1. The bridge polls each enrolled repo's
open PRs for new
-
Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27, plan §4.2/§4.3). Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
- MAX_TESTS =
DRONE_RUNNER_CAPACITY= 1 (modules/drone-runner.nix,maxTestslet-binding). Drone runs at most MAX_TESTS builds at once and auto-queues the rest in its native pending queue — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly. - Per-build TIMEOUT = 60 min (
modules/drone.nix,buildTimeoutMinutes; reconciled best-effort viaPATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}using the bridge's Drone admin token, local--resolve, non-fatal). A build over the limit is cancelled by Drone → the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue once a test finishes OR times out". - Teardown + janitor backstop. Each build deploys → runs the 3 stages → undeploys
(guaranteed
try/finallyinconftest/orchestrator). A SIGKILL'd/timed-out build can't run its own teardown, so the run-start janitor (lifecycle.janitor, called before every deploy in both fixtures +run_recipe_ci) reaps orphaned run apps as the backstop. At capacity=1 the CI path will setCCCI_JANITOR_MAX_AGE=0(reap any orphan immediately — safe with no concurrent runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default 2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live. - Optional
concurrency: {limit: 1}in the recipe-CI.drone.ymlis a redundant belt — primary mechanism isDRONE_RUNNER_CAPACITY. (Wired when the recipe-CI pipeline lands — see backlog.)
- MAX_TESTS =
-
D10 recipe #6: bluesky-pds (TLS-passthrough) SWAPPED → n8n — SETTLED (2026-05-27, plan §4.0 sanctions this swap-with-reason). bluesky-pds routes via a Traefik TCP router with
tls.passthrough=trueto an in-container caddy that terminates TLS itself and obtains its own cert via ACME. cc-ci's design is the opposite: the operator gateway passes wildcard TLS through to cc-ci's Traefik, which terminates it with the pre-issued static wildcard cert, and ACME is hard-forbidden for commoninternet.net (no DNS token on the box — §4.0/§9). Serving bluesky-pds would require either (a) ACME inside caddy (forbidden), or (b) injecting the wildcard cert into caddy + a per-host TCP-passthrough router on cc-ci Traefik (recipe-internal surgery + a bespoke proxy mode — not a clean shared-harness absorb). This is a genuine design conflict, not a harness gap. Per the plan's explicit allowance, bluesky-pds is a documented non-CI'd recipe (reason here), and n8n takes the 6th slot. The 5 required D10 categories are already covered by recipes 1–5 (simple=custom-html, single-DB+SSO=keycloak, stateful/no-DB=cryptpad, DB+media/large-volume= matrix-synapse, multi-service+S3/object-storage=lasuite-docs); n8n adds a 6th real deployable app (workflow automation) behind the normal terminate-at-Traefik path. -
Docker Hub rate limit + mid-breadth prune — FINDING (2026-05-27). D10 real-
!testmebreadth runs exhausted Docker Hub's anonymous pull rate limit (lasuite-docs, 9 images, upgrade stage:toomanyrequests). Two lessons: (1) registry pull creds are an A1 operator input needed for reliable heavy-recipe deploys under load (request + sops-store + wire into docker daemon). (2) Don'tdocker image prune -afmid-breadth — it evicts cached recipe images and forces re-pulls that hit the limit. The first lasuite failure was disk pressure (90% full); pruning fixed disk but triggered re-pulls → rate limit. Better: rely on the daily autoprune, prune onlydangling(not-a) between runs, or grow disk so heavy images stay cached. Net for D10: 5/6 recipes green via real !testme; lasuite-docs gated on the rate limit (transient ~hours; durable fix = creds).
Open (defaults from §8, to confirm as reality lands)
-
Deploy mechanism — SETTLED (M0):
nixos-rebuild switch --flake /root/cc-ci#cc-cirun on cc-ci itself, with the repo materialised on the host at/root/cc-ci. Chosen over--target-host/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS proxy (slow/fragile). Atomic rollback preserved by Nix generations (nixos-rebuild --rollback). The switch is launched as a detached transient systemd unit (systemd-run --unit=ccci-rebuild --collect) so it survives a momentary ssh-over-tailscale drop during activation. For the build loop the host copy is synced from the sandbox clone viatar | ssh(rsync absent on host); source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo on a fresh host, thennixos-rebuild switch --flake .#cc-ci).- nixpkgs pin: flake pins the exact rev cc-ci already ran (
50ab793…) so the first rebuild is a true no-op-then-base. Bump deliberately, never drift.
- nixpkgs pin: flake pins the exact rev cc-ci already ran (
-
Webhook scope: default per-repo via enroll script.
-
CI engine: Drone (per plan) — kept, with a noted risk. nixpkgs 24.11 has Drone server 2.24.0 but
drone-runner-execis abandoned (unstable-2020-04-19) — the only exec runner Drone ever shipped (upstream archived ~2021). The maintained fork Woodpecker (2.7.3, with NixOS modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific (D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern Drone server (RPC protocol stable). Fallback: if the exec runner proves incompatible/broken, pivot to Woodpecker (coop-cloud ships awoodpeckerrecipe too) and record it — like the traefik pivot. Re-evaluate at the M2 gate. -
Drone deployment shape — SETTLED (M2): mirror the traefik pattern. The server is the coop-cloud
dronerecipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by traefik atdrone.ci.commoninternet.net,LETS_ENCRYPT_ENVempty → wildcard cert, no ACME), with Gitea SSO (compose.gitea.yml). The exec runner runs as a Nix systemd service on the host (modules/drone-runner.nix) so it can drive host abra/swarm (plan §4.2). One generatedDRONE_RPC_SECRETis shared: inserted as the server'srpc_secretswarm secret AND read by the runner from sops. Reproducible deploy:scripts/deploy-drone.sh.- Gitea OAuth app
cc-ci-dronecreated under the bot (client_idab4cdb9d-ee96-4867-875f- 87384505fc52, redirecthttps://drone.ci.commoninternet.net/login); client_secret + rpc_secret stored sops-encrypted insecrets/secrets.yaml(A2 internal secrets).
- Gitea OAuth app
-
Drone runner type: exec (must drive host abra).
-
Secret tool — SETTLED (M0): sops-nix. cc-ci decrypts at activation using its ed25519 SSH host key as the age identity (
sops.age.sshKeyPaths), so no extra key file to manage on the box. Recipients in/.sops.yaml: the host age key (age1h90ut…, from ssh-to-age) + an off-box master recovery key (age1cmk26t…; private half only at/srv/cc-ci/.sops/master-age.txton the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing plaintext intosecrets/<f>.yamlthensops -e -i(run inside the repo so.sops.yamlis found). -
D10 recipe set: lock six early. Candidates favouring already-mirrored: custom-html (simple), cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3), bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5.
-
Per-run app domain scheme — adapted (M4, deviates from plan §4.0). Plan §4.0 wanted
<recipe>-pr<n>-<short-sha>.ci.commoninternet.net, but Docker swarm config/secret names (<stackname>_<resource>_<version>) must be ≤ 64 chars and abra derives<stackname>from the domain (dots→_, hyphens kept)..ci.commoninternet.netalone is 22 chars, so long recipe names- config names overflow 64 (hit with
custom-html-pr0-m4demo…_nginx_default_conf_v6= 66). New scheme:<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net(e.g.cust-e084bd) — short, unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/ ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.
- config names overflow 64 (hit with
-
abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6). Many abra commands (
app ls,secret generatewithout flags, version resolution) silentlygit checkout <version-tag>in~/.abra/recipes/<recipe>, discarding a PR branch's files. To test the PR head code (not a re-resolved tag): (1)fetch_recipeclones the mirror branch/ref (private → bot token via per-commandhttp.extraHeader, never persisted/logged); (2) all harness abra calls that touch the recipe pass-C(chaos: use current checkout)-o(offline: no remote fetch); (3) recipe-shippedtests/(D4) are snapshotted to a temp dir right after fetch, since later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.
Risks
- Disk — RESOLVED 2026-05-26. Original 8.9 GiB root had only ~3.8 GiB free and a hard
inode ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
inodes before bytes. Operator grew the VM to 28 GiB (22 GiB free, 1.78M inodes / 1.21M free);
the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
periodic
docker image pruneto avoid regressing during M6.5 breadth.
Dead-ends
- (none yet)
Phase 1c (full reproducibility + genuine D8 live rebuild) — 2026-05-27
- Secrets linkage = git SUBMODULE (deviates from plan §7 flake-input default).
cc-ci-secretsis mounted as a submodule atcc-ci/secrets/rather than a flakeinputs.secrets. Rationale: a private flake input must be re-fetched at every nix eval, requiring the bot token persistently in nix config/netrc on cc-ci AND the throwaway VM (a token in the store/config = a 2nd out-of-band secret, which 1c forbids). A submodule makessecrets/secrets.yamla plain path in the working tree →defaultSopsFile = ../secrets/secrets.yamlis unchanged (minimal diff, trivially byte-identical), and the only credential use is the onegit clone --recursiveat provisioning ("the two repos are given", Mission §1). Build invocation becomesnixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'so the submodule tree is included. (Revisit if?submodules=1proves unreliable on cc-ci's nix version.) - Bootstrap key for the throwaway VM = the existing RECOVERY (master) age key, via
sops.age.keyFile. The recovery key (age1cmk26…, private at/srv/cc-ci/.sops/master-age.txt) is already a sops recipient, so a fresh host with a different ssh host key still decrypts every secret with no re-keying — this is exactly the §0 argument that defeats "host-key binding". Provisioned to the VM at a fixed path (the ONE out-of-band secret). cc-ci itself keeps decrypting via its host key (age.sshKeyPaths); secrets.nix will offer both identity sources. (Per-host re-encrypt is cleaner for a permanent new instance — documented as the alternative, not used for the throwaway test.) - Cert into git: wildcard cert+key become sops secrets in
cc-ci-secrets, decrypted at activation back to/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}viasops.secrets.<name>.path; proxy.nix keeps reading that path (now sops-sourced, not operator-drop). - cc-nix-test final sizing (C6) — SETTLED by operator 2026-05-27: PROMOTE the rebuilt VM. The freshly-rebuilt reproducible VM (the FINAL W5/C4-C5 clean-room throwaway) becomes the canonical cc-nix-test; the operator will repurpose it for a live real-traffic test through the public gateway.
- C6 teardown OVERRIDE (operator, 2026-05-27): do NOT destroy the FINAL throwaway VM after W5/C4-C5 PASSes — keep it RUNNING; defer its C6 teardown until the operator explicitly says otherwise. This overrides the plan §5/§6 "destroy the throwaway" for that one VM only. All other cleanup proceeds normally (the Builder's first throwaway was already destroyed; RAM accounting holds).
Phase 1b — lint/format tooling (open decisions §6, settled W0)
- Formatters/linters (RL1): Nix =
nixpkgs-fmt(format) +statix(lints) +deadnix(dead code); Python =ruff(lint + format); Shell =shellcheck+shfmt -i 2 -ci; YAML =yamllint. Keptnixpkgs-fmtoveralejandrabecause it was already the repoformatterand devshell tool (no extra churn / restyle of every .nix). All built from the already-pinned nixpkgs via a flakelintdevshell (nix develop .#lint) so CI and local use byte-identical tool versions. - Lint entrypoint:
scripts/lint.sh(check-only by default;--fixauto-applies). The.drone.ymlpush pipeline runs it vianix develop .#lint --command bash scripts/lint.sh. - ruff strictness:
select = [E,F,W,I,UP,B,C4,SIM],ignore = [E501](line length is the formatter's job; only un-splittable strings would trip it).line-length=100,target=py311. - Drone lint stage = FAIL (not warn). The codebase is green now, so enforce from here on — an
unclean commit fails the
lintstep. (Resolves the §6 open question.) - Python type-checking (mypy/pyright): DEFERRED to IDEAS, not added in 1b. The harness is small
and dynamically typed around
abra/subprocess JSON; gradual typing is a larger effort than this bounded pass warrants. Revisit if Phase 2's 18-recipe ramp shows type bugs. - blocking vs advisory split (§3): treated as in the phase plan — tests-real, Nix-idempotent, no-footguns, no-secrets, log-redaction, harness-DRY = blocking; readability/docs/arch-drift = advisory unless a real plan deviation. Recorded per-finding in REVIEW-1b / BACKLOG-1b.
- cc-ci self-CI push trigger: the lint stage lives in the
event: pushpipeline. The Gitea→Drone push webhook on this instance is flaky (last_status: None; documented §4.1) and predates 1b — recipe CI uses polling as primary, but cc-ci's own self-test/lint relies on the push webhook. The lint stage is correctly wired and proven green via the identicalnix develop .#lintcommand; reliably auto-firing it on every push is tracked as a (pre-existing) infra item, not a 1b lint gap.
Phase 1b — repo layout (operator review items RL5/RL6, plan §7)
- RL5 — all Nix code under
nix/. Movedmodules/→nix/modules/andhosts/→nix/hosts/.flake.nix/flake.lockSTAY at the repo root (entry point) so the build ref#cc-ciandnixos-rebuild --flake '…#cc-ci'are unchanged — onlyflake.nix's internal./hosts/cc-ci/configuration.nix→./nix/hosts/cc-ci/configuration.nixchanged. Root-relative refs inside the moved modules were re-based../X→../../X(secrets.nix →../../secrets/, bridge.nix →../../bridge/, dashboard.nix →../../dashboard/);configuration.nix's../../modules/*imports are unchanged (both dirs moved undernix/, so the relative path still resolves). Toplevel is byte-identical (8i3jcad9…) before/after the move — store derivations are content-addressed on the copied file contents, and the module.nixfiles aren't part of the runtime closure, so relocating folders doesn't change the build. (The operator anticipated a hash change; in practice it's stable, which is even stronger for reproducibility.) Living docs (README, architecture/install/secrets/enroll) + the.drone.ymlcomment updated tonix/…; append-only history logs left as the record of what was true then. - RL6 — protocol files →
machine-docs/: DEFERRED to the coordinated end of 1b. Willgit mvSTATUS*/REVIEW*/JOURNAL*/BACKLOG*/DECISIONS.mdintomachine-docs/(README.md STAYS at root — operator decision, it's the human readme, not a protocol file). The live watchdog (launch.sh) readsSTATUS-<id>.md/REVIEW-<id>.mdat the repo root for handoffs/transition, so this is done LAST, in lockstep with the orchestrator updatinglaunch.sh+ restarting the watchdog — not unilaterally and not while a phase transition is pending. The Adversary likewisegit mvs its own REVIEW files at the cutover (single-writer rule).
Phase 1b — recorded deviation: no tests/_template/ dir (enroll = copy an existing recipe)
Plan §3's repo layout lists a tests/_template/ "copy-to-add-a-recipe" dir. It was never created
(pre-1b; not introduced or removed by 1b) — instead the documented enroll flow in
docs/enroll-recipe.md is "copy an existing recipe's tree, e.g. tests/custom-html/…, then adjust
recipe_meta.py + the per-recipe test files." This satisfies D5's "small, repeatable, documented
operation with no harness surgery" the same way (a concrete recipe is a better starting template than
an abstract skeleton that can drift). Recording per the Adversary's RL3 D5 advisory; not a blocker.
Phase 1d — generic test suite + layered overlays (design, 2026-05-27)
SSOT: cc-ci-plan/plan-phase1d-generic-test-suite.md. Resolves the §6 open decisions.
-
Tier model & op/assertion split (the core call). A run is a sequence of TIERS — install, upgrade, backup, restore, custom — each =
generic default [overridden by a recipe overlay]. The lifecycle OP (deploy, upgrade, backup, restore) is owned by the shared harness (harness.generichelpers), NOT duplicated in every test file. A tier's test file (generic or overlay) carries the ASSERTIONS and calls the shared op helper. This keeps the op single-sourced (DRY, DG7) and makes deploy-once trivial: only the orchestrator deploys/tears-down. -
Override (not additive) — Builder's call (plan §6, operator leaned override). For each lifecycle op exactly ONE assertion file runs, by precedence: repo-local
tests/test_<op>.py> cc-citests/<recipe>/test_<op>.py> generic (tests/_generic/test_<op>.py). A present overlay REPLACES the generic for that op. Invariant: no overlay for an op ⇒ the generic runs (so any recipe is testable with zero config). Repo-local wins same-name collisions (upstream is authoritative, plan §2.5); cc-ci's overlay is the curated fallback until upstream adopts it. Extend-by-composition: an overlay mayfrom harness import genericand callgeneric.assert_serving(...)/generic.do_upgrade(...)then add its own assertions — so "extend" needs no separate mechanism. -
Custom (non-lifecycle)
test_*.py: ALL discovered from BOTH locations run additively, opt-in (no override, no generic equivalent) — e.g.test_sso.py. -
Deploy ONCE, mutate in place (operator requirement, DG4.1). The orchestrator deploys the app ONCE, runs all tiers against that single live deployment (install asserts; upgrade does
abra app upgradein place; backup/restore mutate in place; custom asserts), then ONE teardown infinally. No per-tier/per-overlayabra app new/deploy/undeploy. ACCCI_DEPLOY_COUNTcounter inlifecycle.deploy_appis asserted == 1 per run (DG4.1 evidence). -
Deployment-sharing scope & base version (§6 open). One deployment for the whole lifecycle. Base version deployed once = the previous published version when an upgrade tier will run and a previous exists (so upgrade goes previous→target in place), else the target (current/$REF). Recipe with only one published version ⇒ upgrade tier is a clean SKIP (nothing to upgrade from). Standalone generic-install demo (no PR) deploys current.
-
Fail handling across shared tiers (§6 open): install failing (app never serves) fail-fasts the run (later tiers can't meaningfully run on a dead deployment) and they report error/skip; upgrade/backup/restore failures are recorded per-op but do not abort the remaining independent tiers where they can still run. Teardown always runs.
-
Backup-capability detection (DG3, §6 open): auto — scan the recipe's
compose*.ymlfor abackupbot.backuplabel (verified present in custom-html).recipe_meta.BACKUP_CAPABLE(bool) overrides the auto-detect. Not capable ⇒ backup+restore tiers are N/A (skip), not failures. -
Custom install-steps hook (DG5, §6 open): a shell hook —
tests/<recipe>/install_steps.sh(cc-ci) or repo-localtests/install_steps.sh— run by the orchestrator during the install tier AFTERabra app new+ env defaults but BEFOREabra app deploy, with envCCCI_APP_DOMAIN,CCCI_RECIPE,CCCI_APP_ENV(path to the app .env). Chosen over a fixture/declarative field as the simplest thing the harness runs uniformly (canabra app secret insert, set env, seed). Graceful rule: a recipe with NO hook still attempts the generic install; if it genuinely needs a step it FAILS the generic install (reported per-op) — that is correct, not a harness bug. -
Per-op result vocabulary (Phase-3 feed):
pass | fail | skip(N/A) | error. The orchestrator prints a per-op summary line per run (feeds DG6 + Phase-3 level). -
Discovery layout: cc-ci overlays/custom/hook live in
tests/<recipe>/; repo-local in the recipe repo'stests/(snapshotted after fetch, per the existing volatile-checkout handling). Generic tier files live intests/_generic/(assertion-only, use the shared live-deployment fixtures).
Phase 1e — generic-harness corrections (HC1–HC4)
Three operator-review corrections to the Phase-1d shared harness, settled here (plan §5).
-
HC2 — repo-local approval allowlist (form/location + workflow). PR-author-controlled code (
install_steps.sh, repo-localtest_*.py) runs on the CI host with/run/secrets/*present, so it is default-deny. Allowlist file:tests/repo-local-approved.txt(checked into the cc-ci repo, git-auditable). Format: one recipe name per line;#comments + blank lines ignored; a lone*is NOT a wildcard (no global opt-in — every recipe is explicit). Default: empty ⇒ no recipe trusts repo-local code. Discovery (resolve_op/custom_tests/install_steps) consults the repo-local source only whenrepo_local_approved(recipe)is true; otherwise precedence is cc-ci > generic only and repo-local is discovered-but-not-executed. Workflow: a cc-ci maintainer reviews a recipe's repo-local tests, then adds the recipe name totests/repo-local-approved.txtin a cc-ci PR — a deliberate, reviewable act. The gate is centralized indiscovery.py(one reader) so the unit tests pin it. -
HC3 — generic-by-default opt-out flag (name/granularity + recipe_meta). Generic assertions run additively alongside any overlay by default. Opt-out, in increasing specificity (any one skips): env
CCCI_SKIP_GENERIC(truthy ⇒ skip generic for ALL ops), envCCCI_SKIP_GENERIC_<OP>(e.g.CCCI_SKIP_GENERIC_UPGRADE⇒ skip generic for that op only), and declarativerecipe_meta.SKIP_GENERIC= a list of op names (or["all"]) so the opt-out is per-recipe and visible in git, not a hidden global. Truthy =1/true/yes/on(case-insensitive). Op-vs-assertion split: a mutating op (upgrade/backup/restore) is performed once by the orchestrator (the harness owns the op); then the generic assertion file (unless opted out) and the overlay assertion file both evaluate the shared post-op state. Op results that an assertion needs (pre-upgrade identity, backup snapshot_id) are passed op→assertions via a run-scoped JSON state file at$CCCI_OP_STATE_FILE(read byharness.generic.op_state()); never logged. Overlays that need to seed pre-op state (data-continuity markers, the backup→restore mutation) ship an optionaltests/<recipe>/ops.pywithpre_install/pre_upgrade/pre_backup/pre_restore(domain, meta)callables the orchestrator runs before the op (repo-localops.pyis allowlist-gated like other repo-local code). Overlaytest_<op>.pyfiles are now assertion-only (they no longer callgeneric.do_*). -
HC1 — DG4.1 deploy-count vs the in-place chaos upgrade. The upgrade tier now upgrades to the PR head (code under test), not a published tag: deploy the previous published version (base), re-checkout the PR head (recorded as the recipe repo HEAD right after fetch, before any version-tag checkout), then
abra app deploy --chaosin place = the upgrade. The deploy-count guard countsabra app newinstalls only (_record_deploy()fires indeploy_app(), NOT in the chaos redeploy, which callsabra.deploydirectly) — so a run is still deploy-count == 1 and the legitimate in-place chaos upgrade is not flagged. Moved assertion (adapted): prev→PR-head may not bump the coop-cloud version label, soassert_upgradedaccepts ANY of: version-label change, image change, or a chaos label now present carrying the PR-head commit (a chaos deploy stampscoop-cloud.<stack>.chaos/.chaos-version) — the chaos label IS the proof PR-head was deployed. Non-PR!testme(no SRC/REF): "PR head" = the catalogue current checkout, so upgrade is prev→current — still a genuine move via chaos. (Exact chaos label name verified on the live abra during E2.)
Phase 2 — per-recipe test authoring (design, 2026-05-28)
Inherits the Phase 1d/1e shared-deployment + additive-overlay + op/assertion-split model. Phase 2
adds content, not infra, with a few small harness primitives ported from
references/recipe-maintainer/utils/tests/helpers.py.
- Per-recipe layout (per plan §4.1). The cc-ci
tests/<recipe>/dir continues to use the Phase-1d/1e overlays at the top level (test_install.py,test_upgrade.py,test_backup.py,test_restore.py,ops.py,recipe_meta.py, optionalinstall_steps.sh). NEW Phase-2 subdirectories:tests/<recipe>/functional/— parity-port tests (one per recipe-maintainertests/*.py) + ≥2 NEW recipe-specific functional tests (P2/P3). Each file istest_*.py(pytest-discoverable); each parity port carries aSOURCE = "recipe-info/<recipe>/tests/<file>"comment near the top so the audit trail is in the file, not just in PARITY.md.tests/<recipe>/playwright/— browser flows (P6) where the app's UX is a UI flow. Sametest_*.pyconvention; each file importsplaywright.sync_api.tests/<recipe>/PARITY.md— required mapping table (P2) with one row per recipe-info parity test:| recipe-maintainer file | cc-ci file | what's verified | status |. A deliberate non-port is a documented row in DECISIONS.md (linked from PARITY.md), not a silent omission.
- Discovery for the new subdirs.
runner/harness/discovery.custom_testsrecurses intotests/<recipe>/functional/andtests/<recipe>/playwright/(in addition to the top-level glob), so Phase-2 functional tests run as part of the custom stage automatically. Repo-local (HC2) gate still applies if the recipe is approved; otherwise only cc-ci's own functional/ + playwright/ run. The top-leveltest_install.py/etc. continue to drive the lifecycle overlays — thefunctional/+playwright/files are always custom-stage, never lifecycle (so they don't perform an op; they assert against the post-install live deployment). - Vendored helpers in
runner/harness/. Capabilities ported fromrecipe-maintainer/utils/tests/ helpers.py(cc-ci is self-contained at runtime — does NOT import recipe-maintainer's workspace, per plan §8 default):harness.http—http_get(url, headers=, timeout=) -> (status, json_or_None),http_post(...),retry_http_get(url, timeout=, **),wait_for_http(url, label, max_wait=),assert_converges(fn, description, max_wait=, interval=). (Several variants existlifecycle.http_fetch/http_get/http_bodyalready; the harness.http module is the canonical Phase-2 HTTP API for tests; lifecycle.* helpers stay for infra-level checks.)harness.abra_tty—script -qefc "abra …" /dev/nullwrapper for the abra commands that require a TTY (backup/restore/secret/run/logs/lint), used by parity tests that drive abra directly. Lifecycle already exposes typed wrappers — this is for tests that need raw shell-abra.harness.deps— dependency resolver primitive. Readstests/<recipe>/recipe.toml(requires/test_requires), deploys each declared dep via the samelifecycle.deploy_appwait_healthypath (so the dep is a real<dep[:4]>-<6hex>.ci.commoninternet.neton the same swarm), persists per-run, tears down with the parent in the orchestrator'sfinally. Heavy recipes sequence sequentially;MAX_TESTS/node budget is the cap.
harness.sso— OIDC-flow primitive (Q2 deliverable). Given a deployed provider domain and a recipe-defined realm/client/test-user, performs the full "deploy provider → setup realm/client via admin API → obtain access token (password + client-credentials grants) → assert protected API call accepts it" assertion. Reusable by every SSO-dependent recipe (cryptpad, lasuite-*, immich, etc.). Setup scripts ported fromrecipe-info/<dep>/setup_<provider>_integration.py.harness.data_integrity— backup data-integrity primitive: a recipe-aware "seed a marker → backup → mutate → restore → assert seeded marker survived" helper aroundlifecycle.exec_in_app/http_get(the recipe chooses the marker mechanism, the helper guarantees the pattern).
- Run-scoped credentials for SSO/recipe-specific tests (plan §4.4 class-B). Generated secrets
(realm/client/test-user passwords, API tokens) persist for the run via the existing
runs/<app-name>/mechanism (Phase 1d). Destroyed at teardown alongside abra secrets/volumes. - Recipe-versioned tests (anti-anchoring). Per plan §7.1, tests read versions/endpoints
dynamically (the app's own discovery endpoints, env from
live_app) — never hardcode published release values. Each functional test file declares the recipe-info SOURCE path it ports from so the Adversary can audit parity cold. - Heavy-recipe parking. Drone's
MAX_TESTS=1+ per-build timeout already serialize runs; for Phase 2 we DO NOT lift it. Within a single run, the orchestrator deploys deps before the recipe-under-test sequentially (never concurrently) per plan §4.2.
Phase 2 Q3.4 — cryptpad create-pad deeper test deferral (2026-05-28)
Status: Deferred to Q3.4 follow-up (or Q5 catch-up), with Adversary sign-off pending per plan §7.1.
What's deferred: The "create-an-object + read-it-back" deep test for cryptpad — authenticate-and-create a real pad in the browser, type a uniquely-marked content string, reload the page (retaining the client-side encryption key in the URL fragment), assert the marker survives. This is the canonical create-and-read-back per plan §4.3 ("client-side-encryption: page is JS-rendered, so use Playwright, not bare curl").
Why deferred (the technical reason):
- CryptPad's pad-creation client-side flow is version-specific. In the recipe under test
(10.6.0+5.7.0), visiting
/pad/does NOT auto-inject a fragment-keyed pad URL; CryptPad requires the user to explicitly click a "new rich text" / "new pad" link from the landing page, AND those UI selectors (.cp-apps-grid a,[data-app='pad'],a[href*='/pad/']) are not stable across CryptPad versions. - Three attempted drafts during Q3.4 each failed cold on this:
- Type + reload + content-survives: contenteditable inside nested iframe with origin mismatch (SANDBOX_DOMAIN).
- Direct-
/pad/-then-fragment: no fragment ever appeared on this version. - Click-fallback for known app-launch selectors: none of the candidate selectors matched.
The maximal testable subset that IS shipped (P3 floor met):
tests/cryptpad/functional/test_health_check.py— parity HTTP 200.tests/cryptpad/functional/test_spa_assets.py— CryptPad branding + canonical asset paths in served HTML. Catches the wedged-server-fallback-page failure mode.tests/cryptpad/playwright/test_pad_create.py— Chromium renders the SPA, asserts brand- canonical asset references + zero non-filtered JavaScript console errors.
The Playwright test exercises the JS pipeline in a real browser (per §4.3 directive); the
piece NOT exercised is the user-action-driven pad lifecycle. What's required to lift the
deferral: pin a specific CryptPad app-launch contract (CryptPad's source has app-launch
URL patterns like /pad/?new=1 on some versions) OR write a Playwright helper that walks the
SPA's main menu via a stable accessibility tree (role-based selectors instead of CSS).
Adversary may file F2-N requesting full create-pad coverage; the answer above is the honest technical reason + the maximal subset. Logged here per plan §7.1.
Phase 2 — nested DOMAIN-derived subdomains flattened to single-label wildcard siblings
Decision (settled): When an enrolled recipe routes additional services on nested subdomains
derived from DOMAIN (e.g. lasuite-drive MINIO_DOMAIN="minio.${DOMAIN}" +
COLLABORA_DOMAIN="collabora.${DOMAIN}"; lasuite-meet LIVEKIT_DOMAIN="livekit.${DOMAIN}"), the
recipe's recipe_meta.EXTRA_ENV(domain) MUST override those vars to a single-label sibling under
the wildcard — minio-<domain>, collabora-<domain>, livekit-<domain> — NOT the recipe's
default <svc>.<domain>.
Why: cc-ci's TLS cert is the operator's pre-issued wildcard *.ci.commoninternet.net (+ bare
ci.commoninternet.net) — §4.0/§1.5, renewed out-of-band, no ACME. A wildcard matches exactly one
label. The per-run app domain is already one label (lasuite-drive-pr<n>-<sha>.ci.commoninternet.net),
so a nested minio.lasuite-drive-pr<n>-<sha>.ci.commoninternet.net is a 2-label name the wildcard
does NOT cover → Traefik would serve an invalid cert on that router and the service is unreachable
over HTTPS. Re-prefixing with a hyphen keeps it one label (minio-lasuite-drive-pr<n>-<sha> +
.ci.commoninternet.net), covered by the same wildcard, routed by Traefik's swarm provider with no
cert work and no gateway change (the gateway already passes the whole wildcard, §4.0). We must NOT
mint per-host certs / ACME for these (class-A1 boundary, §9).
Scope: purely a per-recipe EXTRA_ENV concern (no shared-harness change). Recipes with no
DOMAIN-derived nested subdomains (most) are unaffected.
Phase 2 — services_converged treats a replicas: 0 one-shot as converged
Decision (settled): runner/harness/lifecycle.py::services_converged now considers a service
converged when cur == want (desired replica count met), removing the prior
or want == "0" rejection.
Why: lasuite-drive's minio-createbuckets is declared deploy: {mode: replicated, replicas: 0, restart_policy: {condition: none}} — an on-demand one-shot (scaled up manually only when buckets
need (re)creating; it mc mb … then exit 0). docker stack services reports it 0/0. The old
check rejected any want == "0" row, so the stack could never report converged → every deploy
hung until deploy_timeout. A service AT its desired count (including 0/0) is converged; a service
still spinning up shows 0/1 (cur != want) and is correctly not-yet-converged, so the HTTP
readiness wait still gates real liveness. Safe for all currently-green recipes (their services are
all N/N with N>0; the 0/0 case did not previously occur). Buckets/migrations that the one-shot
performs are run on-demand in the recipe's setup_custom_tests.sh (post-deploy), not relied upon for
generic-install convergence (the SPA at / serves 200 without them).
2026-05-28 — Docker Hub auth: declarative config.json via sops (rate-limit fix) — SETTLED
Context. Heavy Phase-2 recipe deploys exhausted Docker Hub's anonymous pull rate limit
(100/6h per shared IP 68.14.43.142) → toomanyrequests blocked all new deploys. Operator
provided a read-only Docker Hub PAT (Class A1 registry creds, plan §1.5): DOCKERHUB_USERNAME=nptest2
DOCKERHUB_TOKENin/srv/cc-ci/.testenv. Authenticated pulls = 200/6h per-account.
Decision. Wire it declaratively (survives a 1c NixOS rebuild), not just an imperative login:
- Secret:
secrets/secrets.yaml(cc-ci-secrets submodule, commitcdd5e0a) gains keydockerhub_auth=base64("nptest2:<PAT>")— i.e. the exactauthfield docker config.json wants, so the nix template is a pure render (no runtime base64). sops-encrypted to host+master age recipients (edited on cc-ci using its ssh-host-key→age identity vianix shell nixpkgs#sops; plaintext shredded; PAT never committed plaintext nor exposed in process args/logs). - Render:
nix/modules/secrets.nixaddssops.secrets.dockerhub_auth+ asops.templates."docker-config.json"that renders/root/.docker/config.json(0600, root) at activation. It becomes a symlink to/run/secrets/rendered/docker-config.json. - Why /root: the drone exec runner runs pipelines as
User=root(drone-runner.nix), and manual deploys ssh in as root — so/root/.docker/config.jsoncovers both the!testmeCI path and manual ops. Single config, single user.
Swarm-propagation question — RESOLVED empirically (no --with-registry-auth / pre-pull needed).
The operator/Adversary flagged that a node docker login may NOT propagate to swarm SERVICE-task
pulls. Tested on cc-ci with the authenticated config.json in place:
- Account ratelimit baseline 197/200 (source = account hash
b662dd8b-…, not the IP). - Deployed uncached
n8nio/n8n:2.20.6via abra (RECIPE=n8n STAGES=install). The swarm service task pulled it to1/1 Runningwith notoomanyrequests. - Account counter dropped 197 → 196 (manager manifest resolution) → 195 (agent layer-manifest
pull), source still the account hash. So abra's
docker stack deploypropagates the cred to the swarm task pull on this single-node swarm — billed to the account, not the anon IP. - Corroborating: the earlier lasuite-drive deploy resolved 12 images with no
toomanyrequestswhile anon budget was ≤4 — impossible anonymously → manager resolution is authenticated too.
So: declarative root config.json is sufficient end-to-end here; --with-registry-auth is not
required (abra/SDK attaches it). Caveat (Phase 2b): 200/6h may still be tight for a full ~18-recipe
sweep; the permanent structural fix is a registry pull-through cache authenticated with this same PAT.
Phase 2w — warm canonical + --quick (2026-05-28)
Stable-domain scheme for warm apps: warm-<recipe>.ci.commoninternet.net. Distinct from cold
per-run <recipe[:4]>-<6hex> (naming.app_domain) so a warm app is never confused with a disposable
cold run. Live-warm keycloak = warm-keycloak.ci.commoninternet.net; data-warm canonicals (W1) =
warm-<recipe>.... Risk to watch: longer stack name vs swarm's 64-char config/secret limit —
verified per-recipe on first deploy; shorten the scheme if any recipe's secret name overflows.
Realm is the per-run isolation unit on the shared live-warm keycloak (WC1). Instead of
co-deploying a fresh keycloak per dependent run, dependents use the one live-warm keycloak and create
a per-run namespaced realm+client+user, deleted at run teardown. Realm name =
<parent_recipe>-<6hex> where 6hex is the parent's per-run domain label suffix — unique per
(parent, pr, ref) so concurrent dependents never collide, and traceable for debugging. (Was
realm=parent_recipe, which would collide across concurrent same-recipe runs.)
Warm keycloak is declarative INFRA, not warm DATA. The live-warm keycloak service is brought up by a Nix systemd-oneshot reconciler (converges to deployed+healthy at the stable domain), exactly like the traefik recipe deploy — so it IS in the D8 reproducibility closure (re-warmable from scratch) and self-heals on activation/boot. Only warm volumes/snapshots (W1+) are cache excluded from D8. The keycloak's realm data is ephemeral per-run, so nothing persistent to exclude.
Live-warm is an optimization layer with a cold fallback. If no warm keycloak is present (e.g. a from-scratch host before the reconciler has run, or the warm app is down), the keycloak dep path falls back to the existing cold co-deploy so dependent runs still work. The warm path is preferred when available.
Phase 2w — design update: unpinned warm/infra + health-gated rollback (2026-05-28/29)
Warm/infra apps (traefik + keycloak) auto-update to LATEST nightly, health-gated (operator).
Supersedes the W0.3 pinned kcVersion. Keycloak is now unpinned like traefik: reconciler abra recipe fetch latest + chaos deploy; keep secret-generate-only-if-missing + health-wait. D8 holds
because the recipe is fetched at activation (runtime), so the nix store closure is byte-identical
regardless of which keycloak version is live.
Snapshot helper (WC3) — format + path. runner/harness/warmsnap.py. A snapshot is a raw tar
of each docker volume belonging to the app's stack, taken while the app is undeployed (nothing
writing → consistent). Stored under /var/lib/ci-warm/<recipe>/ as <recipe>.snapshot.tar + a
<recipe>.meta.json (commit/version/timestamp/volume list). One last-good per app, replaced
atomically (write to .tmp then rename). Restore: for each volume, clear _data and untar
back. Docker volumes are stack-scoped (<stack>_<vol>); the helper enumerates them via
docker volume ls filtered to the stack. Reused by WC1.1 (pre-upgrade snapshot of keycloak) and WC5
(promote-on-green-cold). Warm snapshots are cache, excluded from the D8 closure (WC8).
Alert mechanism — sentinel files relayed by the Builder loop. The warm/infra reconciler is an
autonomous bash systemd unit on cc-ci; it cannot call the agent's PushNotification tool. So a
reconciler that rolls back (WC1.1) or holds a major/manual-migration upgrade (WC1.2) writes a JSON
alert sentinel to /var/lib/ci-warm/alerts/<ts>-<app>-<reason>.json (fields: app, reason
[rollback|held-major|held-manual-migration], from_version, to_version, release_notes, ts). The
Builder loop, each wake, scans that dir; for each new alert it (a) issues PushNotification to the
operator, (b) records it in STATUS-2w/JOURNAL-2w, (c) archives it to alerts/seen/. This bridges the
autonomous reconciler to operator visibility (latency = next Builder wake; acceptable for an alert).
Re-sequence: WC1.1's keycloak rollback needs the WC3 snapshot helper, so build that FIRST, then rewrite the reconciler ONCE into the unpinned + WC1.2-safety-gated + WC1.1-health-gated-rollback form (avoids reworking the reconciler twice). The W0.3 reconciler is INTERIM until then.
Phase 2w — W0.6 reconciler: version model + deploy-by-tag (2026-05-29)
Reconcile entrypoint in Python, packaged in the nix store. runner/warm_reconcile.py, invoked by
the systemd unit as ${pyEnv}/bin/python3 ${../../runner}/warm_reconcile.py <app> (the runner/ dir is
copied into the store → D8-clean, no dependence on the /root/cc-ci checkout). Reuses
warmsnap/sso/abra/lifecycle so there is ONE snapshot impl (also used by the runner for WC5). Replaces
the bash reconcile in warm-keycloak.nix.
"latest" = newest published version TAG, deployed pinned (not chaos-of-main). WC1.2's "major
recipe-version bump" detection needs comparable versions, which chaos (deploy main HEAD) doesn't give.
So the reconciler resolves latest = git tag | sort -V | tail -1 (valid coop-cloud version tags),
records current = the app .env VERSION, and deploys the chosen tag pinned (abra app deploy <domain> <version> -o -n -f, after git checkout <tag>). "Auto-update to latest" is satisfied by converging
to the newest tag; "chaos" in the operator note is read as "auto-deploy latest", and tag-pinning is
the correct mechanism for a version-gated auto-update.
coop-cloud version format is <recipe-semver>+<app-version> (observed), not the plan's
<upstream>+<recipe-semver>. Evidence: keycloak 10.7.1+26.6.2 → image keycloak:26.6.2; n8n
3.2.0+2.20.6 → image n8nio/n8n:2.20.6 (the post-+ part is the app image tag). So the recipe
semver is the part BEFORE +. WC1.2's "major recipe bump = breaking" keys off the major (first)
component of the pre-+ recipe semver (e.g. 3.x→4.0 = held). Secondary signal: scan the target's
releaseNotes/<version>.md for manual-migration markers.
Scope order for W0.6: keycloak first (the W0 focus, stateful → snapshot path); apply the same health-gated + safety-gate pattern to traefik (stateless, version-rollback-only) afterward by migrating proxy.nix onto the shared reconcile entrypoint.
Phase 2w — W1 canonical registry design (WC2/WC3) (2026-05-29)
Enrollment is declarative per-recipe via recipe_meta.WARM_CANONICAL = True (consistent with how
DEPS/EXTRA_ENV are declared — enrolling a recipe stays a tests/<recipe>/ change, D5). A recipe so
flagged gets a DATA-WARM canonical. Prove the model on a couple of recipes (custom-html simplest:
stateful, no external DB), NOT all (the nightly sweep populates the rest over time).
Stable domain warm-<recipe>.ci.commoninternet.net (already decided for keycloak; same scheme for
canonicals). Distinct from cold <recipe[:4]>-<6hex>. Watch the swarm 64-char secret-name limit
per recipe on first deploy.
Known-good state per canonical, under /var/lib/ci-warm/<recipe>/: last_good (version string,
already written by warm_reconcile), snapshot/ (warmsnap, W0.5), and a small canonical.json
registry record {recipe, domain, version, commit, status, ts}. The DATA VOLUME is retained while
the app is undeployed (data-warm). These are cache (excluded from D8, WC8).
Data-warm lifecycle (new runner/harness/canonical.py): is_enrolled(recipe) (reads
WARM_CANONICAL), canonical_domain(recipe), read/write_registry(recipe), deploy_canonical(recipe)
(deploy warm-<recipe> at last_good, reattaching the retained volume → warm boot), undeploy_keep_ volume(recipe) (undeploy, volume retained = idle data-warm), seed_canonical(recipe, version, commit)
(record + snapshot; the volume becomes the canonical). LIVE-warm (keycloak, always up) vs DATA-warm
(canonicals, undeployed when idle) both use warm-<recipe> + warmsnap.
W1 scope vs W3: W1 builds the registry + data-warm lifecycle and proves it (seed a custom-html canonical → undeploy keep volume → redeploy reattach → data survives; re-warmable from scratch). Automatic promote-on-green-cold (WC5) + nightly (WC6) are W3 — for W1 the canonical is seeded programmatically to prove the model; the cold-advances-canonical wiring comes later.
Phase 2w — W3 WC5 promote-on-green-cold mechanism (2026-05-29)
Promote = re-seed the canonical from a fresh deploy of the green-verified latest (NOT "keep the
cold run's per-run volume"). Rationale: a cold run uses a fresh per-run domain <recipe>-<6hex>
with a fresh volume (cold stays authoritative + fresh); its volume names are per-run-specific and
differ from the canonical's warm-<recipe> volume names, so the per-run volume can't be directly
reused as the canonical without a fragile name-remap. AND the cardinal guardrail "never lose the
known-good" forbids touching the existing canonical until a new green one is ready.
So: on a run that is enrolled (recipe_meta.WARM_CANONICAL) + GREEN + COLD (not --quick) + on LATEST
(no PR head, i.e. REF empty — the nightly/manual-latest run, NOT a PR !testme), AFTER the normal
per-run teardown, the orchestrator PROMOTES: deploy warm-<recipe> at latest → wait healthy →
undeploy → canonical.seed_canonical(version=latest, commit=head) (snapshot-while-undeployed +
atomic registry/snapshot replace). The old known-good is replaced ATOMICALLY only on a green promote
(a red run never reaches promote → known-good safe). The canonical's data = a clean install of the
green-verified latest (a valid known-good baseline; --quick reattaches + upgrades it). Cost: one extra
(canonical) deploy per promote — acceptable for cold/nightly (not latency-sensitive). The FIRST such
green run SEEDS the canonical. --quick never promotes (proven W2). Only cold advances (WC5).
Promote gate predicate (unit-tested): is_enrolled(recipe) and overall==0 and not quick and not ref.
(not ref = a catalogue-latest run, i.e. the nightly sweep or a manual RECIPE=<r> run — a PR
!testme carries REF=PR-head and must NOT advance the canonical to a PR's code.)
Phase 2 — heavy-recipe upgrade tier disk constraint (28GB host) — SETTLED finding @2026-05-29
The upgrade tier (HC1: prev published → PR-head via in-place abra app deploy --chaos) cannot
complete for recipes whose successive releases bump multi-GB image tags, because the rolling update
must hold BOTH versions on disk transiently. Proven on lasuite-drive: onlyoffice 9.2 → 9.3.1.2
(3.94GB each) + collabora two versions → ~10GB office images at once vs ~14GB docker headroom on the
28GB host → 99% → deploy fail. No harness fix is possible (the prev images are running, so they
are neither dangling-prunable nor rmi-able when the new must be pulled). install/backup/restore/
custom (single version) fit and pass. Resolution = grow the host disk (Class A1 operator input,
DEFERRED.md 2026-05-29). Until then, heavy recipes are verified via their maximal testable subset
(install+backup+restore+custom) with the upgrade tier flagged as a genuine env-level (disk) blocker
per plan §7.1 (Adversary sign-off required). The cleanup runbook for an over-full host: pkill -f run_recipe_ci.py; docker stack rm <leftover>; remove its volumes+secrets; docker image prune -f.
SSO-provider policy (operator, 2026-05-29) — keycloak is the DEFAULT; authentik is NOT a DONE gate
Standing policy for all Phase-2 (and later) recipe OIDC/SSO testing:
- keycloak is the default SSO provider. Default ALL recipe OIDC tests to keycloak (live-warm WC1).
- Do NOT test authentik↔keycloak integration, and do NOT enroll authentik merely to "prove pluggability" / second-provider coverage. Phase-2 DONE is NOT gated on authentik.
- Enroll authentik + add
setup_authentik_realm(the provider-pluggable backend inrunner/harness/sso.py) ONLY if a recipe genuinely REQUIRES authentik (cannot work under keycloak). If it works with keycloak, use keycloak. - cryptpad: its recipe-maintainer upstream SSO test uses authentik, but cc-ci tests cryptpad's OIDC under keycloak (equally valid). Same for any recipe whose upstream happens to use authentik but functions fine under keycloak.
- The OIDC FLOW primitives (
oidc_password_grant,assert_discovery_endpoint) are already provider-agnostic; only realm/client SETUP is provider-specific, and we only need the keycloak setup (setup_keycloak_realm) unless/until a recipe forces authentik. Consequences: DEFERRED #9 (authentik enrollment) re-entry trigger narrowed to "a recipe requires authentik"; F2-7 (authentik backend) is not a DONE blocker. plan-sso-dep-testing.md §6 updated by the orchestrator to match.
Phase 2pc — image-prune policy; local store IS the cache; registry pull-through DROPPED (2026-05-29) — SETTLED
Decision (PC1): removed virtualisation.docker.autoPrune (it ran docker system prune --force --all --filter until=24h daily). The --all evicts every image not used by a running container —
between runs no test apps run, so it wiped the cached recipe base images → cold re-pull → Docker-Hub
rate-limit churn (JOURNAL-2 507/542/690-693). Replaced with nix/modules/docker-prune.nix: the
ci-docker-prune daily timer + oneshot, a surgical triple-gated prune that no-ops unless ALL of
(1) / ≥ 80%, (2) no run-app stack live, (3) no swarm service converging; and when it runs prunes
only dangling images + stopped containers + dangling build cache, until=24h — never --all
(keeps tagged base/in-use images), never --volumes (warm canonical data). Teardown
(lifecycle.teardown_app) already removes only services/volumes/secrets/.env, never images — kept.
Why: on this single host Docker's own local image store IS the cache — a pulled image stays and
redeploys reuse local layers with no re-download (proven: redis:7-alpine cold pull 5303ms w/ 6 layer
downloads → after service rm teardown the image is retained → warm redeploy "Image is up to date"
674ms, no bytes); the PAT-authenticated daemon (200/6h) makes the residual warm-deploy manifest check
free of rate-limit pressure. So keeping the store recovers ~all the benefit a cache would give.
Decision (registry pull-through cache): DROPPED here, deferred to IDEAS / Phase 2b (operator
scope correction 2026-05-29, mid-phase). A registry:2 pull-through cache's distinctive wins —
multi-node fan-out, surviving prune/VM-rebuild on separate storage, cache-miss authentication —
don't apply to a single authenticated non-pruning host (one node; co-located cache lost on a
recreate anyway; daemon already authenticated). It would add a registry service + daemon-mirror
config + cache GC for marginal gain. Revisit ONLY if (a) cc-ci goes multi-node, OR (b) Phase-2b
measurement shows cold-deploy pull time is a real bottleneck AND the cache can live on
recreate-surviving storage (Incus volume / host b1 path, not the VM's ephemeral disk). No registry
code was written (caught during orientation) — nothing to revert.
2026-05-29 — Real-abra-only deploys; abra convergence by default; READY_PROBE only when abra doesn't fit (operator principle; plan.md §9)
Decision (operator, 2026-05-29). CI deploys/upgrades MUST use real abra commands — never
docker service update/docker service scale to surgically patch a stack into health (that would
test a non-abra path and can mask a broken deploy). Prefer abra's own convergence checks by
default. Only skip abra's convergence monitor (abra app deploy -c/--no-converge-checks) and
substitute a harness READY_PROBE when abra genuinely does not fit — e.g. its convergence window
is too short for a heavy app and it FATA deploy faileds on a deploy that DOES converge given time.
When you do skip abra convergence, the rules are:
- The deploy stays real abra (
abra app deploy [-C] -c); only abra's waiting is replaced, not the deploy mechanism.docker stack deploystill applies the real spec. - The harness replacement MUST be a genuinely STRICT readiness test: all swarm services N/N
(
lifecycle.wait_healthy→services_converged) + a real app-level check (the app HEALTH_PATH AND any recipeREADY_PROBE— a live HTTP assertion on a real endpoint), bounded by a generous but finite deadline (recipeDEPLOY_TIMEOUT). - It MUST RAISE on actual non-readiness — never a no-op that lets a failed deploy pass. Prove it has teeth with a negative test.
Applied: F2-12 lasuite-drive upgrade tier. abra's converge monitor FATA'd while the upgraded
collabora 25.04.9.4.1 healthcheck was still in start_period (jail/config init), though it
converges via swarm's healthcheck retries. Fix (e1147b5): upgrade chaos redeploy uses abra … -c;
generic.perform_upgrade then owns lifecycle.wait_healthy (services N/N + app HEALTH_PATH) +
lifecycle.wait_ready_probes (recipe READY_PROBE → collabora WOPI /hosting/discovery 200),
bounded by DEPLOY_TIMEOUT. Teeth proven by tests/unit/test_f212_upgrade_convergence.py (6506c4a,
5 P7-negative tests: the wait RAISES TimeoutError on stuck/never-serving convergence). The lone
docker service scale …minio-createbuckets is NOT a bypass — it triggers the recipe's own
replicas:0 one-shot (Adversary-confirmed). The Adversary still owns confirming "not a weakening" at
the Q3.2 cold-verify.
2026-05-29 — READY_PROBE / abra -c policy (operator principle; Adversary-recorded)
Decision (operator, plan.md §9): deploys/upgrades use REAL abra commands — never docker service update/scale to mutate app state. PREFER abra's own convergence checks by default. Only
skip abra convergence (-c/--no-converge-checks) + use a harness READY_PROBE when abra genuinely
does not fit — e.g. its window is too short for a heavy app and it FATAs on a deploy that DOES converge
(F2-12: lasuite-drive new collabora 25.04.9.4.1 in healthcheck start_period). When skipping:
- the deploy stays real abra (only abra's waiting is replaced, not the deploy);
- the custom probe MUST be genuinely STRICT — all services N/N plus a real app-level check — and RAISE on actual non-readiness; never a no-op that masks a failed deploy;
- prove it has teeth with a negative test (cf. F2-12 P7-negative
tests/unit/test_f212_upgrade_convergence.py). Adversary status: the F2-12 lasuite-drive READY_PROBE was cold-verified non-weakening at the Q3.2 re-claim (REVIEW-2 "## Q3.2 … PASS @2026-05-29"):-c+ownedwait_healthy(services N/N + HEALTH_PATH) +wait_ready_probes(collabora WOPI 200) all RAISE on stuck convergence (5 unit tests pass + code-read); upgrade tier GREEN on the Adversary's own cold run. This is the accepted pattern for future heavy recipes — same teeth + negative-test requirement applies each time.
2026-05-29 — R014 lightweight upstream tags → chaos-base deploy (Q3.3 lasuite-meet)
Problem. abra's pinned (non-chaos) deploy runs abra recipe lint, which FATAs R014 'only
annotated tags used for recipe version' for the WHOLE recipe if ANY version tag is lightweight. Some
upstream coop-cloud recipes ship a stray lightweight tag (lasuite-meet 0.3.0+v1.16.0). This blocked
the upgrade tier's prev-version base deploy.
Rejected approach (origin-repoint). Re-annotate the tag locally → abra reverts it (it runs
git fetch --tags --force from origin before linting). Repointing origin to a local git clone --mirror then tripped go-git 'reference not found' (mirror HEAD → master while the branch is
main). Too fragile; abandoned.
Decision (chaos-base). Detect lightweight version tags (abra.has_lightweight_version_tags,
read-only). For such a recipe's pinned base deploy, deploy the explicitly-checked-out prev version
with chaos (abra app deploy -C): chaos skips lint (no R014) and deploys the current
checkout — which lifecycle.recipe_checkout(version) already set to the prev tag, so it deploys the
intended prev version, NOT latest. (F1d-2's hazard was a missing checkout; the explicit checkout
removes it.) Verified real by the Q3.3 upgrade crossover 0.2.0+v1.15.0→0.3.0+v1.16.0. No-op /
stays pinned-non-chaos for all-annotated recipes (most). The deeper fix is upstream (annotate the tag),
out of scope here.
2026-05-29 — lasuite-meet webrtc media-relay = env-blocker non-port (§7.1); LiveKit token issuance shipped
lasuite-meet's webrtc-media.py/webrtc-relay.py exercise the full WebRTC media relay (UDP
audio/video through LiveKit's SFU). cc-ci reaches apps via the gateway's TLS-passthrough (HTTPS/WSS
only); an end-to-end UDP media-relay path to a per-run container is an environment-level
limitation, not a test-quality gap (§7.1 env-blocker exception). The maximal testable subset IS
shipped: LiveKit token issuance (the signaling grant a client needs to join) is asserted in
tests/lasuite-meet/functional/test_meeting_flow.py (create room → JWT token granting the room).
plausible: clickhouse-backup boot-download is an upstream robustness defect (2026-05-29)
Decision (settled): plausible's entrypoint.clickhouse.sh downloads a 22MB clickhouse-backup
tarball from GitHub at every container start with set -e and no retry/cache, BEFORE exec'ing
clickhouse-server. A failed download exits 1 (clickhouse never starts) → swarm crash-loops →
re-downloads 22MB each restart → triggers GitHub secondary rate-limiting → sustained crash-loop →
abra app deploy times out. The deploy converges only when GitHub answers the first wget (normal
single-deploy CI), but is not robust to a transient first-wget failure, and back-to-back heavy testing
exhausts the host IP's GitHub budget and induces the spiral.
This is an UPSTREAM RECIPE defect, not a cc-ci test/harness defect (same class as Q3.2b lasuite-drive collabora and immich's missing pg_dump hook). The cc-ci test content for plausible (event-tracking §4.3 tests, lifecycle overlays, /api/health readiness probe) is correct and the §4.3 functional tests are proven green.
The durable fix is a recipe PR hardening the clickhouse entrypoint: cache the binary on the
persistent /var/lib/clickhouse volume (skip-if-present so restarts don't re-download → no
amplification), retry-with-backoff, and set +e so a download failure never blocks clickhouse-server
start (DB must come up; backup degrades gracefully). IMPORTANT CONSTRAINT: the cc-ci install tier
deploys the PREVIOUS PUBLISHED version (recipe_checkout to the tag), so a recipe PR only fixes the
upgrade tier and FUTURE installs once released — it does NOT make this gate's install tier converge
under an active throttle. Therefore the gate's full-lifecycle green still depends on GitHub answering
the install-tier deploy's first download (achievable after a rate-limit cooldown with a single clean
run). The recipe PR is filed as a deferred robustness follow-up (Q4.7b), mirroring the Q3.2b/immich
pattern; Adversary/operator weigh whether it gates Phase-2 DONE.
mumble: TCP/voice recipe enrollment — mumbleweb HTTP readiness + host-ports + CHAOS_BASE_DEPLOY (2026-05-29)
Decision (settled): mumble is a non-HTTP TLS voice server (port 64738). To enroll it in cc-ci's
HTTP-readiness + on-host (cc-ci-run) test model, deploy it with the two upstream overlays
COMPOSE_FILE=compose.yml:compose.mumbleweb.yml:compose.host-ports.yml (recipe_meta.EXTRA_ENV):
- compose.mumbleweb.yml — the upstream mumble-web HTTP client, routed through Traefik on the app domain. Gives the generic harness a real HTTP serving/readiness signal (HEALTH_PATH "/") AND the web_client.py parity surface. Present in every published mumble version.
- compose.host-ports.yml — publishes 64738 (tcp+udp, mode:host) on the cc-ci host, so the on-host
protocol tests connect to 127.0.0.1:64738. The voice server has NO HTTP API and cc-ci's Traefik only
exposes 80/443 (no
mumbleTCP entrypoint; the gateway forwards 443 only, out of our control), so a host-published port is the reachable path. Theproxyoverlay is attachable (an ephemeral-container path was considered) but host-ports is simpler and needs no extra image.
Two enrollment hazards + their fixes:
- The upstream
compose.host-ports.ymlexists only from version 1.0.0+, but the upgrade tier's base deploy is the previous published version (0.2.0+), which predates it → COMPOSE_FILE fails to resolve on the base deploy. Fix:tests/mumble/install_steps.shprovides a cc-ci-owned identicalcompose.host-ports.ymlto the recipe checkout when absent (no-op when the version ships it natively). - That provided file is UNTRACKED in the older checkout → abra's PINNED base-deploy clean-tree check
FATAs ("has locally unstaged changes"). Fix: new recipe_meta flag
CHAOS_BASE_DEPLOY=True+ harness support (lifecycle._recipe_meta_flag+ adeploy_appbranch) → the base deploy uses chaos (skips lint/clean-tree, deploys the EXPLICITLY-checked-out pinned version — not LATEST), mirroring the existing lightweight-tag chaos-base mechanism. HC1/deploy-count unaffected (upgrade still chaos-redeploys to PR-head; base chaos-version=prev-commit != head → real crossover).
P4 (backup data-integrity): mumble persists server state in /data/mumble-server.sqlite (the exact
file the recipe's backupbot hooks .backup/restore). ops.py seeds a ci_marker row there (using
PRAGMA busy_timeout to wait out the running murmur server's transient sqlite locks), backup, drop,
restore, assert the row survived.
mumble (cont.): recipe_checkout uses git checkout -f + backup labels are 1.0.0+ only (2026-05-29)
Two follow-on fixes from the first full mumble run:
abra.recipe_checkoutnow force-checks-out (-f): the upgrade tier'sgit checkout <head_ref>aborted ("untracked working tree files would be overwritten") because install_steps left an UNTRACKEDcompose.host-ports.ymlin the 0.2.0 base checkout that collides with the same path TRACKED in head_ref (1.0.0+). The version-pinning checkout must yield the exact ref tree; force is correct and robust to cc-ci-provided overlays. (General harness fix; benefits any recipe with a cc-ci overlay.)- mumble's backupbot labels exist only from version 1.0.0+ (0.2.0 has none — "Backups introduced" was the 1.0.0 release). So backup/restore are only meaningful AFTER the upgrade tier moves the app to head_ref (1.0.0+). With the upgrade fixed, backup/restore run against the backup-aware version and P4 (sqlite ci_marker survival) holds. The base (0.2.0) backup-unaware state is expected, not a defect.
immich postgres backup recipe-PR (Phase 2 Q3.5 P4) — 2026-05-30
Decision: fix immich's P4 data-integrity gap with a recipe-PR (recipe-maintainers/immich#1),
not a §7.1 P4-N/A sign-off. The published immich recipe backs up NO database: backupbot.backup
sits only on the app service (whose sole data volume uploads is excluded), and the
database/postgres service had no backup label or pg_dump hook — so restoring a backup yields an
empty DB (total user-metadata loss). immich is the D10 large-volume/data category recipe; a P4-N/A
on its data path would be hollow (unlike mailu's mail-relay N/A). cc-ci exists to catch exactly this
class of bug, and the recipe mirror+PR flow (plan §0b/§4.1) is the sanctioned mechanism.
Fix shape (matrix-synapse convention): database-service deploy.labels
backupbot.backup.pre-hook=/pg_backup.sh backup + backupbot.backup.volumes.postgres.path=backup.sql
backupbot.restore.post-hook=/pg_backup.sh restore; aconfigs:-mountedpg_backup.sh(top-levelconfigs.pg_backup, versioned viaabra.shPG_BACKUP_VERSION=v1). The script: backup =pg_dump | gzip > /var/lib/postgresql/data/backup.sql; restore = terminate immich-server connections +DROP DATABASE … WITH (FORCE)+createdb+ reimport (the matrix pg_hba "local trust" trick does NOT cover immich-server's networked connections, so FORCE-drop is required). The VectorChord/pgvecto.rs extensions (vchord, vector) + all tables round-trip cleanly — validated live, then proven green end-to-end viaRECIPE=immich PR=1(restore tiertest_restore_returns_statePASS).