55 KiB
DECISIONS — cc-ci Builder
Architecture decisions and dead-ends. One line of rationale each. (§0, §8)
Settled
-
Wildcard TLS: operator pre-issues wildcard cert at
/var/lib/ci-certs/live/; Traefik file provider serves it; no ACME for commoninternet.net. (Plan §4.0/§8 — fixed.) -
Repo:
git.autonomic.zone/recipe-maintainers/cc-ci, private. Bot is org admin. (Bootstrap.) -
Git credentials: helper script in repo-local git config sources
/srv/cc-ci/.testenvat call time — no secret values stored in.git/configor commits. -
Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26, overrides plan §3
modules/traefik.nix). Instead of a hand-rolled Traefik we deploy the canonical Co-op Cloudtraefikrecipe via abra in wildcard / file-provider mode, for end-to-end fidelity (canonicalweb/web-secureentrypoints + proxy/swarm conventions every recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO DNS token on the box:WILDCARDS_ENABLED=1+ appendcompose.wildcard.yml; the pre-issued cert is fed as thessl_cert/ssl_keyswarm secrets (v1) viaabra app secret insert … -ffrom/var/lib/ci-certs/live/{fullchain,privkey}.pem. The file provider serves it (tls.certificates).LETS_ENCRYPT_ENV=empty on the traefik app and on every test app → the recipe'stls.certresolver=${LETS_ENCRYPT_ENV}label resolves to no resolver → routers serve the wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)- Reproducibility (D8):
scripts/deploy-proxy.shis idempotent (ensures local abra server, fetches recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented indocs/install.md. The custommodules/traefik.nixwas removed;modules/swarm.nixkeeps swarm init +proxynet + firewall 80/443. - Renewal (manual, ~90d): operator re-issues the wildcard at the same paths, then
abra app secret rm traefik.ci.commoninternet.net ssl_cert -n+ re-insert at a new version (bumpSECRET_WILDCARD_CERT_VERSION) and redeploy. (Documented in docs/secrets.md at M7.) - abra teardown syntax (for harness, §4.3):
abra app undeploy <d> -n,abra app volume remove <d> -f -n,abra app secret remove <d> --all -n. None take--chaos.
-
Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer 2026-05-26). Every piece of swarm infra that abra deploys (traefik
modules/proxy.nix, Dronemodules/drone.nix, later comment-bridge + dashboard) is asystemd.services.<x>withType=oneshot+RemainAfterExit,after/requiresswarm-init + docker,wantsnetwork-online,wantedBymulti-user, embedding its script viapkgs.writeShellApplication(self-contained in the store, not a/root/cc-cipath). The script reconciles (inspect → converge → no-op if correct) on every activation/boot — no run-once sentinel — so it self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit) on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses togit clone+nixos-rebuild switch+ operator preconditions, no manual post-steps. The oldscripts/deploy-*.shwere folded into these modules and removed.pkgs.abrais provided via an overlay (modules/packages.nix) so all modules share the one pinned build.- Cert rotation note: the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
wildcard means bumping
SECRET_WILDCARD_*_VERSION(operator) so the next reconcile re-inserts. Documented in docs/secrets.md at M7.
- Cert rotation note: the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
wildcard means bumping
-
Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27, supersedes the earlier "keep webhook, do NOT pivot to polling" steer). Hard constraint: the bot/server runs at READ level, never repo-admin, and never self-registers a webhook.
- Polling is PRIMARY and the source of truth for D1. The bridge polls each enrolled repo's
open PRs for new
!testmecomments everyPOLL_INTERVAL(30s ≤ 60s). Outbound (cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On startup the first poll marks pre-existing comments seen so it doesn't fire on old comments. - Webhook is an OPTIONAL push optimization. The
/hookendpoint stays live (HMAC-verified) so an admin-registeredissue_commentwebhook lowers latency, but the bridge never registers one. Manual registration is documented indocs/enroll-recipe.md. Both paths share an in-memory seen-set keyed by comment id → a comment seen by both fires at most once. - Commenter authorization via org membership (read-level, no admin). Allowed iff
GET /orgs/{owner}/members/{user}→ 204 (verified 2026-05-27: admits bot/trav/notplants, 404 for a non-member, works with bot read-level basic-auth) or the user is in the optionalAUTH_ALLOWLIST. Replaces the earlier/collaborators/{user}/permissioncheck, which needs repo-admin. Fail-closed on any error. - Enrollment = add the repo to the bridge
POLL_REPOScsv + ensuretests/<recipe>/exists. No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't matter: polling makes it irrelevant; the operator was whitelistingci.commoninternet.netin Gitea'sALLOWED_HOST_LIST, but D1 no longer depends on that.)
- Polling is PRIMARY and the source of truth for D1. The bridge polls each enrolled repo's
open PRs for new
-
Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27, plan §4.2/§4.3). Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
- MAX_TESTS =
DRONE_RUNNER_CAPACITY= 1 (modules/drone-runner.nix,maxTestslet-binding). Drone runs at most MAX_TESTS builds at once and auto-queues the rest in its native pending queue — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly. - Per-build TIMEOUT = 60 min (
modules/drone.nix,buildTimeoutMinutes; reconciled best-effort viaPATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}using the bridge's Drone admin token, local--resolve, non-fatal). A build over the limit is cancelled by Drone → the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue once a test finishes OR times out". - Teardown + janitor backstop. Each build deploys → runs the 3 stages → undeploys
(guaranteed
try/finallyinconftest/orchestrator). A SIGKILL'd/timed-out build can't run its own teardown, so the run-start janitor (lifecycle.janitor, called before every deploy in both fixtures +run_recipe_ci) reaps orphaned run apps as the backstop. At capacity=1 the CI path will setCCCI_JANITOR_MAX_AGE=0(reap any orphan immediately — safe with no concurrent runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default 2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live. - Optional
concurrency: {limit: 1}in the recipe-CI.drone.ymlis a redundant belt — primary mechanism isDRONE_RUNNER_CAPACITY. (Wired when the recipe-CI pipeline lands — see backlog.)
- MAX_TESTS =
-
D10 recipe #6: bluesky-pds (TLS-passthrough) SWAPPED → n8n — SETTLED (2026-05-27, plan §4.0 sanctions this swap-with-reason). bluesky-pds routes via a Traefik TCP router with
tls.passthrough=trueto an in-container caddy that terminates TLS itself and obtains its own cert via ACME. cc-ci's design is the opposite: the operator gateway passes wildcard TLS through to cc-ci's Traefik, which terminates it with the pre-issued static wildcard cert, and ACME is hard-forbidden for commoninternet.net (no DNS token on the box — §4.0/§9). Serving bluesky-pds would require either (a) ACME inside caddy (forbidden), or (b) injecting the wildcard cert into caddy + a per-host TCP-passthrough router on cc-ci Traefik (recipe-internal surgery + a bespoke proxy mode — not a clean shared-harness absorb). This is a genuine design conflict, not a harness gap. Per the plan's explicit allowance, bluesky-pds is a documented non-CI'd recipe (reason here), and n8n takes the 6th slot. The 5 required D10 categories are already covered by recipes 1–5 (simple=custom-html, single-DB+SSO=keycloak, stateful/no-DB=cryptpad, DB+media/large-volume= matrix-synapse, multi-service+S3/object-storage=lasuite-docs); n8n adds a 6th real deployable app (workflow automation) behind the normal terminate-at-Traefik path. -
Docker Hub rate limit + mid-breadth prune — FINDING (2026-05-27). D10 real-
!testmebreadth runs exhausted Docker Hub's anonymous pull rate limit (lasuite-docs, 9 images, upgrade stage:toomanyrequests). Two lessons: (1) registry pull creds are an A1 operator input needed for reliable heavy-recipe deploys under load (request + sops-store + wire into docker daemon). (2) Don'tdocker image prune -afmid-breadth — it evicts cached recipe images and forces re-pulls that hit the limit. The first lasuite failure was disk pressure (90% full); pruning fixed disk but triggered re-pulls → rate limit. Better: rely on the daily autoprune, prune onlydangling(not-a) between runs, or grow disk so heavy images stay cached. Net for D10: 5/6 recipes green via real !testme; lasuite-docs gated on the rate limit (transient ~hours; durable fix = creds).
Open (defaults from §8, to confirm as reality lands)
-
Deploy mechanism — SETTLED (M0):
nixos-rebuild switch --flake /root/cc-ci#cc-cirun on cc-ci itself, with the repo materialised on the host at/root/cc-ci. Chosen over--target-host/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS proxy (slow/fragile). Atomic rollback preserved by Nix generations (nixos-rebuild --rollback). The switch is launched as a detached transient systemd unit (systemd-run --unit=ccci-rebuild --collect) so it survives a momentary ssh-over-tailscale drop during activation. For the build loop the host copy is synced from the sandbox clone viatar | ssh(rsync absent on host); source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo on a fresh host, thennixos-rebuild switch --flake .#cc-ci).- nixpkgs pin: flake pins the exact rev cc-ci already ran (
50ab793…) so the first rebuild is a true no-op-then-base. Bump deliberately, never drift.
- nixpkgs pin: flake pins the exact rev cc-ci already ran (
-
Webhook scope: default per-repo via enroll script.
-
CI engine: Drone (per plan) — kept, with a noted risk. nixpkgs 24.11 has Drone server 2.24.0 but
drone-runner-execis abandoned (unstable-2020-04-19) — the only exec runner Drone ever shipped (upstream archived ~2021). The maintained fork Woodpecker (2.7.3, with NixOS modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific (D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern Drone server (RPC protocol stable). Fallback: if the exec runner proves incompatible/broken, pivot to Woodpecker (coop-cloud ships awoodpeckerrecipe too) and record it — like the traefik pivot. Re-evaluate at the M2 gate. -
Drone deployment shape — SETTLED (M2): mirror the traefik pattern. The server is the coop-cloud
dronerecipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by traefik atdrone.ci.commoninternet.net,LETS_ENCRYPT_ENVempty → wildcard cert, no ACME), with Gitea SSO (compose.gitea.yml). The exec runner runs as a Nix systemd service on the host (modules/drone-runner.nix) so it can drive host abra/swarm (plan §4.2). One generatedDRONE_RPC_SECRETis shared: inserted as the server'srpc_secretswarm secret AND read by the runner from sops. Reproducible deploy:scripts/deploy-drone.sh.- Gitea OAuth app
cc-ci-dronecreated under the bot (client_idab4cdb9d-ee96-4867-875f- 87384505fc52, redirecthttps://drone.ci.commoninternet.net/login); client_secret + rpc_secret stored sops-encrypted insecrets/secrets.yaml(A2 internal secrets).
- Gitea OAuth app
-
Drone runner type: exec (must drive host abra).
-
Secret tool — SETTLED (M0): sops-nix. cc-ci decrypts at activation using its ed25519 SSH host key as the age identity (
sops.age.sshKeyPaths), so no extra key file to manage on the box. Recipients in/.sops.yaml: the host age key (age1h90ut…, from ssh-to-age) + an off-box master recovery key (age1cmk26t…; private half only at/srv/cc-ci/.sops/master-age.txton the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing plaintext intosecrets/<f>.yamlthensops -e -i(run inside the repo so.sops.yamlis found). -
D10 recipe set: lock six early. Candidates favouring already-mirrored: custom-html (simple), cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3), bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5.
-
Per-run app domain scheme — adapted (M4, deviates from plan §4.0). Plan §4.0 wanted
<recipe>-pr<n>-<short-sha>.ci.commoninternet.net, but Docker swarm config/secret names (<stackname>_<resource>_<version>) must be ≤ 64 chars and abra derives<stackname>from the domain (dots→_, hyphens kept)..ci.commoninternet.netalone is 22 chars, so long recipe names- config names overflow 64 (hit with
custom-html-pr0-m4demo…_nginx_default_conf_v6= 66). New scheme:<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net(e.g.cust-e084bd) — short, unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/ ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.
- config names overflow 64 (hit with
-
abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6). Many abra commands (
app ls,secret generatewithout flags, version resolution) silentlygit checkout <version-tag>in~/.abra/recipes/<recipe>, discarding a PR branch's files. To test the PR head code (not a re-resolved tag): (1)fetch_recipeclones the mirror branch/ref (private → bot token via per-commandhttp.extraHeader, never persisted/logged); (2) all harness abra calls that touch the recipe pass-C(chaos: use current checkout)-o(offline: no remote fetch); (3) recipe-shippedtests/(D4) are snapshotted to a temp dir right after fetch, since later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.
Risks
- Disk — RESOLVED 2026-05-26. Original 8.9 GiB root had only ~3.8 GiB free and a hard
inode ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
inodes before bytes. Operator grew the VM to 28 GiB (22 GiB free, 1.78M inodes / 1.21M free);
the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
periodic
docker image pruneto avoid regressing during M6.5 breadth.
Dead-ends
- (none yet)
Phase 1c (full reproducibility + genuine D8 live rebuild) — 2026-05-27
- Secrets linkage = git SUBMODULE (deviates from plan §7 flake-input default).
cc-ci-secretsis mounted as a submodule atcc-ci/secrets/rather than a flakeinputs.secrets. Rationale: a private flake input must be re-fetched at every nix eval, requiring the bot token persistently in nix config/netrc on cc-ci AND the throwaway VM (a token in the store/config = a 2nd out-of-band secret, which 1c forbids). A submodule makessecrets/secrets.yamla plain path in the working tree →defaultSopsFile = ../secrets/secrets.yamlis unchanged (minimal diff, trivially byte-identical), and the only credential use is the onegit clone --recursiveat provisioning ("the two repos are given", Mission §1). Build invocation becomesnixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'so the submodule tree is included. (Revisit if?submodules=1proves unreliable on cc-ci's nix version.) - Bootstrap key for the throwaway VM = the existing RECOVERY (master) age key, via
sops.age.keyFile. The recovery key (age1cmk26…, private at/srv/cc-ci/.sops/master-age.txt) is already a sops recipient, so a fresh host with a different ssh host key still decrypts every secret with no re-keying — this is exactly the §0 argument that defeats "host-key binding". Provisioned to the VM at a fixed path (the ONE out-of-band secret). cc-ci itself keeps decrypting via its host key (age.sshKeyPaths); secrets.nix will offer both identity sources. (Per-host re-encrypt is cleaner for a permanent new instance — documented as the alternative, not used for the throwaway test.) - Cert into git: wildcard cert+key become sops secrets in
cc-ci-secrets, decrypted at activation back to/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}viasops.secrets.<name>.path; proxy.nix keeps reading that path (now sops-sourced, not operator-drop). - cc-nix-test final sizing (C6) — SETTLED by operator 2026-05-27: PROMOTE the rebuilt VM. The freshly-rebuilt reproducible VM (the FINAL W5/C4-C5 clean-room throwaway) becomes the canonical cc-nix-test; the operator will repurpose it for a live real-traffic test through the public gateway.
- C6 teardown OVERRIDE (operator, 2026-05-27): do NOT destroy the FINAL throwaway VM after W5/C4-C5 PASSes — keep it RUNNING; defer its C6 teardown until the operator explicitly says otherwise. This overrides the plan §5/§6 "destroy the throwaway" for that one VM only. All other cleanup proceeds normally (the Builder's first throwaway was already destroyed; RAM accounting holds).
Phase 1b — lint/format tooling (open decisions §6, settled W0)
- Formatters/linters (RL1): Nix =
nixpkgs-fmt(format) +statix(lints) +deadnix(dead code); Python =ruff(lint + format); Shell =shellcheck+shfmt -i 2 -ci; YAML =yamllint. Keptnixpkgs-fmtoveralejandrabecause it was already the repoformatterand devshell tool (no extra churn / restyle of every .nix). All built from the already-pinned nixpkgs via a flakelintdevshell (nix develop .#lint) so CI and local use byte-identical tool versions. - Lint entrypoint:
scripts/lint.sh(check-only by default;--fixauto-applies). The.drone.ymlpush pipeline runs it vianix develop .#lint --command bash scripts/lint.sh. - ruff strictness:
select = [E,F,W,I,UP,B,C4,SIM],ignore = [E501](line length is the formatter's job; only un-splittable strings would trip it).line-length=100,target=py311. - Drone lint stage = FAIL (not warn). The codebase is green now, so enforce from here on — an
unclean commit fails the
lintstep. (Resolves the §6 open question.) - Python type-checking (mypy/pyright): DEFERRED to IDEAS, not added in 1b. The harness is small
and dynamically typed around
abra/subprocess JSON; gradual typing is a larger effort than this bounded pass warrants. Revisit if Phase 2's 18-recipe ramp shows type bugs. - blocking vs advisory split (§3): treated as in the phase plan — tests-real, Nix-idempotent, no-footguns, no-secrets, log-redaction, harness-DRY = blocking; readability/docs/arch-drift = advisory unless a real plan deviation. Recorded per-finding in REVIEW-1b / BACKLOG-1b.
- cc-ci self-CI push trigger: the lint stage lives in the
event: pushpipeline. The Gitea→Drone push webhook on this instance is flaky (last_status: None; documented §4.1) and predates 1b — recipe CI uses polling as primary, but cc-ci's own self-test/lint relies on the push webhook. The lint stage is correctly wired and proven green via the identicalnix develop .#lintcommand; reliably auto-firing it on every push is tracked as a (pre-existing) infra item, not a 1b lint gap.
Phase 1b — repo layout (operator review items RL5/RL6, plan §7)
- RL5 — all Nix code under
nix/. Movedmodules/→nix/modules/andhosts/→nix/hosts/.flake.nix/flake.lockSTAY at the repo root (entry point) so the build ref#cc-ciandnixos-rebuild --flake '…#cc-ci'are unchanged — onlyflake.nix's internal./hosts/cc-ci/configuration.nix→./nix/hosts/cc-ci/configuration.nixchanged. Root-relative refs inside the moved modules were re-based../X→../../X(secrets.nix →../../secrets/, bridge.nix →../../bridge/, dashboard.nix →../../dashboard/);configuration.nix's../../modules/*imports are unchanged (both dirs moved undernix/, so the relative path still resolves). Toplevel is byte-identical (8i3jcad9…) before/after the move — store derivations are content-addressed on the copied file contents, and the module.nixfiles aren't part of the runtime closure, so relocating folders doesn't change the build. (The operator anticipated a hash change; in practice it's stable, which is even stronger for reproducibility.) Living docs (README, architecture/install/secrets/enroll) + the.drone.ymlcomment updated tonix/…; append-only history logs left as the record of what was true then. - RL6 — protocol files →
machine-docs/: DEFERRED to the coordinated end of 1b. Willgit mvSTATUS*/REVIEW*/JOURNAL*/BACKLOG*/DECISIONS.mdintomachine-docs/(README.md STAYS at root — operator decision, it's the human readme, not a protocol file). The live watchdog (launch.sh) readsSTATUS-<id>.md/REVIEW-<id>.mdat the repo root for handoffs/transition, so this is done LAST, in lockstep with the orchestrator updatinglaunch.sh+ restarting the watchdog — not unilaterally and not while a phase transition is pending. The Adversary likewisegit mvs its own REVIEW files at the cutover (single-writer rule).
Phase 1b — recorded deviation: no tests/_template/ dir (enroll = copy an existing recipe)
Plan §3's repo layout lists a tests/_template/ "copy-to-add-a-recipe" dir. It was never created
(pre-1b; not introduced or removed by 1b) — instead the documented enroll flow in
docs/enroll-recipe.md is "copy an existing recipe's tree, e.g. tests/custom-html/…, then adjust
recipe_meta.py + the per-recipe test files." This satisfies D5's "small, repeatable, documented
operation with no harness surgery" the same way (a concrete recipe is a better starting template than
an abstract skeleton that can drift). Recording per the Adversary's RL3 D5 advisory; not a blocker.
Phase 1d — generic test suite + layered overlays (design, 2026-05-27)
SSOT: cc-ci-plan/plan-phase1d-generic-test-suite.md. Resolves the §6 open decisions.
-
Tier model & op/assertion split (the core call). A run is a sequence of TIERS — install, upgrade, backup, restore, custom — each =
generic default [overridden by a recipe overlay]. The lifecycle OP (deploy, upgrade, backup, restore) is owned by the shared harness (harness.generichelpers), NOT duplicated in every test file. A tier's test file (generic or overlay) carries the ASSERTIONS and calls the shared op helper. This keeps the op single-sourced (DRY, DG7) and makes deploy-once trivial: only the orchestrator deploys/tears-down. -
Override (not additive) — Builder's call (plan §6, operator leaned override). For each lifecycle op exactly ONE assertion file runs, by precedence: repo-local
tests/test_<op>.py> cc-citests/<recipe>/test_<op>.py> generic (tests/_generic/test_<op>.py). A present overlay REPLACES the generic for that op. Invariant: no overlay for an op ⇒ the generic runs (so any recipe is testable with zero config). Repo-local wins same-name collisions (upstream is authoritative, plan §2.5); cc-ci's overlay is the curated fallback until upstream adopts it. Extend-by-composition: an overlay mayfrom harness import genericand callgeneric.assert_serving(...)/generic.do_upgrade(...)then add its own assertions — so "extend" needs no separate mechanism. -
Custom (non-lifecycle)
test_*.py: ALL discovered from BOTH locations run additively, opt-in (no override, no generic equivalent) — e.g.test_sso.py. -
Deploy ONCE, mutate in place (operator requirement, DG4.1). The orchestrator deploys the app ONCE, runs all tiers against that single live deployment (install asserts; upgrade does
abra app upgradein place; backup/restore mutate in place; custom asserts), then ONE teardown infinally. No per-tier/per-overlayabra app new/deploy/undeploy. ACCCI_DEPLOY_COUNTcounter inlifecycle.deploy_appis asserted == 1 per run (DG4.1 evidence). -
Deployment-sharing scope & base version (§6 open). One deployment for the whole lifecycle. Base version deployed once = the previous published version when an upgrade tier will run and a previous exists (so upgrade goes previous→target in place), else the target (current/$REF). Recipe with only one published version ⇒ upgrade tier is a clean SKIP (nothing to upgrade from). Standalone generic-install demo (no PR) deploys current.
-
Fail handling across shared tiers (§6 open): install failing (app never serves) fail-fasts the run (later tiers can't meaningfully run on a dead deployment) and they report error/skip; upgrade/backup/restore failures are recorded per-op but do not abort the remaining independent tiers where they can still run. Teardown always runs.
-
Backup-capability detection (DG3, §6 open): auto — scan the recipe's
compose*.ymlfor abackupbot.backuplabel (verified present in custom-html).recipe_meta.BACKUP_CAPABLE(bool) overrides the auto-detect. Not capable ⇒ backup+restore tiers are N/A (skip), not failures. -
Custom install-steps hook (DG5, §6 open): a shell hook —
tests/<recipe>/install_steps.sh(cc-ci) or repo-localtests/install_steps.sh— run by the orchestrator during the install tier AFTERabra app new+ env defaults but BEFOREabra app deploy, with envCCCI_APP_DOMAIN,CCCI_RECIPE,CCCI_APP_ENV(path to the app .env). Chosen over a fixture/declarative field as the simplest thing the harness runs uniformly (canabra app secret insert, set env, seed). Graceful rule: a recipe with NO hook still attempts the generic install; if it genuinely needs a step it FAILS the generic install (reported per-op) — that is correct, not a harness bug. -
Per-op result vocabulary (Phase-3 feed):
pass | fail | skip(N/A) | error. The orchestrator prints a per-op summary line per run (feeds DG6 + Phase-3 level). -
Discovery layout: cc-ci overlays/custom/hook live in
tests/<recipe>/; repo-local in the recipe repo'stests/(snapshotted after fetch, per the existing volatile-checkout handling). Generic tier files live intests/_generic/(assertion-only, use the shared live-deployment fixtures).
Phase 1e — generic-harness corrections (HC1–HC4)
Three operator-review corrections to the Phase-1d shared harness, settled here (plan §5).
-
HC2 — repo-local approval allowlist (form/location + workflow). PR-author-controlled code (
install_steps.sh, repo-localtest_*.py) runs on the CI host with/run/secrets/*present, so it is default-deny. Allowlist file:tests/repo-local-approved.txt(checked into the cc-ci repo, git-auditable). Format: one recipe name per line;#comments + blank lines ignored; a lone*is NOT a wildcard (no global opt-in — every recipe is explicit). Default: empty ⇒ no recipe trusts repo-local code. Discovery (resolve_op/custom_tests/install_steps) consults the repo-local source only whenrepo_local_approved(recipe)is true; otherwise precedence is cc-ci > generic only and repo-local is discovered-but-not-executed. Workflow: a cc-ci maintainer reviews a recipe's repo-local tests, then adds the recipe name totests/repo-local-approved.txtin a cc-ci PR — a deliberate, reviewable act. The gate is centralized indiscovery.py(one reader) so the unit tests pin it. -
HC3 — generic-by-default opt-out flag (name/granularity + recipe_meta). Generic assertions run additively alongside any overlay by default. Opt-out, in increasing specificity (any one skips): env
CCCI_SKIP_GENERIC(truthy ⇒ skip generic for ALL ops), envCCCI_SKIP_GENERIC_<OP>(e.g.CCCI_SKIP_GENERIC_UPGRADE⇒ skip generic for that op only), and declarativerecipe_meta.SKIP_GENERIC= a list of op names (or["all"]) so the opt-out is per-recipe and visible in git, not a hidden global. Truthy =1/true/yes/on(case-insensitive). Op-vs-assertion split: a mutating op (upgrade/backup/restore) is performed once by the orchestrator (the harness owns the op); then the generic assertion file (unless opted out) and the overlay assertion file both evaluate the shared post-op state. Op results that an assertion needs (pre-upgrade identity, backup snapshot_id) are passed op→assertions via a run-scoped JSON state file at$CCCI_OP_STATE_FILE(read byharness.generic.op_state()); never logged. Overlays that need to seed pre-op state (data-continuity markers, the backup→restore mutation) ship an optionaltests/<recipe>/ops.pywithpre_install/pre_upgrade/pre_backup/pre_restore(domain, meta)callables the orchestrator runs before the op (repo-localops.pyis allowlist-gated like other repo-local code). Overlaytest_<op>.pyfiles are now assertion-only (they no longer callgeneric.do_*). -
HC1 — DG4.1 deploy-count vs the in-place chaos upgrade. The upgrade tier now upgrades to the PR head (code under test), not a published tag: deploy the previous published version (base), re-checkout the PR head (recorded as the recipe repo HEAD right after fetch, before any version-tag checkout), then
abra app deploy --chaosin place = the upgrade. The deploy-count guard countsabra app newinstalls only (_record_deploy()fires indeploy_app(), NOT in the chaos redeploy, which callsabra.deploydirectly) — so a run is still deploy-count == 1 and the legitimate in-place chaos upgrade is not flagged. Moved assertion (adapted): prev→PR-head may not bump the coop-cloud version label, soassert_upgradedaccepts ANY of: version-label change, image change, or a chaos label now present carrying the PR-head commit (a chaos deploy stampscoop-cloud.<stack>.chaos/.chaos-version) — the chaos label IS the proof PR-head was deployed. Non-PR!testme(no SRC/REF): "PR head" = the catalogue current checkout, so upgrade is prev→current — still a genuine move via chaos. (Exact chaos label name verified on the live abra during E2.)
Phase 2 — per-recipe test authoring (design, 2026-05-28)
Inherits the Phase 1d/1e shared-deployment + additive-overlay + op/assertion-split model. Phase 2
adds content, not infra, with a few small harness primitives ported from
references/recipe-maintainer/utils/tests/helpers.py.
- Per-recipe layout (per plan §4.1). The cc-ci
tests/<recipe>/dir continues to use the Phase-1d/1e overlays at the top level (test_install.py,test_upgrade.py,test_backup.py,test_restore.py,ops.py,recipe_meta.py, optionalinstall_steps.sh). NEW Phase-2 subdirectories:tests/<recipe>/functional/— parity-port tests (one per recipe-maintainertests/*.py) + ≥2 NEW recipe-specific functional tests (P2/P3). Each file istest_*.py(pytest-discoverable); each parity port carries aSOURCE = "recipe-info/<recipe>/tests/<file>"comment near the top so the audit trail is in the file, not just in PARITY.md.tests/<recipe>/playwright/— browser flows (P6) where the app's UX is a UI flow. Sametest_*.pyconvention; each file importsplaywright.sync_api.tests/<recipe>/PARITY.md— required mapping table (P2) with one row per recipe-info parity test:| recipe-maintainer file | cc-ci file | what's verified | status |. A deliberate non-port is a documented row in DECISIONS.md (linked from PARITY.md), not a silent omission.
- Discovery for the new subdirs.
runner/harness/discovery.custom_testsrecurses intotests/<recipe>/functional/andtests/<recipe>/playwright/(in addition to the top-level glob), so Phase-2 functional tests run as part of the custom stage automatically. Repo-local (HC2) gate still applies if the recipe is approved; otherwise only cc-ci's own functional/ + playwright/ run. The top-leveltest_install.py/etc. continue to drive the lifecycle overlays — thefunctional/+playwright/files are always custom-stage, never lifecycle (so they don't perform an op; they assert against the post-install live deployment). - Vendored helpers in
runner/harness/. Capabilities ported fromrecipe-maintainer/utils/tests/ helpers.py(cc-ci is self-contained at runtime — does NOT import recipe-maintainer's workspace, per plan §8 default):harness.http—http_get(url, headers=, timeout=) -> (status, json_or_None),http_post(...),retry_http_get(url, timeout=, **),wait_for_http(url, label, max_wait=),assert_converges(fn, description, max_wait=, interval=). (Several variants existlifecycle.http_fetch/http_get/http_bodyalready; the harness.http module is the canonical Phase-2 HTTP API for tests; lifecycle.* helpers stay for infra-level checks.)harness.abra_tty—script -qefc "abra …" /dev/nullwrapper for the abra commands that require a TTY (backup/restore/secret/run/logs/lint), used by parity tests that drive abra directly. Lifecycle already exposes typed wrappers — this is for tests that need raw shell-abra.harness.deps— dependency resolver primitive. Readstests/<recipe>/recipe.toml(requires/test_requires), deploys each declared dep via the samelifecycle.deploy_appwait_healthypath (so the dep is a real<dep[:4]>-<6hex>.ci.commoninternet.neton the same swarm), persists per-run, tears down with the parent in the orchestrator'sfinally. Heavy recipes sequence sequentially;MAX_TESTS/node budget is the cap.
harness.sso— OIDC-flow primitive (Q2 deliverable). Given a deployed provider domain and a recipe-defined realm/client/test-user, performs the full "deploy provider → setup realm/client via admin API → obtain access token (password + client-credentials grants) → assert protected API call accepts it" assertion. Reusable by every SSO-dependent recipe (cryptpad, lasuite-*, immich, etc.). Setup scripts ported fromrecipe-info/<dep>/setup_<provider>_integration.py.harness.data_integrity— backup data-integrity primitive: a recipe-aware "seed a marker → backup → mutate → restore → assert seeded marker survived" helper aroundlifecycle.exec_in_app/http_get(the recipe chooses the marker mechanism, the helper guarantees the pattern).
- Run-scoped credentials for SSO/recipe-specific tests (plan §4.4 class-B). Generated secrets
(realm/client/test-user passwords, API tokens) persist for the run via the existing
runs/<app-name>/mechanism (Phase 1d). Destroyed at teardown alongside abra secrets/volumes. - Recipe-versioned tests (anti-anchoring). Per plan §7.1, tests read versions/endpoints
dynamically (the app's own discovery endpoints, env from
live_app) — never hardcode published release values. Each functional test file declares the recipe-info SOURCE path it ports from so the Adversary can audit parity cold. - Heavy-recipe parking. Drone's
MAX_TESTS=1+ per-build timeout already serialize runs; for Phase 2 we DO NOT lift it. Within a single run, the orchestrator deploys deps before the recipe-under-test sequentially (never concurrently) per plan §4.2.
Phase 2 Q3.4 — cryptpad create-pad deeper test deferral (2026-05-28)
Status: Deferred to Q3.4 follow-up (or Q5 catch-up), with Adversary sign-off pending per plan §7.1.
What's deferred: The "create-an-object + read-it-back" deep test for cryptpad — authenticate-and-create a real pad in the browser, type a uniquely-marked content string, reload the page (retaining the client-side encryption key in the URL fragment), assert the marker survives. This is the canonical create-and-read-back per plan §4.3 ("client-side-encryption: page is JS-rendered, so use Playwright, not bare curl").
Why deferred (the technical reason):
- CryptPad's pad-creation client-side flow is version-specific. In the recipe under test
(10.6.0+5.7.0), visiting
/pad/does NOT auto-inject a fragment-keyed pad URL; CryptPad requires the user to explicitly click a "new rich text" / "new pad" link from the landing page, AND those UI selectors (.cp-apps-grid a,[data-app='pad'],a[href*='/pad/']) are not stable across CryptPad versions. - Three attempted drafts during Q3.4 each failed cold on this:
- Type + reload + content-survives: contenteditable inside nested iframe with origin mismatch (SANDBOX_DOMAIN).
- Direct-
/pad/-then-fragment: no fragment ever appeared on this version. - Click-fallback for known app-launch selectors: none of the candidate selectors matched.
The maximal testable subset that IS shipped (P3 floor met):
tests/cryptpad/functional/test_health_check.py— parity HTTP 200.tests/cryptpad/functional/test_spa_assets.py— CryptPad branding + canonical asset paths in served HTML. Catches the wedged-server-fallback-page failure mode.tests/cryptpad/playwright/test_pad_create.py— Chromium renders the SPA, asserts brand- canonical asset references + zero non-filtered JavaScript console errors.
The Playwright test exercises the JS pipeline in a real browser (per §4.3 directive); the
piece NOT exercised is the user-action-driven pad lifecycle. What's required to lift the
deferral: pin a specific CryptPad app-launch contract (CryptPad's source has app-launch
URL patterns like /pad/?new=1 on some versions) OR write a Playwright helper that walks the
SPA's main menu via a stable accessibility tree (role-based selectors instead of CSS).
Adversary may file F2-N requesting full create-pad coverage; the answer above is the honest technical reason + the maximal subset. Logged here per plan §7.1.
Phase 2 — nested DOMAIN-derived subdomains flattened to single-label wildcard siblings
Decision (settled): When an enrolled recipe routes additional services on nested subdomains
derived from DOMAIN (e.g. lasuite-drive MINIO_DOMAIN="minio.${DOMAIN}" +
COLLABORA_DOMAIN="collabora.${DOMAIN}"; lasuite-meet LIVEKIT_DOMAIN="livekit.${DOMAIN}"), the
recipe's recipe_meta.EXTRA_ENV(domain) MUST override those vars to a single-label sibling under
the wildcard — minio-<domain>, collabora-<domain>, livekit-<domain> — NOT the recipe's
default <svc>.<domain>.
Why: cc-ci's TLS cert is the operator's pre-issued wildcard *.ci.commoninternet.net (+ bare
ci.commoninternet.net) — §4.0/§1.5, renewed out-of-band, no ACME. A wildcard matches exactly one
label. The per-run app domain is already one label (lasuite-drive-pr<n>-<sha>.ci.commoninternet.net),
so a nested minio.lasuite-drive-pr<n>-<sha>.ci.commoninternet.net is a 2-label name the wildcard
does NOT cover → Traefik would serve an invalid cert on that router and the service is unreachable
over HTTPS. Re-prefixing with a hyphen keeps it one label (minio-lasuite-drive-pr<n>-<sha> +
.ci.commoninternet.net), covered by the same wildcard, routed by Traefik's swarm provider with no
cert work and no gateway change (the gateway already passes the whole wildcard, §4.0). We must NOT
mint per-host certs / ACME for these (class-A1 boundary, §9).
Scope: purely a per-recipe EXTRA_ENV concern (no shared-harness change). Recipes with no
DOMAIN-derived nested subdomains (most) are unaffected.
Phase 2 — services_converged treats a replicas: 0 one-shot as converged
Decision (settled): runner/harness/lifecycle.py::services_converged now considers a service
converged when cur == want (desired replica count met), removing the prior
or want == "0" rejection.
Why: lasuite-drive's minio-createbuckets is declared deploy: {mode: replicated, replicas: 0, restart_policy: {condition: none}} — an on-demand one-shot (scaled up manually only when buckets
need (re)creating; it mc mb … then exit 0). docker stack services reports it 0/0. The old
check rejected any want == "0" row, so the stack could never report converged → every deploy
hung until deploy_timeout. A service AT its desired count (including 0/0) is converged; a service
still spinning up shows 0/1 (cur != want) and is correctly not-yet-converged, so the HTTP
readiness wait still gates real liveness. Safe for all currently-green recipes (their services are
all N/N with N>0; the 0/0 case did not previously occur). Buckets/migrations that the one-shot
performs are run on-demand in the recipe's setup_custom_tests.sh (post-deploy), not relied upon for
generic-install convergence (the SPA at / serves 200 without them).
2026-05-28 — Docker Hub auth: declarative config.json via sops (rate-limit fix) — SETTLED
Context. Heavy Phase-2 recipe deploys exhausted Docker Hub's anonymous pull rate limit
(100/6h per shared IP 68.14.43.142) → toomanyrequests blocked all new deploys. Operator
provided a read-only Docker Hub PAT (Class A1 registry creds, plan §1.5): DOCKERHUB_USERNAME=nptest2
DOCKERHUB_TOKENin/srv/cc-ci/.testenv. Authenticated pulls = 200/6h per-account.
Decision. Wire it declaratively (survives a 1c NixOS rebuild), not just an imperative login:
- Secret:
secrets/secrets.yaml(cc-ci-secrets submodule, commitcdd5e0a) gains keydockerhub_auth=base64("nptest2:<PAT>")— i.e. the exactauthfield docker config.json wants, so the nix template is a pure render (no runtime base64). sops-encrypted to host+master age recipients (edited on cc-ci using its ssh-host-key→age identity vianix shell nixpkgs#sops; plaintext shredded; PAT never committed plaintext nor exposed in process args/logs). - Render:
nix/modules/secrets.nixaddssops.secrets.dockerhub_auth+ asops.templates."docker-config.json"that renders/root/.docker/config.json(0600, root) at activation. It becomes a symlink to/run/secrets/rendered/docker-config.json. - Why /root: the drone exec runner runs pipelines as
User=root(drone-runner.nix), and manual deploys ssh in as root — so/root/.docker/config.jsoncovers both the!testmeCI path and manual ops. Single config, single user.
Swarm-propagation question — RESOLVED empirically (no --with-registry-auth / pre-pull needed).
The operator/Adversary flagged that a node docker login may NOT propagate to swarm SERVICE-task
pulls. Tested on cc-ci with the authenticated config.json in place:
- Account ratelimit baseline 197/200 (source = account hash
b662dd8b-…, not the IP). - Deployed uncached
n8nio/n8n:2.20.6via abra (RECIPE=n8n STAGES=install). The swarm service task pulled it to1/1 Runningwith notoomanyrequests. - Account counter dropped 197 → 196 (manager manifest resolution) → 195 (agent layer-manifest
pull), source still the account hash. So abra's
docker stack deploypropagates the cred to the swarm task pull on this single-node swarm — billed to the account, not the anon IP. - Corroborating: the earlier lasuite-drive deploy resolved 12 images with no
toomanyrequestswhile anon budget was ≤4 — impossible anonymously → manager resolution is authenticated too.
So: declarative root config.json is sufficient end-to-end here; --with-registry-auth is not
required (abra/SDK attaches it). Caveat (Phase 2b): 200/6h may still be tight for a full ~18-recipe
sweep; the permanent structural fix is a registry pull-through cache authenticated with this same PAT.
Phase 2w — warm canonical + --quick (2026-05-28)
Stable-domain scheme for warm apps: warm-<recipe>.ci.commoninternet.net. Distinct from cold
per-run <recipe[:4]>-<6hex> (naming.app_domain) so a warm app is never confused with a disposable
cold run. Live-warm keycloak = warm-keycloak.ci.commoninternet.net; data-warm canonicals (W1) =
warm-<recipe>.... Risk to watch: longer stack name vs swarm's 64-char config/secret limit —
verified per-recipe on first deploy; shorten the scheme if any recipe's secret name overflows.
Realm is the per-run isolation unit on the shared live-warm keycloak (WC1). Instead of
co-deploying a fresh keycloak per dependent run, dependents use the one live-warm keycloak and create
a per-run namespaced realm+client+user, deleted at run teardown. Realm name =
<parent_recipe>-<6hex> where 6hex is the parent's per-run domain label suffix — unique per
(parent, pr, ref) so concurrent dependents never collide, and traceable for debugging. (Was
realm=parent_recipe, which would collide across concurrent same-recipe runs.)
Warm keycloak is declarative INFRA, not warm DATA. The live-warm keycloak service is brought up by a Nix systemd-oneshot reconciler (converges to deployed+healthy at the stable domain), exactly like the traefik recipe deploy — so it IS in the D8 reproducibility closure (re-warmable from scratch) and self-heals on activation/boot. Only warm volumes/snapshots (W1+) are cache excluded from D8. The keycloak's realm data is ephemeral per-run, so nothing persistent to exclude.
Live-warm is an optimization layer with a cold fallback. If no warm keycloak is present (e.g. a from-scratch host before the reconciler has run, or the warm app is down), the keycloak dep path falls back to the existing cold co-deploy so dependent runs still work. The warm path is preferred when available.
Phase 2w — design update: unpinned warm/infra + health-gated rollback (2026-05-28/29)
Warm/infra apps (traefik + keycloak) auto-update to LATEST nightly, health-gated (operator).
Supersedes the W0.3 pinned kcVersion. Keycloak is now unpinned like traefik: reconciler abra recipe fetch latest + chaos deploy; keep secret-generate-only-if-missing + health-wait. D8 holds
because the recipe is fetched at activation (runtime), so the nix store closure is byte-identical
regardless of which keycloak version is live.
Snapshot helper (WC3) — format + path. runner/harness/warmsnap.py. A snapshot is a raw tar
of each docker volume belonging to the app's stack, taken while the app is undeployed (nothing
writing → consistent). Stored under /var/lib/ci-warm/<recipe>/ as <recipe>.snapshot.tar + a
<recipe>.meta.json (commit/version/timestamp/volume list). One last-good per app, replaced
atomically (write to .tmp then rename). Restore: for each volume, clear _data and untar
back. Docker volumes are stack-scoped (<stack>_<vol>); the helper enumerates them via
docker volume ls filtered to the stack. Reused by WC1.1 (pre-upgrade snapshot of keycloak) and WC5
(promote-on-green-cold). Warm snapshots are cache, excluded from the D8 closure (WC8).
Alert mechanism — sentinel files relayed by the Builder loop. The warm/infra reconciler is an
autonomous bash systemd unit on cc-ci; it cannot call the agent's PushNotification tool. So a
reconciler that rolls back (WC1.1) or holds a major/manual-migration upgrade (WC1.2) writes a JSON
alert sentinel to /var/lib/ci-warm/alerts/<ts>-<app>-<reason>.json (fields: app, reason
[rollback|held-major|held-manual-migration], from_version, to_version, release_notes, ts). The
Builder loop, each wake, scans that dir; for each new alert it (a) issues PushNotification to the
operator, (b) records it in STATUS-2w/JOURNAL-2w, (c) archives it to alerts/seen/. This bridges the
autonomous reconciler to operator visibility (latency = next Builder wake; acceptable for an alert).
Re-sequence: WC1.1's keycloak rollback needs the WC3 snapshot helper, so build that FIRST, then rewrite the reconciler ONCE into the unpinned + WC1.2-safety-gated + WC1.1-health-gated-rollback form (avoids reworking the reconciler twice). The W0.3 reconciler is INTERIM until then.
Phase 2w — W0.6 reconciler: version model + deploy-by-tag (2026-05-29)
Reconcile entrypoint in Python, packaged in the nix store. runner/warm_reconcile.py, invoked by
the systemd unit as ${pyEnv}/bin/python3 ${../../runner}/warm_reconcile.py <app> (the runner/ dir is
copied into the store → D8-clean, no dependence on the /root/cc-ci checkout). Reuses
warmsnap/sso/abra/lifecycle so there is ONE snapshot impl (also used by the runner for WC5). Replaces
the bash reconcile in warm-keycloak.nix.
"latest" = newest published version TAG, deployed pinned (not chaos-of-main). WC1.2's "major
recipe-version bump" detection needs comparable versions, which chaos (deploy main HEAD) doesn't give.
So the reconciler resolves latest = git tag | sort -V | tail -1 (valid coop-cloud version tags),
records current = the app .env VERSION, and deploys the chosen tag pinned (abra app deploy <domain> <version> -o -n -f, after git checkout <tag>). "Auto-update to latest" is satisfied by converging
to the newest tag; "chaos" in the operator note is read as "auto-deploy latest", and tag-pinning is
the correct mechanism for a version-gated auto-update.
coop-cloud version format is <recipe-semver>+<app-version> (observed), not the plan's
<upstream>+<recipe-semver>. Evidence: keycloak 10.7.1+26.6.2 → image keycloak:26.6.2; n8n
3.2.0+2.20.6 → image n8nio/n8n:2.20.6 (the post-+ part is the app image tag). So the recipe
semver is the part BEFORE +. WC1.2's "major recipe bump = breaking" keys off the major (first)
component of the pre-+ recipe semver (e.g. 3.x→4.0 = held). Secondary signal: scan the target's
releaseNotes/<version>.md for manual-migration markers.
Scope order for W0.6: keycloak first (the W0 focus, stateful → snapshot path); apply the same health-gated + safety-gate pattern to traefik (stateless, version-rollback-only) afterward by migrating proxy.nix onto the shared reconcile entrypoint.
Phase 2w — W1 canonical registry design (WC2/WC3) (2026-05-29)
Enrollment is declarative per-recipe via recipe_meta.WARM_CANONICAL = True (consistent with how
DEPS/EXTRA_ENV are declared — enrolling a recipe stays a tests/<recipe>/ change, D5). A recipe so
flagged gets a DATA-WARM canonical. Prove the model on a couple of recipes (custom-html simplest:
stateful, no external DB), NOT all (the nightly sweep populates the rest over time).
Stable domain warm-<recipe>.ci.commoninternet.net (already decided for keycloak; same scheme for
canonicals). Distinct from cold <recipe[:4]>-<6hex>. Watch the swarm 64-char secret-name limit
per recipe on first deploy.
Known-good state per canonical, under /var/lib/ci-warm/<recipe>/: last_good (version string,
already written by warm_reconcile), snapshot/ (warmsnap, W0.5), and a small canonical.json
registry record {recipe, domain, version, commit, status, ts}. The DATA VOLUME is retained while
the app is undeployed (data-warm). These are cache (excluded from D8, WC8).
Data-warm lifecycle (new runner/harness/canonical.py): is_enrolled(recipe) (reads
WARM_CANONICAL), canonical_domain(recipe), read/write_registry(recipe), deploy_canonical(recipe)
(deploy warm-<recipe> at last_good, reattaching the retained volume → warm boot), undeploy_keep_ volume(recipe) (undeploy, volume retained = idle data-warm), seed_canonical(recipe, version, commit)
(record + snapshot; the volume becomes the canonical). LIVE-warm (keycloak, always up) vs DATA-warm
(canonicals, undeployed when idle) both use warm-<recipe> + warmsnap.
W1 scope vs W3: W1 builds the registry + data-warm lifecycle and proves it (seed a custom-html canonical → undeploy keep volume → redeploy reattach → data survives; re-warmable from scratch). Automatic promote-on-green-cold (WC5) + nightly (WC6) are W3 — for W1 the canonical is seeded programmatically to prove the model; the cold-advances-canonical wiring comes later.
Phase 2w — W3 WC5 promote-on-green-cold mechanism (2026-05-29)
Promote = re-seed the canonical from a fresh deploy of the green-verified latest (NOT "keep the
cold run's per-run volume"). Rationale: a cold run uses a fresh per-run domain <recipe>-<6hex>
with a fresh volume (cold stays authoritative + fresh); its volume names are per-run-specific and
differ from the canonical's warm-<recipe> volume names, so the per-run volume can't be directly
reused as the canonical without a fragile name-remap. AND the cardinal guardrail "never lose the
known-good" forbids touching the existing canonical until a new green one is ready.
So: on a run that is enrolled (recipe_meta.WARM_CANONICAL) + GREEN + COLD (not --quick) + on LATEST
(no PR head, i.e. REF empty — the nightly/manual-latest run, NOT a PR !testme), AFTER the normal
per-run teardown, the orchestrator PROMOTES: deploy warm-<recipe> at latest → wait healthy →
undeploy → canonical.seed_canonical(version=latest, commit=head) (snapshot-while-undeployed +
atomic registry/snapshot replace). The old known-good is replaced ATOMICALLY only on a green promote
(a red run never reaches promote → known-good safe). The canonical's data = a clean install of the
green-verified latest (a valid known-good baseline; --quick reattaches + upgrades it). Cost: one extra
(canonical) deploy per promote — acceptable for cold/nightly (not latency-sensitive). The FIRST such
green run SEEDS the canonical. --quick never promotes (proven W2). Only cold advances (WC5).
Promote gate predicate (unit-tested): is_enrolled(recipe) and overall==0 and not quick and not ref.
(not ref = a catalogue-latest run, i.e. the nightly sweep or a manual RECIPE=<r> run — a PR
!testme carries REF=PR-head and must NOT advance the canonical to a PR's code.)
Phase 2 — heavy-recipe upgrade tier disk constraint (28GB host) — SETTLED finding @2026-05-29
The upgrade tier (HC1: prev published → PR-head via in-place abra app deploy --chaos) cannot
complete for recipes whose successive releases bump multi-GB image tags, because the rolling update
must hold BOTH versions on disk transiently. Proven on lasuite-drive: onlyoffice 9.2 → 9.3.1.2
(3.94GB each) + collabora two versions → ~10GB office images at once vs ~14GB docker headroom on the
28GB host → 99% → deploy fail. No harness fix is possible (the prev images are running, so they
are neither dangling-prunable nor rmi-able when the new must be pulled). install/backup/restore/
custom (single version) fit and pass. Resolution = grow the host disk (Class A1 operator input,
DEFERRED.md 2026-05-29). Until then, heavy recipes are verified via their maximal testable subset
(install+backup+restore+custom) with the upgrade tier flagged as a genuine env-level (disk) blocker
per plan §7.1 (Adversary sign-off required). The cleanup runbook for an over-full host: pkill -f run_recipe_ci.py; docker stack rm <leftover>; remove its volumes+secrets; docker image prune -f.