cc-ci/machine-docs/DECISIONS.md

# DECISIONS — cc-ci Builder

Architecture decisions and dead-ends. One line of rationale each. (§0, §8)

## Settled

- **Wildcard TLS:** operator pre-issues wildcard cert at `/var/lib/ci-certs/live/`; Traefik file
  provider serves it; **no ACME** for commoninternet.net. (Plan §4.0/§8 — fixed.)
- **Repo:** `git.autonomic.zone/recipe-maintainers/cc-ci`, private. Bot is org admin. (Bootstrap.)
- **Git credentials:** helper script in repo-local git config sources `/srv/cc-ci/.testenv` at call
  time — no secret values stored in `.git/config` or commits.

- **Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26,
  overrides plan §3 `modules/traefik.nix`).** Instead of a hand-rolled Traefik we deploy the
  canonical Co-op Cloud `traefik` recipe via abra in **wildcard / file-provider mode**, for
  end-to-end fidelity (canonical `web`/`web-secure` entrypoints + proxy/swarm conventions every
  recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO
  DNS token on the box:
  - `WILDCARDS_ENABLED=1` + append `compose.wildcard.yml`; the pre-issued cert is fed as the
    `ssl_cert`/`ssl_key` swarm secrets (v1) via `abra app secret insert … -f` from
    `/var/lib/ci-certs/live/{fullchain,privkey}.pem`. The file provider serves it (`tls.certificates`).
  - `LETS_ENCRYPT_ENV=` **empty** on the traefik app *and* on every test app → the recipe's
    `tls.certresolver=${LETS_ENCRYPT_ENV}` label resolves to no resolver → routers serve the
    wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
  - Reproducibility (D8): `scripts/deploy-proxy.sh` is idempotent (ensures local abra server, fetches
    recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in
    `docs/install.md`. The custom `modules/traefik.nix` was removed; `modules/swarm.nix` keeps swarm
    init + `proxy` net + firewall 80/443.
  - **Renewal (manual, ~90d):** operator re-issues the wildcard at the same paths, then
    `abra app secret rm traefik.ci.commoninternet.net ssl_cert -n` + re-insert at a new version (bump
    `SECRET_WILDCARD_CERT_VERSION`) and redeploy. (Documented in docs/secrets.md at M7.)
  - **abra teardown syntax** (for harness, §4.3): `abra app undeploy <d> -n`,
    `abra app volume remove <d> -f -n`, `abra app secret remove <d> --all -n`. None take `--chaos`.

- **Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer
  2026-05-26).** Every piece of swarm infra that abra deploys (traefik `modules/proxy.nix`, Drone
  `modules/drone.nix`, later comment-bridge + dashboard) is a `systemd.services.<x>` with
  `Type=oneshot` + `RemainAfterExit`, `after`/`requires` swarm-init + docker, `wants`
  network-online, `wantedBy` multi-user, embedding its script via **`pkgs.writeShellApplication`**
  (self-contained in the store, not a `/root/cc-ci` path). The script **reconciles** (inspect →
  converge → no-op if correct) on *every* activation/boot — **no run-once sentinel** — so it
  self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit)
  on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses to
  `git clone` + `nixos-rebuild switch` + operator preconditions, no manual post-steps. The old
  `scripts/deploy-*.sh` were folded into these modules and removed. `pkgs.abra` is provided via an
  overlay (`modules/packages.nix`) so all modules share the one pinned build.
  - *Cert rotation note:* the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
    wildcard means bumping `SECRET_WILDCARD_*_VERSION` (operator) so the next reconcile re-inserts.
    Documented in docs/secrets.md at M7.

- **Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27,
  supersedes the earlier "keep webhook, do NOT pivot to polling" steer).** Hard constraint: the
  bot/server runs at **READ level, never repo-admin**, and **never self-registers a webhook**.
  - **Polling is PRIMARY and the source of truth for D1.** The bridge polls each enrolled repo's
    open PRs for new `!testme` comments every `POLL_INTERVAL` (30s ≤ 60s). Outbound
    (cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On
    startup the first poll marks pre-existing comments seen so it doesn't fire on old comments.
  - **Webhook is an OPTIONAL push optimization.** The `/hook` endpoint stays live (HMAC-verified)
    so an *admin-registered* `issue_comment` webhook lowers latency, but the bridge never registers
    one. Manual registration is documented in `docs/enroll-recipe.md`. Both paths share an
    in-memory seen-set keyed by comment id → a comment seen by both fires at most once.
  - **Commenter authorization via org membership (read-level, no admin).** Allowed iff
    `GET /orgs/{owner}/members/{user}` → 204 (verified 2026-05-27: admits bot/trav/notplants, 404
    for a non-member, works with bot read-level basic-auth) **or** the user is in the optional
    `AUTH_ALLOWLIST`. Replaces the earlier `/collaborators/{user}/permission` check, which needs
    repo-admin. Fail-closed on any error.
  - **Enrollment** = add the repo to the bridge `POLL_REPOS` csv + ensure `tests/<recipe>/` exists.
    No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't
    matter: polling makes it irrelevant; the operator was whitelisting `ci.commoninternet.net` in
    Gitea's `ALLOWED_HOST_LIST`, but D1 no longer depends on that.)

- **Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27,
  plan §4.2/§4.3).** Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
  - **MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1** (`modules/drone-runner.nix`, `maxTests` let-binding).
    Drone runs at most MAX_TESTS builds at once and **auto-queues the rest in its native pending
    queue** — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is
    never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly.
  - **Per-build TIMEOUT = 60 min** (`modules/drone.nix`, `buildTimeoutMinutes`; reconciled
    best-effort via `PATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}` using the bridge's
    Drone admin token, local `--resolve`, non-fatal). A build over the limit is cancelled by Drone →
    the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue
    once a test finishes OR times out".
  - **Teardown + janitor backstop.** Each build deploys → runs the 3 stages → undeploys
    (guaranteed `try/finally` in `conftest`/orchestrator). A SIGKILL'd/timed-out build can't run its
    own teardown, so the **run-start janitor** (`lifecycle.janitor`, called before every deploy in
    both fixtures + `run_recipe_ci`) reaps orphaned run apps as the backstop. At capacity=1 the CI
    path will set `CCCI_JANITOR_MAX_AGE=0` (reap any orphan immediately — safe with no concurrent
    runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default
    2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live.
  - Optional `concurrency: {limit: 1}` in the recipe-CI `.drone.yml` is a redundant belt — primary
    mechanism is `DRONE_RUNNER_CAPACITY`. (Wired when the recipe-CI pipeline lands — see backlog.)

- **D10 recipe #6: bluesky-pds (TLS-passthrough) SWAPPED → n8n — SETTLED (2026-05-27, plan §4.0
  sanctions this swap-with-reason).** bluesky-pds routes via a Traefik **TCP router with
  `tls.passthrough=true`** to an in-container **caddy** that terminates TLS itself and obtains its own
  cert via **ACME**. cc-ci's design is the opposite: the operator gateway passes wildcard TLS through
  to cc-ci's Traefik, which **terminates** it with the pre-issued static wildcard cert, and **ACME is
  hard-forbidden** for commoninternet.net (no DNS token on the box — §4.0/§9). Serving bluesky-pds
  would require either (a) ACME inside caddy (forbidden), or (b) injecting the wildcard cert into
  caddy + a per-host TCP-passthrough router on cc-ci Traefik (recipe-internal surgery + a bespoke
  proxy mode — not a clean shared-harness absorb). This is a genuine design conflict, not a harness
  gap. Per the plan's explicit allowance, **bluesky-pds is a documented non-CI'd recipe** (reason
  here), and **n8n** takes the 6th slot. The 5 required D10 categories are already covered by recipes
  1–5 (simple=custom-html, single-DB+SSO=keycloak, stateful/no-DB=cryptpad, DB+media/large-volume=
  matrix-synapse, multi-service+S3/object-storage=lasuite-docs); n8n adds a 6th real deployable app
  (workflow automation) behind the normal terminate-at-Traefik path.

- **Docker Hub rate limit + mid-breadth prune — FINDING (2026-05-27).** D10 real-`!testme` breadth
  runs exhausted Docker Hub's anonymous pull rate limit (lasuite-docs, 9 images, upgrade stage:
  `toomanyrequests`). Two lessons: (1) **registry pull creds are an A1 operator input** needed for
  reliable heavy-recipe deploys under load (request + sops-store + wire into docker daemon). (2)
  **Don't `docker image prune -af` mid-breadth** — it evicts cached recipe images and forces re-pulls
  that hit the limit. The first lasuite failure was disk pressure (90% full); pruning fixed disk but
  triggered re-pulls → rate limit. Better: rely on the daily autoprune, prune only `dangling`
  (not `-a`) between runs, or grow disk so heavy images stay cached. Net for D10: 5/6 recipes green
  via real !testme; lasuite-docs gated on the rate limit (transient ~hours; durable fix = creds).

## Open (defaults from §8, to confirm as reality lands)

- **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on
  cc-ci itself*, with the repo materialised on the host at `/root/cc-ci`. Chosen over
  `--target-host`/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS
  proxy (slow/fragile). Atomic rollback preserved by Nix generations (`nixos-rebuild --rollback`).
  The switch is launched as a **detached transient systemd unit** (`systemd-run --unit=ccci-rebuild
  --collect`) so it survives a momentary ssh-over-tailscale drop during activation. For the build
  loop the host copy is synced from the sandbox clone via `tar | ssh` (rsync absent on host);
  source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo
  on a fresh host, then `nixos-rebuild switch --flake .#cc-ci`).
  - **nixpkgs pin:** flake pins the exact rev cc-ci already ran (`50ab793…`) so the first rebuild
    is a true no-op-then-base. Bump deliberately, never drift.
- **Webhook scope:** default per-repo via enroll script.
- **CI engine: Drone (per plan) — kept, with a noted risk.** nixpkgs 24.11 has Drone **server**
  2.24.0 but `drone-runner-exec` is **abandoned (unstable-2020-04-19)** — the only exec runner Drone
  ever shipped (upstream archived ~2021). The maintained fork **Woodpecker** (2.7.3, with NixOS
  modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific
  (D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern
  Drone server (RPC protocol stable). **Fallback:** if the exec runner proves incompatible/broken,
  pivot to Woodpecker (coop-cloud ships a `woodpecker` recipe too) and record it — like the traefik
  pivot. Re-evaluate at the M2 gate.
- **Drone deployment shape — SETTLED (M2):** mirror the traefik pattern. The **server** is the
  coop-cloud `drone` recipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by
  traefik at `drone.ci.commoninternet.net`, `LETS_ENCRYPT_ENV` empty → wildcard cert, no ACME),
  with Gitea SSO (`compose.gitea.yml`). The **exec runner** runs as a Nix systemd service on the
  host (`modules/drone-runner.nix`) so it can drive host abra/swarm (plan §4.2). One generated
  `DRONE_RPC_SECRET` is shared: inserted as the server's `rpc_secret` swarm secret AND read by the
  runner from sops. Reproducible deploy: `scripts/deploy-drone.sh`.
  - Gitea OAuth app `cc-ci-drone` created under the bot (client_id `ab4cdb9d-ee96-4867-875f-
    87384505fc52`, redirect `https://drone.ci.commoninternet.net/login`); client_secret +
    rpc_secret stored sops-encrypted in `secrets/secrets.yaml` (A2 internal secrets).
- **Drone runner type:** exec (must drive host abra).
- **Secret tool — SETTLED (M0):** sops-nix. cc-ci decrypts at activation using its **ed25519 SSH
  host key** as the age identity (`sops.age.sshKeyPaths`), so no extra key file to manage on the box.
  Recipients in `/.sops.yaml`: the host age key (`age1h90ut…`, from ssh-to-age) + an off-box
  **master recovery key** (`age1cmk26t…`; private half only at `/srv/cc-ci/.sops/master-age.txt` on
  the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing
  plaintext into `secrets/<f>.yaml` then `sops -e -i` (run inside the repo so `.sops.yaml` is found).
- **D10 recipe set:** lock six early. Candidates favouring already-mirrored: custom-html (simple),
  cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3),
  bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5.

- **Per-run app domain scheme — adapted (M4, deviates from plan §4.0).** Plan §4.0 wanted
  `<recipe>-pr<n>-<short-sha>.ci.commoninternet.net`, but Docker swarm config/secret names
  (`<stackname>_<resource>_<version>`) must be ≤ 64 chars and abra derives `<stackname>` from the
  domain (dots→`_`, hyphens kept). `.ci.commoninternet.net` alone is 22 chars, so long recipe names
  + config names overflow 64 (hit with `custom-html-pr0-m4demo…_nginx_default_conf_v6` = 66). New
  scheme: **`<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net`** (e.g. `cust-e084bd`) — short,
  unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/
  ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.

- **abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6).** Many
  abra commands (`app ls`, `secret generate` without flags, version resolution) silently
  `git checkout <version-tag>` in `~/.abra/recipes/<recipe>`, discarding a PR branch's files. To
  test the *PR head code* (not a re-resolved tag): (1) `fetch_recipe` clones the mirror branch/ref
  (private → bot token via per-command `http.extraHeader`, never persisted/logged); (2) all harness
  abra calls that touch the recipe pass `-C` (chaos: use current checkout) `-o` (offline: no remote
  fetch); (3) recipe-shipped `tests/` (D4) are **snapshotted to a temp dir right after fetch**, since
  later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.

## Risks

- **Disk — RESOLVED 2026-05-26.** Original 8.9 GiB root had only ~3.8 GiB free *and* a hard
  **inode** ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
  inodes before bytes. Operator grew the VM to **28 GiB** (22 GiB free, 1.78M inodes / 1.21M free);
  the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
  periodic `docker image prune` to avoid regressing during M6.5 breadth.

## Dead-ends
- (none yet)

## Phase 1c (full reproducibility + genuine D8 live rebuild) — 2026-05-27

- **Secrets linkage = git SUBMODULE (deviates from plan §7 flake-input default).** `cc-ci-secrets`
  is mounted as a submodule at `cc-ci/secrets/` rather than a flake `inputs.secrets`. Rationale: a
  private flake input must be re-fetched at **every nix eval**, requiring the bot token persistently
  in nix config/netrc on cc-ci AND the throwaway VM (a token in the store/config = a 2nd out-of-band
  secret, which 1c forbids). A submodule makes `secrets/secrets.yaml` a plain path in the working
  tree → `defaultSopsFile = ../secrets/secrets.yaml` is unchanged (minimal diff, trivially
  byte-identical), and the only credential use is the one `git clone --recursive` at provisioning
  ("the two repos are *given*", Mission §1). Build invocation becomes
  `nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` so the submodule tree is
  included. (Revisit if `?submodules=1` proves unreliable on cc-ci's nix version.)
- **Bootstrap key for the throwaway VM = the existing RECOVERY (master) age key, via
  `sops.age.keyFile`.** The recovery key (`age1cmk26…`, private at `/srv/cc-ci/.sops/master-age.txt`)
  is already a sops recipient, so a fresh host with a *different* ssh host key still decrypts every
  secret with no re-keying — this is exactly the §0 argument that defeats "host-key binding".
  Provisioned to the VM at a fixed path (the ONE out-of-band secret). cc-ci itself keeps decrypting
  via its host key (`age.sshKeyPaths`); secrets.nix will offer both identity sources. (Per-host
  re-encrypt is cleaner for a *permanent* new instance — documented as the alternative, not used for
  the throwaway test.)
- **Cert into git:** wildcard cert+key become sops secrets in `cc-ci-secrets`, decrypted at
  activation back to `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` via
  `sops.secrets.<name>.path`; proxy.nix keeps reading that path (now sops-sourced, not operator-drop).
- **cc-nix-test final sizing (C6) — SETTLED by operator 2026-05-27: PROMOTE the rebuilt VM.** The
  freshly-rebuilt reproducible VM (the FINAL W5/C4-C5 clean-room throwaway) becomes the canonical
  cc-nix-test; the operator will repurpose it for a live real-traffic test through the public gateway.
- **C6 teardown OVERRIDE (operator, 2026-05-27):** do NOT destroy the FINAL throwaway VM after
  W5/C4-C5 PASSes — keep it RUNNING; defer its C6 teardown until the operator explicitly says
  otherwise. This overrides the plan §5/§6 "destroy the throwaway" for that one VM only. All other
  cleanup proceeds normally (the Builder's first throwaway was already destroyed; RAM accounting holds).

## Phase 1b — lint/format tooling (open decisions §6, settled W0)
- **Formatters/linters (RL1):** Nix = `nixpkgs-fmt` (format) + `statix` (lints) + `deadnix` (dead
  code); Python = `ruff` (lint + format); Shell = `shellcheck` + `shfmt -i 2 -ci`; YAML = `yamllint`.
  Kept `nixpkgs-fmt` over `alejandra` because it was already the repo `formatter` and devshell tool
  (no extra churn / restyle of every .nix). All built from the already-pinned nixpkgs via a flake
  `lint` devshell (`nix develop .#lint`) so CI and local use byte-identical tool versions.
- **Lint entrypoint:** `scripts/lint.sh` (check-only by default; `--fix` auto-applies). The
  `.drone.yml` push pipeline runs it via `nix develop .#lint --command bash scripts/lint.sh`.
- **ruff strictness:** `select = [E,F,W,I,UP,B,C4,SIM]`, `ignore = [E501]` (line length is the
  formatter's job; only un-splittable strings would trip it). `line-length=100`, `target=py311`.
- **Drone lint stage = FAIL (not warn).** The codebase is green now, so enforce from here on — an
  unclean commit fails the `lint` step. (Resolves the §6 open question.)
- **Python type-checking (mypy/pyright): DEFERRED to IDEAS**, not added in 1b. The harness is small
  and dynamically typed around `abra`/subprocess JSON; gradual typing is a larger effort than this
  bounded pass warrants. Revisit if Phase 2's 18-recipe ramp shows type bugs.
- **blocking vs advisory split (§3):** treated as in the phase plan — tests-real, Nix-idempotent,
  no-footguns, no-secrets, log-redaction, harness-DRY = blocking; readability/docs/arch-drift =
  advisory unless a real plan deviation. Recorded per-finding in REVIEW-1b / BACKLOG-1b.
- **cc-ci self-CI push trigger:** the lint stage lives in the `event: push` pipeline. The Gitea→Drone
  push webhook on this instance is flaky (`last_status: None`; documented §4.1) and predates 1b —
  recipe CI uses polling as primary, but cc-ci's *own* self-test/lint relies on the push webhook.
  The lint stage is correctly wired and proven green via the identical `nix develop .#lint` command;
  reliably auto-firing it on every push is tracked as a (pre-existing) infra item, not a 1b lint gap.

## Phase 1b — repo layout (operator review items RL5/RL6, plan §7)
- **RL5 — all Nix code under `nix/`.** Moved `modules/`→`nix/modules/` and `hosts/`→`nix/hosts/`.
  `flake.nix`/`flake.lock` STAY at the repo root (entry point) so the build ref `#cc-ci` and
  `nixos-rebuild --flake '…#cc-ci'` are unchanged — only `flake.nix`'s internal
  `./hosts/cc-ci/configuration.nix` → `./nix/hosts/cc-ci/configuration.nix` changed. Root-relative
  refs inside the moved modules were re-based `../X` → `../../X` (secrets.nix → `../../secrets/`,
  bridge.nix → `../../bridge/`, dashboard.nix → `../../dashboard/`); `configuration.nix`'s
  `../../modules/*` imports are unchanged (both dirs moved under `nix/`, so the relative path still
  resolves). **Toplevel is byte-identical (`8i3jcad9…`) before/after the move** — store derivations
  are content-addressed on the copied file *contents*, and the module `.nix` files aren't part of the
  runtime closure, so relocating folders doesn't change the build. (The operator anticipated a hash
  change; in practice it's stable, which is even stronger for reproducibility.) Living docs
  (README, architecture/install/secrets/enroll) + the `.drone.yml` comment updated to `nix/…`;
  append-only history logs left as the record of what was true then.
- **RL6 — protocol files → `machine-docs/`: DEFERRED to the coordinated end of 1b.** Will `git mv`
  `STATUS*/REVIEW*/JOURNAL*/BACKLOG*/DECISIONS.md` into `machine-docs/` (README.md STAYS at root —
  operator decision, it's the human readme, not a protocol file). The live watchdog (`launch.sh`)
  reads `STATUS-<id>.md`/`REVIEW-<id>.md` at the repo root for handoffs/transition, so this is done
  LAST, in lockstep with the orchestrator updating `launch.sh` + restarting the watchdog — not
  unilaterally and not while a phase transition is pending. The Adversary likewise `git mv`s its own
  REVIEW files at the cutover (single-writer rule).

## Phase 1b — recorded deviation: no `tests/_template/` dir (enroll = copy an existing recipe)
Plan §3's repo layout lists a `tests/_template/` "copy-to-add-a-recipe" dir. It was **never created**
(pre-1b; not introduced or removed by 1b) — instead the documented enroll flow in
`docs/enroll-recipe.md` is **"copy an existing recipe's tree, e.g. `tests/custom-html/…`, then adjust
`recipe_meta.py` + the per-recipe test files."** This satisfies D5's "small, repeatable, documented
operation with no harness surgery" the same way (a concrete recipe is a better starting template than
an abstract skeleton that can drift). Recording per the Adversary's RL3 D5 advisory; not a blocker.

## Phase 1d — generic test suite + layered overlays (design, 2026-05-27)

SSOT: `cc-ci-plan/plan-phase1d-generic-test-suite.md`. Resolves the §6 open decisions.

- **Tier model & op/assertion split (the core call).** A run is a sequence of TIERS — install,
  upgrade, backup, restore, custom — each = `generic default [overridden by a recipe overlay]`. The
  **lifecycle OP** (deploy, upgrade, backup, restore) is owned by the **shared harness**
  (`harness.generic` helpers), NOT duplicated in every test file. A tier's **test file** (generic or
  overlay) carries the ASSERTIONS and calls the shared op helper. This keeps the op single-sourced
  (DRY, DG7) and makes deploy-once trivial: only the orchestrator deploys/tears-down.

- **Override (not additive) — Builder's call (plan §6, operator leaned override).** For each
  lifecycle op exactly ONE assertion file runs, by precedence:
  **repo-local `tests/test_<op>.py` > cc-ci `tests/<recipe>/test_<op>.py` > generic
  (`tests/_generic/test_<op>.py`)**. A present overlay REPLACES the generic for that op. **Invariant:
  no overlay for an op ⇒ the generic runs** (so any recipe is testable with zero config). Repo-local
  wins same-name collisions (upstream is authoritative, plan §2.5); cc-ci's overlay is the curated
  fallback until upstream adopts it. **Extend-by-composition:** an overlay may
  `from harness import generic` and call `generic.assert_serving(...)` / `generic.do_upgrade(...)`
  then add its own assertions — so "extend" needs no separate mechanism.

- **Custom (non-lifecycle) `test_*.py`:** ALL discovered from BOTH locations run additively, opt-in
  (no override, no generic equivalent) — e.g. `test_sso.py`.

- **Deploy ONCE, mutate in place (operator requirement, DG4.1).** The orchestrator deploys the app
  ONCE, runs all tiers against that single live deployment (install asserts; upgrade does
  `abra app upgrade` in place; backup/restore mutate in place; custom asserts), then ONE teardown in
  `finally`. No per-tier/per-overlay `abra app new/deploy/undeploy`. A `CCCI_DEPLOY_COUNT` counter in
  `lifecycle.deploy_app` is asserted == 1 per run (DG4.1 evidence).

- **Deployment-sharing scope & base version (§6 open).** One deployment for the whole lifecycle.
  Base version deployed once = the **previous published version** when an upgrade tier will run and a
  previous exists (so upgrade goes previous→target in place), **else the target** (current/$REF).
  Recipe with only one published version ⇒ upgrade tier is a clean **SKIP** (nothing to upgrade
  from). Standalone generic-install demo (no PR) deploys current.

- **Fail handling across shared tiers (§6 open):** install failing (app never serves) **fail-fasts**
  the run (later tiers can't meaningfully run on a dead deployment) and they report **error/skip**;
  upgrade/backup/restore failures are recorded per-op but do not abort the remaining independent
  tiers where they can still run. Teardown always runs.

- **Backup-capability detection (DG3, §6 open):** auto — scan the recipe's `compose*.yml` for a
  `backupbot.backup` label (verified present in custom-html). `recipe_meta.BACKUP_CAPABLE` (bool)
  overrides the auto-detect. Not capable ⇒ backup+restore tiers are **N/A (skip)**, not failures.

- **Custom install-steps hook (DG5, §6 open):** a shell hook — `tests/<recipe>/install_steps.sh`
  (cc-ci) or repo-local `tests/install_steps.sh` — run by the orchestrator during the install tier
  AFTER `abra app new` + env defaults but BEFORE `abra app deploy`, with env `CCCI_APP_DOMAIN`,
  `CCCI_RECIPE`, `CCCI_APP_ENV` (path to the app .env). Chosen over a fixture/declarative field as the
  simplest thing the harness runs uniformly (can `abra app secret insert`, set env, seed). Graceful
  rule: a recipe with NO hook still attempts the generic install; if it genuinely needs a step it
  FAILS the generic install (reported per-op) — that is correct, not a harness bug.

- **Per-op result vocabulary (Phase-3 feed):** `pass | fail | skip(N/A) | error`. The orchestrator
  prints a per-op summary line per run (feeds DG6 + Phase-3 level).

- **Discovery layout:** cc-ci overlays/custom/hook live in `tests/<recipe>/`; repo-local in the
  recipe repo's `tests/` (snapshotted after fetch, per the existing volatile-checkout handling).
  Generic tier files live in `tests/_generic/` (assertion-only, use the shared live-deployment
  fixtures).

---

## Phase 1e — generic-harness corrections (HC1–HC4)

Three operator-review corrections to the Phase-1d shared harness, settled here (plan §5).

- **HC2 — repo-local approval allowlist (form/location + workflow).** PR-author-controlled code
  (`install_steps.sh`, repo-local `test_*.py`) runs on the CI host with `/run/secrets/*` present, so it
  is **default-deny**. Allowlist file: **`tests/repo-local-approved.txt`** (checked into the cc-ci
  repo, git-auditable). Format: one recipe name per line; `#` comments + blank lines ignored; a lone
  `*` is NOT a wildcard (no global opt-in — every recipe is explicit). **Default: empty ⇒ no recipe
  trusts repo-local code.** Discovery (`resolve_op`/`custom_tests`/`install_steps`) consults the
  repo-local source **only** when `repo_local_approved(recipe)` is true; otherwise precedence is
  **cc-ci > generic** only and repo-local is discovered-but-not-executed. **Workflow:** a cc-ci
  maintainer reviews a recipe's repo-local tests, then adds the recipe name to
  `tests/repo-local-approved.txt` in a cc-ci PR — a deliberate, reviewable act. The gate is centralized
  in `discovery.py` (one reader) so the unit tests pin it.

- **HC3 — generic-by-default opt-out flag (name/granularity + recipe_meta).** Generic assertions run
  **additively** alongside any overlay by default. Opt-out, in increasing specificity (any one skips):
  env **`CCCI_SKIP_GENERIC`** (truthy ⇒ skip generic for ALL ops), env
  **`CCCI_SKIP_GENERIC_<OP>`** (e.g. `CCCI_SKIP_GENERIC_UPGRADE` ⇒ skip generic for that op only), and
  declarative **`recipe_meta.SKIP_GENERIC`** = a list of op names (or `["all"]`) so the opt-out is
  per-recipe and visible in git, not a hidden global. Truthy = `1/true/yes/on` (case-insensitive).
  **Op-vs-assertion split:** a mutating op (upgrade/backup/restore) is performed **once by the
  orchestrator** (the harness owns the op); then the generic assertion file (unless opted out) and the
  overlay assertion file both evaluate the **shared post-op state**. Op results that an assertion needs
  (pre-upgrade identity, backup snapshot_id) are passed op→assertions via a run-scoped JSON state file
  at `$CCCI_OP_STATE_FILE` (read by `harness.generic.op_state()`); never logged. Overlays that need to
  **seed pre-op state** (data-continuity markers, the backup→restore mutation) ship an optional
  `tests/<recipe>/ops.py` with `pre_install/pre_upgrade/pre_backup/pre_restore(domain, meta)` callables
  the orchestrator runs **before** the op (repo-local `ops.py` is allowlist-gated like other repo-local
  code). Overlay `test_<op>.py` files are now **assertion-only** (they no longer call `generic.do_*`).

- **HC1 — DG4.1 deploy-count vs the in-place chaos upgrade.** The upgrade tier now upgrades to the
  **PR head** (code under test), not a published tag: deploy the previous published version (base),
  **re-checkout the PR head** (recorded as the recipe repo HEAD right after fetch, before any
  version-tag checkout), then **`abra app deploy --chaos`** in place = the upgrade. The deploy-count
  guard counts **`abra app new` installs only** (`_record_deploy()` fires in `deploy_app()`, NOT in the
  chaos redeploy, which calls `abra.deploy` directly) — so a run is still **deploy-count == 1** and the
  legitimate in-place chaos upgrade is not flagged. **Moved assertion (adapted):** prev→PR-head may not
  bump the coop-cloud version label, so `assert_upgraded` accepts ANY of: version-label change, image
  change, or a **chaos label** now present carrying the PR-head commit (a chaos deploy stamps
  `coop-cloud.<stack>.chaos`/`.chaos-version`) — the chaos label IS the proof PR-head was deployed.
  Non-PR `!testme` (no SRC/REF): "PR head" = the catalogue current checkout, so upgrade is prev→current
  — still a genuine move via chaos. (Exact chaos label name verified on the live abra during E2.)

## Phase 2 — per-recipe test authoring (design, 2026-05-28)
Inherits the Phase 1d/1e shared-deployment + additive-overlay + op/assertion-split model. Phase 2
adds **content**, not infra, with a few small harness primitives ported from
`references/recipe-maintainer/utils/tests/helpers.py`.

- **Per-recipe layout (per plan §4.1).** The cc-ci `tests/<recipe>/` dir continues to use the
  Phase-1d/1e overlays at the top level (`test_install.py`, `test_upgrade.py`, `test_backup.py`,
  `test_restore.py`, `ops.py`, `recipe_meta.py`, optional `install_steps.sh`). NEW Phase-2
  subdirectories:
  - `tests/<recipe>/functional/` — parity-port tests (one per recipe-maintainer `tests/*.py`) +
    ≥2 NEW recipe-specific functional tests (P2/P3). Each file is `test_*.py` (pytest-discoverable);
    each parity port carries a **`SOURCE = "recipe-info/<recipe>/tests/<file>"`** comment near the
    top so the audit trail is in the file, not just in PARITY.md.
  - `tests/<recipe>/playwright/` — browser flows (P6) where the app's UX is a UI flow. Same
    `test_*.py` convention; each file imports `playwright.sync_api`.
  - `tests/<recipe>/PARITY.md` — required mapping table (P2) with one row per
    recipe-info parity test: `| recipe-maintainer file | cc-ci file | what's verified | status |`.
    A deliberate non-port is a documented row in DECISIONS.md (linked from PARITY.md), not a silent
    omission.
- **Discovery for the new subdirs.** `runner/harness/discovery.custom_tests` recurses into
  `tests/<recipe>/functional/` and `tests/<recipe>/playwright/` (in addition to the top-level glob),
  so Phase-2 functional tests run as part of the **custom** stage automatically. Repo-local (HC2)
  gate still applies if the recipe is approved; otherwise only cc-ci's own functional/ + playwright/
  run. The top-level `test_install.py`/etc. continue to drive the lifecycle overlays — the
  `functional/` + `playwright/` files are **always custom-stage**, never lifecycle (so they don't
  perform an op; they assert against the post-install live deployment).
- **Vendored helpers in `runner/harness/`.** Capabilities ported from `recipe-maintainer/utils/tests/
  helpers.py` (cc-ci is self-contained at runtime — does NOT import recipe-maintainer's workspace,
  per plan §8 default):
  - `harness.http` — `http_get(url, headers=, timeout=) -> (status, json_or_None)`,
    `http_post(...)`, `retry_http_get(url, timeout=, **)`, `wait_for_http(url, label, max_wait=)`,
    `assert_converges(fn, description, max_wait=, interval=)`. (Several variants exist
    `lifecycle.http_fetch/http_get/http_body` already; the harness.http module is the **canonical**
    Phase-2 HTTP API for tests; lifecycle.* helpers stay for infra-level checks.)
  - `harness.abra_tty` — `script -qefc "abra …" /dev/null` wrapper for the abra commands that
    require a TTY (backup/restore/secret/run/logs/lint), used by parity tests that drive abra
    directly. Lifecycle already exposes typed wrappers — this is for tests that need raw shell-abra.
  - `harness.deps` — dependency resolver primitive. Reads `tests/<recipe>/recipe.toml`
    (`requires` / `test_requires`), deploys each declared dep via the same `lifecycle.deploy_app`
    + `wait_healthy` path (so the dep is a real `<dep[:4]>-<6hex>.ci.commoninternet.net` on the
    same swarm), persists per-run, tears down with the parent in the orchestrator's `finally`.
    Heavy recipes sequence sequentially; `MAX_TESTS`/node budget is the cap.
  - `harness.sso` — OIDC-flow primitive (Q2 deliverable). Given a deployed provider domain and a
    recipe-defined realm/client/test-user, performs the full "deploy provider → setup realm/client
    via admin API → obtain access token (password + client-credentials grants) → assert protected
    API call accepts it" assertion. Reusable by every SSO-dependent recipe (cryptpad, lasuite-*,
    immich, etc.). Setup scripts ported from `recipe-info/<dep>/setup_<provider>_integration.py`.
  - `harness.data_integrity` — backup data-integrity primitive: a recipe-aware "seed a marker
    →  backup → mutate → restore → assert seeded marker survived" helper around `lifecycle.exec_in_app`
    / `http_get` (the recipe chooses the marker mechanism, the helper guarantees the pattern).
- **Run-scoped credentials for SSO/recipe-specific tests** (plan §4.4 class-B). Generated secrets
  (realm/client/test-user passwords, API tokens) persist for the run via the existing
  `runs/<app-name>/` mechanism (Phase 1d). Destroyed at teardown alongside abra secrets/volumes.
- **Recipe-versioned tests (anti-anchoring).** Per plan §7.1, tests read versions/endpoints
  dynamically (the app's own discovery endpoints, env from `live_app`) — never hardcode published
  release values. Each functional test file declares the recipe-info SOURCE path it ports from so
  the Adversary can audit parity cold.
- **Heavy-recipe parking.** Drone's `MAX_TESTS=1` + per-build timeout already serialize runs; for
  Phase 2 we DO NOT lift it. Within a single run, the orchestrator deploys deps before the
  recipe-under-test sequentially (never concurrently) per plan §4.2.

## Phase 2 Q3.4 — cryptpad create-pad deeper test deferral (2026-05-28)

**Status:** Deferred to Q3.4 follow-up (or Q5 catch-up), with Adversary sign-off pending per
plan §7.1.

**What's deferred:** The "create-an-object + read-it-back" deep test for cryptpad —
authenticate-and-create a real pad in the browser, type a uniquely-marked content string, reload
the page (retaining the client-side encryption key in the URL fragment), assert the marker
survives. This is the canonical create-and-read-back per plan §4.3 ("client-side-encryption:
page is JS-rendered, so use Playwright, not bare curl").

**Why deferred (the technical reason):**
- CryptPad's pad-creation client-side flow is **version-specific**. In the recipe under test
  (10.6.0+5.7.0), visiting `/pad/` does NOT auto-inject a fragment-keyed pad URL; CryptPad
  requires the user to explicitly click a "new rich text" / "new pad" link from the landing
  page, AND those UI selectors (`.cp-apps-grid a`, `[data-app='pad']`, `a[href*='/pad/']`) are
  not stable across CryptPad versions.
- Three attempted drafts during Q3.4 each failed cold on this:
  1. Type + reload + content-survives: contenteditable inside nested iframe with origin
     mismatch (SANDBOX_DOMAIN).
  2. Direct-`/pad/`-then-fragment: no fragment ever appeared on this version.
  3. Click-fallback for known app-launch selectors: none of the candidate selectors matched.

**The maximal testable subset that IS shipped (P3 floor met):**
- `tests/cryptpad/functional/test_health_check.py` — parity HTTP 200.
- `tests/cryptpad/functional/test_spa_assets.py` — CryptPad branding + canonical asset paths
  in served HTML. Catches the wedged-server-fallback-page failure mode.
- `tests/cryptpad/playwright/test_pad_create.py` — Chromium renders the SPA, asserts brand
  + canonical asset references + zero non-filtered JavaScript console errors.

The Playwright test exercises the JS pipeline in a real browser (per §4.3 directive); the
piece NOT exercised is the user-action-driven pad lifecycle. **What's required to lift the
deferral:** pin a specific CryptPad app-launch contract (CryptPad's source has app-launch
URL patterns like `/pad/?new=1` on some versions) OR write a Playwright helper that walks the
SPA's main menu via a stable accessibility tree (role-based selectors instead of CSS).

Adversary may file F2-N requesting full create-pad coverage; the answer above is the
honest technical reason + the maximal subset. Logged here per plan §7.1.


---

## Phase 2 — nested DOMAIN-derived subdomains flattened to single-label wildcard siblings

**Decision (settled):** When an enrolled recipe routes additional services on **nested subdomains
derived from `DOMAIN`** (e.g. lasuite-drive `MINIO_DOMAIN="minio.${DOMAIN}"` +
`COLLABORA_DOMAIN="collabora.${DOMAIN}"`; lasuite-meet `LIVEKIT_DOMAIN="livekit.${DOMAIN}"`), the
recipe's `recipe_meta.EXTRA_ENV(domain)` MUST override those vars to a **single-label sibling under
the wildcard** — `minio-<domain>`, `collabora-<domain>`, `livekit-<domain>` — NOT the recipe's
default `<svc>.<domain>`.

**Why:** cc-ci's TLS cert is the operator's pre-issued wildcard `*.ci.commoninternet.net` (+ bare
`ci.commoninternet.net`) — §4.0/§1.5, renewed out-of-band, no ACME. A wildcard matches exactly **one**
label. The per-run app domain is already one label (`lasuite-drive-pr<n>-<sha>.ci.commoninternet.net`),
so a nested `minio.lasuite-drive-pr<n>-<sha>.ci.commoninternet.net` is a **2-label** name the wildcard
does NOT cover → Traefik would serve an invalid cert on that router and the service is unreachable
over HTTPS. Re-prefixing with a hyphen keeps it one label (`minio-lasuite-drive-pr<n>-<sha>` +
`.ci.commoninternet.net`), covered by the same wildcard, routed by Traefik's swarm provider with **no
cert work and no gateway change** (the gateway already passes the whole wildcard, §4.0). We must NOT
mint per-host certs / ACME for these (class-A1 boundary, §9).

**Scope:** purely a per-recipe `EXTRA_ENV` concern (no shared-harness change). Recipes with no
DOMAIN-derived nested subdomains (most) are unaffected.

## Phase 2 — `services_converged` treats a `replicas: 0` one-shot as converged

**Decision (settled):** `runner/harness/lifecycle.py::services_converged` now considers a service
converged when `cur == want` (desired replica count met), removing the prior
`or want == "0"` rejection.

**Why:** lasuite-drive's `minio-createbuckets` is declared `deploy: {mode: replicated, replicas: 0,
restart_policy: {condition: none}}` — an **on-demand one-shot** (scaled up manually only when buckets
need (re)creating; it `mc mb …` then `exit 0`). `docker stack services` reports it `0/0`. The old
check rejected any `want == "0"` row, so the stack could **never** report converged → every deploy
hung until `deploy_timeout`. A service AT its desired count (including 0/0) is converged; a service
still spinning up shows `0/1` (`cur != want`) and is correctly not-yet-converged, so the HTTP
readiness wait still gates real liveness. Safe for all currently-green recipes (their services are
all N/N with N>0; the `0/0` case did not previously occur). Buckets/migrations that the one-shot
performs are run on-demand in the recipe's `setup_custom_tests.sh` (post-deploy), not relied upon for
generic-install convergence (the SPA at `/` serves 200 without them).

## 2026-05-28 — Docker Hub auth: declarative config.json via sops (rate-limit fix) — SETTLED

**Context.** Heavy Phase-2 recipe deploys exhausted Docker Hub's anonymous pull rate limit
(100/6h per shared IP `68.14.43.142`) → `toomanyrequests` blocked all new deploys. Operator
provided a read-only Docker Hub PAT (Class A1 registry creds, plan §1.5): `DOCKERHUB_USERNAME=nptest2`
+ `DOCKERHUB_TOKEN` in `/srv/cc-ci/.testenv`. Authenticated pulls = 200/6h **per-account**.

**Decision.** Wire it declaratively (survives a 1c NixOS rebuild), not just an imperative login:
- **Secret:** `secrets/secrets.yaml` (cc-ci-secrets submodule, commit `cdd5e0a`) gains key
  `dockerhub_auth` = `base64("nptest2:<PAT>")` — i.e. the exact `auth` field docker config.json
  wants, so the nix template is a pure render (no runtime base64). sops-encrypted to host+master age
  recipients (edited on cc-ci using its ssh-host-key→age identity via `nix shell nixpkgs#sops`;
  plaintext shredded; PAT never committed plaintext nor exposed in process args/logs).
- **Render:** `nix/modules/secrets.nix` adds `sops.secrets.dockerhub_auth` + a
  `sops.templates."docker-config.json"` that renders `/root/.docker/config.json` (0600, root) at
  activation. It becomes a symlink to `/run/secrets/rendered/docker-config.json`.
- **Why /root:** the drone exec runner runs pipelines as `User=root` (drone-runner.nix), and manual
  deploys ssh in as root — so `/root/.docker/config.json` covers both the `!testme` CI path and
  manual ops. Single config, single user.

**Swarm-propagation question — RESOLVED empirically (no `--with-registry-auth` / pre-pull needed).**
The operator/Adversary flagged that a node `docker login` may NOT propagate to swarm SERVICE-task
pulls. Tested on cc-ci with the authenticated config.json in place:
- Account ratelimit baseline 197/200 (source = account hash `b662dd8b-…`, not the IP).
- Deployed **uncached** `n8nio/n8n:2.20.6` via abra (`RECIPE=n8n STAGES=install`). The swarm service
  task pulled it to `1/1 Running` with **no `toomanyrequests`**.
- Account counter dropped 197 → 196 (manager manifest resolution) → **195** (agent layer-manifest
  pull), source still the account hash. So abra's `docker stack deploy` propagates the cred to the
  swarm task pull on this single-node swarm — billed to the account, not the anon IP.
- Corroborating: the earlier lasuite-drive deploy resolved **12** images with no `toomanyrequests`
  while anon budget was ≤4 — impossible anonymously → manager resolution is authenticated too.

So: declarative root `config.json` is sufficient end-to-end here; `--with-registry-auth` is not
required (abra/SDK attaches it). **Caveat (Phase 2b):** 200/6h may still be tight for a full ~18-recipe
sweep; the permanent structural fix is a registry pull-through cache authenticated with this same PAT.

---

## Phase 2w — warm canonical + `--quick` (2026-05-28)

**Stable-domain scheme for warm apps: `warm-<recipe>.ci.commoninternet.net`.** Distinct from cold
per-run `<recipe[:4]>-<6hex>` (naming.app_domain) so a warm app is never confused with a disposable
cold run. Live-warm keycloak = `warm-keycloak.ci.commoninternet.net`; data-warm canonicals (W1) =
`warm-<recipe>...`. Risk to watch: longer stack name vs swarm's 64-char config/secret limit —
verified per-recipe on first deploy; shorten the scheme if any recipe's secret name overflows.

**Realm is the per-run isolation unit on the shared live-warm keycloak (WC1).** Instead of
co-deploying a fresh keycloak per dependent run, dependents use the one live-warm keycloak and create
a **per-run namespaced realm+client+user**, deleted at run teardown. Realm name =
`<parent_recipe>-<6hex>` where 6hex is the parent's per-run domain label suffix — unique per
(parent, pr, ref) so concurrent dependents never collide, and traceable for debugging. (Was
`realm=parent_recipe`, which would collide across concurrent same-recipe runs.)

**Warm keycloak is declarative INFRA, not warm DATA.** The live-warm keycloak service is brought up
by a Nix systemd-oneshot reconciler (converges to deployed+healthy at the stable domain), exactly
like the traefik recipe deploy — so it IS in the D8 reproducibility closure (re-warmable from
scratch) and self-heals on activation/boot. Only warm *volumes/snapshots* (W1+) are cache excluded
from D8. The keycloak's realm data is ephemeral per-run, so nothing persistent to exclude.

**Live-warm is an optimization layer with a cold fallback.** If no warm keycloak is present (e.g. a
from-scratch host before the reconciler has run, or the warm app is down), the keycloak dep path
falls back to the existing cold co-deploy so dependent runs still work. The warm path is preferred
when available.

## Phase 2w — design update: unpinned warm/infra + health-gated rollback (2026-05-28/29)

**Warm/infra apps (traefik + keycloak) auto-update to LATEST nightly, health-gated (operator).**
Supersedes the W0.3 pinned `kcVersion`. Keycloak is now unpinned like traefik: reconciler `abra
recipe fetch` latest + chaos deploy; keep secret-generate-only-if-missing + health-wait. D8 holds
because the recipe is fetched at *activation* (runtime), so the nix store closure is byte-identical
regardless of which keycloak version is live.

**Snapshot helper (WC3) — format + path.** `runner/harness/warmsnap.py`. A snapshot is a **raw tar
of each docker volume belonging to the app's stack**, taken **while the app is undeployed** (nothing
writing → consistent). Stored under `/var/lib/ci-warm/<recipe>/` as `<recipe>.snapshot.tar` + a
`<recipe>.meta.json` (commit/version/timestamp/volume list). **One last-good per app**, replaced
**atomically** (write to `.tmp` then `rename`). Restore: for each volume, clear `_data` and untar
back. Docker volumes are stack-scoped (`<stack>_<vol>`); the helper enumerates them via
`docker volume ls` filtered to the stack. Reused by WC1.1 (pre-upgrade snapshot of keycloak) and WC5
(promote-on-green-cold). Warm snapshots are **cache, excluded from the D8 closure** (WC8).

**Alert mechanism — sentinel files relayed by the Builder loop.** The warm/infra reconciler is an
autonomous bash systemd unit on cc-ci; it cannot call the agent's `PushNotification` tool. So a
reconciler that rolls back (WC1.1) or holds a major/manual-migration upgrade (WC1.2) writes a JSON
**alert sentinel** to `/var/lib/ci-warm/alerts/<ts>-<app>-<reason>.json` (fields: app, reason
[rollback|held-major|held-manual-migration], from_version, to_version, release_notes, ts). The
Builder loop, each wake, scans that dir; for each new alert it (a) issues `PushNotification` to the
operator, (b) records it in STATUS-2w/JOURNAL-2w, (c) archives it to `alerts/seen/`. This bridges the
autonomous reconciler to operator visibility (latency = next Builder wake; acceptable for an alert).

**Re-sequence:** WC1.1's keycloak rollback needs the WC3 snapshot helper, so build that FIRST, then
rewrite the reconciler ONCE into the unpinned + WC1.2-safety-gated + WC1.1-health-gated-rollback form
(avoids reworking the reconciler twice). The W0.3 reconciler is INTERIM until then.

## Phase 2w — W0.6 reconciler: version model + deploy-by-tag (2026-05-29)

**Reconcile entrypoint in Python, packaged in the nix store.** `runner/warm_reconcile.py`, invoked by
the systemd unit as `${pyEnv}/bin/python3 ${../../runner}/warm_reconcile.py <app>` (the runner/ dir is
copied into the store → D8-clean, no dependence on the /root/cc-ci checkout). Reuses
warmsnap/sso/abra/lifecycle so there is ONE snapshot impl (also used by the runner for WC5). Replaces
the bash reconcile in warm-keycloak.nix.

**"latest" = newest published version TAG, deployed pinned (not chaos-of-main).** WC1.2's "major
recipe-version bump" detection needs comparable versions, which chaos (deploy main HEAD) doesn't give.
So the reconciler resolves latest = `git tag | sort -V | tail -1` (valid coop-cloud version tags),
records current = the app .env `VERSION`, and deploys the chosen tag pinned (`abra app deploy <domain>
<version> -o -n -f`, after `git checkout <tag>`). "Auto-update to latest" is satisfied by converging
to the newest tag; "chaos" in the operator note is read as "auto-deploy latest", and tag-pinning is
the correct mechanism for a version-gated auto-update.

**coop-cloud version format is `<recipe-semver>+<app-version>` (observed), not the plan's
`<upstream>+<recipe-semver>`.** Evidence: keycloak `10.7.1+26.6.2` → image `keycloak:26.6.2`; n8n
`3.2.0+2.20.6` → image `n8nio/n8n:2.20.6` (the post-`+` part is the app image tag). So the **recipe
semver is the part BEFORE `+`**. WC1.2's "major recipe bump = breaking" keys off the major (first)
component of the pre-`+` recipe semver (e.g. 3.x→4.0 = held). Secondary signal: scan the target's
`releaseNotes/<version>.md` for manual-migration markers.

**Scope order for W0.6:** keycloak first (the W0 focus, stateful → snapshot path); apply the same
health-gated + safety-gate pattern to traefik (stateless, version-rollback-only) afterward by
migrating proxy.nix onto the shared reconcile entrypoint.

## Phase 2w — W1 canonical registry design (WC2/WC3) (2026-05-29)

**Enrollment is declarative per-recipe via `recipe_meta.WARM_CANONICAL = True`** (consistent with how
DEPS/EXTRA_ENV are declared — enrolling a recipe stays a `tests/<recipe>/` change, D5). A recipe so
flagged gets a DATA-WARM canonical. Prove the model on a couple of recipes (custom-html simplest:
stateful, no external DB), NOT all (the nightly sweep populates the rest over time).

**Stable domain `warm-<recipe>.ci.commoninternet.net`** (already decided for keycloak; same scheme for
canonicals). Distinct from cold `<recipe[:4]>-<6hex>`. Watch the swarm 64-char secret-name limit
per recipe on first deploy.

**Known-good state per canonical, under `/var/lib/ci-warm/<recipe>/`:** `last_good` (version string,
already written by warm_reconcile), `snapshot/` (warmsnap, W0.5), and a small `canonical.json`
registry record `{recipe, domain, version, commit, status, ts}`. The DATA VOLUME is retained while
the app is undeployed (data-warm). These are cache (excluded from D8, WC8).

**Data-warm lifecycle (new `runner/harness/canonical.py`):** `is_enrolled(recipe)` (reads
WARM_CANONICAL), `canonical_domain(recipe)`, `read/write_registry(recipe)`, `deploy_canonical(recipe)`
(deploy `warm-<recipe>` at last_good, reattaching the retained volume → warm boot), `undeploy_keep_
volume(recipe)` (undeploy, volume retained = idle data-warm), `seed_canonical(recipe, version, commit)`
(record + snapshot; the volume becomes the canonical). LIVE-warm (keycloak, always up) vs DATA-warm
(canonicals, undeployed when idle) both use `warm-<recipe>` + warmsnap.

**W1 scope vs W3:** W1 builds the registry + data-warm lifecycle and proves it (seed a custom-html
canonical → undeploy keep volume → redeploy reattach → data survives; re-warmable from scratch).
**Automatic promote-on-green-cold (WC5) + nightly (WC6) are W3** — for W1 the canonical is seeded
programmatically to prove the model; the cold-advances-canonical wiring comes later.

## Phase 2w — W3 WC5 promote-on-green-cold mechanism (2026-05-29)

**Promote = re-seed the canonical from a fresh deploy of the green-verified latest (NOT "keep the
cold run's per-run volume").** Rationale: a cold run uses a fresh per-run domain `<recipe>-<6hex>`
with a fresh volume (cold stays authoritative + fresh); its volume names are per-run-specific and
differ from the canonical's `warm-<recipe>` volume names, so the per-run volume can't be directly
reused as the canonical without a fragile name-remap. AND the cardinal guardrail "never lose the
known-good" forbids touching the existing canonical until a new green one is ready.

So: on a run that is **enrolled (recipe_meta.WARM_CANONICAL) + GREEN + COLD (not --quick) + on LATEST
(no PR head, i.e. REF empty — the nightly/manual-latest run, NOT a PR `!testme`)**, AFTER the normal
per-run teardown, the orchestrator PROMOTES: deploy `warm-<recipe>` at latest → wait healthy →
undeploy → `canonical.seed_canonical(version=latest, commit=head)` (snapshot-while-undeployed +
atomic registry/snapshot replace). The old known-good is replaced ATOMICALLY only on a green promote
(a red run never reaches promote → known-good safe). The canonical's data = a clean install of the
green-verified latest (a valid known-good baseline; --quick reattaches + upgrades it). Cost: one extra
(canonical) deploy per promote — acceptable for cold/nightly (not latency-sensitive). The FIRST such
green run SEEDS the canonical. `--quick` never promotes (proven W2). Only cold advances (WC5).

Promote gate predicate (unit-tested): `is_enrolled(recipe) and overall==0 and not quick and not ref`.
(`not ref` = a catalogue-latest run, i.e. the nightly sweep or a manual `RECIPE=<r>` run — a PR
`!testme` carries REF=PR-head and must NOT advance the canonical to a PR's code.)

## Phase 2 — heavy-recipe upgrade tier disk constraint (28GB host) — SETTLED finding @2026-05-29
The upgrade tier (HC1: prev published → PR-head via in-place `abra app deploy --chaos`) cannot
complete for recipes whose successive releases bump multi-GB image tags, because the rolling update
must hold BOTH versions on disk transiently. Proven on lasuite-drive: onlyoffice 9.2 → 9.3.1.2
(3.94GB each) + collabora two versions → ~10GB office images at once vs ~14GB docker headroom on the
28GB host → 99% → deploy fail. **No harness fix is possible** (the prev images are running, so they
are neither dangling-prunable nor `rmi`-able when the new must be pulled). install/backup/restore/
custom (single version) fit and pass. Resolution = grow the host disk (Class A1 operator input,
DEFERRED.md 2026-05-29). Until then, heavy recipes are verified via their maximal testable subset
(install+backup+restore+custom) with the upgrade tier flagged as a genuine env-level (disk) blocker
per plan §7.1 (Adversary sign-off required). The cleanup runbook for an over-full host: `pkill -f
run_recipe_ci.py`; `docker stack rm <leftover>`; remove its volumes+secrets; `docker image prune -f`.