Files
cc-ci/machine-docs/DECISIONS.md
autonomic-bot f59d8e6996 feat(2): Q3.2 lasuite-drive base enrollment + nested-subdomain + replicas:0 harness fixes
- harness: services_converged treats replicas:0 one-shot (minio-createbuckets) as
  converged (cur==want); removes the want==0 rejection that hung deploys. DECISIONS.md.
- recipe_meta.EXTRA_ENV flattens MINIO_DOMAIN/COLLABORA_DOMAIN to single-label wildcard
  siblings (the *.ci.commoninternet.net cert covers one label only). DECISIONS.md.
- lifecycle overlays (install/upgrade/backup/restore) + ops.py postgres ci_marker
  data-integrity (db user/name=drive). Parity health_check functional test. PARITY.md.
- DEPS=[keycloak] + OIDC/WOPI/upload functional tests deferred to the SSO iteration
  (probe-before-assert: prove the ~10-service base deploy converges first).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 19:54:31 +01:00

524 lines
41 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# DECISIONS — cc-ci Builder
Architecture decisions and dead-ends. One line of rationale each. (§0, §8)
## Settled
- **Wildcard TLS:** operator pre-issues wildcard cert at `/var/lib/ci-certs/live/`; Traefik file
provider serves it; **no ACME** for commoninternet.net. (Plan §4.0/§8 — fixed.)
- **Repo:** `git.autonomic.zone/recipe-maintainers/cc-ci`, private. Bot is org admin. (Bootstrap.)
- **Git credentials:** helper script in repo-local git config sources `/srv/cc-ci/.testenv` at call
time — no secret values stored in `.git/config` or commits.
- **Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26,
overrides plan §3 `modules/traefik.nix`).** Instead of a hand-rolled Traefik we deploy the
canonical Co-op Cloud `traefik` recipe via abra in **wildcard / file-provider mode**, for
end-to-end fidelity (canonical `web`/`web-secure` entrypoints + proxy/swarm conventions every
recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO
DNS token on the box:
- `WILDCARDS_ENABLED=1` + append `compose.wildcard.yml`; the pre-issued cert is fed as the
`ssl_cert`/`ssl_key` swarm secrets (v1) via `abra app secret insert … -f` from
`/var/lib/ci-certs/live/{fullchain,privkey}.pem`. The file provider serves it (`tls.certificates`).
- `LETS_ENCRYPT_ENV=` **empty** on the traefik app *and* on every test app → the recipe's
`tls.certresolver=${LETS_ENCRYPT_ENV}` label resolves to no resolver → routers serve the
wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
- Reproducibility (D8): `scripts/deploy-proxy.sh` is idempotent (ensures local abra server, fetches
recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in
`docs/install.md`. The custom `modules/traefik.nix` was removed; `modules/swarm.nix` keeps swarm
init + `proxy` net + firewall 80/443.
- **Renewal (manual, ~90d):** operator re-issues the wildcard at the same paths, then
`abra app secret rm traefik.ci.commoninternet.net ssl_cert -n` + re-insert at a new version (bump
`SECRET_WILDCARD_CERT_VERSION`) and redeploy. (Documented in docs/secrets.md at M7.)
- **abra teardown syntax** (for harness, §4.3): `abra app undeploy <d> -n`,
`abra app volume remove <d> -f -n`, `abra app secret remove <d> --all -n`. None take `--chaos`.
- **Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer
2026-05-26).** Every piece of swarm infra that abra deploys (traefik `modules/proxy.nix`, Drone
`modules/drone.nix`, later comment-bridge + dashboard) is a `systemd.services.<x>` with
`Type=oneshot` + `RemainAfterExit`, `after`/`requires` swarm-init + docker, `wants`
network-online, `wantedBy` multi-user, embedding its script via **`pkgs.writeShellApplication`**
(self-contained in the store, not a `/root/cc-ci` path). The script **reconciles** (inspect →
converge → no-op if correct) on *every* activation/boot — **no run-once sentinel** — so it
self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit)
on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses to
`git clone` + `nixos-rebuild switch` + operator preconditions, no manual post-steps. The old
`scripts/deploy-*.sh` were folded into these modules and removed. `pkgs.abra` is provided via an
overlay (`modules/packages.nix`) so all modules share the one pinned build.
- *Cert rotation note:* the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
wildcard means bumping `SECRET_WILDCARD_*_VERSION` (operator) so the next reconcile re-inserts.
Documented in docs/secrets.md at M7.
- **Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27,
supersedes the earlier "keep webhook, do NOT pivot to polling" steer).** Hard constraint: the
bot/server runs at **READ level, never repo-admin**, and **never self-registers a webhook**.
- **Polling is PRIMARY and the source of truth for D1.** The bridge polls each enrolled repo's
open PRs for new `!testme` comments every `POLL_INTERVAL` (30s ≤ 60s). Outbound
(cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On
startup the first poll marks pre-existing comments seen so it doesn't fire on old comments.
- **Webhook is an OPTIONAL push optimization.** The `/hook` endpoint stays live (HMAC-verified)
so an *admin-registered* `issue_comment` webhook lowers latency, but the bridge never registers
one. Manual registration is documented in `docs/enroll-recipe.md`. Both paths share an
in-memory seen-set keyed by comment id → a comment seen by both fires at most once.
- **Commenter authorization via org membership (read-level, no admin).** Allowed iff
`GET /orgs/{owner}/members/{user}` → 204 (verified 2026-05-27: admits bot/trav/notplants, 404
for a non-member, works with bot read-level basic-auth) **or** the user is in the optional
`AUTH_ALLOWLIST`. Replaces the earlier `/collaborators/{user}/permission` check, which needs
repo-admin. Fail-closed on any error.
- **Enrollment** = add the repo to the bridge `POLL_REPOS` csv + ensure `tests/<recipe>/` exists.
No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't
matter: polling makes it irrelevant; the operator was whitelisting `ci.commoninternet.net` in
Gitea's `ALLOWED_HOST_LIST`, but D1 no longer depends on that.)
- **Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27,
plan §4.2/§4.3).** Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
- **MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1** (`modules/drone-runner.nix`, `maxTests` let-binding).
Drone runs at most MAX_TESTS builds at once and **auto-queues the rest in its native pending
queue** — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is
never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly.
- **Per-build TIMEOUT = 60 min** (`modules/drone.nix`, `buildTimeoutMinutes`; reconciled
best-effort via `PATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}` using the bridge's
Drone admin token, local `--resolve`, non-fatal). A build over the limit is cancelled by Drone →
the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue
once a test finishes OR times out".
- **Teardown + janitor backstop.** Each build deploys → runs the 3 stages → undeploys
(guaranteed `try/finally` in `conftest`/orchestrator). A SIGKILL'd/timed-out build can't run its
own teardown, so the **run-start janitor** (`lifecycle.janitor`, called before every deploy in
both fixtures + `run_recipe_ci`) reaps orphaned run apps as the backstop. At capacity=1 the CI
path will set `CCCI_JANITOR_MAX_AGE=0` (reap any orphan immediately — safe with no concurrent
runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default
2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live.
- Optional `concurrency: {limit: 1}` in the recipe-CI `.drone.yml` is a redundant belt — primary
mechanism is `DRONE_RUNNER_CAPACITY`. (Wired when the recipe-CI pipeline lands — see backlog.)
- **D10 recipe #6: bluesky-pds (TLS-passthrough) SWAPPED → n8n — SETTLED (2026-05-27, plan §4.0
sanctions this swap-with-reason).** bluesky-pds routes via a Traefik **TCP router with
`tls.passthrough=true`** to an in-container **caddy** that terminates TLS itself and obtains its own
cert via **ACME**. cc-ci's design is the opposite: the operator gateway passes wildcard TLS through
to cc-ci's Traefik, which **terminates** it with the pre-issued static wildcard cert, and **ACME is
hard-forbidden** for commoninternet.net (no DNS token on the box — §4.0/§9). Serving bluesky-pds
would require either (a) ACME inside caddy (forbidden), or (b) injecting the wildcard cert into
caddy + a per-host TCP-passthrough router on cc-ci Traefik (recipe-internal surgery + a bespoke
proxy mode — not a clean shared-harness absorb). This is a genuine design conflict, not a harness
gap. Per the plan's explicit allowance, **bluesky-pds is a documented non-CI'd recipe** (reason
here), and **n8n** takes the 6th slot. The 5 required D10 categories are already covered by recipes
15 (simple=custom-html, single-DB+SSO=keycloak, stateful/no-DB=cryptpad, DB+media/large-volume=
matrix-synapse, multi-service+S3/object-storage=lasuite-docs); n8n adds a 6th real deployable app
(workflow automation) behind the normal terminate-at-Traefik path.
- **Docker Hub rate limit + mid-breadth prune — FINDING (2026-05-27).** D10 real-`!testme` breadth
runs exhausted Docker Hub's anonymous pull rate limit (lasuite-docs, 9 images, upgrade stage:
`toomanyrequests`). Two lessons: (1) **registry pull creds are an A1 operator input** needed for
reliable heavy-recipe deploys under load (request + sops-store + wire into docker daemon). (2)
**Don't `docker image prune -af` mid-breadth** — it evicts cached recipe images and forces re-pulls
that hit the limit. The first lasuite failure was disk pressure (90% full); pruning fixed disk but
triggered re-pulls → rate limit. Better: rely on the daily autoprune, prune only `dangling`
(not `-a`) between runs, or grow disk so heavy images stay cached. Net for D10: 5/6 recipes green
via real !testme; lasuite-docs gated on the rate limit (transient ~hours; durable fix = creds).
## Open (defaults from §8, to confirm as reality lands)
- **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on
cc-ci itself*, with the repo materialised on the host at `/root/cc-ci`. Chosen over
`--target-host`/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS
proxy (slow/fragile). Atomic rollback preserved by Nix generations (`nixos-rebuild --rollback`).
The switch is launched as a **detached transient systemd unit** (`systemd-run --unit=ccci-rebuild
--collect`) so it survives a momentary ssh-over-tailscale drop during activation. For the build
loop the host copy is synced from the sandbox clone via `tar | ssh` (rsync absent on host);
source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo
on a fresh host, then `nixos-rebuild switch --flake .#cc-ci`).
- **nixpkgs pin:** flake pins the exact rev cc-ci already ran (`50ab793…`) so the first rebuild
is a true no-op-then-base. Bump deliberately, never drift.
- **Webhook scope:** default per-repo via enroll script.
- **CI engine: Drone (per plan) — kept, with a noted risk.** nixpkgs 24.11 has Drone **server**
2.24.0 but `drone-runner-exec` is **abandoned (unstable-2020-04-19)** — the only exec runner Drone
ever shipped (upstream archived ~2021). The maintained fork **Woodpecker** (2.7.3, with NixOS
modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific
(D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern
Drone server (RPC protocol stable). **Fallback:** if the exec runner proves incompatible/broken,
pivot to Woodpecker (coop-cloud ships a `woodpecker` recipe too) and record it — like the traefik
pivot. Re-evaluate at the M2 gate.
- **Drone deployment shape — SETTLED (M2):** mirror the traefik pattern. The **server** is the
coop-cloud `drone` recipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by
traefik at `drone.ci.commoninternet.net`, `LETS_ENCRYPT_ENV` empty → wildcard cert, no ACME),
with Gitea SSO (`compose.gitea.yml`). The **exec runner** runs as a Nix systemd service on the
host (`modules/drone-runner.nix`) so it can drive host abra/swarm (plan §4.2). One generated
`DRONE_RPC_SECRET` is shared: inserted as the server's `rpc_secret` swarm secret AND read by the
runner from sops. Reproducible deploy: `scripts/deploy-drone.sh`.
- Gitea OAuth app `cc-ci-drone` created under the bot (client_id `ab4cdb9d-ee96-4867-875f-
87384505fc52`, redirect `https://drone.ci.commoninternet.net/login`); client_secret +
rpc_secret stored sops-encrypted in `secrets/secrets.yaml` (A2 internal secrets).
- **Drone runner type:** exec (must drive host abra).
- **Secret tool — SETTLED (M0):** sops-nix. cc-ci decrypts at activation using its **ed25519 SSH
host key** as the age identity (`sops.age.sshKeyPaths`), so no extra key file to manage on the box.
Recipients in `/.sops.yaml`: the host age key (`age1h90ut…`, from ssh-to-age) + an off-box
**master recovery key** (`age1cmk26t…`; private half only at `/srv/cc-ci/.sops/master-age.txt` on
the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing
plaintext into `secrets/<f>.yaml` then `sops -e -i` (run inside the repo so `.sops.yaml` is found).
- **D10 recipe set:** lock six early. Candidates favouring already-mirrored: custom-html (simple),
cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3),
bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4M6.5.
- **Per-run app domain scheme — adapted (M4, deviates from plan §4.0).** Plan §4.0 wanted
`<recipe>-pr<n>-<short-sha>.ci.commoninternet.net`, but Docker swarm config/secret names
(`<stackname>_<resource>_<version>`) must be ≤ 64 chars and abra derives `<stackname>` from the
domain (dots→`_`, hyphens kept). `.ci.commoninternet.net` alone is 22 chars, so long recipe names
+ config names overflow 64 (hit with `custom-html-pr0-m4demo…_nginx_default_conf_v6` = 66). New
scheme: **`<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net`** (e.g. `cust-e084bd`) — short,
unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/
ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.
- **abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6).** Many
abra commands (`app ls`, `secret generate` without flags, version resolution) silently
`git checkout <version-tag>` in `~/.abra/recipes/<recipe>`, discarding a PR branch's files. To
test the *PR head code* (not a re-resolved tag): (1) `fetch_recipe` clones the mirror branch/ref
(private → bot token via per-command `http.extraHeader`, never persisted/logged); (2) all harness
abra calls that touch the recipe pass `-C` (chaos: use current checkout) `-o` (offline: no remote
fetch); (3) recipe-shipped `tests/` (D4) are **snapshotted to a temp dir right after fetch**, since
later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.
## Risks
- **Disk — RESOLVED 2026-05-26.** Original 8.9 GiB root had only ~3.8 GiB free *and* a hard
**inode** ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
inodes before bytes. Operator grew the VM to **28 GiB** (22 GiB free, 1.78M inodes / 1.21M free);
the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
periodic `docker image prune` to avoid regressing during M6.5 breadth.
## Dead-ends
- (none yet)
## Phase 1c (full reproducibility + genuine D8 live rebuild) — 2026-05-27
- **Secrets linkage = git SUBMODULE (deviates from plan §7 flake-input default).** `cc-ci-secrets`
is mounted as a submodule at `cc-ci/secrets/` rather than a flake `inputs.secrets`. Rationale: a
private flake input must be re-fetched at **every nix eval**, requiring the bot token persistently
in nix config/netrc on cc-ci AND the throwaway VM (a token in the store/config = a 2nd out-of-band
secret, which 1c forbids). A submodule makes `secrets/secrets.yaml` a plain path in the working
tree → `defaultSopsFile = ../secrets/secrets.yaml` is unchanged (minimal diff, trivially
byte-identical), and the only credential use is the one `git clone --recursive` at provisioning
("the two repos are *given*", Mission §1). Build invocation becomes
`nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` so the submodule tree is
included. (Revisit if `?submodules=1` proves unreliable on cc-ci's nix version.)
- **Bootstrap key for the throwaway VM = the existing RECOVERY (master) age key, via
`sops.age.keyFile`.** The recovery key (`age1cmk26…`, private at `/srv/cc-ci/.sops/master-age.txt`)
is already a sops recipient, so a fresh host with a *different* ssh host key still decrypts every
secret with no re-keying — this is exactly the §0 argument that defeats "host-key binding".
Provisioned to the VM at a fixed path (the ONE out-of-band secret). cc-ci itself keeps decrypting
via its host key (`age.sshKeyPaths`); secrets.nix will offer both identity sources. (Per-host
re-encrypt is cleaner for a *permanent* new instance — documented as the alternative, not used for
the throwaway test.)
- **Cert into git:** wildcard cert+key become sops secrets in `cc-ci-secrets`, decrypted at
activation back to `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` via
`sops.secrets.<name>.path`; proxy.nix keeps reading that path (now sops-sourced, not operator-drop).
- **cc-nix-test final sizing (C6) — SETTLED by operator 2026-05-27: PROMOTE the rebuilt VM.** The
freshly-rebuilt reproducible VM (the FINAL W5/C4-C5 clean-room throwaway) becomes the canonical
cc-nix-test; the operator will repurpose it for a live real-traffic test through the public gateway.
- **C6 teardown OVERRIDE (operator, 2026-05-27):** do NOT destroy the FINAL throwaway VM after
W5/C4-C5 PASSes — keep it RUNNING; defer its C6 teardown until the operator explicitly says
otherwise. This overrides the plan §5/§6 "destroy the throwaway" for that one VM only. All other
cleanup proceeds normally (the Builder's first throwaway was already destroyed; RAM accounting holds).
## Phase 1b — lint/format tooling (open decisions §6, settled W0)
- **Formatters/linters (RL1):** Nix = `nixpkgs-fmt` (format) + `statix` (lints) + `deadnix` (dead
code); Python = `ruff` (lint + format); Shell = `shellcheck` + `shfmt -i 2 -ci`; YAML = `yamllint`.
Kept `nixpkgs-fmt` over `alejandra` because it was already the repo `formatter` and devshell tool
(no extra churn / restyle of every .nix). All built from the already-pinned nixpkgs via a flake
`lint` devshell (`nix develop .#lint`) so CI and local use byte-identical tool versions.
- **Lint entrypoint:** `scripts/lint.sh` (check-only by default; `--fix` auto-applies). The
`.drone.yml` push pipeline runs it via `nix develop .#lint --command bash scripts/lint.sh`.
- **ruff strictness:** `select = [E,F,W,I,UP,B,C4,SIM]`, `ignore = [E501]` (line length is the
formatter's job; only un-splittable strings would trip it). `line-length=100`, `target=py311`.
- **Drone lint stage = FAIL (not warn).** The codebase is green now, so enforce from here on — an
unclean commit fails the `lint` step. (Resolves the §6 open question.)
- **Python type-checking (mypy/pyright): DEFERRED to IDEAS**, not added in 1b. The harness is small
and dynamically typed around `abra`/subprocess JSON; gradual typing is a larger effort than this
bounded pass warrants. Revisit if Phase 2's 18-recipe ramp shows type bugs.
- **blocking vs advisory split (§3):** treated as in the phase plan — tests-real, Nix-idempotent,
no-footguns, no-secrets, log-redaction, harness-DRY = blocking; readability/docs/arch-drift =
advisory unless a real plan deviation. Recorded per-finding in REVIEW-1b / BACKLOG-1b.
- **cc-ci self-CI push trigger:** the lint stage lives in the `event: push` pipeline. The Gitea→Drone
push webhook on this instance is flaky (`last_status: None`; documented §4.1) and predates 1b —
recipe CI uses polling as primary, but cc-ci's *own* self-test/lint relies on the push webhook.
The lint stage is correctly wired and proven green via the identical `nix develop .#lint` command;
reliably auto-firing it on every push is tracked as a (pre-existing) infra item, not a 1b lint gap.
## Phase 1b — repo layout (operator review items RL5/RL6, plan §7)
- **RL5 — all Nix code under `nix/`.** Moved `modules/`→`nix/modules/` and `hosts/`→`nix/hosts/`.
`flake.nix`/`flake.lock` STAY at the repo root (entry point) so the build ref `#cc-ci` and
`nixos-rebuild --flake '…#cc-ci'` are unchanged — only `flake.nix`'s internal
`./hosts/cc-ci/configuration.nix` → `./nix/hosts/cc-ci/configuration.nix` changed. Root-relative
refs inside the moved modules were re-based `../X` → `../../X` (secrets.nix → `../../secrets/`,
bridge.nix → `../../bridge/`, dashboard.nix → `../../dashboard/`); `configuration.nix`'s
`../../modules/*` imports are unchanged (both dirs moved under `nix/`, so the relative path still
resolves). **Toplevel is byte-identical (`8i3jcad9…`) before/after the move** — store derivations
are content-addressed on the copied file *contents*, and the module `.nix` files aren't part of the
runtime closure, so relocating folders doesn't change the build. (The operator anticipated a hash
change; in practice it's stable, which is even stronger for reproducibility.) Living docs
(README, architecture/install/secrets/enroll) + the `.drone.yml` comment updated to `nix/…`;
append-only history logs left as the record of what was true then.
- **RL6 — protocol files → `machine-docs/`: DEFERRED to the coordinated end of 1b.** Will `git mv`
`STATUS*/REVIEW*/JOURNAL*/BACKLOG*/DECISIONS.md` into `machine-docs/` (README.md STAYS at root —
operator decision, it's the human readme, not a protocol file). The live watchdog (`launch.sh`)
reads `STATUS-<id>.md`/`REVIEW-<id>.md` at the repo root for handoffs/transition, so this is done
LAST, in lockstep with the orchestrator updating `launch.sh` + restarting the watchdog — not
unilaterally and not while a phase transition is pending. The Adversary likewise `git mv`s its own
REVIEW files at the cutover (single-writer rule).
## Phase 1b — recorded deviation: no `tests/_template/` dir (enroll = copy an existing recipe)
Plan §3's repo layout lists a `tests/_template/` "copy-to-add-a-recipe" dir. It was **never created**
(pre-1b; not introduced or removed by 1b) — instead the documented enroll flow in
`docs/enroll-recipe.md` is **"copy an existing recipe's tree, e.g. `tests/custom-html/…`, then adjust
`recipe_meta.py` + the per-recipe test files."** This satisfies D5's "small, repeatable, documented
operation with no harness surgery" the same way (a concrete recipe is a better starting template than
an abstract skeleton that can drift). Recording per the Adversary's RL3 D5 advisory; not a blocker.
## Phase 1d — generic test suite + layered overlays (design, 2026-05-27)
SSOT: `cc-ci-plan/plan-phase1d-generic-test-suite.md`. Resolves the §6 open decisions.
- **Tier model & op/assertion split (the core call).** A run is a sequence of TIERS — install,
upgrade, backup, restore, custom — each = `generic default [overridden by a recipe overlay]`. The
**lifecycle OP** (deploy, upgrade, backup, restore) is owned by the **shared harness**
(`harness.generic` helpers), NOT duplicated in every test file. A tier's **test file** (generic or
overlay) carries the ASSERTIONS and calls the shared op helper. This keeps the op single-sourced
(DRY, DG7) and makes deploy-once trivial: only the orchestrator deploys/tears-down.
- **Override (not additive) — Builder's call (plan §6, operator leaned override).** For each
lifecycle op exactly ONE assertion file runs, by precedence:
**repo-local `tests/test_<op>.py` > cc-ci `tests/<recipe>/test_<op>.py` > generic
(`tests/_generic/test_<op>.py`)**. A present overlay REPLACES the generic for that op. **Invariant:
no overlay for an op ⇒ the generic runs** (so any recipe is testable with zero config). Repo-local
wins same-name collisions (upstream is authoritative, plan §2.5); cc-ci's overlay is the curated
fallback until upstream adopts it. **Extend-by-composition:** an overlay may
`from harness import generic` and call `generic.assert_serving(...)` / `generic.do_upgrade(...)`
then add its own assertions — so "extend" needs no separate mechanism.
- **Custom (non-lifecycle) `test_*.py`:** ALL discovered from BOTH locations run additively, opt-in
(no override, no generic equivalent) — e.g. `test_sso.py`.
- **Deploy ONCE, mutate in place (operator requirement, DG4.1).** The orchestrator deploys the app
ONCE, runs all tiers against that single live deployment (install asserts; upgrade does
`abra app upgrade` in place; backup/restore mutate in place; custom asserts), then ONE teardown in
`finally`. No per-tier/per-overlay `abra app new/deploy/undeploy`. A `CCCI_DEPLOY_COUNT` counter in
`lifecycle.deploy_app` is asserted == 1 per run (DG4.1 evidence).
- **Deployment-sharing scope & base version (§6 open).** One deployment for the whole lifecycle.
Base version deployed once = the **previous published version** when an upgrade tier will run and a
previous exists (so upgrade goes previous→target in place), **else the target** (current/$REF).
Recipe with only one published version ⇒ upgrade tier is a clean **SKIP** (nothing to upgrade
from). Standalone generic-install demo (no PR) deploys current.
- **Fail handling across shared tiers (§6 open):** install failing (app never serves) **fail-fasts**
the run (later tiers can't meaningfully run on a dead deployment) and they report **error/skip**;
upgrade/backup/restore failures are recorded per-op but do not abort the remaining independent
tiers where they can still run. Teardown always runs.
- **Backup-capability detection (DG3, §6 open):** auto — scan the recipe's `compose*.yml` for a
`backupbot.backup` label (verified present in custom-html). `recipe_meta.BACKUP_CAPABLE` (bool)
overrides the auto-detect. Not capable ⇒ backup+restore tiers are **N/A (skip)**, not failures.
- **Custom install-steps hook (DG5, §6 open):** a shell hook — `tests/<recipe>/install_steps.sh`
(cc-ci) or repo-local `tests/install_steps.sh` — run by the orchestrator during the install tier
AFTER `abra app new` + env defaults but BEFORE `abra app deploy`, with env `CCCI_APP_DOMAIN`,
`CCCI_RECIPE`, `CCCI_APP_ENV` (path to the app .env). Chosen over a fixture/declarative field as the
simplest thing the harness runs uniformly (can `abra app secret insert`, set env, seed). Graceful
rule: a recipe with NO hook still attempts the generic install; if it genuinely needs a step it
FAILS the generic install (reported per-op) — that is correct, not a harness bug.
- **Per-op result vocabulary (Phase-3 feed):** `pass | fail | skip(N/A) | error`. The orchestrator
prints a per-op summary line per run (feeds DG6 + Phase-3 level).
- **Discovery layout:** cc-ci overlays/custom/hook live in `tests/<recipe>/`; repo-local in the
recipe repo's `tests/` (snapshotted after fetch, per the existing volatile-checkout handling).
Generic tier files live in `tests/_generic/` (assertion-only, use the shared live-deployment
fixtures).
---
## Phase 1e — generic-harness corrections (HC1HC4)
Three operator-review corrections to the Phase-1d shared harness, settled here (plan §5).
- **HC2 — repo-local approval allowlist (form/location + workflow).** PR-author-controlled code
(`install_steps.sh`, repo-local `test_*.py`) runs on the CI host with `/run/secrets/*` present, so it
is **default-deny**. Allowlist file: **`tests/repo-local-approved.txt`** (checked into the cc-ci
repo, git-auditable). Format: one recipe name per line; `#` comments + blank lines ignored; a lone
`*` is NOT a wildcard (no global opt-in — every recipe is explicit). **Default: empty ⇒ no recipe
trusts repo-local code.** Discovery (`resolve_op`/`custom_tests`/`install_steps`) consults the
repo-local source **only** when `repo_local_approved(recipe)` is true; otherwise precedence is
**cc-ci > generic** only and repo-local is discovered-but-not-executed. **Workflow:** a cc-ci
maintainer reviews a recipe's repo-local tests, then adds the recipe name to
`tests/repo-local-approved.txt` in a cc-ci PR — a deliberate, reviewable act. The gate is centralized
in `discovery.py` (one reader) so the unit tests pin it.
- **HC3 — generic-by-default opt-out flag (name/granularity + recipe_meta).** Generic assertions run
**additively** alongside any overlay by default. Opt-out, in increasing specificity (any one skips):
env **`CCCI_SKIP_GENERIC`** (truthy ⇒ skip generic for ALL ops), env
**`CCCI_SKIP_GENERIC_<OP>`** (e.g. `CCCI_SKIP_GENERIC_UPGRADE` ⇒ skip generic for that op only), and
declarative **`recipe_meta.SKIP_GENERIC`** = a list of op names (or `["all"]`) so the opt-out is
per-recipe and visible in git, not a hidden global. Truthy = `1/true/yes/on` (case-insensitive).
**Op-vs-assertion split:** a mutating op (upgrade/backup/restore) is performed **once by the
orchestrator** (the harness owns the op); then the generic assertion file (unless opted out) and the
overlay assertion file both evaluate the **shared post-op state**. Op results that an assertion needs
(pre-upgrade identity, backup snapshot_id) are passed op→assertions via a run-scoped JSON state file
at `$CCCI_OP_STATE_FILE` (read by `harness.generic.op_state()`); never logged. Overlays that need to
**seed pre-op state** (data-continuity markers, the backup→restore mutation) ship an optional
`tests/<recipe>/ops.py` with `pre_install/pre_upgrade/pre_backup/pre_restore(domain, meta)` callables
the orchestrator runs **before** the op (repo-local `ops.py` is allowlist-gated like other repo-local
code). Overlay `test_<op>.py` files are now **assertion-only** (they no longer call `generic.do_*`).
- **HC1 — DG4.1 deploy-count vs the in-place chaos upgrade.** The upgrade tier now upgrades to the
**PR head** (code under test), not a published tag: deploy the previous published version (base),
**re-checkout the PR head** (recorded as the recipe repo HEAD right after fetch, before any
version-tag checkout), then **`abra app deploy --chaos`** in place = the upgrade. The deploy-count
guard counts **`abra app new` installs only** (`_record_deploy()` fires in `deploy_app()`, NOT in the
chaos redeploy, which calls `abra.deploy` directly) — so a run is still **deploy-count == 1** and the
legitimate in-place chaos upgrade is not flagged. **Moved assertion (adapted):** prev→PR-head may not
bump the coop-cloud version label, so `assert_upgraded` accepts ANY of: version-label change, image
change, or a **chaos label** now present carrying the PR-head commit (a chaos deploy stamps
`coop-cloud.<stack>.chaos`/`.chaos-version`) — the chaos label IS the proof PR-head was deployed.
Non-PR `!testme` (no SRC/REF): "PR head" = the catalogue current checkout, so upgrade is prev→current
— still a genuine move via chaos. (Exact chaos label name verified on the live abra during E2.)
## Phase 2 — per-recipe test authoring (design, 2026-05-28)
Inherits the Phase 1d/1e shared-deployment + additive-overlay + op/assertion-split model. Phase 2
adds **content**, not infra, with a few small harness primitives ported from
`references/recipe-maintainer/utils/tests/helpers.py`.
- **Per-recipe layout (per plan §4.1).** The cc-ci `tests/<recipe>/` dir continues to use the
Phase-1d/1e overlays at the top level (`test_install.py`, `test_upgrade.py`, `test_backup.py`,
`test_restore.py`, `ops.py`, `recipe_meta.py`, optional `install_steps.sh`). NEW Phase-2
subdirectories:
- `tests/<recipe>/functional/` — parity-port tests (one per recipe-maintainer `tests/*.py`) +
≥2 NEW recipe-specific functional tests (P2/P3). Each file is `test_*.py` (pytest-discoverable);
each parity port carries a **`SOURCE = "recipe-info/<recipe>/tests/<file>"`** comment near the
top so the audit trail is in the file, not just in PARITY.md.
- `tests/<recipe>/playwright/` — browser flows (P6) where the app's UX is a UI flow. Same
`test_*.py` convention; each file imports `playwright.sync_api`.
- `tests/<recipe>/PARITY.md` — required mapping table (P2) with one row per
recipe-info parity test: `| recipe-maintainer file | cc-ci file | what's verified | status |`.
A deliberate non-port is a documented row in DECISIONS.md (linked from PARITY.md), not a silent
omission.
- **Discovery for the new subdirs.** `runner/harness/discovery.custom_tests` recurses into
`tests/<recipe>/functional/` and `tests/<recipe>/playwright/` (in addition to the top-level glob),
so Phase-2 functional tests run as part of the **custom** stage automatically. Repo-local (HC2)
gate still applies if the recipe is approved; otherwise only cc-ci's own functional/ + playwright/
run. The top-level `test_install.py`/etc. continue to drive the lifecycle overlays — the
`functional/` + `playwright/` files are **always custom-stage**, never lifecycle (so they don't
perform an op; they assert against the post-install live deployment).
- **Vendored helpers in `runner/harness/`.** Capabilities ported from `recipe-maintainer/utils/tests/
helpers.py` (cc-ci is self-contained at runtime — does NOT import recipe-maintainer's workspace,
per plan §8 default):
- `harness.http` — `http_get(url, headers=, timeout=) -> (status, json_or_None)`,
`http_post(...)`, `retry_http_get(url, timeout=, **)`, `wait_for_http(url, label, max_wait=)`,
`assert_converges(fn, description, max_wait=, interval=)`. (Several variants exist
`lifecycle.http_fetch/http_get/http_body` already; the harness.http module is the **canonical**
Phase-2 HTTP API for tests; lifecycle.* helpers stay for infra-level checks.)
- `harness.abra_tty` — `script -qefc "abra …" /dev/null` wrapper for the abra commands that
require a TTY (backup/restore/secret/run/logs/lint), used by parity tests that drive abra
directly. Lifecycle already exposes typed wrappers — this is for tests that need raw shell-abra.
- `harness.deps` — dependency resolver primitive. Reads `tests/<recipe>/recipe.toml`
(`requires` / `test_requires`), deploys each declared dep via the same `lifecycle.deploy_app`
+ `wait_healthy` path (so the dep is a real `<dep[:4]>-<6hex>.ci.commoninternet.net` on the
same swarm), persists per-run, tears down with the parent in the orchestrator's `finally`.
Heavy recipes sequence sequentially; `MAX_TESTS`/node budget is the cap.
- `harness.sso` — OIDC-flow primitive (Q2 deliverable). Given a deployed provider domain and a
recipe-defined realm/client/test-user, performs the full "deploy provider → setup realm/client
via admin API → obtain access token (password + client-credentials grants) → assert protected
API call accepts it" assertion. Reusable by every SSO-dependent recipe (cryptpad, lasuite-*,
immich, etc.). Setup scripts ported from `recipe-info/<dep>/setup_<provider>_integration.py`.
- `harness.data_integrity` — backup data-integrity primitive: a recipe-aware "seed a marker
→ backup → mutate → restore → assert seeded marker survived" helper around `lifecycle.exec_in_app`
/ `http_get` (the recipe chooses the marker mechanism, the helper guarantees the pattern).
- **Run-scoped credentials for SSO/recipe-specific tests** (plan §4.4 class-B). Generated secrets
(realm/client/test-user passwords, API tokens) persist for the run via the existing
`runs/<app-name>/` mechanism (Phase 1d). Destroyed at teardown alongside abra secrets/volumes.
- **Recipe-versioned tests (anti-anchoring).** Per plan §7.1, tests read versions/endpoints
dynamically (the app's own discovery endpoints, env from `live_app`) — never hardcode published
release values. Each functional test file declares the recipe-info SOURCE path it ports from so
the Adversary can audit parity cold.
- **Heavy-recipe parking.** Drone's `MAX_TESTS=1` + per-build timeout already serialize runs; for
Phase 2 we DO NOT lift it. Within a single run, the orchestrator deploys deps before the
recipe-under-test sequentially (never concurrently) per plan §4.2.
## Phase 2 Q3.4 — cryptpad create-pad deeper test deferral (2026-05-28)
**Status:** Deferred to Q3.4 follow-up (or Q5 catch-up), with Adversary sign-off pending per
plan §7.1.
**What's deferred:** The "create-an-object + read-it-back" deep test for cryptpad —
authenticate-and-create a real pad in the browser, type a uniquely-marked content string, reload
the page (retaining the client-side encryption key in the URL fragment), assert the marker
survives. This is the canonical create-and-read-back per plan §4.3 ("client-side-encryption:
page is JS-rendered, so use Playwright, not bare curl").
**Why deferred (the technical reason):**
- CryptPad's pad-creation client-side flow is **version-specific**. In the recipe under test
(10.6.0+5.7.0), visiting `/pad/` does NOT auto-inject a fragment-keyed pad URL; CryptPad
requires the user to explicitly click a "new rich text" / "new pad" link from the landing
page, AND those UI selectors (`.cp-apps-grid a`, `[data-app='pad']`, `a[href*='/pad/']`) are
not stable across CryptPad versions.
- Three attempted drafts during Q3.4 each failed cold on this:
1. Type + reload + content-survives: contenteditable inside nested iframe with origin
mismatch (SANDBOX_DOMAIN).
2. Direct-`/pad/`-then-fragment: no fragment ever appeared on this version.
3. Click-fallback for known app-launch selectors: none of the candidate selectors matched.
**The maximal testable subset that IS shipped (P3 floor met):**
- `tests/cryptpad/functional/test_health_check.py` — parity HTTP 200.
- `tests/cryptpad/functional/test_spa_assets.py` — CryptPad branding + canonical asset paths
in served HTML. Catches the wedged-server-fallback-page failure mode.
- `tests/cryptpad/playwright/test_pad_create.py` — Chromium renders the SPA, asserts brand
+ canonical asset references + zero non-filtered JavaScript console errors.
The Playwright test exercises the JS pipeline in a real browser (per §4.3 directive); the
piece NOT exercised is the user-action-driven pad lifecycle. **What's required to lift the
deferral:** pin a specific CryptPad app-launch contract (CryptPad's source has app-launch
URL patterns like `/pad/?new=1` on some versions) OR write a Playwright helper that walks the
SPA's main menu via a stable accessibility tree (role-based selectors instead of CSS).
Adversary may file F2-N requesting full create-pad coverage; the answer above is the
honest technical reason + the maximal subset. Logged here per plan §7.1.
---
## Phase 2 — nested DOMAIN-derived subdomains flattened to single-label wildcard siblings
**Decision (settled):** When an enrolled recipe routes additional services on **nested subdomains
derived from `DOMAIN`** (e.g. lasuite-drive `MINIO_DOMAIN="minio.${DOMAIN}"` +
`COLLABORA_DOMAIN="collabora.${DOMAIN}"`; lasuite-meet `LIVEKIT_DOMAIN="livekit.${DOMAIN}"`), the
recipe's `recipe_meta.EXTRA_ENV(domain)` MUST override those vars to a **single-label sibling under
the wildcard** — `minio-<domain>`, `collabora-<domain>`, `livekit-<domain>` — NOT the recipe's
default `<svc>.<domain>`.
**Why:** cc-ci's TLS cert is the operator's pre-issued wildcard `*.ci.commoninternet.net` (+ bare
`ci.commoninternet.net`) — §4.0/§1.5, renewed out-of-band, no ACME. A wildcard matches exactly **one**
label. The per-run app domain is already one label (`lasuite-drive-pr<n>-<sha>.ci.commoninternet.net`),
so a nested `minio.lasuite-drive-pr<n>-<sha>.ci.commoninternet.net` is a **2-label** name the wildcard
does NOT cover → Traefik would serve an invalid cert on that router and the service is unreachable
over HTTPS. Re-prefixing with a hyphen keeps it one label (`minio-lasuite-drive-pr<n>-<sha>` +
`.ci.commoninternet.net`), covered by the same wildcard, routed by Traefik's swarm provider with **no
cert work and no gateway change** (the gateway already passes the whole wildcard, §4.0). We must NOT
mint per-host certs / ACME for these (class-A1 boundary, §9).
**Scope:** purely a per-recipe `EXTRA_ENV` concern (no shared-harness change). Recipes with no
DOMAIN-derived nested subdomains (most) are unaffected.
## Phase 2 — `services_converged` treats a `replicas: 0` one-shot as converged
**Decision (settled):** `runner/harness/lifecycle.py::services_converged` now considers a service
converged when `cur == want` (desired replica count met), removing the prior
`or want == "0"` rejection.
**Why:** lasuite-drive's `minio-createbuckets` is declared `deploy: {mode: replicated, replicas: 0,
restart_policy: {condition: none}}` — an **on-demand one-shot** (scaled up manually only when buckets
need (re)creating; it `mc mb …` then `exit 0`). `docker stack services` reports it `0/0`. The old
check rejected any `want == "0"` row, so the stack could **never** report converged → every deploy
hung until `deploy_timeout`. A service AT its desired count (including 0/0) is converged; a service
still spinning up shows `0/1` (`cur != want`) and is correctly not-yet-converged, so the HTTP
readiness wait still gates real liveness. Safe for all currently-green recipes (their services are
all N/N with N>0; the `0/0` case did not previously occur). Buckets/migrations that the one-shot
performs are run on-demand in the recipe's `setup_custom_tests.sh` (post-deploy), not relied upon for
generic-install convergence (the SPA at `/` serves 200 without them).