- STATUS-2.md / BACKLOG-2.md / JOURNAL-2.md seeded from plan §6 (Q0-Q5). - DECISIONS.md appended Phase 2 section: functional/ + playwright/ subdirs, PARITY.md mapping convention, vendored helpers in runner/harness/ (http, abra_tty, deps, sso, data_integrity), recipe-versioned tests. - Bootstrap access re-verified: ssh cc-ci ok, Gitea API 200, wildcard DNS to gateway 143.244.213.108. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
36 KiB
DECISIONS — cc-ci Builder
Architecture decisions and dead-ends. One line of rationale each. (§0, §8)
Settled
-
Wildcard TLS: operator pre-issues wildcard cert at
/var/lib/ci-certs/live/; Traefik file provider serves it; no ACME for commoninternet.net. (Plan §4.0/§8 — fixed.) -
Repo:
git.autonomic.zone/recipe-maintainers/cc-ci, private. Bot is org admin. (Bootstrap.) -
Git credentials: helper script in repo-local git config sources
/srv/cc-ci/.testenvat call time — no secret values stored in.git/configor commits. -
Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26, overrides plan §3
modules/traefik.nix). Instead of a hand-rolled Traefik we deploy the canonical Co-op Cloudtraefikrecipe via abra in wildcard / file-provider mode, for end-to-end fidelity (canonicalweb/web-secureentrypoints + proxy/swarm conventions every recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO DNS token on the box:WILDCARDS_ENABLED=1+ appendcompose.wildcard.yml; the pre-issued cert is fed as thessl_cert/ssl_keyswarm secrets (v1) viaabra app secret insert … -ffrom/var/lib/ci-certs/live/{fullchain,privkey}.pem. The file provider serves it (tls.certificates).LETS_ENCRYPT_ENV=empty on the traefik app and on every test app → the recipe'stls.certresolver=${LETS_ENCRYPT_ENV}label resolves to no resolver → routers serve the wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)- Reproducibility (D8):
scripts/deploy-proxy.shis idempotent (ensures local abra server, fetches recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented indocs/install.md. The custommodules/traefik.nixwas removed;modules/swarm.nixkeeps swarm init +proxynet + firewall 80/443. - Renewal (manual, ~90d): operator re-issues the wildcard at the same paths, then
abra app secret rm traefik.ci.commoninternet.net ssl_cert -n+ re-insert at a new version (bumpSECRET_WILDCARD_CERT_VERSION) and redeploy. (Documented in docs/secrets.md at M7.) - abra teardown syntax (for harness, §4.3):
abra app undeploy <d> -n,abra app volume remove <d> -f -n,abra app secret remove <d> --all -n. None take--chaos.
-
Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer 2026-05-26). Every piece of swarm infra that abra deploys (traefik
modules/proxy.nix, Dronemodules/drone.nix, later comment-bridge + dashboard) is asystemd.services.<x>withType=oneshot+RemainAfterExit,after/requiresswarm-init + docker,wantsnetwork-online,wantedBymulti-user, embedding its script viapkgs.writeShellApplication(self-contained in the store, not a/root/cc-cipath). The script reconciles (inspect → converge → no-op if correct) on every activation/boot — no run-once sentinel — so it self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit) on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses togit clone+nixos-rebuild switch+ operator preconditions, no manual post-steps. The oldscripts/deploy-*.shwere folded into these modules and removed.pkgs.abrais provided via an overlay (modules/packages.nix) so all modules share the one pinned build.- Cert rotation note: the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
wildcard means bumping
SECRET_WILDCARD_*_VERSION(operator) so the next reconcile re-inserts. Documented in docs/secrets.md at M7.
- Cert rotation note: the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
wildcard means bumping
-
Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27, supersedes the earlier "keep webhook, do NOT pivot to polling" steer). Hard constraint: the bot/server runs at READ level, never repo-admin, and never self-registers a webhook.
- Polling is PRIMARY and the source of truth for D1. The bridge polls each enrolled repo's
open PRs for new
!testmecomments everyPOLL_INTERVAL(30s ≤ 60s). Outbound (cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On startup the first poll marks pre-existing comments seen so it doesn't fire on old comments. - Webhook is an OPTIONAL push optimization. The
/hookendpoint stays live (HMAC-verified) so an admin-registeredissue_commentwebhook lowers latency, but the bridge never registers one. Manual registration is documented indocs/enroll-recipe.md. Both paths share an in-memory seen-set keyed by comment id → a comment seen by both fires at most once. - Commenter authorization via org membership (read-level, no admin). Allowed iff
GET /orgs/{owner}/members/{user}→ 204 (verified 2026-05-27: admits bot/trav/notplants, 404 for a non-member, works with bot read-level basic-auth) or the user is in the optionalAUTH_ALLOWLIST. Replaces the earlier/collaborators/{user}/permissioncheck, which needs repo-admin. Fail-closed on any error. - Enrollment = add the repo to the bridge
POLL_REPOScsv + ensuretests/<recipe>/exists. No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't matter: polling makes it irrelevant; the operator was whitelistingci.commoninternet.netin Gitea'sALLOWED_HOST_LIST, but D1 no longer depends on that.)
- Polling is PRIMARY and the source of truth for D1. The bridge polls each enrolled repo's
open PRs for new
-
Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27, plan §4.2/§4.3). Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
- MAX_TESTS =
DRONE_RUNNER_CAPACITY= 1 (modules/drone-runner.nix,maxTestslet-binding). Drone runs at most MAX_TESTS builds at once and auto-queues the rest in its native pending queue — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly. - Per-build TIMEOUT = 60 min (
modules/drone.nix,buildTimeoutMinutes; reconciled best-effort viaPATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}using the bridge's Drone admin token, local--resolve, non-fatal). A build over the limit is cancelled by Drone → the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue once a test finishes OR times out". - Teardown + janitor backstop. Each build deploys → runs the 3 stages → undeploys
(guaranteed
try/finallyinconftest/orchestrator). A SIGKILL'd/timed-out build can't run its own teardown, so the run-start janitor (lifecycle.janitor, called before every deploy in both fixtures +run_recipe_ci) reaps orphaned run apps as the backstop. At capacity=1 the CI path will setCCCI_JANITOR_MAX_AGE=0(reap any orphan immediately — safe with no concurrent runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default 2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live. - Optional
concurrency: {limit: 1}in the recipe-CI.drone.ymlis a redundant belt — primary mechanism isDRONE_RUNNER_CAPACITY. (Wired when the recipe-CI pipeline lands — see backlog.)
- MAX_TESTS =
-
D10 recipe #6: bluesky-pds (TLS-passthrough) SWAPPED → n8n — SETTLED (2026-05-27, plan §4.0 sanctions this swap-with-reason). bluesky-pds routes via a Traefik TCP router with
tls.passthrough=trueto an in-container caddy that terminates TLS itself and obtains its own cert via ACME. cc-ci's design is the opposite: the operator gateway passes wildcard TLS through to cc-ci's Traefik, which terminates it with the pre-issued static wildcard cert, and ACME is hard-forbidden for commoninternet.net (no DNS token on the box — §4.0/§9). Serving bluesky-pds would require either (a) ACME inside caddy (forbidden), or (b) injecting the wildcard cert into caddy + a per-host TCP-passthrough router on cc-ci Traefik (recipe-internal surgery + a bespoke proxy mode — not a clean shared-harness absorb). This is a genuine design conflict, not a harness gap. Per the plan's explicit allowance, bluesky-pds is a documented non-CI'd recipe (reason here), and n8n takes the 6th slot. The 5 required D10 categories are already covered by recipes 1–5 (simple=custom-html, single-DB+SSO=keycloak, stateful/no-DB=cryptpad, DB+media/large-volume= matrix-synapse, multi-service+S3/object-storage=lasuite-docs); n8n adds a 6th real deployable app (workflow automation) behind the normal terminate-at-Traefik path. -
Docker Hub rate limit + mid-breadth prune — FINDING (2026-05-27). D10 real-
!testmebreadth runs exhausted Docker Hub's anonymous pull rate limit (lasuite-docs, 9 images, upgrade stage:toomanyrequests). Two lessons: (1) registry pull creds are an A1 operator input needed for reliable heavy-recipe deploys under load (request + sops-store + wire into docker daemon). (2) Don'tdocker image prune -afmid-breadth — it evicts cached recipe images and forces re-pulls that hit the limit. The first lasuite failure was disk pressure (90% full); pruning fixed disk but triggered re-pulls → rate limit. Better: rely on the daily autoprune, prune onlydangling(not-a) between runs, or grow disk so heavy images stay cached. Net for D10: 5/6 recipes green via real !testme; lasuite-docs gated on the rate limit (transient ~hours; durable fix = creds).
Open (defaults from §8, to confirm as reality lands)
-
Deploy mechanism — SETTLED (M0):
nixos-rebuild switch --flake /root/cc-ci#cc-cirun on cc-ci itself, with the repo materialised on the host at/root/cc-ci. Chosen over--target-host/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS proxy (slow/fragile). Atomic rollback preserved by Nix generations (nixos-rebuild --rollback). The switch is launched as a detached transient systemd unit (systemd-run --unit=ccci-rebuild --collect) so it survives a momentary ssh-over-tailscale drop during activation. For the build loop the host copy is synced from the sandbox clone viatar | ssh(rsync absent on host); source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo on a fresh host, thennixos-rebuild switch --flake .#cc-ci).- nixpkgs pin: flake pins the exact rev cc-ci already ran (
50ab793…) so the first rebuild is a true no-op-then-base. Bump deliberately, never drift.
- nixpkgs pin: flake pins the exact rev cc-ci already ran (
-
Webhook scope: default per-repo via enroll script.
-
CI engine: Drone (per plan) — kept, with a noted risk. nixpkgs 24.11 has Drone server 2.24.0 but
drone-runner-execis abandoned (unstable-2020-04-19) — the only exec runner Drone ever shipped (upstream archived ~2021). The maintained fork Woodpecker (2.7.3, with NixOS modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific (D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern Drone server (RPC protocol stable). Fallback: if the exec runner proves incompatible/broken, pivot to Woodpecker (coop-cloud ships awoodpeckerrecipe too) and record it — like the traefik pivot. Re-evaluate at the M2 gate. -
Drone deployment shape — SETTLED (M2): mirror the traefik pattern. The server is the coop-cloud
dronerecipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by traefik atdrone.ci.commoninternet.net,LETS_ENCRYPT_ENVempty → wildcard cert, no ACME), with Gitea SSO (compose.gitea.yml). The exec runner runs as a Nix systemd service on the host (modules/drone-runner.nix) so it can drive host abra/swarm (plan §4.2). One generatedDRONE_RPC_SECRETis shared: inserted as the server'srpc_secretswarm secret AND read by the runner from sops. Reproducible deploy:scripts/deploy-drone.sh.- Gitea OAuth app
cc-ci-dronecreated under the bot (client_idab4cdb9d-ee96-4867-875f- 87384505fc52, redirecthttps://drone.ci.commoninternet.net/login); client_secret + rpc_secret stored sops-encrypted insecrets/secrets.yaml(A2 internal secrets).
- Gitea OAuth app
-
Drone runner type: exec (must drive host abra).
-
Secret tool — SETTLED (M0): sops-nix. cc-ci decrypts at activation using its ed25519 SSH host key as the age identity (
sops.age.sshKeyPaths), so no extra key file to manage on the box. Recipients in/.sops.yaml: the host age key (age1h90ut…, from ssh-to-age) + an off-box master recovery key (age1cmk26t…; private half only at/srv/cc-ci/.sops/master-age.txton the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing plaintext intosecrets/<f>.yamlthensops -e -i(run inside the repo so.sops.yamlis found). -
D10 recipe set: lock six early. Candidates favouring already-mirrored: custom-html (simple), cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3), bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5.
-
Per-run app domain scheme — adapted (M4, deviates from plan §4.0). Plan §4.0 wanted
<recipe>-pr<n>-<short-sha>.ci.commoninternet.net, but Docker swarm config/secret names (<stackname>_<resource>_<version>) must be ≤ 64 chars and abra derives<stackname>from the domain (dots→_, hyphens kept)..ci.commoninternet.netalone is 22 chars, so long recipe names- config names overflow 64 (hit with
custom-html-pr0-m4demo…_nginx_default_conf_v6= 66). New scheme:<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net(e.g.cust-e084bd) — short, unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/ ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.
- config names overflow 64 (hit with
-
abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6). Many abra commands (
app ls,secret generatewithout flags, version resolution) silentlygit checkout <version-tag>in~/.abra/recipes/<recipe>, discarding a PR branch's files. To test the PR head code (not a re-resolved tag): (1)fetch_recipeclones the mirror branch/ref (private → bot token via per-commandhttp.extraHeader, never persisted/logged); (2) all harness abra calls that touch the recipe pass-C(chaos: use current checkout)-o(offline: no remote fetch); (3) recipe-shippedtests/(D4) are snapshotted to a temp dir right after fetch, since later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.
Risks
- Disk — RESOLVED 2026-05-26. Original 8.9 GiB root had only ~3.8 GiB free and a hard
inode ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
inodes before bytes. Operator grew the VM to 28 GiB (22 GiB free, 1.78M inodes / 1.21M free);
the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
periodic
docker image pruneto avoid regressing during M6.5 breadth.
Dead-ends
- (none yet)
Phase 1c (full reproducibility + genuine D8 live rebuild) — 2026-05-27
- Secrets linkage = git SUBMODULE (deviates from plan §7 flake-input default).
cc-ci-secretsis mounted as a submodule atcc-ci/secrets/rather than a flakeinputs.secrets. Rationale: a private flake input must be re-fetched at every nix eval, requiring the bot token persistently in nix config/netrc on cc-ci AND the throwaway VM (a token in the store/config = a 2nd out-of-band secret, which 1c forbids). A submodule makessecrets/secrets.yamla plain path in the working tree →defaultSopsFile = ../secrets/secrets.yamlis unchanged (minimal diff, trivially byte-identical), and the only credential use is the onegit clone --recursiveat provisioning ("the two repos are given", Mission §1). Build invocation becomesnixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'so the submodule tree is included. (Revisit if?submodules=1proves unreliable on cc-ci's nix version.) - Bootstrap key for the throwaway VM = the existing RECOVERY (master) age key, via
sops.age.keyFile. The recovery key (age1cmk26…, private at/srv/cc-ci/.sops/master-age.txt) is already a sops recipient, so a fresh host with a different ssh host key still decrypts every secret with no re-keying — this is exactly the §0 argument that defeats "host-key binding". Provisioned to the VM at a fixed path (the ONE out-of-band secret). cc-ci itself keeps decrypting via its host key (age.sshKeyPaths); secrets.nix will offer both identity sources. (Per-host re-encrypt is cleaner for a permanent new instance — documented as the alternative, not used for the throwaway test.) - Cert into git: wildcard cert+key become sops secrets in
cc-ci-secrets, decrypted at activation back to/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}viasops.secrets.<name>.path; proxy.nix keeps reading that path (now sops-sourced, not operator-drop). - cc-nix-test final sizing (C6) — SETTLED by operator 2026-05-27: PROMOTE the rebuilt VM. The freshly-rebuilt reproducible VM (the FINAL W5/C4-C5 clean-room throwaway) becomes the canonical cc-nix-test; the operator will repurpose it for a live real-traffic test through the public gateway.
- C6 teardown OVERRIDE (operator, 2026-05-27): do NOT destroy the FINAL throwaway VM after W5/C4-C5 PASSes — keep it RUNNING; defer its C6 teardown until the operator explicitly says otherwise. This overrides the plan §5/§6 "destroy the throwaway" for that one VM only. All other cleanup proceeds normally (the Builder's first throwaway was already destroyed; RAM accounting holds).
Phase 1b — lint/format tooling (open decisions §6, settled W0)
- Formatters/linters (RL1): Nix =
nixpkgs-fmt(format) +statix(lints) +deadnix(dead code); Python =ruff(lint + format); Shell =shellcheck+shfmt -i 2 -ci; YAML =yamllint. Keptnixpkgs-fmtoveralejandrabecause it was already the repoformatterand devshell tool (no extra churn / restyle of every .nix). All built from the already-pinned nixpkgs via a flakelintdevshell (nix develop .#lint) so CI and local use byte-identical tool versions. - Lint entrypoint:
scripts/lint.sh(check-only by default;--fixauto-applies). The.drone.ymlpush pipeline runs it vianix develop .#lint --command bash scripts/lint.sh. - ruff strictness:
select = [E,F,W,I,UP,B,C4,SIM],ignore = [E501](line length is the formatter's job; only un-splittable strings would trip it).line-length=100,target=py311. - Drone lint stage = FAIL (not warn). The codebase is green now, so enforce from here on — an
unclean commit fails the
lintstep. (Resolves the §6 open question.) - Python type-checking (mypy/pyright): DEFERRED to IDEAS, not added in 1b. The harness is small
and dynamically typed around
abra/subprocess JSON; gradual typing is a larger effort than this bounded pass warrants. Revisit if Phase 2's 18-recipe ramp shows type bugs. - blocking vs advisory split (§3): treated as in the phase plan — tests-real, Nix-idempotent, no-footguns, no-secrets, log-redaction, harness-DRY = blocking; readability/docs/arch-drift = advisory unless a real plan deviation. Recorded per-finding in REVIEW-1b / BACKLOG-1b.
- cc-ci self-CI push trigger: the lint stage lives in the
event: pushpipeline. The Gitea→Drone push webhook on this instance is flaky (last_status: None; documented §4.1) and predates 1b — recipe CI uses polling as primary, but cc-ci's own self-test/lint relies on the push webhook. The lint stage is correctly wired and proven green via the identicalnix develop .#lintcommand; reliably auto-firing it on every push is tracked as a (pre-existing) infra item, not a 1b lint gap.
Phase 1b — repo layout (operator review items RL5/RL6, plan §7)
- RL5 — all Nix code under
nix/. Movedmodules/→nix/modules/andhosts/→nix/hosts/.flake.nix/flake.lockSTAY at the repo root (entry point) so the build ref#cc-ciandnixos-rebuild --flake '…#cc-ci'are unchanged — onlyflake.nix's internal./hosts/cc-ci/configuration.nix→./nix/hosts/cc-ci/configuration.nixchanged. Root-relative refs inside the moved modules were re-based../X→../../X(secrets.nix →../../secrets/, bridge.nix →../../bridge/, dashboard.nix →../../dashboard/);configuration.nix's../../modules/*imports are unchanged (both dirs moved undernix/, so the relative path still resolves). Toplevel is byte-identical (8i3jcad9…) before/after the move — store derivations are content-addressed on the copied file contents, and the module.nixfiles aren't part of the runtime closure, so relocating folders doesn't change the build. (The operator anticipated a hash change; in practice it's stable, which is even stronger for reproducibility.) Living docs (README, architecture/install/secrets/enroll) + the.drone.ymlcomment updated tonix/…; append-only history logs left as the record of what was true then. - RL6 — protocol files →
machine-docs/: DEFERRED to the coordinated end of 1b. Willgit mvSTATUS*/REVIEW*/JOURNAL*/BACKLOG*/DECISIONS.mdintomachine-docs/(README.md STAYS at root — operator decision, it's the human readme, not a protocol file). The live watchdog (launch.sh) readsSTATUS-<id>.md/REVIEW-<id>.mdat the repo root for handoffs/transition, so this is done LAST, in lockstep with the orchestrator updatinglaunch.sh+ restarting the watchdog — not unilaterally and not while a phase transition is pending. The Adversary likewisegit mvs its own REVIEW files at the cutover (single-writer rule).
Phase 1b — recorded deviation: no tests/_template/ dir (enroll = copy an existing recipe)
Plan §3's repo layout lists a tests/_template/ "copy-to-add-a-recipe" dir. It was never created
(pre-1b; not introduced or removed by 1b) — instead the documented enroll flow in
docs/enroll-recipe.md is "copy an existing recipe's tree, e.g. tests/custom-html/…, then adjust
recipe_meta.py + the per-recipe test files." This satisfies D5's "small, repeatable, documented
operation with no harness surgery" the same way (a concrete recipe is a better starting template than
an abstract skeleton that can drift). Recording per the Adversary's RL3 D5 advisory; not a blocker.
Phase 1d — generic test suite + layered overlays (design, 2026-05-27)
SSOT: cc-ci-plan/plan-phase1d-generic-test-suite.md. Resolves the §6 open decisions.
-
Tier model & op/assertion split (the core call). A run is a sequence of TIERS — install, upgrade, backup, restore, custom — each =
generic default [overridden by a recipe overlay]. The lifecycle OP (deploy, upgrade, backup, restore) is owned by the shared harness (harness.generichelpers), NOT duplicated in every test file. A tier's test file (generic or overlay) carries the ASSERTIONS and calls the shared op helper. This keeps the op single-sourced (DRY, DG7) and makes deploy-once trivial: only the orchestrator deploys/tears-down. -
Override (not additive) — Builder's call (plan §6, operator leaned override). For each lifecycle op exactly ONE assertion file runs, by precedence: repo-local
tests/test_<op>.py> cc-citests/<recipe>/test_<op>.py> generic (tests/_generic/test_<op>.py). A present overlay REPLACES the generic for that op. Invariant: no overlay for an op ⇒ the generic runs (so any recipe is testable with zero config). Repo-local wins same-name collisions (upstream is authoritative, plan §2.5); cc-ci's overlay is the curated fallback until upstream adopts it. Extend-by-composition: an overlay mayfrom harness import genericand callgeneric.assert_serving(...)/generic.do_upgrade(...)then add its own assertions — so "extend" needs no separate mechanism. -
Custom (non-lifecycle)
test_*.py: ALL discovered from BOTH locations run additively, opt-in (no override, no generic equivalent) — e.g.test_sso.py. -
Deploy ONCE, mutate in place (operator requirement, DG4.1). The orchestrator deploys the app ONCE, runs all tiers against that single live deployment (install asserts; upgrade does
abra app upgradein place; backup/restore mutate in place; custom asserts), then ONE teardown infinally. No per-tier/per-overlayabra app new/deploy/undeploy. ACCCI_DEPLOY_COUNTcounter inlifecycle.deploy_appis asserted == 1 per run (DG4.1 evidence). -
Deployment-sharing scope & base version (§6 open). One deployment for the whole lifecycle. Base version deployed once = the previous published version when an upgrade tier will run and a previous exists (so upgrade goes previous→target in place), else the target (current/$REF). Recipe with only one published version ⇒ upgrade tier is a clean SKIP (nothing to upgrade from). Standalone generic-install demo (no PR) deploys current.
-
Fail handling across shared tiers (§6 open): install failing (app never serves) fail-fasts the run (later tiers can't meaningfully run on a dead deployment) and they report error/skip; upgrade/backup/restore failures are recorded per-op but do not abort the remaining independent tiers where they can still run. Teardown always runs.
-
Backup-capability detection (DG3, §6 open): auto — scan the recipe's
compose*.ymlfor abackupbot.backuplabel (verified present in custom-html).recipe_meta.BACKUP_CAPABLE(bool) overrides the auto-detect. Not capable ⇒ backup+restore tiers are N/A (skip), not failures. -
Custom install-steps hook (DG5, §6 open): a shell hook —
tests/<recipe>/install_steps.sh(cc-ci) or repo-localtests/install_steps.sh— run by the orchestrator during the install tier AFTERabra app new+ env defaults but BEFOREabra app deploy, with envCCCI_APP_DOMAIN,CCCI_RECIPE,CCCI_APP_ENV(path to the app .env). Chosen over a fixture/declarative field as the simplest thing the harness runs uniformly (canabra app secret insert, set env, seed). Graceful rule: a recipe with NO hook still attempts the generic install; if it genuinely needs a step it FAILS the generic install (reported per-op) — that is correct, not a harness bug. -
Per-op result vocabulary (Phase-3 feed):
pass | fail | skip(N/A) | error. The orchestrator prints a per-op summary line per run (feeds DG6 + Phase-3 level). -
Discovery layout: cc-ci overlays/custom/hook live in
tests/<recipe>/; repo-local in the recipe repo'stests/(snapshotted after fetch, per the existing volatile-checkout handling). Generic tier files live intests/_generic/(assertion-only, use the shared live-deployment fixtures).
Phase 1e — generic-harness corrections (HC1–HC4)
Three operator-review corrections to the Phase-1d shared harness, settled here (plan §5).
-
HC2 — repo-local approval allowlist (form/location + workflow). PR-author-controlled code (
install_steps.sh, repo-localtest_*.py) runs on the CI host with/run/secrets/*present, so it is default-deny. Allowlist file:tests/repo-local-approved.txt(checked into the cc-ci repo, git-auditable). Format: one recipe name per line;#comments + blank lines ignored; a lone*is NOT a wildcard (no global opt-in — every recipe is explicit). Default: empty ⇒ no recipe trusts repo-local code. Discovery (resolve_op/custom_tests/install_steps) consults the repo-local source only whenrepo_local_approved(recipe)is true; otherwise precedence is cc-ci > generic only and repo-local is discovered-but-not-executed. Workflow: a cc-ci maintainer reviews a recipe's repo-local tests, then adds the recipe name totests/repo-local-approved.txtin a cc-ci PR — a deliberate, reviewable act. The gate is centralized indiscovery.py(one reader) so the unit tests pin it. -
HC3 — generic-by-default opt-out flag (name/granularity + recipe_meta). Generic assertions run additively alongside any overlay by default. Opt-out, in increasing specificity (any one skips): env
CCCI_SKIP_GENERIC(truthy ⇒ skip generic for ALL ops), envCCCI_SKIP_GENERIC_<OP>(e.g.CCCI_SKIP_GENERIC_UPGRADE⇒ skip generic for that op only), and declarativerecipe_meta.SKIP_GENERIC= a list of op names (or["all"]) so the opt-out is per-recipe and visible in git, not a hidden global. Truthy =1/true/yes/on(case-insensitive). Op-vs-assertion split: a mutating op (upgrade/backup/restore) is performed once by the orchestrator (the harness owns the op); then the generic assertion file (unless opted out) and the overlay assertion file both evaluate the shared post-op state. Op results that an assertion needs (pre-upgrade identity, backup snapshot_id) are passed op→assertions via a run-scoped JSON state file at$CCCI_OP_STATE_FILE(read byharness.generic.op_state()); never logged. Overlays that need to seed pre-op state (data-continuity markers, the backup→restore mutation) ship an optionaltests/<recipe>/ops.pywithpre_install/pre_upgrade/pre_backup/pre_restore(domain, meta)callables the orchestrator runs before the op (repo-localops.pyis allowlist-gated like other repo-local code). Overlaytest_<op>.pyfiles are now assertion-only (they no longer callgeneric.do_*). -
HC1 — DG4.1 deploy-count vs the in-place chaos upgrade. The upgrade tier now upgrades to the PR head (code under test), not a published tag: deploy the previous published version (base), re-checkout the PR head (recorded as the recipe repo HEAD right after fetch, before any version-tag checkout), then
abra app deploy --chaosin place = the upgrade. The deploy-count guard countsabra app newinstalls only (_record_deploy()fires indeploy_app(), NOT in the chaos redeploy, which callsabra.deploydirectly) — so a run is still deploy-count == 1 and the legitimate in-place chaos upgrade is not flagged. Moved assertion (adapted): prev→PR-head may not bump the coop-cloud version label, soassert_upgradedaccepts ANY of: version-label change, image change, or a chaos label now present carrying the PR-head commit (a chaos deploy stampscoop-cloud.<stack>.chaos/.chaos-version) — the chaos label IS the proof PR-head was deployed. Non-PR!testme(no SRC/REF): "PR head" = the catalogue current checkout, so upgrade is prev→current — still a genuine move via chaos. (Exact chaos label name verified on the live abra during E2.)
Phase 2 — per-recipe test authoring (design, 2026-05-28)
Inherits the Phase 1d/1e shared-deployment + additive-overlay + op/assertion-split model. Phase 2
adds content, not infra, with a few small harness primitives ported from
references/recipe-maintainer/utils/tests/helpers.py.
- Per-recipe layout (per plan §4.1). The cc-ci
tests/<recipe>/dir continues to use the Phase-1d/1e overlays at the top level (test_install.py,test_upgrade.py,test_backup.py,test_restore.py,ops.py,recipe_meta.py, optionalinstall_steps.sh). NEW Phase-2 subdirectories:tests/<recipe>/functional/— parity-port tests (one per recipe-maintainertests/*.py) + ≥2 NEW recipe-specific functional tests (P2/P3). Each file istest_*.py(pytest-discoverable); each parity port carries aSOURCE = "recipe-info/<recipe>/tests/<file>"comment near the top so the audit trail is in the file, not just in PARITY.md.tests/<recipe>/playwright/— browser flows (P6) where the app's UX is a UI flow. Sametest_*.pyconvention; each file importsplaywright.sync_api.tests/<recipe>/PARITY.md— required mapping table (P2) with one row per recipe-info parity test:| recipe-maintainer file | cc-ci file | what's verified | status |. A deliberate non-port is a documented row in DECISIONS.md (linked from PARITY.md), not a silent omission.
- Discovery for the new subdirs.
runner/harness/discovery.custom_testsrecurses intotests/<recipe>/functional/andtests/<recipe>/playwright/(in addition to the top-level glob), so Phase-2 functional tests run as part of the custom stage automatically. Repo-local (HC2) gate still applies if the recipe is approved; otherwise only cc-ci's own functional/ + playwright/ run. The top-leveltest_install.py/etc. continue to drive the lifecycle overlays — thefunctional/+playwright/files are always custom-stage, never lifecycle (so they don't perform an op; they assert against the post-install live deployment). - Vendored helpers in
runner/harness/. Capabilities ported fromrecipe-maintainer/utils/tests/ helpers.py(cc-ci is self-contained at runtime — does NOT import recipe-maintainer's workspace, per plan §8 default):harness.http—http_get(url, headers=, timeout=) -> (status, json_or_None),http_post(...),retry_http_get(url, timeout=, **),wait_for_http(url, label, max_wait=),assert_converges(fn, description, max_wait=, interval=). (Several variants existlifecycle.http_fetch/http_get/http_bodyalready; the harness.http module is the canonical Phase-2 HTTP API for tests; lifecycle.* helpers stay for infra-level checks.)harness.abra_tty—script -qefc "abra …" /dev/nullwrapper for the abra commands that require a TTY (backup/restore/secret/run/logs/lint), used by parity tests that drive abra directly. Lifecycle already exposes typed wrappers — this is for tests that need raw shell-abra.harness.deps— dependency resolver primitive. Readstests/<recipe>/recipe.toml(requires/test_requires), deploys each declared dep via the samelifecycle.deploy_appwait_healthypath (so the dep is a real<dep[:4]>-<6hex>.ci.commoninternet.neton the same swarm), persists per-run, tears down with the parent in the orchestrator'sfinally. Heavy recipes sequence sequentially;MAX_TESTS/node budget is the cap.
harness.sso— OIDC-flow primitive (Q2 deliverable). Given a deployed provider domain and a recipe-defined realm/client/test-user, performs the full "deploy provider → setup realm/client via admin API → obtain access token (password + client-credentials grants) → assert protected API call accepts it" assertion. Reusable by every SSO-dependent recipe (cryptpad, lasuite-*, immich, etc.). Setup scripts ported fromrecipe-info/<dep>/setup_<provider>_integration.py.harness.data_integrity— backup data-integrity primitive: a recipe-aware "seed a marker → backup → mutate → restore → assert seeded marker survived" helper aroundlifecycle.exec_in_app/http_get(the recipe chooses the marker mechanism, the helper guarantees the pattern).
- Run-scoped credentials for SSO/recipe-specific tests (plan §4.4 class-B). Generated secrets
(realm/client/test-user passwords, API tokens) persist for the run via the existing
runs/<app-name>/mechanism (Phase 1d). Destroyed at teardown alongside abra secrets/volumes. - Recipe-versioned tests (anti-anchoring). Per plan §7.1, tests read versions/endpoints
dynamically (the app's own discovery endpoints, env from
live_app) — never hardcode published release values. Each functional test file declares the recipe-info SOURCE path it ports from so the Adversary can audit parity cold. - Heavy-recipe parking. Drone's
MAX_TESTS=1+ per-build timeout already serialize runs; for Phase 2 we DO NOT lift it. Within a single run, the orchestrator deploys deps before the recipe-under-test sequentially (never concurrently) per plan §4.2.