335 lines
27 KiB
Markdown
335 lines
27 KiB
Markdown
# DECISIONS — cc-ci Builder
|
||
|
||
Architecture decisions and dead-ends. One line of rationale each. (§0, §8)
|
||
|
||
## Settled
|
||
|
||
- **Wildcard TLS:** operator pre-issues wildcard cert at `/var/lib/ci-certs/live/`; Traefik file
|
||
provider serves it; **no ACME** for commoninternet.net. (Plan §4.0/§8 — fixed.)
|
||
- **Repo:** `git.autonomic.zone/recipe-maintainers/cc-ci`, private. Bot is org admin. (Bootstrap.)
|
||
- **Git credentials:** helper script in repo-local git config sources `/srv/cc-ci/.testenv` at call
|
||
time — no secret values stored in `.git/config` or commits.
|
||
|
||
- **Proxy: real coop-cloud/traefik via abra — SETTLED (M1, orchestrator decision 2026-05-26,
|
||
overrides plan §3 `modules/traefik.nix`).** Instead of a hand-rolled Traefik we deploy the
|
||
canonical Co-op Cloud `traefik` recipe via abra in **wildcard / file-provider mode**, for
|
||
end-to-end fidelity (canonical `web`/`web-secure` entrypoints + proxy/swarm conventions every
|
||
recipe expects — this also fixed an entrypoint-name mismatch the custom build hit). NO ACME, NO
|
||
DNS token on the box:
|
||
- `WILDCARDS_ENABLED=1` + append `compose.wildcard.yml`; the pre-issued cert is fed as the
|
||
`ssl_cert`/`ssl_key` swarm secrets (v1) via `abra app secret insert … -f` from
|
||
`/var/lib/ci-certs/live/{fullchain,privkey}.pem`. The file provider serves it (`tls.certificates`).
|
||
- `LETS_ENCRYPT_ENV=` **empty** on the traefik app *and* on every test app → the recipe's
|
||
`tls.certresolver=${LETS_ENCRYPT_ENV}` label resolves to no resolver → routers serve the
|
||
wildcard via SNI from the file provider, ACME never fires. (Verified: 0 ACME log lines.)
|
||
- Reproducibility (D8): `scripts/deploy-proxy.sh` is idempotent (ensures local abra server, fetches
|
||
recipe, writes the wildcard/no-ACME env, inserts cert secrets, deploys). Documented in
|
||
`docs/install.md`. The custom `modules/traefik.nix` was removed; `modules/swarm.nix` keeps swarm
|
||
init + `proxy` net + firewall 80/443.
|
||
- **Renewal (manual, ~90d):** operator re-issues the wildcard at the same paths, then
|
||
`abra app secret rm traefik.ci.commoninternet.net ssl_cert -n` + re-insert at a new version (bump
|
||
`SECRET_WILDCARD_CERT_VERSION`) and redeploy. (Documented in docs/secrets.md at M7.)
|
||
- **abra teardown syntax** (for harness, §4.3): `abra app undeploy <d> -n`,
|
||
`abra app volume remove <d> -f -n`, `abra app secret remove <d> --all -n`. None take `--chaos`.
|
||
|
||
- **Infra bring-up = idempotent-reconcile systemd oneshots — SETTLED (M2, orchestrator steer
|
||
2026-05-26).** Every piece of swarm infra that abra deploys (traefik `modules/proxy.nix`, Drone
|
||
`modules/drone.nix`, later comment-bridge + dashboard) is a `systemd.services.<x>` with
|
||
`Type=oneshot` + `RemainAfterExit`, `after`/`requires` swarm-init + docker, `wants`
|
||
network-online, `wantedBy` multi-user, embedding its script via **`pkgs.writeShellApplication`**
|
||
(self-contained in the store, not a `/root/cc-ci` path). The script **reconciles** (inspect →
|
||
converge → no-op if correct) on *every* activation/boot — **no run-once sentinel** — so it
|
||
self-heals drift (stack gone → redeploy; secret missing → re-insert). Fails visibly (failed unit)
|
||
on missing preconditions (e.g. cert absent). Result: a from-scratch install (D8) collapses to
|
||
`git clone` + `nixos-rebuild switch` + operator preconditions, no manual post-steps. The old
|
||
`scripts/deploy-*.sh` were folded into these modules and removed. `pkgs.abra` is provided via an
|
||
overlay (`modules/packages.nix`) so all modules share the one pinned build.
|
||
- *Cert rotation note:* the proxy reconcile inserts ssl_cert/ssl_key only if absent; rotating the
|
||
wildcard means bumping `SECRET_WILDCARD_*_VERSION` (operator) so the next reconcile re-inserts.
|
||
Documented in docs/secrets.md at M7.
|
||
|
||
- **Trigger: POLLING primary, webhook optional — SETTLED (orchestrator design change 2026-05-27,
|
||
supersedes the earlier "keep webhook, do NOT pivot to polling" steer).** Hard constraint: the
|
||
bot/server runs at **READ level, never repo-admin**, and **never self-registers a webhook**.
|
||
- **Polling is PRIMARY and the source of truth for D1.** The bridge polls each enrolled repo's
|
||
open PRs for new `!testme` comments every `POLL_INTERVAL` (30s ≤ 60s). Outbound
|
||
(cc-ci → git.autonomic.zone, the reliably-working direction), needs only read+comment. On
|
||
startup the first poll marks pre-existing comments seen so it doesn't fire on old comments.
|
||
- **Webhook is an OPTIONAL push optimization.** The `/hook` endpoint stays live (HMAC-verified)
|
||
so an *admin-registered* `issue_comment` webhook lowers latency, but the bridge never registers
|
||
one. Manual registration is documented in `docs/enroll-recipe.md`. Both paths share an
|
||
in-memory seen-set keyed by comment id → a comment seen by both fires at most once.
|
||
- **Commenter authorization via org membership (read-level, no admin).** Allowed iff
|
||
`GET /orgs/{owner}/members/{user}` → 204 (verified 2026-05-27: admits bot/trav/notplants, 404
|
||
for a non-member, works with bot read-level basic-auth) **or** the user is in the optional
|
||
`AUTH_ALLOWLIST`. Replaces the earlier `/collaborators/{user}/permission` check, which needs
|
||
repo-admin. Fail-closed on any error.
|
||
- **Enrollment** = add the repo to the bridge `POLL_REPOS` csv + ensure `tests/<recipe>/` exists.
|
||
No webhook required for CI to work. (Why root cause of the old webhook non-delivery doesn't
|
||
matter: polling makes it irrelevant; the operator was whitelisting `ci.commoninternet.net` in
|
||
Gitea's `ALLOWED_HOST_LIST`, but D1 no longer depends on that.)
|
||
|
||
- **Resource safety: bound live test apps — SETTLED (orchestrator design change 2026-05-27,
|
||
plan §4.2/§4.3).** Do NOT keep multiple test apps deployed at once. Three layers, all configurable:
|
||
- **MAX_TESTS = `DRONE_RUNNER_CAPACITY` = 1** (`modules/drone-runner.nix`, `maxTests` let-binding).
|
||
Drone runs at most MAX_TESTS builds at once and **auto-queues the rest in its native pending
|
||
queue** — no custom queue. Kept at 1 (single 28GiB node, heavy recipes). At capacity=1 there is
|
||
never a concurrent in-flight run, so the bound "at most 1 test app live" holds exactly.
|
||
- **Per-build TIMEOUT = 60 min** (`modules/drone.nix`, `buildTimeoutMinutes`; reconciled
|
||
best-effort via `PATCH /api/repos/recipe-maintainers/cc-ci {"timeout":60}` using the bridge's
|
||
Drone admin token, local `--resolve`, non-fatal). A build over the limit is cancelled by Drone →
|
||
the exec runner kills it → the MAX_TESTS slot frees → the queue advances. Satisfies "continue
|
||
once a test finishes OR times out".
|
||
- **Teardown + janitor backstop.** Each build deploys → runs the 3 stages → undeploys
|
||
(guaranteed `try/finally` in `conftest`/orchestrator). A SIGKILL'd/timed-out build can't run its
|
||
own teardown, so the **run-start janitor** (`lifecycle.janitor`, called before every deploy in
|
||
both fixtures + `run_recipe_ci`) reaps orphaned run apps as the backstop. At capacity=1 the CI
|
||
path will set `CCCI_JANITOR_MAX_AGE=0` (reap any orphan immediately — safe with no concurrent
|
||
runs) in the recipe-CI Drone pipeline; with capacity>1 the janitor MUST stay age-based (default
|
||
2h) to avoid reaping a live concurrent run. Net: at most MAX_TESTS apps ever live.
|
||
- Optional `concurrency: {limit: 1}` in the recipe-CI `.drone.yml` is a redundant belt — primary
|
||
mechanism is `DRONE_RUNNER_CAPACITY`. (Wired when the recipe-CI pipeline lands — see backlog.)
|
||
|
||
- **D10 recipe #6: bluesky-pds (TLS-passthrough) SWAPPED → n8n — SETTLED (2026-05-27, plan §4.0
|
||
sanctions this swap-with-reason).** bluesky-pds routes via a Traefik **TCP router with
|
||
`tls.passthrough=true`** to an in-container **caddy** that terminates TLS itself and obtains its own
|
||
cert via **ACME**. cc-ci's design is the opposite: the operator gateway passes wildcard TLS through
|
||
to cc-ci's Traefik, which **terminates** it with the pre-issued static wildcard cert, and **ACME is
|
||
hard-forbidden** for commoninternet.net (no DNS token on the box — §4.0/§9). Serving bluesky-pds
|
||
would require either (a) ACME inside caddy (forbidden), or (b) injecting the wildcard cert into
|
||
caddy + a per-host TCP-passthrough router on cc-ci Traefik (recipe-internal surgery + a bespoke
|
||
proxy mode — not a clean shared-harness absorb). This is a genuine design conflict, not a harness
|
||
gap. Per the plan's explicit allowance, **bluesky-pds is a documented non-CI'd recipe** (reason
|
||
here), and **n8n** takes the 6th slot. The 5 required D10 categories are already covered by recipes
|
||
1–5 (simple=custom-html, single-DB+SSO=keycloak, stateful/no-DB=cryptpad, DB+media/large-volume=
|
||
matrix-synapse, multi-service+S3/object-storage=lasuite-docs); n8n adds a 6th real deployable app
|
||
(workflow automation) behind the normal terminate-at-Traefik path.
|
||
|
||
- **Docker Hub rate limit + mid-breadth prune — FINDING (2026-05-27).** D10 real-`!testme` breadth
|
||
runs exhausted Docker Hub's anonymous pull rate limit (lasuite-docs, 9 images, upgrade stage:
|
||
`toomanyrequests`). Two lessons: (1) **registry pull creds are an A1 operator input** needed for
|
||
reliable heavy-recipe deploys under load (request + sops-store + wire into docker daemon). (2)
|
||
**Don't `docker image prune -af` mid-breadth** — it evicts cached recipe images and forces re-pulls
|
||
that hit the limit. The first lasuite failure was disk pressure (90% full); pruning fixed disk but
|
||
triggered re-pulls → rate limit. Better: rely on the daily autoprune, prune only `dangling`
|
||
(not `-a`) between runs, or grow disk so heavy images stay cached. Net for D10: 5/6 recipes green
|
||
via real !testme; lasuite-docs gated on the rate limit (transient ~hours; durable fix = creds).
|
||
|
||
## Open (defaults from §8, to confirm as reality lands)
|
||
|
||
- **Deploy mechanism — SETTLED (M0):** `nixos-rebuild switch --flake /root/cc-ci#cc-ci` run *on
|
||
cc-ci itself*, with the repo materialised on the host at `/root/cc-ci`. Chosen over
|
||
`--target-host`/deploy-rs to avoid pushing large closures over the userspace-tailscaled SOCKS
|
||
proxy (slow/fragile). Atomic rollback preserved by Nix generations (`nixos-rebuild --rollback`).
|
||
The switch is launched as a **detached transient systemd unit** (`systemd-run --unit=ccci-rebuild
|
||
--collect`) so it survives a momentary ssh-over-tailscale drop during activation. For the build
|
||
loop the host copy is synced from the sandbox clone via `tar | ssh` (rsync absent on host);
|
||
source of truth stays the git repo. D8/install.md will document the from-scratch path (clone repo
|
||
on a fresh host, then `nixos-rebuild switch --flake .#cc-ci`).
|
||
- **nixpkgs pin:** flake pins the exact rev cc-ci already ran (`50ab793…`) so the first rebuild
|
||
is a true no-op-then-base. Bump deliberately, never drift.
|
||
- **Webhook scope:** default per-repo via enroll script.
|
||
- **CI engine: Drone (per plan) — kept, with a noted risk.** nixpkgs 24.11 has Drone **server**
|
||
2.24.0 but `drone-runner-exec` is **abandoned (unstable-2020-04-19)** — the only exec runner Drone
|
||
ever shipped (upstream archived ~2021). The maintained fork **Woodpecker** (2.7.3, with NixOS
|
||
modules) is the alternative. Decision: honor the plan (Drone) because the plan is Drone-specific
|
||
(D7 "Drone's native UI", comment-bridge → Drone API). The 2020 exec runner pairs fine with modern
|
||
Drone server (RPC protocol stable). **Fallback:** if the exec runner proves incompatible/broken,
|
||
pivot to Woodpecker (coop-cloud ships a `woodpecker` recipe too) and record it — like the traefik
|
||
pivot. Re-evaluate at the M2 gate.
|
||
- **Drone deployment shape — SETTLED (M2):** mirror the traefik pattern. The **server** is the
|
||
coop-cloud `drone` recipe (drone/drone:2.26.0) deployed via abra (swarm-native, auto-routed by
|
||
traefik at `drone.ci.commoninternet.net`, `LETS_ENCRYPT_ENV` empty → wildcard cert, no ACME),
|
||
with Gitea SSO (`compose.gitea.yml`). The **exec runner** runs as a Nix systemd service on the
|
||
host (`modules/drone-runner.nix`) so it can drive host abra/swarm (plan §4.2). One generated
|
||
`DRONE_RPC_SECRET` is shared: inserted as the server's `rpc_secret` swarm secret AND read by the
|
||
runner from sops. Reproducible deploy: `scripts/deploy-drone.sh`.
|
||
- Gitea OAuth app `cc-ci-drone` created under the bot (client_id `ab4cdb9d-ee96-4867-875f-
|
||
87384505fc52`, redirect `https://drone.ci.commoninternet.net/login`); client_secret +
|
||
rpc_secret stored sops-encrypted in `secrets/secrets.yaml` (A2 internal secrets).
|
||
- **Drone runner type:** exec (must drive host abra).
|
||
- **Secret tool — SETTLED (M0):** sops-nix. cc-ci decrypts at activation using its **ed25519 SSH
|
||
host key** as the age identity (`sops.age.sshKeyPaths`), so no extra key file to manage on the box.
|
||
Recipients in `/.sops.yaml`: the host age key (`age1h90ut…`, from ssh-to-age) + an off-box
|
||
**master recovery key** (`age1cmk26t…`; private half only at `/srv/cc-ci/.sops/master-age.txt` on
|
||
the build host, never in the repo) for re-keying if cc-ci is lost. Encrypt new secrets by writing
|
||
plaintext into `secrets/<f>.yaml` then `sops -e -i` (run inside the repo so `.sops.yaml` is found).
|
||
- **D10 recipe set:** lock six early. Candidates favouring already-mirrored: custom-html (simple),
|
||
cryptpad (stateful no-DB), keycloak (SSO/DB), matrix-synapse (DB+media), lasuite-docs (multi+S3),
|
||
bluesky-pds (TLS-passthrough) — covers all five categories. Confirm during M4–M6.5.
|
||
|
||
- **Per-run app domain scheme — adapted (M4, deviates from plan §4.0).** Plan §4.0 wanted
|
||
`<recipe>-pr<n>-<short-sha>.ci.commoninternet.net`, but Docker swarm config/secret names
|
||
(`<stackname>_<resource>_<version>`) must be ≤ 64 chars and abra derives `<stackname>` from the
|
||
domain (dots→`_`, hyphens kept). `.ci.commoninternet.net` alone is 22 chars, so long recipe names
|
||
+ config names overflow 64 (hit with `custom-html-pr0-m4demo…_nginx_default_conf_v6` = 66). New
|
||
scheme: **`<recipe[:4]>-<6hex(recipe|pr|ref)>.ci.commoninternet.net`** (e.g. `cust-e084bd`) — short,
|
||
unique per run, collision-safe across recipes (full recipe in the hash). Human-readable recipe/PR/
|
||
ref context lives in the Drone build params + the PR comment, not the (ephemeral) domain.
|
||
|
||
- **abra recipe checkout is volatile — harness uses chaos+offline + a tests/ snapshot (M6).** Many
|
||
abra commands (`app ls`, `secret generate` without flags, version resolution) silently
|
||
`git checkout <version-tag>` in `~/.abra/recipes/<recipe>`, discarding a PR branch's files. To
|
||
test the *PR head code* (not a re-resolved tag): (1) `fetch_recipe` clones the mirror branch/ref
|
||
(private → bot token via per-command `http.extraHeader`, never persisted/logged); (2) all harness
|
||
abra calls that touch the recipe pass `-C` (chaos: use current checkout) `-o` (offline: no remote
|
||
fetch); (3) recipe-shipped `tests/` (D4) are **snapshotted to a temp dir right after fetch**, since
|
||
later abra commands still reset the checkout — the recipe-local stage runs from the snapshot.
|
||
|
||
## Risks
|
||
|
||
- **Disk — RESOLVED 2026-05-26.** Original 8.9 GiB root had only ~3.8 GiB free *and* a hard
|
||
**inode** ceiling (586k total, ~6k free) — the flake's nixpkgs fetch (~50k files) hit ENOSPC on
|
||
inodes before bytes. Operator grew the VM to **28 GiB** (22 GiB free, 1.78M inodes / 1.21M free);
|
||
the ext4 fs auto-resized (new block groups carry proportional inodes). Keep aggressive teardown +
|
||
periodic `docker image prune` to avoid regressing during M6.5 breadth.
|
||
|
||
## Dead-ends
|
||
- (none yet)
|
||
|
||
## Phase 1c (full reproducibility + genuine D8 live rebuild) — 2026-05-27
|
||
|
||
- **Secrets linkage = git SUBMODULE (deviates from plan §7 flake-input default).** `cc-ci-secrets`
|
||
is mounted as a submodule at `cc-ci/secrets/` rather than a flake `inputs.secrets`. Rationale: a
|
||
private flake input must be re-fetched at **every nix eval**, requiring the bot token persistently
|
||
in nix config/netrc on cc-ci AND the throwaway VM (a token in the store/config = a 2nd out-of-band
|
||
secret, which 1c forbids). A submodule makes `secrets/secrets.yaml` a plain path in the working
|
||
tree → `defaultSopsFile = ../secrets/secrets.yaml` is unchanged (minimal diff, trivially
|
||
byte-identical), and the only credential use is the one `git clone --recursive` at provisioning
|
||
("the two repos are *given*", Mission §1). Build invocation becomes
|
||
`nixos-rebuild switch --flake 'git+file:///root/cc-ci?submodules=1#cc-ci'` so the submodule tree is
|
||
included. (Revisit if `?submodules=1` proves unreliable on cc-ci's nix version.)
|
||
- **Bootstrap key for the throwaway VM = the existing RECOVERY (master) age key, via
|
||
`sops.age.keyFile`.** The recovery key (`age1cmk26…`, private at `/srv/cc-ci/.sops/master-age.txt`)
|
||
is already a sops recipient, so a fresh host with a *different* ssh host key still decrypts every
|
||
secret with no re-keying — this is exactly the §0 argument that defeats "host-key binding".
|
||
Provisioned to the VM at a fixed path (the ONE out-of-band secret). cc-ci itself keeps decrypting
|
||
via its host key (`age.sshKeyPaths`); secrets.nix will offer both identity sources. (Per-host
|
||
re-encrypt is cleaner for a *permanent* new instance — documented as the alternative, not used for
|
||
the throwaway test.)
|
||
- **Cert into git:** wildcard cert+key become sops secrets in `cc-ci-secrets`, decrypted at
|
||
activation back to `/var/lib/ci-certs/live/{fullchain.pem,privkey.pem}` via
|
||
`sops.secrets.<name>.path`; proxy.nix keeps reading that path (now sops-sourced, not operator-drop).
|
||
- **cc-nix-test final sizing (C6) — SETTLED by operator 2026-05-27: PROMOTE the rebuilt VM.** The
|
||
freshly-rebuilt reproducible VM (the FINAL W5/C4-C5 clean-room throwaway) becomes the canonical
|
||
cc-nix-test; the operator will repurpose it for a live real-traffic test through the public gateway.
|
||
- **C6 teardown OVERRIDE (operator, 2026-05-27):** do NOT destroy the FINAL throwaway VM after
|
||
W5/C4-C5 PASSes — keep it RUNNING; defer its C6 teardown until the operator explicitly says
|
||
otherwise. This overrides the plan §5/§6 "destroy the throwaway" for that one VM only. All other
|
||
cleanup proceeds normally (the Builder's first throwaway was already destroyed; RAM accounting holds).
|
||
|
||
## Phase 1b — lint/format tooling (open decisions §6, settled W0)
|
||
- **Formatters/linters (RL1):** Nix = `nixpkgs-fmt` (format) + `statix` (lints) + `deadnix` (dead
|
||
code); Python = `ruff` (lint + format); Shell = `shellcheck` + `shfmt -i 2 -ci`; YAML = `yamllint`.
|
||
Kept `nixpkgs-fmt` over `alejandra` because it was already the repo `formatter` and devshell tool
|
||
(no extra churn / restyle of every .nix). All built from the already-pinned nixpkgs via a flake
|
||
`lint` devshell (`nix develop .#lint`) so CI and local use byte-identical tool versions.
|
||
- **Lint entrypoint:** `scripts/lint.sh` (check-only by default; `--fix` auto-applies). The
|
||
`.drone.yml` push pipeline runs it via `nix develop .#lint --command bash scripts/lint.sh`.
|
||
- **ruff strictness:** `select = [E,F,W,I,UP,B,C4,SIM]`, `ignore = [E501]` (line length is the
|
||
formatter's job; only un-splittable strings would trip it). `line-length=100`, `target=py311`.
|
||
- **Drone lint stage = FAIL (not warn).** The codebase is green now, so enforce from here on — an
|
||
unclean commit fails the `lint` step. (Resolves the §6 open question.)
|
||
- **Python type-checking (mypy/pyright): DEFERRED to IDEAS**, not added in 1b. The harness is small
|
||
and dynamically typed around `abra`/subprocess JSON; gradual typing is a larger effort than this
|
||
bounded pass warrants. Revisit if Phase 2's 18-recipe ramp shows type bugs.
|
||
- **blocking vs advisory split (§3):** treated as in the phase plan — tests-real, Nix-idempotent,
|
||
no-footguns, no-secrets, log-redaction, harness-DRY = blocking; readability/docs/arch-drift =
|
||
advisory unless a real plan deviation. Recorded per-finding in REVIEW-1b / BACKLOG-1b.
|
||
- **cc-ci self-CI push trigger:** the lint stage lives in the `event: push` pipeline. The Gitea→Drone
|
||
push webhook on this instance is flaky (`last_status: None`; documented §4.1) and predates 1b —
|
||
recipe CI uses polling as primary, but cc-ci's *own* self-test/lint relies on the push webhook.
|
||
The lint stage is correctly wired and proven green via the identical `nix develop .#lint` command;
|
||
reliably auto-firing it on every push is tracked as a (pre-existing) infra item, not a 1b lint gap.
|
||
|
||
## Phase 1b — repo layout (operator review items RL5/RL6, plan §7)
|
||
- **RL5 — all Nix code under `nix/`.** Moved `modules/`→`nix/modules/` and `hosts/`→`nix/hosts/`.
|
||
`flake.nix`/`flake.lock` STAY at the repo root (entry point) so the build ref `#cc-ci` and
|
||
`nixos-rebuild --flake '…#cc-ci'` are unchanged — only `flake.nix`'s internal
|
||
`./hosts/cc-ci/configuration.nix` → `./nix/hosts/cc-ci/configuration.nix` changed. Root-relative
|
||
refs inside the moved modules were re-based `../X` → `../../X` (secrets.nix → `../../secrets/`,
|
||
bridge.nix → `../../bridge/`, dashboard.nix → `../../dashboard/`); `configuration.nix`'s
|
||
`../../modules/*` imports are unchanged (both dirs moved under `nix/`, so the relative path still
|
||
resolves). **Toplevel is byte-identical (`8i3jcad9…`) before/after the move** — store derivations
|
||
are content-addressed on the copied file *contents*, and the module `.nix` files aren't part of the
|
||
runtime closure, so relocating folders doesn't change the build. (The operator anticipated a hash
|
||
change; in practice it's stable, which is even stronger for reproducibility.) Living docs
|
||
(README, architecture/install/secrets/enroll) + the `.drone.yml` comment updated to `nix/…`;
|
||
append-only history logs left as the record of what was true then.
|
||
- **RL6 — protocol files → `machine-docs/`: DEFERRED to the coordinated end of 1b.** Will `git mv`
|
||
`STATUS*/REVIEW*/JOURNAL*/BACKLOG*/DECISIONS.md` into `machine-docs/` (README.md STAYS at root —
|
||
operator decision, it's the human readme, not a protocol file). The live watchdog (`launch.sh`)
|
||
reads `STATUS-<id>.md`/`REVIEW-<id>.md` at the repo root for handoffs/transition, so this is done
|
||
LAST, in lockstep with the orchestrator updating `launch.sh` + restarting the watchdog — not
|
||
unilaterally and not while a phase transition is pending. The Adversary likewise `git mv`s its own
|
||
REVIEW files at the cutover (single-writer rule).
|
||
|
||
## Phase 1b — recorded deviation: no `tests/_template/` dir (enroll = copy an existing recipe)
|
||
Plan §3's repo layout lists a `tests/_template/` "copy-to-add-a-recipe" dir. It was **never created**
|
||
(pre-1b; not introduced or removed by 1b) — instead the documented enroll flow in
|
||
`docs/enroll-recipe.md` is **"copy an existing recipe's tree, e.g. `tests/custom-html/…`, then adjust
|
||
`recipe_meta.py` + the per-recipe test files."** This satisfies D5's "small, repeatable, documented
|
||
operation with no harness surgery" the same way (a concrete recipe is a better starting template than
|
||
an abstract skeleton that can drift). Recording per the Adversary's RL3 D5 advisory; not a blocker.
|
||
|
||
## Phase 1d — generic test suite + layered overlays (design, 2026-05-27)
|
||
|
||
SSOT: `cc-ci-plan/plan-phase1d-generic-test-suite.md`. Resolves the §6 open decisions.
|
||
|
||
- **Tier model & op/assertion split (the core call).** A run is a sequence of TIERS — install,
|
||
upgrade, backup, restore, custom — each = `generic default [overridden by a recipe overlay]`. The
|
||
**lifecycle OP** (deploy, upgrade, backup, restore) is owned by the **shared harness**
|
||
(`harness.generic` helpers), NOT duplicated in every test file. A tier's **test file** (generic or
|
||
overlay) carries the ASSERTIONS and calls the shared op helper. This keeps the op single-sourced
|
||
(DRY, DG7) and makes deploy-once trivial: only the orchestrator deploys/tears-down.
|
||
|
||
- **Override (not additive) — Builder's call (plan §6, operator leaned override).** For each
|
||
lifecycle op exactly ONE assertion file runs, by precedence:
|
||
**repo-local `tests/test_<op>.py` > cc-ci `tests/<recipe>/test_<op>.py` > generic
|
||
(`tests/_generic/test_<op>.py`)**. A present overlay REPLACES the generic for that op. **Invariant:
|
||
no overlay for an op ⇒ the generic runs** (so any recipe is testable with zero config). Repo-local
|
||
wins same-name collisions (upstream is authoritative, plan §2.5); cc-ci's overlay is the curated
|
||
fallback until upstream adopts it. **Extend-by-composition:** an overlay may
|
||
`from harness import generic` and call `generic.assert_serving(...)` / `generic.do_upgrade(...)`
|
||
then add its own assertions — so "extend" needs no separate mechanism.
|
||
|
||
- **Custom (non-lifecycle) `test_*.py`:** ALL discovered from BOTH locations run additively, opt-in
|
||
(no override, no generic equivalent) — e.g. `test_sso.py`.
|
||
|
||
- **Deploy ONCE, mutate in place (operator requirement, DG4.1).** The orchestrator deploys the app
|
||
ONCE, runs all tiers against that single live deployment (install asserts; upgrade does
|
||
`abra app upgrade` in place; backup/restore mutate in place; custom asserts), then ONE teardown in
|
||
`finally`. No per-tier/per-overlay `abra app new/deploy/undeploy`. A `CCCI_DEPLOY_COUNT` counter in
|
||
`lifecycle.deploy_app` is asserted == 1 per run (DG4.1 evidence).
|
||
|
||
- **Deployment-sharing scope & base version (§6 open).** One deployment for the whole lifecycle.
|
||
Base version deployed once = the **previous published version** when an upgrade tier will run and a
|
||
previous exists (so upgrade goes previous→target in place), **else the target** (current/$REF).
|
||
Recipe with only one published version ⇒ upgrade tier is a clean **SKIP** (nothing to upgrade
|
||
from). Standalone generic-install demo (no PR) deploys current.
|
||
|
||
- **Fail handling across shared tiers (§6 open):** install failing (app never serves) **fail-fasts**
|
||
the run (later tiers can't meaningfully run on a dead deployment) and they report **error/skip**;
|
||
upgrade/backup/restore failures are recorded per-op but do not abort the remaining independent
|
||
tiers where they can still run. Teardown always runs.
|
||
|
||
- **Backup-capability detection (DG3, §6 open):** auto — scan the recipe's `compose*.yml` for a
|
||
`backupbot.backup` label (verified present in custom-html). `recipe_meta.BACKUP_CAPABLE` (bool)
|
||
overrides the auto-detect. Not capable ⇒ backup+restore tiers are **N/A (skip)**, not failures.
|
||
|
||
- **Custom install-steps hook (DG5, §6 open):** a shell hook — `tests/<recipe>/install_steps.sh`
|
||
(cc-ci) or repo-local `tests/install_steps.sh` — run by the orchestrator during the install tier
|
||
AFTER `abra app new` + env defaults but BEFORE `abra app deploy`, with env `CCCI_APP_DOMAIN`,
|
||
`CCCI_RECIPE`, `CCCI_APP_ENV` (path to the app .env). Chosen over a fixture/declarative field as the
|
||
simplest thing the harness runs uniformly (can `abra app secret insert`, set env, seed). Graceful
|
||
rule: a recipe with NO hook still attempts the generic install; if it genuinely needs a step it
|
||
FAILS the generic install (reported per-op) — that is correct, not a harness bug.
|
||
|
||
- **Per-op result vocabulary (Phase-3 feed):** `pass | fail | skip(N/A) | error`. The orchestrator
|
||
prints a per-op summary line per run (feeds DG6 + Phase-3 level).
|
||
|
||
- **Discovery layout:** cc-ci overlays/custom/hook live in `tests/<recipe>/`; repo-local in the
|
||
recipe repo's `tests/` (snapshotted after fetch, per the existing volatile-checkout handling).
|
||
Generic tier files live in `tests/_generic/` (assertion-only, use the shared live-deployment
|
||
fixtures).
|