Reboot survival for the Pi orchestrator host: - systemd unit cc-ci-plan/systemd/cc-ci-loops.service (installed + enabled): on boot records the reboot, starts loops+watchdog (RESUME_PHASE=1), and resumes the orchestrator session. - reboot-log.sh: boot_id-gated reboot record -> REBOOTS.md (manual restarts don't count). - launch-orchestrator.sh: injects an AGENTS.md startup nudge so an auto-resumed orchestrator announces itself (PushNotification) + reports reboots. - AGENTS.md: on-startup notify routine documented. Plans/tooling accumulated this session: - plan-phase1d (generic suite), 1e (harness corrections), phase4 (final review), sso-dep-testing, orchestrator-migration (parked), test-e2e-testme-acceptance. - launch.sh: 1d/1e/2/2b/3/4 phase sequence, machine-docs-aware state resolution, limit-stall re-nudge, INBOX side-channel detection. - plan.md §6.1/§7: artifact-layer isolation, INBOX, 5-min long-run polling, DEFERRED. - prompts: isolation discipline + INBOX + pacing. - .gitignore: harden (.sops/, cc-ci-secrets/, .claude/, *.tmp.*). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
17 KiB
cc-ci Phase 1d — Generic test suite + layered recipe overlays (Autonomous Build Plan)
Status: QUEUED — runs after Phase 1b (plan-phase1b-review-lint.md) and before Phase 2
(plan-phase2-recipe-tests.md). It is the test-architecture foundation Phase 2 builds on, so it
must precede it.
Transition: manual (operator kicks it off at the post-1b check-in).
Builds on: the post-1b codebase (the runner/harness, .drone.yml, the comment-bridge, the proof
recipes, the nix/ + machine-docs/ layout from 1b RL5/RL6).
Owner agents: same Builder + Adversary loops (plan.md §6/§7); Adversary cold-verifies.
This file's path: /srv/cc-ci/cc-ci-plan/plan-phase1d-generic-test-suite.md
Phase order: 1c → 1b → 1d → 2 → 2b → 3.
0. Why this phase
Today a recipe only gets tested if someone has authored tests for it. That doesn't scale to ~18+
recipes and gives nothing for a brand-new recipe. The operator's model (2026-05-27): every recipe
gets a generic lifecycle test suite for free, and recipe-specific tests layer on top rather than
being the only thing that runs. This makes !testme meaningful on any recipe immediately, turns
Phase 2 into "add overlays where they add value" instead of "write everything from scratch," and
gives Phase 3 a natural basis for a YunoHost-style level (how many tiers pass).
Core principle: the generic is the default for each lifecycle op; a recipe's own test_<op>.py
may override or extend that default. The exact additive-vs-override mechanism is the Builder's
design call (operator, 2026-05-27: override — e.g. a present test_install.py replaces the
generic install assertions — is a perfectly good option; additive is fine too; could even be
per-recipe opt-in). What's fixed and non-negotiable: (1) when a recipe defines no test for an
op, the generic runs (so any recipe is testable with zero config); (2) the harness owns a
single shared deployment that generic + recipe assertions reuse — no redeploy (§2.2); (3)
custom (non-lifecycle) tests are opt-in.
SUPERSEDED (2026-05-28) by Phase 1e HC3 (
plan-phase1e-harness-corrections.md): the override default is replaced — the generic now runs by default alongside an overlay (additive), skipped only via an explicit opt-out. Also: repo-local PR code is gated to approved recipes (HC2), and the upgrade tier targets the PR head (HC1). Read 1e for the current behavior.
1. Definition of Done (Phase 1d exit condition)
Terminates when every item holds and the Adversary has independently cold-verified (logged in
machine-docs/REVIEW-1d.md):
- DG1 — Generic INSTALL test. A recipe-agnostic install test exists that, given only a
recipe name and zero recipe-specific config, does
abra app new(sane defaults / auto secrets) →deploy→ polls to converged (all services running/healthy, no baresleep) → asserts the app is actually serving (real HTTP(S) response on its<run>.ci.commoninternet.netdomain through Traefik — not a Traefik 404/default cert, not health-only). Demonstrated green on ≥1 simple recipe (e.g.custom-html) that has no cc-ci or repo-local tests. - DG2 — Generic UPGRADE test. Deploy the previous published version (the last release
before the code under test), then upgrade to the code under test (PR head) via
abra app deploy --chaos(chaos = the current checkout) — i.e. previous-release → PR-head, not previous → newest-published-tag. Assert services reconverge and the app still serves. OPERATOR CORRECTION (2026-05-28): the current 1d impl upgrades to the newest published tag and (because deploying the prev tag re-checks-out the recipe) never deploys the PR head — so a recipe PR's actual changes aren't exercised by the upgrade path. Fix: after deploying prev, restore the PR-head checkout (re-checkout the PR ref / re-snapshot) anddeploy --chaosto it as the upgrade target. The "deployment actually moved" assertion (do_upgrade) still applies, but adapt it for prev→PR-head (a PR may not bump the version label — also accept an image/config change, or assert the running config now matches the PR head). For a non-PR!testme, "current checkout" = the catalogue current, so upgrade tests prev→current. (Data- continuity assertions remain recipe overlays — see §2.1.) - DG3 — Generic BACKUP + RESTORE tests. For backup-capable recipes (declare backupbot /
abra app backupsupport): run backup → assert a snapshot artifact is produced; then restore → assert restore completes and the app is healthy after. For recipes that declare no backup config, backup/restore are cleanly N/A (skipped) — not a failure. - DG4 — Layering (override-or-extend; generic is the default). The harness composes a
recipe's run from the generic default per op, overridden or extended by the recipe's
test_install.py/test_upgrade.py/test_backup.py/test_restore.pywhen present (in cc-ci's tests dir or repo-local in the recipe'stests/) — the additive-vs-override mechanism is the Builder's design choice, recorded inmachine-docs/DECISIONS.md. Invariant: if a recipe defines no test for an op, the generic runs. Arbitrary customtest_*.pyrun only if defined. Discovery + cc-ci-vs-repo-local precedence is implemented and settled inmachine-docs/DECISIONS.md. - DG4.1 — Overlays reuse the deployment (no redeploy). Overlay + custom assertions run
against the same live deployment the generic tier brought up — one deploy per run, one
teardown (§2.2), lifecycle ops mutating it in place. Verified: adding an overlay to a recipe
causes no extra
abra app new/deploy/undeploybeyond the shared run (assert via deploy-count / harness logs). Re-provisioning only where an op semantically demands it, and then explicitly. - DG5 — Custom install-steps hook (with the "generic-anyway" rule). The harness supports defined per-recipe extra install steps (cc-ci or repo-local) that run before/around the generic install. A recipe with no customization still attempts the generic suite. Proven both ways: (a) a recipe needing a custom step fails the generic install as expected when the step is absent; (b) the same recipe passes once the custom step is added — demonstrating the hook + the graceful-generic-failure are both real.
- DG6 —
!testmeend-to-end on an unconfigured recipe. A!testmeon a recipe with no cc-ci/repo-local customization runs the full generic suite through the real pipeline (bridge → Drone → deploy → assert → undeploy → report) and reports per-operation pass/fail/skip (install/upgrade/backup/restore). - DG7 — Real, DRY, clean. No softened/
skip/xfail/can't-fail assertions; generic logic lives in the shared harness (M6.5 — no per-recipe copy-paste); every run undeploys (teardown infinally), respectsMAX_TESTS. - DG8 — Documented + cold-verified.
docs/explains the generic suite, the overlay convention (file names + locations + precedence), and the custom-install-steps hook + how to add a recipe overlay. The Adversary re-runs the acceptance checks from a cold start within 24h.
When DG1–DG8 hold and are confirmed, write ## DONE to machine-docs/STATUS-1d.md.
2. The layered test model (the core design)
For a given recipe, the harness assembles and runs tiers, each = generic [+ overlays]:
(gen = generic default; → = "overridden/extended by, if the recipe defines it" — mechanism is
the Builder's call, §2.2):
INSTALL = gen_install → test_install.py (else gen) ← always runs
UPGRADE = gen_upgrade → test_upgrade.py (else gen) ← always runs
BACKUP = gen_backup → test_backup.py (else gen) ← if backup-capable, else N/A
RESTORE = gen_restore → test_restore.py (else gen) ← if backup-capable, else N/A
CUSTOM = recipe test_*.py (anything beyond the four) ← ONLY if defined
2.1 Generic baseline suite (recipe-agnostic)
- install —
abra app new <recipe> --domain <run>.ci.commoninternet.netwith non-interactive defaults + auto-generated secrets;abra app deploy; poll to converged; assert a real HTTP(S) response from the app over its domain (status + that it's the app, not Traefik's fallback). - upgrade — deploy a prior/pinned version,
abra app upgradeto target, assert reconverge + still serving. - backup / restore — only for recipes declaring backup config; verify the mechanism (backup
produces an artifact; restore completes + app healthy). Honest limit: generic backup/restore
can't assert app-specific data integrity without recipe knowledge — that's a recipe overlay
(
test_backup.py/test_restore.pyseed a marker + assert it survives). State this in docs.
2.2 Layering — override or extend (Builder's call), always reuse the deployment
A recipe-defined test_install.py (etc.) either overrides the generic assertions for that op or
extends them — the Builder picks the mechanism (a present test_<op>.py replacing the generic is
a fine, simple model; or additive; or per-recipe opt-in). The invariant either way: if a recipe
defines nothing for an op, the generic runs. This guarantees every recipe is testable with zero
config, while letting a recipe with a poor generic fit supply its own.
Reuse the deployment — do NOT redeploy per test (operator requirement, 2026-05-27). Overlay
assertions run against the same live deployment the generic tier already brought up — no extra
abra app new/deploy/undeploy per overlay. The target shape: one deploy per run, then the
lifecycle ops mutate that single deployment in sequence and all assertions (generic + overlays +
custom) run against it, then one teardown at the end:
deploy ONCE
→ INSTALL assertions: generic_install + test_install.py (same live app)
→ UPGRADE in place (abra app upgrade)
assertions: generic_upgrade + test_upgrade.py (same app, upgraded)
→ BACKUP (if capable) → generic_backup + test_backup.py
→ RESTORE (if capable) → generic_restore + test_restore.py
→ CUSTOM test_*.py (same live app)
teardown ONCE (in finally)
So a seed/marker written by test_backup.py is the same data test_restore.py checks; an overlay's
extra HTTP assertion hits the app the generic install already deployed. Tiers that intentionally
change state (UPGRADE, RESTORE) do so on the shared deployment in order. The only time a tier
re-provisions is when an op semantically requires it (e.g. a from-scratch restore-into-blank variant)
— and that must be explicit, not the default. This is also the main Phase-2b speed win.
2.3 Custom tests
test_*.py that aren't one of the four lifecycle names run only when present for that recipe —
no generic equivalent, purely opt-in (e.g. test_sso.py, test_federation.py).
2.4 Custom install steps (and the graceful-generic rule)
Some recipes need extra setup the generic flow won't do (pre-seed a DB, set an env/secret, run a one-off command). The harness exposes a defined per-recipe install-steps hook (cc-ci or repo-local). Rules:
- If a recipe declares custom install steps → run them as part of the install tier.
- If a recipe has no customization defined anywhere → still attempt the generic suite. Recipes that genuinely need special steps will fail the generic install — and that's acceptable and expected; the failure is reported (per-op), and the fix is to add the custom step (Phase 2 work), not to special-case the harness.
2.5 Discovery + precedence
- Locations: cc-ci's test dir (e.g.
tests/<recipe>/) and the recipe repo'stests/(repo-local). The harness discovers overlays + custom-install-steps from both. - Precedence (OPEN — settle in DECISIONS): proposed default — both layer; repo-local is the upstream-authoritative source and always runs, cc-ci's overlay runs in addition (and may pin/extend for the CI env). Define the rule for same-named collisions explicitly.
3. Milestones (bounded)
- G0 — Generic install. Implement generic_install in the shared harness; green on
custom-htmlwith no recipe config. Accept: DG1. - G1 — Generic upgrade + backup/restore. Add generic_upgrade; add generic_backup/restore gated on backup-capability (clean N/A otherwise). Accept: DG2, DG3.
- G2 — Layering + discovery. Implement the generic+overlay composition and cc-ci/repo-local discovery + precedence; prove an overlay runs on top of generic. Accept: DG4.
- G3 — Custom install-steps hook + graceful-generic. Implement the hook; demonstrate the fail-without / pass-with proof on one recipe needing a step. Accept: DG5.
- G4 —
!testmeintegration + per-op reporting + docs + cold verify. Accept: DG6, DG7, DG8; then flipmachine-docs/STATUS-1d.mdto## DONE.
4. Guardrails
- Generic is the default; recipe tests override or extend it (Builder's mechanism) — but the generic always runs when a recipe defines no test for that op. Never let a recipe end up with no assertion for an op it should be tested on.
- Generic failure ≠ harness bug — a recipe needing custom steps failing the generic install is a correct, reported outcome; fix by adding the step (Phase 2), don't weaken/special-case the generic.
- Deploy once, reuse it — overlays run against the generic tier's live deployment; no per-test/per-overlay redeploy. One deploy + one teardown per run; re-provision only when an op truly needs it, explicitly. (Correctness and the main perf lever.)
- DRY — generic logic in the shared harness, not per-recipe (M6.5); overlays are thin.
- No weakened tests — real assertions on real app state; teardown always; honor
MAX_TESTS. - Bounded — build the architecture + prove it on a couple of recipes; the full per-recipe overlay authoring is Phase 2, not here.
5. Impact on later phases (reshapes the plan set)
- Phase 2 (
plan-phase2-recipe-tests.md) changes from "author every test from scratch" to: every enrolled recipe gets the generic suite for free; Phase 2 = author the additive overlays (port recipe-maintainer tests astest_*.pyoverlays) + define custom install steps where a recipe fails generically. Update Phase 2 to reference 1d as its foundation. - Phase 3 (
plan-phase3-results-ux.md) — the YunoHost-style level maps cleanly onto the tiers: e.g. installs (generic) → +upgrade → +backup/restore → +custom assertions. The level is derived from which tiers pass. - Phase 2b (perf) — the generic suite is the common hot path, so it's the prime target for the image-cache / readiness / dedup optimizations.
6. Open decisions (log in machine-docs/DECISIONS.md)
- Override vs extend (Builder's call). Does a present
test_<op>.pyreplace the generic assertions for that op, add to them, or is it per-recipe opt-in? Operator leans: override is a good, simple model. Builder decides + documents — keeping the invariant "no recipe test for an op ⇒ generic runs" and the single-shared-deployment rule. - cc-ci vs repo-local precedence for overlays + install-steps (§2.5) and same-name collision rule.
- Exact overlay file convention: fixed names (
test_install.py…) + discovery dir layout (tests/<recipe>/in cc-ci;tests/in the recipe repo). - How custom install steps are declared: a shell hook (
install_steps.sh), a pytest fixture, or a declarative field — pick the simplest that the harness can run uniformly. - Backup-capability detection: how the harness decides a recipe is backup-capable (backupbot
labels present /
abra app backupexit) to choose run-vs-N/A for DG3. - Whether generic upgrade should always go previous→latest, or test the specific
version-bump under
!testme(PR-driven). - Per-op result vocabulary (
pass/fail/skip(N/A)/error) feeding the Phase-3 level. - Deployment-sharing scope: confirm one-deploy-per-run for the whole lifecycle (install→upgrade→ backup→restore→custom on a single deployment) vs per-tier deployments; and how a failed earlier tier (e.g. install) affects later tiers sharing that deployment (fail-fast vs continue-and-report).