Compare commits

..

1 Commits

Author SHA1 Message Date
73427d6e36 feat(regression): add E2E canary regression suite (tests/regression/)
Some checks failed
continuous-integration/drone/push Build is failing
Seven canaries prove both halves of the server's job:
- GREEN: good apps are reported healthy (good-simple + good-significant)
- RED: broken apps are caught at intended tier (false-green guard + 4 per-tier)

Fixtures: custom-html-bkp-bad (backup tier RED) + custom-html-rst-bad (restore tier RED).
All 7 canaries verified on live server (see STATUS-regression.md for artifacts).

Not wired to per-commit CI — run on-demand: pytest -m canary tests/regression/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-02 03:33:24 +00:00
146 changed files with 952 additions and 3767 deletions

View File

@ -35,12 +35,10 @@ steps:
# the comment-bridge). Deploys the recipe at the PR head, runs install/upgrade/backup + any
# recipe-local tests via the shared harness, then guarantees teardown (plan §4.2/§4.3).
#
# Resource safety (plan §4.2/§4.3): DRONE_RUNNER_CAPACITY=2 (nix/modules/drone-runner.nix, the
# single concurrency knob) allows two recipe runs in parallel. Concurrent-run safety is enforced by
# the harness, not by serialisation: every run holds an exclusive flock on its app domain
# (/run/lock/cc-ci-app-<domain>.lock) for its whole process lifetime, the run-start janitor probes
# that lock to reap only orphans (held lock = live run, never touched), and recipe working trees
# are per-run ($ABRA_DIR/recipes — no shared checkout, no recipe lock). See docs/concurrency.md.
# Resource safety (plan §4.2/§4.3): MAX_TESTS=DRONE_RUNNER_CAPACITY=1 (nix/modules/drone-runner.nix) is
# the primary concurrency cap; concurrency.limit below is a redundant belt. CCCI_JANITOR_MAX_AGE=0
# makes the run-start janitor reap ANY orphaned run app before deploying — safe because capacity=1
# means no concurrent run exists (a SIGKILL'd/timed-out build leaves an orphan with no teardown).
kind: pipeline
type: exec
name: recipe-ci
@ -53,37 +51,21 @@ trigger:
event:
- custom
# NB deliberately NO `concurrency.limit` here: DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix
# maxTests) is the single concurrency knob (P4 — two knobs in two files drifted).
concurrency:
limit: 1
steps:
- name: ci
environment:
STAGES: install,upgrade,backup,restore,custom
# The exec runner points HOME at a per-build workspace; force it to /root so abra's server
# config is found via the per-run ABRA_DIR's servers/ symlink -> /root/.abra/servers.
# Recipe trees are PER-RUN ($ABRA_DIR/recipes, exported by run_recipe_ci before any abra
# call), so concurrent builds never share a recipe checkout; app .env files are per-domain
# in the shared canonical servers/ path, guarded by the app-domain flock.
CCCI_JANITOR_MAX_AGE: "0"
# The exec runner points HOME at a per-build workspace; force it to /root so abra finds its
# server config + recipes under /root/.abra (as the manual M4/M5 runs did). Safe: capacity=1
# means no concurrent build shares /root/.abra.
HOME: /root
commands:
# RECIPE/REF/PR/SRC (+ CCCI_QUICK for `!testme --quick`) are injected as env vars from the
# build's custom params. CCCI_QUICK=1 makes run_recipe_ci take the opt-in fast lane (WC7);
# absent => full cold (default). run_quick ignores STAGES (always upgrade+custom).
- 'echo "recipe-ci: RECIPE=$RECIPE REF=$REF PR=$PR SRC=$SRC stages=$STAGES quick=${CCCI_QUICK:-0}"'
# P1 lock-lifetime hardening: run the harness in its own session/process group (setsid) and
# forward a drone cancel (TERM to this step shell) to the WHOLE group, so the harness's
# SIGTERM handler runs its teardown funnel instead of being leaked (the exec runner kills
# only the step shell, not the tree). PDEATHSIG inside the harness backstops the case where
# this shell dies without the trap firing. The harness exit code is captured explicitly and
# the traps cleared before exiting: the runner shell is `set -e`, and an EXIT-trap kill of
# the already-gone process group returns ESRCH, which otherwise poisons a GREEN run's exit
# status to 1 (observed live, build 269: all tiers pass, step exit 1).
- |
setsid cc-ci-run runner/run_recipe_ci.py &
PID=$!
trap 'kill -TERM -- "-$PID" 2>/dev/null || true' TERM EXIT
rc=0
wait "$PID" || rc=$?
trap - TERM EXIT
exit "$rc"
- cc-ci-run runner/run_recipe_ci.py

View File

@ -1,22 +0,0 @@
# BACKLOG — sub-phase conc
## Build backlog
- [ ] P1 lock-lifetime hardening: prctl PDEATHSIG + ppid race check + SIGTERM handler →
teardown funnel + signal.alarm(3600) hard deadline; .drone.yml setsid/trap wrap;
PEP 446 comment on lock open()
- [ ] P2 flock-probe janitor: acquire_app_lock(domain) at register_run_app's call site;
janitor probes per-domain lockfiles (acquired→reap under probe lock, held→leave,
>120min mtime→warn); delete registry symbols
- [ ] P3 per-run ABRA_DIR: /var/lib/cc-ci-runs/<build>/abra with servers+catalogue symlinks,
fresh recipes/; fetch_recipe = plain clone; delete acquire_recipe_lock; route harness
recipe paths through ABRA_DIR
- [ ] P4 config cleanup: remove concurrency.limit from .drone.yml; maxTests is the single knob
- [ ] tests/concurrency suite (19 cases, real-kernel flock, explicit invocation only)
- [ ] P5 docs/concurrency.md rewrite to the new model
- [ ] M1 claim (branch complete, both suites + lint green)
- [ ] M2: merge to main after M1 PASS, push build green, live verification ad
## Adversary findings
(adversary-owned)

View File

@ -1,24 +0,0 @@
# JOURNAL — sub-phase conc (Builder, append-only)
## 2026-06-10 — bootstrap
Read concurrency-restructure-full-plan.md (SSOT) + plan.md §6.1/§7/§9. Oriented on the code:
- `runner/harness/lifecycle.py` — recipe flock (l.46), registry (l.6597), deploy_app
registration (l.283), teardown unregister (l.723), three-way janitor (l.726).
- `runner/run_recipe_ci.py``acquire_recipe_lock` call site (l.843), `fetch_recipe` (l.140,
rm-rf + reclone of the shared tree), janitor call sites (l.600 quick, l.932 cold).
- `.drone.yml` — recipe-ci step runs `cc-ci-run runner/run_recipe_ci.py` bare (P1 wraps it),
`concurrency.limit: 2` (P4 removes).
- Greps for P3 fallout: `~/.abra/recipes` referenced in abra.py (recipe_checkout,
has_lightweight_version_tags, recipe_head_commit, recipe_versions), generic.py:28,
lifecycle.prepull_images, run_recipe_ci (fetch_recipe, snapshot_recipe_tests, comment),
warm_reconcile.py:202 (runs OUTSIDE per-run context — keeps default), and
tests/ghost+discourse install_steps.sh (`${HOME}/.abra/recipes/...` — these run INSIDE a
run and copy compose.ccci.yml into the deploy tree, so they must resolve the per-run dir).
- `~/.abra/servers/...` paths are unaffected by design (servers/ is symlinked to the canonical
/root/.abra/servers, so both resolutions land on the same file).
Working setup: state files on main in this clone; code on branch `restructure/concurrency`
via a git worktree at ../cc-ci-conc; test runs on the cc-ci host via /root/builder-clone
(`cc-ci-run -m pytest ...`, `nix develop .#lint`).

View File

@ -1,32 +0,0 @@
# REVIEW-conc.md — Adversary ledger, concurrency-restructure phase
Append-only. Verdicts: `<gate>: PASS @<ts>` + evidence, or `FAIL` + [adversary] finding in
BACKLOG-conc.md. SSOT for what is verified: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md.
## 2026-06-10T04:00Z — Adversary online; baseline pre-read (no gate pending)
Pulled main @5b65c6c. No STATUS-conc.md, no `restructure/concurrency` branch — nothing claimed yet.
Pre-read the CURRENT system (docs/concurrency.md @5b65c6c + lifecycle.py/run_recipe_ci.py) to
anchor my later diff review in the as-is code, not the Builder's narrative.
Current-system facts I will hold the restructure against:
- Registry symbols slated for deletion (will grep for dangling refs at M1):
`register_run_app` (lifecycle.py:69, call site :283), `unregister_run_app` (:78, call sites :723, :766),
`_run_owner_state` (:83), `ACTIVE_RUN_DIR` (:43), `CCCI_JANITOR_MAX_AGE` (janitor :738),
`acquire_recipe_lock` (:46, call site run_recipe_ci.py:843), `RECIPE_LOCK_DIR` (:42).
- Must survive untouched: `RUN_APP_RE` (lifecycle.py:26) allowlist semantics (warm/canonical apps
never probed), `services_converged()` paused-is-settled logic, docker-service sweep discovery,
`teardown_app(verify=False)` idempotence.
- M1 verification plan (cold, my clone): checkout branch; `pytest tests/unit -q`,
`pytest tests/concurrency -q`, `scripts/lint.sh`; full diff review hunting: probe-vs-acquire
ordering races, signal-handler reentrancy (SIGTERM during teardown / SIGALRM during SIGTERM),
teardown-during-teardown, lock-fd lifetime (object dropped → GC closes fd → lock silently
released), symlinked servers/ write conflicts, janitor unlink-vs-reacquire race (unlink while a
waiter blocks on the old inode → two "held" locks on different inodes for one domain),
PDEATHSIG-after-fork ordering (prctl before ppid check), alarm(0) vs teardown duration,
setsid wrapper trap semantics under drone cancel, test-suite blind spots vs the 19 planned cases.
- Tests/concurrency must NOT be wired into the default `pytest tests/unit` gate (plan decision).
- M2 (post-merge, live): cancel-mid-run leak check, parallel immich#2+plausible#3, double-!testme
same PR blocks visibly, one full green run. NEVER merge/push recipe mirror repos.
No verdict yet — waiting for Builder bootstrap/claim.

View File

@ -1,19 +0,0 @@
# STATUS — sub-phase conc (concurrency restructure)
Plan: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md (SSOT for this phase)
## Phase state
- Phase: conc — concurrency restructure (P1P5 + tests/concurrency)
- Builder branch: `restructure/concurrency` (code lands there; main untouched until M2 merge)
- In flight: P1 (lock-lifetime hardening)
- Gate: none claimed yet
## Gates
- M1 (implementation verified): NOT CLAIMED
- M2 (merged + live-verified): NOT CLAIMED — blocked on M1 PASS
## Blockers
(none)

View File

@ -64,8 +64,6 @@ def parse_trigger(body):
if s == f"{TRIGGER} --quick":
return True, True
return False, False
ALLOWLIST = {u.strip() for u in os.environ.get("AUTH_ALLOWLIST", "").split(",") if u.strip()}
@ -169,12 +167,8 @@ def post_commit_status(owner, repo, sha, state, target_url, description=""):
f"{GITEA_API}/repos/{owner}/{repo}/statuses/{sha}",
GITEA_TOKEN,
method="POST",
data={
"state": state,
"target_url": target_url,
"description": description,
"context": "cc-ci/testme",
},
data={"state": state, "target_url": target_url,
"description": description, "context": "cc-ci/testme"},
)
@ -223,9 +217,7 @@ def result_comment_body(recipe, sha, num, run_url, status):
if artifact_available(badge_url):
body += f"\n\n[![level]({badge_url})]({run_url})"
return f"{body}\n\n{links}"
return (
f"{header}{run_url}\n\n_(summary card unavailable — see the run for details.)_ {links}"
)
return f"{header}{run_url}\n\n_(summary card unavailable — see the run for details.)_ {links}"
def watch_and_reflect(owner, name, number, num, recipe, sha, comment_id, run_url):
@ -295,11 +287,15 @@ def process_testme(full_name, owner, name, number, user, comment_id, source, qui
run_url = f"{DRONE_URL}/{CI_REPO}/{num}"
post_commit_status(owner, name, head["sha"], "pending", run_url, "cc-ci run in progress")
mode = " **(--quick: lower-confidence fast lane; does not gate merge)**" if quick else ""
# One NEW comment PER `!testme` (operator preference 2026-06-02): post a fresh ⏳ placeholder each
# run so every re-`!testme` is visible in the PR timeline; watch_and_reflect then edits THIS
# comment to its result. (Previously a single marked comment was reused/edited in place.)
# R2/U3: one comment per PR, updated in place. Reuse the existing marked comment if present
# (re-`!testme` refreshes it back to the ⏳ placeholder), else post a new one.
start_body = start_comment_body(name, head["sha"], run_url, mode)
cid = post_comment(owner, name, number, start_body)
existing = find_existing_comment(full_name, number)
if existing:
edit_comment(owner, name, existing, start_body)
cid = existing
else:
cid = post_comment(owner, name, number, start_body)
log(
f"[{source}] triggered build {num} for {name}@{head['sha'][:8]} "
f"(PR #{number}, comment {comment_id}) by {user}"

View File

@ -66,13 +66,8 @@ _COLORS = {
# Level → colour ramp, kept in sync with runner/harness/card.py LEVEL_COLOR (the dashboard is a
# standalone stdlib service that doesn't import the runner harness, so the small map is duplicated).
_LEVEL_COLOR = {
0: "#e5534b",
1: "#e0823d",
2: "#e0823d",
3: "#d9b343",
4: "#a0b93f",
5: "#57ab5a",
6: "#3fb950",
0: "#e5534b", 1: "#e0823d", 2: "#e0823d", 3: "#d9b343",
4: "#a0b93f", 5: "#57ab5a", 6: "#3fb950",
}
@ -274,11 +269,7 @@ def _card(r):
f'<a class="shot" href="{run_url}" title="open run">'
f'<span class="ph">no screenshot</span>{_level_pill(r["level"])}</a>'
)
cap = (
f'<div class="cap">{html.escape(r["level_cap_reason"])}</div>'
if r["level_cap_reason"]
else ""
)
cap = f'<div class="cap">{html.escape(r["level_cap_reason"])}</div>' if r["level_cap_reason"] else ""
return (
f'<div class="card">{shot}<div class="body">'
f'<div class="name">{html.escape(r["recipe"])}</div>'
@ -316,11 +307,7 @@ def render_history(recipe, rows):
trs = []
for r in rows:
color = _COLORS.get(r["status"], "#8b949e")
lvl = (
""
if r["level"] is None
else f'<b style="color:{level_color(r["level"])}">L{int(r["level"])}</b>'
)
lvl = "" if r["level"] is None else f'<b style="color:{level_color(r["level"])}">L{int(r["level"])}</b>'
shot = f'<a href="/runs/{r["number"]}/summary.png">card</a>' if r["has_screenshot"] else ""
trs.append(
f'<tr><td><a href="{html.escape(r["url"])}">#{r["number"]}</a></td>'
@ -330,7 +317,7 @@ def render_history(recipe, rows):
)
body = "\n".join(trs) or '<tr><td colspan="6">no runs for this recipe yet</td></tr>'
inner = (
f"<h1>{_FLOWER} {html.escape(recipe)} — run history</h1>"
f'<h1>{_FLOWER} {html.escape(recipe)} — run history</h1>'
'<p class="sub"><a href="/">← all recipes</a> · every <code>!testme</code> run, newest first.</p>'
"<table><thead><tr><th>Run</th><th>Status</th><th>Level</th><th>Version</th>"
"<th>When</th><th>Card</th></tr></thead><tbody>"

View File

@ -1,236 +0,0 @@
# Concurrency: how parallel recipe CI runs stay safe
Spec of the concurrent-run system after the 2026-06-10 restructure (branch
`restructure/concurrency`; plan: cc-ci-plan `concurrency-restructure-full-plan.md`). The previous
registry + per-recipe-flock model is documented in this file's git history (`5b65c6c`).
## 1. Goal and design summary
Two recipe CI builds may run **at the same time** on the single cc-ci host. Safety is enforced by
the **harness**, not by serialising everything, and rests on ONE locking mechanism plus ONE
structural isolation:
| Rule | Mechanism |
|---|---|
| Different recipes run in parallel | nothing blocks them (isolation, §3) |
| Same-RECIPE runs run in parallel too | per-run `ABRA_DIR` recipe trees (§4) — no shared tree, no lock |
| Same-DOMAIN runs (double-`!testme` of one PR) serialise | per-app-domain `flock` (§5) |
| A starting run never reaps a live concurrent run's app | janitor probes the app lock; held = live (§6) |
| A crashed/canceled/rebooted run's leftovers get reaped | lock auto-released by the kernel → probe acquires → reap (§6) |
The invariant chain that makes "held lock = live owner" sound:
```
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline
```
- **lock ⊆ process**: locks are kernel flocks on fds the process holds (and PEP 446 makes those
fds non-inheritable, so abra/docker/pytest children never carry them). The kernel releases them
on process death, however it dies. There is no unlock code path and no stale-lock failure mode.
- **process ⊆ step**: `PR_SET_PDEATHSIG(SIGTERM)` + the `.drone.yml` setsid/trap wrap (§2) — a
dead or canceled build cannot leak a running harness.
- **step ⊆ 60 min**: `signal.alarm(3600)` self-deadline (§2).
Never steal a held lock; manage the holder's lifetime. There is **no daemon and no shared state
service** — everything is kernel/file primitives under `/run/lock` and per-run directories.
## 2. Mechanism 0: run-lifetime hardening (`runner/harness/lifetime.py`)
`run_recipe_ci.main()` calls `lifetime.install_lifetime_guards()` before ANY abra call or lock
acquisition:
1. **`PR_SET_PDEATHSIG(SIGTERM)`** (ctypes prctl, return code checked): if the parent — the drone
step shell — dies, the kernel TERMs the harness. A post-prctl `ppid == 1` re-check closes the
start race: a harness whose parent died *before* the prctl armed would never get the signal,
so it refuses to run orphaned.
2. **SIGTERM handler**: logs, then raises `SystemExit(143)` so the run's `finally:` teardown
funnel executes and the process exits non-zero. Re-entrant signals during teardown are logged
and IGNORED (`lifetime.begin_teardown()`, also set at the top of the run's `finally:` blocks)
so a second signal can't abort the cleanup the first one asked for.
3. **`signal.alarm(3600)` hard deadline**: SIGALRM funnels into the same teardown path with a
distinct log line (`== run exceeded 60-minute hard deadline — tearing down ==`), exit 142.
Recipes keep their own smaller per-tier timeouts; this bounds the whole run. Teardown time
after the deadline is deliberately not alarm-bounded — the janitor is the backstop if a
teardown wedges and the process is killed harder.
The `.drone.yml` recipe-ci step runs the harness as `setsid cc-ci-run … &` with a
`trap 'kill -TERM -- "-$PID"' TERM EXIT; wait "$PID"` — a drone **cancel** (TERM to the step
shell) is forwarded to the harness's whole process group instead of leaking it (the exec runner
only kills the step shell). PDEATHSIG backstops the no-trap paths.
## 3. Isolation model: what is shared, what is per-run
Per-run (no conflict possible):
- **App + stack + volumes + secrets.** Run app domain = `naming.app_domain()`
`<recipe[:4]>-<sha1(recipe|pr|ref)[:6]>.ci.commoninternet.net`, unique per (recipe, pr, ref);
everything abra creates is namespaced by it. Run apps are recognised by
`RUN_APP_RE = ^[a-z0-9]{1,4}-[0-9a-f]{6}\.ci\.commoninternet\.net$`; warm/canonical apps
(e.g. `warm-keycloak...`) deliberately do NOT match → the janitor never probes them.
- **Recipe working trees** — `$ABRA_DIR/recipes/<recipe>`, per run (§4). NEW in the restructure.
- **Drone build workspace** (`/var/lib/drone-runner/drone-<id>/`) and **run artifacts**
(`/var/lib/cc-ci-runs/<run-id>/`).
- **Run-scoped state files** (`/tmp/ccci-{deploys,opstate,deps,depskip}-<run-id>-<pid>…`) —
keyed by run id + harness pid via `run_recipe_ci._run_state_path()`, NEVER by app domain.
A second run of the same domain executes its `main()` preamble before blocking at the app
lock (§5), so domain-keyed files would be reset/removed underneath the live first run
(live finding, M2(c) double-`!testme`: false DG4.1 deploy-count in run 1, countfile
`FileNotFoundError` in run 2). Tier/hook children get the exact paths via the
`CCCI_*_FILE` env vars; removed on normal run exit.
Shared (by design, conflict-free):
- **`/root/.abra/servers`** — app `.env` files, one per domain. The per-run `ABRA_DIR` symlinks
`servers/` here, so .env files land in the canonical path: janitor discovery (`abra app ls`)
and out-of-run tooling see every app. Per-domain filenames + the app-domain lock prevent write
conflicts.
- **`/root/.abra/catalogue`** — read-mostly, symlinked into each per-run dir.
- **`HOME=/root`** (forced in `.drone.yml`) — safe: nothing recipe-mutable lives under `~/.abra`
for a run anymore except through the two symlinks above.
## 4. Mechanism 1: per-run `ABRA_DIR` (replaces the per-recipe flock)
`run_recipe_ci.setup_run_abra_dir()` — called first thing in `main()`, before any abra call —
builds `<runs_dir>/<run-id>/abra/` (run-id = Drone build number; `manual-<pid>` for hand runs):
```
abra/
servers/ -> /root/.abra/servers (symlink; canonical shared .env path)
catalogue/ -> /root/.abra/catalogue (symlink; read-mostly)
recipes/ fresh, empty (THE isolation that matters)
```
and exports it as `$ABRA_DIR` — honored by the abra CLI itself and by every harness path helper
(`abra.abra_dir()` / `abra.recipe_dir()`; `generic._recipe_dir`, `prepull_images`,
`snapshot_recipe_tests`, `warm_reconcile._recipe_dir` all route through the same rule:
`$ABRA_DIR` if set, else `~/.abra`).
- `fetch_recipe()` is now a plain clone into `$ABRA_DIR/recipes/<recipe>` (PR-head clone+checkout
or `abra recipe fetch`); the upgrade tier's mid-run `git checkout`s happen in the run's own
tree. Two same-recipe runs can no longer corrupt each other — structurally, with no lock. The
old observed failure (immich builds 229/230 deploying a tree missing its config) is impossible.
- `CCCI_SKIP_FETCH=1` (test/Adversary staging) copies the canonically-staged
`~/.abra/recipes/<recipe>` clone into the per-run tree.
- Out-of-run flows (warm_reconcile's systemd timer, manual abra) set no `ABRA_DIR` and keep using
the canonical `/root/.abra` unchanged. In-run flows that touch canonical state on purpose
(warm/canonical .env files) go through `servers/` and are unaffected.
- The per-run dir rides along the existing `/var/lib/cc-ci-runs/<run-id>/` retention. abra
auto-clones any recipe it needs to resolve (e.g. during `app ls`) into the per-run `recipes/`
a few seconds of git per run, gone with the run dir.
## 5. Mechanism 2: per-app-domain flock (`lifecycle.acquire_app_lock`)
- Lock file: `/run/lock/cc-ci-app-<domain>.lock` (dir overridable via `CCCI_APP_LOCK_DIR` for the
test suite), exclusive `fcntl.flock`, taken in `deploy_app()` **before the app is created** — a
concurrent janitor can never see a run app without its held lock.
- Blocks (with a log line: `== app lock: another run of <domain> is in flight — waiting ==`) when
another run of the SAME domain is in flight — the double-`!testme` serialisation point; the
waiting run is visibly parked at that line in its drone log, by design.
- The returned file object is ALSO retained in module-level `_held_app_locks` — if a caller
dropped it, GC would close the fd and silently release the lock.
- mtime is touched at acquisition: lock age feeds the janitor's long-held flag (§6).
- **Unlink/recreate race guard**: the janitor unlinks reaped lockfiles, so after EVERY
acquisition the locked fd is verified to still be the inode the path names
(`fstat().st_ino == stat().st_ino`); a waiter that won a just-unlinked inode closes it and
retries on the live path. (A lock on an unlinked inode protects nothing: a later opener gets a
fresh inode and would acquire "the same" lock.)
- Release is implicit: process exit (any kind). `teardown_app()` does NOT release or unlink —
a clean run's leftover lockfile is unheld and is unlinked on sight by the next janitor sweep.
## 6. The flock-probe janitor (`lifecycle.janitor`)
Runs at every run start (cold + quick paths) and in the warm/upgrade sweeps. Candidate discovery
is unchanged from the old model: `abra app ls` + a docker-service sweep (catches stacks whose
`.env` is already gone), both matched against `RUN_APP_RE` — warm/canonical apps never match and
are never probed.
Decision table (per candidate domain, `_probe_and_reap`):
| Probe (`LOCK_EX\|LOCK_NB`) | Meaning | Action |
|---|---|---|
| acquires (+ inode identity OK) | nobody holds it → owner died (kernel-guaranteed) | **reap**: `teardown_app(verify=False)` WHILE HOLDING the probe lock, then unlink the lockfile, then release |
| acquires, inode stale | another janitor reaped + unlinked while we raced | skip (reap already done; unlinking now would hit a newer run's file) |
| `BlockingIOError` (held) | live concurrent run | leave it; if lockfile mtime > 120 min (2× the hard deadline): `!! lock for <domain> held >120min — possible leaked run; inspect with lslocks` — flag, **never steal** |
| `open()` fails (`OSError`) | garbled/unopenable lockfile | skip + log, never crash |
- Reaping under the probe lock closes the janitor-vs-new-run race: a new run of that domain
blocks in `acquire_app_lock` until the reap finishes — no window where a fresh app coexists
with a half-reaped one.
- Two racing janitors arbitrate on the flock: one reaps, the other sees "held" and leaves; reaps
are idempotent (`teardown_app(verify=False)` tolerates half-gone stacks).
- After the candidates, a tidy sweep unlinks stale **unheld** `cc-ci-app-*.lock` files with no
app behind them (under their own probe lock + identity check), keeping `/run/lock` clean.
- **Post-reboot**: `/run/lock` is tmpfs → lockfiles gone → every surviving app probes as an
orphan → reaped immediately. (Improvement over the old 2-hour age fallback; there IS no age
logic anymore.)
## 7. Failure-mode guarantees
| Event | Outcome |
|---|---|
| Run crashes / SIGKILL mid-run | flock auto-released by kernel → next janitor probe reaps app + lockfile |
| Drone build canceled via API | step trap TERMs the harness process group → SIGTERM funnel runs the run's own teardown (exit 143); if anything still leaks, PDEATHSIG + janitor reap (the old "cancel leaks the harness" gap is CLOSED) |
| Run exceeds 60 min | SIGALRM → distinct log line → own teardown → exit 142 |
| Host reboot | locks and lockfiles vanish (tmpfs, correct: no owners survived) → all surviving run apps reaped at the next run start, immediately |
| Two same-recipe `!testme`s (different PRs) | run in parallel — separate domains, separate per-run recipe trees |
| Double-`!testme` (same PR → same domain) | second blocks on the app lock before creating anything, visibly in its drone log, runs after the first finishes |
| Janitor vs. app being created | impossible to mis-reap: the lock is held before `app new`, and a held lock is never touched |
| Janitor unlink vs. blocked waiter | inode identity re-check on every acquisition → waiter retries on the live path |
| Lock held implausibly long (>120 min) | flagged loudly for a human (`lslocks`), never stolen |
## 8. Where convergence fits (adjacent; unchanged by the restructure)
Two swarm-convergence behaviors in `services_converged()` look like concurrency bugs but aren't —
any future work must keep them fixed:
- **N/N replicas ≠ converged** during a stop-first rolling update — `UpdateStatus.State` is also
inspected (build 238: backupbot exec'd into a container killed seconds later).
- **`paused` persists forever** (swarm's default `update-failure-action`) — only `updating` and
`rollback_started` block convergence; `paused`/`rollback_paused` are settled (build 241).
- `backup_app()` additionally waits (bounded 300s) for convergence before `backup create`.
## 9. Configuration knobs
| Knob | Where | Current | Meaning |
|---|---|---|---|
| `DRONE_RUNNER_CAPACITY` (aka `MAX_TESTS`) | `nix/modules/drone-runner.nix` (`maxTests`) | `2` | **THE single concurrency knob.** Max builds the exec runner executes at once; Drone queues the rest. (The `.drone.yml` `concurrency.limit` duplicate was removed.) Change requires `nixos-rebuild switch`. |
| `CCCI_APP_LOCK_DIR` | env, read at call time | unset → `/run/lock` | App-domain lockfile dir override — used by `tests/concurrency` to sandbox locks. Never set in production. |
| hard deadline | `lifetime.HARD_DEADLINE_SECONDS` | 3600 s | the whole-run alarm; long-held flag threshold is 2× this (`LONG_HELD_LOCK_SECONDS`) |
## 10. Testing: `tests/concurrency/`
Real-kernel suite (19 planned cases + companions): helper subprocesses hold REAL flocks and
install the REAL prctl/signal/alarm guards — flock itself is never mocked; the janitor runs with
injected candidates + stubbed teardown but probes real locks. **Not part of the default
`pytest tests/unit` gate** (it spawns processes and sleeps); run it explicitly:
```
cc-ci-run -m pytest tests/concurrency -q
```
Covers: kernel auto-release on SIGKILL; LOCK_NB probe semantics; PEP 446 fd non-inheritance;
same-domain serialisation; orphan reap + unlink; live-run protection; reap-under-probe-lock
blocking; two-janitor arbitration; reboot-immediate reap; long-held flag; RUN_APP_RE allowlist;
degrade-on-garbage; PDEATHSIG; ppid start race; deadline + SIGTERM funnels; per-run ABRA_DIR
construction/export; concurrent same-recipe fetch isolation; symlinked-servers .env canonicality;
run-keyed (never domain-keyed) run-scoped state files (M2(c) regression, `test_run_state.py`).
## 11. File / symbol index
| What | Where |
|---|---|
| lifetime guards (PDEATHSIG, signal funnels, deadline) | `runner/harness/lifetime.py`; installed in `run_recipe_ci.main()` |
| setsid/trap cancel forwarding | `.drone.yml` (`recipe-ci` step) |
| `acquire_app_lock`, `_held_app_locks`, `_app_lock_path` | `runner/harness/lifecycle.py` |
| `acquire_app_lock` call site | `lifecycle.deploy_app()` (before app creation) |
| janitor + probe (`janitor`, `_probe_and_reap`, `LONG_HELD_LOCK_SECONDS`) | `runner/harness/lifecycle.py` |
| per-run ABRA_DIR (`setup_run_abra_dir`, `fetch_recipe`) | `runner/run_recipe_ci.py` |
| path resolution (`abra_dir`, `recipe_dir`) | `runner/harness/abra.py` (used by `generic`, `lifecycle.prepull_images`, `warm_reconcile`) |
| run-app naming | `runner/harness/naming.py` (`app_domain`), `RUN_APP_RE` in `lifecycle.py` |
| capacity knob | `nix/modules/drone-runner.nix` (`maxTests`) |
| convergence (adjacent) | `lifecycle.services_converged()`, `lifecycle.backup_app()` |
| the test suite | `tests/concurrency/` (`helpers.py` subprocess entrypoints, `concutil.py` probes) |
Deleted in the restructure (grep should find NOTHING): `register_run_app`, `unregister_run_app`,
`_run_owner_state`, `ACTIVE_RUN_DIR`, `CCCI_JANITOR_MAX_AGE`, `_stack_age_seconds`,
`acquire_recipe_lock`, `RECIPE_LOCK_DIR`.

View File

@ -31,36 +31,34 @@
];
in
{
nixosConfigurations = {
# Canonical live host target: the Hetzner cc-ci server.
# Use `.#cc-ci` for the current production host.
cc-ci = nixpkgs.lib.nixosSystem {
inherit system;
modules = [
sops-nix.nixosModules.sops
./nix/hosts/cc-ci-hetzner/configuration.nix
];
};
# Canonical live host target: the Hetzner cc-ci server.
# Use `.#cc-ci` for the current production host.
nixosConfigurations.cc-ci = nixpkgs.lib.nixosSystem {
inherit system;
modules = [
sops-nix.nixosModules.sops
./nix/hosts/cc-ci-hetzner/configuration.nix
];
};
# Legacy Incus VM host definition retained only for historical comparison and fallback.
# Do NOT use this target on the live Hetzner server.
cc-ci-incus = nixpkgs.lib.nixosSystem {
inherit system;
modules = [
sops-nix.nixosModules.sops
./nix/hosts/cc-ci/configuration.nix
];
};
# Legacy Incus VM host definition retained only for historical comparison and fallback.
# Do NOT use this target on the live Hetzner server.
nixosConfigurations.cc-ci-incus = nixpkgs.lib.nixosSystem {
inherit system;
modules = [
sops-nix.nixosModules.sops
./nix/hosts/cc-ci/configuration.nix
];
};
# Explicit alias for the live Hetzner host. Kept alongside `cc-ci` so the intended host
# target remains obvious in recovery/migration workflows.
cc-ci-hetzner = nixpkgs.lib.nixosSystem {
inherit system;
modules = [
sops-nix.nixosModules.sops
./nix/hosts/cc-ci-hetzner/configuration.nix
];
};
# Explicit alias for the live Hetzner host. Kept alongside `cc-ci` so the intended host target
# remains obvious in recovery/migration workflows.
nixosConfigurations.cc-ci-hetzner = nixpkgs.lib.nixosSystem {
inherit system;
modules = [
sops-nix.nixosModules.sops
./nix/hosts/cc-ci-hetzner/configuration.nix
];
};
devShells.${system} = {

View File

@ -1,131 +0,0 @@
# BACKLOG — server regression canaries phase
## Build backlog
- [x] Create `tests/regression/` suite (conftest + test_canaries + README)
- [ ] Run `good-simple` canary (custom-html-tiny main) → confirm GREEN + test_serving passes
- [ ] Run `bad-false-green` canary (custom-html v5-stale-docroot) → confirm RED + test_content_type fails
- [ ] Run `good-significant` canary (lasuite-docs main) → confirm GREEN + test_serving_and_frontend passes
- [ ] Open PR for operator review (DoD item 5: NOT merged)
- [ ] Claim gate once all canary runs are GREEN/RED as expected + PR is open
## Adversary findings
### A-reg-1 [adversary] CLOSED @2026-06-02T01:46Z — relative import fixed, 3 tests collect
**Filed:** 2026-06-02T01:37Z
**Severity:** CRITICAL — suite can't run at all until fixed
Cold-run `cc-ci-run -m pytest tests/regression/ --collect-only` on cc-ci confirms:
```
ImportError: attempted relative import with no known parent package
tests/regression/test_canaries.py:18: from .conftest import run_recipe_ci, ...
```
No tests collected. 0 canaries can run.
**Root cause:** `test_canaries.py` uses a relative import (`from .conftest import ...`) which
requires the directory to be a Python package. Without `tests/regression/__init__.py` (and
`tests/__init__.py`), pytest imports `test_canaries.py` as a top-level module, not a package
member. Relative imports fail.
**Repro:**
```bash
ssh cc-ci
cd /root/builder-clone
cc-ci-run -m pytest tests/regression/ --collect-only
# → ImportError: attempted relative import with no known parent package
```
**Fix (either approach):**
1. Add `tests/__init__.py` and `tests/regression/__init__.py` (makes it a real package)
2. OR replace `from .conftest import ...` with absolute sys.path manipulation (like other test
files do, e.g. `sys.path.insert(0, ...); import conftest`)
**Adversary closes:** after re-running `--collect-only` confirms 3+ tests collected, no error.
---
### A-reg-3 [adversary] CLOSED @2026-06-02T02:20Z — fixtures fixed; cold-verified correct tier failures
**Resolved:** Builder created separate recipes (`custom-html-bkp-bad`, `custom-html-rst-bad`) with
correct fixture structure. Cold-verified from cc-ci artifact dirs (no harness re-run needed).
**Evidence:**
- bad-backup-5 (`b6fe99de`, custom-html-bkp-bad): `install=pass, backup=fail`
- `test_backup_artifact: pass` (snapshot IS produced)
- `test_backup_captures_state: fail` ("MISSING" not "original") ✓ — backup=RED
- bad-restore-3 (`9a73a184e739`, custom-html-rst-bad): `install=pass, backup=pass, restore=fail`
- `test_restore_returns_state: fail` ("mutated" not "original") ✓ — restore=RED
### A-reg-3 [adversary] OPEN — CRITICAL: bad-backup and bad-restore fixtures broken (empty compose.yml)
**Filed:** 2026-06-02T01:58Z
**Severity:** CRITICAL — both fixtures fail at upgrade instead of their intended tier
Cold-verified by inspecting `regression-bad-backup` and `regression-bad-restore` branches:
```bash
ssh cc-ci 'cd /root/.abra/recipes/custom-html && git diff origin/main..origin/regression-bad-backup -- compose.yml'
```
Result: compose.yml is completely empty (entire file deleted, leaving only a blank line). Same
for `regression-bad-restore`.
**Evidence from run artifacts:**
- `regression-bad-backup-1`: `results: install=pass, upgrade=fail, backup=skip`
- Expected: `install=pass, upgrade=pass, backup=fail`
- Actual: upgrade fails because chaos deploy deploys empty compose → no service → deploy error
- `regression-bad-restore-*`: never ran to completion (same root cause blocks it)
**Impact on regression test assertions:**
`_assert_red_at_tier` for bad-backup:
- `failing_tier="backup"` → checks `results["backup"]="skip"` → FAIL: "expected 'backup'='fail', got 'skip'"
- Test would FAIL with confusing assertion, not passing as expected
**Fix:** Recreate both fixture branches with correct compose.yml that:
- bad-backup: keeps full valid nginx service, only changes `backupbot.backup.path` label to `/nonexistent-cc-ci-canary-bad`
- bad-restore: keeps full valid nginx service, changes backup scope to capture a subdir that doesn't contain ci-marker.txt (so restore doesn't recover the marker)
The compose.yml should be identical to main EXCEPT for the single label/config change.
**Repro:** `git diff origin/main..origin/regression-bad-backup -- compose.yml` → empty file
**Adversary closes:** after both fixtures are recreated correctly, runs confirm:
- bad-backup: `install=pass, upgrade=pass, backup=fail`
- bad-restore: `install=pass, upgrade=pass, backup=pass, restore=fail` with `test_restore_returns_state` FAIL
---
### A-reg-2 [adversary] CLOSED @2026-06-02T02:20Z — 4 per-tier RED canaries cold-verified
**Resolved:** All 4 per-tier RED canaries added, artifacts cold-verified on cc-ci.
| Canary | Run artifact | failing_tier | passing_before | verdict |
|--------|-------------|-------------|---------------|---------|
| bad-install | regression-bad-install-v2 | install=fail ✓ | [] | CORRECT ✓ |
| bad-upgrade | regression-bad-upgrade-v2 | upgrade=fail ✓ | install=pass ✓ | CORRECT ✓ |
| bad-backup | regression-bad-backup-5 | backup=fail ✓ | install=pass ✓ | CORRECT ✓ |
| bad-restore | regression-bad-restore-3 | restore=fail ✓ | install=pass, backup=pass ✓ | CORRECT ✓ |
`@pytest.mark.canary_fast` marker added ✓. 7 tests collect ✓.
**Note:** bad-backup comment in test_canaries.py says "test_backup_artifact fails" but actual
behavior is test_backup_artifact PASSES and test_backup_captures_state FAILS. Functional result
(backup=fail) is correct; comment is misleading but non-blocking.
### A-reg-2 [adversary] OPEN — Plan gap: 4 per-tier RED canaries required by updated DoD
**Filed:** 2026-06-02T01:37Z
**Severity:** HIGH — DoD#4 unmet; Builder cannot claim DONE without these
Updated plan (commit 7bdeb74) added DoD#4: four per-tier RED canaries (install/upgrade/backup/
restore on `custom-html-tiny`) that prove the server reports RED at EACH tier. Each must:
- Assert overall verdict RED at the intended tier
- Assert prior tiers PASSED
- Have teeth: wrongly-green tier would FAIL the test
Current suite only has 3 canaries (good-simple, good-significant, bad-false-green). The 4
per-tier RED canaries are MISSING. This is a mandatory DoD item.
These also require:
- Fixture branches or SHA-pinned commits where custom-html-tiny is broken at exactly one tier
- A `@pytest.mark.canary_fast` sub-marker (plan recommends it for the fast RED subset)
- README update to document the fast subset
**Adversary closes:** after all 4 canaries exist, run, and the Adversary cold-verifies each
produces RED at the intended tier with prior tiers PASS.

View File

@ -1,76 +0,0 @@
# JOURNAL — server regression canaries phase (Builder)
**Phase:** server regression canaries
**Started:** 2026-06-02
---
## Step 0 — phase kickoff and design (2026-06-02)
**Context:** Mirror phase (plan-mirror-enroll-all-recipes.md) completed DONE at 2026-06-02T01:16Z.
Adversary initialized regression phase files in machine-docs/ at commit f202c5a.
**Decision: run regression tests ON cc-ci, not from the orchestrator**
The regression tests call `run_recipe_ci.py` which uses abra/docker/swarm — these only exist on
cc-ci. The test process runs under `cc-ci-run python -m pytest`, which sets up the right PATH
(abra, python3, playwright, etc.). The test then invokes `run_recipe_ci.py` as a subprocess using
`sys.executable` (inherits the same python3 from cc-ci-run).
The README.md documents the `ssh cc-ci "cc-ci-run python -m pytest tests/regression/ -m canary"`
invocation pattern.
**Canary selection:**
| ID | Recipe | SHA | Rationale |
|----|--------|-----|-----------|
| good-simple | custom-html-tiny | 435df8fc (main) | Fast, few deps, quick signal |
| good-significant | lasuite-docs | 290a8ad7 (main) | Multi-service, exercises real breadth |
| bad-false-green | custom-html | 71e7326a (v5-stale-docroot) | Already produced RED build #75; pinned fixture |
SHAs confirmed from Gitea API on 2026-06-02.
**Semantic checks ("teeth") design:**
The regression tests assert BOTH exit code AND named tests in results.json stages. This guards
against two failure modes:
1. Harness returns wrong exit code (false-green / false-red) → rc assertion catches it
2. A specific assertion is silently removed/vacuated → named test disappears from stages → semantic check catches it
For custom-html-tiny: `test_serving` (generic install) must appear passing
For lasuite-docs: `test_serving_and_frontend` (install overlay) must appear passing
For bad canary: `test_content_type` (custom functional) must appear failing
**File layout:**
- `tests/regression/conftest.py` — run_recipe_ci(), stage_has_passing_test(), stage_has_failing_test()
- `tests/regression/test_canaries.py` — parametrized @pytest.mark.canary test
- `tests/regression/README.md` — cadence policy + how to run + how to add
**Next step:** commit + push, then run good-simple and bad-false-green canaries to get real output.
lasuite-docs is slow (10-20 min) so will run it last.
---
## Step 1 — initial canary runs (2026-06-02 ~01:28-01:40Z)
### bad-false-green run (regression-bad-canary-1)
Command: `RECIPE=custom-html REF=71e7326a... SRC=recipe-maintainers/custom-html cc-ci-run runner/run_recipe_ci.py`
Result: RC=1, custom=FAIL
Key output:
- `test_content_type_html_and_txt` FAILED: `ccci-89273b0b.txt Content-Type='application/octet-stream'`, expected `text/plain`
- All other tiers (install/upgrade/backup/restore): PASS
- `flags: {clean_teardown: True, no_secret_leak: True}`
- Confirms: regression test `assert rc != 0` will PASS ✓
- Confirms: `stage_has_failing_test(results, "custom", "test_content_type")` will return True ✓
### good-simple run (regression-good-simple-1)
Command: `RECIPE=custom-html-tiny REF=435df8fc... SRC=recipe-maintainers/custom-html-tiny cc-ci-run runner/run_recipe_ci.py`
Result: RC=0, install=pass, upgrade=pass, backup/restore/custom=skip
Key output:
- `test_serving` in install stage: PASSED ✓
- `flags: {clean_teardown: True, no_secret_leak: True}`
- Confirms: all regression assertions for good-simple will PASS ✓
### good-significant run (regression-good-significant-1) [IN PROGRESS]
Started ~01:35Z. Multi-service stack (lasuite-docs + keycloak dep). Image pull in progress.
Expected: GREEN (install/upgrade pass, keycloak dep provisioned, SSO tests run).

View File

@ -1,238 +0,0 @@
# REVIEW — server regression canaries phase (Adversary ledger)
**Phase:** server regression canaries (codified E2E self-tests)
**SSOT:** `/srv/cc-ci/cc-ci-plan/plan-server-regression-canaries.md`
**Adversary loop started:** 2026-06-02T01:15Z
**Repo:** git.autonomic.zone/recipe-maintainers/cc-ci
**Adversary clone:** /srv/cc-ci/cc-ci-adv
---
## D-gate verdicts
### D-final: PASS @2026-06-02T03:36Z — all 7 canaries cold-verified; PR#5 open; all DoD items met
**Cold verification result: PASS**
All DoD items independently verified (cold shell, Adversary clone, no cached state):
**DoD#1 — tests/regression/ committed:**
- `cc-ci-run -m pytest tests/regression/ --collect-only -q` on cc-ci from PR branch: 7 tests collected ✓
- Files present on `regression-canaries` branch: `conftest.py`, `test_canaries.py`, `README.md`, plus `tests/custom-html-bkp-bad/` and `tests/custom-html-rst-bad/`
**DoD#2 — both good canaries GREEN with semantic assertion teeth:**
- `good-simple` (regression-good-simple-1, SHA `435df8fc`): `install=pass, upgrade=pass`, `test_serving` PASS in install stage ✓
- Teeth: if `test_serving` removed → `stage_has_passing_test("install","test_serving")` → False → assert fires ✓
- `good-significant` (regression-good-significant-2, SHA `290a8ad7`): `install=pass, upgrade=pass, backup=pass, restore=pass, custom=pass`, `clean_teardown=true`, `no_secret_leak=true`
- `test_serving_and_frontend` PASS in install stage ✓
- Teeth: if `test_serving_and_frontend` removed → `stage_has_passing_test("install","test_serving_and_frontend")` → False → assert fires ✓
- Run 1 had upgrade=fail (convergence race, transient); run 2 fully GREEN. Known plan risk; no action needed unless persistent.
**DoD#3 — bad-false-green catches false-green:**
- `bad-false-green` (regression-bad-canary-1, SHA `71e7326a`): `custom=fail`, `test_content_type_html_and_txt: FAIL` (Content-Type='application/octet-stream') ✓
- Teeth: if harness returns rc=0 → `assert rc != 0` fires → false-green caught ✓
**DoD#4 — 4 per-tier RED canaries (cold-verified from artifacts):**
- `bad-install` (regression-bad-install-v2, SHA `4ae8866`): `install=fail, upgrade=na` ✓ — failing_tier=install, passing_before=[] ✓
- `bad-upgrade` (regression-bad-upgrade-v2, SHA `4ae8866`): `install=pass, upgrade=fail` ✓ — prior tier PASS verified ✓
- `bad-backup` (regression-bad-backup-5, SHA `b6fe99de`, recipe `custom-html-bkp-bad`): `install=pass, backup=fail` ✓ — `test_backup_captures_state` FAIL ✓
- `bad-restore` (regression-bad-restore-3, SHA `9a73a184`, recipe `custom-html-rst-bad`): `install=pass, backup=pass, restore=fail` ✓ — `test_restore_returns_state` FAIL ✓
- All 4: if harness wrongly returned rc=0 → `assert rc != 0` fires ✓; if wrong tier failed → tier check assertion fires ✓
**DoD#5 — README.md:**
- `tests/regression/README.md` present on regression-canaries branch ✓
- Contains: cadence policy ("Do NOT run on every commit"), canary table, per-tier teeth explanation, how to add a canary ✓
**DoD#6 — NOT merged, PR opened for operator review:**
- PR#5: `https://git.autonomic.zone/recipe-maintainers/cc-ci/pulls/5` — state=open, merged=False ✓
- Branch: `regression-canaries``main`. 10 files, 704 insertions ✓
- PR body says "Do not merge — loops never merge" ✓
**Observations (non-blocking, not DoD blockers):**
- good-significant run 1's upgrade=fail was a convergence race; transient (run 2 passed without retry). No test weakening, no retry added — consistent with plan policy.
- Semantic stage_pass_checks only explicitly guard install tier for good-significant. Upgrade/backup/restore tooth coverage is via `_assert_green`'s "no tier failed" check. Limitation noted; acceptable per plan DoD requirements.
- A-reg-2 comment in test_canaries.py says "test_backup_artifact fails" for bad-backup; actual behavior is test_backup_artifact passes and test_backup_captures_state fails. Misleading comment, non-blocking.
**Verdict: D-final PASS.** All 7 canaries verified. All 6 DoD items met. Phase is complete pending operator review of PR#5. No vetoes.
---
### D-initial update @2026-06-02T01:46Z — A-reg-1 CLOSED; A-reg-2 still open
**A-reg-1 RESOLVED.** Cold-verify after fix:
```
ssh cc-ci && cd /root/builder-clone && git pull --rebase
cc-ci-run -m pytest tests/regression/ --collect-only
```
Output: `collected 3 items``test_canary[good-simple]`, `test_canary[good-significant]`, `test_canary[bad-false-green]`. No errors.
**Canary artifacts cold-verified from cc-ci artifact dirs:**
`good-simple (custom-html-tiny)``/var/lib/cc-ci-runs/regression-good-simple-1/results.json`:
- `results: install=pass, upgrade=pass, backup=skip, restore=skip, custom=skip`
- `flags: clean_teardown=true, no_secret_leak=true`
- `install/test_serving`: PASS ✓ (stage_has_passing_test confirms teeth present)
`bad-false-green (custom-html v5-stale-docroot)``/var/lib/cc-ci-runs/regression-bad-canary-1/results.json`:
- `results: install=pass, upgrade=pass, backup=pass, restore=pass, custom=FAIL`
- `flags: clean_teardown=true, no_secret_leak=true`
- `custom/test_content_type_html_and_txt`: FAIL with `Content-Type='application/octet-stream'`
- `rc` would be non-zero (any(v=="fail")) ✓ → regression test `assert rc != 0` PASSES
`good-significant (lasuite-docs)` — upgrade FAILED in Builder's run:
- `results: install=PASS, upgrade=FAIL``test_upgrade_reconverges` → convergence race
- This is the known WOPI/upgrade convergence risk from the plan (§ Risks). Builder is re-running.
- OBSERVATION (non-blocking now): if consistently flaky, add bounded retries to readiness probe per
plan policy ("bounded retries on readiness only, never on correctness assertion"). Will watch.
**A-reg-2 partially addressed** — 4 per-tier RED canary tests added to suite, 7 tests collect.
But bad-backup and bad-restore FIXTURES are broken (see A-reg-3). A-reg-2 cannot close until
all 4 canaries actually produce the expected results.
---
### D-initial-2 update @2026-06-02T02:00Z — A-reg-3 filed; bad-backup/bad-restore fixtures broken
4 per-tier RED canary tests now in suite (7 tests collect via cold --collect-only). SHAs verified:
- `4ae8866100563204` (custom-html-tiny, bad image) ✓ — bad-install + bad-upgrade fixture
- `e1e3c5fc5e2bd414` (custom-html, bad-backup) — SHA exists BUT compose.yml is empty (A-reg-3)
- `5a481cc1f6b2a462` (custom-html, bad-restore) — SHA exists BUT compose.yml is empty (A-reg-3)
**Cold-verified canary run results:**
bad-install (regression-bad-install-v2): `install=fail, upgrade=na` ✓ — install tier fails as intended
bad-upgrade (regression-bad-upgrade-v2): `install=pass, upgrade=fail, custom=skip` ✓ — upgrade tier fails as intended
bad-backup (regression-bad-backup-1): `install=pass, upgrade=fail, backup=skip` ✗ — WRONG TIER
Root cause A-reg-3: `regression-bad-backup` branch has empty compose.yml (whole file deleted, not
just backup path changed). Empty compose → chaos upgrade deploy fails → upgrade=fail, backup never
runs. Same issue for `regression-bad-restore` (same empty compose.yml diff).
**`_assert_red_at_tier` for bad-backup would FAIL** with `expected 'backup'='fail', got 'skip'`
proving the fixture is broken, not the test.
**What still needs fixing before final gate:**
1. ~~A-reg-3~~ CLOSED — fixtures fixed and cold-verified ✓
2. ~~A-reg-2~~ CLOSED — all 4 per-tier RED canaries present and verified ✓
3. **good-significant**: still needs successful re-run (upgrade flakiness unresolved)
4. **Open PR** (DoD#6): not yet opened
---
### Comprehensive canary verification @2026-06-02T02:20Z
All 6 of 7 canaries cold-verified from cc-ci artifact dirs (fresh SSH shell, no cached state):
**GREEN canaries:**
- `good-simple` (regression-good-simple-1, SHA `435df8fc`): `install=pass, upgrade=pass, backup/restore/custom=skip`, `clean_teardown=true`, `no_secret_leak=true`, `test_serving: pass`
- `good-significant` (regression-good-significant-1, SHA `290a8ad7`): PENDING — upgrade FAIL (convergence race). Needs re-run to confirm transient.
**Custom-assertion RED canary:**
- `bad-false-green` (regression-bad-canary-1, SHA `71e7326a`): `install/upgrade/backup/restore=pass, custom=fail`, `test_content_type_html_and_txt: FAIL` (Content-Type='application/octet-stream') ✓
**Per-tier RED canaries (all cold-verified from artifact dirs):**
- `bad-install` (regression-bad-install-v2, SHA `4ae8866`): `install=fail, upgrade=na` ✓ — failing_tier=install, no prior tier checked
- `bad-upgrade` (regression-bad-upgrade-v2, SHA `4ae8866`): `install=pass, upgrade=fail` ✓ — install=pass before failing
- `bad-backup` (regression-bad-backup-5, SHA `b6fe99de`, recipe `custom-html-bkp-bad`): `install=pass, backup=fail` ✓ — test_backup_captures_state FAIL
- `bad-restore` (regression-bad-restore-3, SHA `9a73a184`, recipe `custom-html-rst-bad`): `install=pass, backup=pass, restore=fail` ✓ — test_restore_returns_state FAIL
**Teeth verification:**
- good-simple: if test_serving removed → stage_has_passing_test("install","test_serving") returns False → regression test FAILS ✓
- bad-false-green: if harness returns rc=0 → assert rc!=0 FAILS → false-green caught ✓
- bad-install: if harness returns rc=0 for bad image → assert rc!=0 FAILS ✓
- bad-upgrade: if upgrade wrongly passes → tier_results["upgrade"]="pass"≠"fail" → assert FAILS ✓
- bad-backup: if backup wrongly passes → rc=0 → assert rc!=0 FAILS ✓
- bad-restore: if restore wrongly passes → tier_results["restore"]!="fail" → assert FAILS ✓; if backup wrongly fails → tier_results["backup"]!="pass" → assert FAILS ✓
**DoD status:**
- DoD#1 (tests/regression/ committed): ✓
- DoD#2 (good canaries GREEN with semantic assertions): good-simple ✓; good-significant PENDING re-run
- DoD#3 (bad-false-green catches false-green): ✓ verified
- DoD#4 (4 per-tier RED canaries): ✓ all 4 verified
- DoD#5 (README.md): ✓ present with cadence, canaries, how to add
- DoD#6 (PR open for operator review): NOT YET
**Remaining blockers before final PASS:**
1. good-significant must pass (or flakiness addressed with bounded retries on readiness)
2. PR must be opened (DoD#6)
---
### D-initial: FAIL @2026-06-02T01:38Z — suite won't collect (A-reg-1); plan gap (A-reg-2)
Builder claimed: test suite written, initial gate; canaries in-flight.
**Cold verification result: FAIL — two blocking issues.**
**A-reg-1 (CRITICAL): Relative import fails, 0 tests collected.**
```
ssh cc-ci && cd /root/builder-clone
cc-ci-run -m pytest tests/regression/ --collect-only
```
Output (cold, fresh shell):
```
collected 0 items / 1 error
ImportError: attempted relative import with no known parent package
tests/regression/test_canaries.py:18: from .conftest import run_recipe_ci, ...
!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!
```
Root cause: `tests/regression/__init__.py` and `tests/__init__.py` missing. Fix: add them or
use absolute imports (as other test files in this repo do).
**A-reg-2 (HIGH): Plan updated (commit 7bdeb74) — 4 per-tier RED canaries now mandatory (DoD#4).**
Updated plan requires RED canaries for install/upgrade/backup/restore tiers on custom-html-tiny,
each asserting RED at the intended tier with prior tiers PASS. Current suite: 3 canaries only
(2 good + 1 bad-custom-assertion). All four are MISSING. Cannot claim DONE without them.
**Other code quality observations (not blocking):**
- Canary SHAs all verified present on Gitea ✓
- custom-html-tiny: `435df8fc98ef7598` ✓ (main 2026-06-02 merge commit)
- lasuite-docs: `290a8ad72d06232f` ✓ (v0.3.3+v5.1.0 merge)
- custom-html v5-stale-docroot: `71e7326a99bbb690` ✓ (confirmed RED via build #81)
- `CCCI_RUN_ID` and `CCCI_RUNS_DIR` correctly picked up by `results.py`
- `_assert_red` / `_assert_green` logic sound ✓
- README cadence policy complete ✓
**Verdict: FAIL. Standing issues: A-reg-1 (critical), A-reg-2 (high). Builder must fix both
before re-claiming this gate.**
---
## Adversary findings
*(See BACKLOG-regression.md § Adversary findings: A-reg-1, A-reg-2)*
---
## Break-it probes log
*(Break-it probes will be recorded here as they are run)*
---
## Pre-orientation findings @01:17Z
**Known-bad fixture confirmed present and working:**
- Branch: `recipe-maintainers/custom-html:v5-stale-docroot` (SHA `71e7326a99bb`)
- Build #81 (run 3h ago): confirmed RED — `custom` stage FAIL; specifically:
- `test_content_type_html_and_txt`: FAIL — `ccci-e0d6e804.txt Content-Type='application/octet-stream'`, expected `text/plain`
- All other tiers (install/upgrade/backup/restore): PASS
- `clean_teardown=true`, `no_secret_leak=true`
- **Implication for regression suite DoD#3**: the known-bad canary correctly produces RED;
the regression test must assert this outcome AND must be shown to fail if the server returns
green for it (false-green detection).
**Good canaries:**
- `custom-html-tiny`: build #45 GREEN (SHA `4bd8416a209f`, 21h ago) — simple, fast
- `lasuite-docs`: multi-service stack with DEPS=["keycloak"], DEPLOY_TIMEOUT=900s — test exists at tests/lasuite-docs/
**Infrastructure state:**
- Bridge (`ccci-bridge_app`): running, polling 20 repos every 30s ✓
- Drone exec runner: running ✓
- Dashboard: serving at ci.commoninternet.net ✓
- Builder hasn't started regression phase: no STATUS-regression.md yet
**Notes:**
- Mirror phase (plan-mirror-enroll-all-recipes.md) completed DONE at 2026-06-02T01:16Z.
- This phase starts fresh: no STATUS-regression.md or tests/regression/ yet.
- Watching for Builder to create STATUS-regression.md and begin work.

View File

@ -1,138 +0,0 @@
# STATUS — server regression canaries phase
**Phase:** server regression canaries (codified E2E self-tests)
**SSOT:** `/srv/cc-ci/cc-ci-plan/plan-server-regression-canaries.md`
**Builder loop started:** 2026-06-02
**Repo:** git.autonomic.zone/recipe-maintainers/cc-ci
---
## DONE
**Adversary PASS: @2026-06-02T03:36Z — D-final PASS. All 7 canaries verified. All 6 DoD items met. No vetoes.**
All DoD items Adversary-verified:
1.`tests/regression/` suite committed — 7 tests collected (DoD#1)
2. ✓ good-simple GREEN: `/var/lib/cc-ci-runs/regression-good-simple-1/` — install/upgrade=pass, test_serving PASS (DoD#2)
3. ✓ good-significant GREEN: `/var/lib/cc-ci-runs/regression-good-significant-2/` — all 5 tiers pass, clean_teardown/no_secret_leak=true (DoD#2)
4. ✓ bad-false-green RED: `/var/lib/cc-ci-runs/regression-bad-canary-1/` — custom=fail, false-green caught (DoD#3)
5. ✓ 4 per-tier RED canaries verified (bad-install/upgrade/backup/restore — artifacts on server) (DoD#4)
6. ✓ README.md: cadence, canaries, how to add (DoD#5)
7. ✓ PR#5 open for operator review: https://git.autonomic.zone/recipe-maintainers/cc-ci/pulls/5 (DoD#6)
**Phase complete. Loop stopped. PR#5 awaits operator review — do not merge.**
---
## What was built
```
tests/regression/
├── conftest.py — run_recipe_ci(), stage_has_{passing,failing}_test() helpers
├── test_canaries.py — 7 parametrized canaries (3 @canary + 4 @canary_fast)
└── README.md — cadence policy, how to run, how to add a canary
tests/custom-html-bkp-bad/ — cc-ci recipe dir for bad-backup canary
├── recipe_meta.py — BACKUP_CAPABLE=True
└── test_backup.py — asserts marker=="original" (not seeded → FAIL → backup=RED)
tests/custom-html-rst-bad/ — cc-ci recipe dir for bad-restore canary
├── recipe_meta.py — BACKUP_CAPABLE=True
├── ops.py — pre_restore writes "mutated" (no pre_backup)
└── test_restore.py — asserts marker=="original" (not in snapshot → FAIL → restore=RED)
```
---
## Canaries (7 total)
| ID | Recipe | SHA | Expected | Verified |
|----|--------|-----|---------|---------|
| good-simple | custom-html-tiny | 435df8fc (main) | GREEN | ✓ rc=0, install=pass, test_serving present |
| good-significant | lasuite-docs | 290a8ad7 (main) | GREEN | ✓ rc=0, all tiers pass (run: regression-good-significant-2) |
| bad-false-green | custom-html | 71e7326a (v5-stale-docroot) | RED | ✓ rc=1, custom=fail, test_content_type fails |
| bad-install | custom-html-tiny | 4ae88661 (regression-bad-image) | RED (install) | ✓ rc=1, install=fail |
| bad-upgrade | custom-html-tiny | 4ae88661 (regression-bad-image) | RED (upgrade) | ✓ rc=1, install=pass, upgrade=fail |
| bad-backup | custom-html-bkp-bad | b6fe99de (main) | RED (backup) | ✓ rc=1, install=pass, backup=fail |
| bad-restore | custom-html-rst-bad | 9a73a184 (main) | RED (restore) | ✓ rc=1, install=pass, backup=pass, restore=fail |
---
## How to verify (Adversary commands)
From cc-ci server (builder-clone at `/root/builder-clone`):
```bash
# Pull latest
cd /root/builder-clone && git pull --rebase
# Verify collection (expect 7 tests)
cc-ci-run -m pytest tests/regression/ --collect-only
# Fast RED canaries (~2-3 min each):
RECIPE=custom-html-tiny REF=4ae8866100563204d40435c5aba00374aa5a8ed3 SRC=recipe-maintainers/custom-html-tiny PR=0 STAGES=install CCCI_RUN_ID=adv-bad-install HOME=/root /run/current-system/sw/bin/cc-ci-run runner/run_recipe_ci.py
# Expected: install=fail, rc=1
RECIPE=custom-html-tiny REF=4ae8866100563204d40435c5aba00374aa5a8ed3 SRC=recipe-maintainers/custom-html-tiny PR=0 STAGES=install,upgrade,custom CCCI_RUN_ID=adv-bad-upgrade HOME=/root /run/current-system/sw/bin/cc-ci-run runner/run_recipe_ci.py
# Expected: install=pass, upgrade=fail, rc=1
RECIPE=custom-html-bkp-bad REF=b6fe99de41601f9e51bc7ea5b6072f0c3f56cdc3 SRC=recipe-maintainers/custom-html-bkp-bad PR=0 STAGES=install,upgrade,backup CCCI_RUN_ID=adv-bad-backup HOME=/root /run/current-system/sw/bin/cc-ci-run runner/run_recipe_ci.py
# Expected: install=pass, backup=fail (test_backup_captures_state: MISSING), rc=1
RECIPE=custom-html-rst-bad REF=9a73a184e739691bc6a621a5f1e6efc799743c5b SRC=recipe-maintainers/custom-html-rst-bad PR=0 STAGES=install,backup,restore CCCI_RUN_ID=adv-bad-restore HOME=/root /run/current-system/sw/bin/cc-ci-run runner/run_recipe_ci.py
# Expected: install=pass, backup=pass, restore=fail (test_restore_returns_state: mutated), rc=1
# Good-simple GREEN:
RECIPE=custom-html-tiny REF=435df8fc98ef7598084fcffcd6225470eca80053 SRC=recipe-maintainers/custom-html-tiny PR=0 CCCI_RUN_ID=adv-good-simple HOME=/root /run/current-system/sw/bin/cc-ci-run runner/run_recipe_ci.py
# Expected: install=pass, upgrade=pass, rc=0; stages.install has test_serving PASS
# Bad-false-green RED:
RECIPE=custom-html REF=71e7326a99bbb69035a046fba8fa51859ca66115 SRC=recipe-maintainers/custom-html PR=0 CCCI_RUN_ID=adv-bad-fg HOME=/root /run/current-system/sw/bin/cc-ci-run runner/run_recipe_ci.py
# Expected: custom=fail (test_content_type FAILS), rc=1
# Good-significant (lasuite-docs) — verify artifact (or re-run, takes ~15-20 min):
# Quick artifact check (no re-run needed):
cat /var/lib/cc-ci-runs/regression-good-significant-2/results.json
# Expected: install=pass, upgrade=pass, backup=pass, restore=pass, custom=pass, rc implicit in level>=5
# Check PR exists and is open:
# https://git.autonomic.zone/recipe-maintainers/cc-ci/pulls/5 — state=open, 10 files, 704 insertions
```
---
## Artifacts already on server
| Run ID | Recipe | Result |
|--------|--------|--------|
| regression-good-simple-1 | custom-html-tiny | GREEN ✓ |
| regression-good-significant-2 | lasuite-docs | GREEN ✓ (all tiers: install/upgrade/backup/restore/custom=pass) |
| regression-bad-canary-1 | custom-html v5-stale-docroot | RED ✓ |
| regression-bad-install-v2 | custom-html-tiny bad-image | RED (install=fail) ✓ |
| regression-bad-upgrade-v2 | custom-html-tiny bad-image | RED (upgrade=fail) ✓ |
| regression-bad-backup-5 | custom-html-bkp-bad | RED (backup=fail) ✓ |
| regression-bad-restore-3 | custom-html-rst-bad | RED (restore=fail) ✓ |
---
## good-significant run 2 full results (cold-readable on server)
`cat /var/lib/cc-ci-runs/regression-good-significant-2/results.json` shows:
- `install=pass, upgrade=pass, backup=pass, restore=pass, custom=pass`
- `level=5 (full suite), level_cap_reason="L6 recipe-local N/A"`
- `clean_teardown=true, no_secret_leak=true`
- install: `test_serving` PASS, `test_serving_and_frontend` PASS
- upgrade: `test_upgrade_reconverges` PASS, `test_upgrade_preserves_data` PASS
- backup: `test_backup_artifact` PASS, `test_backup_captures_state` PASS
- restore: `test_restore_healthy` PASS, `test_restore_returns_state` PASS
- custom: auth/create-doc/health/oidc/OIDC-keycloak all PASS
This confirms run 1's upgrade failure was a transient convergence race (no retry, no weakening —
the fixture itself is sound; race resolved on second cold run).
---
## PR
**PR#5: https://git.autonomic.zone/recipe-maintainers/cc-ci/pulls/5**
Branch `regression-canaries``main`. 10 files, 704 insertions. Open for operator review.
"Do not merge" — operator review only per DoD#6.

View File

@ -7,7 +7,7 @@
# git clone --recursive https://git.autonomic.zone/recipe-maintainers/cc-ci.git /etc/cc-ci
# install -m600 <age-private-key> /var/lib/sops-nix/key.txt
# nixos-rebuild switch --flake /etc/cc-ci#cc-ci-hetzner
{ pkgs, ... }:
{ pkgs, lib, ... }:
{
imports = [
./hardware.nix
@ -22,7 +22,6 @@
../../modules/drone-runner.nix
../../modules/bridge.nix
../../modules/dashboard.nix
../../modules/reports.nix
../../modules/backupbot.nix
../../modules/harness.nix
../../modules/warm-keycloak.nix

View File

@ -11,17 +11,13 @@
{
imports = [ (modulesPath + "/profiles/qemu-guest.nix") ];
boot = {
loader = {
efi.efiSysMountPoint = "/boot/efi";
grub = {
efiSupport = true;
efiInstallAsRemovable = true;
device = "nodev";
};
boot.loader = {
efi.efiSysMountPoint = "/boot/efi";
grub = {
efiSupport = true;
efiInstallAsRemovable = true;
device = "nodev";
};
initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "xen_blkfront" "vmw_pvscsi" ];
initrd.kernelModules = [ "nvme" ];
};
fileSystems."/boot/efi" = {
@ -29,6 +25,9 @@
fsType = "vfat";
};
boot.initrd.availableKernelModules = [ "ata_piix" "uhci_hcd" "xen_blkfront" "vmw_pvscsi" ];
boot.initrd.kernelModules = [ "nvme" ];
fileSystems."/" = {
device = "/dev/sda1";
fsType = "ext4";

View File

@ -8,19 +8,14 @@
{ pkgs, config, lib, ... }:
let
# MAX_TESTS (plan §4.2/§4.3 resource safety): max CI builds the exec runner runs at once. Drone
# queues the rest in its native pending-build queue (no custom queue). THE SINGLE concurrency
# knob — nothing else caps recipe-ci parallelism (the .drone.yml concurrency.limit was removed:
# one knob, one place). Bounds how many test apps can be live at once.
#
# Raised to 2 (operator request 2026-06-09) so two recipes can be tested in parallel (e.g. immich
# and plausible under active development at once). Verified safe on the current node (Hetzner cpx22,
# ~7.6 GiB / 4 vCPU — NOTE: smaller than the original 28 GiB this was written for): a full immich CI
# stack measured ~1 GiB (server+ML+pg+redis) with multiple GiB free, so two concurrent recipes fit.
# Concurrent-run safety is the harness's job at ANY capacity (docs/concurrency.md): per-run
# ABRA_DIR recipe trees, per-app-domain flocks, and a flock-probe janitor that reaps a crashed
# build's orphan immediately (held lock = live run, never touched). Revert to "1" if OOM /
# disk-I/O contention is observed under load.
maxTests = "2";
# queues the rest in its native pending-build queue (no custom queue). THE concurrency cap that
# bounds how many test apps can be live at once — kept LOW (1) on this single 28GiB node since
# recipes are heavy (immich/matrix large volumes). With capacity=1 there is never a concurrent
# in-flight run, so the run-start janitor can safely reap *any* orphan (a SIGKILL'd build runs no
# teardown) and the "at most MAX_TESTS apps live" bound holds exactly. Raise to 2 only if the node
# is shown to handle two light recipes at once (then the janitor MUST stay age-based to avoid
# reaping a concurrent run — see DECISIONS.md "Resource safety").
maxTests = "1";
in
{
# Drone ships under the Polyform Small Business license (nixpkgs marks it unfree);

View File

@ -29,7 +29,7 @@ in
serviceConfig = {
Type = "oneshot";
# A full sweep across several recipes (each a cold deploy/test/teardown) is long; bound it.
TimeoutStartSec = "21600"; # 6h ceiling
TimeoutStartSec = "21600"; # 6h ceiling
ExecStart = "${sweep}/bin/cc-ci-nightly-sweep";
};
};
@ -39,7 +39,7 @@ in
wantedBy = [ "timers.target" ];
timerConfig = {
OnCalendar = "*-*-* 03:00:00";
Persistent = true; # catch up a missed nightly after downtime
Persistent = true; # catch up a missed nightly after downtime
RandomizedDelaySec = "600";
};
};

View File

@ -1,116 +0,0 @@
# Recipe Report static site (report.ci.commoninternet.net): a public nginx serving the weekly
# "Recipe Report" HTML pages written to /var/lib/cc-ci-reports by the /recipe-report skill. No app,
# no secrets — just static files behind traefik + the wildcard TLS (same pattern as dashboard.nix,
# but a plain nginx:alpine since there's nothing to render server-side). Content is updated by writing
# files into /var/lib/cc-ci-reports; nginx serves them live (no redeploy needed).
#
# It ALSO serves a same-origin realtime PR-status proxy at /pr/<recipe>/<n>: the report's STATUS
# column fetches it client-side to show each PR's live state (open vs. ✓). Same-origin means no
# dependency on the Gitea CORS allow-list; the recipe mirrors are public so no token is needed. The
# proxy is pinned to recipe-maintainers + a safe recipe-name charset and is read-only (GET/HEAD).
{ pkgs, ... }:
let
reportsDir = "/var/lib/cc-ci-reports";
# Custom nginx server: static report files + the /pr/<recipe>/<n> → Gitea-API proxy. Replaces the
# stock /etc/nginx/conf.d/default.conf (which the image's nginx.conf includes inside http{}).
nginxConf = pkgs.writeText "cc-ci-reports-default.conf" ''
server {
listen 80;
server_name _;
root /usr/share/nginx/html;
index index.html;
# Realtime PR-status proxy for the Recipe Report STATUS column.
# GET /pr/<recipe>/<n> -> the PUBLIC Gitea PR JSON ({state, merged, ...}). Same-origin from
# the browser's view, so no CORS dependency; unauthenticated, since the recipe mirrors are
# public. The repo owner is hard-pinned to recipe-maintainers and the recipe name to a
# slashless charset, so the proxied path can only ever address recipe-maintainers/<name>/pulls
# (it cannot be coerced to another org or path). Only safe read methods are allowed.
location ~ ^/pr/([a-z0-9._-]+)/([0-9]+)$ {
limit_except GET HEAD { deny all; }
resolver 127.0.0.11 ipv6=off valid=30s; # docker embedded DNS (forwards external names)
proxy_ssl_server_name on;
proxy_set_header Host git.autonomic.zone;
proxy_set_header Accept "application/json";
proxy_pass https://git.autonomic.zone/api/v1/repos/recipe-maintainers/$1/pulls/$2;
proxy_intercept_errors off;
proxy_connect_timeout 5s;
proxy_read_timeout 10s;
add_header Cache-Control "no-store" always; # always fetch live state, never cache in the browser
}
location / {
try_files $uri $uri/ =404;
}
}
'';
stack = pkgs.writeText "cc-ci-reports-stack.yml" ''
version: "3.8"
services:
app:
image: nginx:alpine
volumes:
- type: bind
source: ${reportsDir}
target: /usr/share/nginx/html
read_only: true
- type: bind
source: ${nginxConf}
target: /etc/nginx/conf.d/default.conf
read_only: true
networks:
- proxy
deploy:
replicas: 1
restart_policy:
condition: any
labels:
- "traefik.enable=true"
- "traefik.http.services.ccci-reports.loadbalancer.server.port=80"
- "traefik.http.routers.ccci-reports.rule=Host(`report.ci.commoninternet.net`)"
- "traefik.http.routers.ccci-reports.entrypoints=web-secure"
- "traefik.http.routers.ccci-reports.tls=true"
networks:
proxy:
external: true
'';
reconcile = pkgs.writeShellApplication {
name = "cc-ci-reconcile-reports";
runtimeInputs = with pkgs; [ docker coreutils ];
text = ''
mkdir -p ${reportsDir}
# Seed a placeholder index so the site serves something before the first report is generated.
if [ ! -f ${reportsDir}/index.html ]; then
cat > ${reportsDir}/index.html <<'HTML'
<!doctype html><html lang="en"><head><meta charset="utf-8">
<meta name="viewport" content="width=device-width,initial-scale=1">
<title>The Recipe Report</title>
<style>body{font:16px/1.5 system-ui,sans-serif;max-width:50rem;margin:3rem auto;padding:0 1rem;color:#222}</style>
</head><body><h1>🌻 The Recipe Report</h1>
<p>No reports yet the first one is generated after the weekly recipe-upgrade run.</p>
</body></html>
HTML
fi
docker stack deploy --detach=true -c ${stack} ccci-reports
'';
};
in
{
systemd.services.deploy-reports = {
description = "Reconcile the cc-ci Recipe Report static site (report.ci.commoninternet.net)";
# Ordering-only: chain after the dashboard (proxy→…→dashboard→reports) to avoid concurrent
# docker-init races on a fresh host.
after = [ "deploy-dashboard.service" "deploy-proxy.service" "swarm-init.service" "docker.service" "network-online.target" ];
requires = [ "swarm-init.service" "docker.service" ];
wants = [ "network-online.target" ];
wantedBy = [ "multi-user.target" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = "${reconcile}/bin/cc-ci-reconcile-reports";
};
};
}

View File

@ -10,7 +10,6 @@ Bakes in the known abra gotchas (re-verify per installed abra version, currently
from __future__ import annotations
import json
import os
import subprocess
ABRA = "abra"
@ -20,20 +19,6 @@ class AbraError(RuntimeError):
pass
def abra_dir() -> str:
"""abra's state dir, resolved the same way the abra CLI resolves it: $ABRA_DIR if set, else
~/.abra. Inside a CI run, run_recipe_ci exports a PER-RUN $ABRA_DIR (fresh recipes/, shared
servers/+catalogue/ symlinks) before any abra call, so every helper here and every abra
subprocess agree on the same tree; outside a run (warm_reconcile's systemd timer, manual use)
both fall back to the canonical /root/.abra."""
return os.environ.get("ABRA_DIR") or os.path.expanduser("~/.abra")
def recipe_dir(recipe: str) -> str:
"""The current ABRA_DIR's working tree for a recipe (per-run inside a CI run)."""
return os.path.join(abra_dir(), "recipes", recipe)
def _run_pty(
args: list[str], timeout: int = 900, check: bool = True
) -> subprocess.CompletedProcess:
@ -92,7 +77,9 @@ def recipe_checkout(recipe: str, version: str) -> None:
a chaos (`-C`) deploy ignores ENV VERSION and uses the current checkout — together that silently
deployed LATEST for a 'previous-version' base, making the upgrade a no-op (Adversary F1d-2). With
this checkout + a non-chaos deploy, a pinned deploy genuinely deploys that version."""
path = recipe_dir(recipe)
import os
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
# -f (force): the version-pinning checkout must yield the EXACT ref tree. Without it, a cc-ci
# install_steps-provided overlay (e.g. discourse's compose.ccci.yml, copied into the pinned base)
# is an UNTRACKED file that collides with the same path TRACKED in a later ref, and
@ -113,7 +100,9 @@ def has_lightweight_version_tags(recipe: str) -> bool:
'reference not found'.) The caller (deploy_app) uses this to fall back to a chaos base deploy
(which skips lint and deploys the explicitly-checked-out pinned version — see lifecycle.deploy_app).
Read-only: just `git tag` + `cat-file -t`; no fetch/mutation, so it can't trigger abra's revert."""
path = recipe_dir(recipe)
import os
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
tags = subprocess.run(
["git", "-C", path, "tag", "-l"], capture_output=True, text=True
).stdout.split()
@ -179,9 +168,7 @@ def secret_generate(domain: str, timeout: int = 300) -> None:
)
def deploy(
domain: str, chaos: bool = True, timeout: int = 900, no_converge_checks: bool = False
) -> None:
def deploy(domain: str, chaos: bool = True, timeout: int = 900, no_converge_checks: bool = False) -> None:
args = ["app", "deploy", domain, "-o", "-n"]
if chaos:
args.append("-C")
@ -216,10 +203,7 @@ def backup_create(domain: str, timeout: int = 900) -> str:
# remote and fails "authentication required: Unauthorized". Returns the captured output, whose
# restic JSON summary line carries the produced "snapshot_id" (the backup artifact, DG3) — note
# `abra app backup snapshots` needs a TTY and is awkward to script, so we read the create output.
out = (
_run_pty(["app", "backup", "create", domain, "-n", "-C", "-o"], timeout=timeout).stdout
or ""
)
out = _run_pty(["app", "backup", "create", domain, "-n", "-C", "-o"], timeout=timeout).stdout or ""
# Echo the backup output (incl. backupbot's pre-hook run / any "Failed to run command" or
# "Container ... not running" ERROR) into the run log. Backup is otherwise opaque: a pre-hook that
# fails to register/run leaves the DB dump out of the snapshot, surfacing only as a downstream
@ -242,7 +226,9 @@ def recipe_head_commit(recipe: str) -> str | None:
"""The current HEAD commit of the recipe checkout — captured right after fetch (the PR head, or
the catalogue current) so the upgrade tier can re-checkout it for the chaos redeploy after the
prev-tag base deploy reset the working tree (HC1)."""
path = recipe_dir(recipe)
import os
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
proc = subprocess.run(["git", "-C", path, "rev-parse", "HEAD"], capture_output=True, text=True)
out = proc.stdout.strip()
return out or None
@ -250,7 +236,10 @@ def recipe_head_commit(recipe: str) -> str | None:
def recipe_versions(recipe: str) -> list[str]:
"""Published versions of a recipe, oldest→newest (from the recipe git tags)."""
path = recipe_dir(recipe)
import os
import subprocess
path = os.path.expanduser(f"~/.abra/recipes/{recipe}")
proc = subprocess.run(
["git", "-C", path, "tag", "--sort=creatordate"], capture_output=True, text=True
)

View File

@ -13,15 +13,8 @@ from __future__ import annotations
import time
def goto_with_retry(
page,
url,
*,
deadline_seconds: int = 120,
accept_statuses=(200, 304),
goto_timeout_ms: int = 30_000,
wait_until: str = "domcontentloaded",
):
def goto_with_retry(page, url, *, deadline_seconds: int = 120, accept_statuses=(200, 304),
goto_timeout_ms: int = 30_000, wait_until: str = "domcontentloaded"):
"""Poll `page.goto(url)` until status is in `accept_statuses` OR the deadline expires.
Returns the final Playwright response. Raises AssertionError if the deadline expires without

View File

@ -55,9 +55,7 @@ def enrolled_recipes() -> list[str]:
out = []
try:
for name in sorted(os.listdir(tests_dir)):
if os.path.isfile(os.path.join(tests_dir, name, "recipe_meta.py")) and is_enrolled(
name
):
if os.path.isfile(os.path.join(tests_dir, name, "recipe_meta.py")) and is_enrolled(name):
out.append(name)
except OSError:
pass
@ -124,15 +122,11 @@ def deploy_canonical(recipe: str, timeout: int = 900) -> None:
abra.recipe_checkout(recipe, version)
r = subprocess.run(
["abra", "app", "deploy", domain, version, "-o", "-n", "-f"],
capture_output=True,
text=True,
timeout=timeout,
capture_output=True, text=True, timeout=timeout,
)
if r.returncode != 0:
raise RuntimeError(
f"deploy canonical {domain} {version} failed: "
f"{(r.stderr + ' ' + r.stdout).strip()[:300]}"
)
raise RuntimeError(f"deploy canonical {domain} {version} failed: "
f"{(r.stderr + ' ' + r.stdout).strip()[:300]}")
_set_status(recipe, "warm")

View File

@ -79,44 +79,10 @@ def render_badge_svg(label: str, message: str, color: str) -> str:
)
# Third-segment colours for the level badge: amber = an UNINTENTIONAL skip (a rung skipped but not
# in the recipe's intentional list — likely missing coverage) capped the climb; muted = an
# INTENTIONAL skip (declared in recipe_meta.EXPECTED_NA — nothing to fix). Font-safe text labels
# (no emoji) so the SVG renders anywhere.
GAP_COLOR = "#d29922"
EXPECT_COLOR = "#6e7681"
def level_badge_svg(level: int, cap_reason: str = "", cap_skip: str = "") -> str:
"""Per-recipe/-run LEVEL badge: 'cc-ci | level N' coloured by level (R6), with a THIRD segment
that differentiates *why* the climb stopped when a SKIP capped it (`cap_skip`):
- "unintentional" (a rung skipped but not in the recipe's intentional list): amber 'gap?'.
- "intentional" (a skip declared in recipe_meta.EXPECTED_NA): muted 'expected'.
- "" (clean cap / full climb / a real failure): no third segment (the level + card carry it).
The badge never inflates — it only annotates the cap the level already reflects."""
label, msg = "cc-ci", f"level {int(level)}"
lw, mw = _text_width(label), _text_width(msg)
third: tuple[str, str] | None = None
if cap_skip == "unintentional":
third = ("gap?", GAP_COLOR)
elif cap_skip == "intentional":
third = ("expected", EXPECT_COLOR)
if third is None:
return render_badge_svg(label, msg, level_color(level))
txt, tcolor = third
tw = _text_width(txt)
w = lw + mw + tw
return (
f'<svg xmlns="http://www.w3.org/2000/svg" width="{w}" height="20" role="img" '
f'aria-label="{html.escape(label)}: {html.escape(msg)} ({html.escape(txt)})">'
f'<rect width="{lw}" height="20" fill="#555"/>'
f'<rect x="{lw}" width="{mw}" height="20" fill="{level_color(level)}"/>'
f'<rect x="{lw + mw}" width="{tw}" height="20" fill="{tcolor}"/>'
f'<g fill="#fff" font-family="Verdana,Geneva,sans-serif" font-size="11">'
f'<text x="6" y="14">{html.escape(label)}</text>'
f'<text x="{lw + 6}" y="14">{html.escape(msg)}</text>'
f'<text x="{lw + mw + 6}" y="14">{html.escape(txt)}</text></g></svg>'
)
def level_badge_svg(level: int, cap_reason: str = "") -> str:
"""Per-recipe/-run LEVEL badge: 'cc-ci | level N'. Colour by level (R6)."""
msg = f"level {int(level)}"
return render_badge_svg("cc-ci", msg, level_color(level))
def _stage_rows(stages: list[dict]) -> str:
@ -141,45 +107,6 @@ def _stage_rows(stages: list[dict]) -> str:
return "\n".join(rows) or '<tr><td colspan="3">no stages</td></tr>'
# Friendly rung labels for the skip rows (the four essential rungs).
RUNG_LABEL = {
"install": "install",
"upgrade": "upgrade",
"backup_restore": "backup/restore",
"functional": "functional",
}
SKIP_GREEN = (
"#57ab5a" # muted green — an intentional skip reads like a pass (but labelled, never inflating)
)
def _skip_rows(skips: dict) -> str:
"""Render SKIPPED rungs as stage-like rows. An intentional (declared) skip looks like a pass row
but its status says 'INTENTIONAL SKIP' (muted green) with the declared reason on the line below;
an unintentional skip is amber 'UNINTENTIONAL SKIP' with a prompt to add a test or declare it."""
rows = []
for rung, reason in (skips.get("intentional") or {}).items():
rows.append(
f'<tr class="stage"><td colspan="2"><span class="mark" style="color:{SKIP_GREEN}">⊘</span>'
f"<b>{html.escape(RUNG_LABEL.get(rung, rung))}</b></td>"
f'<td class="st" style="color:{SKIP_GREEN}">intentional skip</td></tr>'
)
rows.append(
f'<tr class="skipreason"><td></td><td colspan="2">{html.escape(reason)}</td></tr>'
)
for rung in skips.get("unintentional") or []:
rows.append(
f'<tr class="stage"><td colspan="2"><span class="mark" style="color:{GAP_COLOR}">⊘</span>'
f"<b>{html.escape(RUNG_LABEL.get(rung, rung))}</b></td>"
f'<td class="st" style="color:{GAP_COLOR}">unintentional skip</td></tr>'
)
rows.append(
'<tr class="skipreason"><td></td><td colspan="2">not declared in EXPECTED_NA — add the '
"missing test/label, or declare the skip with a reason</td></tr>"
)
return "\n".join(rows)
def render_card_html(data: dict, screenshot_rel: str | None = "screenshot.png") -> str:
"""Build the summary-card HTML from a results.json dict. `screenshot_rel` is the relative path to
the screenshot PNG (same dir as the card) — omitted from the card if None / absent.
@ -189,9 +116,7 @@ def render_card_html(data: dict, screenshot_rel: str | None = "screenshot.png")
recipe = html.escape(str(data.get("recipe", "?")))
version = html.escape(str(data.get("version") or data.get("ref") or ""))
level = int(data.get("level", 0))
cap_reason = str(data.get("level_cap_reason") or "")
cap = html.escape(cap_reason)
sk = data.get("skips", {}) or {}
cap = html.escape(str(data.get("level_cap_reason") or ""))
color = level_color(level)
flags = data.get("flags", {}) or {}
flag_bits = []
@ -207,7 +132,7 @@ def render_card_html(data: dict, screenshot_rel: str | None = "screenshot.png")
if show_shot
else '<div class="shot noshot">no screenshot</div>'
)
rows = _stage_rows(data.get("stages", [])) + "\n" + _skip_rows(sk)
rows = _stage_rows(data.get("stages", []))
return f"""<!doctype html><html><head><meta charset="utf-8"><style>
*{{box-sizing:border-box}}
body{{margin:0;font-family:system-ui,-apple-system,Segoe UI,sans-serif;background:#0d1117;color:#c9d1d9}}
@ -232,7 +157,6 @@ tr.stage td{{padding-top:.5rem;border-bottom:1px solid #30363d}}
.test .tmark{{width:1.4rem;text-align:center}}
.test .tname{{color:#c9d1d9;font-family:ui-monospace,monospace;font-size:.8rem}}
.test .tms{{text-align:right;color:#8b949e;font-size:.74rem;width:5rem}}
tr.skipreason td{{color:#8b949e;font-size:.78rem;font-style:italic;padding-top:0;padding-bottom:.45rem;border-bottom:1px solid #21262d}}
.shot{{width:360px;flex:none;border:1px solid #30363d;border-radius:8px;overflow:hidden;background:#0d1117}}
.shot img{{width:100%;display:block}}
.shot.noshot{{display:flex;align-items:center;justify-content:center;height:225px;color:#8b949e;font-size:.85rem}}
@ -243,7 +167,7 @@ tr.skipreason td{{color:#8b949e;font-size:.78rem;font-style:italic;padding-top:0
<div class="hd">{FLOWER_SVG}
<div class="title"><h1>{recipe}</h1><span class="ver">{version}</span></div>
<div class="lvl"><span class="num">{level}</span><span class="lbl">level</span></div></div>
<div class="cap">{("<b>capped:</b> " + cap) if cap else "<b>full clean climb</b> — top level (4)"}</div>
<div class="cap">{("<b>capped:</b> " + cap) if cap else "<b>full clean climb</b> — top level (6)"}</div>
<div class="body"><div class="tbl"><table>{rows}</table></div>{shot_html}</div>
<div class="flags">{"".join(flag_bits)}</div>
</div></body></html>"""

View File

@ -28,7 +28,7 @@ from __future__ import annotations
import contextlib
import json
import os
from collections.abc import Iterable
from typing import Iterable
from . import lifecycle, naming
@ -36,7 +36,9 @@ from . import lifecycle, naming
def declared_deps(recipe: str) -> list[str]:
"""Read `DEPS` from `tests/<recipe>/recipe_meta.py` — a list of recipe names this recipe needs
deployed alongside it. Returns [] if none."""
path = os.path.join(os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py")
path = os.path.join(
os.path.dirname(__file__), "..", "..", "tests", recipe, "recipe_meta.py"
)
if not os.path.exists(path):
return []
ns: dict = {}

View File

@ -25,7 +25,7 @@ _BACKUPBOT_RE = re.compile(r"backupbot\.backup\b[^\n]*\btrue\b", re.IGNORECASE)
def _recipe_dir(recipe: str) -> str:
return abra.recipe_dir(recipe) # the per-run tree inside a CI run ($ABRA_DIR)
return os.path.expanduser(f"~/.abra/recipes/{recipe}")
def backup_capable(recipe: str, meta: dict | None = None) -> bool:
@ -222,11 +222,7 @@ def assert_restore_healthy(domain: str, meta: dict) -> None:
def perform_upgrade(
domain: str,
recipe: str,
head_ref: str | None,
deploy_timeout: int = 900,
meta: dict | None = None,
domain: str, recipe: str, head_ref: str | None, deploy_timeout: int = 900, meta: dict | None = None
) -> dict[str, str | None]:
"""Perform the UPGRADE op once, in place, to the PR-HEAD code under test (HC1): re-checkout the
PR head (the prev-tag base deploy reset the recipe working tree), then `abra app deploy --chaos`
@ -271,9 +267,7 @@ def perform_upgrade(
deploy_timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout)),
http_timeout=int(meta.get("HTTP_TIMEOUT", 300)),
)
lifecycle.wait_ready_probes(
meta, domain, timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout))
)
lifecycle.wait_ready_probes(meta, domain, timeout=int(meta.get("DEPLOY_TIMEOUT", deploy_timeout)))
after = lifecycle.deployed_identity(domain)
# Evidence (HC1): the chaos-version label = the deployed recipe commit; it should match the
# PR-head we checked out — proving the upgrade deployed the code under test, not a published tag.

View File

@ -73,7 +73,7 @@ def http_post(
`data` is JSON-encoded if content_type='application/json',
form-encoded if 'application/x-www-form-urlencoded' (the OIDC token endpoint form),
or sent raw bytes if data is already bytes."""
if isinstance(data, bytes | bytearray):
if isinstance(data, (bytes, bytearray)):
body: bytes | None = bytes(data)
elif content_type == "application/json" and data is not None:
body = json.dumps(data).encode()
@ -107,7 +107,7 @@ def http_request(
) -> tuple[int, object | None]:
"""Arbitrary-method HTTP (PUT/DELETE/PATCH) for parity tests that mutate. Same shape as
http_post (returns (status, json_or_None))."""
if isinstance(data, bytes | bytearray):
if isinstance(data, (bytes, bytearray)):
body: bytes | None = bytes(data)
elif content_type == "application/json" and data is not None:
body = json.dumps(data).encode()
@ -142,7 +142,7 @@ def post_with_headers(
"""Like http_post but ALSO returns the response headers as a dict — for APIs that hand back an
auth token in a response header rather than the body (e.g. mattermost login → `Token` header).
Returns (status, parsed_json_or_None, response_headers). status=0 + {} on transport failure."""
if isinstance(data, bytes | bytearray):
if isinstance(data, (bytes, bytearray)):
body: bytes | None = bytes(data)
elif content_type == "application/json" and data is not None:
body = json.dumps(data).encode()
@ -252,16 +252,13 @@ def retry_http_post(
) -> tuple[int, object | None]:
"""POST with retry until expect_fn(status, json) is truthy. Defaults to any 2xx."""
if expect_fn is None:
def expect_fn(s, _j): # noqa: ARG001
return 200 <= s < 300
result: list[tuple[int, object | None]] = [(0, None)]
def _check():
s, j = http_post(
url, data=data, headers=headers, content_type=content_type, timeout=timeout
)
s, j = http_post(url, data=data, headers=headers, content_type=content_type, timeout=timeout)
result[0] = (s, j)
return expect_fn(s, j)

View File

@ -5,39 +5,37 @@ YunoHost semantics: **a gap caps the level** — you only earn level L if every
PASS. The first rung that is not a clean PASS (a real FAIL *or* genuinely N/A for this recipe) stops
the climb; `cap_reason` records why. This is deliberately conservative: presentation must NEVER make
a run look greener than its tests (plan §6 cardinal guardrail), so an N/A rung caps just like a fail
— with a recorded reason so the level is *fair*, not inflated.
(the L5 example in §4.1 — "recipes with no integration surface cap at L4 by definition" — is exactly
this: N/A caps, with a recorded reason so the level is *fair*, not inflated).
The ladder is the FOUR essential rungs every recipe is held to:
The ladder (§4.1):
L0 — install failed / app never became healthy.
L1 — Installs: deploys + passes health/readiness.
L2 — Upgrades: previous published version → PR version, stays healthy, data intact.
L3 — Backup/restore: seeded data survives backup → wipe → restore.
L4 — Functional: recipe-specific functional tests pass.
Integration (SSO/OIDC + cross-app) and recipe-local (the recipe repo's own tests/) are **OPTIONAL**
capabilities — they are NOT part of the level ladder and never cap it. They still run when present
(and SSO is still enforced for the run VERDICT via the deps/SSO checks in run_recipe_ci.py), but a
recipe without an SSO surface or without repo-local tests is simply not penalised on the level.
L5 — Integration: SSO/OIDC + cross-app integration tests pass.
L6 — Recipe-local: the recipe repo's own tests/ (D4) pass and are merged.
This module is PURE (no I/O) so it is cheaply unit-testable and the Adversary can re-run the unit
test cold (`cc-ci-run -m pytest tests/unit/test_level.py -q`). The orchestrator
(`run_recipe_ci.py`) is responsible for translating its raw per-tier results into the rung-status
dict this function consumes; that mapping is documented in DECISIONS.md (Phase 3).
(`run_recipe_ci.py`) is responsible for translating its raw per-tier results + deps/SSO signals into
the rung-status dict this function consumes; that mapping is documented in DECISIONS.md (Phase 3).
Rung status vocabulary (each rung ∈ these three):
"pass" — the rung was exercised and passed.
"fail" — the rung was exercised and failed.
"na" — the rung does not apply to this recipe (e.g. only one published version → no upgrade;
not backup-capable). N/A is NOT a failure, but it DOES cap the climb (with a distinct
cap_reason) so the level never overstates what was actually verified.
not backup-capable; no SSO/integration surface; no recipe-local tests). N/A is NOT a
failure, but it DOES cap the climb (with a distinct cap_reason) so the level never
overstates what was actually verified.
"""
from __future__ import annotations
# The climbable rungs in ascending order. install (L1) is the foundation; L0 means install itself
# did not pass. Each later rung requires every earlier rung to be a clean PASS. These four are the
# ESSENTIAL rungs — integration/recipe-local are optional and deliberately NOT in this tuple.
RUNGS = ("install", "upgrade", "backup_restore", "functional")
# did not pass. Each later rung requires every earlier rung to be a clean PASS.
RUNGS = ("install", "upgrade", "backup_restore", "functional", "integration", "recipe_local")
# Human-readable label per rung level, for cap_reason + the summary card.
RUNG_LABEL = {
@ -45,20 +43,22 @@ RUNG_LABEL = {
2: "upgrade (prev published → PR)",
3: "backup/restore (data integrity)",
4: "functional (recipe-specific tests)",
5: "integration (SSO/OIDC + cross-app)",
6: "recipe-local (recipe repo tests/)",
}
VALID = {"pass", "fail", "na"}
def compute_level(rungs: dict[str, str]) -> tuple[int, str]:
"""Map a rung-status dict → (level 0..4, cap_reason).
"""Map a rung-status dict → (level 0..6, cap_reason).
`rungs` must contain a status in {"pass","fail","na"} for every name in RUNGS. The level is the
highest L such that rungs[1..L] are all "pass"; the first non-"pass" rung caps the climb. L0 is
returned when the install rung itself is not "pass" (install failed / never healthy).
cap_reason explains where the climb stopped:
- "" (empty) when the recipe earned the top rung (L4, full clean climb).
- "" (empty) when the recipe earned the top rung (L6, full clean climb).
- "L<k> <label> FAILED" when a rung was exercised and failed.
- "L<k> <label> N/A" when a rung does not apply to this recipe.
Returns the reason for the FIRST rung that stopped the climb (the binding constraint).

View File

@ -7,8 +7,7 @@ next run. Callers wrap deploy()/teardown() in try/finally (or a pytest finalizer
from __future__ import annotations
import contextlib
import fcntl
import glob
import datetime
import json
import os
import re
@ -18,7 +17,7 @@ import subprocess
import time
import urllib.request
from . import abra, lifetime
from . import abra
GATEWAY_IP = "143.244.213.108" # *.ci.commoninternet.net -> gateway (TLS passthrough to cc-ci)
# A run app domain is "<recipe[:4]>-<6hex>.ci.commoninternet.net" (see DECISIONS.md). Used by the
@ -30,68 +29,6 @@ class TeardownError(RuntimeError):
pass
# --- Concurrent-run safety (capacity=2) -------------------------------------------------------
# ONE mechanism, process-lifetime-scoped so SIGKILL can't leak a stale claim: every run holds an
# exclusive kernel flock on its app DOMAIN (/run/lock/cc-ci-app-<domain>.lock) for the whole run.
# A held lock implies a live owner — the kernel releases a flock when the holding process dies,
# however it dies. The janitor probes the lock (LOCK_NB) to tell a live concurrent run (held →
# leave it) from a crashed run's orphan (acquirable → reap it); it never inspects pids and never
# steals a held lock. Recipe-tree corruption between same-recipe runs is gone structurally (each
# run deploys from its own per-run ABRA_DIR — there is no shared recipe tree and no recipe lock),
# and same-domain runs (double-!testme of one PR) serialise on this app lock.
# See docs/concurrency.md.
# Acquired app-lock file objects are retained here for the REMAINING PROCESS LIFETIME: if the
# caller drops the returned file object, GC would close the fd and silently release the lock —
# this list is the lock's owner of record. Never cleared; release is process exit.
_held_app_locks: list = []
def _app_lock_dir() -> str:
"""The app-domain lockfile dir. /run/lock (tmpfs: a reboot clears locks AND lockfiles, so
post-reboot apps probe as orphans and are reaped immediately). Env-overridable so the
tests/concurrency suite (and its helper subprocesses) can use a sandbox dir."""
return os.environ.get("CCCI_APP_LOCK_DIR", "/run/lock")
def _app_lock_path(domain: str) -> str:
return os.path.join(_app_lock_dir(), f"cc-ci-app-{domain}.lock")
def acquire_app_lock(domain: str):
"""Take the per-app-domain exclusive lock; blocks (with a log line) if another run of the
same domain is in flight (double-!testme serialisation). Returns the open lock file, which is
ALSO retained in _held_app_locks so the flock lives exactly as long as the process.
Unlink/recreate race guard: the janitor unlinks a reaped orphan's lockfile while holding its
flock, so a waiter blocked on the OLD inode can win a lock no later opener can observe (a new
open() at the path creates a FRESH inode). After every acquisition, verify the locked fd is
still the file at the path (st_ino match); if not, drop it and retry on the live path."""
path = _app_lock_path(domain)
waited = False
while True:
# PEP 446: the fd is non-inheritable, so subprocess children never carry the lock.
f = open(path, "a") # noqa: SIM115 — deliberately held for the rest of the process
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
except BlockingIOError:
if not waited:
print(f"== app lock: another run of {domain} is in flight — waiting ==", flush=True)
waited = True
fcntl.flock(f, fcntl.LOCK_EX)
try:
if os.fstat(f.fileno()).st_ino == os.stat(path).st_ino:
break # we hold the lock on the inode the path names — done
except FileNotFoundError:
pass
f.close() # locked a stale (unlinked) inode — retry on the live path
os.utime(f.fileno()) # mtime = acquisition time = lock age (janitor's long-held flag)
_held_app_locks.append(f)
if waited:
print(f"== app lock: acquired {path} ==", flush=True)
return f
def _docker_names(kind: str, stack: str) -> list[str]:
"""docker <kind> ls names filtered to a stack (kind: service|volume|secret)."""
proc = subprocess.run(
@ -111,6 +48,31 @@ def _residual(domain: str) -> dict:
}
def _stack_age_seconds(stack: str) -> float | None:
"""Age of the stack's oldest service, or None if not present."""
svcs = _docker_names("service", stack)
if not svcs:
return None
oldest = None
for s in svcs:
p = subprocess.run(
["docker", "service", "inspect", s, "--format", "{{.CreatedAt}}"],
capture_output=True,
text=True,
)
ts = p.stdout.strip()
try:
# docker emits e.g. 2026-05-27 00:12:33.123 +0000 UTC -> take the leading 19 chars
dt = datetime.datetime.strptime(ts[:19], "%Y-%m-%d %H:%M:%S").replace(
tzinfo=datetime.UTC
)
except ValueError:
continue
age = (datetime.datetime.now(datetime.UTC) - dt).total_seconds()
oldest = age if oldest is None else max(oldest, age)
return oldest
def _recipe_extra_env(recipe: str, domain: str) -> dict[str, str]:
"""Per-recipe extra .env keys, applied at every deploy (install + upgrade's old_app) so a recipe
with multi-domain / config needs is enrolled with NO shared-harness change (D5/M6.5). A recipe
@ -187,9 +149,9 @@ def prepull_images(recipe: str, domain: str) -> None:
app-INIT time (slow-init apps like collabora/immich still need their recipe healthcheck/READY_PROBE).
Best-effort on resolution failure (skip + let the deploy pull as usual); HARD-fails on a real
pull error (don't mask it)."""
recipe_dir = abra.recipe_dir(recipe) # per-run tree inside a CI run
# The app .env lives in the CANONICAL servers path (the per-run ABRA_DIR's servers/ is a
# symlink to it, so abra and this path agree on the same file).
import os
recipe_dir = os.path.expanduser(f"~/.abra/recipes/{recipe}")
env_path = os.path.expanduser(f"~/.abra/servers/default/{domain}.env")
if not os.path.isdir(recipe_dir) or not os.path.isfile(env_path):
print(f" prepull: recipe dir or .env missing for {recipe} — skipping", flush=True)
@ -199,8 +161,7 @@ def prepull_images(recipe: str, domain: str) -> None:
# --env-file supplies $VERSION-style interpolation so pinned tags resolve correctly.
cf = subprocess.run(
["bash", "-c", f'set -a; . "{env_path}"; printf "%s" "${{COMPOSE_FILE:-compose.yml}}"'],
capture_output=True,
text=True,
capture_output=True, text=True,
).stdout.strip()
files = [f for f in cf.split(":") if f] or ["compose.yml"]
args = ["docker", "compose", "--env-file", env_path]
@ -248,10 +209,6 @@ def deploy_app(
past the 900s default. abra's INTERNAL TIMEOUT (recipe's TIMEOUT env, default 300s) is set via
EXTRA_ENV; this is the Python subprocess wrapper's timeout so abra doesn't get SIGKILLed mid-deploy."""
_record_deploy()
# Lock BEFORE the app exists: a concurrent run's janitor must never see this app without a
# held app lock (it would probe it as an orphan and reap an in-flight deploy). Also the
# double-!testme serialisation point: a second run of the same domain blocks here.
acquire_app_lock(domain)
abra.app_config_remove(domain) # clear any stale .env from a prior crashed run
abra.app_new(recipe, domain, version=version, secrets=secrets)
# A pinned version must actually deploy that version: check the recipe out to the tag so the
@ -311,22 +268,18 @@ def _stack_name(domain: str) -> str:
def services_converged(domain: str) -> bool:
"""True when every service in the stack reports replicas N/N (N>0) AND no service is
mid-rolling-update (swarm UpdateStatus settled)."""
"""True when every service in the stack reports replicas N/N (N>0)."""
stack = _stack_name(domain)
proc = subprocess.run(
["docker", "stack", "services", stack, "--format", "{{.Name}} {{.Replicas}}"],
["docker", "stack", "services", stack, "--format", "{{.Replicas}}"],
capture_output=True,
text=True,
)
rows = [r for r in proc.stdout.split("\n") if r.strip()]
if not rows:
return False
names = []
for r in rows:
name, _, replicas = r.partition(" ")
names.append(name)
cur, _, want = replicas.partition("/")
cur, _, want = r.partition("/")
# A service at its DESIRED replica count is converged — including a `replicas: 0`
# on-demand one-shot (e.g. lasuite-drive's `minio-createbuckets`, which is scaled up
# manually only when buckets need (re)creating), which reports "0/0". The earlier
@ -335,34 +288,6 @@ def services_converged(domain: str) -> bool:
# still spinning up shows e.g. "0/1" (cur != want) and is correctly not-yet-converged.
if not want or cur != want:
return False
# N/N alone is NOT convergence during a stop-first rolling update: a chaos redeploy that changes
# a non-app service image (e.g. immich's db pin) registers the update immediately, but swarm may
# not have cycled that service's task yet — the OLD task still shows 1/1, then dies seconds later
# (immich CI 238: backupbot exec'd the db pre-hook into the just-killed container → 409). Require
# every service's UpdateStatus to be settled too, so the wait spans the whole rolling update.
proc = subprocess.run(
[
"docker",
"service",
"inspect",
*names,
"--format",
"{{if .UpdateStatus}}{{.UpdateStatus.State}}{{end}}",
],
capture_output=True,
text=True,
)
if proc.returncode != 0:
return False # a service vanished mid-check — not settled
for state in proc.stdout.split("\n"):
# Only ACTIVE states block convergence. 'paused'/'rollback_paused' are terminal-without-
# intervention: swarm's default update-failure-action pauses the update on one task flicker
# and the flag then persists FOREVER (immich CI 241: app service 'paused' from a restart
# during restore, service back at 1/1 and healthy — the wait hung to its deadline). With
# N/N already required above, a paused update is settled for our purposes; the HTTP-health
# and tier assertions still gate whether the app actually works.
if state.strip() in ("updating", "rollback_started"):
return False
return True
@ -490,9 +415,7 @@ def recipe_checkout_ref(recipe: str, ref: str) -> None:
abra.recipe_checkout(recipe, ref)
def chaos_redeploy(
domain: str, deploy_timeout: int = 900, no_converge_checks: bool = False
) -> None:
def chaos_redeploy(domain: str, deploy_timeout: int = 900, no_converge_checks: bool = False) -> None:
"""In-place `abra app deploy --chaos`: redeploy the running app at the CURRENT recipe checkout
(HC1: the PR-head code under test). This is the upgrade op, not a fresh install — it does NOT go
through deploy_app, so the deploy-count guard (DG4.1) is not incremented.
@ -575,16 +498,6 @@ def wait_ready_probes(meta: dict, domain: str, timeout: int = 600) -> None:
def backup_app(domain: str) -> str:
"""Create a backup; return the abra/restic output (carries the produced snapshot_id)."""
# Never back up a stack that is still converging/rolling-updating: backupbot resolves each
# service's hook container ONCE up front, so a task that cycles between that lookup and the
# pre-hook exec crashes the whole backup with a 409 (immich CI 238). Bounded wait — on timeout
# we still attempt the backup and let the tier's assertion deliver the verdict.
deadline = time.time() + 300
while time.time() < deadline and not services_converged(domain):
print(
f" backup: {domain} stack not settled yet — waiting before backup create", flush=True
)
time.sleep(5)
return abra.backup_create(domain)
@ -690,84 +603,17 @@ def teardown_app(domain: str, verify: bool = True) -> None:
residual = _residual(domain)
if any(residual.values()):
raise TeardownError(f"teardown left residual for {domain}: {residual}")
# No unregistration step: the app lock releases implicitly at process exit. The clean run's
# leftover lockfile (unheld) is unlinked on sight by the next janitor's stale-lockfile sweep.
# A lock held longer than 2x the 60-min hard deadline can only be a leaked run (the deadline
# bounds every healthy run). Flag it for a human — NEVER steal a held lock.
LONG_HELD_LOCK_SECONDS = 2 * lifetime.HARD_DEADLINE_SECONDS
def janitor(max_age_seconds: int | None = None) -> None:
"""Reap orphaned run apps from crashed/rebooted runs. Matches the real naming scheme and only
reaps apps older than max_age_seconds (so concurrent in-flight runs are never killed). Reaps via
docker primitives so it works even when the .env is gone (A2/A3). Default 2h, env-overridable
via CCCI_JANITOR_MAX_AGE (e.g. 0 to reap all matching orphans immediately)."""
import os
def _probe_and_reap(domain: str) -> None:
"""Probe one run app's lock; reap iff nobody holds it (kernel-guaranteed orphan).
Reaping happens WHILE HOLDING the probe lock, closing the janitor-vs-new-run race: a new run
of the same domain blocks in acquire_app_lock until the reap finishes, so a fresh app never
coexists with a half-reaped one. The lockfile is unlinked before release (still holding the
lock); a waiter that blocked on the unlinked inode re-checks identity and retries. Two racing
janitors arbitrate on the same flock: one reaps, the other sees 'held' and leaves —
teardown_app(verify=False) is idempotent either way."""
path = _app_lock_path(domain)
try:
# PEP 446: non-inheritable fd, same as acquire_app_lock.
f = open(path, "a") # noqa: SIM115 — closed in the finally below, lock released with it
except OSError as e:
print(f"!! janitor: cannot open lockfile {path} ({e}) — skipping {domain}", flush=True)
return
try:
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
except BlockingIOError:
# Held -> live run. Never steal; flag if it has been held implausibly long.
try:
held_for = time.time() - os.stat(path).st_mtime
except OSError:
held_for = 0
if held_for > LONG_HELD_LOCK_SECONDS:
print(
f"!! lock for {domain} held >{LONG_HELD_LOCK_SECONDS // 60}min — possible "
"leaked run; inspect with lslocks",
flush=True,
)
else:
print(
f" janitor: {domain} lock held — live concurrent run, leaving it", flush=True
)
return
# Acquired — but only the inode the PATH names counts (another janitor may have reaped
# and unlinked this inode while we raced; a lock on an unlinked inode protects nothing
# and unlinking the path now would delete a NEWER run's lockfile).
try:
if os.fstat(f.fileno()).st_ino != os.stat(path).st_ino:
return
except FileNotFoundError:
return
# Orphan: no live owner (the kernel released the lock when the owner died). Reap while
# holding the probe lock, then unlink the lockfile before releasing.
print(f" janitor: {domain} lock acquirable — orphan, reaping", flush=True)
with contextlib.suppress(Exception):
teardown_app(domain, verify=False)
with contextlib.suppress(OSError):
os.unlink(path)
finally:
f.close()
def janitor() -> None:
"""Reap orphaned run apps from crashed/rebooted runs; the kernel flock is the only liveness
oracle. For every candidate run app, probe its app-domain lock (LOCK_NB):
acquirable -> nobody holds it -> orphan -> reap under the probe lock + unlink lockfile
held -> live concurrent run -> leave it (warn if held >2x the hard deadline)
Candidate discovery is unchanged: `abra app ls` + a docker-service sweep (catches stacks
whose .env is already gone), both matched against RUN_APP_RE — warm/canonical apps never
match and are never probed. Post-reboot, /run/lock (tmpfs) is empty, so every surviving app
probes as an orphan and is reaped immediately (no age threshold). Stale lockfiles with no
app behind them are unlinked on sight. Degrades safely: an unreadable lockfile/dir is
skipped with a log line, never a crash. Reaps via docker primitives so it works even when
the .env is gone (A2/A3)."""
if max_age_seconds is None:
max_age_seconds = int(os.environ.get("CCCI_JANITOR_MAX_AGE", "7200"))
seen = set()
for app in abra.app_ls():
name = app.get("appName") or app.get("domain") or ""
@ -781,22 +627,9 @@ def janitor() -> None:
seen.add(f"{m.group(1)}.ci.commoninternet.net")
for name in seen:
_probe_and_reap(name)
# Tidy /run/lock: a clean run's leftover lockfile is unheld and appless — unlink it (under
# its own probe lock, with the same identity check as above).
with contextlib.suppress(OSError):
for path in glob.glob(os.path.join(_app_lock_dir(), "cc-ci-app-*.lock")):
domain = os.path.basename(path)[len("cc-ci-app-") : -len(".lock")]
if domain in seen:
continue # handled (or deliberately left) above
with contextlib.suppress(OSError):
f = open(path, "a") # noqa: SIM115 — closed below, lock released with it
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
if os.fstat(f.fileno()).st_ino == os.stat(path).st_ino:
os.unlink(path)
except (BlockingIOError, FileNotFoundError):
pass # held (live run pre-deploy) or already gone — leave it
finally:
f.close()
stack = _stack_name(name)
age = _stack_age_seconds(stack)
if age is not None and age < max_age_seconds:
continue # likely a concurrent in-flight run; leave it
with contextlib.suppress(Exception):
teardown_app(name, verify=False)

View File

@ -1,95 +0,0 @@
"""Run-lifetime hardening (concurrency restructure P1).
The concurrency model's invariant chain is:
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline
Locks are kernel flocks released on process exit, so the only thing that needs managing is the
PROCESS lifetime. Three guards, installed at run startup (before any abra call) by
`install_lifetime_guards()`:
1. `PR_SET_PDEATHSIG(SIGTERM)`: if the parent (the drone step shell) dies — cancel, runner
crash, host shutdown of the step — the kernel delivers SIGTERM to the harness, so a dead
build can never leak a running harness that holds locks. Paired with a ppid==1 re-check
AFTER the prctl: a parent that died BEFORE the prctl took effect would never trigger the
death signal, so a harness that finds itself already reparented refuses to run.
2. SIGTERM handler: raise SystemExit so the run's `finally:` teardown funnel executes and the
process exits non-zero. Re-entrant deliveries during teardown are logged and IGNORED so a
second signal can't abort the cleanup the first one asked for (`begin_teardown()` guards
this; the run's own `finally:` blocks also call it so a signal landing mid-normal-teardown
can't abort that either).
3. `signal.alarm(3600)`: self-imposed hard deadline. SIGALRM funnels into the same teardown
path with a distinct log line. Teardown time after the deadline is not alarm-bounded —
interrupting a teardown buys nothing; the janitor (flock probe) is the backstop if a
teardown wedges and the process is killed harder.
"""
from __future__ import annotations
import ctypes
import os
import signal
import sys
HARD_DEADLINE_SECONDS = 60 * 60
_PR_SET_PDEATHSIG = 1 # linux/prctl.h
_state = {"tearing_down": False}
def begin_teardown() -> None:
"""Mark the teardown funnel as running. From here on SIGTERM/SIGALRM must NOT raise — it
would abort the very cleanup it asks for — so the handlers log and return instead. Called by
the handlers themselves before raising, and at the top of the run's `finally:` blocks."""
_state["tearing_down"] = True
def _funnel_handler(log_line: str, exit_code: int):
"""A signal handler that routes into the teardown funnel exactly once: log, then raise
SystemExit (propagates through the run's try/finally → teardown executes → non-zero exit).
While teardown is already running, further signals are logged and swallowed."""
def handler(signum: int, frame) -> None: # noqa: ARG001
print(log_line, flush=True)
if _state["tearing_down"]:
print(
f"== signal {signum} during teardown — ignored (teardown continues, "
"exit stays non-zero) ==",
flush=True,
)
return
begin_teardown()
raise SystemExit(exit_code)
return handler
def install_lifetime_guards(deadline_seconds: int = HARD_DEADLINE_SECONDS) -> None:
"""Install all three lifetime guards (see module docstring). Must run at harness startup,
before any abra call and before any lock is taken."""
libc = ctypes.CDLL("libc.so.6", use_errno=True)
if libc.prctl(_PR_SET_PDEATHSIG, signal.SIGTERM, 0, 0, 0) != 0:
err = ctypes.get_errno()
raise OSError(err, f"prctl(PR_SET_PDEATHSIG, SIGTERM) failed: {os.strerror(err)}")
# The prctl is armed now — but only fires for a parent death AFTER this point. If the parent
# already died, we are reparented (ppid 1) and would never get the signal: refuse to run, an
# orphaned harness would hold locks/apps with nothing managing its lifetime.
if os.getppid() == 1:
sys.exit("parent died before prctl(PR_SET_PDEATHSIG) — refusing to run orphaned")
signal.signal(
signal.SIGTERM,
_funnel_handler(
"== SIGTERM received (drone cancel / parent death) — tearing down ==",
128 + signal.SIGTERM,
),
)
minutes = deadline_seconds // 60
signal.signal(
signal.SIGALRM,
_funnel_handler(
f"== run exceeded {minutes}-minute hard deadline — tearing down ==",
128 + signal.SIGALRM,
),
)
signal.alarm(deadline_seconds)

View File

@ -2,14 +2,7 @@
Turns a run's per-tier pytest outcomes into a single `results.json` artifact carrying, per the plan:
{ recipe, version, pr, ref, run_id, finished, stages:[{name,status,tests:[{name,status,ms}]}],
level, level_cap_reason, level_cap_rung, rungs,
skips:{intentional:{rung:reason}, unintentional:[rung]},
flags:{clean_teardown,no_secret_leak}, screenshot, summary_card }
`skips` splits the N/A (skipped) rungs by a simple rule: a skip is INTENTIONAL iff the recipe lists
it (with a reason) in `recipe_meta.EXPECTED_NA = {rung: reason}`; any rung skipped but not listed is
UNINTENTIONAL (a coverage gap to fill or declare). Skips still cap the level either way — the harness
never claims a rung it did not verify; this only labels *why* a skip happened.
level, level_cap_reason, rungs, flags:{clean_teardown,no_secret_leak}, screenshot, summary_card }
The per-test breakdown comes from JUnit XML emitted by each tier's pytest invocation (`--junitxml`),
parsed here with the stdlib (no new dep). The integer **level** is computed by harness.level from a
@ -134,24 +127,41 @@ def collect_stages(records: list[dict]) -> list[dict]:
return stages
def _has_repo_local(records: list[dict]) -> bool:
return any(r.get("source") == "repo-local" for r in records)
def _repo_local_passed(records: list[dict]) -> bool:
repo = [r for r in records if r.get("source") == "repo-local"]
return bool(repo) and all(r.get("rc", 1) == 0 for r in repo)
def derive_rungs(
results: dict[str, str],
*,
backup_capable: bool,
declared: list[str] | None,
deps_ready: bool,
sso_unverified: bool,
has_custom: bool,
has_repo_local: bool,
repo_local_passed: bool,
) -> dict[str, str]:
"""Translate the orchestrator's tier results into the rung-status dict harness.level consumes —
the FOUR essential rungs only. Conservative by design — never reports a rung 'pass' it can't
substantiate (cardinal guardrail: presentation never inflates).
"""Translate the orchestrator's tier results + deps/SSO signals into the rung-status dict
harness.level consumes. Documented in DECISIONS.md (Phase 3). Conservative by design — never
reports a rung 'pass' it can't substantiate (cardinal guardrail: presentation never inflates).
L1 install : install tier pass.
L2 upgrade : upgrade tier (skip → N/A: only one published version).
L3 backup/res : backup AND restore tiers pass (N/A if not backup-capable).
L4 functional : recipe-specific functional tests pass — the custom tier. N/A if none ran.
Integration (SSO/OIDC) and recipe-local are OPTIONAL and intentionally NOT rungs here — they
never cap the level (SSO is still enforced for the run VERDICT in run_recipe_ci.py).
L4 functional : the recipe-specific functional (non-deps) tests pass — the custom tier, minus
its SSO/integration tests. N/A if the recipe has no custom tests at all.
L5 integration: SSO/OIDC + cross-app. Applies ONLY if the recipe declares deps (else N/A — the
"no integration surface caps at L4" rule, §4.1). pass iff deps wired
(deps_ready) and not sso_unverified and the custom tier didn't fail.
L6 recipe-loc : the recipe repo's own tests/ (repo-local source) ran and passed (N/A if none).
"""
declared = declared or []
rungs: dict[str, str] = {}
rungs["install"] = level_mod.tier_to_rung(results.get("install"))
rungs["upgrade"] = level_mod.tier_to_rung(results.get("upgrade"))
@ -160,34 +170,36 @@ def derive_rungs(
)
custom = results.get("custom")
# Functional rung (L4): the non-deps custom tests.
if not has_custom or custom == "skip" or custom is None:
rungs["functional"] = "na"
elif custom == "fail":
# A custom test failed. With declared deps we cannot cheaply tell functional-vs-SSO apart, so
# conservatively fail the functional rung (caps at L3) — never inflate.
rungs["functional"] = "fail"
else: # custom == "pass"
rungs["functional"] = "pass"
# Integration rung (L5): only recipes with an SSO/integration surface (declared deps) can climb.
if not declared:
rungs["integration"] = "na"
elif sso_unverified or not deps_ready or custom == "fail":
# SSO not wired/verified, or a custom test failed → integration not verified.
rungs["integration"] = "fail"
elif custom == "pass":
rungs["integration"] = "pass"
else:
# declared deps but no custom tests ran — can't claim integration verified
rungs["integration"] = "na"
# Recipe-local rung (L6).
if not has_repo_local:
rungs["recipe_local"] = "na"
else:
rungs["recipe_local"] = "pass" if repo_local_passed else "fail"
return rungs
def skips(rungs: dict[str, str], expected_na: dict | None) -> dict:
"""Split the SKIPPED (N/A) rungs into intentional vs unintentional (operator model).
A recipe lists the rungs it intentionally skips, each with a reason, in
`recipe_meta.EXPECTED_NA = {rung: reason}`. The rule is dead simple: a skipped rung is
**intentional** iff it is in that list; any rung that is skipped and NOT in the list is
**unintentional** (a coverage gap someone should either fill or declare). N/A still caps the
level either way — the harness never claims a rung it did not verify — this only labels *why* a
skip happened. Returns:
{ "intentional": {rung: reason, ...}, # skipped AND declared in EXPECTED_NA
"unintentional": [rung, ...] } # skipped but NOT declared
"""
expected = {str(k): str(v) for k, v in (expected_na or {}).items()}
na = [r for r, st in rungs.items() if st == "na"]
intentional = {r: expected[r] for r in na if r in expected}
unintentional = sorted(r for r in na if r not in expected)
return {"intentional": intentional, "unintentional": unintentional}
def build_results(
*,
recipe: str,
@ -197,24 +209,30 @@ def build_results(
records: list[dict],
results: dict[str, str],
backup_capable: bool,
declared: list[str] | None,
deps_ready: bool,
sso_unverified: bool,
clean_teardown: bool,
no_secret_leak: bool,
finished_ts: float | None,
screenshot: str | None = None,
summary_card: str | None = None,
expected_na: dict | None = None,
) -> dict:
"""Assemble the full results.json dict (no I/O). `finished_ts` is passed in (the orchestrator
stamps it) so this stays pure and deterministic for unit tests. `expected_na` is the recipe's
declared intentional-skip map (recipe_meta.EXPECTED_NA) used to distinguish a deliberate skip from
accidentally-missing coverage."""
stamps it) so this stays pure and deterministic for unit tests."""
stages = collect_stages(records)
has_custom = any(r["tier"] == "custom" for r in records)
rungs = derive_rungs(results, backup_capable=backup_capable, has_custom=has_custom)
rungs = derive_rungs(
results,
backup_capable=backup_capable,
declared=declared,
deps_ready=deps_ready,
sso_unverified=sso_unverified,
has_custom=has_custom,
has_repo_local=_has_repo_local(records),
repo_local_passed=_repo_local_passed(records),
)
lvl, cap_reason = level_mod.compute_level(rungs)
# The rung that capped the climb (lowest non-pass), or None on a full climb — lets a consumer
# (card/badge) tell whether the cap was an intentional skip, an unintentional one, or a failure.
capped = level_mod.RUNGS[lvl] if cap_reason else None
return {
"schema": 1,
"run_id": run_id(),
@ -225,9 +243,7 @@ def build_results(
"finished": finished_ts,
"level": lvl,
"level_cap_reason": cap_reason,
"level_cap_rung": capped,
"rungs": rungs,
"skips": skips(rungs, expected_na),
"stages": stages,
"results": results,
"flags": {

View File

@ -113,9 +113,7 @@ def _assert_undeployed(domain: str) -> None:
)
def snapshot(
recipe: str, domain: str, commit: str | None = None, version: str | None = None
) -> dict:
def snapshot(recipe: str, domain: str, commit: str | None = None, version: str | None = None) -> dict:
"""Take a last-known-good snapshot of every data volume of <domain>'s stack. The app MUST be
undeployed. Atomically replaces the prior last-good. Returns the written meta dict."""
_assert_undeployed(domain)
@ -171,9 +169,7 @@ def restore(recipe: str, domain: str) -> dict:
for vol in meta.get("volumes", []):
tar_path = os.path.join(volumes_dir(recipe), f"{vol}.tar")
if vol not in current:
raise SnapshotError(
f"snapshot volume {vol} absent from current stack {sorted(current)}"
)
raise SnapshotError(f"snapshot volume {vol} absent from current stack {sorted(current)}")
mp = _volume_mountpoint(vol)
# Clear the volume contents (incl. dotfiles) without removing the mountpoint itself.
r = _run(["sh", "-c", f'rm -rf -- "{mp}"/* "{mp}"/.[!.]* "{mp}"/..?* 2>/dev/null; true'])

View File

@ -60,17 +60,14 @@ def sweep() -> int:
for r in recipes:
print(f"\n===== nightly: full-cold {r} (latest) =====", flush=True)
env = dict(os.environ, RECIPE=r)
env.pop("REF", None) # latest, not a PR head
env.pop("REF", None) # latest, not a PR head
env.pop("CCCI_QUICK", None)
env.pop("MODE", None)
rc = subprocess.run(
[sys.executable, os.path.join(_here(), "run_recipe_ci.py")], env=env
).returncode
results[r] = rc
print(
f"nightly: {r} rc={rc} ({'green→canonical refreshed' if rc == 0 else 'red'})",
flush=True,
)
print(f"nightly: {r} rc={rc} ({'green→canonical refreshed' if rc == 0 else 'red'})", flush=True)
# WC8 disk hygiene: drop warm data for de-enrolled canonicals; log the disk budget.
pruned = canonical.prune_stale()
if pruned:

View File

@ -44,26 +44,17 @@ sys.path.insert(0, os.path.join(ROOT, "runner"))
from harness import ( # noqa: E402
abra,
canonical,
card as card_mod,
deps as deps_mod,
discovery,
generic,
lifecycle,
lifetime,
naming,
results as results_mod,
screenshot as screenshot_mod,
warm,
warmsnap,
)
from harness import ( # noqa: E402
card as card_mod,
)
from harness import ( # noqa: E402
deps as deps_mod,
)
from harness import ( # noqa: E402
results as results_mod,
)
from harness import ( # noqa: E402
screenshot as screenshot_mod,
)
ALL_STAGES = ("install", "upgrade", "backup", "restore", "custom")
@ -138,73 +129,18 @@ def _gitea_token() -> str | None:
return tok or None
def _run_state_path(name: str) -> str:
"""Run-scoped state file in the tempdir, keyed by run id + harness pid — NEVER by app domain.
A second run of the SAME domain overlaps this process (its main() preamble executes before it
blocks at the app lock inside deploy_app), so domain-keyed files get reset/removed under the
live run: M2(c) double-!testme produced a false DG4.1 deploy-count=2 in run 1 and a countfile
FileNotFoundError crash in run 2. Children never re-derive these paths — they receive them
via the CCCI_*_FILE env vars, so the key only has to be unique per harness process."""
rid = results_mod.run_id()
return os.path.join(tempfile.gettempdir(), f"ccci-{name}-{rid}-{os.getpid()}")
def setup_run_abra_dir() -> str:
"""P3: build + export this run's PER-RUN ABRA_DIR — structural isolation of recipe trees.
`<runs_dir>/<run-id>/abra/` with:
servers/ -> symlink to the canonical ~/.abra/servers. App .env files land in the shared
canonical path, so janitor discovery (`abra app ls`) and env-based teardown
work unchanged from any process; per-domain filenames + the app-domain lock
prevent write conflicts.
catalogue/ -> symlink to the canonical ~/.abra/catalogue (read-mostly).
recipes/ fresh + empty — THE isolation that matters: each run clones and git-checkouts
its own recipe trees, so concurrent runs (same recipe included) can never
corrupt each other's deploy tree. Replaces the per-recipe flock.
Exported as $ABRA_DIR — honored by the abra CLI and by every harness path helper
(abra.abra_dir()) — BEFORE any abra call. Rides along the existing run-dir retention."""
canonical = os.path.expanduser("~/.abra")
rid = results_mod.run_id()
if rid == "manual":
rid = f"manual-{os.getpid()}" # two concurrent hand-runs must not share a tree
run_abra_dir = os.path.join(results_mod.runs_dir(), rid, "abra")
os.makedirs(os.path.join(run_abra_dir, "recipes"), exist_ok=True)
for shared in ("servers", "catalogue"):
link = os.path.join(run_abra_dir, shared)
if not os.path.islink(link):
os.symlink(os.path.join(canonical, shared), link)
os.environ["ABRA_DIR"] = run_abra_dir
print(
f"== per-run ABRA_DIR: {run_abra_dir} (servers/catalogue -> canonical; fresh recipes/) ==",
flush=True,
)
return run_abra_dir
def fetch_recipe(recipe: str, ref: str | None, src: str | None) -> None:
"""Make the recipe available at the code under test in THIS RUN's recipe tree
($ABRA_DIR/recipes/<recipe>): a plain clone — no locking needed, no rm-rf of any shared
state (the rm below only clears this run's own leftovers, e.g. a janitor-triggered
`abra app ls` auto-clone or a Drone build-number reuse). If SRC+REF point at the mirror PR,
"""Make the recipe available at the code under test. If SRC+REF point at the mirror PR,
clone it at that ref; otherwise fetch the catalogue copy. Private mirror repos need the bot
token — passed via a per-command http.extraHeader (not persisted in .git/config, not printed)."""
dest = abra.recipe_dir(recipe)
os.makedirs(os.path.dirname(dest), exist_ok=True)
# CCCI_SKIP_FETCH=1: use the locally STAGED recipe clone as-is (lets a test/Adversary stage a
# fake/broken ref — e.g. a simulated broken PR head for the --quick rollback proof — without it
# being clobbered by a re-fetch). Staging happens in the canonical ~/.abra/recipes/<recipe>;
# copy it into the per-run tree so the rest of the run reads the staged state. Never set in
# production CI.
recipes_dir = os.path.expanduser("~/.abra/recipes")
os.makedirs(recipes_dir, exist_ok=True)
dest = os.path.join(recipes_dir, recipe)
# CCCI_SKIP_FETCH=1: use the local recipe clone as-is (lets a test/Adversary stage a fake/broken
# ref — e.g. a simulated broken PR head for the --quick rollback proof — without it being clobbered
# by a re-fetch). Never set in production CI.
if os.environ.get("CCCI_SKIP_FETCH") == "1":
canonical = os.path.expanduser(f"~/.abra/recipes/{recipe}")
subprocess.run(["rm", "-rf", dest], check=False)
if os.path.isdir(canonical):
shutil.copytree(canonical, dest, symlinks=True)
print(
f"[fetch] CCCI_SKIP_FETCH=1 — using staged {recipe} clone as-is "
f"(copied {canonical} -> per-run tree)",
flush=True,
)
print(f"[fetch] CCCI_SKIP_FETCH=1 — using local {recipe} recipe clone as-is", flush=True)
return
if src and ref:
url = f"https://git.autonomic.zone/{src}.git"
@ -233,7 +169,7 @@ def fetch_recipe(recipe: str, ref: str | None, src: str | None) -> None:
def snapshot_recipe_tests(recipe: str) -> str | None:
"""Copy the recipe-shipped tests/ to a stable temp dir, immune to abra re-checking-out the
recipe to a version tag during the run. Returns the snapshot path, or None if no tests/."""
src = os.path.join(abra.recipe_dir(recipe), "tests")
src = os.path.expanduser(f"~/.abra/recipes/{recipe}/tests")
if not os.path.isdir(src):
return None
has_overlay = glob.glob(os.path.join(src, "test_*.py")) or os.path.isfile(
@ -264,7 +200,6 @@ def _load_meta(recipe: str) -> dict:
for k in list(meta) + [
"BACKUP_CAPABLE",
"SKIP_GENERIC",
"EXPECTED_NA",
"OIDC_AT_INSTALL",
"READY_PROBE",
"UPGRADE_BASE_VERSION",
@ -630,15 +565,15 @@ def run_quick(
flush=True,
)
statefile = _run_state_path("opstate") + ".json"
statefile = os.path.join(tempfile.gettempdir(), f"ccci-opstate-{domain}.json")
with open(statefile, "w") as f:
json.dump({}, f)
os.environ["CCCI_OP_STATE_FILE"] = statefile
depsfile = _run_state_path("deps") + ".json"
depsfile = os.path.join(tempfile.gettempdir(), f"ccci-deps-{domain}.json")
with open(depsfile, "w") as f:
json.dump({}, f)
os.environ["CCCI_DEPS_FILE"] = depsfile
skipfile = _run_state_path("depskip") + ".txt"
skipfile = os.path.join(tempfile.gettempdir(), f"ccci-depskip-{domain}.txt")
with contextlib.suppress(OSError):
os.remove(skipfile)
os.environ["CCCI_DEPS_SKIP_REPORT"] = skipfile
@ -714,8 +649,6 @@ def run_quick(
results["upgrade"] = "fail"
results["custom"] = "skip"
finally:
# Teardown funnel running: further SIGTERM/SIGALRM are logged + ignored (lifetime.py).
lifetime.begin_teardown()
# F2-11 skip count (read before deciding pass/fail)
requires_deps_skipped = 0
try:
@ -879,9 +812,6 @@ def promote_canonical(recipe: str, head_ref: str | None) -> None:
def main() -> int:
# P1 lock-lifetime hardening: PDEATHSIG + SIGTERM/SIGALRM teardown funnel + 60-min hard
# deadline, armed before ANY abra call or lock acquisition (see harness/lifetime.py).
lifetime.install_lifetime_guards()
recipe = os.environ.get("RECIPE")
if not recipe:
print("RECIPE env is required", file=sys.stderr)
@ -896,10 +826,6 @@ def main() -> int:
print(
f"== cc-ci run: recipe={recipe} ref={ref} pr={os.environ.get('PR', '0')} stages={sorted(stages)}"
)
# Concurrent-run safety is structural: this run's recipe trees live in its own ABRA_DIR
# (exported here, before ANY abra call), so no recipe-tree lock exists; same-DOMAIN runs
# serialise on the app-domain flock taken in deploy_app (see docs/concurrency.md).
setup_run_abra_dir()
fetch_recipe(recipe, ref, src)
# The PR-head commit the upgrade tier re-checks out for the chaos redeploy to the code under test
# (HC1). Prefer the explicit PR head sha ($REF) — robust + exact; fall back to the recipe checkout
@ -938,7 +864,7 @@ def main() -> int:
hook = discovery.install_steps(recipe, repo_local)
# Deploy-count guard (DG4.1): exactly one deploy_app() per run.
countfile = _run_state_path("deploys")
countfile = os.path.join(tempfile.gettempdir(), f"ccci-deploys-{domain}")
with open(countfile, "w") as f:
f.write("0")
os.environ["CCCI_DEPLOY_COUNT_FILE"] = countfile
@ -954,7 +880,7 @@ def main() -> int:
# Run-scoped op state (HC3): the orchestrator records op results (pre-upgrade identity, backup
# snapshot_id) here for the assertion tiers (generic + overlay) to read via generic.op_state().
statefile = _run_state_path("opstate") + ".json"
statefile = os.path.join(tempfile.gettempdir(), f"ccci-opstate-{domain}.json")
with open(statefile, "w") as f:
json.dump({}, f)
os.environ["CCCI_OP_STATE_FILE"] = statefile
@ -965,12 +891,12 @@ def main() -> int:
# cannot break the generic-tier signal. The `setup_custom_tests` step deploys each dep + runs
# `tests/<recipe>/setup_custom_tests.sh` to wire OIDC env via in-place redeploy.
# `$CCCI_DEPS_FILE` is written with the full creds dict the hook script needs (jq-readable).
depsfile = _run_state_path("deps") + ".json"
depsfile = os.path.join(tempfile.gettempdir(), f"ccci-deps-{domain}.json")
with open(depsfile, "w") as f:
json.dump({}, f)
os.environ["CCCI_DEPS_FILE"] = depsfile
# F2-11: conftest appends the count of requires_deps tests it skips (deps-not-ready) here.
skipfile = _run_state_path("depskip") + ".txt"
skipfile = os.path.join(tempfile.gettempdir(), f"ccci-depskip-{domain}.txt")
with contextlib.suppress(OSError):
os.remove(skipfile)
os.environ["CCCI_DEPS_SKIP_REPORT"] = skipfile
@ -1182,9 +1108,6 @@ def main() -> int:
if op in stages:
results[op] = "skip"
finally:
# From here the teardown funnel runs: a SIGTERM/SIGALRM landing now is logged + ignored
# (lifetime.py) so a second signal can't abort the cleanup the first one asked for.
lifetime.begin_teardown()
# Teardown the recipe under test FIRST, then deps in reverse declaration order.
# Parent verify=False (Phase 1d): keep as-is so a parent residual doesn't mask a tier
# failure. Dep teardown uses verify=True via teardown_deps (F2-5 fix); failures are
@ -1301,6 +1224,7 @@ def main() -> int:
# a failure here NEVER changes `overall` (R7 — cosmetics never block the pipeline). ----
data: dict | None = None
try:
sso_unverified = sso_dep_unverified(declared, deps_ready, requires_deps_skipped)
clean_teardown = (deploy_count == expected_deploy_count) and not dep_teardown_error
data = results_mod.build_results(
recipe=recipe,
@ -1310,11 +1234,13 @@ def main() -> int:
records=records,
results=results,
backup_capable=backup_cap,
declared=declared,
deps_ready=deps_ready,
sso_unverified=sso_unverified,
clean_teardown=clean_teardown,
no_secret_leak=True, # narrowed below by an actual scan of the serialised artifact
screenshot=screenshot_rel, # Phase 3 U1 (R4): relative PNG name iff capture succeeded
finished_ts=time.time(),
expected_na=meta.get("EXPECTED_NA"), # declared intentional-skip map (recipe_meta)
)
# Real (if narrow) leak check: no known infra-secret value may appear in the artifact (R7).
blob = json.dumps(data)
@ -1331,15 +1257,6 @@ def main() -> int:
f"{'' + data['level_cap_reason'] if data['level_cap_reason'] else ''})",
flush=True,
)
# Surface UNINTENTIONAL skips in the CI log (non-blocking, R7): a rung that was skipped (N/A)
# but is not in the recipe's intentional list — either add the missing coverage or declare it.
for rung in data.get("skips", {}).get("unintentional", []):
print(
f"⚠ coverage: rung '{rung}' was skipped (N/A) but is not declared intentional — add "
f"the missing test/label, or list it in tests/{recipe}/recipe_meta.py "
f"EXPECTED_NA = {{'{rung}': '<why>'}}.",
flush=True,
)
except Exception as e: # noqa: BLE001 — results assembly is cosmetic; never fail a run on it (R7)
print(
f"!! results.json assembly failed (non-fatal, verdict unaffected): {_scrub(str(e))}",
@ -1358,21 +1275,8 @@ def main() -> int:
with open(html_path, "w", encoding="utf-8") as f:
f.write(card_mod.render_card_html(data, screenshot_rel=data.get("screenshot")))
png = card_mod.render_card_png(html_path, os.path.join(run_artifact_dir, "summary.png"))
capped = data.get("level_cap_rung")
sk = data.get("skips", {})
cap_skip = (
"intentional"
if capped in (sk.get("intentional") or {})
else "unintentional"
if capped in (sk.get("unintentional") or [])
else ""
)
with open(os.path.join(run_artifact_dir, "badge.svg"), "w", encoding="utf-8") as f:
f.write(
card_mod.level_badge_svg(
data["level"], data.get("level_cap_reason", ""), cap_skip
)
)
f.write(card_mod.level_badge_svg(data["level"], data.get("level_cap_reason", "")))
print(
f"summary card {'rendered ' + png if png else '(PNG render unavailable)'} + "
f"badge.svg written into {run_artifact_dir}",

View File

@ -43,16 +43,11 @@ def _traefik_setup(recipe: str, domain: str, version: str) -> None:
ssl_cert/ssl_key swarm secrets; NO ACME). Uses the proven abra.env_set (newline-safe, unlike the
bash set_env that bit keycloak)."""
cert_dir = "/var/lib/ci-certs/live"
if not (
os.path.isfile(f"{cert_dir}/fullchain.pem") and os.path.isfile(f"{cert_dir}/privkey.pem")
):
if not (os.path.isfile(f"{cert_dir}/fullchain.pem") and os.path.isfile(f"{cert_dir}/privkey.pem")):
raise RuntimeError(f"FATAL: wildcard cert missing at {cert_dir} (sops decrypt broken?)")
if not os.path.isfile(env_file(domain)):
_run(
["abra", "app", "new", recipe, "-s", "default", "-D", domain, version, "-o", "-n"],
timeout=120,
check=True,
)
_run(["abra", "app", "new", recipe, "-s", "default", "-D", domain, version, "-o", "-n"],
timeout=120, check=True)
abra.env_set(domain, "DOMAIN", domain)
abra.env_set(domain, "LETS_ENCRYPT_ENV", "")
abra.env_set(domain, "WILDCARDS_ENABLED", "1")
@ -66,39 +61,11 @@ def _traefik_setup(recipe: str, domain: str, version: str) -> None:
return any(s.endswith(f"_{name}_v1") for s in have)
if not _has("ssl_cert"):
_run(
[
"abra",
"app",
"secret",
"insert",
domain,
"ssl_cert",
"v1",
f"{cert_dir}/fullchain.pem",
"-f",
"-n",
],
timeout=120,
check=True,
)
_run(["abra", "app", "secret", "insert", domain, "ssl_cert", "v1",
f"{cert_dir}/fullchain.pem", "-f", "-n"], timeout=120, check=True)
if not _has("ssl_key"):
_run(
[
"abra",
"app",
"secret",
"insert",
domain,
"ssl_key",
"v1",
f"{cert_dir}/privkey.pem",
"-f",
"-n",
],
timeout=120,
check=True,
)
_run(["abra", "app", "secret", "insert", domain, "ssl_key", "v1",
f"{cert_dir}/privkey.pem", "-f", "-n"], timeout=120, check=True)
SPECS: dict[str, dict] = {
@ -199,13 +166,7 @@ def _run(cmd, timeout=120, check=False):
def _recipe_dir(recipe: str) -> str:
# Resolve like the abra CLI does: $ABRA_DIR (the per-run tree when imported by a CI run,
# e.g. promote_canonical) else the canonical ~/.abra (this module's own systemd-timer runs,
# which set no ABRA_DIR). Keeps fetch_recipe (an `abra` subprocess) and the git readers
# below pointed at the SAME tree in both contexts.
return os.path.join(
os.environ.get("ABRA_DIR") or os.path.expanduser("~/.abra"), "recipes", recipe
)
return os.path.expanduser(f"~/.abra/recipes/{recipe}")
def recipe_tags(recipe: str) -> list[str]:
@ -257,17 +218,8 @@ def health_code(spec: dict) -> int:
domain = spec.get("health_domain", spec["domain"])
r = _run(
[
"curl",
"-sk",
"-o",
"/dev/null",
"-w",
"%{http_code}",
"--max-time",
"10",
"--resolve",
f"{domain}:443:127.0.0.1",
f"https://{domain}{spec['health_path']}",
"curl", "-sk", "-o", "/dev/null", "-w", "%{http_code}", "--max-time", "10",
"--resolve", f"{domain}:443:127.0.0.1", f"https://{domain}{spec['health_path']}",
],
timeout=20,
)
@ -278,6 +230,7 @@ def health_code(spec: dict) -> int:
def wait_healthy(spec: dict, timeout: int | None = None) -> bool:
domain = spec["domain"]
deadline = time.time() + (timeout or spec["health_timeout"])
while time.time() < deadline:
if health_code(spec) in tuple(spec["health_ok"]):
@ -372,18 +325,15 @@ def ensure_server() -> None:
def ensure_app_config(recipe: str, domain: str, version: str) -> None:
if not os.path.isfile(env_file(domain)):
_run(
["abra", "app", "new", recipe, "-s", "default", "-D", domain, version, "-o", "-n"],
timeout=120,
check=True,
)
_run(["abra", "app", "new", recipe, "-s", "default", "-D", domain, version, "-o", "-n"],
timeout=120, check=True)
abra.env_set(domain, "DOMAIN", domain)
abra.env_set(domain, "LETS_ENCRYPT_ENV", "")
def ensure_secrets(domain: str) -> None:
stack = lifecycle._stack_name(domain) # noqa: SLF001
have = set(lifecycle._docker_names("secret", stack)) # noqa: SLF001
have = {n for n in lifecycle._docker_names("secret", stack)} # noqa: SLF001
if not any(n.endswith("_admin_password_v1") for n in have):
abra.secret_generate(domain)
@ -443,9 +393,8 @@ def reconcile(app: str) -> str:
write_alert(app, "held-major", current=current, latest=latest, release_notes=notes[:4000])
return f"held-major:{current}->{latest}"
if notes_flag_manual_migration(notes):
write_alert(
app, "held-manual-migration", current=current, latest=latest, release_notes=notes[:4000]
)
write_alert(app, "held-manual-migration", current=current, latest=latest,
release_notes=notes[:4000])
return f"held-manual-migration:{current}->{latest}"
# WC1.1 health-gated upgrade with rollback.
@ -479,14 +428,8 @@ def reconcile(app: str) -> str:
warmsnap.restore(recipe, domain)
deploy_version(recipe, domain, last_good, dt)
recovered = wait_healthy(spec)
write_alert(
app,
"rollback",
last_good=last_good,
attempted=latest,
recovered=recovered,
release_notes=notes[:2000],
)
write_alert(app, "rollback", last_good=last_good, attempted=latest, recovered=recovered,
release_notes=notes[:2000])
if not recovered:
raise RuntimeError(f"{app} rollback to {last_good} did not become healthy")
return f"rolled-back:{latest}->{last_good}"

View File

@ -15,8 +15,7 @@ import shlex
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import http as harness_http # noqa: E402
from harness import lifecycle
from harness import http as harness_http, lifecycle # noqa: E402
PDS_HOST_LOCAL = "http://localhost:3000"
_PW = "ccci-P4-marker-pw-2026"

View File

@ -27,7 +27,6 @@ CRUD). A wedged PDS subsystem fails AT its layer.
from __future__ import annotations
import contextlib
import os
import re
import secrets
@ -36,8 +35,7 @@ import sys
import uuid
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner"))
from harness import http as harness_http # noqa: E402
from harness import lifecycle
from harness import http as harness_http, lifecycle # noqa: E402
PDS_HOST_LOCAL = "http://localhost:3000"
@ -60,18 +58,14 @@ def _goat_admin(domain: str, args: str) -> str:
return _in_container(domain, cmd)
def _xrpc_post(
domain: str, nsid: str, data: dict, token: str | None = None
) -> tuple[int, dict | None]:
def _xrpc_post(domain: str, nsid: str, data: dict, token: str | None = None) -> tuple[int, dict | None]:
headers = {}
if token:
headers["Authorization"] = f"Bearer {token}"
return harness_http.http_post(f"https://{domain}/xrpc/{nsid}", data=data, headers=headers)
def _xrpc_get(
domain: str, nsid: str, query: str, token: str | None = None
) -> tuple[int, dict | None]:
def _xrpc_get(domain: str, nsid: str, query: str, token: str | None = None) -> tuple[int, dict | None]:
headers = {}
if token:
headers["Authorization"] = f"Bearer {token}"
@ -88,9 +82,9 @@ def test_account_lifecycle_and_post_roundtrip(live_app):
# Step 1: PDS describe via goat — recipe self-identifies as did:web:<domain>
out = _in_container(domain, f"goat pds describe {PDS_HOST_LOCAL} 2>&1")
assert (
f"did:web:{domain}" in out
), f"goat pds describe did not contain expected DID 'did:web:{domain}'. Output:\n{out[:500]!r}"
assert f"did:web:{domain}" in out, (
f"goat pds describe did not contain expected DID 'did:web:{domain}'. Output:\n{out[:500]!r}"
)
# Step 2: Create account (UUID-suffixed handle = no run-to-run collision)
out = _goat_admin(
@ -133,9 +127,9 @@ def test_account_lifecycle_and_post_roundtrip(live_app):
assert s == 200, f"createRecord HTTP {s}: {body!r}"
record_uri = (body or {}).get("uri", "")
# URI format: at://<did>/app.bsky.feed.post/<rkey>
assert record_uri.startswith(
f"at://{new_did}/app.bsky.feed.post/"
), f"unexpected record uri: {record_uri!r}"
assert record_uri.startswith(f"at://{new_did}/app.bsky.feed.post/"), (
f"unexpected record uri: {record_uri!r}"
)
rkey = record_uri.rsplit("/", 1)[-1]
assert rkey, f"no rkey in uri: {record_uri!r}"
@ -148,13 +142,15 @@ def test_account_lifecycle_and_post_roundtrip(live_app):
)
assert s == 200, f"getRecord HTTP {s}: {body!r}"
record_value = (body or {}).get("value", {})
assert (
record_value.get("text") == marker
), f"post text did not round-trip: created={marker!r}, fetched={record_value.get('text')!r}"
assert record_value.get("text") == marker, (
f"post text did not round-trip: created={marker!r}, fetched={record_value.get('text')!r}"
)
assert record_value.get("$type") == "app.bsky.feed.post"
finally:
# Step 6: Best-effort cleanup. (The per-run domain teardown will discard the volume
# too, but we exercise the delete-account path because it's part of §4.3.)
if cleanup_did:
with contextlib.suppress(Exception):
try:
_goat_admin(domain, f"account delete {cleanup_did}")
except Exception: # noqa: BLE001
pass

View File

@ -26,6 +26,6 @@ def test_describe_server_returns_atproto_envelope(live_app):
# At least one of these atproto-spec fields must be present
expected_any = ("availableUserDomains", "inviteCodeRequired", "links", "did")
present = [k for k in expected_any if k in body]
assert (
present
), f"describe-server missing all of {expected_any}; got keys: {sorted(body.keys())[:20]}"
assert present, (
f"describe-server missing all of {expected_any}; got keys: {sorted(body.keys())[:20]}"
)

View File

@ -17,6 +17,6 @@ def test_pds_health_returns_version(live_app):
url = f"https://{live_app}/xrpc/_health"
status, body = harness_http.retry_http_get(url, expect_status=200, max_wait=60, interval=3)
assert status == 200, f"GET {url} HTTP {status} (expected 200)"
assert (
isinstance(body, dict) and isinstance(body.get("version"), str) and body["version"]
), f"GET {url} response is not the expected health envelope: {body!r}"
assert isinstance(body, dict) and isinstance(body.get("version"), str) and body["version"], (
f"GET {url} response is not the expected health envelope: {body!r}"
)

View File

@ -30,6 +30,6 @@ def test_get_session_requires_auth(live_app):
f"body: {body!r}"
)
# The XRPC error envelope is JSON with an `error` field per the atproto spec.
assert isinstance(body, dict) and body.get(
"error"
), f"expected XRPC JSON error envelope; got: {body!r}"
assert isinstance(body, dict) and body.get("error"), (
f"expected XRPC JSON error envelope; got: {body!r}"
)

View File

@ -22,12 +22,12 @@ echo " bluesky-pds install_steps: generating secp256k1 PLC rotation key..."
# same shape the PDS expects (32-byte hex). Equivalent for atproto PDS bootstrap.
KEY_HEX=$(cc-ci-run -c 'import secrets; print(secrets.token_bytes(32).hex())')
if [ -z "${KEY_HEX}" ] || [ "${#KEY_HEX}" != "64" ]; then
echo " install_steps: failed to generate PLC rotation key (KEY_HEX length=${#KEY_HEX})" >&2
exit 1
echo " install_steps: failed to generate PLC rotation key (KEY_HEX length=${#KEY_HEX})" >&2
exit 1
fi
# Insert via abra under TTY-wrap (`abra app secret insert` requires a TTY on this version).
# We DON'T log the key value — abra also doesn't print it.
script -qec "abra app secret insert ${CCCI_APP_DOMAIN} pds_plc_rotation_key v1 ${KEY_HEX} --no-input" /dev/null \
>/dev/null 2>&1
>/dev/null 2>&1
echo " bluesky-pds install_steps: PLC rotation key inserted (v1)."

View File

@ -11,6 +11,6 @@ import _p4 # noqa: E402
def test_restore_returns_state(live_app):
assert _p4.account_exists(
live_app
), "restore did not bring back the seeded marker account (PDS data did not survive restore)"
assert _p4.account_exists(live_app), (
"restore did not bring back the seeded marker account (PDS data did not survive restore)"
)

View File

@ -1,108 +0,0 @@
"""Shared utilities for the real-kernel concurrency suite (imported by the test modules; the
fixtures in conftest.py wrap these). No flock mocking anywhere — probes use real LOCK_NB."""
from __future__ import annotations
import contextlib
import fcntl
import os
import signal
import subprocess
import sys
import time
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import lifecycle # noqa: E402
HELPERS = os.path.join(os.path.dirname(__file__), "helpers.py")
DOMAIN = "test-abc123.ci.commoninternet.net" # matches RUN_APP_RE
class HelperPool:
"""Spawns helpers.py subprocesses and GUARANTEES their cleanup (incl. recorded grandchild
pids from `hold-with-child`/`wrapper` markers) — no leaked children in the test VM."""
def __init__(self, out_dir: str):
self.out_dir = out_dir
self.procs: list[subprocess.Popen] = []
self.extra_pids: list[int] = []
self._n = 0
def spawn(self, *args: str, env_extra: dict | None = None) -> tuple[subprocess.Popen, str]:
"""Start `helpers.py <args...>`; returns (proc, marker_file)."""
self._n += 1
out = os.path.join(self.out_dir, f"helper-{self._n}.out")
env = dict(os.environ, CCCI_HELPER_OUT=out, **(env_extra or {}))
p = subprocess.Popen( # noqa: S603
[sys.executable, HELPERS, *args],
env=env,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
)
self.procs.append(p)
return p, out
def track_pid(self, pid: int) -> None:
self.extra_pids.append(pid)
def cleanup(self) -> None:
for p in self.procs:
if p.poll() is None:
p.kill()
with contextlib.suppress(subprocess.TimeoutExpired):
p.wait(timeout=10)
for pid in self.extra_pids:
with contextlib.suppress(OSError):
os.kill(pid, signal.SIGKILL)
def wait_marker(out: str, token: str, timeout: float = 15.0) -> str | None:
"""Poll a helper's marker file for a line containing `token`; returns the line or None."""
deadline = time.time() + timeout
while time.time() < deadline:
try:
with open(out) as f:
for line in f:
if token in line:
return line.strip()
except OSError:
pass
time.sleep(0.1)
return None
def lock_state(domain: str) -> str:
"""'held' | 'free' | 'absent' for the domain's lockfile, probed with a REAL LOCK_NB."""
path = lifecycle._app_lock_path(domain) # noqa: SLF001
if not os.path.exists(path):
return "absent"
with open(path, "a") as f:
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
return "free"
except BlockingIOError:
return "held"
def wait_lock_state(domain: str, want: str, timeout: float = 10.0) -> str:
"""Poll until lock_state(domain) == want (kernel release on process death is fast, but give
the scheduler room). Returns the final observed state."""
deadline = time.time() + timeout
state = lock_state(domain)
while state != want and time.time() < deadline:
time.sleep(0.1)
state = lock_state(domain)
return state
def pid_alive(pid: int) -> bool:
return os.path.exists(f"/proc/{pid}")
def wait_pid_gone(pid: int, timeout: float = 15.0) -> bool:
deadline = time.time() + timeout
while time.time() < deadline:
if not pid_alive(pid):
return True
time.sleep(0.1)
return False

View File

@ -1,34 +0,0 @@
"""Fixtures for the real-kernel concurrency suite (concurrency-restructure plan, 19 cases).
NOT part of the default `pytest tests/unit` gate — run explicitly with `pytest tests/concurrency
-q` (docs/concurrency.md). Locks live in a per-test tmp dir (CCCI_APP_LOCK_DIR); helper
subprocesses hold REAL flocks / install the REAL prctl+signal guards and are always reaped in
fixture finalizers (no leaked children in the test VM).
"""
from __future__ import annotations
import os
import sys
import pytest
sys.path.insert(0, os.path.dirname(__file__))
from concutil import HelperPool # noqa: E402
@pytest.fixture
def lock_dir(tmp_path, monkeypatch):
"""Sandbox lock dir, exported so BOTH this process's lifecycle calls and helper subprocesses
(which inherit os.environ) resolve their lockfiles here — never /run/lock."""
d = tmp_path / "locks"
d.mkdir()
monkeypatch.setenv("CCCI_APP_LOCK_DIR", str(d))
return str(d)
@pytest.fixture
def pool(tmp_path):
hp = HelperPool(str(tmp_path))
yield hp
hp.cleanup()

View File

@ -1,149 +0,0 @@
#!/usr/bin/env python3
"""Subprocess helpers for tests/concurrency — REAL kernel locks and the REAL lifetime guards in
separate processes (flock/prctl are never mocked; tests assert on actual kernel behavior).
Invoked as: python3 helpers.py <command> <args...>
Env contract (set by the spawning test):
CCCI_APP_LOCK_DIR sandbox lock dir (never /run/lock in tests)
CCCI_HELPER_OUT marker file this helper APPENDS progress lines to (ACQUIRED/READY/...)
Commands:
hold <domain> acquire the app lock, mark `ACQUIRED <ts>`, sleep forever
hold-with-child <domain> acquire the lock, spawn a plain sleeping subprocess child, mark
`ACQUIRED <ts>` + `CHILD <pid>` (PEP 446: the child must NOT
inherit the lock fd), sleep forever
guarded <domain> <deadline> install the REAL lifetime guards (alarm=<deadline>s), acquire the
lock, mark `READY`; when the teardown funnel runs (`finally:`),
mark `TEARDOWN` before exiting
wrapper <domain> spawn `guarded <domain> 3600` as MY child, mark `WRAPPED <pid>`,
sleep — the test kills me to prove PDEATHSIG TERMs the child
orphan-probe wait (bounded) until reparented (ppid==1), then install the
guards; mark `REFUSED` if they exit (expected) or `GUARDS_OK`
fetch-checkout <recipe> <ref> run run_recipe_ci.fetch_recipe (the test sets CCCI_SKIP_FETCH=1
+ a per-"run" ABRA_DIR), git-checkout <ref>, mark
`RESULT <head> <data.txt content>`
"""
from __future__ import annotations
import os
import subprocess
import sys
import time
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "runner"))
from harness import abra, lifecycle, lifetime # noqa: E402
OUT = os.environ.get("CCCI_HELPER_OUT")
def mark(line: str) -> None:
if OUT:
with open(OUT, "a") as f:
f.write(line + "\n")
f.flush()
print(line, flush=True)
def cmd_hold(domain: str) -> None:
lifecycle.acquire_app_lock(domain)
mark(f"ACQUIRED {time.time()}")
time.sleep(3600)
def cmd_hold_with_child(domain: str) -> None:
lifecycle.acquire_app_lock(domain)
child = subprocess.Popen([sys.executable, "-c", "import time; time.sleep(3600)"])
mark(f"ACQUIRED {time.time()}")
mark(f"CHILD {child.pid}")
time.sleep(3600)
def cmd_guarded(domain: str, deadline: str) -> None:
lifetime.install_lifetime_guards(deadline_seconds=int(deadline))
lifecycle.acquire_app_lock(domain)
mark("READY")
try:
time.sleep(3600)
finally:
mark("TEARDOWN")
def cmd_wrapper(domain: str) -> None:
p = subprocess.Popen( # noqa: S603
[sys.executable, os.path.abspath(__file__), "guarded", domain, "3600"],
env=os.environ.copy(),
)
mark(f"WRAPPED {p.pid}")
time.sleep(3600)
def cmd_orphan_probe() -> None:
# Our spawner exits immediately after fork; wait (bounded) until we are reparented so the
# prctl is installed with the parent ALREADY dead — the exact race the ppid check closes.
for _ in range(200):
if os.getppid() == 1:
break
time.sleep(0.05)
else:
mark("NEVER_REPARENTED") # e.g. a subreaper environment — test will fail visibly
return
try:
lifetime.install_lifetime_guards()
except SystemExit:
mark("REFUSED")
raise
mark("GUARDS_OK")
def cmd_fetch_checkout(recipe: str, ref: str) -> None:
import run_recipe_ci
run_recipe_ci.fetch_recipe(recipe, None, None)
abra.recipe_checkout(recipe, ref)
head = abra.recipe_head_commit(recipe)
with open(os.path.join(abra.recipe_dir(recipe), "data.txt")) as f:
content = f.read().strip()
mark(f"RESULT {head} {content}")
def cmd_deploy_count_run(domain: str, gate: str) -> None:
"""Mirror the REAL run flow for the DG4.1 counter (CONC-A1 regression): countfile init
(main() preamble) → _record_deploy (deploy_app fires it BEFORE the app lock) → acquire
the app lock → wait for `gate` (file path; '' = no wait) → read + remove own countfile.
Two of these on the SAME domain must each see COUNT 1 and never lose their file."""
import run_recipe_ci
countfile = run_recipe_ci._run_state_path("deploys")
with open(countfile, "w") as f:
f.write("0")
os.environ["CCCI_DEPLOY_COUNT_FILE"] = countfile
lifecycle._record_deploy() # pre-lock, exactly like lifecycle.deploy_app()
mark("PRELOCK")
lifecycle.acquire_app_lock(domain)
mark("ACQUIRED")
if gate:
deadline = time.time() + 15
while not os.path.exists(gate) and time.time() < deadline:
time.sleep(0.05)
try:
with open(countfile) as f:
n = int(f.read().strip() or "0")
os.remove(countfile)
mark(f"COUNT {n}")
except FileNotFoundError:
mark("COUNT_FILE_MISSING")
if __name__ == "__main__":
cmd, *args = sys.argv[1:]
{
"hold": cmd_hold,
"hold-with-child": cmd_hold_with_child,
"guarded": cmd_guarded,
"wrapper": cmd_wrapper,
"orphan-probe": cmd_orphan_probe,
"fetch-checkout": cmd_fetch_checkout,
"deploy-count-run": cmd_deploy_count_run,
}[cmd](*args)

View File

@ -1,175 +0,0 @@
"""Per-run ABRA_DIR isolation (concurrency-restructure plan, cases 17-19). Real directories,
real symlinks, real git — abra itself is replaced by a recording stub where a CLI call is
involved (case 17), because these cases test OUR dir/env plumbing, not abra."""
from __future__ import annotations
import os
import stat
import subprocess
import sys
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
import run_recipe_ci # noqa: E402
from concutil import wait_marker # noqa: E402
from harness import abra # noqa: E402
RECIPE = "fakerecipe"
def _git(cwd, *args):
subprocess.run(
["git", "-c", "user.email=t@t", "-c", "user.name=t", *args],
cwd=cwd,
check=True,
capture_output=True,
)
def _make_fake_home(tmp_path):
"""A fake $HOME with a canonical ~/.abra: servers/default + catalogue dirs, and a recipe git
repo with two tags whose data.txt differs (v1 -> 'one', v2 -> 'two', HEAD at v2)."""
home = tmp_path / "home"
(home / ".abra" / "servers" / "default").mkdir(parents=True)
(home / ".abra" / "catalogue").mkdir(parents=True)
repo = home / ".abra" / "recipes" / RECIPE
repo.mkdir(parents=True)
_git(repo, "init", "-q")
(repo / "data.txt").write_text("one\n")
_git(repo, "add", "data.txt")
_git(repo, "commit", "-qm", "v1")
_git(repo, "tag", "v1")
(repo / "data.txt").write_text("two\n")
_git(repo, "add", "data.txt")
_git(repo, "commit", "-qm", "v2")
_git(repo, "tag", "v2")
return home
def test_17_per_run_dir_built_and_exported_before_abra(tmp_path, monkeypatch):
"""Case 17: setup_run_abra_dir builds the per-run dir correctly (servers/catalogue symlinks
resolve to the canonical tree, recipes/ empty + writable) and $ABRA_DIR is exported before
the first abra call — proven by a stub `abra` on PATH that records the env it saw."""
home = _make_fake_home(tmp_path)
monkeypatch.setenv("HOME", str(home))
monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
monkeypatch.setenv("DRONE_BUILD_NUMBER", "777")
monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten") # so monkeypatch restores it
d = run_recipe_ci.setup_run_abra_dir()
assert d == str(tmp_path / "runs" / "777" / "abra")
assert os.environ["ABRA_DIR"] == d
assert os.readlink(os.path.join(d, "servers")) == str(home / ".abra" / "servers")
assert os.readlink(os.path.join(d, "catalogue")) == str(home / ".abra" / "catalogue")
# symlinks RESOLVE (targets exist) and recipes/ is empty + writable
assert os.path.isdir(os.path.join(d, "servers", "default"))
assert os.path.isdir(os.path.join(d, "catalogue"))
assert os.listdir(os.path.join(d, "recipes")) == []
probe = os.path.join(d, "recipes", ".write-probe")
open(probe, "w").close()
os.remove(probe)
# idempotent re-entry (Drone build-number retry): must not raise on existing symlinks
assert run_recipe_ci.setup_run_abra_dir() == d
# stub abra records $ABRA_DIR at call time; fetch_recipe's catalogue branch invokes it
stub_dir = tmp_path / "bin"
stub_dir.mkdir()
log = tmp_path / "abra-env.log"
stub = stub_dir / "abra"
stub.write_text(f'#!/bin/sh\necho "$ABRA_DIR" >> {log}\nexit 0\n')
stub.chmod(stub.stat().st_mode | stat.S_IEXEC)
monkeypatch.setenv("PATH", f"{stub_dir}{os.pathsep}{os.environ['PATH']}")
monkeypatch.delenv("CCCI_SKIP_FETCH", raising=False)
run_recipe_ci.fetch_recipe(RECIPE, None, None)
assert log.read_text().strip() == d, "abra was called without the per-run ABRA_DIR exported"
def test_18_concurrent_same_recipe_fetch_no_cross_talk(tmp_path, monkeypatch, pool):
"""Case 18: two CONCURRENT fetch+checkout flows of the SAME recipe into different ABRA_DIRs
produce two correct, divergent trees (v1 vs v2) — the old shared-tree corruption scenario,
now structurally safe with no lock. The canonical staged clone is untouched."""
home = _make_fake_home(tmp_path)
canonical_repo = home / ".abra" / "recipes" / RECIPE
head_before = subprocess.run(
["git", "-C", canonical_repo, "rev-parse", "HEAD"], capture_output=True, text=True
).stdout.strip()
runs = {}
for name, ref in (("runA", "v1"), ("runB", "v2")):
abra_dir = tmp_path / name / "abra"
abra_dir.mkdir(parents=True)
_, out = pool.spawn(
"fetch-checkout",
RECIPE,
ref,
env_extra={
"HOME": str(home),
"ABRA_DIR": str(abra_dir),
"CCCI_SKIP_FETCH": "1",
},
)
runs[name] = (out, ref, abra_dir)
expect = {"v1": "one", "v2": "two"}
for name, (out, ref, abra_dir) in runs.items():
line = wait_marker(out, "RESULT", timeout=30)
assert line, f"{name} never produced a RESULT"
_, head, content = line.split()
assert content == expect[ref], f"{name}@{ref}: tree content {content!r}"
tree = abra_dir / "recipes" / RECIPE
assert (tree / "data.txt").read_text().strip() == expect[ref]
assert (
head
== subprocess.run(
["git", "-C", tree, "rev-parse", "HEAD"], capture_output=True, text=True
).stdout.strip()
)
# the two trees genuinely diverge AND the canonical staged clone is untouched
a = (runs["runA"][2] / "recipes" / RECIPE / "data.txt").read_text()
b = (runs["runB"][2] / "recipes" / RECIPE / "data.txt").read_text()
assert a != b
head_after = subprocess.run(
["git", "-C", canonical_repo, "rev-parse", "HEAD"], capture_output=True, text=True
).stdout.strip()
assert head_after == head_before, "canonical clone must not be touched by per-run fetches"
def test_19_env_written_through_servers_symlink_lands_canonical(tmp_path, monkeypatch):
"""Case 19: an app .env written through the per-run servers/ symlink (what abra does under
$ABRA_DIR) lands in the CANONICAL shared path — so janitor discovery and every
expanduser('~/.abra/servers/...') reader keep working unchanged."""
home = _make_fake_home(tmp_path)
monkeypatch.setenv("HOME", str(home))
monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
monkeypatch.setenv("DRONE_BUILD_NUMBER", "778")
monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten")
d = run_recipe_ci.setup_run_abra_dir()
domain = "test-abc123.ci.commoninternet.net"
via_symlink = os.path.join(d, "servers", "default", f"{domain}.env")
with open(via_symlink, "w") as f:
f.write("TYPE=fakerecipe:1.0.0\nDOMAIN=placeholder\n")
canonical = home / ".abra" / "servers" / "default" / f"{domain}.env"
assert canonical.is_file(), ".env written via the symlink must land in the canonical path"
# the canonical-path readers/writers (abra.env_get/env_set use ~/.abra) see the same file
assert abra.env_get(domain, "TYPE") == "fakerecipe:1.0.0"
abra.env_set(domain, "DOMAIN", domain)
with open(via_symlink) as f:
assert f"DOMAIN={domain}" in f.read()
def test_18b_run_id_manual_fallback_is_per_process(tmp_path, monkeypatch):
"""Companion to case 18: two concurrent MANUAL runs (no DRONE_BUILD_NUMBER) must not share an
abra dir either — the manual fallback is pid-suffixed."""
home = _make_fake_home(tmp_path)
monkeypatch.setenv("HOME", str(home))
monkeypatch.setenv("CCCI_RUNS_DIR", str(tmp_path / "runs"))
monkeypatch.delenv("DRONE_BUILD_NUMBER", raising=False)
monkeypatch.delenv("CCCI_APP_DOMAIN", raising=False)
monkeypatch.delenv("CCCI_RUN_ID", raising=False)
monkeypatch.setenv("ABRA_DIR", "sentinel-to-be-overwritten")
d = run_recipe_ci.setup_run_abra_dir()
assert f"manual-{os.getpid()}" in d

View File

@ -1,189 +0,0 @@
"""Janitor / flock-probe semantics (concurrency-restructure plan, cases 5-12).
The janitor runs IN-PROCESS with its discovery monkeypatched (candidates injected via a stubbed
abra.app_ls + empty docker sweep) and teardown_app stubbed to record calls — but the LOCKS are
real kernel flocks, held by real helper subprocesses where a live owner is needed."""
from __future__ import annotations
import os
import sys
import threading
import time
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from concutil import DOMAIN, lock_state, wait_marker # noqa: E402
from harness import lifecycle # noqa: E402
def _inject_candidates(monkeypatch, domains):
"""Point janitor discovery at exactly `domains`: abra lists them, docker sweep is empty.
teardown_app is stubbed to a recorder; returns the calls list."""
calls = []
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": d} for d in domains])
monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
monkeypatch.setattr(lifecycle, "teardown_app", lambda d, verify=True: calls.append(d))
return calls
def test_5_orphan_reaped_lockfile_unlinked(lock_dir, pool, monkeypatch):
"""Case 5: an orphan (lockfile exists, no holder — its run was SIGKILL'd) is reaped exactly
once and its lockfile unlinked."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
p.kill()
p.wait(timeout=10)
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor()
assert calls == [DOMAIN], f"teardown calls: {calls} (expected exactly one)"
assert lock_state(DOMAIN) == "absent", "reaped orphan's lockfile must be unlinked"
def test_6_live_run_never_reaped(lock_dir, pool, monkeypatch, capsys):
"""Case 6: a held lock (live helper) is never reaped and is logged as live."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor()
assert calls == []
assert "live concurrent run" in capsys.readouterr().out
assert lock_state(DOMAIN) == "held"
def test_7_new_run_blocks_until_reap_finishes(lock_dir, pool, monkeypatch):
"""Case 7: the janitor reaps WHILE HOLDING the probe lock, so a new run of the same domain
blocks in acquire_app_lock until the reap completes — no window where a fresh app coexists
with a half-reaped one."""
# Make an orphan.
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
p.kill()
p.wait(timeout=10)
state = {"teardown_end": None, "acquirer_out": None}
def slow_teardown(domain, verify=True):
# While the janitor holds the probe lock mid-reap, a new run starts acquiring.
_, aout = pool.spawn("hold", DOMAIN)
state["acquirer_out"] = aout
time.sleep(2.0)
state["teardown_end"] = time.time()
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": DOMAIN}])
monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
monkeypatch.setattr(lifecycle, "teardown_app", slow_teardown)
lifecycle.janitor()
line = wait_marker(state["acquirer_out"], "ACQUIRED", timeout=15)
assert line, "new run never acquired after the reap"
acquired_ts = float(line.split()[1])
assert (
acquired_ts >= state["teardown_end"]
), f"new run acquired at {acquired_ts} BEFORE the reap finished at {state['teardown_end']}"
# The new run must hold a lock the next probe can SEE (fresh inode at the path).
assert lock_state(DOMAIN) == "held"
def test_8_two_janitors_exactly_one_reaps(lock_dir, pool, monkeypatch):
"""Case 8: two concurrent janitors arbitrate on the probe flock — exactly one reaps (the
other sees 'held' and leaves). Teardown is slowed so the runs genuinely overlap."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
p.kill()
p.wait(timeout=10)
calls = []
calls_lock = threading.Lock()
def slow_teardown(domain, verify=True):
with calls_lock:
calls.append(domain)
time.sleep(2.0)
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": DOMAIN}])
monkeypatch.setattr(lifecycle, "_docker_names", lambda kind, stack: [])
monkeypatch.setattr(lifecycle, "teardown_app", slow_teardown)
barrier = threading.Barrier(2)
def run_janitor():
barrier.wait()
lifecycle.janitor()
t1, t2 = threading.Thread(target=run_janitor), threading.Thread(target=run_janitor)
t1.start(), t2.start()
t1.join(timeout=30), t2.join(timeout=30)
assert calls == [DOMAIN], f"expected exactly one reap, got {calls}"
assert lock_state(DOMAIN) == "absent"
def test_9_reboot_lockfile_absent_reaped_immediately(lock_dir, monkeypatch):
"""Case 9: post-reboot simulation — the app exists but its lockfile is gone (/run/lock is
tmpfs). The probe trivially acquires -> immediate reap, NO age threshold (improvement over
the old 2h fallback)."""
assert lock_state(DOMAIN) == "absent"
calls = _inject_candidates(monkeypatch, [DOMAIN])
t0 = time.time()
lifecycle.janitor()
assert calls == [DOMAIN]
assert time.time() - t0 < 5, "reap must be immediate (no age wait)"
def test_10_long_held_lock_flagged_never_stolen(lock_dir, pool, monkeypatch, capsys):
"""Case 10: a lock held with mtime older than 120min is flagged as a possible leaked run —
and NOT reaped (never steal a held lock)."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
path = lifecycle._app_lock_path(DOMAIN) # noqa: SLF001
backdate = time.time() - (130 * 60)
os.utime(path, (backdate, backdate))
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor()
assert calls == []
out_text = capsys.readouterr().out
assert "possible leaked run" in out_text and "lslocks" in out_text
assert lock_state(DOMAIN) == "held"
def test_11_warm_canonical_names_never_probed(lock_dir, monkeypatch):
"""Case 11: RUN_APP_RE allowlist — warm/canonical-shaped names never become candidates, so
they are never probed (no lockfile is even created for them) and never reaped."""
warmish = [
"warm-keycloak.ci.commoninternet.net",
"keycloak.ci.commoninternet.net",
"warm-hedgedoc.ci.commoninternet.net",
"drone.ci.commoninternet.net",
]
calls = []
monkeypatch.setattr(lifecycle.abra, "app_ls", lambda: [{"appName": d} for d in warmish])
monkeypatch.setattr(
lifecycle,
"_docker_names",
lambda kind, stack: ["warm-keycloak_ci_commoninternet_net_app"]
if kind == "service"
else [],
)
monkeypatch.setattr(lifecycle, "teardown_app", lambda d, verify=True: calls.append(d))
lifecycle.janitor()
assert calls == []
lockdir = os.environ["CCCI_APP_LOCK_DIR"]
assert [
f for f in os.listdir(lockdir) if f.startswith("cc-ci-app-")
] == [], "janitor must not create lockfiles for non-run-app names"
def test_12_degrades_safely_on_bad_lockfile_and_missing_dir(lock_dir, monkeypatch, capsys):
"""Case 12: a garbled/unopenable lockfile (here: a DIRECTORY at the lockfile path) is skipped
with a log line; a missing lock dir doesn't crash the janitor either. Never a crash."""
path = lifecycle._app_lock_path(DOMAIN) # noqa: SLF001
os.makedirs(path) # open(path, "a") -> IsADirectoryError (an OSError)
calls = _inject_candidates(monkeypatch, [DOMAIN])
lifecycle.janitor() # must not raise
assert calls == []
assert "skipping" in capsys.readouterr().out
os.rmdir(path)
monkeypatch.setenv("CCCI_APP_LOCK_DIR", os.path.join(os.environ["CCCI_APP_LOCK_DIR"], "gone"))
lifecycle.janitor() # missing dir: probe open fails -> skip; tidy glob -> empty. No crash.
assert calls == []

View File

@ -1,82 +0,0 @@
"""Lifetime hardening (concurrency-restructure plan, cases 13-16): the REAL prctl/signal/alarm
guards installed by helper subprocesses; tests assert teardown ran, exit was non-zero, and the
lock was released."""
from __future__ import annotations
import os
import signal
import sys
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from concutil import ( # noqa: E402
DOMAIN,
wait_lock_state,
wait_marker,
wait_pid_gone,
)
def test_13_pdeathsig_parent_kill_terms_harness(lock_dir, pool):
"""Case 13: wrapper-parent spawns a guarded harness-child; the parent is SIGKILL'd (the
harness gets no courtesy signal) -> the kernel's PDEATHSIG TERMs the child, its teardown
funnel runs, it exits, and the lock is released."""
p, out = pool.spawn("wrapper", DOMAIN)
line = wait_marker(out, "WRAPPED")
assert line, "wrapper never spawned its child"
child_pid = int(line.split()[1])
pool.track_pid(child_pid)
assert wait_marker(out, "READY"), "guarded child never got ready"
p.kill() # parent dies WITHOUT signalling the child — only PDEATHSIG can save us
p.wait(timeout=10)
assert wait_pid_gone(child_pid), "guarded child must exit on parent death (PDEATHSIG)"
assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run"
assert wait_lock_state(DOMAIN, "free") == "free"
def test_14_already_orphaned_helper_refuses_to_run(lock_dir, pool):
"""Case 14 (ppid race): a helper whose parent died BEFORE the prctl was armed (it starts
already reparented to pid 1) must refuse to run — PDEATHSIG would never fire for it."""
# Spawn an intermediate parent that forks orphan-probe and exits immediately.
import subprocess
out = os.path.join(pool.out_dir, "orphan.out")
intermediate = (
"import subprocess, sys, os; "
"subprocess.Popen([sys.executable, os.environ['CCCI_HELPERS'], 'orphan-probe']); "
)
env = dict(
os.environ,
CCCI_HELPER_OUT=out,
CCCI_HELPERS=os.path.join(os.path.dirname(__file__), "helpers.py"),
)
subprocess.run([sys.executable, "-c", intermediate], env=env, timeout=15, check=True)
line = wait_marker(out, "REFUSED", timeout=20)
assert line, "orphaned helper did not refuse to run (or never reparented to pid 1)"
def test_15_deadline_alarm_fires_teardown_and_releases(lock_dir, pool):
"""Case 15: the self-deadline (alarm). A guarded helper with a 2s deadline tears down via
the funnel (finally: ran), exits NON-zero, and its lock is released."""
p, out = pool.spawn("guarded", DOMAIN, "2")
assert wait_marker(out, "READY")
rc = p.wait(timeout=20)
assert rc != 0, f"deadline exit must be non-zero (got {rc})"
assert rc == 128 + signal.SIGALRM, f"expected 142 (128+SIGALRM), got {rc}"
assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run on deadline"
assert wait_lock_state(DOMAIN, "free") == "free"
def test_16_sigterm_runs_teardown_funnel_and_releases(lock_dir, pool):
"""Case 16: SIGTERM (drone cancel path) -> the finally: teardown funnel runs, exit is
non-zero, lock released."""
p, out = pool.spawn("guarded", DOMAIN, "3600")
assert wait_marker(out, "READY")
p.send_signal(signal.SIGTERM)
rc = p.wait(timeout=20)
assert rc != 0, f"SIGTERM exit must be non-zero (got {rc})"
assert rc == 128 + signal.SIGTERM, f"expected 143 (128+SIGTERM), got {rc}"
assert wait_marker(out, "TEARDOWN", timeout=5), "teardown funnel did not run on SIGTERM"
assert wait_lock_state(DOMAIN, "free") == "free"

View File

@ -1,85 +0,0 @@
"""Lock fundamentals (concurrency-restructure plan, cases 1-4). Real kernel flocks held by real
subprocesses — nothing mocked."""
from __future__ import annotations
import fcntl
import os
import sys
import time
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from concutil import ( # noqa: E402
DOMAIN,
lock_state,
wait_lock_state,
wait_marker,
)
from harness import lifecycle # noqa: E402
def test_1_sigkill_releases_lock(lock_dir, pool):
"""Case 1: acquire -> holder SIGKILL'd -> lock immediately acquirable (kernel auto-release).
The exact property the old pidfile registry approximated with /proc checks."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED"), "holder never acquired"
assert lock_state(DOMAIN) == "held"
p.kill()
p.wait(timeout=10)
assert wait_lock_state(DOMAIN, "free") == "free"
def test_2_nb_probe_held_vs_unheld(lock_dir, pool):
"""Case 2: LOCK_NB probe raises BlockingIOError against a held lock; succeeds when unheld."""
p, out = pool.spawn("hold", DOMAIN)
assert wait_marker(out, "ACQUIRED")
path = lifecycle._app_lock_path(DOMAIN) # noqa: SLF001
with open(path, "a") as f:
try:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB)
raise AssertionError("LOCK_NB succeeded against a held lock")
except BlockingIOError:
pass
p.kill()
p.wait(timeout=10)
assert wait_lock_state(DOMAIN, "free") == "free"
with open(path, "a") as f:
fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB) # must not raise now
def test_3_lock_fd_not_inherited_by_children(lock_dir, pool):
"""Case 3 (PEP 446): the holder spawns a subprocess child, the holder dies, the child lives —
and the lock is STILL released (the child never inherited the lock fd). This is what makes
'held lock == live HARNESS owner' sound even though runs spawn abra/docker/pytest children."""
p, out = pool.spawn("hold-with-child", DOMAIN)
assert wait_marker(out, "ACQUIRED")
child_line = wait_marker(out, "CHILD")
assert child_line, "holder never reported its child pid"
child_pid = int(child_line.split()[1])
pool.track_pid(child_pid)
p.kill()
p.wait(timeout=10)
assert os.path.exists(f"/proc/{child_pid}"), "child should outlive the holder"
assert (
wait_lock_state(DOMAIN, "free") == "free"
), "lock must release on holder death even with a live child (PEP 446 non-inheritable fd)"
def test_4_second_acquire_blocks_until_first_exits(lock_dir, pool):
"""Case 4: a second same-domain acquire blocks until the first holder exits — the
double-!testme serialisation property."""
p1, out1 = pool.spawn("hold", DOMAIN)
assert wait_marker(out1, "ACQUIRED")
p2, out2 = pool.spawn("hold", DOMAIN)
# p2 must NOT acquire while p1 holds.
time.sleep(1.5)
assert wait_marker(out2, "ACQUIRED", timeout=0.1) is None, "second acquire did not block"
t_kill = time.time()
p1.kill()
p1.wait(timeout=10)
line = wait_marker(out2, "ACQUIRED", timeout=15)
assert line, "second acquire never completed after first holder exited"
acquired_ts = float(line.split()[1])
assert acquired_ts >= t_kill - 0.05, "second holder acquired before the first exited"
assert lock_state(DOMAIN) == "held"

View File

@ -1,79 +0,0 @@
"""Run-scoped state files — M2(c) live-verify regression (not one of the 19 plan cases).
The four CCCI state files (deploys countfile, opstate, deps, depskip) must be keyed by
run id + harness pid, NEVER by app domain: a second run of the SAME domain executes its
main() preamble (state-file init, deploy_app's _record_deploy) BEFORE it blocks at the
app lock, so domain-keyed files in the shared tempdir get reset/removed underneath the
live first run. Observed live (builds 279/281): false DG4.1 deploy-count=2 in run 1,
countfile FileNotFoundError crash in run 2. Children never re-derive these paths — they
receive them via the CCCI_*_FILE env vars, so per-process uniqueness is sufficient.
"""
from __future__ import annotations
import os
import sys
import tempfile
sys.path.insert(0, os.path.dirname(__file__))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
import run_recipe_ci # noqa: E402
from concutil import wait_marker # noqa: E402
DOMAIN = "fake-abc123.ci.commoninternet.net"
def test_20_state_paths_keyed_by_run_and_pid_never_by_domain(monkeypatch):
domain = "immi-ad3e33.ci.commoninternet.net"
monkeypatch.setenv("CCCI_APP_DOMAIN", domain)
monkeypatch.setenv("DRONE_BUILD_NUMBER", "279")
p279 = run_recipe_ci._run_state_path("deploys")
monkeypatch.setenv("DRONE_BUILD_NUMBER", "281")
p281 = run_recipe_ci._run_state_path("deploys")
# the double-!testme invariant: two runs (same domain) share NO state file
assert p279 != p281
# keyed by run id + pid, under the tempdir
base = os.path.basename(p279)
assert base == f"ccci-deploys-279-{os.getpid()}"
assert os.path.dirname(p279) == tempfile.gettempdir()
# the app domain must not appear in the path at all
assert domain not in p279 and domain not in p281
def test_20c_same_domain_runs_each_keep_their_own_count(tmp_path, lock_dir, pool):
"""The live CONC-A1 interleaving, with REAL processes + the REAL lock and counter code:
run A holds the app lock; run B (same domain) fires its pre-lock _record_deploy and
blocks; A then reads its counter — must still be 1 (not polluted by B) — and removes
its own file; B acquires and must find ITS file intact (no FileNotFoundError)."""
gate = tmp_path / "gate"
env_a = {"TMPDIR": str(tmp_path), "DRONE_BUILD_NUMBER": "9001"}
env_b = {"TMPDIR": str(tmp_path), "DRONE_BUILD_NUMBER": "9002"}
pa, out_a = pool.spawn("deploy-count-run", DOMAIN, str(gate), env_extra=env_a)
assert wait_marker(out_a, "ACQUIRED")
pb, out_b = pool.spawn("deploy-count-run", DOMAIN, "", env_extra=env_b)
# B's main()-preamble + pre-lock increment have fired; B is now blocked on the app lock
assert wait_marker(out_b, "PRELOCK")
assert wait_marker(out_b, "ACQUIRED", timeout=1.0) is None # still serialised behind A
gate.touch() # let A read its counter only AFTER B's pre-lock work landed
line_a = wait_marker(out_a, "COUNT")
assert line_a is not None and line_a.strip() == "COUNT 1", line_a # not 2: B didn't pollute A
pa.wait(timeout=15)
line_b = wait_marker(out_b, "COUNT")
assert (
line_b is not None and line_b.strip() == "COUNT 1"
), line_b # B's file survived A's remove
pb.wait(timeout=15)
def test_20b_manual_runs_distinct_via_pid(monkeypatch):
# no DRONE_BUILD_NUMBER and no domain/run-id env → run_id() falls back to "manual";
# the pid suffix still separates two concurrent hand-runs of the same domain.
for var in ("DRONE_BUILD_NUMBER", "CCCI_APP_DOMAIN", "CCCI_RUN_ID"):
monkeypatch.delenv(var, raising=False)
p = run_recipe_ci._run_state_path("opstate")
assert os.path.basename(p) == f"ccci-opstate-manual-{os.getpid()}"

View File

@ -13,8 +13,7 @@ import sys
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "runner"))
from harness import deps as deps_mod # noqa: E402
from harness import lifecycle, naming
from harness import deps as deps_mod, lifecycle, naming # noqa: E402
def _short(s: str, n: int = 8) -> str:

View File

@ -26,7 +26,6 @@ Transient `net::ERR_NETWORK_CHANGED` is handled by the shared `goto_with_retry`
from __future__ import annotations
import contextlib
import os
import sys
import uuid
@ -40,11 +39,7 @@ def _open_pad(ctx, url):
bar once CryptPad has created/loaded the fragment-keyed pad (`#/2/pad/edit/<key>/`)."""
page = ctx.new_page()
harness_browser.goto_with_retry(
page,
url,
accept_statuses=(200,),
goto_timeout_ms=60_000,
wait_until="load",
page, url, accept_statuses=(200,), goto_timeout_ms=60_000, wait_until="load",
deadline_seconds=150,
)
pad_url = url
@ -58,15 +53,13 @@ def _open_pad(ctx, url):
pad_url = page.url
break
if i == 40:
with contextlib.suppress(Exception): # best-effort unstick
try:
harness_browser.goto_with_retry(
page,
url,
accept_statuses=(200,),
goto_timeout_ms=60_000,
wait_until="load",
deadline_seconds=120,
page, url, accept_statuses=(200,), goto_timeout_ms=60_000,
wait_until="load", deadline_seconds=120,
)
except Exception: # noqa: BLE001 — best-effort unstick
pass
return page, pad_url
@ -81,22 +74,18 @@ def _ckeditor_frame(page, deadline_polls=90, reload_at=22, reload_url=None):
if "ckeditor-inner" in f.url:
return f
if i == reload_at and reload_url is not None:
with contextlib.suppress(Exception): # reload is a best-effort unstick
try:
harness_browser.goto_with_retry(
page,
reload_url,
accept_statuses=(200,),
goto_timeout_ms=60_000,
wait_until="load",
deadline_seconds=120,
page, reload_url, accept_statuses=(200,), goto_timeout_ms=60_000,
wait_until="load", deadline_seconds=120,
)
except Exception: # noqa: BLE001 — reload is a best-effort unstick
pass
page.wait_for_timeout(2000)
return None
def _poll_any_frame_for_text(
page, needle, deadline_polls=120, reload_at=(20, 45, 75, 100), reload_url=None
):
def _poll_any_frame_for_text(page, needle, deadline_polls=120, reload_at=(20, 45, 75, 100), reload_url=None):
"""Robust read-back (F2-13): poll EVERY frame's body text for `needle`, returning True as soon as
it appears. The fresh cold-cache read-back context's deeply-nested CKEditor frame is slow/flaky to
*attach* by URL (the prior `_ckeditor_frame` wait timed out on the Adversary's cold run), but the
@ -112,15 +101,13 @@ def _poll_any_frame_for_text(
except Exception: # noqa: BLE001 — frame not ready / detached; keep polling
pass
if reload_url and i in reload_at:
with contextlib.suppress(Exception): # best-effort unstick
try:
harness_browser.goto_with_retry(
page,
reload_url,
accept_statuses=(200,),
goto_timeout_ms=60_000,
wait_until="load",
deadline_seconds=120,
page, reload_url, accept_statuses=(200,), goto_timeout_ms=60_000,
wait_until="load", deadline_seconds=120,
)
except Exception: # noqa: BLE001 — best-effort unstick
pass
page.wait_for_timeout(2000)
return False
@ -150,9 +137,9 @@ def test_cryptpad_pad_content_survives_fresh_session(live_app):
# --- session 1: create the pad + write the marker ---
ctx1 = browser.new_context(ignore_https_errors=True)
page, pad_url = _open_pad(ctx1, f"https://{live_app}/pad/")
assert (
"#/2/pad/edit/" in pad_url
), f"CryptPad did not create a fragment-keyed pad URL; got {pad_url!r}"
assert "#/2/pad/edit/" in pad_url, (
f"CryptPad did not create a fragment-keyed pad URL; got {pad_url!r}"
)
ck = _ckeditor_frame(page, reload_url=pad_url)
assert ck is not None, "CKEditor content frame never attached (pad editor not ready)"
_dismiss_store_modal(page)
@ -161,9 +148,9 @@ def test_cryptpad_pad_content_survives_fresh_session(live_app):
page.wait_for_timeout(1000)
body.type(marker, delay=40)
page.wait_for_timeout(12000) # let CryptPad encrypt + sync the update to the server
assert (
marker in ck.locator("body").inner_text()
), "marker not present in the editor after typing — type did not land"
assert marker in ck.locator("body").inner_text(), (
"marker not present in the editor after typing — type did not land"
)
ctx1.close()
# --- session 2: FRESH context (no shared storage/localStorage) reads the pad back by URL.

View File

@ -51,9 +51,9 @@ def test_cryptpad_spa_renders_with_no_console_errors(live_app):
title = (page.title() or "").lower()
body = page.content()
blower = body.lower()
assert (
"cryptpad" in title or "cryptpad" in blower
), f"CryptPad SPA does not carry brand. title={title!r}, body excerpt: {body[:200]!r}"
assert "cryptpad" in title or "cryptpad" in blower, (
f"CryptPad SPA does not carry brand. title={title!r}, body excerpt: {body[:200]!r}"
)
# Canonical CryptPad asset references in the rendered DOM
canonical = ("/customize/", "/components/", "main.js", "/api/broadcast")

View File

@ -8,8 +8,7 @@ import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import browser as harness_browser # noqa: E402
from harness import generic, lifecycle
from harness import browser as harness_browser, generic, lifecycle # noqa: E402
def test_serving_and_content(live_app, meta):

View File

@ -20,9 +20,7 @@ def test_backup_captures_state(live_app):
Since custom-html-bkp-bad has no ops.py::pre_backup to seed the marker, this file does NOT
exist at backup time — exec_in_app returns empty or raises → assertion fails → backup tier RED.
This models a recipe that declares backup capability but omits the data-seeding hook."""
result = lifecycle.exec_in_app(
live_app, ["sh", "-c", f"cat {MARKER_PATH} 2>/dev/null || echo MISSING"]
).strip()
result = lifecycle.exec_in_app(live_app, ["sh", "-c", f"cat {MARKER_PATH} 2>/dev/null || echo MISSING"]).strip()
assert result == "original", (
f"backup did not capture the expected marker at {MARKER_PATH}: got {result!r}. "
"Expected 'original' (seeded by pre_backup). If the marker is 'MISSING', the pre_backup "

View File

@ -1,87 +0,0 @@
"""custom-html-tiny — recipe-specific functional test (static-web-server).
Proves the deployed static-web-server is *actually serving files from its `content` volume* with real
file-server semantics, not merely returning 200 from a Traefik fallback or a generic stub:
1. exact-byte round-trip — write a uniquely-named file with random content into the served volume,
fetch it over HTTPS, and assert the bytes come back verbatim. Non-vacuous: the content is random
per run, so only a server that reads this file off the volume can pass.
2. real 404 — a random non-existent path returns 404, proving directory/file semantics (a
200-everything stub or mis-routed host would not 404).
The recipe's image (joseluisq/static-web-server) is shell-less (scratch-based) and its content volume
is seeded via the install_steps.sh host-mountpoint mechanism — so this test writes its probe file the
same way (resolve the swarm volume's mountpoint with `docker volume inspect`, write directly) rather
than `docker exec`-ing in a container that has no shell.
Runs in the custom tier against the shared post-install deployment (the `live_app` fixture is its
per-run domain). Mirrors install_steps.sh: the app's content volume is named `<stack>_content`, where
`stack` is the domain with dots replaced by underscores; HTTP_SUBDIR is empty, so the volume root is
served at `/`.
"""
from __future__ import annotations
import contextlib
import os
import ssl
import subprocess
import urllib.error
import urllib.request
import uuid
def _served_dir(domain: str) -> str:
"""Host mountpoint of the app's served `content` volume (same naming as install_steps.sh)."""
vol = f"{domain.replace('.', '_')}_content"
out = subprocess.run(
["docker", "volume", "inspect", vol, "--format", "{{.Mountpoint}}"],
capture_output=True,
text=True,
check=True,
)
mountpoint = out.stdout.strip()
assert mountpoint, f"could not resolve mountpoint for volume {vol!r}"
return mountpoint
def _get(url: str) -> tuple[int, bytes]:
"""GET the URL; return (status, body). A 4xx/5xx is returned, not raised (we assert on the code).
TLS verification is relaxed: the served wildcard cert is validated separately by the infra check;
here we care only about the app's response."""
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
try:
with urllib.request.urlopen(url, timeout=20, context=ctx) as resp:
return resp.status, resp.read()
except urllib.error.HTTPError as e:
return e.code, e.read()
def test_static_file_roundtrip_and_404(live_app):
"""Write a random file into the served volume → fetch it → bytes match; and a missing path 404s."""
served = _served_dir(live_app)
token = uuid.uuid4().hex
name = f"ccci-probe-{token}.txt"
body = f"cc-ci-functional-{token}\n".encode()
path = os.path.join(served, name)
with open(path, "wb") as fh:
fh.write(body)
try:
status, got = _get(f"https://{live_app}/{name}")
assert status == 200, f"served probe file returned {status} (expected 200)"
assert got == body, (
f"content round-trip mismatch: served {got!r}, wrote {body!r} "
"(static-web-server not serving the content volume?)"
)
# A random non-existent path must 404 — proves real static-file semantics, distinguishing a
# working server from a 200-everything stub or a mis-routed Traefik fallback.
miss_status, _ = _get(f"https://{live_app}/ccci-missing-{uuid.uuid4().hex}.txt")
assert (
miss_status == 404
), f"missing path returned {miss_status} (expected 404 — generic 200-returner / mis-route?)"
finally:
with contextlib.suppress(OSError):
os.remove(path)

View File

@ -3,14 +3,3 @@
# (DG5) is detected quickly instead of waiting the default 300s HTTP timeout.
DEPLOY_TIMEOUT = 120
HTTP_TIMEOUT = 90
# Rungs this recipe INTENTIONALLY skips, each with a reason. Any essential rung skipped (N/A) and NOT
# listed here is reported as an *unintentional* skip (a coverage gap to fill or declare). A skip still
# caps the level either way — the harness never claims a rung it did not verify; this only records
# that the skip is deliberate. (The level ladder is the four essential rungs install/upgrade/
# backup_restore/functional; integration + recipe-local are optional and not leveled.)
# custom-html-tiny is a stateless static-web-server, so it has no backup surface:
EXPECTED_NA = {
"backup_restore": "stateless static file server: serves an ephemeral content volume seeded at "
"deploy, with no persistent/user data to back up or restore (no backupbot.backup label)",
}

View File

@ -15,8 +15,7 @@ import sys
import uuid
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner"))
from harness import http as harness_http # noqa: E402
from harness import lifecycle
from harness import http as harness_http, lifecycle # noqa: E402
def test_content_roundtrip(live_app):

View File

@ -53,9 +53,9 @@ def test_content_type_html_and_txt(live_app):
ct_txt = h_txt.get("content-type", "")
# nginx default: "text/html" for .html and "text/plain" for .txt (may include "; charset=utf-8")
assert ct_html.startswith(
"text/html"
), f"{html_name} Content-Type={ct_html!r}, expected text/html (nginx MIME config broken?)"
assert ct_txt.startswith(
"text/plain"
), f"{txt_name} Content-Type={ct_txt!r}, expected text/plain (nginx MIME config broken?)"
assert ct_html.startswith("text/html"), (
f"{html_name} Content-Type={ct_html!r}, expected text/html (nginx MIME config broken?)"
)
assert ct_txt.startswith("text/plain"), (
f"{txt_name} Content-Type={ct_txt!r}, expected text/plain (nginx MIME config broken?)"
)

View File

@ -9,8 +9,7 @@ import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import browser as harness_browser # noqa: E402
from harness import generic
from harness import browser as harness_browser, generic # noqa: E402
def test_serving_and_content(live_app, meta):

View File

@ -53,7 +53,7 @@ def mint_admin(domain: str) -> tuple[str, str]:
cmd = (
"cd /opt/bitnami/discourse && "
"RUBY=$(command -v ruby || echo /opt/bitnami/ruby/bin/ruby) && "
f'RAILS_ENV=production "$RUBY" bin/rails runner "{_BOOTSTRAP_RB}"'
f"RAILS_ENV=production \"$RUBY\" bin/rails runner \"{_BOOTSTRAP_RB}\""
)
out = lifecycle.exec_in_app(domain, ["bash", "-c", cmd], service="app", timeout=240)
key = user = None
@ -63,9 +63,9 @@ def mint_admin(domain: str) -> tuple[str, str]:
key = line.split("=", 1)[1].strip()
elif line.startswith("CCCI_API_USER="):
user = line.split("=", 1)[1].strip()
assert (
key and user
), f"could not bootstrap discourse admin/API key; rails output tail:\n{out[-1000:]}"
assert key and user, (
f"could not bootstrap discourse admin/API key; rails output tail:\n{out[-1000:]}"
)
return key, user

View File

@ -48,23 +48,21 @@ def test_create_topic_roundtrip(live_app):
headers=hdrs,
timeout=60,
)
assert status in (200, 201) and isinstance(
body, dict
), f"create topic failed: HTTP {status}, body={body!r}"
assert status in (200, 201) and isinstance(body, dict), (
f"create topic failed: HTTP {status}, body={body!r}"
)
topic_id = body.get("topic_id")
assert topic_id, f"create topic returned no topic_id: {body!r}"
# 4) Read the topic back and assert title + first-post body round-trip.
status, got = harness_http.http_get(f"{base}/t/{topic_id}.json", headers=hdrs, timeout=30)
assert status == 200 and isinstance(
got, dict
), f"read topic failed: HTTP {status}, body={got!r}"
assert (
got.get("title") == title
), f"topic title did not round-trip: sent {title!r}, got {got.get('title')!r}"
assert status == 200 and isinstance(got, dict), f"read topic failed: HTTP {status}, body={got!r}"
assert got.get("title") == title, (
f"topic title did not round-trip: sent {title!r}, got {got.get('title')!r}"
)
posts = (got.get("post_stream") or {}).get("posts") or []
assert posts, f"topic has no posts on read-back: {got!r}"
first_cooked = posts[0].get("cooked", "")
assert (
marker in first_cooked
), f"topic body did not round-trip: marker {marker!r} not in first post {first_cooked!r}"
assert marker in first_cooked, (
f"topic body did not round-trip: marker {marker!r} not in first post {first_cooked!r}"
)

View File

@ -20,12 +20,12 @@ def test_site_json_has_discourse_config(live_app):
status, body = harness_http.retry_http_get(
f"https://{live_app}/site.json", expect_status=200, max_wait=120, interval=5
)
assert status == 200 and isinstance(
body, dict
), f"GET /site.json failed: HTTP {status}, body type={type(body).__name__}"
assert status == 200 and isinstance(body, dict), (
f"GET /site.json failed: HTTP {status}, body type={type(body).__name__}"
)
# /site.json carries Discourse-specific structure — `categories` (a list) and `groups` are always
# present in a booted Discourse. A non-Discourse 200 (placeholder page) would not parse to this.
assert "categories" in body, f"/site.json missing 'categories' key: keys={list(body)[:20]}"
assert isinstance(
body["categories"], list
), f"/site.json 'categories' not a list: {type(body['categories']).__name__}"
assert isinstance(body["categories"], list), (
f"/site.json 'categories' not a list: {type(body['categories']).__name__}"
)

View File

@ -15,9 +15,7 @@ set -euo pipefail
: "${CCCI_RECIPE:?missing CCCI_RECIPE}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Resolve the recipe tree the way abra does: $ABRA_DIR (the per-run tree inside a CI run) else
# the canonical ~/.abra — the overlay must land in the tree this run actually deploys from.
RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"
RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}"
if [ ! -d "$RECIPE_DIR" ]; then
echo " discourse install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2

View File

@ -15,7 +15,8 @@ from harness import lifecycle # noqa: E402
def _psql(domain, sql):
cmd = (
"PGPASSWORD=$(cat /run/secrets/db_password) " f'psql -U discourse -d discourse -tAc "{sql}"'
'PGPASSWORD=$(cat /run/secrets/db_password) '
f'psql -U discourse -d discourse -tAc "{sql}"'
)
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
@ -41,7 +42,6 @@ def pre_backup(domain, meta):
def pre_restore(domain, meta):
# diverge from the backup so a successful restore is observable
_psql(domain, "DROP TABLE IF EXISTS ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in ("", "NULL"), (
"drop did not take"
)

View File

@ -6,9 +6,7 @@
# app is actually serving (the canonical "is discourse up" signal — NOT "/", which may redirect to setup).
HEALTH_PATH = "/srv/status"
HEALTH_OK = (200,)
DEPLOY_TIMEOUT = (
3600 # slow Rails cold boot (15-25min) on the 7-GiB single node; bumped 2400→3600 for
)
DEPLOY_TIMEOUT = 3600 # slow Rails cold boot (15-25min) on the 7-GiB single node; bumped 2400→3600 for
# headroom after full4's base deploy timed out at 2400s (RAM/CPU-constrained boot + image re-pull).
HTTP_TIMEOUT = 1200
@ -61,11 +59,7 @@ def BACKUP_VERIFY(domain):
try:
out = lifecycle.exec_in_app(
domain,
[
"sh",
"-c",
"gzip -t /var/lib/postgresql/data/backup.sql && wc -c < /var/lib/postgresql/data/backup.sql",
],
["sh", "-c", "gzip -t /var/lib/postgresql/data/backup.sql && wc -c < /var/lib/postgresql/data/backup.sql"],
service="db",
timeout=60,
).strip()

View File

@ -14,12 +14,13 @@ from harness import lifecycle # noqa: E402
def _psql(domain, sql):
cmd = (
"PGPASSWORD=$(cat /run/secrets/db_password) " f'psql -U discourse -d discourse -tAc "{sql}"'
'PGPASSWORD=$(cat /run/secrets/db_password) '
f'psql -U discourse -d discourse -tAc "{sql}"'
)
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
def test_backup_captures_state(live_app):
assert (
_psql(live_app, "SELECT v FROM ci_marker;") == "original"
), "the seeded discourse postgres state was not present at backup time"
assert _psql(live_app, "SELECT v FROM ci_marker;") == "original", (
"the seeded discourse postgres state was not present at backup time"
)

View File

@ -14,12 +14,13 @@ from harness import lifecycle # noqa: E402
def _psql(domain, sql):
cmd = (
"PGPASSWORD=$(cat /run/secrets/db_password) " f'psql -U discourse -d discourse -tAc "{sql}"'
'PGPASSWORD=$(cat /run/secrets/db_password) '
f'psql -U discourse -d discourse -tAc "{sql}"'
)
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
def test_restore_returns_state(live_app):
assert (
_psql(live_app, "SELECT v FROM ci_marker;") == "original"
), "restore did not return the pre-mutation discourse postgres state (data-integrity failure)"
assert _psql(live_app, "SELECT v FROM ci_marker;") == "original", (
"restore did not return the pre-mutation discourse postgres state (data-integrity failure)"
)

View File

@ -93,10 +93,9 @@ class GhostAdmin:
status, body = self.req(
"POST", "/session/", {"username": ADMIN_EMAIL, "password": ADMIN_PW}
)
assert status in (
200,
201,
), f"ghost admin session login failed: HTTP {status}, body={body!r}"
assert status in (200, 201), (
f"ghost admin session login failed: HTTP {status}, body={body!r}"
)
def create_post(self, title: str, html: str) -> dict:
status, body = self.req(

View File

@ -53,15 +53,13 @@ def test_ghost_admin_route_is_wired(live_app):
return None
status_body = harness_http.assert_converges(
_ready,
f"GET {url} returns Ghost admin (200) or setup redirect (302)",
max_wait=60,
interval=3,
_ready, f"GET {url} returns Ghost admin (200) or setup redirect (302)",
max_wait=60, interval=3,
)
status, body = status_body
assert status in (200, 302), f"unexpected status: {status}"
if status == 200:
# The admin SPA references /ghost-assets/ or contains "ghost" in title/body
assert (
"ghost" in body.lower()
), f"GET {url} 200 but body has no Ghost markers: {body[:200]!r}"
assert "ghost" in body.lower(), (
f"GET {url} 200 but body has no Ghost markers: {body[:200]!r}"
)

View File

@ -35,10 +35,10 @@ def test_content_api_settings_endpoint(live_app):
assert body is not None, f"GET {url} returned non-JSON body"
# On success: {"settings": {...}}. On error: {"errors": [...]}. Either shape is valid.
if status == 200:
assert (
isinstance(body, dict) and "settings" in body
), f"200 response missing 'settings' envelope: {body!r}"
assert isinstance(body, dict) and "settings" in body, (
f"200 response missing 'settings' envelope: {body!r}"
)
else:
assert isinstance(body, dict) and (
"errors" in body or "message" in body or body
), f"error response not a proper Ghost error envelope: {body!r}"
assert isinstance(body, dict) and ("errors" in body or "message" in body or body), (
f"error response not a proper Ghost error envelope: {body!r}"
)

View File

@ -43,17 +43,17 @@ def test_create_post_roundtrip(live_app):
title = f"ccci-marker-{uniq}"
marker = f"ccci-body-marker-{uniq}-roundtrip"
created = admin.create_post(title, f"<p>{marker}</p>")
assert (
created.get("title") == title
), f"created post title mismatch: sent {title!r}, got {created.get('title')!r}"
assert created.get("title") == title, (
f"created post title mismatch: sent {title!r}, got {created.get('title')!r}"
)
# 4) Read it back by id and assert the post survived the round-trip (title always returned;
# html returned because we requested ?formats=html).
got = admin.get_post(created["id"])
assert (
got.get("title") == title
), f"post title did not round-trip: sent {title!r}, got {got.get('title')!r}"
assert got.get("title") == title, (
f"post title did not round-trip: sent {title!r}, got {got.get('title')!r}"
)
html = got.get("html") or ""
assert (
marker in html
), f"post body did not round-trip: marker {marker!r} not in read-back html {html!r}"
assert marker in html, (
f"post body did not round-trip: marker {marker!r} not in read-back html {html!r}"
)

View File

@ -15,9 +15,7 @@ set -euo pipefail
: "${CCCI_RECIPE:?missing CCCI_RECIPE}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Resolve the recipe tree the way abra does: $ABRA_DIR (the per-run tree inside a CI run) else
# the canonical ~/.abra — the overlay must land in the tree this run actually deploys from.
RECIPE_DIR="${ABRA_DIR:-${HOME}/.abra}/recipes/${CCCI_RECIPE}"
RECIPE_DIR="${HOME}/.abra/recipes/${CCCI_RECIPE}"
if [ ! -d "$RECIPE_DIR" ]; then
echo " ghost install_steps: recipe dir $RECIPE_DIR missing — cannot provide compose.ccci.yml" >&2

View File

@ -22,7 +22,10 @@ from harness import lifecycle # noqa: E402
def _mysql(domain, sql):
cmd = 'MYSQL_PWD="$(cat /run/secrets/db_password)" ' f'mysql -u root -N -s ghost -e "{sql}"'
cmd = (
'MYSQL_PWD="$(cat /run/secrets/db_password)" '
f'mysql -u root -N -s ghost -e "{sql}"'
)
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()

View File

@ -63,11 +63,7 @@ def BACKUP_VERIFY(domain):
try:
out = lifecycle.exec_in_app(
domain,
[
"sh",
"-c",
"gzip -t /var/lib/mysql/backup.sql.gz && wc -c < /var/lib/mysql/backup.sql.gz",
],
["sh", "-c", "gzip -t /var/lib/mysql/backup.sql.gz && wc -c < /var/lib/mysql/backup.sql.gz"],
service="db",
timeout=60,
).strip()

View File

@ -15,11 +15,14 @@ from harness import lifecycle # noqa: E402
def _mysql(domain, sql):
cmd = 'MYSQL_PWD="$(cat /run/secrets/db_password)" ' f'mysql -u root -N -s ghost -e "{sql}"'
cmd = (
'MYSQL_PWD="$(cat /run/secrets/db_password)" '
f'mysql -u root -N -s ghost -e "{sql}"'
)
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
def test_backup_captures_state(live_app):
assert (
_mysql(live_app, "SELECT v FROM ci_marker;") == "original"
), "the seeded ghost MySQL marker was not present at backup time"
assert _mysql(live_app, "SELECT v FROM ci_marker;") == "original", (
"the seeded ghost MySQL marker was not present at backup time"
)

View File

@ -22,7 +22,10 @@ from harness import lifecycle # noqa: E402
def _mysql(domain, sql):
cmd = 'MYSQL_PWD="$(cat /run/secrets/db_password)" ' f'mysql -u root -N -s ghost -e "{sql}"'
cmd = (
'MYSQL_PWD="$(cat /run/secrets/db_password)" '
f'mysql -u root -N -s ghost -e "{sql}"'
)
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()

View File

@ -14,11 +14,14 @@ from harness import lifecycle # noqa: E402
def _mysql(domain, sql):
cmd = 'MYSQL_PWD="$(cat /run/secrets/db_password)" ' f'mysql -u root -N -s ghost -e "{sql}"'
cmd = (
'MYSQL_PWD="$(cat /run/secrets/db_password)" '
f'mysql -u root -N -s ghost -e "{sql}"'
)
return lifecycle.exec_in_app(domain, ["sh", "-c", cmd], service="db").strip()
def test_upgrade_preserves_state(live_app):
assert (
_mysql(live_app, "SELECT v FROM ci_marker;") == "upgrade-survives"
), "the seeded ghost MySQL marker did not survive the upgrade redeploy (data loss on upgrade)"
assert _mysql(live_app, "SELECT v FROM ci_marker;") == "upgrade-survives", (
"the seeded ghost MySQL marker did not survive the upgrade redeploy (data loss on upgrade)"
)

View File

@ -14,6 +14,7 @@ import urllib.request
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner"))
from harness import http as harness_http # noqa: E402
_CTX = ssl.create_default_context()
_CTX.check_hostname = False
_CTX.verify_mode = ssl.CERT_NONE

View File

@ -15,5 +15,7 @@ from harness import http as harness_http # noqa: E402
def test_hedgedoc_root_serves(live_app):
"""GET / → 200 or 302 (login/new redirect)."""
url = f"https://{live_app}/"
status, _ = harness_http.retry_http_get(url, expect_status=(200, 302), max_wait=90, interval=5)
status, _ = harness_http.retry_http_get(
url, expect_status=(200, 302), max_wait=90, interval=5
)
assert status in (200, 302), f"GET {url} HTTP {status} (expected 200 or 302)"

View File

@ -111,13 +111,13 @@ def test_immich_processes_uploaded_asset_metadata_and_statistics(live_app):
if exif and exif.get("exifImageWidth"):
break
time.sleep(5)
assert (
exif and exif.get("exifImageWidth") == 1 and exif.get("exifImageHeight") == 1
), f"immich metadata-extraction did not populate the 1x1 PNG dimensions in exifInfo: {exif!r}"
assert exif and exif.get("exifImageWidth") == 1 and exif.get("exifImageHeight") == 1, (
f"immich metadata-extraction did not populate the 1x1 PNG dimensions in exifInfo: {exif!r}"
)
# the asset is catalogued into the owner's library statistics (list-back in aggregate)
sst, stats = harness_http.http_request("GET", f"{base}/api/assets/statistics", headers=auth)
assert sst == 200 and isinstance(stats, dict), f"statistics HTTP {sst}: {stats!r}"
assert (
stats.get("images", 0) >= 1 and stats.get("total", 0) >= 1
), f"uploaded asset not reflected in library statistics: {stats!r}"
assert stats.get("images", 0) >= 1 and stats.get("total", 0) >= 1, (
f"uploaded asset not reflected in library statistics: {stats!r}"
)

View File

@ -121,6 +121,6 @@ def test_immich_upload_asset_readback_and_thumbnail(live_app):
if thumb == 200:
break
time.sleep(5)
assert (
thumb == 200
), f"immich did not generate a thumbnail/derivative for the uploaded asset (last HTTP {thumb})"
assert thumb == 200, (
f"immich did not generate a thumbnail/derivative for the uploaded asset (last HTTP {thumb})"
)

View File

@ -16,11 +16,5 @@ from harness import http as harness_http # noqa: E402
def test_immich_returns_200(live_app):
url = f"https://{live_app}/"
status, _ = harness_http.retry_http_get(
url, expect_status=(200, 301, 302), max_wait=60, interval=3
)
assert status in (
200,
301,
302,
), f"immich at {url} returned HTTP {status} (expected 200/301/302)"
status, _ = harness_http.retry_http_get(url, expect_status=(200, 301, 302), max_wait=60, interval=3)
assert status in (200, 301, 302), f"immich at {url} returned HTTP {status} (expected 200/301/302)"

View File

@ -35,7 +35,4 @@ def pre_backup(domain, meta):
def pre_restore(domain, meta):
_psql(domain, "DROP TABLE ci_marker;")
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in (
"",
"NULL",
), "drop did not take"
assert _psql(domain, "SELECT to_regclass('public.ci_marker');") in ("", "NULL"), "drop did not take"

View File

@ -14,6 +14,4 @@ def _psql(domain, sql):
def test_backup_captures_state(live_app):
assert (
_psql(live_app, "SELECT v FROM ci_marker;") == "original"
), "seeded postgres state not present at backup time"
assert _psql(live_app, "SELECT v FROM ci_marker;") == "original", "seeded postgres state not present at backup time"

View File

@ -7,8 +7,7 @@ import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import browser as harness_browser # noqa: E402
from harness import generic, lifecycle
from harness import browser as harness_browser, generic, lifecycle # noqa: E402
def test_serving_and_frontend(live_app, meta):
@ -26,11 +25,7 @@ def test_serving_and_frontend(live_app, meta):
resp = harness_browser.goto_with_retry(
page, url, accept_statuses=(200, 301, 302), goto_timeout_ms=60_000
)
assert resp is not None and resp.status in (
200,
301,
302,
), f"page status {resp and resp.status}"
assert resp is not None and resp.status in (200, 301, 302), f"page status {resp and resp.status}"
assert "<html" in page.content().lower(), "no HTML served by the immich frontend"
finally:
browser.close()

View File

@ -14,6 +14,4 @@ def _psql(domain, sql):
def test_restore_returns_state(live_app):
assert (
_psql(live_app, "SELECT v FROM ci_marker;") == "original"
), "restore did not return the pre-mutation postgres state"
assert _psql(live_app, "SELECT v FROM ci_marker;") == "original", "restore did not return the pre-mutation postgres state"

View File

@ -14,6 +14,4 @@ def _psql(domain, sql):
def test_upgrade_preserves_data(live_app):
assert (
_psql(live_app, "SELECT v FROM ci_marker;") == "upgrade-survives"
), "postgres data did not survive the upgrade"
assert _psql(live_app, "SELECT v FROM ci_marker;") == "upgrade-survives", "postgres data did not survive the upgrade"

View File

@ -120,9 +120,9 @@ def test_create_confidential_client_and_obtain_token(live_app):
"clientId": client_id,
"enabled": True,
"secret": client_secret,
"publicClient": False, # confidential client
"serviceAccountsEnabled": True, # required for client_credentials grant
"standardFlowEnabled": False, # not needed for service-account-only client
"publicClient": False, # confidential client
"serviceAccountsEnabled": True, # required for client_credentials grant
"standardFlowEnabled": False, # not needed for service-account-only client
"directAccessGrantsEnabled": False,
"protocol": "openid-connect",
}
@ -144,25 +144,25 @@ def test_create_confidential_client_and_obtain_token(live_app):
# Use the client to obtain its own token (client_credentials grant)
tok_status, tok_resp = _client_credentials_token(live_app, client_id, client_secret)
assert (
tok_status == 200
), f"client_credentials token returned HTTP {tok_status}: {tok_resp!r}"
assert tok_status == 200, (
f"client_credentials token returned HTTP {tok_status}: {tok_resp!r}"
)
access_token = tok_resp.get("access_token") if isinstance(tok_resp, dict) else None
assert (
isinstance(access_token, str) and access_token.count(".") == 2
), f"client_credentials access_token not a JWT: {access_token!r}"
assert isinstance(access_token, str) and access_token.count(".") == 2, (
f"client_credentials access_token not a JWT: {access_token!r}"
)
# Decode the JWT payload; assert azp matches the new client
payload = json.loads(_b64url_decode(access_token.split(".")[1]))
assert (
payload.get("azp") == client_id
), f"client_credentials JWT azp={payload.get('azp')!r} != client_id={client_id!r}"
assert payload.get("azp") == client_id, (
f"client_credentials JWT azp={payload.get('azp')!r} != client_id={client_id!r}"
)
# Service-account token does NOT carry a session-scoped user (azp + clientId differ from
# admin-cli token). The presence of azp + iss == per-run-domain proves the issuance flow.
expected_iss = f"https://{live_app}/realms/master"
assert (
payload.get("iss") == expected_iss
), f"JWT iss={payload.get('iss')!r} != {expected_iss!r}"
assert payload.get("iss") == expected_iss, (
f"JWT iss={payload.get('iss')!r} != {expected_iss!r}"
)
finally:
# Idempotent cleanup
if cleanup_id:

View File

@ -43,20 +43,22 @@ def test_password_grant_issues_valid_jwt(live_app):
token = kc_admin.admin_token(live_app, password)
# Shape: a JWT is exactly 3 base64url segments
assert (
isinstance(token, str) and token.count(".") == 2
), f"access_token does not look like a JWT (no 3 segments): len={len(token) if token else 0}"
assert isinstance(token, str) and token.count(".") == 2, (
f"access_token does not look like a JWT (no 3 segments): len={len(token) if token else 0}"
)
payload = _decode_jwt_payload(token)
# iss = the issuer URL, must be the per-run domain's /realms/master endpoint
expected_iss = f"https://{live_app}/realms/master"
assert (
payload.get("iss") == expected_iss
), f"JWT iss claim {payload.get('iss')!r} != {expected_iss!r}"
assert payload.get("iss") == expected_iss, (
f"JWT iss claim {payload.get('iss')!r} != {expected_iss!r}"
)
# azp = authorized party (which client requested this token)
assert payload.get("azp") == "admin-cli", f"JWT azp claim {payload.get('azp')!r} != 'admin-cli'"
assert payload.get("azp") == "admin-cli", (
f"JWT azp claim {payload.get('azp')!r} != 'admin-cli'"
)
# typ = token type
assert payload.get("typ") == "Bearer", f"JWT typ claim {payload.get('typ')!r} != 'Bearer'"
@ -68,6 +70,6 @@ def test_password_grant_issues_valid_jwt(live_app):
# iat (issued at) is also a standard claim
iat = payload.get("iat")
assert (
isinstance(iat, int) and iat <= time.time() + 60
), f"JWT iat {iat!r} not a reasonable past timestamp"
assert isinstance(iat, int) and iat <= time.time() + 60, (
f"JWT iat {iat!r} not a reasonable past timestamp"
)

View File

@ -2,7 +2,5 @@
# conftest — enrolling this recipe needs NO change to runner/harness code (D5).
HEALTH_PATH = "/realms/master" # 200 JSON once keycloak is up (not "/", which redirects)
HEALTH_OK = (200,)
DEPLOY_TIMEOUT = (
900 # JVM + DB migration are slow on a 2-vCPU VM; observed 502 fallback up to ~10min
)
DEPLOY_TIMEOUT = 900 # JVM + DB migration are slow on a 2-vCPU VM; observed 502 fallback up to ~10min
HTTP_TIMEOUT = 900

View File

@ -8,8 +8,7 @@ import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import browser as harness_browser # noqa: E402
from harness import generic, lifecycle
from harness import browser as harness_browser, generic, lifecycle # noqa: E402
def test_serving_and_admin_console(live_app, meta):

View File

@ -28,7 +28,9 @@ def test_users_me_requires_auth(live_app):
url = f"https://{live_app}/api/v1.0/users/me/"
# Retry with broad acceptance: any 4xx (or specific 401) indicates the route exists + auth is
# required. Reject 200 (anonymous access) and 5xx (broken backend).
status, _ = harness_http.retry_http_get(url, expect_status=(401, 403), max_wait=60, interval=3)
status, _ = harness_http.retry_http_get(
url, expect_status=(401, 403), max_wait=60, interval=3
)
assert status in (401, 403), (
f"GET {url} returned {status}, expected 401 (auth required). "
f"200 = anonymous access leaked; 404 = route missing; 5xx = backend broken."

View File

@ -27,8 +27,7 @@ import uuid
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner"))
from harness import http as harness_http # noqa: E402
from harness import sso
from harness import http as harness_http, sso # noqa: E402
@pytest.mark.requires_deps
@ -37,15 +36,13 @@ def test_create_doc_and_read_back(live_app, deps_creds):
kc = deps_creds["keycloak"]
# Obtain a JWT via OIDC password grant
access_token = sso.oidc_password_grant(
{
"client_id": kc["client_id"],
"client_secret": kc["client_secret"],
"user": kc["user"],
"password": kc["password"],
"token_url": kc["token_url"],
}
)
access_token = sso.oidc_password_grant({
"client_id": kc["client_id"],
"client_secret": kc["client_secret"],
"user": kc["user"],
"password": kc["password"],
"token_url": kc["token_url"],
})
auth = {"Authorization": f"Bearer {access_token}"}
# Create a doc with a unique title
@ -59,9 +56,9 @@ def test_create_doc_and_read_back(live_app, deps_creds):
assert isinstance(body, dict), f"unexpected response shape: {body!r}"
doc_id = body.get("id")
assert doc_id, f"created doc has no id: {body!r}"
assert (
body.get("title") == title
), f"created doc title mismatch: created={title!r}, response={body.get('title')!r}"
assert body.get("title") == title, (
f"created doc title mismatch: created={title!r}, response={body.get('title')!r}"
)
# Fetch it back via the dedicated GET endpoint
s, fetched = harness_http.http_get(
@ -69,10 +66,9 @@ def test_create_doc_and_read_back(live_app, deps_creds):
)
assert s == 200, f"GET /api/v1.0/documents/{doc_id}/ HTTP {s}: {fetched!r}"
assert isinstance(fetched, dict), f"unexpected GET response: {fetched!r}"
assert fetched.get("id") in (
doc_id,
str(doc_id),
), f"fetched id mismatch: created={doc_id!r}, fetched={fetched.get('id')!r}"
assert (
fetched.get("title") == title
), f"fetched title mismatch: created={title!r}, fetched={fetched.get('title')!r}"
assert fetched.get("id") in (doc_id, str(doc_id)), (
f"fetched id mismatch: created={doc_id!r}, fetched={fetched.get('id')!r}"
)
assert fetched.get("title") == title, (
f"fetched title mismatch: created={title!r}, fetched={fetched.get('title')!r}"
)

View File

@ -22,11 +22,7 @@ def test_lasuite_docs_returns_200(live_app):
url = f"https://{live_app}/"
# accept 200 (frontend SPA shell) — lasuite-docs serves the SPA at root unauthenticated;
# the SPA itself bootstraps via /api/v1.0/users/me/ which requires OIDC (separate test).
status, _ = harness_http.retry_http_get(
url, expect_status=(200, 301, 302), max_wait=60, interval=3
status, _ = harness_http.retry_http_get(url, expect_status=(200, 301, 302), max_wait=60, interval=3)
assert status in (200, 301, 302), (
f"lasuite-docs at {url} returned HTTP {status} (expected 200/301/302)"
)
assert status in (
200,
301,
302,
), f"lasuite-docs at {url} returned HTTP {status} (expected 200/301/302)"

View File

@ -25,8 +25,7 @@ import urllib.request
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "..", "runner"))
from harness import http as harness_http # noqa: E402
from harness import sso
from harness import http as harness_http, sso # noqa: E402
_CTX = ssl.create_default_context()
_CTX.check_hostname = False
@ -62,9 +61,9 @@ def test_oidc_login_via_keycloak(live_app, deps_creds):
# 302 redirect. Both are valid "auth-required" indicators — accept either, but if a
# redirect is returned it must point at the dep keycloak realm.
if status in (301, 302, 303, 307, 308):
assert expected_prefix in (
redirect or ""
), f"Docs redirected to {redirect!r}, expected to start with {expected_prefix!r}"
assert expected_prefix in (redirect or ""), (
f"Docs redirected to {redirect!r}, expected to start with {expected_prefix!r}"
)
else:
assert status in (401, 403), (
f"GET /api/v1.0/users/me/ unauth: HTTP {status}; expected redirect to keycloak "
@ -89,6 +88,6 @@ def test_oidc_login_via_keycloak(live_app, deps_creds):
)
assert status == 200, f"GET /api/v1.0/users/me/ with token HTTP {status}: {body!r}"
assert isinstance(body, dict), f"unexpected response: {body!r}"
assert (
body.get("email") == kc["email"]
), f"unexpected user email: got {body.get('email')!r}, expected {kc['email']!r}"
assert body.get("email") == kc["email"], (
f"unexpected user email: got {body.get('email')!r}, expected {kc['email']!r}"
)

View File

@ -42,9 +42,9 @@ def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
# Sanity-check the creds shape — orchestrator-written
assert kc["domain"]
# WC1: realm is per-run namespaced "<parent>-<6hex>" so concurrent dependents never collide.
assert re.fullmatch(
r"lasuite-docs-[0-9a-f]{6}", kc["realm"]
), f"realm {kc['realm']!r} not the per-run namespaced form lasuite-docs-<6hex>"
assert re.fullmatch(r"lasuite-docs-[0-9a-f]{6}", kc["realm"]), (
f"realm {kc['realm']!r} not the per-run namespaced form lasuite-docs-<6hex>"
)
assert kc["client_id"] == "lasuite-docs"
assert isinstance(kc["client_secret"], str) and len(kc["client_secret"]) >= 16
assert isinstance(kc["password"], str) and len(kc["password"]) >= 16
@ -74,14 +74,16 @@ def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
# Password grant → real JWT
token = sso.oidc_password_grant(creds)
assert isinstance(token, str) and token.count(".") == 2, f"access_token is not a JWT: {token!r}"
assert isinstance(token, str) and token.count(".") == 2, (
f"access_token is not a JWT: {token!r}"
)
payload = json.loads(_b64url_decode(token.split(".")[1]))
assert payload.get("iss") == expected_iss, f"JWT iss={payload.get('iss')!r} != {expected_iss!r}"
assert (
payload.get("azp") == kc["client_id"]
), f"JWT azp={payload.get('azp')!r} != {kc['client_id']!r}"
assert payload.get("azp") == kc["client_id"], (
f"JWT azp={payload.get('azp')!r} != {kc['client_id']!r}"
)
assert payload.get("typ") == "Bearer", f"JWT typ={payload.get('typ')!r} != 'Bearer'"
exp = payload.get("exp")
assert (
isinstance(exp, int) and exp > time.time()
), f"JWT exp={exp!r} not a future timestamp (now={time.time():.0f})"
assert isinstance(exp, int) and exp > time.time(), (
f"JWT exp={exp!r} not a future timestamp (now={time.time():.0f})"
)

View File

@ -21,24 +21,15 @@ set -euo pipefail
: "${CCCI_APP_DOMAIN:?missing}"
: "${CCCI_DEPS_FILE:?missing}"
test -s "$CCCI_DEPS_FILE" || {
echo " setup_custom_tests: deps file empty"
exit 1
}
test -s "$CCCI_DEPS_FILE" || { echo " setup_custom_tests: deps file empty"; exit 1; }
# Read keycloak dep info via jq
KC_DOMAIN=$(jq -r '.keycloak.domain' "$CCCI_DEPS_FILE")
KC_REALM=$(jq -r '.keycloak.realm' "$CCCI_DEPS_FILE")
KC_CLIENT=$(jq -r '.keycloak.client_id' "$CCCI_DEPS_FILE")
KC_SECRET=$(jq -r '.keycloak.client_secret' "$CCCI_DEPS_FILE")
if [ -z "$KC_DOMAIN" ] || [ "$KC_DOMAIN" = "null" ]; then
echo " setup_custom_tests: no keycloak.domain in deps"
exit 1
fi
if [ -z "$KC_SECRET" ] || [ "$KC_SECRET" = "null" ]; then
echo " setup_custom_tests: no keycloak.client_secret"
exit 1
fi
KC_DOMAIN=$(jq -r '.keycloak.domain' "$CCCI_DEPS_FILE")
KC_REALM=$( jq -r '.keycloak.realm' "$CCCI_DEPS_FILE")
KC_CLIENT=$(jq -r '.keycloak.client_id' "$CCCI_DEPS_FILE")
KC_SECRET=$(jq -r '.keycloak.client_secret' "$CCCI_DEPS_FILE")
[ -n "$KC_DOMAIN" ] && [ "$KC_DOMAIN" != "null" ] || { echo " setup_custom_tests: no keycloak.domain in deps"; exit 1; }
[ -n "$KC_SECRET" ] && [ "$KC_SECRET" != "null" ] || { echo " setup_custom_tests: no keycloak.client_secret"; exit 1; }
echo " lasuite-docs setup_custom_tests: wiring OIDC against keycloak dep ${KC_DOMAIN}"
@ -48,15 +39,12 @@ echo " lasuite-docs setup_custom_tests: wiring OIDC against keycloak dep ${KC_D
# update SECRET_OIDC_RPCS_VERSION in the .env to point at the new one.
ENV_PATH="$HOME/.abra/servers/default/${CCCI_APP_DOMAIN}.env"
CUR_VER=$(grep -E '^\s*SECRET_OIDC_RPCS_VERSION=' "$ENV_PATH" | tail -1 | cut -d= -f2 | tr -d '"\r' || echo "v1")
NEW_NUM=$((${CUR_VER#v} + 1))
NEW_NUM=$(( ${CUR_VER#v} + 1 ))
NEW_VER="v${NEW_NUM}"
INSERT_LOG=$(abra app secret insert "$CCCI_APP_DOMAIN" oidc_rpcs "$NEW_VER" "$KC_SECRET" --no-input -C -o 2>&1) ||
INSERT_LOG=$(script -qec "abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o" /dev/null 2>&1) ||
{
echo " setup_custom_tests: abra app secret insert oidc_rpcs@$NEW_VER failed: $INSERT_LOG"
exit 1
}
INSERT_LOG=$(abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o 2>&1) \
|| INSERT_LOG=$(script -qec "abra app secret insert $CCCI_APP_DOMAIN oidc_rpcs $NEW_VER $KC_SECRET --no-input -C -o" /dev/null 2>&1) \
|| { echo " setup_custom_tests: abra app secret insert oidc_rpcs@$NEW_VER failed: $INSERT_LOG"; exit 1; }
# Repoint the env var to the new version
sed -i "s|^\s*SECRET_OIDC_RPCS_VERSION=.*|SECRET_OIDC_RPCS_VERSION=$NEW_VER|" "$ENV_PATH"
echo " setup_custom_tests: oidc_rpcs secret inserted at $NEW_VER (was $CUR_VER)"
@ -64,25 +52,25 @@ echo " setup_custom_tests: oidc_rpcs secret inserted at $NEW_VER (was $CUR_VER)
# 2) Write OIDC env vars to the app's .env (names per lasuite-docs's .env.sample).
# Ensure the file ends with a newline FIRST so our appends don't concatenate onto the last line
# (we saw `TIMEOUT=900OIDC_REALM=...` malformed by a missing-trailing-newline file).
[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >>"$ENV_PATH"
write_env() {
[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >> "$ENV_PATH"
write_env () {
local key="$1" val="$2"
# remove any existing key (commented or live) then append the live key=val
sed -i "/^\s*#\?\s*${key}=/d" "$ENV_PATH"
# Re-ensure trailing newline after each delete (sed may leave the file without one)
[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >>"$ENV_PATH"
printf '%s=%s\n' "$key" "$val" >>"$ENV_PATH"
[ -z "$(tail -c1 "$ENV_PATH" 2>/dev/null)" ] || printf '\n' >> "$ENV_PATH"
printf '%s=%s\n' "$key" "$val" >> "$ENV_PATH"
}
write_env OIDC_REALM "$KC_REALM"
write_env OIDC_OP_DISCOVERY_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/.well-known/openid-configuration"
write_env OIDC_OP_AUTHORIZATION_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/auth"
write_env OIDC_OP_TOKEN_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/token"
write_env OIDC_OP_USER_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/userinfo"
write_env OIDC_OP_LOGOUT_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/logout"
write_env OIDC_OP_JWKS_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/certs"
write_env OIDC_RP_CLIENT_ID "$KC_CLIENT"
write_env OIDC_RP_SIGN_ALGO "RS256"
write_env OIDC_RP_SCOPES "openid email profile"
write_env OIDC_REALM "$KC_REALM"
write_env OIDC_OP_DISCOVERY_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/.well-known/openid-configuration"
write_env OIDC_OP_AUTHORIZATION_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/auth"
write_env OIDC_OP_TOKEN_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/token"
write_env OIDC_OP_USER_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/userinfo"
write_env OIDC_OP_LOGOUT_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/logout"
write_env OIDC_OP_JWKS_ENDPOINT "https://${KC_DOMAIN}/realms/${KC_REALM}/protocol/openid-connect/certs"
write_env OIDC_RP_CLIENT_ID "$KC_CLIENT"
write_env OIDC_RP_SIGN_ALGO "RS256"
write_env OIDC_RP_SCOPES "openid email profile"
# 3) Trigger an in-place redeploy so the env update takes effect. --force re-deploys even when
# the recipe hasn't changed; --chaos avoids the chaos prompt; --no-input non-interactive.

View File

@ -10,8 +10,7 @@ import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "runner"))
from harness import browser as harness_browser # noqa: E402
from harness import generic, lifecycle
from harness import browser as harness_browser, generic, lifecycle # noqa: E402
def test_serving_and_frontend(live_app, meta):

View File

@ -25,8 +25,6 @@ def test_lasuite_drive_returns_200(live_app):
status, _ = harness_http.retry_http_get(
url, expect_status=(200, 301, 302), max_wait=60, interval=3
)
assert status in (
200,
301,
302,
), f"lasuite-drive at {url} returned HTTP {status} (expected 200/301/302)"
assert status in (200, 301, 302), (
f"lasuite-drive at {url} returned HTTP {status} (expected 200/301/302)"
)

View File

@ -29,8 +29,8 @@ BUCKET = "drive-media-storage"
def _mc(domain: str, script: str) -> str:
"""Run an `mc` shell script inside the minio container (root creds from /run/secrets)."""
prelude = (
"set -e; "
"U=$(cat /run/secrets/minio_ru); P=$(cat /run/secrets/minio_rp); "
'set -e; '
'U=$(cat /run/secrets/minio_ru); P=$(cat /run/secrets/minio_rp); '
'mc alias set ccci http://localhost:9000 "$U" "$P" >/dev/null 2>&1; '
)
return lifecycle.exec_in_app(domain, ["sh", "-c", prelude + script], service="minio")
@ -49,13 +49,13 @@ def test_minio_bucket_present_and_object_roundtrip(live_app):
domain,
# upload via stdin; list the object; read it back (tagged); then delete.
f'printf %s "{marker}" | mc pipe ccci/{BUCKET}/{key} >/dev/null 2>&1; '
f"mc ls ccci/{BUCKET}/{key}; "
f'mc ls ccci/{BUCKET}/{key}; '
f'echo "READBACK:$(mc cat ccci/{BUCKET}/{key})"; '
f"mc rm ccci/{BUCKET}/{key} >/dev/null 2>&1",
f'mc rm ccci/{BUCKET}/{key} >/dev/null 2>&1',
)
# The object was listed (its key appears) and its content round-tripped intact.
assert f"{marker}.txt" in out, f"uploaded object not listed in bucket: {out!r}"
assert (
f"READBACK:{marker}" in out
), f"object content did not round-trip through MinIO; got: {out!r}"
assert f"READBACK:{marker}" in out, (
f"object content did not round-trip through MinIO; got: {out!r}"
)

View File

@ -46,9 +46,9 @@ def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
# Creds shape. WC1: realm is per-run namespaced "<parent>-<6hex>"; client_id stays the parent.
assert kc["domain"]
assert re.fullmatch(
r"lasuite-drive-[0-9a-f]{6}", kc["realm"]
), f"realm {kc['realm']!r} not the per-run namespaced form lasuite-drive-<6hex>"
assert re.fullmatch(r"lasuite-drive-[0-9a-f]{6}", kc["realm"]), (
f"realm {kc['realm']!r} not the per-run namespaced form lasuite-drive-<6hex>"
)
assert kc["client_id"] == "lasuite-drive"
assert isinstance(kc["client_secret"], str) and len(kc["client_secret"]) >= 16
assert isinstance(kc["password"], str) and len(kc["password"]) >= 16
@ -77,14 +77,16 @@ def test_oidc_password_grant_against_dep_keycloak(live_app, deps_creds):
# Password grant → real JWT
token = sso.oidc_password_grant(creds)
assert isinstance(token, str) and token.count(".") == 2, f"access_token is not a JWT: {token!r}"
assert isinstance(token, str) and token.count(".") == 2, (
f"access_token is not a JWT: {token!r}"
)
payload = json.loads(_b64url_decode(token.split(".")[1]))
assert payload.get("iss") == expected_iss, f"JWT iss={payload.get('iss')!r} != {expected_iss!r}"
assert (
payload.get("azp") == kc["client_id"]
), f"JWT azp={payload.get('azp')!r} != {kc['client_id']!r}"
assert payload.get("azp") == kc["client_id"], (
f"JWT azp={payload.get('azp')!r} != {kc['client_id']!r}"
)
assert payload.get("typ") == "Bearer", f"JWT typ={payload.get('typ')!r} != 'Bearer'"
exp = payload.get("exp")
assert (
isinstance(exp, int) and exp > time.time()
), f"JWT exp={exp!r} not a future timestamp (now={time.time():.0f})"
assert isinstance(exp, int) and exp > time.time(), (
f"JWT exp={exp!r} not a future timestamp (now={time.time():.0f})"
)

Some files were not shown because too many files have changed in this diff Show More