203 lines
12 KiB
Markdown
203 lines
12 KiB
Markdown
# Concurrency restructure — full implementation plan
|
||
|
||
Expands `concurrency-restructure-plan.md` (operator-approved 2026-06-10).
|
||
Reference spec of current system: cc-ci `docs/concurrency.md`.
|
||
|
||
## Decisions locked (operator + orchestrator)
|
||
|
||
| Question | Decision |
|
||
|---|---|
|
||
| Hard deadline | **60 min** (`signal.alarm(60*60)`); recipes keep their own smaller per-tier deadlines |
|
||
| `servers/` in per-run ABRA_DIR | **symlink** to `/root/.abra/servers` — app `.env` files land in the shared canonical path, so the janitor's `abra app ls` discovery and env-based teardown work unchanged (kills the P3 orphan-env risk). Per-domain filenames + app-domain lock prevent write conflicts. `catalogue/` also symlinked (read-mostly). `recipes/` is fresh per-run — that's the isolation that matters. |
|
||
| Landing | one branch `restructure/concurrency`, one commit per phase (P1..P5 + tests), merged to main only after adversary pass + lint green; live verification on main afterwards |
|
||
| Concurrency test suite | `tests/concurrency/` — NOT in the default `pytest tests/unit` run; invoked explicitly (`pytest tests/concurrency -q`); documented in docs/concurrency.md |
|
||
|
||
## Target invariant chain
|
||
|
||
```
|
||
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min deadline
|
||
```
|
||
|
||
Never steal a held lock; manage the holder's lifetime. Held lock = live owner (kernel-guaranteed).
|
||
|
||
---
|
||
|
||
## P1 — Lock-lifetime hardening
|
||
|
||
`runner/run_recipe_ci.py` (startup, before any abra call):
|
||
|
||
1. `PR_SET_PDEATHSIG(SIGTERM)` via `ctypes.CDLL("libc.so.6", use_errno=True).prctl(1, signal.SIGTERM)`;
|
||
check return code; then `if os.getppid() == 1: sys.exit("parent died before prctl — refusing to run orphaned")`.
|
||
2. SIGTERM handler: run teardown (reuse the existing finally/teardown funnel — raise SystemExit
|
||
from the handler so `finally:` blocks run), exit non-zero.
|
||
3. Self-deadline: `signal.alarm(60*60)` + SIGALRM handler → same teardown funnel, distinct log line
|
||
(`== run exceeded 60-minute hard deadline — tearing down ==`), exit non-zero.
|
||
|
||
`.drone.yml` recipe-ci step: wrap the harness invocation:
|
||
```sh
|
||
setsid python3 runner/run_recipe_ci.py ... &
|
||
PID=$!; trap 'kill -TERM -- -$PID 2>/dev/null' TERM EXIT; wait $PID
|
||
```
|
||
(builder: verify drone exec runner signal delivery semantics; the trap must fire on drone cancel.)
|
||
|
||
Also: one-line comment on the lock `open()` noting PEP 446 makes the fd non-inheritable
|
||
(subprocess children never carry the lock).
|
||
|
||
## P2 — Flock-probe janitor (replaces the pidfile registry)
|
||
|
||
`runner/harness/lifecycle.py`:
|
||
|
||
1. New: `acquire_app_lock(domain)` — `/run/lock/cc-ci-app-<domain>.lock`, exclusive flock,
|
||
non-blocking try first → on `BlockingIOError` log
|
||
`== app lock: another run of <domain> is in flight — waiting ==` and block. Returns the kept-alive
|
||
file object. Touch mtime at acquisition (mtime = lock age for the long-held flag).
|
||
2. Call site: in `deploy_app()` exactly where `register_run_app()` is today (BEFORE app creation);
|
||
the file object must be stored on a module-level/owner object so it lives for the process lifetime.
|
||
3. Janitor rewrite (`janitor()`): for each candidate domain (discovery unchanged: `abra app ls`
|
||
matched against `RUN_APP_RE` + docker-service sweep):
|
||
- open/create its lockfile, try `LOCK_EX|LOCK_NB`:
|
||
- **acquired** → orphan → `teardown_app(domain, verify=False)` WHILE HOLDING the probe lock
|
||
(closes janitor-vs-new-run race: a new run of that domain blocks on the lock until the reap
|
||
finishes), then release + unlink lockfile.
|
||
- **held** → live run → leave it; if lockfile mtime older than 2× deadline (120 min), log:
|
||
`!! lock for <domain> held >120min — possible leaked run; inspect with lslocks` (flag, never steal).
|
||
4. Delete: `register_run_app`, `unregister_run_app`, `_run_owner_state`, `ACTIVE_RUN_DIR`,
|
||
`CCCI_JANITOR_MAX_AGE` handling + age fallback, pid-reuse guard, and their call sites
|
||
(`deploy_app`, `teardown_app`, both janitor call sites keep working — only internals change).
|
||
5. `teardown_app()` no longer unregisters; normal-exit lock release stays implicit (process exit).
|
||
Janitor unlinks lockfiles only for reaped orphans; a clean run's stale lockfile (0 bytes, unlocked)
|
||
is harmless and reaped-on-sight by the next janitor probe (unlink after acquiring probe lock,
|
||
even when no app exists for it — keep /run/lock tidy).
|
||
|
||
Edge semantics to preserve/verify:
|
||
- Two janitors racing: both probe; one wins the flock and reaps; the loser sees "held" and leaves.
|
||
Reap must be idempotent (`teardown_app(verify=False)` already tolerates half-gone stacks).
|
||
- Post-reboot: lockfiles gone (tmpfs) → probe trivially acquires → immediate reap. (Improvement
|
||
over the 2h age fallback — note in spec.)
|
||
- Warm/canonical apps never match `RUN_APP_RE` → never probed. Unchanged. DO NOT touch.
|
||
|
||
## P3 — Per-run ABRA_DIR (deletes the per-recipe flock)
|
||
|
||
`runner/run_recipe_ci.py` / `runner/harness/`:
|
||
|
||
1. Early in main(): build `run_abra_dir = /var/lib/cc-ci-runs/<build>/abra`; create it;
|
||
symlink `servers` → `/root/.abra/servers`, `catalogue` → `/root/.abra/catalogue` (builder:
|
||
verify the minimal set abra needs — probe showed it fails only on missing `servers/`);
|
||
fresh empty `recipes/`. Set `os.environ["ABRA_DIR"] = run_abra_dir` before ANY abra/harness call.
|
||
2. `fetch_recipe()` becomes a plain clone into `$ABRA_DIR/recipes/<recipe>` (no rm-rf of a shared
|
||
tree; keep the checkout-of-ref logic).
|
||
3. Delete: `acquire_recipe_lock`, `RECIPE_LOCK_DIR` recipe-lock usage, its call site in main(),
|
||
and the "must be taken before fetch_recipe" ordering rule.
|
||
4. `.drone.yml`: HOME=/root comment updated — abra config still found via symlinked servers;
|
||
recipe trees now per-run.
|
||
5. Janitor / warm-app / canonical flows that read `/root/.abra/recipes/...` (if any — builder:
|
||
grep) must be checked; they run outside a per-run ABRA_DIR and keep using the default.
|
||
6. Run-dir cleanup: `/var/lib/cc-ci-runs/<n>/` retention already exists — abra dir rides along.
|
||
|
||
## P4 — Config cleanup
|
||
|
||
- Remove `concurrency.limit` from `.drone.yml` recipe-ci pipeline; `DRONE_RUNNER_CAPACITY`
|
||
(nix/modules/drone-runner.nix `maxTests`) is the single knob. Update its comment.
|
||
|
||
## P5 — Spec rewrite
|
||
|
||
- Rewrite `docs/concurrency.md` to the new model: one mechanism (per-app-domain flock), the
|
||
invariant chain (§ lock-lifetime management), flock-probe janitor decision table, per-run
|
||
ABRA_DIR isolation, updated failure-mode table (cancel now kills harness; reboot reaps
|
||
immediately; deadline bounds everything), updated file/symbol index, and how to run
|
||
`tests/concurrency`.
|
||
|
||
---
|
||
|
||
## Test suite: `tests/concurrency/`
|
||
|
||
Real-kernel tests (no mocking of flock itself — spawn helper subprocesses that hold locks; use
|
||
tmp dirs for lock/run dirs via env/monkeypatch). NOT part of `pytest tests/unit`. Run with
|
||
`pytest tests/concurrency -q` (document in docs; CI does not run it on every build — optional
|
||
manual/nightly invocation).
|
||
|
||
Lock fundamentals:
|
||
1. acquire → child process SIGKILL'd → lock immediately acquirable (kernel auto-release).
|
||
2. probe (`LOCK_NB`) against a held lock raises `BlockingIOError`; against unheld succeeds.
|
||
3. lock fd NOT inherited: holder spawns a `subprocess` child, holder dies, child still alive →
|
||
lock acquirable (PEP 446).
|
||
4. second same-domain acquire blocks until first holder exits (double-!testme serialisation).
|
||
|
||
Janitor / probe semantics:
|
||
5. orphan (no holder) → reaped, lockfile unlinked; teardown invoked exactly once (stub teardown).
|
||
6. live (held by helper process) → never reaped, logged as live.
|
||
7. reap-under-probe-lock race: janitor holds probe lock mid-reap → a new run's acquire of the
|
||
same domain blocks until reap completes (no window where new app coexists with half-reaped one).
|
||
8. two concurrent janitors → exactly one reaps (flock arbitration), reap idempotent.
|
||
9. reboot simulation: app exists, lockfile absent → treated as orphan immediately (no age wait).
|
||
10. long-held flag: held lock with mtime backdated >120min → warning logged, NOT reaped.
|
||
11. `RUN_APP_RE` allowlist: warm/canonical-shaped names never become candidates.
|
||
12. garbled/unwritable lockfile, missing lock dir → janitor degrades safely (skip + log, never crash).
|
||
|
||
Lifetime hardening:
|
||
13. PDEATHSIG: spawn wrapper-parent + harness-child (helper script calling the real prctl setup);
|
||
kill parent → child receives SIGTERM and exits, lock released.
|
||
14. ppid race: helper started with already-dead parent (ppid 1) exits immediately.
|
||
15. deadline: helper with `alarm(2)` → teardown hook fires, exit non-zero, lock released.
|
||
16. SIGTERM → teardown funnel runs (`finally:` executed), lock released, non-zero exit.
|
||
|
||
ABRA_DIR isolation:
|
||
17. per-run dir built correctly: `servers`/`catalogue` symlinks resolve, `recipes/` empty + writable,
|
||
`ABRA_DIR` exported before first abra call (assert via stub abra recording its env).
|
||
18. two concurrent fetch_recipe of the SAME recipe into different ABRA_DIRs → both trees correct,
|
||
no cross-talk (the old corruption scenario, now safe without a lock).
|
||
19. app `.env` written through the servers symlink lands in the canonical shared path (janitor
|
||
discovery keeps working).
|
||
|
||
Each test must clean up helper processes in `finally`/fixtures (no leaked children in the test VM).
|
||
|
||
## Roles, gates & Definition of Done (loop protocol — plan.md §6.1 applies)
|
||
|
||
This phase runs as Builder/Adversary loops with phase-namespaced state files
|
||
(STATUS-conc.md, REVIEW-conc.md, BACKLOG-conc.md, JOURNAL-conc.md).
|
||
|
||
**Builder**: implements P1→P5 + the tests/concurrency suite on branch `restructure/concurrency`
|
||
in YOUR clone; one commit per phase; before each commit ALL of:
|
||
`pytest tests/unit -q`, `pytest tests/concurrency -q` (once it exists), `scripts/lint.sh` —
|
||
green. Push the branch (NOT main — main merge is gated below). Claim gates via `claim(conc): ...`
|
||
commits + STATUS-conc.md per protocol.
|
||
|
||
**Adversary**: cold-verify from your own clone. For M1: check out the branch, run both suites +
|
||
lint yourself, then adversarially review the full diff — hunt races (probe vs acquire ordering,
|
||
signal-handler reentrancy, teardown-during-teardown), deleted-code fallout (grep for dangling
|
||
references to registry symbols: register_run_app, unregister_run_app, _run_owner_state,
|
||
ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE, acquire_recipe_lock), gate integrity (recipe tests /
|
||
RUN_APP_RE / warm apps untouched), and test-suite blind spots vs the 19 cases above.
|
||
Default-skeptical; findings to REVIEW-conc.md.
|
||
|
||
**Gates:**
|
||
- **M1 — implementation verified.** Branch complete (P1–P5 + tests), both suites + lint green
|
||
on the Adversary's cold clone, adversarial diff review PASS in REVIEW-conc.md.
|
||
- **M2 — merged + live-verified.** After M1 PASS only, Builder merges the branch to main
|
||
(merge commit, never force) and confirms the push build green. Then live verification on the
|
||
real CI host (evidence in STATUS-conc.md, Adversary re-checks):
|
||
a. trigger an immich !testme, cancel mid-run via drone API → harness pid dies (no leaked
|
||
python), lock released, next janitor reaps the app — zero leakage;
|
||
b. two parallel !testme runs (immich PR#2 + plausible PR#3) → both green, zero leakage;
|
||
c. double-!testme same PR → second blocks on the app lock (visible in its drone log), then runs;
|
||
d. one full green run end-to-end.
|
||
|
||
**## DONE in STATUS-conc.md** only when M1 and M2 both show a fresh Adversary PASS in
|
||
REVIEW-conc.md. NOTE for both loops: recipe-mirror PRs (immich#2, plausible#3) are used as
|
||
!testme targets only — NEVER merge or push to recipe mirror repos.
|
||
|
||
## Guardrails (builder + adversary MUST honor)
|
||
|
||
- NEVER weaken recipe-test gates or touch `tests/<recipe>/` content.
|
||
- cc-ci main is touched ONLY by the M2 merge after M1 PASS; all other work stays on the branch.
|
||
Never force-push. NEVER merge/push recipe mirror repos.
|
||
- No secrets in commits; reference `.testenv` / `/run/secrets` locations only.
|
||
- Don't touch `services_converged()` / paused-is-settled logic except where the plan says.
|
||
- Match repo commit style (`feat(...)`/`fix(harness)`/`test(...)`/`docs:`).
|
||
|
||
## Rollback
|
||
|
||
Single revert of the merge commit restores the registry+recipe-flock world; no data migration
|
||
(lock/registry state is tmpfs-ephemeral, old and new harness can't run simultaneously anyway —
|
||
deploys land from main atomically per build).
|