Files

autonomic-bot a1b4943da1 plan: adapt concurrency restructure to builder/adversary loop protocol (gates M1/M2, phase-namespaced state)

2026-06-10 03:54:31 +00:00

12 KiB

Raw Blame History

Concurrency restructure — full implementation plan

Expands concurrency-restructure-plan.md (operator-approved 2026-06-10). Reference spec of current system: cc-ci docs/concurrency.md.

Decisions locked (operator + orchestrator)

Question	Decision
Hard deadline	60 min (`signal.alarm(60*60)`); recipes keep their own smaller per-tier deadlines
`servers/` in per-run ABRA_DIR	symlink to `/root/.abra/servers` — app `.env` files land in the shared canonical path, so the janitor's `abra app ls` discovery and env-based teardown work unchanged (kills the P3 orphan-env risk). Per-domain filenames + app-domain lock prevent write conflicts. `catalogue/` also symlinked (read-mostly). `recipes/` is fresh per-run — that's the isolation that matters.
Landing	one branch `restructure/concurrency`, one commit per phase (P1..P5 + tests), merged to main only after adversary pass + lint green; live verification on main afterwards
Concurrency test suite	`tests/concurrency/` — NOT in the default `pytest tests/unit` run; invoked explicitly (`pytest tests/concurrency -q`); documented in docs/concurrency.md

Target invariant chain

lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min deadline

Never steal a held lock; manage the holder's lifetime. Held lock = live owner (kernel-guaranteed).

P1 — Lock-lifetime hardening

runner/run_recipe_ci.py (startup, before any abra call):

PR_SET_PDEATHSIG(SIGTERM) via ctypes.CDLL("libc.so.6", use_errno=True).prctl(1, signal.SIGTERM); check return code; then if os.getppid() == 1: sys.exit("parent died before prctl — refusing to run orphaned").
SIGTERM handler: run teardown (reuse the existing finally/teardown funnel — raise SystemExit from the handler so finally: blocks run), exit non-zero.
Self-deadline: signal.alarm(60*60) + SIGALRM handler → same teardown funnel, distinct log line (== run exceeded 60-minute hard deadline — tearing down ==), exit non-zero.

.drone.yml recipe-ci step: wrap the harness invocation:

setsid python3 runner/run_recipe_ci.py ... &
PID=$!; trap 'kill -TERM -- -$PID 2>/dev/null' TERM EXIT; wait $PID

(builder: verify drone exec runner signal delivery semantics; the trap must fire on drone cancel.)

Also: one-line comment on the lock open() noting PEP 446 makes the fd non-inheritable (subprocess children never carry the lock).

P2 — Flock-probe janitor (replaces the pidfile registry)

runner/harness/lifecycle.py:

New: acquire_app_lock(domain) — /run/lock/cc-ci-app-<domain>.lock, exclusive flock, non-blocking try first → on BlockingIOError log == app lock: another run of <domain> is in flight — waiting == and block. Returns the kept-alive file object. Touch mtime at acquisition (mtime = lock age for the long-held flag).
Call site: in deploy_app() exactly where register_run_app() is today (BEFORE app creation); the file object must be stored on a module-level/owner object so it lives for the process lifetime.
Janitor rewrite (janitor()): for each candidate domain (discovery unchanged: abra app ls matched against RUN_APP_RE + docker-service sweep):
- open/create its lockfile, try LOCK_EX|LOCK_NB:
  - acquired → orphan → teardown_app(domain, verify=False) WHILE HOLDING the probe lock (closes janitor-vs-new-run race: a new run of that domain blocks on the lock until the reap finishes), then release + unlink lockfile.
  - held → live run → leave it; if lockfile mtime older than 2× deadline (120 min), log: !! lock for <domain> held >120min — possible leaked run; inspect with lslocks (flag, never steal).
Delete: register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE handling + age fallback, pid-reuse guard, and their call sites (deploy_app, teardown_app, both janitor call sites keep working — only internals change).
teardown_app() no longer unregisters; normal-exit lock release stays implicit (process exit). Janitor unlinks lockfiles only for reaped orphans; a clean run's stale lockfile (0 bytes, unlocked) is harmless and reaped-on-sight by the next janitor probe (unlink after acquiring probe lock, even when no app exists for it — keep /run/lock tidy).

Edge semantics to preserve/verify:

Two janitors racing: both probe; one wins the flock and reaps; the loser sees "held" and leaves. Reap must be idempotent (teardown_app(verify=False) already tolerates half-gone stacks).
Post-reboot: lockfiles gone (tmpfs) → probe trivially acquires → immediate reap. (Improvement over the 2h age fallback — note in spec.)
Warm/canonical apps never match RUN_APP_RE → never probed. Unchanged. DO NOT touch.

P3 — Per-run ABRA_DIR (deletes the per-recipe flock)

runner/run_recipe_ci.py / runner/harness/:

Early in main(): build run_abra_dir = /var/lib/cc-ci-runs/<build>/abra; create it; symlink servers → /root/.abra/servers, catalogue → /root/.abra/catalogue (builder: verify the minimal set abra needs — probe showed it fails only on missing servers/); fresh empty recipes/. Set os.environ["ABRA_DIR"] = run_abra_dir before ANY abra/harness call.
fetch_recipe() becomes a plain clone into $ABRA_DIR/recipes/<recipe> (no rm-rf of a shared tree; keep the checkout-of-ref logic).
Delete: acquire_recipe_lock, RECIPE_LOCK_DIR recipe-lock usage, its call site in main(), and the "must be taken before fetch_recipe" ordering rule.
.drone.yml: HOME=/root comment updated — abra config still found via symlinked servers; recipe trees now per-run.
Janitor / warm-app / canonical flows that read /root/.abra/recipes/... (if any — builder: grep) must be checked; they run outside a per-run ABRA_DIR and keep using the default.
Run-dir cleanup: /var/lib/cc-ci-runs/<n>/ retention already exists — abra dir rides along.

P4 — Config cleanup

Remove concurrency.limit from .drone.yml recipe-ci pipeline; DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix maxTests) is the single knob. Update its comment.

P5 — Spec rewrite

Rewrite docs/concurrency.md to the new model: one mechanism (per-app-domain flock), the invariant chain (§ lock-lifetime management), flock-probe janitor decision table, per-run ABRA_DIR isolation, updated failure-mode table (cancel now kills harness; reboot reaps immediately; deadline bounds everything), updated file/symbol index, and how to run tests/concurrency.

Test suite: `tests/concurrency/`

Real-kernel tests (no mocking of flock itself — spawn helper subprocesses that hold locks; use tmp dirs for lock/run dirs via env/monkeypatch). NOT part of pytest tests/unit. Run with pytest tests/concurrency -q (document in docs; CI does not run it on every build — optional manual/nightly invocation).

Lock fundamentals:

acquire → child process SIGKILL'd → lock immediately acquirable (kernel auto-release).
probe (LOCK_NB) against a held lock raises BlockingIOError; against unheld succeeds.
lock fd NOT inherited: holder spawns a subprocess child, holder dies, child still alive → lock acquirable (PEP 446).
second same-domain acquire blocks until first holder exits (double-!testme serialisation).

Janitor / probe semantics: 5. orphan (no holder) → reaped, lockfile unlinked; teardown invoked exactly once (stub teardown). 6. live (held by helper process) → never reaped, logged as live. 7. reap-under-probe-lock race: janitor holds probe lock mid-reap → a new run's acquire of the same domain blocks until reap completes (no window where new app coexists with half-reaped one). 8. two concurrent janitors → exactly one reaps (flock arbitration), reap idempotent. 9. reboot simulation: app exists, lockfile absent → treated as orphan immediately (no age wait). 10. long-held flag: held lock with mtime backdated >120min → warning logged, NOT reaped. 11. RUN_APP_RE allowlist: warm/canonical-shaped names never become candidates. 12. garbled/unwritable lockfile, missing lock dir → janitor degrades safely (skip + log, never crash).

Lifetime hardening: 13. PDEATHSIG: spawn wrapper-parent + harness-child (helper script calling the real prctl setup); kill parent → child receives SIGTERM and exits, lock released. 14. ppid race: helper started with already-dead parent (ppid 1) exits immediately. 15. deadline: helper with alarm(2) → teardown hook fires, exit non-zero, lock released. 16. SIGTERM → teardown funnel runs (finally: executed), lock released, non-zero exit.

ABRA_DIR isolation: 17. per-run dir built correctly: servers/catalogue symlinks resolve, recipes/ empty + writable, ABRA_DIR exported before first abra call (assert via stub abra recording its env). 18. two concurrent fetch_recipe of the SAME recipe into different ABRA_DIRs → both trees correct, no cross-talk (the old corruption scenario, now safe without a lock). 19. app .env written through the servers symlink lands in the canonical shared path (janitor discovery keeps working).

Each test must clean up helper processes in finally/fixtures (no leaked children in the test VM).

Roles, gates & Definition of Done (loop protocol — plan.md §6.1 applies)

This phase runs as Builder/Adversary loops with phase-namespaced state files (STATUS-conc.md, REVIEW-conc.md, BACKLOG-conc.md, JOURNAL-conc.md).

Builder: implements P1→P5 + the tests/concurrency suite on branch restructure/concurrency in YOUR clone; one commit per phase; before each commit ALL of: pytest tests/unit -q, pytest tests/concurrency -q (once it exists), scripts/lint.sh — green. Push the branch (NOT main — main merge is gated below). Claim gates via claim(conc): ... commits + STATUS-conc.md per protocol.

Adversary: cold-verify from your own clone. For M1: check out the branch, run both suites + lint yourself, then adversarially review the full diff — hunt races (probe vs acquire ordering, signal-handler reentrancy, teardown-during-teardown), deleted-code fallout (grep for dangling references to registry symbols: register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE, acquire_recipe_lock), gate integrity (recipe tests / RUN_APP_RE / warm apps untouched), and test-suite blind spots vs the 19 cases above. Default-skeptical; findings to REVIEW-conc.md.

Gates:

M1 — implementation verified. Branch complete (P1–P5 + tests), both suites + lint green on the Adversary's cold clone, adversarial diff review PASS in REVIEW-conc.md.
M2 — merged + live-verified. After M1 PASS only, Builder merges the branch to main (merge commit, never force) and confirms the push build green. Then live verification on the real CI host (evidence in STATUS-conc.md, Adversary re-checks): a. trigger an immich !testme, cancel mid-run via drone API → harness pid dies (no leaked python), lock released, next janitor reaps the app — zero leakage; b. two parallel !testme runs (immich PR#2 + plausible PR#3) → both green, zero leakage; c. double-!testme same PR → second blocks on the app lock (visible in its drone log), then runs; d. one full green run end-to-end.

## DONE in STATUS-conc.md only when M1 and M2 both show a fresh Adversary PASS in REVIEW-conc.md. NOTE for both loops: recipe-mirror PRs (immich#2, plausible#3) are used as !testme targets only — NEVER merge or push to recipe mirror repos.

Guardrails (builder + adversary MUST honor)

NEVER weaken recipe-test gates or touch tests/<recipe>/ content.
cc-ci main is touched ONLY by the M2 merge after M1 PASS; all other work stays on the branch. Never force-push. NEVER merge/push recipe mirror repos.
No secrets in commits; reference .testenv / /run/secrets locations only.
Don't touch services_converged() / paused-is-settled logic except where the plan says.
Match repo commit style (feat(...)/fix(harness)/test(...)/docs:).

Rollback

Single revert of the merge commit restores the registry+recipe-flock world; no data migration (lock/registry state is tmpfs-ephemeral, old and new harness can't run simultaneously anyway — deploys land from main atomically per build).

12 KiB Raw Blame History Unescape Escape