All checks were successful
continuous-integration/drone/push Build is passing
- new harness/lifetime.py: install_lifetime_guards() arms PR_SET_PDEATHSIG(SIGTERM) (with post-prctl ppid==1 orphan refusal), a SIGTERM handler raising SystemExit through the run's finally: teardown funnel (exit 143), and signal.alarm(3600) funnelling SIGALRM the same way with a distinct deadline log line (exit 142). Re-entrant signals during teardown are logged and ignored (begin_teardown guard) so a second signal can't abort the running cleanup. - run_recipe_ci.main(): guards installed first thing, before any abra call/lock; both teardown finally: blocks (cold + quick) mark begin_teardown(). - .drone.yml recipe-ci step: harness runs under setsid in its own process group; a trap forwards the step shell's TERM/EXIT to the whole group so drone cancel reaches the harness instead of leaking it (docs/concurrency.md §8.1). - PEP 446 note on the recipe-lock open(): the fd is non-inheritable, children never carry it.
96 lines
4.2 KiB
Python
96 lines
4.2 KiB
Python
"""Run-lifetime hardening (concurrency restructure P1).
|
|
|
|
The concurrency model's invariant chain is:
|
|
|
|
lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline
|
|
|
|
Locks are kernel flocks released on process exit, so the only thing that needs managing is the
|
|
PROCESS lifetime. Three guards, installed at run startup (before any abra call) by
|
|
`install_lifetime_guards()`:
|
|
|
|
1. `PR_SET_PDEATHSIG(SIGTERM)`: if the parent (the drone step shell) dies — cancel, runner
|
|
crash, host shutdown of the step — the kernel delivers SIGTERM to the harness, so a dead
|
|
build can never leak a running harness that holds locks. Paired with a ppid==1 re-check
|
|
AFTER the prctl: a parent that died BEFORE the prctl took effect would never trigger the
|
|
death signal, so a harness that finds itself already reparented refuses to run.
|
|
2. SIGTERM handler: raise SystemExit so the run's `finally:` teardown funnel executes and the
|
|
process exits non-zero. Re-entrant deliveries during teardown are logged and IGNORED so a
|
|
second signal can't abort the cleanup the first one asked for (`begin_teardown()` guards
|
|
this; the run's own `finally:` blocks also call it so a signal landing mid-normal-teardown
|
|
can't abort that either).
|
|
3. `signal.alarm(3600)`: self-imposed hard deadline. SIGALRM funnels into the same teardown
|
|
path with a distinct log line. Teardown time after the deadline is not alarm-bounded —
|
|
interrupting a teardown buys nothing; the janitor (flock probe) is the backstop if a
|
|
teardown wedges and the process is killed harder.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import ctypes
|
|
import os
|
|
import signal
|
|
import sys
|
|
|
|
HARD_DEADLINE_SECONDS = 60 * 60
|
|
|
|
_PR_SET_PDEATHSIG = 1 # linux/prctl.h
|
|
|
|
_state = {"tearing_down": False}
|
|
|
|
|
|
def begin_teardown() -> None:
|
|
"""Mark the teardown funnel as running. From here on SIGTERM/SIGALRM must NOT raise — it
|
|
would abort the very cleanup it asks for — so the handlers log and return instead. Called by
|
|
the handlers themselves before raising, and at the top of the run's `finally:` blocks."""
|
|
_state["tearing_down"] = True
|
|
|
|
|
|
def _funnel_handler(log_line: str, exit_code: int):
|
|
"""A signal handler that routes into the teardown funnel exactly once: log, then raise
|
|
SystemExit (propagates through the run's try/finally → teardown executes → non-zero exit).
|
|
While teardown is already running, further signals are logged and swallowed."""
|
|
|
|
def handler(signum: int, frame) -> None: # noqa: ARG001
|
|
print(log_line, flush=True)
|
|
if _state["tearing_down"]:
|
|
print(
|
|
f"== signal {signum} during teardown — ignored (teardown continues, "
|
|
"exit stays non-zero) ==",
|
|
flush=True,
|
|
)
|
|
return
|
|
begin_teardown()
|
|
raise SystemExit(exit_code)
|
|
|
|
return handler
|
|
|
|
|
|
def install_lifetime_guards(deadline_seconds: int = HARD_DEADLINE_SECONDS) -> None:
|
|
"""Install all three lifetime guards (see module docstring). Must run at harness startup,
|
|
before any abra call and before any lock is taken."""
|
|
libc = ctypes.CDLL("libc.so.6", use_errno=True)
|
|
if libc.prctl(_PR_SET_PDEATHSIG, signal.SIGTERM, 0, 0, 0) != 0:
|
|
err = ctypes.get_errno()
|
|
raise OSError(err, f"prctl(PR_SET_PDEATHSIG, SIGTERM) failed: {os.strerror(err)}")
|
|
# The prctl is armed now — but only fires for a parent death AFTER this point. If the parent
|
|
# already died, we are reparented (ppid 1) and would never get the signal: refuse to run, an
|
|
# orphaned harness would hold locks/apps with nothing managing its lifetime.
|
|
if os.getppid() == 1:
|
|
sys.exit("parent died before prctl(PR_SET_PDEATHSIG) — refusing to run orphaned")
|
|
signal.signal(
|
|
signal.SIGTERM,
|
|
_funnel_handler(
|
|
"== SIGTERM received (drone cancel / parent death) — tearing down ==",
|
|
128 + signal.SIGTERM,
|
|
),
|
|
)
|
|
minutes = deadline_seconds // 60
|
|
signal.signal(
|
|
signal.SIGALRM,
|
|
_funnel_handler(
|
|
f"== run exceeded {minutes}-minute hard deadline — tearing down ==",
|
|
128 + signal.SIGALRM,
|
|
),
|
|
)
|
|
signal.alarm(deadline_seconds)
|