"""Run-lifetime hardening (concurrency restructure P1). The concurrency model's invariant chain is: lock lifetime ⊆ harness process lifetime ⊆ drone step lifetime ⊆ 60-min hard deadline Locks are kernel flocks released on process exit, so the only thing that needs managing is the PROCESS lifetime. Three guards, installed at run startup (before any abra call) by `install_lifetime_guards()`: 1. `PR_SET_PDEATHSIG(SIGTERM)`: if the parent (the drone step shell) dies — cancel, runner crash, host shutdown of the step — the kernel delivers SIGTERM to the harness, so a dead build can never leak a running harness that holds locks. Paired with a ppid==1 re-check AFTER the prctl: a parent that died BEFORE the prctl took effect would never trigger the death signal, so a harness that finds itself already reparented refuses to run. 2. SIGTERM handler: raise SystemExit so the run's `finally:` teardown funnel executes and the process exits non-zero. Re-entrant deliveries during teardown are logged and IGNORED so a second signal can't abort the cleanup the first one asked for (`begin_teardown()` guards this; the run's own `finally:` blocks also call it so a signal landing mid-normal-teardown can't abort that either). 3. `signal.alarm(3600)`: self-imposed hard deadline. SIGALRM funnels into the same teardown path with a distinct log line. Teardown time after the deadline is not alarm-bounded — interrupting a teardown buys nothing; the janitor (flock probe) is the backstop if a teardown wedges and the process is killed harder. """ from __future__ import annotations import ctypes import os import signal import sys HARD_DEADLINE_SECONDS = 60 * 60 _PR_SET_PDEATHSIG = 1 # linux/prctl.h _state = {"tearing_down": False} def begin_teardown() -> None: """Mark the teardown funnel as running. From here on SIGTERM/SIGALRM must NOT raise — it would abort the very cleanup it asks for — so the handlers log and return instead. Called by the handlers themselves before raising, and at the top of the run's `finally:` blocks.""" _state["tearing_down"] = True def _funnel_handler(log_line: str, exit_code: int): """A signal handler that routes into the teardown funnel exactly once: log, then raise SystemExit (propagates through the run's try/finally → teardown executes → non-zero exit). While teardown is already running, further signals are logged and swallowed.""" def handler(signum: int, frame) -> None: # noqa: ARG001 print(log_line, flush=True) if _state["tearing_down"]: print( f"== signal {signum} during teardown — ignored (teardown continues, " "exit stays non-zero) ==", flush=True, ) return begin_teardown() raise SystemExit(exit_code) return handler def install_lifetime_guards(deadline_seconds: int = HARD_DEADLINE_SECONDS) -> None: """Install all three lifetime guards (see module docstring). Must run at harness startup, before any abra call and before any lock is taken.""" libc = ctypes.CDLL("libc.so.6", use_errno=True) if libc.prctl(_PR_SET_PDEATHSIG, signal.SIGTERM, 0, 0, 0) != 0: err = ctypes.get_errno() raise OSError(err, f"prctl(PR_SET_PDEATHSIG, SIGTERM) failed: {os.strerror(err)}") # The prctl is armed now — but only fires for a parent death AFTER this point. If the parent # already died, we are reparented (ppid 1) and would never get the signal: refuse to run, an # orphaned harness would hold locks/apps with nothing managing its lifetime. if os.getppid() == 1: sys.exit("parent died before prctl(PR_SET_PDEATHSIG) — refusing to run orphaned") signal.signal( signal.SIGTERM, _funnel_handler( "== SIGTERM received (drone cancel / parent death) — tearing down ==", 128 + signal.SIGTERM, ), ) minutes = deadline_seconds // 60 signal.signal( signal.SIGALRM, _funnel_handler( f"== run exceeded {minutes}-minute hard deadline — tearing down ==", 128 + signal.SIGALRM, ), ) signal.alarm(deadline_seconds)