Files
cc-ci/REVIEW-conc.md
autonomic-bot c51692b57e
All checks were successful
continuous-integration/drone/push Build is passing
chore(conc): pre-review P3+P4 — zero dangling refs, ABRA_DIR ordering clean (not a verdict)
2026-06-10 04:28:41 +00:00

9.6 KiB
Raw Blame History

REVIEW-conc.md — Adversary ledger, concurrency-restructure phase

Append-only. Verdicts: <gate>: PASS @<ts> + evidence, or FAIL + [adversary] finding in BACKLOG-conc.md. SSOT for what is verified: /srv/cc-ci/cc-ci-plan/concurrency-restructure-full-plan.md.

2026-06-10T04:00Z — Adversary online; baseline pre-read (no gate pending)

Pulled main @5b65c6c. No STATUS-conc.md, no restructure/concurrency branch — nothing claimed yet. Pre-read the CURRENT system (docs/concurrency.md @5b65c6c + lifecycle.py/run_recipe_ci.py) to anchor my later diff review in the as-is code, not the Builder's narrative.

Current-system facts I will hold the restructure against:

  • Registry symbols slated for deletion (will grep for dangling refs at M1): register_run_app (lifecycle.py:69, call site :283), unregister_run_app (:78, call sites :723, :766), _run_owner_state (:83), ACTIVE_RUN_DIR (:43), CCCI_JANITOR_MAX_AGE (janitor :738), acquire_recipe_lock (:46, call site run_recipe_ci.py:843), RECIPE_LOCK_DIR (:42).
  • Must survive untouched: RUN_APP_RE (lifecycle.py:26) allowlist semantics (warm/canonical apps never probed), services_converged() paused-is-settled logic, docker-service sweep discovery, teardown_app(verify=False) idempotence.
  • M1 verification plan (cold, my clone): checkout branch; pytest tests/unit -q, pytest tests/concurrency -q, scripts/lint.sh; full diff review hunting: probe-vs-acquire ordering races, signal-handler reentrancy (SIGTERM during teardown / SIGALRM during SIGTERM), teardown-during-teardown, lock-fd lifetime (object dropped → GC closes fd → lock silently released), symlinked servers/ write conflicts, janitor unlink-vs-reacquire race (unlink while a waiter blocks on the old inode → two "held" locks on different inodes for one domain), PDEATHSIG-after-fork ordering (prctl before ppid check), alarm(0) vs teardown duration, setsid wrapper trap semantics under drone cancel, test-suite blind spots vs the 19 planned cases.
  • Tests/concurrency must NOT be wired into the default pytest tests/unit gate (plan decision).
  • M2 (post-merge, live): cancel-mid-run leak check, parallel immich#2+plausible#3, double-!testme same PR blocks visibly, one full green run. NEVER merge/push recipe mirror repos.

No verdict yet — waiting for Builder bootstrap/claim.

2026-06-10T04:05Z — cold-verify environment established (prep, no gate)

Builder seeded STATUS/BACKLOG/JOURNAL-conc; STATUS says P1 in flight, no gate claimed. Mapped the test-execution environment I'll use for the M1 cold run so a time-sensitive gate isn't spent debugging tooling:

  • Local VM devshell (nix develop) has only lintTools (no pytest). So pytest does NOT run here.
  • pytest 8.3.3 + playwright live in the host pyEnv (nix/modules/harness.nix) exposed as cc-ci-run on cc-ci. cc-ci-run -m pytest <path> -q works as the real harness interpreter (verified: cc-ci-run -c "import pytest" -> 8.3.3).
  • .drone.yml lint stage runs nix develop .#lint --command bash scripts/lint.sh.
  • COLD M1 PLAN: fresh git clone/checkout of restructure/concurrency into a throwaway dir ON cc-ci → cc-ci-run -m pytest tests/unit -q + cc-ci-run -m pytest tests/concurrency -q + nix develop .#lint --command bash scripts/lint.sh, all from that clean checkout (not the Builder's working tree). Then adversarial diff review per my baseline hit-list.
  • Baseline .drone.yml on main is still the pre-restructure version (concurrency.limit=2, acquire_recipe_lock / /run/cc-ci-active registry referenced) — confirms P1/P4 edits are branch-only so far. Good.

2026-06-10T04:23Z — early pre-review of P1+P2 (branch @b302f3a, NO gate claimed — NOT a verdict)

Builder has pushed P1 (b492f99) + P2 (b302f3a) to restructure/concurrency; P3/P4/P5/tests still pending, so M1 is not claimable and this is NOT a PASS — it's pre-review to front-load the M1 diff audit and avoid re-doing it under gate time pressure. Read code/diff + git only; did NOT read JOURNAL (anti-anchoring intact). I actively tried to break the following and each concern was REFUTED:

  1. Green-on-red via the .drone.yml EXIT trap (my lead hypothesis). The wrapper is setsid cc-ci-run … & PID=$!; trap 'kill -TERM -- -$PID' TERM EXIT; wait $PID. I worried the EXIT trap's final kill status would override the harness exit code and mask a failing run. EMPIRICALLY TESTED (4 bash repros incl. failing harness with a lingering group member that makes kill succeed=0): bash PRESERVES the pre-trap exit status when the EXIT trap doesn't call exit. Exit code propagates correctly in all cases (RED stays RED, GREEN stays GREEN). Refuted.
  2. P2 unlink/reacquire inode race (janitor unlinks a reaped orphan's lockfile while a new run blocks on the old inode). Handled: both acquire_app_lock and _probe_and_reap recheck fstat(fd).st_ino == stat(path).st_ino after acquiring and retry/bail on mismatch — a lock on an unlinked (anonymous) inode is never treated as authoritative, and the path's lockfile is never unlinked out from under a newer run. Refuted.
  3. Half-reaped/new-app coexistence. Reap runs WHILE HOLDING the probe lock; a new same-domain run blocks in acquire_app_lock until reap completes. The pre-deploy window (lock held, app not yet created) is covered: the stale-lockfile sweep sees the held lock (BlockingIOError) and leaves it. Refuted.
  4. Signal mid-normal-teardown aborting cleanup. begin_teardown() is the FIRST line of BOTH finally blocks (run_recipe_ci.py:663 run_quick, :1134 main); the _funnel_handler swallows (logs+returns) any SIGTERM/SIGALRM once tearing_down is set, so a second signal can't abort the cleanup the first asked for. install_lifetime_guards() is the FIRST statement of main() (:829), before any abra/lock call, with prctl→ppid==1 recheck in the correct order. Refuted.

Open items to confirm AT M1 (cold, full suite) — NOT defects, just unverified-until-then:

  • datetime import removed from lifecycle.py along with _stack_age_seconds — grep for any remaining datetime use (ruff would catch an undefined name; confirm import truly orphaned).
  • _stack_name / age-fallback deadcode after the janitor rewrite — confirm no dangling refs.
  • Registry-symbol deletion is only PARTIAL on this commit: acquire_recipe_lock still present (P3 deletes it); register/unregister/_run_owner_state/ACTIVE_RUN_DIR/CCCI_JANITOR_MAX_AGE are gone — full dangling-ref grep belongs at M1 once P3 lands.
  • setsid-fork edge: if setsid ever forks (only when it's a pgrp leader; not the case for a backgrounded job in a non-job-control drone shell), $PID would be the intermediate and the harness would reparent to ppid==1 and self-abort. Live-verify the trap+cancel path at M2(a).
  • begin_teardown is process-global module state (lifetime._state) — fine for one harness process; the tests/concurrency suite must not import-share it across in-process cases (verify at M1).

2026-06-10T04:32Z — pre-review P3+P4 (branch @91d3cc7, NO gate claimed — NOT a verdict)

Builder pushed P3 (17ebdf3 per-run ABRA_DIR) + P4 (91d3cc7 config cleanup). tests/concurrency + P5 docs still pending, so M1 still not claimable. Continued the front-loaded diff audit (code/git only; JOURNAL still unread). Findings — all CLEAN:

  • Dangling-ref grep across runner/bridge/dashboard/nix = ZERO hits for all 9 deleted symbols: acquire_recipe_lock, register_run_app, unregister_run_app, _run_owner_state, ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE, RECIPE_LOCK_DIR, _stack_age_seconds, _registry_path. The orphaned datetime import is also gone from lifecycle.py. Clean deletion.
  • Path centralization: all ~/.abra/recipes/<recipe> literals replaced by abra.recipe_dir() (resolves $ABRA_DIR else ~/.abra) across abra.py (recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions), generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, fetch_recipe. prepull's env_path stays canonical ~/.abra/servers/... which is correct (servers/ is the shared symlink target).
  • Ordering verified (main(), the only structural risk): install_lifetime_guards() is the FIRST stmt (873); between it and setup_run_abra_dir() (891) there are ONLY env reads + a print — no abra call; ABRA_DIR is exported at 891 BEFORE fetch_recipe (892) and before the first path-helper recipe_head_commit (895). The --quick dispatch (run_quick, ~908) is AFTER 891, so the quick lane inherits the per-run ABRA_DIR too. No tree is touched before ABRA_DIR is set.
  • Manual-run isolation: rid=="manual" → "manual-" so two hand-runs don't share a tree.

Open items to confirm AT M1 (cold) — not defects:

  • setup_run_abra_dir symlink idempotency: if not os.path.islink(link): os.symlink(...) — if a NON-symlink file pre-exists at servers/catalogue (reused run dir from a crashed partial), symlink raises FileExistsError. Low risk (fresh run-id per Drone build) but worth a glance.
  • CCCI_SKIP_FETCH=1 now rm -rf dest + copytree(canonical, dest, symlinks=True) — confirm the --quick rollback-proof staging tests still pass (they set CCCI_SKIP_FETCH).
  • tests/{ghost,discourse}/install_steps.sh RECIPE_DIR=${ABRA_DIR:-$HOME/.abra} mechanical path fix — confirm it changed NO assertion/gate (guardrail: never weaken recipe-test gates). Diff-check.

Net: the entire P1P4 diff has been pre-audited and is clean against my break-it hit-list. M1 cold run, once claimed (after tests/concurrency + P5 land), reduces to: fresh checkout on cc-ci → cc-ci-run -m pytest tests/unit -q + cc-ci-run -m pytest tests/concurrency -q + lint, plus a focused review of only the tests/concurrency suite (vs the 19 planned cases) and the P5 doc delta.