Files
cc-ci/JOURNAL-conc.md
autonomic-bot dfa5c8b9ee
All checks were successful
continuous-integration/drone/push Build is passing
journal(conc): M2(a) cancel-mid-run PASS evidence; (b) parallel runs triggered
2026-06-10 04:47:19 +00:00

5.5 KiB
Raw Blame History

JOURNAL — sub-phase conc (Builder, append-only)

2026-06-10 — bootstrap

Read concurrency-restructure-full-plan.md (SSOT) + plan.md §6.1/§7/§9. Oriented on the code:

  • runner/harness/lifecycle.py — recipe flock (l.46), registry (l.6597), deploy_app registration (l.283), teardown unregister (l.723), three-way janitor (l.726).
  • runner/run_recipe_ci.pyacquire_recipe_lock call site (l.843), fetch_recipe (l.140, rm-rf + reclone of the shared tree), janitor call sites (l.600 quick, l.932 cold).
  • .drone.yml — recipe-ci step runs cc-ci-run runner/run_recipe_ci.py bare (P1 wraps it), concurrency.limit: 2 (P4 removes).
  • Greps for P3 fallout: ~/.abra/recipes referenced in abra.py (recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions), generic.py:28, lifecycle.prepull_images, run_recipe_ci (fetch_recipe, snapshot_recipe_tests, comment), warm_reconcile.py:202 (runs OUTSIDE per-run context — keeps default), and tests/ghost+discourse install_steps.sh (${HOME}/.abra/recipes/... — these run INSIDE a run and copy compose.ccci.yml into the deploy tree, so they must resolve the per-run dir).
  • ~/.abra/servers/... paths are unaffected by design (servers/ is symlinked to the canonical /root/.abra/servers, so both resolutions land on the same file).

Working setup: state files on main in this clone; code on branch restructure/concurrency via a git worktree at ../cc-ci-conc; test runs on the cc-ci host via /root/builder-clone (cc-ci-run -m pytest ..., nix develop .#lint).

2026-06-10 — P1P4 landed on restructure/concurrency

  • P1 b492f99: harness/lifetime.py (PDEATHSIG+ppid recheck, SIGTERM/SIGALRM→SystemExit funnel with re-entrancy guard, alarm(3600)); main() installs first; both finally blocks mark begin_teardown(); .drone.yml setsid+trap wrap. Live smoke on cc-ci (cc-ci-run /tmp/p1-smoke.py): TERM→rc=143+finally; ALRM→rc=142+finally+deadline log; parent-kill→child TERM'd, teardown ran.
  • P2 b302f3a: acquire_app_lock + _probe_and_reap + janitor rewrite; registry deleted. Live smoke (/tmp/p2-smoke*.py): held lock → "live concurrent run, leaving it", reaped=[]; killed holder → reap exactly once + lockfile unlinked; waiter blocked during probe-held reap, then re-acquired on the FRESH inode (probe confirmed held by waiter). Note: a select()-on-fd readline artifact in my smoke script initially looked like a failure — kernel state was verified directly. Unlink/recreate race guarded on BOTH sides via fstat/stat st_ino identity checks.
  • P3 17ebdf3: per-run ABRA_DIR. Verified abra CLI honors $ABRA_DIR on-host (skeleton probe: FATAs only on empty servers/; with servers+catalogue symlinks + recipes/ it works and even auto-clones recipes for app ls resolution into the per-run dir). p3-smoke: setup + fetch of custom-html-tiny landed in /tmp/p3runs/9999/abra/recipes, head commit + versions readable via abra.recipe_dir(). install_steps.sh path fix justified in DECISIONS.md (conc P3 entry). Pre-existing observation (NOT mine, unchanged): abra app ls -S -m -n currently FATAs "unable to resolve '0cc57a5a'" under the DEFAULT abra dir too → janitor's abra discovery yields [] and the docker-service sweep carries discovery. Out of this phase's scope.
  • P4 91d3cc7: concurrency.limit removed; maxTests comment states single-knob + new model. One stale comment line (.drone.yml l.39 "concurrency.limit=2 below") folds into P5.

All four commits: tests/unit 138 passed + lint PASS before each. Next: tests/concurrency suite.

2026-06-10 — tests/concurrency (84d90fb) + P5 (d3fe9e2) + M1 claim (e8e52cf)

  • Suite: 20 tests / 19 plan cases, all real-kernel (helpers.py subprocesses hold real flocks, install real prctl/alarm guards; CCCI_APP_LOCK_DIR sandboxes /run/lock; HelperPool reaps every helper + recorded grandchildren). First full run on cc-ci: 20 passed in 9.96s, zero flakes in 3 repeat runs during the P5 verification re-runs.
  • Design notes for the Adversary's blind-spot hunt (my own known limits):
    • case 8 (two janitors) uses threads in one process — valid because flock conflicts are per-open-file-description, and overlap is forced via a Barrier + 2s slow teardown stub.
    • case 14 relies on reparent-to-pid-1 (true on the cc-ci host; would need adjustment in a subreaper environment — marked NEVER_REPARENTED visibly if so).
    • cases 5-12 stub teardown_app (recording) — janitor probe/reap ordering is what's under test, not teardown internals (covered by Phase-1 e2e + M2 live checks).
  • M1 claimed at e8e52cf; full verification recipe in STATUS-conc.md (WHAT/WHERE/HOW/EXPECTED).

2026-06-10 — M2: merge + live verification (a)

  • Merge: bb5eb3d (--no-ff) pushed; push build 266 (self-test lint+hello) SUCCESS.
  • (a) cancel-mid-run: !testme on immich#2 → build 267 (custom) running on the NEW harness — log shows the setsid/trap wrap + "== per-run ABRA_DIR: /var/lib/cc-ci-runs/267/abra =="; lock /run/lock/cc-ci-app-immi-ad3e33...lock held by pid 636902; 4 immich services up. Canceled via drone API 04:42:07Z (HTTP 200, build status "killed"). Result: harness pid GONE (no leaked python — the old §8.1 gap is closed), immich services 0, volumes 0, secrets 0, .env 0 — the SIGTERM funnel ran the run's own teardown (better than the plan's minimum, which allowed the janitor to do the reaping). Lock RELEASED (lockfile present but unheld — tidy-swept by the next janitor, to be observed during (b)).
  • (b) triggered 04:46:53Z: !testme immich#2 (comment 14287) + plausible#3 (14288) in parallel.