Some checks failed
continuous-integration/drone/push Build is failing
STATUS/BACKLOG/REVIEW/JOURNAL for bsky/conc/dstamp/kuma/lvl5/mailu/rcust/shot (32 files) were at the repo root; move them into machine-docs/ to match the mandated file-location rule (DECISIONS/DEFERRED/INBOX + older phases already live there). AGENTS.md gains an explicit File-location rule. No content change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
166 lines
12 KiB
Markdown
166 lines
12 KiB
Markdown
# JOURNAL — sub-phase conc (Builder, append-only)
|
||
|
||
## 2026-06-10 — bootstrap
|
||
|
||
Read concurrency-restructure-full-plan.md (SSOT) + plan.md §6.1/§7/§9. Oriented on the code:
|
||
|
||
- `runner/harness/lifecycle.py` — recipe flock (l.46), registry (l.65–97), deploy_app
|
||
registration (l.283), teardown unregister (l.723), three-way janitor (l.726).
|
||
- `runner/run_recipe_ci.py` — `acquire_recipe_lock` call site (l.843), `fetch_recipe` (l.140,
|
||
rm-rf + reclone of the shared tree), janitor call sites (l.600 quick, l.932 cold).
|
||
- `.drone.yml` — recipe-ci step runs `cc-ci-run runner/run_recipe_ci.py` bare (P1 wraps it),
|
||
`concurrency.limit: 2` (P4 removes).
|
||
- Greps for P3 fallout: `~/.abra/recipes` referenced in abra.py (recipe_checkout,
|
||
has_lightweight_version_tags, recipe_head_commit, recipe_versions), generic.py:28,
|
||
lifecycle.prepull_images, run_recipe_ci (fetch_recipe, snapshot_recipe_tests, comment),
|
||
warm_reconcile.py:202 (runs OUTSIDE per-run context — keeps default), and
|
||
tests/ghost+discourse install_steps.sh (`${HOME}/.abra/recipes/...` — these run INSIDE a
|
||
run and copy compose.ccci.yml into the deploy tree, so they must resolve the per-run dir).
|
||
- `~/.abra/servers/...` paths are unaffected by design (servers/ is symlinked to the canonical
|
||
/root/.abra/servers, so both resolutions land on the same file).
|
||
|
||
Working setup: state files on main in this clone; code on branch `restructure/concurrency`
|
||
via a git worktree at ../cc-ci-conc; test runs on the cc-ci host via /root/builder-clone
|
||
(`cc-ci-run -m pytest ...`, `nix develop .#lint`).
|
||
|
||
## 2026-06-10 — P1–P4 landed on restructure/concurrency
|
||
|
||
- P1 b492f99: harness/lifetime.py (PDEATHSIG+ppid recheck, SIGTERM/SIGALRM→SystemExit funnel
|
||
with re-entrancy guard, alarm(3600)); main() installs first; both finally blocks mark
|
||
begin_teardown(); .drone.yml setsid+trap wrap. Live smoke on cc-ci (cc-ci-run /tmp/p1-smoke.py):
|
||
TERM→rc=143+finally; ALRM→rc=142+finally+deadline log; parent-kill→child TERM'd, teardown ran.
|
||
- P2 b302f3a: acquire_app_lock + _probe_and_reap + janitor rewrite; registry deleted. Live smoke
|
||
(/tmp/p2-smoke*.py): held lock → "live concurrent run, leaving it", reaped=[]; killed holder →
|
||
reap exactly once + lockfile unlinked; waiter blocked during probe-held reap, then re-acquired
|
||
on the FRESH inode (probe confirmed held by waiter). Note: a select()-on-fd readline artifact
|
||
in my smoke script initially looked like a failure — kernel state was verified directly.
|
||
Unlink/recreate race guarded on BOTH sides via fstat/stat st_ino identity checks.
|
||
- P3 17ebdf3: per-run ABRA_DIR. Verified abra CLI honors $ABRA_DIR on-host (skeleton probe:
|
||
FATAs only on empty servers/; with servers+catalogue symlinks + recipes/ it works and even
|
||
auto-clones recipes for `app ls` resolution into the per-run dir). p3-smoke: setup + fetch of
|
||
custom-html-tiny landed in /tmp/p3runs/9999/abra/recipes, head commit + versions readable via
|
||
abra.recipe_dir(). install_steps.sh path fix justified in DECISIONS.md (conc P3 entry).
|
||
Pre-existing observation (NOT mine, unchanged): `abra app ls -S -m -n` currently FATAs
|
||
"unable to resolve '0cc57a5a'" under the DEFAULT abra dir too → janitor's abra discovery
|
||
yields [] and the docker-service sweep carries discovery. Out of this phase's scope.
|
||
- P4 91d3cc7: concurrency.limit removed; maxTests comment states single-knob + new model.
|
||
One stale comment line (.drone.yml l.39 "concurrency.limit=2 below") folds into P5.
|
||
|
||
All four commits: tests/unit 138 passed + lint PASS before each. Next: tests/concurrency suite.
|
||
|
||
## 2026-06-10 — tests/concurrency (84d90fb) + P5 (d3fe9e2) + M1 claim (e8e52cf)
|
||
|
||
- Suite: 20 tests / 19 plan cases, all real-kernel (helpers.py subprocesses hold real flocks,
|
||
install real prctl/alarm guards; CCCI_APP_LOCK_DIR sandboxes /run/lock; HelperPool reaps every
|
||
helper + recorded grandchildren). First full run on cc-ci: 20 passed in 9.96s, zero flakes in
|
||
3 repeat runs during the P5 verification re-runs.
|
||
- Design notes for the Adversary's blind-spot hunt (my own known limits):
|
||
- case 8 (two janitors) uses threads in one process — valid because flock conflicts are
|
||
per-open-file-description, and overlap is forced via a Barrier + 2s slow teardown stub.
|
||
- case 14 relies on reparent-to-pid-1 (true on the cc-ci host; would need adjustment in a
|
||
subreaper environment — marked NEVER_REPARENTED visibly if so).
|
||
- cases 5-12 stub teardown_app (recording) — janitor probe/reap ordering is what's under
|
||
test, not teardown internals (covered by Phase-1 e2e + M2 live checks).
|
||
- M1 claimed at e8e52cf; full verification recipe in STATUS-conc.md (WHAT/WHERE/HOW/EXPECTED).
|
||
|
||
## 2026-06-10 — M2: merge + live verification (a)
|
||
|
||
- Merge: bb5eb3d (--no-ff) pushed; push build 266 (self-test lint+hello) SUCCESS.
|
||
- (a) cancel-mid-run: !testme on immich#2 → build 267 (custom) running on the NEW harness —
|
||
log shows the setsid/trap wrap + "== per-run ABRA_DIR: /var/lib/cc-ci-runs/267/abra ==";
|
||
lock /run/lock/cc-ci-app-immi-ad3e33...lock held by pid 636902; 4 immich services up.
|
||
Canceled via drone API 04:42:07Z (HTTP 200, build status "killed"). Result: harness pid
|
||
GONE (no leaked python — the old §8.1 gap is closed), immich services 0, volumes 0,
|
||
secrets 0, .env 0 — the SIGTERM funnel ran the run's own teardown (better than the plan's
|
||
minimum, which allowed the janitor to do the reaping). Lock RELEASED (lockfile present but
|
||
unheld — tidy-swept by the next janitor, to be observed during (b)).
|
||
- (b) triggered 04:46:53Z: !testme immich#2 (comment 14287) + plausible#3 (14288) in parallel.
|
||
|
||
## 2026-06-10 — M2(b) round 1: green runs, poisoned exit code → wrapper fix
|
||
|
||
- Builds 268 (immich#2) + 269 (plausible#3) ran in PARALLEL on the new harness: both logs end
|
||
with all-tiers-pass RUN SUMMARY (level=4, deploy-count 1/1) and the host shows ZERO leakage
|
||
after (no harness processes, no immi/plau services/volumes/secrets, only unheld lockfiles).
|
||
Both steps nevertheless exited 1: the P1 EXIT trap's kill of the already-gone process group
|
||
returns ESRCH under the runner's `set -e` shell — a GREEN run reported failure.
|
||
- Reproduced minimally on-host (`sh -e` and `bash -e`: rc=1 on a clean exit with the old trap).
|
||
Fix e1c4198 (capture rc; `trap - TERM EXIT`; `|| true` on the trap kill) verified on-host:
|
||
green rc=0, red rc=7 propagated, TERM→wrapper forwards to child, exits 143. Merged to main
|
||
b7a009c; push builds 272-274 green. Adversary notified via inbox.
|
||
- (b) re-triggered on the fixed wrapper 04:56:10Z (immich#2 + plausible#3).
|
||
|
||
## 2026-06-10 — M2(b) PASS + (c) triggered
|
||
|
||
- (b) round 2 on fixed wrapper: builds 275 (immich#2) + 276 (plausible#3) ran in PARALLEL,
|
||
BOTH status=success (drone API). Host after: 0 python harness processes, 0 immi/plau
|
||
services/volumes/secrets/.envs — zero leakage. (d) satisfied by 275 (full green immich e2e).
|
||
Leftover unheld lockfiles present by design (tidy-swept at next janitor).
|
||
- (c) double-!testme on immich#2: two comments at 05:03:58Z → two custom builds, same run
|
||
domain immi-ad3e33 → exactly one must block on the app lock with the visible log line.
|
||
|
||
## 2026-06-10 — CONC-A1: (c) failure root-caused + fixed (run-keyed state files)
|
||
|
||
- (c) round 1 = builds 279+281, both RED. Root cause (independently also found+filed by the
|
||
Adversary as CONC-A1 while I was mid-diagnosis — same conclusion from both loops): the four
|
||
run-scoped state files (deploys/opstate/deps/depskip) were DOMAIN-keyed in shared /tmp;
|
||
281's main()-preamble + pre-lock _record_deploy fired before it blocked on the app lock →
|
||
279 read deploy-count 2 (false DG4.1 RED); 279's end-of-run os.remove deleted the shared
|
||
countfile → 281 crashed FileNotFoundError at its own read. Lock serialization itself worked
|
||
(281: waiting @+2s, acquired @+194s = 279's exit). Masked pre-restructure by the
|
||
end-to-end recipe flock.
|
||
- Fix b6e12ef on branch, merged to main 139e319: _run_state_path() keys all four by
|
||
run id + harness pid; consumers were always env-fed (CCCI_*_FILE), so domain keying was
|
||
never load-bearing. Both cleanup sites already remove all four on normal exit.
|
||
- New tests/concurrency/test_run_state.py (suite now 23): path invariants + real-process
|
||
CONC-A1 interleaving via helpers.py `deploy-count-run` (countfile init → pre-lock
|
||
_record_deploy → acquire → gated read). Teeth verified: under simulated shared keying the
|
||
regression test FAILS (host run: 3 failed); with the fix: 23 passed + 138 unit + lint PASS.
|
||
- Next: push build green → re-run (b)+(d), then (c), then (a) per the VETO's conditions.
|
||
|
||
## 2026-06-10 — M2 re-verification on CONC-A1-fixed main (139e319)
|
||
|
||
- Push builds 283/284/285 (branch fix, merge, inbox) all green.
|
||
- (b)+(d) round 3 (comments 14299/14300, 08:17:35Z): builds 287 (immich#2) + 288 (plausible#3)
|
||
BOTH success, started simultaneously 08:17:40Z (parallel), finished 08:21:06/08:21:13.
|
||
Both logs: deploy-count = 1 (expect 1), level=4. Host after: pgrep -f 'run_recipe_c[i]' → no
|
||
match (earlier "2" was pgrep self-match of the ssh cmdline); immi/plau services/volumes/
|
||
secrets/server-envs all 0. Zero leakage. (d) satisfied by 287 (full green immich e2e on the
|
||
final harness code).
|
||
- (c) round 2 triggered 08:22:13Z: comments 14303+14304 on immich#2 (same domain immi-ad3e33).
|
||
|
||
## 2026-06-10 — M2(c) PASS round 2 (builds 290+291) + (a) re-run triggered
|
||
|
||
- (c) round 2: builds 290 (08:22:30→08:46:05) + 291 (08:22:33→08:49:23) BOTH success.
|
||
291 log: "== app lock: another run of immi-ad3e33... in flight — waiting ==" at +1s,
|
||
"acquired" at +1411s = exactly 290's exit. Both: deploy-count = 1 (expect 1), level=4.
|
||
Slowness was an immich-ML healthcheck flake (Adversary cross-confirmed live via lslocks:
|
||
one holder pid 739163, one waiter pid 739341 on the same lock inode — serialization observed
|
||
in the kernel lock table); ML converged inside the 1500s window, both runs green anyway —
|
||
no clean re-run needed.
|
||
- After both: no harness procs (pgrep run_recipe_c[i] empty), 0 immi/plau services/volumes/
|
||
secrets/server-envs. Unheld lockfile remains by design (tidy-swept at next janitor probe).
|
||
- (a) re-run on fixed harness: !testme immich#2 comment 14307 @08:50:02Z; will cancel mid-run
|
||
via drone API once the deploy is in flight, then check pid/lock/leakage + janitor reap.
|
||
|
||
## 2026-06-10 — M2(a) re-run PASS (build 295) + M2 claim
|
||
|
||
- (a) on fixed harness: build 295 (comment 14307 @08:50:02Z) canceled @08:51:05Z (HTTP 200)
|
||
while mid-deploy (lock held by pid 763099, 4 immich services converging). Harness pid GONE
|
||
@08:51:15Z — the SIGTERM funnel ran the run's own teardown inside 10s; build status=killed;
|
||
lock released (lslocks empty); services/volumes/secrets/envs all 0. Zero leakage, no janitor
|
||
required.
|
||
- Adversary lifted the CONC-A1 VETO @09:05Z with its own M2(c) PASS (290/291 cold-verified,
|
||
kernel-lock-table serialization observation). Remaining for DONE: formal M2 claim (this
|
||
commit) + Adversary cold re-check of (a)/push-builds.
|
||
- M2 claimed in STATUS-conc.md with consolidated (a)-(d) evidence + cold re-check recipe.
|
||
|
||
## 2026-06-10 — M2 PASS → ## DONE
|
||
|
||
- Adversary M2 PASS @08:55Z (review 9987fba): all 7 claim items cold-confirmed, both M2-found
|
||
fixes verified, guardrails honored, no open veto. Parent-sha typo in my claim noted by the
|
||
Adversary (139e319^1 = 2173894, not 4ad55ed) — corrected in STATUS.
|
||
- ## DONE written to STATUS-conc.md. Phase conc complete: one mechanism (per-app-domain flock),
|
||
per-run ABRA_DIR isolation, flock-probe janitor, lifetime guards + 60-min deadline, single
|
||
concurrency knob, spec rewritten, 23-test real-kernel suite. Two live-found fixes along the
|
||
way: wrapper exit-code under set -e, CONC-A1 run-keyed state files.
|