machine-docs: move all per-phase coordination files out of repo root
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
STATUS/BACKLOG/REVIEW/JOURNAL for bsky/conc/dstamp/kuma/lvl5/mailu/rcust/shot (32 files) were at the repo root; move them into machine-docs/ to match the mandated file-location rule (DECISIONS/DEFERRED/INBOX + older phases already live there). AGENTS.md gains an explicit File-location rule. No content change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
68
machine-docs/BACKLOG-conc.md
Normal file
68
machine-docs/BACKLOG-conc.md
Normal file
@ -0,0 +1,68 @@
|
||||
# BACKLOG — sub-phase conc
|
||||
|
||||
## Build backlog
|
||||
|
||||
- [x] P1 lock-lifetime hardening: prctl PDEATHSIG + ppid race check + SIGTERM handler →
|
||||
teardown funnel + signal.alarm(3600) hard deadline; .drone.yml setsid/trap wrap;
|
||||
PEP 446 comment on lock open()
|
||||
- [x] P2 flock-probe janitor: acquire_app_lock(domain) at register_run_app's call site;
|
||||
janitor probes per-domain lockfiles (acquired→reap under probe lock, held→leave,
|
||||
>120min mtime→warn); delete registry symbols
|
||||
- [x] P3 per-run ABRA_DIR: /var/lib/cc-ci-runs/<build>/abra with servers+catalogue symlinks,
|
||||
fresh recipes/; fetch_recipe = plain clone; delete acquire_recipe_lock; route harness
|
||||
recipe paths through ABRA_DIR
|
||||
- [x] P4 config cleanup: remove concurrency.limit from .drone.yml; maxTests is the single knob
|
||||
- [x] tests/concurrency suite (19 cases, real-kernel flock, explicit invocation only)
|
||||
- [x] P5 docs/concurrency.md rewrite to the new model
|
||||
- [ ] M1 claim (branch complete, both suites + lint green)
|
||||
- [ ] M2: merge to main after M1 PASS, push build green, live verification a–d
|
||||
|
||||
## Adversary findings
|
||||
|
||||
### [adversary] CONC-A1 — double-!testme same domain corrupts the shared deploy-count file (M2(c) FAIL)
|
||||
|
||||
**Severity:** blocks M2(c). Both runs of a same-domain double-!testme go RED.
|
||||
|
||||
**Root cause (two coupled defects, one shared root):**
|
||||
1. The DG4.1 deploy-counter file is keyed by DOMAIN in the *shared* system tempdir, NOT per-run:
|
||||
`run_recipe_ci.py:930 countfile = /tmp/ccci-deploys-<domain>`. P3 isolated `ABRA_DIR` per run
|
||||
but this per-run state file was missed — it predates the restructure (ef44d46) and the OLD
|
||||
recipe-flock used to serialize same-recipe runs end-to-end, incidentally masking it.
|
||||
2. `lifecycle.deploy_app()` calls `_record_deploy()` (lifecycle.py:250) BEFORE
|
||||
`acquire_app_lock(domain)` (lifecycle.py:254, introduced by P2 b302f3a). So the counter
|
||||
increment happens OUTSIDE the serialization window — a second same-domain run bumps the
|
||||
shared counter before it ever blocks on the lock.
|
||||
|
||||
**Observed (live, builds 279 + 281, immich PR#2, same domain immi-ad3e33, 2026-06-10T05:04Z):**
|
||||
- Lock serialization itself WORKS: 281 logged `== app lock: ... in flight — waiting ==` at 2s,
|
||||
then `== app lock: acquired ==` at 194s — exactly when 279 exited (279 finished 05:07:35).
|
||||
- 279 RED: `!! deploy-count 2 != 1 (DG4.1 violation)`. The `2` = 281's pre-lock `_record_deploy`
|
||||
(fired ~2s, before 281 blocked) polluting the shared counter 279 was actively using.
|
||||
- 281 RED: `FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33...` at run_recipe_ci.py:1213 —
|
||||
279's end-of-run `os.remove(countfile)` (line 1215) deleted the shared file out from under 281,
|
||||
whose single `_record_deploy` had already fired at 2s and never recreates it.
|
||||
- Control: isolated immich (build 275, same fixed wrapper) → `deploy-count = 1`, GREEN. So this
|
||||
is concurrency-specific, not a pre-existing immich/wrapper issue.
|
||||
|
||||
**Repro:** two `!testme` comments on the same recipe PR (same domain) in quick succession on the
|
||||
deployed main harness → both builds RED (one DG4.1 false-violation, one FileNotFoundError).
|
||||
|
||||
**Fix direction (Builder owns):** key the deploy-counter per RUN, not per domain — e.g. put it in
|
||||
`/var/lib/cc-ci-runs/<build>/` (alongside the per-run artifacts) or include the build/run id in the
|
||||
filename, and export that path via `CCCI_DEPLOY_COUNT_FILE`. Per-run keying fixes BOTH defects at
|
||||
once (no cross-run pollution; no shared remove). Moving `_record_deploy()` after `acquire_app_lock`
|
||||
alone is INSUFFICIENT — the shared `os.remove`/`FileNotFoundError` collision survives. Add a
|
||||
tests/concurrency case: two same-domain runs serialized on the app lock → each sees its own
|
||||
deploy-count, neither removes the other's file (this is the gap vs the 19 planned cases — case 4
|
||||
serialises acquire but never asserts deploy-count isolation across the two).
|
||||
|
||||
**Closure:** adversary-owned. Re-test the (c) double-!testme live (both GREEN, visible block line,
|
||||
zero leakage) + the new unit case before this clears. Only I close it.
|
||||
|
||||
**CLOSED @2026-06-10T09:0xZ** — fix b6e12ef (run-keyed state files via `_run_state_path`) merged
|
||||
139e319. Verified by me: (a) code cold-verified + mutation-proven (reverting to domain-keying fails
|
||||
all 3 test_run_state cases); (b) suites green cold (unit 138, concurrency 23); (c) LIVE re-run
|
||||
builds 290+291 (same immich domain immi-ad3e33) BOTH SUCCESS — 291 logged the block line
|
||||
(`in flight — waiting` → `acquired`), both read `deploy-count = 1` (290 no longer false-2; 291 no
|
||||
longer FileNotFoundError), zero leakage after (0 procs / 0 apps / 0 services / 0 volumes / 0 secrets
|
||||
/ no held locks). Full evidence in REVIEW-conc M2(c) PASS.
|
||||
Reference in New Issue
Block a user