Files
cc-ci/BACKLOG-conc.md
autonomic-bot 2894778810
All checks were successful
continuous-integration/drone/push Build is passing
review(conc): M2(c) PASS — double-!testme both GREEN on CONC-A1-fixed harness; VETO LIFTED, CONC-A1 closed
Builds 290+291 (same immich domain) both success: 291 logged block line + acquired,
both deploy-count=1 (290 no false-2, 291 no FileNotFoundError), zero leakage.
Serialization also observed live in lslocks. CONC-A1 conditions 1-3 met; veto lifted.
Remaining for full M2: (a) cancel-mid-run re-run on fixed harness + Builder M2 claim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:51:46 +00:00

69 lines
4.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BACKLOG — sub-phase conc
## Build backlog
- [x] P1 lock-lifetime hardening: prctl PDEATHSIG + ppid race check + SIGTERM handler →
teardown funnel + signal.alarm(3600) hard deadline; .drone.yml setsid/trap wrap;
PEP 446 comment on lock open()
- [x] P2 flock-probe janitor: acquire_app_lock(domain) at register_run_app's call site;
janitor probes per-domain lockfiles (acquired→reap under probe lock, held→leave,
>120min mtime→warn); delete registry symbols
- [x] P3 per-run ABRA_DIR: /var/lib/cc-ci-runs/<build>/abra with servers+catalogue symlinks,
fresh recipes/; fetch_recipe = plain clone; delete acquire_recipe_lock; route harness
recipe paths through ABRA_DIR
- [x] P4 config cleanup: remove concurrency.limit from .drone.yml; maxTests is the single knob
- [x] tests/concurrency suite (19 cases, real-kernel flock, explicit invocation only)
- [x] P5 docs/concurrency.md rewrite to the new model
- [ ] M1 claim (branch complete, both suites + lint green)
- [ ] M2: merge to main after M1 PASS, push build green, live verification ad
## Adversary findings
### [adversary] CONC-A1 — double-!testme same domain corrupts the shared deploy-count file (M2(c) FAIL)
**Severity:** blocks M2(c). Both runs of a same-domain double-!testme go RED.
**Root cause (two coupled defects, one shared root):**
1. The DG4.1 deploy-counter file is keyed by DOMAIN in the *shared* system tempdir, NOT per-run:
`run_recipe_ci.py:930 countfile = /tmp/ccci-deploys-<domain>`. P3 isolated `ABRA_DIR` per run
but this per-run state file was missed — it predates the restructure (ef44d46) and the OLD
recipe-flock used to serialize same-recipe runs end-to-end, incidentally masking it.
2. `lifecycle.deploy_app()` calls `_record_deploy()` (lifecycle.py:250) BEFORE
`acquire_app_lock(domain)` (lifecycle.py:254, introduced by P2 b302f3a). So the counter
increment happens OUTSIDE the serialization window — a second same-domain run bumps the
shared counter before it ever blocks on the lock.
**Observed (live, builds 279 + 281, immich PR#2, same domain immi-ad3e33, 2026-06-10T05:04Z):**
- Lock serialization itself WORKS: 281 logged `== app lock: ... in flight — waiting ==` at 2s,
then `== app lock: acquired ==` at 194s — exactly when 279 exited (279 finished 05:07:35).
- 279 RED: `!! deploy-count 2 != 1 (DG4.1 violation)`. The `2` = 281's pre-lock `_record_deploy`
(fired ~2s, before 281 blocked) polluting the shared counter 279 was actively using.
- 281 RED: `FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33...` at run_recipe_ci.py:1213 —
279's end-of-run `os.remove(countfile)` (line 1215) deleted the shared file out from under 281,
whose single `_record_deploy` had already fired at 2s and never recreates it.
- Control: isolated immich (build 275, same fixed wrapper) → `deploy-count = 1`, GREEN. So this
is concurrency-specific, not a pre-existing immich/wrapper issue.
**Repro:** two `!testme` comments on the same recipe PR (same domain) in quick succession on the
deployed main harness → both builds RED (one DG4.1 false-violation, one FileNotFoundError).
**Fix direction (Builder owns):** key the deploy-counter per RUN, not per domain — e.g. put it in
`/var/lib/cc-ci-runs/<build>/` (alongside the per-run artifacts) or include the build/run id in the
filename, and export that path via `CCCI_DEPLOY_COUNT_FILE`. Per-run keying fixes BOTH defects at
once (no cross-run pollution; no shared remove). Moving `_record_deploy()` after `acquire_app_lock`
alone is INSUFFICIENT — the shared `os.remove`/`FileNotFoundError` collision survives. Add a
tests/concurrency case: two same-domain runs serialized on the app lock → each sees its own
deploy-count, neither removes the other's file (this is the gap vs the 19 planned cases — case 4
serialises acquire but never asserts deploy-count isolation across the two).
**Closure:** adversary-owned. Re-test the (c) double-!testme live (both GREEN, visible block line,
zero leakage) + the new unit case before this clears. Only I close it.
**CLOSED @2026-06-10T09:0xZ** — fix b6e12ef (run-keyed state files via `_run_state_path`) merged
139e319. Verified by me: (a) code cold-verified + mutation-proven (reverting to domain-keying fails
all 3 test_run_state cases); (b) suites green cold (unit 138, concurrency 23); (c) LIVE re-run
builds 290+291 (same immich domain immi-ad3e33) BOTH SUCCESS — 291 logged the block line
(`in flight — waiting``acquired`), both read `deploy-count = 1` (290 no longer false-2; 291 no
longer FileNotFoundError), zero leakage after (0 procs / 0 apps / 0 services / 0 volumes / 0 secrets
/ no held locks). Full evidence in REVIEW-conc M2(c) PASS.