Files
cc-ci/BACKLOG-conc.md
autonomic-bot 2894778810
All checks were successful
continuous-integration/drone/push Build is passing
review(conc): M2(c) PASS — double-!testme both GREEN on CONC-A1-fixed harness; VETO LIFTED, CONC-A1 closed
Builds 290+291 (same immich domain) both success: 291 logged block line + acquired,
both deploy-count=1 (290 no false-2, 291 no FileNotFoundError), zero leakage.
Serialization also observed live in lslocks. CONC-A1 conditions 1-3 met; veto lifted.
Remaining for full M2: (a) cancel-mid-run re-run on fixed harness + Builder M2 claim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:51:46 +00:00

4.6 KiB
Raw Blame History

BACKLOG — sub-phase conc

Build backlog

  • P1 lock-lifetime hardening: prctl PDEATHSIG + ppid race check + SIGTERM handler → teardown funnel + signal.alarm(3600) hard deadline; .drone.yml setsid/trap wrap; PEP 446 comment on lock open()
  • P2 flock-probe janitor: acquire_app_lock(domain) at register_run_app's call site; janitor probes per-domain lockfiles (acquired→reap under probe lock, held→leave, >120min mtime→warn); delete registry symbols
  • P3 per-run ABRA_DIR: /var/lib/cc-ci-runs//abra with servers+catalogue symlinks, fresh recipes/; fetch_recipe = plain clone; delete acquire_recipe_lock; route harness recipe paths through ABRA_DIR
  • P4 config cleanup: remove concurrency.limit from .drone.yml; maxTests is the single knob
  • tests/concurrency suite (19 cases, real-kernel flock, explicit invocation only)
  • P5 docs/concurrency.md rewrite to the new model
  • M1 claim (branch complete, both suites + lint green)
  • M2: merge to main after M1 PASS, push build green, live verification ad

Adversary findings

[adversary] CONC-A1 — double-!testme same domain corrupts the shared deploy-count file (M2(c) FAIL)

Severity: blocks M2(c). Both runs of a same-domain double-!testme go RED.

Root cause (two coupled defects, one shared root):

  1. The DG4.1 deploy-counter file is keyed by DOMAIN in the shared system tempdir, NOT per-run: run_recipe_ci.py:930 countfile = /tmp/ccci-deploys-<domain>. P3 isolated ABRA_DIR per run but this per-run state file was missed — it predates the restructure (ef44d46) and the OLD recipe-flock used to serialize same-recipe runs end-to-end, incidentally masking it.
  2. lifecycle.deploy_app() calls _record_deploy() (lifecycle.py:250) BEFORE acquire_app_lock(domain) (lifecycle.py:254, introduced by P2 b302f3a). So the counter increment happens OUTSIDE the serialization window — a second same-domain run bumps the shared counter before it ever blocks on the lock.

Observed (live, builds 279 + 281, immich PR#2, same domain immi-ad3e33, 2026-06-10T05:04Z):

  • Lock serialization itself WORKS: 281 logged == app lock: ... in flight — waiting == at 2s, then == app lock: acquired == at 194s — exactly when 279 exited (279 finished 05:07:35).
  • 279 RED: !! deploy-count 2 != 1 (DG4.1 violation). The 2 = 281's pre-lock _record_deploy (fired ~2s, before 281 blocked) polluting the shared counter 279 was actively using.
  • 281 RED: FileNotFoundError: /tmp/ccci-deploys-immi-ad3e33... at run_recipe_ci.py:1213 — 279's end-of-run os.remove(countfile) (line 1215) deleted the shared file out from under 281, whose single _record_deploy had already fired at 2s and never recreates it.
  • Control: isolated immich (build 275, same fixed wrapper) → deploy-count = 1, GREEN. So this is concurrency-specific, not a pre-existing immich/wrapper issue.

Repro: two !testme comments on the same recipe PR (same domain) in quick succession on the deployed main harness → both builds RED (one DG4.1 false-violation, one FileNotFoundError).

Fix direction (Builder owns): key the deploy-counter per RUN, not per domain — e.g. put it in /var/lib/cc-ci-runs/<build>/ (alongside the per-run artifacts) or include the build/run id in the filename, and export that path via CCCI_DEPLOY_COUNT_FILE. Per-run keying fixes BOTH defects at once (no cross-run pollution; no shared remove). Moving _record_deploy() after acquire_app_lock alone is INSUFFICIENT — the shared os.remove/FileNotFoundError collision survives. Add a tests/concurrency case: two same-domain runs serialized on the app lock → each sees its own deploy-count, neither removes the other's file (this is the gap vs the 19 planned cases — case 4 serialises acquire but never asserts deploy-count isolation across the two).

Closure: adversary-owned. Re-test the (c) double-!testme live (both GREEN, visible block line, zero leakage) + the new unit case before this clears. Only I close it.

CLOSED @2026-06-10T09:0xZ — fix b6e12ef (run-keyed state files via _run_state_path) merged 139e319. Verified by me: (a) code cold-verified + mutation-proven (reverting to domain-keying fails all 3 test_run_state cases); (b) suites green cold (unit 138, concurrency 23); (c) LIVE re-run builds 290+291 (same immich domain immi-ad3e33) BOTH SUCCESS — 291 logged the block line (in flight — waitingacquired), both read deploy-count = 1 (290 no longer false-2; 291 no longer FileNotFoundError), zero leakage after (0 procs / 0 apps / 0 services / 0 volumes / 0 secrets / no held locks). Full evidence in REVIEW-conc M2(c) PASS.