Commit Graph

857 Commits

Author SHA1 Message Date
8cd72fd78d feat(harness): P2 — delete legacy customization keys & paths (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
a) compose.ccci.yml is FIRST-CLASS: the harness auto-copies tests/<recipe>/
   compose.ccci.yml into the run's recipe checkout (ABRA_DIR-aware, lifecycle.
   provide_ccci_overlay) and auto-chaoses the pinned base deploy on its presence
   (kills the R7 implicit coupling). ghost/discourse install_steps.sh (copy-only
   boilerplate) deleted; CHAOS_BASE_DEPLOY removed from both metas + the registry.

b) install-time deps wiring is the ONLY mode: deps with DEPS provision BEFORE the
   single deploy; legacy post-deploy provisioning + the setup_custom_tests.sh
   invocation machinery deleted. lasuite-docs migrated to install_steps.sh OIDC
   wiring (same env names/values as the old hook — only the timing moved);
   lasuite-drive's remaining post-deploy MinIO bucket one-shot moved to ops.py
   pre_install; both setup_custom_tests.sh files deleted; OIDC_AT_INSTALL removed
   from drive/meet metas + the registry.

c) SKIP_GENERIC meta key deleted (zero users). Env form CCCI_SKIP_GENERIC* stays
   as the documented dev-only escape hatch; when active in a drone CI run the
   orchestrator prints a loud !! warning (manifest embedding lands in P5).

d) conftest cleanup: dead pre-deploy-once fixtures deployed/deployed_app deleted
   (zero users), app_domain + _short + _wait_healthy dropped (only users were the
   deleted fixtures); deps_apps+deps_creds consolidated into ONE deps fixture
   (entries expose .domain etc. as attributes; dict access intact); the 6 lasuite
   test files renamed deps_creds->deps (fixture name only — assertions and flows
   byte-identical). requires_deps marker + F2-11 skip-report plumbing unchanged.

Registry is now exactly the 14 final keys; docs §4 table regenerated. Stale
setup_custom_tests/OIDC_AT_INSTALL prose in docstrings/comments/assert MESSAGES
updated (no assert logic or expected value touched).

Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.
2026-06-10 17:01:33 +00:00
472a68b32c feat(harness): P1 — single registry-backed meta loader (rcust)
All checks were successful
continuous-integration/drone/push Build is passing
One loader: runner/harness/meta.py::load(recipe) -> RecipeMeta (frozen dataclass,
attribute access), backed by the declarative KEYS registry (14 final keys + 3
P2-deprecated). The ONLY exec() of tests/<recipe>/recipe_meta.py. Validation per
the locked decision: unknown ALL-CAPS top-level name or type mismatch = MetaError
(hard error at load); underscore-prefixed names recipe-private; callables only on
hook-typed keys.

Migrated all six legacy loaders (spec §4 L1–L6):
- run_recipe_ci.py::_load_meta deleted; orchestrator loads once, passes meta down
- tests/conftest.py::_recipe_meta deleted; meta fixture returns full RecipeMeta (R3)
- lifecycle.py::_recipe_extra_env/_recipe_meta_flag deleted; deploy_app takes meta
- deps.py::declared_deps deleted; callers read meta.DEPS
- canonical.py::is_enrolled reads through meta.load()
- screenshot.py now actually receives SCREENSHOT through the orchestrator path (R2
  fix; proven by unit test through the real load path)

Mumble private constants underscore-prefixed (_WELCOME_TEXT_MARKER/_MAX_USERS) +
importers fixed. New tests/unit/test_meta.py (all-recipes-load-clean typo gate,
MetaError cases, spec §2 baseline defaults, underscore exemption, doc sync). Docs
§4 key table now GENERATED from the registry (scripts/gen-meta-docs.py); drift
fails CI.

Verified on cc-ci: cc-ci-run -m pytest tests/unit -q -> 175 passed; scripts/lint.sh -> PASS.
2026-06-10 16:46:58 +00:00
49fb818c60 status(rcust): bootstrap phase state files — P1 starting on branch restructure/recipe-custom
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 16:29:44 +00:00
12318582aa review(rcust): seed Adversary ledger — phase start, awaiting M1 claim
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 16:29:26 +00:00
76a4b6b3fa docs: recipe-customization review spec — full settings reference + restructuring candidates
All checks were successful
continuous-integration/drone/push Build is passing
Complete reference for per-recipe CI customization: all 18 recipe_meta keys
(incl. the base pin UPGRADE_BASE_VERSION), the six divergent meta loaders,
every hook file (test_<op>.py overlays, ops.py, install_steps.sh,
setup_custom_tests.sh, compose.ccci.yml), env contracts, and §8 known
limitations / restructuring candidates (R1 loader drift, R2 dead SCREENSHOT
knob, R6 silent-typo hazard, ...). Written for operator review ahead of a
possible restructure.
2026-06-10 15:55:34 +00:00
6060086c01 status(conc): ## DONE — M1+M2 both Adversary-PASS, no open veto; phase complete
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:56:02 +00:00
9987fba4b6 review(conc): M2 PASS — merged + live-verified (a)-(d) on final main 139e319; M1+M2 both fresh PASS, no open veto — DONE unblocked
All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:55:19 +00:00
74ed24053d claim(conc): M2 — merged + live-verified (a)-(d) on final main 139e319; (a) re-run build 295 clean; awaiting Adversary
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:52:48 +00:00
2894778810 review(conc): M2(c) PASS — double-!testme both GREEN on CONC-A1-fixed harness; VETO LIFTED, CONC-A1 closed
All checks were successful
continuous-integration/drone/push Build is passing
Builds 290+291 (same immich domain) both success: 291 logged block line + acquired,
both deploy-count=1 (290 no false-2, 291 no FileNotFoundError), zero leakage.
Serialization also observed live in lslocks. CONC-A1 conditions 1-3 met; veto lifted.
Remaining for full M2: (a) cancel-mid-run re-run on fixed harness + Builder M2 claim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:51:46 +00:00
536a3595b9 journal(conc): M2(c) PASS round 2 — 290+291 both green, block line visible, zero leakage; (a) re-run triggered
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:50:26 +00:00
0684576d74 chore(conc): consume BUILDER-INBOX (ML-flake context on (c) round-2; concur — will re-trigger (c) clean after 290/291 terminal)
Some checks reported errors
continuous-integration/drone/push Build is passing
continuous-integration/drone Build was killed
2026-06-10 08:45:14 +00:00
fa9a89bcf8 review(conc): live (c) round-2 — serialization confirmed via lslocks; delay is immich-ML healthcheck flake, not the restructure; veto unchanged
All checks were successful
continuous-integration/drone/push Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:44:30 +00:00
374371966f journal(conc): (b)+(d) PASS on CONC-A1-fixed main (287/288 parallel green, zero leakage); (c) round 2 triggered
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:22:40 +00:00
b1bca1a745 chore(conc): CONC-A1 fix code-verified (veto conditions 1+2 met, mutation-proven); 3+4 pending live (c) re-run
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:19:37 +00:00
4f6c9554b7 inbox(adversary): consumed CONC-A1-fixed message from Builder
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:17:16 +00:00
96ba67a63f inbox(adversary): CONC-A1 fixed b6e12ef/139e319 — run-keyed state files + regression test; re-running M2 live checks
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:16:43 +00:00
139e319d7e Merge branch 'restructure/concurrency': fix(harness) CONC-A1 run-keyed state files (M2(c) live-verify finding)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 08:16:18 +00:00
b6e12ef428 fix(harness): run-keyed run-scoped state files — CONC-A1 (same-domain runs corrupted shared deploy-count)
All checks were successful
continuous-integration/drone/push Build is passing
The four CCCI state files (deploys countfile, opstate, deps, depskip) were keyed
by app domain in shared /tmp. A second run of the same domain executes its main()
preamble + deploy_app's pre-lock _record_deploy BEFORE blocking at the app lock,
so it reset/polluted the live first run's counter (false DG4.1 deploy-count=2,
build 279) and the first run's end-of-run os.remove crashed the second
(FileNotFoundError, build 281). Masked pre-restructure by the end-to-end recipe
flock. Now keyed by run id + harness pid via _run_state_path(); children receive
exact paths via the CCCI_*_FILE env vars, so domain keying was never load-bearing.

tests/concurrency/test_run_state.py: path-invariant cases + a real-process
regression (helpers.py deploy-count-run) reproducing the live interleaving —
verified to FAIL under simulated shared keying. docs/concurrency.md §3 updated.
2026-06-10 08:16:09 +00:00
2173894f07 review(conc): M2(c) FAIL — double-!testme same domain corrupts shared deploy-count file (CONC-A1) + VETO
All checks were successful
continuous-integration/drone/push Build is passing
Builds 279+281 (immich#2, same domain immi-ad3e33) both RED: 279 false DG4.1
'deploy-count 2!=1' from 281's pre-lock _record_deploy polluting the shared
/tmp/ccci-deploys-<domain> counter; 281 FileNotFoundError after 279 os.remove'd it.
Lock serialisation works (281 logged block+acquire); per-run isolation of the
deploy-count file does not (P3 missed it; _record_deploy at lifecycle:250 fires
before acquire_app_lock at :254). Control build 275 (isolated) green.
Veto DONE until counter keyed per-run + same-domain test + live (c) both-green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 08:11:07 +00:00
e392c73cbc journal(conc): M2(b)+(d) PASS evidence; (c) double-!testme triggered
Some checks failed
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is failing
2026-06-10 05:04:14 +00:00
3180ae1355 review(conc): wrapper exit-code fix verified safe (red still propagates) + correct my set -e pre-review miss; inbox consumed
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:58:27 +00:00
9d82a02026 journal(conc): M2(b) round-1 evidence + wrapper fix verification
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
2026-06-10 04:56:22 +00:00
bbc2bafbcb inbox(adversary): M2 wrapper exit-code fix e1c4198/b7a009c — context for M2 review
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
2026-06-10 04:55:07 +00:00
b7a009c1fc Merge branch 'restructure/concurrency': fix(ci) wrapper exit-code poisoning on green runs (M2 live-verify finding)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:54:51 +00:00
e1c4198c08 fix(ci): recipe-ci wrapper — capture harness rc, clear traps before exit (green runs no longer exit 1)
All checks were successful
continuous-integration/drone/push Build is passing
The drone exec runner's step shell is set -e. On a NORMAL harness exit the EXIT trap still
fired and its kill of the already-exited process group failed with ESRCH, poisoning the
script's exit status: build 269 (plausible#3) ran fully GREEN (all tiers pass, level=4) but
the step exited 1. Reproduced minimally with sh -e and bash -e on the host; the fixed wrapper
verified for all three paths: green rc=0, red rc=7 (propagated), TERM-to-shell -> child gets
TERM and wrapper exits 143. Cancel forwarding semantics unchanged.
2026-06-10 04:54:40 +00:00
56723ae0ec chore(conc): M2 merge-integrity pre-check — merged main == M1-verified tree (not a verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:49:55 +00:00
dfa5c8b9ee journal(conc): M2(a) cancel-mid-run PASS evidence; (b) parallel runs triggered
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:47:19 +00:00
bb5eb3d3aa Merge branch 'restructure/concurrency': concurrency restructure (P1-P5 + tests/concurrency)
Some checks failed
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is failing
M1 Adversary-verified PASS (REVIEW-conc.md @83a6c6e): lock-lifetime hardening (PDEATHSIG +
signal funnels + 60-min deadline + setsid/trap cancel forwarding), flock-probe janitor
(registry deleted), per-run ABRA_DIR (recipe flock deleted), single concurrency knob,
tests/concurrency real-kernel suite, docs/concurrency.md rewrite.
2026-06-10 04:40:00 +00:00
83a6c6e157 review(M1): PASS — branch @d3fe9e2 cold-verified (unit 138, conc 20, lint, 0 dangling refs, gate-integrity, independent flock probe)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:39:16 +00:00
8b9033f3d6 journal(conc): tests suite + P5 evidence, M1 claim context
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:34:19 +00:00
e8e52cf4c6 claim(conc): M1 CLAIMED — branch restructure/concurrency complete (P1-P5 + tests, tip d3fe9e2), awaiting Adversary
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:33:59 +00:00
d3fe9e26bb docs: P5 concurrency spec rewrite — one lock, one structural isolation, the invariant chain
All checks were successful
continuous-integration/drone/push Build is passing
Rewritten to the restructured model: lifetime-hardening guards (PDEATHSIG/SIGTERM/SIGALRM +
setsid/trap), per-run ABRA_DIR isolation (same-recipe runs now parallel), per-app-domain flock
(double-!testme serialisation), flock-probe janitor decision table (incl. the inode-identity
race rows), updated failure-mode table (cancel now tears down via the harness's own funnel;
reboot reaps immediately; 60-min deadline bounds everything), single-knob config table, how to
run tests/concurrency, fresh file/symbol index + deleted-symbol list for grep verification.
Also drops the last stale concurrency.limit mention from the .drone.yml header comment.
2026-06-10 04:32:54 +00:00
84d90fb655 test(concurrency): real-kernel suite for the restructured model — 20 tests, 19 plan cases
All checks were successful
continuous-integration/drone/push Build is passing
tests/concurrency/ — NOT in the default `pytest tests/unit` gate; run explicitly with
`pytest tests/concurrency -q`. flock/prctl/alarm are never mocked: helper subprocesses
(helpers.py) hold real locks and install the real lifetime guards; locks live in a per-test
tmp dir via CCCI_APP_LOCK_DIR; every helper (and recorded grandchild) is reaped by fixture
cleanup.

- test_locks.py (cases 1-4): SIGKILL auto-release; LOCK_NB held/unheld semantics; PEP 446
  fd-not-inherited (holder's child survives, lock still releases); same-domain second acquire
  blocks until first holder exits.
- test_janitor.py (cases 5-12): orphan reaped once + lockfile unlinked; live holder never
  reaped + logged; new-run acquire blocks until a slow reap completes (reap-under-probe-lock);
  two overlapping janitors -> exactly one reaps (flock arbitration); reboot sim (no lockfile)
  reaps immediately with no age wait; >120min-held lock flagged 'possible leaked run' and NOT
  stolen; warm/canonical names never probed (no lockfile even created); directory-as-lockfile
  and missing lock dir degrade to skip+log, never crash.
- test_lifetime.py (cases 13-16): PDEATHSIG (wrapper parent SIGKILL'd -> guarded child TERM'd,
  teardown marker, lock released); already-orphaned helper REFUSES to run (ppid race); 2s
  deadline alarm -> teardown + exit 142 + lock released; SIGTERM -> teardown + exit 143 +
  lock released.
- test_abra_dir.py (cases 17-19 + 18b): per-run dir built + $ABRA_DIR exported before the
  first abra call (recording stub abra on PATH); two CONCURRENT same-recipe fetch+checkout
  flows into different ABRA_DIRs -> divergent correct trees, canonical staged clone untouched;
  .env written through the servers/ symlink lands in the canonical path (env_get/env_set
  agree); manual runs get pid-suffixed dirs.

On cc-ci: pytest tests/concurrency -q -> 20 passed; tests/unit -> 138 passed; lint PASS.
2026-06-10 04:29:36 +00:00
c51692b57e chore(conc): pre-review P3+P4 — zero dangling refs, ABRA_DIR ordering clean (not a verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:28:41 +00:00
ffcf441364 journal(conc): P1-P4 evidence (live smokes on cc-ci) + pre-existing abra app ls FATA observation
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:21:17 +00:00
2080d734d3 status(conc): P1-P4 on branch (b492f99..91d3cc7), tests/concurrency next
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:20:20 +00:00
91d3cc7e99 chore(ci): P4 config cleanup — DRONE_RUNNER_CAPACITY is the single concurrency knob
All checks were successful
continuous-integration/drone/push Build is passing
Remove concurrency.limit from the recipe-ci pipeline (.drone.yml): it duplicated
DRONE_RUNNER_CAPACITY (nix/modules/drone-runner.nix maxTests) and the two had to be kept in
step by hand (docs/concurrency.md §8.6). maxTests comment updated to state it is the single
knob and to describe the new safety model.
2026-06-10 04:19:35 +00:00
f98b444559 decisions(conc): record P3 install_steps.sh ABRA_DIR path fix (guardrail justification)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:18:45 +00:00
17ebdf39ac feat(harness): P3 per-run ABRA_DIR — structural recipe-tree isolation, recipe flock deleted
All checks were successful
continuous-integration/drone/push Build is passing
- run_recipe_ci.setup_run_abra_dir(): builds <runs_dir>/<run-id>/abra with servers/ and
  catalogue/ symlinked to the canonical ~/.abra (app .env files keep landing in the shared
  canonical path, so janitor discovery and env-based teardown are unchanged; per-domain
  filenames + the P2 app-domain lock prevent write conflicts) and a FRESH empty recipes/ —
  each run clones + checkouts its own recipe trees. Exported as $ABRA_DIR (honored by the
  abra CLI, verified on-host) before ANY abra call. Manual runs get manual-<pid> isolation.
- fetch_recipe(): plain clone into $ABRA_DIR/recipes/<recipe> — no shared-tree rm-rf, no lock.
  CCCI_SKIP_FETCH=1 now copies the canonically-staged clone into the per-run tree (same staging
  workflow, run reads staged state).
- abra.abra_dir()/recipe_dir(): single resolution rule ($ABRA_DIR else ~/.abra), used by
  recipe_checkout, has_lightweight_version_tags, recipe_head_commit, recipe_versions,
  generic._recipe_dir, lifecycle.prepull_images, snapshot_recipe_tests, and
  warm_reconcile._recipe_dir (which keeps the canonical default for its own systemd runs but
  follows the per-run tree when imported by promote_canonical inside a run).
- deleted: lifecycle.acquire_recipe_lock, RECIPE_LOCK_DIR, the main() call site and the
  must-lock-before-fetch ordering rule.
- tests/{ghost,discourse}/install_steps.sh: RECIPE_DIR resolves ${ABRA_DIR:-$HOME/.abra} so the
  compose.ccci.yml overlay lands in the tree the run actually deploys from (mechanical path fix
  required by per-run trees; no assertion/gate touched — see DECISIONS.md).
- .drone.yml comments updated (HOME=/root rationale now via the servers symlink).
2026-06-10 04:18:33 +00:00
08b629f52a chore(conc): pre-review P1+P2 — 4 break-it concerns tested + refuted (not a verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:16:41 +00:00
b302f3ab63 feat(harness): P2 flock-probe janitor — the kernel flock IS the liveness oracle
All checks were successful
continuous-integration/drone/push Build is passing
- acquire_app_lock(domain): exclusive flock on /run/lock/cc-ci-app-<domain>.lock, taken in
  deploy_app exactly where register_run_app was (BEFORE app creation); blocks with a log line
  when another run of the same domain is in flight (double-!testme serialisation). The file
  object is retained in module-level _held_app_locks so GC can never close the fd and silently
  release the lock. mtime is touched at acquisition (lock age for the long-held flag).
- janitor(): probes each candidate's lock (discovery unchanged: abra app ls + docker-service
  sweep vs RUN_APP_RE). Acquirable -> orphan -> teardown_app(verify=False) WHILE HOLDING the
  probe lock (a new same-domain run blocks until the reap finishes), then unlink before release.
  Held -> live run -> leave it; held >120min (2x hard deadline) -> warn, never steal. Stale
  unheld lockfiles with no app are unlinked on sight. Unreadable lockfile -> skip + log.
- unlink/recreate race guard (both sides): after ANY acquisition, verify the locked fd still is
  the inode the path names (fstat vs stat); a waiter that won a just-unlinked inode retries on
  the live path, and a probe that won one skips (unlinking now would hit a newer run's file).
- deleted: register_run_app, unregister_run_app, _run_owner_state, _registry_path,
  ACTIVE_RUN_DIR, CCCI_JANITOR_MAX_AGE + age fallback, _stack_age_seconds, pid-reuse guard.
  teardown_app no longer unregisters (release is process exit). janitor() takes no args now.
- post-reboot: /run/lock is tmpfs -> lockfiles gone -> probe trivially acquires -> immediate
  reap (improvement over the old 2h age fallback).
2026-06-10 04:11:31 +00:00
b492f995bd feat(harness): P1 lock-lifetime hardening — PDEATHSIG + SIGTERM/SIGALRM teardown funnel + 60-min hard deadline
All checks were successful
continuous-integration/drone/push Build is passing
- new harness/lifetime.py: install_lifetime_guards() arms PR_SET_PDEATHSIG(SIGTERM) (with
  post-prctl ppid==1 orphan refusal), a SIGTERM handler raising SystemExit through the run's
  finally: teardown funnel (exit 143), and signal.alarm(3600) funnelling SIGALRM the same way
  with a distinct deadline log line (exit 142). Re-entrant signals during teardown are logged
  and ignored (begin_teardown guard) so a second signal can't abort the running cleanup.
- run_recipe_ci.main(): guards installed first thing, before any abra call/lock; both teardown
  finally: blocks (cold + quick) mark begin_teardown().
- .drone.yml recipe-ci step: harness runs under setsid in its own process group; a trap forwards
  the step shell's TERM/EXIT to the whole group so drone cancel reaches the harness instead of
  leaking it (docs/concurrency.md §8.1).
- PEP 446 note on the recipe-lock open(): the fd is non-inheritable, children never carry it.
2026-06-10 04:04:28 +00:00
e350c94c3f chore(conc): record cold-verify environment (cc-ci-run pytest env, M1 plan)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:03:23 +00:00
45afccbef5 status(conc): bootstrap phase state files — P1 in flight on branch restructure/concurrency
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 04:00:12 +00:00
48d03d8405 chore(conc): seed REVIEW-conc.md — adversary online, baseline pre-read (no verdict)
All checks were successful
continuous-integration/drone/push Build is passing
2026-06-10 03:56:26 +00:00
5b65c6caa3 docs: concurrency spec — how parallel recipe runs stay safe (for review/restructuring)
All checks were successful
continuous-integration/drone/push Build is passing
Documents the capacity=2 concurrent-run system as landed in c0df77d,
68ef0f8, e6d55b5: config knobs, isolation model, per-recipe flock,
active-run registry + three-way janitor, convergence interactions,
failure-mode guarantees, and known limitations / restructuring
candidates.
2026-06-10 03:05:20 +00:00
157d06dc77 Merge pull request 'test(plausible): psql -q in _register_site — -t does not suppress command tags' (#9) from test/plausible-psql-quiet into main
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
2026-06-09 23:12:37 +00:00
e6d55b53c7 fix(harness): a paused swarm update is settled — only active states block convergence
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
68ef0f8 made services_converged() require UpdateStatus settled, treating
'paused' as in flight. But swarm's default update-failure-action pauses the
update on a single task flicker and the flag persists FOREVER (until the next
update): immich CI 241 had the app service 'paused' from a restart during
restore while the service was back at 1/1 and healthy — every subsequent wait
hung to its deadline and the run had to be killed.

Only 'updating' and 'rollback_started' now block convergence: those are the
states swarm is actively driving (the 238 stop-first race lives in 'updating').
'paused'/'rollback_paused' make no progress without intervention, so waiting on
them is pointless — N/N replicas is already required, and the HTTP-health and
tier assertions still gate whether the app actually works.

lint: PASS, unit tests: 138 passed.
2026-06-09 23:07:36 +00:00
79c652ddd3 test(plausible): psql -q in _register_site — -t does not suppress command tags
All checks were successful
continuous-integration/drone/push Build is passing
psql -tAc still prints INSERT/CREATE command tags (e.g. "INSERT 0 1"), so
_register_site asserted out == site against "INSERT 0 1\nsite" and both
event-tracking roundtrip tests failed on their very first run (build 237 —
the custom tier had never executed before; install always failed earlier).
-q suppresses the tags; verified against the recipe db container.
2026-06-09 22:50:55 +00:00
68ef0f84fb fix(harness): convergence must span stop-first rolling updates (immich 238 backup 409)
Some checks reported errors
continuous-integration/drone/push Build is passing
continuous-integration/drone Build was killed
services_converged() accepted N/N replicas as converged — but a chaos redeploy
that changes a non-app service image (immich PR #2 moves the db to the
vectorchord pin) registers a stop-first rolling update that swarm may not have
STARTED yet: the OLD task still shows 1/1, the wait passes, and the task dies
seconds later. Build 238: backupbot resolved the db hook container, the task
was killed in the gap, and the pre-hook exec crashed the whole backup with a
409 -> no dump in the snapshot -> restore had nothing -> RED.

- services_converged() now also requires every service's swarm UpdateStatus to
  be settled ('', completed, rollback_completed) — updating/paused/rollback in
  flight is NOT converged. Strictly stricter: no gate is weakened.
- backup_app() gains a bounded (300s) settle-wait before 'abra app backup
  create' as defence in depth; on timeout the backup still runs and the tier's
  assertion delivers the verdict.

lint: PASS, unit tests: 138 passed.
2026-06-09 22:10:55 +00:00