Root cause of the 2 failing custom tests: TLS_FLAVOR=notls → dovecot refuses plaintext auth over
network 143, so host-side IMAP login/auth isn't a meaningful signal. Smoke2 PROVED the in-container
path: sendmail (postfix container) local-injects a marker mail → doveadm search (imap container) finds
it in INBOX. test_mail_flow now exercises the real postfix→rspamd→dovecot deliver/store/fetch via
exec_in_app(service=smtp/imap). Dropped test_imap_login (network plaintext-auth disallowed under notls).
test_mailbox (create+config-export read-back) unchanged. PARITY.md updated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Diagnostic (RECIPE=mumble STAGES=install,backup,restore,custom, no upgrade) PROVED backup+restore green
on a stable 1.0.0 deploy incl. ci_marker survival (P4). The full-run backup 409 ('container not
running') was the chaos UPGRADE redeploy: host-mode 64738 must be released by the old task + rebound by
the new, and HEALTH_PATH '/' only proves the mumble-web sidecar (not the voice server), so wait_healthy
passed while the app churned → backup-bot execed a not-running container. Fix: extend
lifecycle.wait_ready_probes to support a TCP probe ({tcp_host,tcp_port,stable=N consecutive connects});
mumble recipe_meta READY_PROBE returns 64738 (stable=3) so the harness waits for the voice server up
after install AND upgrade before backup.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PRAGMA busy_timeout=N emits its own result row, polluting the read-back parse (seed read back
'20000\nupgrade-survives' → AssertionError 'seed did not commit', failing upgrade/backup/restore ops
— though the INSERT actually committed). Switch _sqlite to 'sqlite3 -cmd ".timeout 20000"' which sets
the busy timeout silently. install+custom already green (handshake/welcome/web/tcp PASS); this fixes
the P4 lifecycle ops.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mumble's pinned base deploy (prev version 0.2.0) FATAs 'has locally unstaged changes' because
install_steps provides an untracked compose.host-ports.yml. New recipe_meta CHAOS_BASE_DEPLOY=True +
lifecycle._recipe_meta_flag + deploy_app branch -> base uses chaos (skips clean-tree/lint, deploys the
checked-out pinned version, not LATEST), mirroring the lightweight-tag chaos-base path. DECISIONS.md
records the full mumble enrollment design.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The upstream compose.host-ports.yml exists only from v1.0.0+, but the upgrade-tier base deploy is
the previous published version (0.2.0+), which predates it — so EXTRA_ENV's COMPOSE_FILE failed to
resolve on the base deploy (config --images rc=14, deploy FATA). install_steps.sh now copies a
cc-ci-owned identical overlay into the recipe checkout when absent, so 64738 is host-published for
every version (base + upgrade) and on-host protocol tests reach 127.0.0.1:64738.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
v1 failed wait_healthy 'not healthy / (last status 500)': plausible's app starts before clickhouse
(plausible_events_db) is ready (recipe depends_on names events_db, mismatched → no swarm ordering) and
returns 500 until DB migrations finish (several min on cold deploy). It serves 302 once ready; widen
the health window.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lifecycle.prepull_images(recipe, domain): resolve images via docker compose config --images (COMPOSE_FILE
from the app .env — handles $VERSION interpolation + multi-compose) → docker pull each, skip-if-present
(zero network for cached pinned tags). Called in deploy_app before the (unchanged, real) abra.deploy AND
in generic.perform_upgrade before the chaos redeploy (warms new-version images). A pull failure RAISES a
clear pre-deploy error (not a converge timeout); deploy path unchanged (no docker service update/scale).
Removes PULL time not app-INIT time. 4 unit tests (tests/unit/test_prepull.py): present→skip, missing→
pull, pull-fail→raise, no-images→skip. NOT claimed yet — validating cold-verify criteria next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adversary cold-verify of F2-9 FAILED: the read-back's CKEditor-frame-attach wait timed out on a fresh
cold context (flaky, not 3x-reliable). Fix: read-back now polls EVERY frame's body text for the marker
(don't require the specific ckeditor-inner frame to attach — that's the flaky part) with a generous
~240s deadline + periodic reloads to unstick cold loads. The marker appearing in a fresh context still
proves server-side E2E-encrypted persistence (only URL+fragment key carried over). Also bumped the
session-1 post-type sync wait 9s→12s. F2-13 Adversary-owned; will validate cold before it closes F2-9.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Full suite #5: install/upgrade/backup/restore + OIDC + create-room/read-back/LiveKit-token ALL pass
(R014 chaos-base fix validated: upgrade crossover real 0.2.0→0.3.0). Only the final 404-after-DELETE
assert failed — meet 0.3.0+v1.16.0 soft/async-deletes (DELETE 2xx, re-GET still 200). The §4.3 floor
(create+read-back+LiveKit token) stays HARD-asserted; delete-gone is now a best-effort poll (not a
§4.3 requirement). PARITY.md noted.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirrors lasuite-drive machinery (sibling La Suite recipe): install_steps.sh wires OIDC at install
(client_id from deps, scopes 'openid email'); ops.py + test_{install,upgrade,backup,restore}.py
lifecycle overlays (postgres meet/meet ci_marker data-integrity); functional/test_health_check.py
(parity) + test_oidc_with_keycloak.py (password-grant JWT vs dep keycloak, realm lasuite-meet-<6hex>).
§4.3 meeting_flow + webrtc specifics next (after install+OIDC validated). No setup_custom_tests.sh
(no post-deploy step — OIDC at install, no minio/collabora).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Full-suite custom-tier run showed the pad #/2/pad/edit fragment didn't appear within 80s on a fresh
cold deploy (passed on the warm probe). Bump _open_pad hash-wait to ~240s + one mid-way reload.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Proactively addresses the Adversary's pre-claim recon (f7c5681): since the F2-12 fix replaces abra's
converge monitor (-c) with the harness's own wait, prove the replacement genuinely FAILS a broken
convergence (non-vacuous), not just passes a slow one. 5 deterministic tests (fake clock, no deploy):
- wait_ready_probes RAISES TimeoutError when the READY_PROBE never returns 200 (collabora wedged).
- wait_ready_probes returns when it reaches 200; no-op without a READY_PROBE.
- wait_healthy RAISES when services never converge, and when converged-but-never-serving.
Run: cc-ci-run -m pytest tests/unit/test_f212_upgrade_convergence.py -q → 5 passed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adversary cold-verify FAILed Q3.2 (F2-12): the prev→PR-head chaos upgrade's abra converge monitor
FATAs while the NEW collabora 25.04.9.4.1's healthcheck is still in start_period (jail/config init),
even though it converges given swarm's healthcheck retries. My WOPI pre-gate fixed the OLD collabora
being killed mid-boot but not the NEW collabora's convergence. Flaky (3x green for me, 1x fail cold).
Fix (cc-ci-side, stronger verification — not weaker):
- abra.deploy gains no_converge_checks (`-c`); chaos_redeploy passes it for the upgrade op so abra's
impatient monitor no longer FATAs (the stack spec is applied regardless).
- perform_upgrade now OWNS the convergence verification after the redeploy: wait_healthy (services
N/N + app HEALTH_PATH) + new lifecycle.wait_ready_probes (recipe READY_PROBE), bounded by the
recipe DEPLOY_TIMEOUT (generous) not abra's impatient window. meta threaded _perform_op→perform_upgrade.
- recipe_meta READY_PROBE hook (added to _load_meta whitelist): lasuite-drive probes collabora WOPI
discovery (/hosting/discovery on collabora-<domain>) → 200. Called after install deploy AND after
the upgrade redeploy. No-op for recipes without a READY_PROBE.
NOT re-claiming yet — validating the upgrade tier is now reliably green (incl. the slow-collabora
crossover) across multiple runs before re-claiming Q3.2. F2-12 stays open (Adversary-owned).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds tests/cryptpad/playwright/test_pad_content_roundtrip.py: open /pad/ → CryptPad auto-creates a
fragment-keyed pad → type a unique marker into the CKEditor body → wait for encrypted sync → open a
FRESH browser context (no shared localStorage/cookies) → navigate to the captured pad URL → assert
the marker survives in the re-decrypted body. Proves genuine end-to-end-encrypted server-side
persistence (the fresh session carries only the URL+fragment key), the §4.3 create-and-read-back
floor F2-9 requires — not a health/SPA stand-in.
Empirically mapped against CryptPad 2026.2.0 (the prior deferral cited version-fragility on 5.7.0):
editor is the deep nested frame …/pad/ckeditor-inner.html; ~15s cold-cache LESS-compile init; the
fragment-keyed pad URL DOES appear after init; transient net::ERR_NETWORK_CHANGED handled by the
shared goto_with_retry + a mid-load reload retry in the frame wait. PASSED against a live probe
instance. PARITY.md updated (roundtrip = the P3/§4.3 test; SPA-render test kept as fast liveness).
F2-9 is Adversary-owned — left for the Adversary to close after cold-verify.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom +
OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed
on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it
mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The
install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C
readiness-gating, no test weakened):
- tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery
on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly.
- runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op —
plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default,
while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow
collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing.
Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md +
BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section
(241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated
IDEAS.md/plan-sso-dep-testing.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Q3.2a / plan-lasuite-drive-oidc-robustness.md Part A. The old setup_custom_tests.sh did a
post-deploy in-place `abra app deploy --force --chaos` of the heavy 12-service stack to apply
the OIDC env — flaky (collabora WOPI-discovery race + gunicorn-perms; JOURNAL Step 0). Since
the OIDC env only affects backend/app and keycloak is live-warm, provision the per-run realm
BEFORE the single deploy and wire OIDC into the .env at install time (no reconverge).
- runner/run_recipe_ci.py: new _provision_deps() helper (warm/cold split + SSO enrich + write
$CCCI_DEPS_FILE), used by both paths. New per-recipe OIDC_AT_INSTALL meta flag (added to
_load_meta whitelist). When set + deps live-warm: provision BEFORE deploy_app; the install
tier's install_steps.sh wires OIDC into the single deploy; post-deploy step runs only the
MinIO bucket one-shot — no re-provision, no redeploy. Legacy post-deploy path unchanged for
all other dep recipes (gated on `not oidc_at_install`).
- tests/lasuite-drive/install_steps.sh (NEW): install-time OIDC env + secret wiring; no-ops on
empty deps file (recipe still boots, OIDC test skips → F2-11 RED).
- tests/lasuite-drive/setup_custom_tests.sh: trimmed to MinIO-bucket-only (OIDC moved out).
- tests/lasuite-drive/recipe_meta.py: OIDC_AT_INSTALL = True.
- JOURNAL-2: Step-0 root-cause failure logs captured before the fix.
NOT a claim — validating 3x green (incl. now-required upgrade tier) before claiming Q3.2.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
canonical.enrolled_recipes; runner/nightly_sweep.py (roll keycloak+traefik →
serial full-cold over enrolled on latest → green promotes; skip if test active;
operate against CCCI_REPO checkout for tests/); nix/modules/nightly-sweep.nix
(timer 03:00 Persistent + oneshot service) wired in. 2 bugs fixed via live
service run (repo-relative enrolled scan; util-linux for backup PTY). Live
SERVICE sweep: enrolled=['custom-html'] → all tiers green → canonical advanced
1.10.0→1.11.0; red-run correctly does NOT promote. 71 unit pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
should_promote_canonical (enrolled+green+cold+latest) + promote_canonical
(re-seed canonical at green-verified latest, snapshot+registry, old known-good
replaced only on green). +5 unit (70 pass). Live: custom-html canonical advanced
1.10.0+1.28.0 → 1.11.0+1.29.0 via a full green cold run; snapshot refreshed; idle;
per-run app torn down. WC6 nightly sweep next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bridge/bridge.py: parse_trigger(body) → (is_trigger, quick); accepts exactly
'!testme' (cold, default) and '!testme --quick' (opt-in fast lane), rejects
'!testmexyz'/'!testme foo'/etc. Threaded through both poll + webhook paths and
process_testme → trigger_build adds the CCCI_QUICK=1 Drone param (auto-exposed
to run_recipe_ci). PR comment labels a quick run lower-confidence. .drone.yml
echoes quick=. +3 unit tests (incl. the !testmexyz negative). 64 unit pass.
WC7: default !testme stays full cold; --quick opt-in, never gates merge.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
W1.2: enrolled custom-html (recipe_meta.WARM_CANONICAL); live proof ALL PASS
(seed canonical → idle-with-volume-retained → re-warm → marker survived).
WC2 (registry+data-warm model) + WC3 (snapshot+restore) proven. 61 unit pass.
custom-html now the first real data-warm canonical (idle).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bugs found by the live proof, fixed:
- warmsnap: snapshot now swaps a <recipe>/snapshot/ SUBDIR, not the whole
<recipe>/ dir — so the reconciler's sibling last_good file survives a
snapshot swap (was being clobbered).
- warm_reconcile: deploy_version captures abra's stdout (it writes FATA to
stdout) in the error; add wait_undeployed() after every undeploy so
snapshot/restore/redeploy don't race a half-removed swarm stack; the upgrade
deploy is wrapped so a deploy FAILURE (not just unhealthy) also triggers
rollback. (57 unit pass.)
LIVE PROOF on warm keycloak (annotated fake tags via CCCI_SKIP_FETCH):
(a) healthy upgrade 10.7.1->10.7.9: snapshot+deploy+health-pass, last_good
committed=10.7.9, marker realm preserved.
(b) MARQUEE rollback: broken latest 10.7.10 (lint-fail) -> rollback to 10.7.9,
HEALTHY, marker realm INTACT (data preserved through broken-upgrade+restore),
last_good NOT advanced, rollback alert written (attempted=10.7.10,
last_good=10.7.9, recovered=True). keycloak recovered to canonical
10.7.1+26.6.2 healthy.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sso.py: list_realms, delete_keycloak_realm (idempotent, refuses master),
realms_to_reap (pure, concurrency-safe predicate), reap_orphaned_realms.
The per-run realm is the isolation unit on a shared live-warm keycloak;
orphans (crashed runs) reaped by hex not mapping to a live app stack.
+8 unit tests (tests/unit/test_warm_realm.py); 43 unit pass on cc-ci.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In-flight Q3.2 iteration (NOT yet live-verified — needs a lasuite-drive deploy
once the warm keycloak from Phase 2w is available). Phase 2 paused here per
operator interjection of Phase 2w; state preserved.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a DEPS-declaring recipe's setup_custom_tests fails, its @requires_deps (SSO/OIDC)
tests skip; a skip-only pytest file exits 0 so the run previously reported overall=0
(GREEN) while the only SSO test never ran (violates P7). Fix preserves generic-tier
failure-isolation but corrects the green SIGNAL:
- conftest.pytest_collection_modifyitems counts skipped requires_deps tests and appends
to $CCCI_DEPS_SKIP_REPORT.
- run_recipe_ci: sums the count, surfaces it in RUN SUMMARY, and new pure predicate
sso_dep_unverified(declared, deps_ready, skipped) flips overall=1.
- 7 new unit tests (tests/unit/test_f211_sso_skip.py).
Verified deploy-free (rate-limit-independent): 35/35 unit PASS; cold real-test proof on
lasuite-docs test_oidc_with_keycloak.py -> 1 skipped + skip-report==1 -> orchestrator
would set overall=1. Full e2e deferred until Docker Hub rate limit lifts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per orchestrator's SSO-dep plan + the refactor in 41ede13, DEFERRED.md entry #5 (lasuite-docs
OIDC parity ports + create-a-doc) closes by execution.
- tests/lasuite-docs/functional/test_oidc_login.py: parity port of recipe-maintainer
oidc_login.py. Anonymous GET /api/v1.0/users/me/ → 302 to keycloak realm OR 401/403;
password-grant token → 200 with user.email matching the provisioned test user.
- tests/lasuite-docs/functional/test_create_doc.py: plan §4.3 prescribed create-an-object +
read-it-back. POST /api/v1.0/documents/ with OIDC Bearer → captured id; GET
/api/v1.0/documents/<id>/ → asserts id+title round-trip.
Both marked \@pytest.mark.requires_deps; skipped with 'deps-not-ready' if setup_custom_tests
fails (failure isolation per plan-sso-dep-testing.md §4).
Cold-verifiable: ssh cc-ci 'RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py'
install: 2 PASS; custom: 5 PASS incl. test_oidc_login_via_keycloak +
test_create_doc_and_read_back; deploy-count=2 (recipe + keycloak dep).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>