Commit Graph

9 Commits

Author SHA1 Message Date
e1147b5fe3 fix(2): F2-12 lasuite-drive upgrade tier — own convergence wait (abra -c) + collabora READY_PROBE
Adversary cold-verify FAILed Q3.2 (F2-12): the prev→PR-head chaos upgrade's abra converge monitor
FATAs while the NEW collabora 25.04.9.4.1's healthcheck is still in start_period (jail/config init),
even though it converges given swarm's healthcheck retries. My WOPI pre-gate fixed the OLD collabora
being killed mid-boot but not the NEW collabora's convergence. Flaky (3x green for me, 1x fail cold).

Fix (cc-ci-side, stronger verification — not weaker):
- abra.deploy gains no_converge_checks (`-c`); chaos_redeploy passes it for the upgrade op so abra's
  impatient monitor no longer FATAs (the stack spec is applied regardless).
- perform_upgrade now OWNS the convergence verification after the redeploy: wait_healthy (services
  N/N + app HEALTH_PATH) + new lifecycle.wait_ready_probes (recipe READY_PROBE), bounded by the
  recipe DEPLOY_TIMEOUT (generous) not abra's impatient window. meta threaded _perform_op→perform_upgrade.
- recipe_meta READY_PROBE hook (added to _load_meta whitelist): lasuite-drive probes collabora WOPI
  discovery (/hosting/discovery on collabora-<domain>) → 200. Called after install deploy AND after
  the upgrade redeploy. No-op for recipes without a READY_PROBE.

NOT re-claiming yet — validating the upgrade tier is now reliably green (incl. the slow-collabora
crossover) across multiple runs before re-claiming Q3.2. F2-12 stays open (Adversary-owned).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 11:55:53 +01:00
4b38b66fa5 fix(2): lasuite-drive Q3.2a — gate upgrade redeploy on collabora-ready + plumb DEPLOY_TIMEOUT
Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom +
OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed
on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it
mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The
install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C
readiness-gating, no test weakened):

- tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery
  on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly.
- runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op —
  plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default,
  while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow
  collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing.

Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md +
BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section
(241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated
IDEAS.md/plan-sso-dep-testing.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 10:37:55 +01:00
a151489996 feat(2): lasuite-drive Q3.2a Part A — wire OIDC at INSTALL, eliminate flaky redeploy
Q3.2a / plan-lasuite-drive-oidc-robustness.md Part A. The old setup_custom_tests.sh did a
post-deploy in-place `abra app deploy --force --chaos` of the heavy 12-service stack to apply
the OIDC env — flaky (collabora WOPI-discovery race + gunicorn-perms; JOURNAL Step 0). Since
the OIDC env only affects backend/app and keycloak is live-warm, provision the per-run realm
BEFORE the single deploy and wire OIDC into the .env at install time (no reconverge).

- runner/run_recipe_ci.py: new _provision_deps() helper (warm/cold split + SSO enrich + write
  $CCCI_DEPS_FILE), used by both paths. New per-recipe OIDC_AT_INSTALL meta flag (added to
  _load_meta whitelist). When set + deps live-warm: provision BEFORE deploy_app; the install
  tier's install_steps.sh wires OIDC into the single deploy; post-deploy step runs only the
  MinIO bucket one-shot — no re-provision, no redeploy. Legacy post-deploy path unchanged for
  all other dep recipes (gated on `not oidc_at_install`).
- tests/lasuite-drive/install_steps.sh (NEW): install-time OIDC env + secret wiring; no-ops on
  empty deps file (recipe still boots, OIDC test skips → F2-11 RED).
- tests/lasuite-drive/setup_custom_tests.sh: trimmed to MinIO-bucket-only (OIDC moved out).
- tests/lasuite-drive/recipe_meta.py: OIDC_AT_INSTALL = True.
- JOURNAL-2: Step-0 root-cause failure logs captured before the fix.

NOT a claim — validating 3x green (incl. now-required upgrade tier) before claiming Q3.2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 10:10:05 +01:00
f1c626cc67 fix(2): lasuite-drive setup_custom_tests — docker service scale --detach for the run-once minio-createbuckets job (blocking scale hung the custom tier forever; --detach submits + returns, bucket-poll confirms) 2026-05-29 06:21:42 +01:00
1b8d26b504 feat(2w): W0.2 live-warm keycloak dep mode in orchestrator (WC1)
- runner/harness/warm.py: stable-domain scheme (warm-<recipe>), is_warm_up
  probe, live_app_hexes scan, per-run realm_for naming, reap_orphan_realms.
- run_recipe_ci.py: split declared deps into live-warm (shared provider +
  per-run realm, no deploy, realm deleted at teardown) vs cold (co-deploy).
  Warm path used only when provider is up; cold fallback otherwise. Reap
  orphan realms at run start (concurrency-safe). deploy-count excludes warm
  deps. Realm naming now per-run namespaced (<parent>-<6hex>).
- dependent tests assert the namespaced realm pattern (stronger than ==parent).

Live proof on warm keycloak: realm create -> password-grant JWT -> discovery
issuer -> delete(idempotent) -> reap(keeps live hex, deletes orphan): PASS.
43 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 23:26:02 +01:00
66e065dff5 feat(2): lasuite-drive setup creates MinIO bucket via createbuckets one-shot
In-flight Q3.2 iteration (NOT yet live-verified — needs a lasuite-drive deploy
once the warm keycloak from Phase 2w is available). Phase 2 paused here per
operator interjection of Phase 2w; state preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 23:08:15 +01:00
6557197858 feat(2): Q3.2 lasuite-drive SSO iteration — keycloak dep + OIDC test + MinIO storage round-trip
- recipe_meta: DEPS=[keycloak] enabled (base proven cold-green).
- setup_custom_tests.sh: wire OIDC env (explicit keycloak realm endpoints) + insert oidc_rpcs
  secret at bumped version + clear FranceConnect eidas1 acr + in-place redeploy (adapted from
  the proven lasuite-docs hook).
- functional/test_oidc_with_keycloak.py: SSO discovery + password grant + JWT claims vs dep
  keycloak realm 'lasuite-drive' (@requires_deps; F2-11 fails run on skip).
- functional/test_minio_storage.py: §4.3 specific — drive-media-storage bucket present + real
  upload->list->download round-trip via mc inside the minio container.
- PARITY.md: OIDC + MinIO rows landed; backup data-integrity (ci_marker) already real.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 22:28:35 +01:00
1138d77cbb blocked(2): Q3.2 drive base-deploy hits Docker Hub rate limit + Gitea outage
- recipe_meta: bump drive abra TIMEOUT 900->1500, DEPLOY_TIMEOUT 1200->1800 (12-svc
  stack w/ onlyoffice+collabora; cold pulls need a wide window).
- STATUS-2 ## Blocked: two Class-A1 external blocks documented w/ verify commands —
  (1) Docker Hub anon pull rate limit (registry-creds finding per plan §1.5; blocks all
  new deploys), (2) Gitea git.autonomic.zone 404 outage (coordination down; 2 watchdog
  pings unconsumable until recovery). JOURNAL-2: full disk->prune->rate-limit chain.
- Queued locally; push + Adversary-inbox processing deferred to Gitea recovery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 20:48:52 +01:00
f59d8e6996 feat(2): Q3.2 lasuite-drive base enrollment + nested-subdomain + replicas:0 harness fixes
- harness: services_converged treats replicas:0 one-shot (minio-createbuckets) as
  converged (cur==want); removes the want==0 rejection that hung deploys. DECISIONS.md.
- recipe_meta.EXTRA_ENV flattens MINIO_DOMAIN/COLLABORA_DOMAIN to single-label wildcard
  siblings (the *.ci.commoninternet.net cert covers one label only). DECISIONS.md.
- lifecycle overlays (install/upgrade/backup/restore) + ops.py postgres ci_marker
  data-integrity (db user/name=drive). Parity health_check functional test. PARITY.md.
- DEPS=[keycloak] + OIDC/WOPI/upload functional tests deferred to the SSO iteration
  (probe-before-assert: prove the ~10-service base deploy converges first).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 19:54:31 +01:00