Commit Graph

45 Commits

Author SHA1 Message Date
fc6e35d617 feat(2): mattermost-lts create-message round-trip (§4.3 P3) — first-user→login→team→channel→post→read-back; harness http.post_with_headers (returns response headers, for mattermost login Token) 2026-05-29 08:31:37 +01:00
40b03a9bf1 claim(2w): WC8 + WC9 (FINAL gates) — resource-safety consolidation + stale-warm prune + docs/warm.md + --quick rollback proof
WC8: canonical.prune_stale (drop de-enrolled warm data + volumes) wired into the
nightly sweep + df log; consolidated evidence (DRONE_RUNNER_CAPACITY=MAX_TESTS
serialize; autoPrune drops --volumes so warm vols survive; cold teardown sacred;
warm excluded from D8 — no nix source ref). +1 unit (72 pass). WC9: docs/warm.md
documents the full warm/quick model; --quick rollback proof already proven live
(W2 FAIL restores exact known-good; WC4 PASS byte-identical snapshot). On PASS,
all WC1-WC9 (incl WC1.1/WC1.2) verified → DONE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:43:34 +01:00
465e1059b0 claim(2w): WC6 nightly full-cold sweep — timer+service roll warm/infra (health-gated) then serial cold sweep promoting canonicals (WC5); proven live
canonical.enrolled_recipes; runner/nightly_sweep.py (roll keycloak+traefik →
serial full-cold over enrolled on latest → green promotes; skip if test active;
operate against CCCI_REPO checkout for tests/); nix/modules/nightly-sweep.nix
(timer 03:00 Persistent + oneshot service) wired in. 2 bugs fixed via live
service run (repo-relative enrolled scan; util-linux for backup PTY). Live
SERVICE sweep: enrolled=['custom-html'] → all tiers green → canonical advanced
1.10.0→1.11.0; red-run correctly does NOT promote. 71 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:33:08 +01:00
125453df20 claim(2w): WC5 promote-on-green-cold proven — green cold run advances canonical (1.10.0→1.11.0); --quick never promotes; only cold advances
should_promote_canonical (enrolled+green+cold+latest) + promote_canonical
(re-seed canonical at green-verified latest, snapshot+registry, old known-good
replaced only on green). +5 unit (70 pass). Live: custom-html canonical advanced
1.10.0+1.28.0 → 1.11.0+1.29.0 via a full green cold run; snapshot refreshed; idle;
per-run app torn down. WC6 nightly sweep next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:08:14 +01:00
e678d2e006 claim(2w): W0.10a traefik WC1.1 migrated onto shared health-gated reconciler — no-op converge proven; destructive rollback = Adversary cold proof
warm_reconcile.py: per-spec setup hook + health_domain; SPECS[traefik]
(stateful=False, version-rollback-only, _traefik_setup preserves wildcard-cert/
file-provider config, health on routed dashboard host). keycloak path unchanged.
proxy.nix: deploy-proxy.service now execs warm_reconcile.py traefik. ZERO-disruption
migration (traefik already at latest 5.1.1+v3.6.15; pre-seeded TYPE+last_good →
clean no-op converge; traefik 200 + keycloak-through-traefik 200 + 0 failed).
65 unit pass. Per operator out: code+converge delivered; destructive rollback
(brief TLS blip) = Adversary's required cold proof. Closes the W0.10a tracked-open.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 03:50:32 +01:00
191ebde466 fix(2w): W2 --quick live-proof fixes (time import + stale-TYPE reset)
3 bugs found by the live PASS+FAIL proof on the custom-html canonical:
- import time (run_quick._wait_undeployed used it → the FAIL rollback crashed
  with NameError before restore ran).
- canonical.deploy_canonical now resets .env TYPE=<recipe>:<version> before
  redeploy, so a stale TYPE left by a prior --quick upgrade (pointing at a
  since-removed broken PR commit) can't FATAL abra 'unable to resolve <commit>'.
- run_quick FAIL rollback resets TYPE to known-good after restore (idle .env
  agrees with the registry).

LIVE PROOF (custom-html canonical), ALL PASS: (A) PASS quick run → undeploy
keep-volume, known-good UNCHANGED, marker intact; (B) FAIL quick run (broken
image) → 'rolling back' → 'restored known-good data; canonical idle' → exit 1,
known-good UNCHANGED, DATA RESTORED. Canonical left clean (idle, 1.11.0+1.29.0).
61 unit pass; cold path untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 03:05:39 +01:00
f68e9d463f feat(2w): W2 --quick mode in run_recipe_ci.py (WC4+WC7)
run_quick(): opt-in fast lane (CCCI_QUICK=1 / MODE=quick) — reattach the
data-warm canonical (canonical.deploy_canonical, known-good volume) → deps wiring
(warm keycloak + per-run realm) → UPGRADE to PR head (chaos, run_lifecycle_tier
'upgrade': reconverge+moved+serving + overlay) → custom tier. PASS →
undeploy_keep_volume, known-good UNCHANGED (NEVER promote); FAIL → warmsnap.restore
last-known-good + undeploy (roll back, data safe). Always deletes per-run warm
realm. mode=quick labelled lower-confidence (WC7); skips install/backup/restore;
no deploy-count guard (no deploy_app). main() dispatches to run_quick when a
canonical exists, else clean no-canonical fallback to COLD. Cold path byte-identical
(deps wiring intentionally mirrored, not refactored). 61 unit pass; cold untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 02:45:44 +01:00
b6ef83ab0b feat(2w): W1 canonical registry module (WC2) + alerts archived
runner/harness/canonical.py: data-warm canonical registry + lifecycle —
is_enrolled (recipe_meta.WARM_CANONICAL), canonical_domain (warm.stable_domain
warm-<recipe>), registry read/write (/var/lib/ci-warm/<recipe>/canonical.json),
has_canonical (record + retained volume), deploy_canonical (reattach volume at
known-good version), undeploy_keep_volume (idle data-warm), seed_canonical
(record + warmsnap snapshot). warm.stable_domain helper added (keycloak path
unchanged). +4 unit tests (61 unit pass).

Also archived the Adversary's verification alert sentinels to alerts/seen/
(simulated rollback + 2 holds — evidentiary, gate PASSED; dir clean for real alerts).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 02:15:11 +01:00
32f00717ac fix(2w): W0.9 WC1.1 hardening (proven live: healthy upgrade + marquee rollback)
Bugs found by the live proof, fixed:
- warmsnap: snapshot now swaps a <recipe>/snapshot/ SUBDIR, not the whole
  <recipe>/ dir — so the reconciler's sibling last_good file survives a
  snapshot swap (was being clobbered).
- warm_reconcile: deploy_version captures abra's stdout (it writes FATA to
  stdout) in the error; add wait_undeployed() after every undeploy so
  snapshot/restore/redeploy don't race a half-removed swarm stack; the upgrade
  deploy is wrapped so a deploy FAILURE (not just unhealthy) also triggers
  rollback. (57 unit pass.)

LIVE PROOF on warm keycloak (annotated fake tags via CCCI_SKIP_FETCH):
(a) healthy upgrade 10.7.1->10.7.9: snapshot+deploy+health-pass, last_good
    committed=10.7.9, marker realm preserved.
(b) MARQUEE rollback: broken latest 10.7.10 (lint-fail) -> rollback to 10.7.9,
    HEALTHY, marker realm INTACT (data preserved through broken-upgrade+restore),
    last_good NOT advanced, rollback alert written (attempted=10.7.10,
    last_good=10.7.9, recovered=True). keycloak recovered to canonical
    10.7.1+26.6.2 healthy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 01:21:05 +01:00
07ea951f31 fix(2w): WC1.1 reconcile rolls back on deploy FAILURE too (not just unhealthy)
A broken 'latest' can fail abra's converge (deploy_version raises) rather than
deploy-then-be-unhealthy; wrap the upgrade deploy so BOTH paths trigger the
snapshot-restore rollback instead of crashing the reconcile unit.
2026-05-29 01:01:28 +01:00
a044abb298 feat(2w): W0.6 unpinned warm reconciler + WC1.2 safety gate + WC1.1 scaffold
runner/warm_reconcile.py (python, packaged into nix store, replaces bash
reconcile): UNPIN keycloak (deploy latest published version TAG; recipe fetched
at runtime -> D8 closure byte-identical). WC1.2 pre-deploy safety gate (runs
FIRST): major recipe/app-version bump OR releaseNotes manual-migration marker
-> hold-on-current + alert sentinel (no deploy churn). WC1.1 health-gated
upgrade-with-rollback: record last-good -> [keycloak: undeploy->warmsnap.snapshot
->deploy latest] -> health-gate -> commit-or-(restore+redeploy-prior+alert).
Alerts = /var/lib/ci-warm/alerts/*.json (Builder loop relays). current version
read from abra TYPE=<recipe>:<version>. CCCI_SKIP_FETCH test hook.
+8 unit tests for the version gate (56 unit pass).

Proven on cc-ci: nixos-rebuild switch -> warm-keycloak.service runs the python
reconciler -> noop-healthy (system 0-failed, /realms/master=200). WC1.2 holds
proven live: MAJOR bump -> held-major (keycloak untouched); minor+manual-
migration notes -> held-manual-migration (alert carries notes); no deploy churn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 00:42:02 +01:00
4cc1e15a53 feat(2w): W0.5 WC3 snapshot/restore helper (warmsnap.py)
runner/harness/warmsnap.py: raw per-volume tar of an app's stack volumes while
UNDEPLOYED, under /var/lib/ci-warm/<recipe>/ (meta.json + volumes/<vol>.tar);
one last-good, atomic dir swap; restore clears+untars each volume back. Asserts
undeployed (consistency). Reused by WC1.1 (pre-upgrade keycloak snapshot) + WC5.
+5 unit tests (48 unit pass).

LIVE round-trip PROVEN on warm keycloak: create marker realm -> undeploy ->
snapshot (mariadb+providers vols) -> deploy -> delete marker (mutate DB) ->
undeploy -> restore -> deploy -> marker realm BACK; keycloak healthy. WC3 core.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 00:12:46 +01:00
1b8d26b504 feat(2w): W0.2 live-warm keycloak dep mode in orchestrator (WC1)
- runner/harness/warm.py: stable-domain scheme (warm-<recipe>), is_warm_up
  probe, live_app_hexes scan, per-run realm_for naming, reap_orphan_realms.
- run_recipe_ci.py: split declared deps into live-warm (shared provider +
  per-run realm, no deploy, realm deleted at teardown) vs cold (co-deploy).
  Warm path used only when provider is up; cold fallback otherwise. Reap
  orphan realms at run start (concurrency-safe). deploy-count excludes warm
  deps. Realm naming now per-run namespaced (<parent>-<6hex>).
- dependent tests assert the namespaced realm pattern (stronger than ==parent).

Live proof on warm keycloak: realm create -> password-grant JWT -> discovery
issuer -> delete(idempotent) -> reap(keeps live hex, deletes orphan): PASS.
43 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 23:26:02 +01:00
74bf8c1723 feat(2w): W0.1 keycloak realm lifecycle primitives (WC1)
sso.py: list_realms, delete_keycloak_realm (idempotent, refuses master),
realms_to_reap (pure, concurrency-safe predicate), reap_orphaned_realms.
The per-run realm is the isolation unit on a shared live-warm keycloak;
orphans (crashed runs) reaped by hex not mapping to a live app stack.
+8 unit tests (tests/unit/test_warm_realm.py); 43 unit pass on cc-ci.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 23:16:48 +01:00
5b34496557 fix(2): F2-11 — SSO-dep deps-not-ready SKIP no longer yields GREEN !testme
When a DEPS-declaring recipe's setup_custom_tests fails, its @requires_deps (SSO/OIDC)
tests skip; a skip-only pytest file exits 0 so the run previously reported overall=0
(GREEN) while the only SSO test never ran (violates P7). Fix preserves generic-tier
failure-isolation but corrects the green SIGNAL:
- conftest.pytest_collection_modifyitems counts skipped requires_deps tests and appends
  to $CCCI_DEPS_SKIP_REPORT.
- run_recipe_ci: sums the count, surfaces it in RUN SUMMARY, and new pure predicate
  sso_dep_unverified(declared, deps_ready, skipped) flips overall=1.
- 7 new unit tests (tests/unit/test_f211_sso_skip.py).

Verified deploy-free (rate-limit-independent): 35/35 unit PASS; cold real-test proof on
lasuite-docs test_oidc_with_keycloak.py -> 1 skipped + skip-report==1 -> orchestrator
would set overall=1. Full e2e deferred until Docker Hub rate limit lifts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 21:25:27 +01:00
f59d8e6996 feat(2): Q3.2 lasuite-drive base enrollment + nested-subdomain + replicas:0 harness fixes
- harness: services_converged treats replicas:0 one-shot (minio-createbuckets) as
  converged (cur==want); removes the want==0 rejection that hung deploys. DECISIONS.md.
- recipe_meta.EXTRA_ENV flattens MINIO_DOMAIN/COLLABORA_DOMAIN to single-label wildcard
  siblings (the *.ci.commoninternet.net cert covers one label only). DECISIONS.md.
- lifecycle overlays (install/upgrade/backup/restore) + ops.py postgres ci_marker
  data-integrity (db user/name=drive). Parity health_check functional test. PARITY.md.
- DEPS=[keycloak] + OIDC/WOPI/upload functional tests deferred to the SSO iteration
  (probe-before-assert: prove the ~10-service base deploy converges first).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 19:54:31 +01:00
41ede13042 feat(2): refactor — SSO-dep plan refinement (deps AFTER generic + setup_custom_tests + failure isolation)
Per operator-2026-05-28 SSO-dep plan (plan-sso-dep-testing.md). Substantial orchestrator
restructuring:

NEW LIFECYCLE ORDER:
  1. Recipe deploy ALONE (no deps).
  2. install / upgrade / backup / restore — recipe-only generic tiers.
  3. setup_custom_tests step (NEW):
     a. Deploy each declared dep + provision realm/client/test-user via harness.sso.
     b. Write $CCCI_DEPS_FILE in dict shape {dep_recipe: {domain, realm, client_id, client_secret,
        admin_user, admin_password, discovery_url, token_url, ...}}.
     c. Run tests/<recipe>/setup_custom_tests.sh hook (jq-readable; wires OIDC env via abra
        secret insert + .env edits + in-place 'abra app deploy --force --chaos').
  4. CUSTOM tier with deps-ready flag; @pytest.mark.requires_deps tests skip with
     'deps-not-ready: <reason>' when setup_custom_tests fails. NON-deps custom tests still run
     normally — FAILURE ISOLATION (a DoD item per plan).
  5. Teardown: recipe first, deps in reverse declaration order.

Harness changes:
- runner/run_recipe_ci.py: deps deploy moves from BEFORE recipe deploy to AFTER restore tier.
  Adds _enrich_deps_with_sso() + _run_setup_custom_tests_hook(). DG4.1 generalised to
  'one abra app new per app' (recipe + each dep); in-place redeploys (\--force) don't count.
- runner/harness/deps.py: write_run_state + load_run_state accept dict OR list shape;
  deps_as_dict() coerces either to a recipe→entry map.
- runner/harness/sso.py: admin_password_inside() public re-export.
- tests/conftest.py: deps_creds fixture (full creds dict); deps_apps fixture flattens to
  recipe→domain string. pytest_collection_modifyitems hook skips
  \@pytest.mark.requires_deps tests when CCCI_DEPS_READY=0.
  pytest_configure registers the marker.

Recipe content:
- tests/lasuite-docs/setup_custom_tests.sh: NEW hook reads $CCCI_DEPS_FILE via jq;
  inserts oidc_rpcs secret at BUMPED version (v1→v2) since abra app new -S generates v1 first
  and Swarm forbids overwriting; updates SECRET_OIDC_RPCS_VERSION in .env; writes 9 OIDC env
  vars (REALM/DISCOVERY/AUTH/TOKEN/USERINFO/LOGOUT/JWKS/CLIENT_ID/SCOPES); ensures trailing
  newline on .env so writes don't concatenate (caught a 'TIMEOUT=900OIDC_REALM=...' bug);
  triggers in-place 'abra app deploy --force --chaos --no-input'.
- tests/lasuite-docs/functional/test_oidc_with_keycloak.py: refactored to consume deps_creds
  fixture (no longer calls setup_keycloak_realm itself — the orchestrator does it in
  setup_custom_tests). Marked \@pytest.mark.requires_deps.

Cold-verifiable on cc-ci (log /root/ccci-refactor-lasuite-r5.log):
  RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py
  install: PASS, custom: 3 PASS incl. test_oidc_password_grant_against_dep_keycloak.
  deploy-count = 2 (expect 2) — DG4.1 generalised holds.
  Smoke regression: RECIPE=custom-html STAGES=install,custom → 5 PASS, deploy-count=1.

Closes DEFERRED.md #5 (lasuite-docs OIDC parity ports via this plan).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 19:11:42 +01:00
1bd7c7a1d3 feat(2): Q4.4 ghost + DEPLOY_TIMEOUT plumb-through for heavy recipes
Harness change (small, surgical):
- runner/harness/lifecycle.deploy_app gains a deploy_timeout param (default 900s); passes
  through to abra.deploy(timeout=...). For heavy recipes (ghost, matrix-synapse, lasuite-meet),
  the orchestrator + dep resolver now read recipe_meta.DEPLOY_TIMEOUT and pass it so the Python
  subprocess wrapping abra deploy doesn't SIGKILL it before the recipe's INTERNAL TIMEOUT
  (via EXTRA_ENV) finishes swarm convergence.
- runner/run_recipe_ci.py + runner/harness/deps.py: thread recipe_meta.DEPLOY_TIMEOUT into
  the per-recipe deploy_app call.

Q4.4 ghost enrollment:
- recipe_meta.py: HEALTH_PATH=/, DEPLOY_TIMEOUT=1200 (subprocess), EXTRA_ENV={TIMEOUT: 1200}
  (recipe internal). Ghost cold-start with theme + DB migration runs ~12-15min on cc-ci.
- functional/test_health_check.py: GET / returns 200 (themed site).
- functional/test_content_api.py: GET /ghost/api/content/settings/ returns 200 (settings JSON)
  or 401/403 (Ghost error envelope) — distinguishes ghost-server up + JSON API working from
  static fallback.
- functional/test_admin_redirect.py: GET /ghost/ returns 200 or 302 + Ghost branding;
  proves admin route is wired through nginx proxy.
- PARITY.md: recipe-maintainer corpus has no ghost tests/, Phase-2 health_check is the
  parity baseline; create-a-post deeper test deferred (DEFERRED.md, --extra-tests linked).

Cold-verifiable (log /root/ccci-q44-ghost-r3.log):
  RECIPE=ghost STAGES=install,custom cc-ci-run runner/run_recipe_ci.py
  install + 3 functional tests PASS, deploy-count=1. 28/28 unit tests still PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 17:23:40 +01:00
c6e94af766 fix(2): F2-5 — dep teardown verify=True, errors propagate to run-fail (Adversary cold)
Per REVIEW-2 ## Q2 FAIL: runner/harness/deps.py::teardown_deps suppressed ALL exceptions via
contextlib.suppress(Exception), silently swallowing teardown failures. The 'DEPS teardown' print
fired even when undeploy actually raised — leaving leftover swarm services/volumes/secrets that
broke the NEXT run targeting the same deterministic dep domain (this is what caused the Q3.1 dep
flake I saw immediately after the Q2.4 acceptance run).

Fix:
- runner/harness/deps.py: teardown_deps now uses lifecycle.teardown_app(..., verify=True) so
  residuals raise TeardownError. Errors are LOGGED LOUDLY per-dep but we continue to other deps
  so one failure doesn't strand the rest. After all attempts: raise a combined TeardownError if
  any dep failed.
- runner/run_recipe_ci.py: orchestrator catches the dep TeardownError in finally, prints it,
  captures into dep_teardown_error; the run summary surfaces it and the exit code is non-zero.
  The run STILL prints the diagnosable summary so a leak doesn't hide other failures.

Per §9 teardown sacred / DG7: a green run that leaks state is not 'green'. F2-5 now correctly
fails the run instead of silently passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 09:00:37 +01:00
47f7cb47c2 fix(2): F2-3 systemic — harness.browser.goto_with_retry; applied to all install overlays
Phase 2 lesson from F2-3 (n8n install Playwright flake on net::ERR_NETWORK_CHANGED): every
install overlay that does page.goto needs the same try/except PlaywrightError + status retry.
Centralize in runner/harness/browser.py::goto_with_retry; apply to ALL install overlays.

- runner/harness/browser.py: shared helper. Polls page.goto until status in accept_statuses;
  catches PlaywrightError (net::ERR_*) as a retryable signal, not a failure. Raises AssertionError
  with last_status + last_err diagnostic only on deadline expiry.
- tests/custom-html/test_install.py: now uses goto_with_retry (200 only, wait_until=load).
- tests/custom-html/playwright/test_browser_smoke.py: same.
- tests/n8n/test_install.py: replaced inline retry loop with goto_with_retry (200, 304).
- tests/keycloak/test_install.py: goto_with_retry for admin console (200, 302, 303; 45s goto).
- tests/cryptpad/test_install.py: goto_with_retry (200, 304; 60s goto, wait_until=load).
- tests/lasuite-docs/test_install.py: goto_with_retry (200, 301, 302; 60s goto).

Cold-verifiable: ssh cc-ci 'RECIPE=custom-html cc-ci-run runner/run_recipe_ci.py'
  all 5 stages PASS (including the install overlay that flaked in the deps_smoke run),
  deploy-count=1, head_ref=8a026066==chaos-version=8a026066 (HC1 non-vacuous).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 07:46:34 +01:00
4d6b040ba7 feat(2): Q2.3 — dep resolver + SSO-setup harness primitives
- runner/harness/deps.py: dep resolver primitive (Phase 2 §4.2 / Q2.3).
  - declared_deps(recipe) reads DEPS list from tests/<recipe>/recipe_meta.py
  - dep_domain(parent, pr, ref, dep) — per-run domain per (parent, dep) pair
    so two recipes' deps of the same kind don't collide on a host
  - deploy_deps / teardown_deps — sequential deploy + reverse-order teardown
  - read/write of run-scoped $CCCI_DEPS_FILE
- runner/harness/sso.py: SSO-setup / OIDC-flow primitive (Phase 2 §4.2 / Q2.3).
  - setup_keycloak_realm: idempotent realm + confidential OIDC client +
    test user with generated 25-char alphanumeric password (class-B per §4.4-B);
    returns SsoCreds dict with discovery_url, token_url, all identifiers.
  - oidc_password_grant: exercises the password-grant OIDC flow; returns
    access_token (a JWT) or raises.
  - assert_discovery_endpoint: GET /.well-known/openid-configuration; asserts
    issuer matches the per-run provider domain+realm.
- runner/run_recipe_ci.py: wired in dep deploy BEFORE recipe-under-test, dep
  teardown LAST in finally (reverse order). DG4.1 deploy-count guard now
  expects 1 + len(deps_state) — accommodates declared deps without breaking
  the no-extra-deploys invariant.
- tests/conftest.py: deps_apps fixture reads $CCCI_DEPS_FILE -> dict mapping
  dep_recipe -> dep_domain.
- tests/unit/test_deps.py: 7 unit tests covering declared_deps parsing,
  per-(parent,dep) domain distinctness, run-state JSON write/load, env-var
  no-op semantics. 28/28 unit tests PASS on cc-ci.

Smoke test confirmed deploy_count == expected (1) when no deps declared
(custom-html install run, log /root/ccci-q2-deps-smoke.log).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 07:41:56 +01:00
0d0fc6c4bc feat(2): Q0.1/Q0.2 — harness.http + discovery recurses functional/playwright (Phase 2)
- runner/harness/http.py: canonical Phase-2 recipe-test HTTP API (vendored from
  recipe-maintainer/utils/tests/helpers.py): http_get/http_post, retry variants,
  wait_for_http, assert_converges. JSON-parsing, header support, form/JSON POST
  bodies, transport-failure -> status=0. Self-contained (cc-ci does not import
  recipe-maintainer at runtime per DECISIONS Phase 2).
- harness.discovery.custom_tests now also recurses into
  tests/<recipe>/{functional,playwright}/test_*.py (Phase 2 §4.1 layout) while
  excluding lifecycle test_<op>.py names and honoring the HC2 repo-local gate.
- Unit tests:
    tests/unit/test_http.py — in-process http.server fixture; deterministic
    proofs of parsing/retry/convergence semantics, no network egress.
    tests/unit/test_discovery_phase2.py — functional/+playwright/ recursion
    + HC2 gate still applies to subdirs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:36:49 +01:00
74725610ab fix(1e): HC1 upgrade/restore tier calls now pass head_ref (multi-line edit miss)
Earlier perl substitution missed the multi-line upgrade and restore run_lifecycle_tier calls (still
passed `target` = VERSION env, None for !testme runs), so perform_upgrade got head_ref=None for
upgrade tier → re-checkout skipped → chaos redeploy of leftover prev checkout (vacuous prev→prev that
'passed' via the chaos-label move fallback).

Verified e2e on hedgedoc (install,upgrade; commit pending push):
  upgrade→PR-head: head_ref=09bf4d54 chaos-version=09bf4d54 version=3.0.9+1.10.7→3.0.10+1.10.8
deploy-count=1, install/upgrade=pass, clean teardown. The chaos-version label deterministically
matches head_ref — direct proof PR-head code was deployed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 04:04:13 +01:00
6eabfdc0fb fix(1e): F1e-1 exec_in_app race + HC1 head_ref/move hardening
F1e-1 (Adversary): exec_in_app silently returned '' on a failed docker exec, flipping a healthy
recipe RED under opt-out (post-backup container cycle, no readiness buffer). Now polls (re-resolve
container + re-exec) until rc==0 or 90s, then RAISES — never masks an exec failure as empty data.
No assertion weakened. Verified: opt-out install,backup,restore on custom-html now PASS.

HC1: head_ref = ref or recipe_head_commit (prefer explicit PR head sha $REF — robust, no git race;
production !testme always sets REF). assert_upgraded, when head_ref known, REQUIRES the deployed
chaos-version commit to MATCH head_ref (direct + non-vacuous proof the PR-head code was deployed; a
stale prev-checkout chaos redeploy fails). Falls back to version/image/chaos move check otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:41:42 +01:00
b7e6cbd7be feat(1e): HC3 additive generic + op/assertion split (orchestrator owns the op)
- orchestrator: per mutating tier, run optional pre-op seed hook (ops.py pre_<op>) → perform the op
  ONCE (harness-owned) → run generic assertion (unless opted out) AND overlay assertion, both against
  the shared post-op deployment. Op results passed op→assertion via run-scoped CCCI_OP_STATE_FILE.
- opt-out: CCCI_SKIP_GENERIC / CCCI_SKIP_GENERIC_<OP> / recipe_meta.SKIP_GENERIC (declarative).
- generic.py: split do_* into op primitives (perform_upgrade/backup/restore) + assertions
  (assert_upgraded/backup_artifact/restore_healthy) reading op_state(); deployed_identity now returns
  {version,image,chaos} (chaos label ready for HC1).
- generic test_<op>.py + all 6 recipe overlays migrated to assertion-only; pre-op seeding moved to
  per-recipe ops.py (pre_upgrade/pre_backup/pre_restore). install overlays unchanged (no op).
- deploy-count stays 1 (op primitives never call deploy_app). lint PASS; 8 unit tests PASS on cc-ci.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 03:12:04 +01:00
d38a695fa3 feat(1e): HC2 repo-local approval allowlist (default-deny) + discovery gate
- tests/repo-local-approved.txt (empty ⇒ default-deny); CCCI_REPO_LOCAL_APPROVED_FILE override.
- discovery: repo_local_approved()/_gated() centralize the gate; resolve_overlay_op + generic_op
  (HC3 additive split); custom_tests/install_steps/pre_op_hook all honor the gate.
- unit tests rewritten for approved-vs-not + the generic floor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 02:55:58 +01:00
feb6f80d50 fix(1d): bounded retry in _app_container (backup briefly cycles the app container)
abra app backup create (backup-bot-two) stops/cycles the app container, so a mutate exec_in_app
right after backup hit an empty docker ps and raised. _app_container now polls (no bare sleep) for
the container to reappear within a timeout. Recipe-agnostic harness robustness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 00:06:28 +01:00
81e26a1bdc fix(1d): F1d-2 — pinned base deploys the pinned version; upgrade is non-vacuous
- deploy_app: checkout the pinned tag + deploy NON-chaos when a version is pinned (chaos only for
  version=None / PR-head). Was always -C, which ignored the pin and deployed LATEST -> upgrade no-op.
- do_upgrade: assert the deployment actually MOVED (coop-cloud version label and/or image changed)
  via lifecycle.deployed_identity -> a vacuous no-op upgrade can no longer pass (DG2).
- G2: migrate custom-html overlays to the assertion-only contract (override + extend-by-composition
  + data-continuity; split backup/restore). tests/unit/test_discovery.py proves precedence (5/5).

Probe (Adversary's F1d-2 test): hedgedoc deploy-prev=1.10.7 -> upgrade=1.10.8, CHANGED=True.
hedgedoc full generic lifecycle green (install/upgrade/backup/restore, deploy-count=1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 00:02:59 +01:00
6c5d8f28ea fix(1d): G1 backup/restore + F1d-1 cert-check reframe
- backup artifact: read snapshot_id from 'abra app backup create' output (snapshots needs a TTY);
  generic.parse_snapshot_id + do_backup assert it
- restore serving race: lifecycle.http_fetch (one request -> status+body, never raises) +
  assert_serving is now a bounded poll (settles a post-op reconverge, no bare sleep); drop wait_serving
- F1d-1 (Adversary, low): reframe served_cert/assert_serving honestly as an INFRA TLS sanity check
  (catches a lapsed/mis-rotated wildcard cert), NOT app-vs-fallback (Traefik serves the wildcard
  zone-wide); the genuine serving proof is services_converged + non-404 status. Awaiting re-test.

DG1 Adversary PASS @ef44d46. G1 full-lifecycle re-verification in flight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:39:45 +01:00
ef44d4658b feat(1d): G0 — generic install + deploy-once orchestrator (DG1 green on hedgedoc)
- harness/generic.py: recipe-agnostic assert_serving (converged + real HTTP, 404-excluded +
  not Traefik 404 body + CA-verified trusted wildcard cert), op helpers, backup_capable detect
- harness/discovery.py: per-op overlay resolution (repo-local > cc-ci > generic), custom + hook
- tests/_generic/: assertion-only tiers (install/upgrade/backup/restore) on the shared deployment
- run_recipe_ci.py: deploy-ONCE orchestrator, per-op summary, deploy-count guard (DG4.1)
- conftest live_app fixture; lifecycle deploy-count + install-steps hook + pin DOMAIN to run domain

DG1 cold-verified green on hedgedoc (pure generic, deploy-count=1, clean teardown). G0 CLAIMED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 23:27:55 +01:00
2cede01ed7 style(1b): auto-format + lint-clean the whole codebase (RL1)
Mechanical, semantics-preserving cleanup so the codebase passes the new lint stage:
- ruff format: all 32 Python files (wraps long signatures, normalizes quotes/blank lines).
- nixpkgs-fmt: modules/drone-runner.nix.
- shfmt (-i 2 -ci): scripts/*.sh.

Lint fixes (reviewed, behavior-preserving — no test weakened):
- ruff SIM105: try/except-pass -> contextlib.suppress (abra.py app_config rm; lifecycle.py janitor).
- ruff SIM115: open().read() -> with open() (run_recipe_ci.py redaction-values + gitea-token).
- statix: merge repeated sops `secrets.*` keys into one `secrets = { ... }` (comments kept);
  empty fn pattern `{ ... }:` -> `_:` (packages.nix).
- deadnix: drop unused lambda args (flake `self`; configuration.nix `lib`; overlay `final` -> `_`).

Verified on cc-ci: `scripts/lint.sh` -> lint: PASS; nixosConfigurations.cc-ci evaluates;
all Python byte-compiles. The deployed bridge/dashboard/runner source changes hash (reformat),
so cc-ci will be rebuilt to the new closure in W2 before the cold D1-D10 re-verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 20:52:05 +01:00
575efb5054 fix: abra app upgrade -c (no-converge-checks) — abra false-fails slow heavy rolling upgrades
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
Diagnosed via instrumented diag: lasuite-docs upgrade reported 'FATA deploy failed' while all 9
services converged 1/1 — abra's convergence poll gives up too early on the slow stop-first roll
(pulling new images). Disable abra's check; the harness wait_healthy + data-survival assertion is
the real, more-patient gate (a genuine failure still fails the test: app never gets healthy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 11:34:59 +01:00
4d5f7e25c6 fix: abra app upgrade -o (offline) — was 401'ing fetching tags from the private mirror origin
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 08:31:40 +01:00
a2f3b14745 fix: upstream tag fetch needs explicit refspec (bare --tags errors 'no remote HEAD')
Some checks failed
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is failing
git fetch --tags <url> without a refspec errors 'couldn't find remote ref HEAD'; use
'refs/tags/*:refs/tags/*'. Verified: brings custom-html's 18 upstream version tags into the mirror
PR clone so the upgrade stage finds a previous published version (was skipping).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 08:28:22 +01:00
c277029f84 M10/D10: enable real-!testme path — fetch upstream tags + enroll 6 recipes in POLL_REPOS
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
fetch_recipe (SRC+REF/PR path) now read-only fetches published version tags from the public upstream
into the mirror clone, so the upgrade stage finds a previous published version (mirror PR branches
carry no tags → upgrade would skip). Guardrail-safe: only fetches tags, never pushes to the recipe
repo; plain git so the bot token isn't sent to upstream. Adds the 6 D10 recipes to the bridge
POLL_REPOS so !testme on their PRs triggers runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 08:21:43 +01:00
fc07d15800 M7/D6: secrets rotation doc + log redaction filter
All checks were successful
continuous-integration/drone/push Build is passing
docs/secrets.md documents the 3 secret classes (A1 external, A2 internal-generated, B recipe-app),
the sops-nix decryption chain, and rotation procedures for each (cert version bump, sops re-encrypt +
swarm-secret version bump, recipe-app ephemeral). run_recipe_ci streams each stage's output through a
redaction filter that masks any /run/secrets/* value (>=8 chars) before it reaches Drone logs —
belt-and-suspenders over 'harness never prints secrets + abra doesn't echo'. Live streaming + exit
code preserved (locally tested). Recipe-ci clones cc-ci fresh per build, so this applies next run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 07:44:53 +01:00
ebb4c0cbca M6.5: enroll cryptpad (recipe #3, stateful/no-DB) + generic per-recipe EXTRA_ENV
All checks were successful
continuous-integration/drone/push Build is passing
Adds a shared-harness EXTRA_ENV mechanism (recipe_meta.py dict or domain-callable),
applied in deploy_app at every deploy path — no per-recipe harness surgery (D5).
cryptpad uses it for its required distinct SANDBOX_DOMAIN. Tests assert data
survival via a marker file in the backed-up cryptpad_data volume (exec_in_app,
since cryptpad data isn't HTTP-served).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 04:41:44 +01:00
7aa0346902 harness: backup/restore pass -C -o; catalogue fetch re-clones clean
Some checks failed
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is failing
Two fixes surfaced by the first real recipe-ci run through Drone:
- abra app backup/restore now pass -C -o (current checkout, no remote fetch) like
  every other recipe-touching call — without -o they fetch recipe tags from the
  (private) remote and fail 'authentication required: Unauthorized'.
- fetch_recipe's catalogue path rm's the recipe dir first so a leftover private-mirror
  remote from a prior SRC+REF run can't poison version resolution / backup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 03:05:03 +01:00
25b628e959 harness: app_new uses chaos only when no version (version => clean tag checkout)
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone Build is passing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 02:05:54 +01:00
9b33fdf6e6 M6: D4 recipe-local discovery + recipe #2 (keycloak, DB-backed) enrolled; M6 CLAIMED
All checks were successful
continuous-integration/drone/push Build is passing
D4 snapshots recipe-shipped tests/ and runs them against the live app. abra -C -o
everywhere + token clone for private mirror PRs. keycloak install green with no
harness surgery (D5). docs/enroll-recipe.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 01:48:06 +01:00
7fc26fae68 M6 (part 1): per-recipe meta + D4 recipe-local discovery + shared naming helper
All checks were successful
continuous-integration/drone/push Build is passing
Recipe-agnostic harness (no surgery to enroll a recipe): recipe_meta.py for
health path/codes/timeouts; run_recipe_local discovers + runs recipe-shipped
tests/ against the live app. install non-regressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 01:16:29 +01:00
b7a2d70380 harness: fix A2 (janitor real-name + docker reap + age gate) and A3 (verified teardown)
All checks were successful
continuous-integration/drone/push Build is passing
teardown_app now docker-stack-rm fallback, removes .env only after stack gone,
retries volume rm, and verifies no residual (raises TeardownError). janitor matches
the real <recipe[:4]>-<6hex> scheme + reaps env-less orphans via docker. Verified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 01:05:18 +01:00
7eb0dd3c77 M5: upgrade + backup/restore stages green (custom-html); backup-bot-two oneshot
All checks were successful
continuous-integration/drone/push Build is passing
3-stage run green (install/upgrade/backup), clean teardown. backupbot deployed
via reconcile oneshot; PTY (script) for abra backup/restore; -m for secret generate
(no value leak). M5 CLAIMED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 00:53:16 +01:00
38a145fd9c M4: harness + green install stage (custom-html + Playwright); guaranteed teardown; M4 CLAIMED
All checks were successful
continuous-integration/drone/push Build is passing
run_recipe_ci.py + conftest + abra/lifecycle wrappers + Nix python/playwright env.
deploy_app forces LETS_ENCRYPT_ENV='' (addresses A1). Short per-run domain scheme
for the 64-char swarm name limit. 2 passed; teardown leaves zero orphans.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 00:23:55 +01:00
c21cce51b9 chore: bootstrap cc-ci loop state
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 21:07:31 +01:00