The autoPrune flags passed '--volumes' WITH '--filter until=24h', which docker
rejects ('until filter not supported with --volumes') — so docker-prune.service
FAILED every day (system 'degraded') and never reclaimed anything (a cause of the
disk creeping to 96%). Worse, '--volumes' prunes volumes with no running
container — which would DELETE Phase-2w DATA-WARM canonical volumes (undeployed by
design). Removed '--volumes': now prunes images/containers/networks/build-cache
older than 24h only; warm volumes survive and are pruned deliberately by the warm
reconcilers (WC8).
Verified: nixos-rebuild switch -> docker-prune.service runs clean, system
'running' (0 failed units), warm keycloak still 200.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bugs found by the live proof, fixed:
- warmsnap: snapshot now swaps a <recipe>/snapshot/ SUBDIR, not the whole
<recipe>/ dir — so the reconciler's sibling last_good file survives a
snapshot swap (was being clobbered).
- warm_reconcile: deploy_version captures abra's stdout (it writes FATA to
stdout) in the error; add wait_undeployed() after every undeploy so
snapshot/restore/redeploy don't race a half-removed swarm stack; the upgrade
deploy is wrapped so a deploy FAILURE (not just unhealthy) also triggers
rollback. (57 unit pass.)
LIVE PROOF on warm keycloak (annotated fake tags via CCCI_SKIP_FETCH):
(a) healthy upgrade 10.7.1->10.7.9: snapshot+deploy+health-pass, last_good
committed=10.7.9, marker realm preserved.
(b) MARQUEE rollback: broken latest 10.7.10 (lint-fail) -> rollback to 10.7.9,
HEALTHY, marker realm INTACT (data preserved through broken-upgrade+restore),
last_good NOT advanced, rollback alert written (attempted=10.7.10,
last_good=10.7.9, recovered=True). keycloak recovered to canonical
10.7.1+26.6.2 healthy.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A broken 'latest' can fail abra's converge (deploy_version raises) rather than
deploy-then-be-unhealthy; wrap the upgrade deploy so BOTH paths trigger the
snapshot-restore rollback instead of crashing the reconcile unit.
- set_env: ensure trailing newline before append (keycloak .env.sample ends
with a newline-less #COMPOSE_FILE comment, so a bare append glued DOMAIN onto
it -> DOMAIN unset -> KC_HOSTNAME=https:// -> crash-loop). Same bite fixed in
backupbot.nix.
- converge skips the (forced) redeploy when keycloak already serves 200, so an
activation/boot is a true no-op (no JVM-restart blip) and only redeploys when
down/crash-looping. Health-wait extended to 15min.
Verified on cc-ci: nixos-rebuild switch -> warm-keycloak.service active,
'no-op converge', system running (0 failed), /realms/master=200.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nix/modules/warm-keycloak.nix: idempotent systemd oneshot (like deploy-proxy)
that converges a live-warm shared keycloak at warm-keycloak.ci.commoninternet.net
pinned to 10.7.1+26.6.2, secrets generated only-if-missing (never
rotate a live provider), waits /realms/master=200. Re-warmable from scratch
(D8/WC8). Wired into hosts/cc-ci/configuration.nix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
sso.py: list_realms, delete_keycloak_realm (idempotent, refuses master),
realms_to_reap (pure, concurrency-safe predicate), reap_orphaned_realms.
The per-run realm is the isolation unit on a shared live-warm keycloak;
orphans (crashed runs) reaped by hex not mapping to a live app stack.
+8 unit tests (tests/unit/test_warm_realm.py); 43 unit pass on cc-ci.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In-flight Q3.2 iteration (NOT yet live-verified — needs a lasuite-drive deploy
once the warm keycloak from Phase 2w is available). Phase 2 paused here per
operator interjection of Phase 2w; state preserved.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a DEPS-declaring recipe's setup_custom_tests fails, its @requires_deps (SSO/OIDC)
tests skip; a skip-only pytest file exits 0 so the run previously reported overall=0
(GREEN) while the only SSO test never ran (violates P7). Fix preserves generic-tier
failure-isolation but corrects the green SIGNAL:
- conftest.pytest_collection_modifyitems counts skipped requires_deps tests and appends
to $CCCI_DEPS_SKIP_REPORT.
- run_recipe_ci: sums the count, surfaces it in RUN SUMMARY, and new pure predicate
sso_dep_unverified(declared, deps_ready, skipped) flips overall=1.
- 7 new unit tests (tests/unit/test_f211_sso_skip.py).
Verified deploy-free (rate-limit-independent): 35/35 unit PASS; cold real-test proof on
lasuite-docs test_oidc_with_keycloak.py -> 1 skipped + skip-report==1 -> orchestrator
would set overall=1. Full e2e deferred until Docker Hub rate limit lifts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per orchestrator's SSO-dep plan + the refactor in 41ede13, DEFERRED.md entry #5 (lasuite-docs
OIDC parity ports + create-a-doc) closes by execution.
- tests/lasuite-docs/functional/test_oidc_login.py: parity port of recipe-maintainer
oidc_login.py. Anonymous GET /api/v1.0/users/me/ → 302 to keycloak realm OR 401/403;
password-grant token → 200 with user.email matching the provisioned test user.
- tests/lasuite-docs/functional/test_create_doc.py: plan §4.3 prescribed create-an-object +
read-it-back. POST /api/v1.0/documents/ with OIDC Bearer → captured id; GET
/api/v1.0/documents/<id>/ → asserts id+title round-trip.
Both marked \@pytest.mark.requires_deps; skipped with 'deps-not-ready' if setup_custom_tests
fails (failure isolation per plan-sso-dep-testing.md §4).
Cold-verifiable: ssh cc-ci 'RECIPE=lasuite-docs STAGES=install,custom cc-ci-run runner/run_recipe_ci.py'
install: 2 PASS; custom: 5 PASS incl. test_oidc_login_via_keycloak +
test_create_doc_and_read_back; deploy-count=2 (recipe + keycloak dep).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tests/plausible/recipe_meta.py + tests/plausible/functional/test_health_check.py drafted with
EXTRA_ENV setting required Phoenix vars (DISABLE_AUTH, DISABLE_REGISTRATION, SECRET_KEY_BASE).
Stack converges 1/1 but the served app returns HTTP 500 from / for the full 600s HTTP_TIMEOUT
window — config-class failure, not a deploy-timing issue. Diagnosing needs live container-log
inspection + iterative env tuning, more debug cycles than fit autonomous mode. Committing the
draft + a DEFERRED.md entry; operator can iterate when they want.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Harness change (small, surgical):
- runner/harness/lifecycle.deploy_app gains a deploy_timeout param (default 900s); passes
through to abra.deploy(timeout=...). For heavy recipes (ghost, matrix-synapse, lasuite-meet),
the orchestrator + dep resolver now read recipe_meta.DEPLOY_TIMEOUT and pass it so the Python
subprocess wrapping abra deploy doesn't SIGKILL it before the recipe's INTERNAL TIMEOUT
(via EXTRA_ENV) finishes swarm convergence.
- runner/run_recipe_ci.py + runner/harness/deps.py: thread recipe_meta.DEPLOY_TIMEOUT into
the per-recipe deploy_app call.
Q4.4 ghost enrollment:
- recipe_meta.py: HEALTH_PATH=/, DEPLOY_TIMEOUT=1200 (subprocess), EXTRA_ENV={TIMEOUT: 1200}
(recipe internal). Ghost cold-start with theme + DB migration runs ~12-15min on cc-ci.
- functional/test_health_check.py: GET / returns 200 (themed site).
- functional/test_content_api.py: GET /ghost/api/content/settings/ returns 200 (settings JSON)
or 401/403 (Ghost error envelope) — distinguishes ghost-server up + JSON API working from
static fallback.
- functional/test_admin_redirect.py: GET /ghost/ returns 200 or 302 + Ghost branding;
proves admin route is wired through nginx proxy.
- PARITY.md: recipe-maintainer corpus has no ghost tests/, Phase-2 health_check is the
parity baseline; create-a-post deeper test deferred (DEFERRED.md, --extra-tests linked).
Cold-verifiable (log /root/ccci-q44-ghost-r3.log):
RECIPE=ghost STAGES=install,custom cc-ci-run runner/run_recipe_ci.py
install + 3 functional tests PASS, deploy-count=1. 28/28 unit tests still PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per orchestrator note: my prior append (commit 650ab47) accidentally landed under the
'## Closed deferrals' header instead of '## Open deferrals'. All 5 entries (lasuite-docs OIDC
parity, cryptpad create-a-pad, uptime-kuma create-a-monitor, ghost create-a-post, authentik
enrollment) are still OPEN (unchecked boxes) — section relocation only, no content change.
'## Closed deferrals' restored to its (none yet) placeholder.
Per orchestrator note: machine-docs/DEFERRED.md is now the single canonical registry for any
deliberately-deferred work. Every entry MUST carry a specific RE-ENTRY TRIGGER. The orchestrator
seeded 4 matrix-synapse entries; this commit migrates the other Phase-2 deferrals I'd buried
in JOURNAL/PARITY/DECISIONS:
- lasuite-docs OIDC parity ports + create-a-doc (re-entry: before any Q3 gate claim — Adversary
already flagged this in Q3/Q4 checkpoint).
- cryptpad create-a-pad + content round-trip Playwright (re-entry: Adversary F2-9 conditional —
MUST lift before Phase-2 DONE; Q5.2 cold-sample must include).
- uptime-kuma create-a-monitor via Socket.IO (re-entry: --extra-tests flag OR another recipe
needing Socket.IO).
- ghost create-a-post round-trip (re-entry: --extra-tests flag OR Q4 deeper-test pass before
Phase-2 DONE).
- Q2.2 authentik enrollment + setup_authentik_realm backend (re-entry: when cryptpad oidc_login
parity lifts — uses authentik — OR Phase-2 DONE review).
All linked to IDEAS.md --extra-tests flag where relevant. Phase-4 cleanup pass MUST review this
file per plan.md §6.1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per REVIEW-2 ## Q3/Q4 partial checkpoint, F2-8: 'goat CLI in container / account state cleanup'
was the §7.1-prohibited 'needs X' excuse class (same shape as F2-4). The recipe-maintainer
corpus literally calls the goat CLI via abra app run — it works fine.
Added tests/bluesky-pds/functional/test_account_and_post.py:
- goat pds describe → assert did:web:<live_app> in output (PDS self-identifies correctly).
- goat pds admin account create with UUID-suffixed handle + email + per-run password (class-B);
parse new account's did:plc:<id>.
- POST /xrpc/com.atproto.server.createSession with the new handle+password → accessJwt.
- POST /xrpc/com.atproto.repo.createRecord (collection=app.bsky.feed.post) with a UUID-marker
text → returns at://<did>/app.bsky.feed.post/<rkey>.
- GET /xrpc/com.atproto.repo.getRecord with that rkey → assert value.text == marker (round-trip).
- Best-effort goat account delete cleanup in finally.
This is the §4.3 prescribed test in full (create account + create post + fetch back + delete).
Cold-verifiable: ssh cc-ci 'RECIPE=bluesky-pds STAGES=install,custom cc-ci-run runner/run_recipe_ci.py'
install + 4 functional tests (health_check + describe_server + session_auth + account_and_post)
all PASS, deploy-count=1.
PARITY.md updated to show goat_account.py as ported.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>