Commit Graph

174 Commits

Author SHA1 Message Date
cc4af49c99 status(2): Q3.2 F2-12 FAIL acknowledged, fix e1147b5 validating; cryptpad F2-9 test landed 3/3 green
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 11:58:03 +01:00
aab77ea0f3 review(2): FAIL gate Q3.2 lasuite-drive (claim 911680f/code 4b38b66) — cold re-run upgrade tier FAILS (abra chaos-deploy FATA: new collabora 25.04.9.4.1 not converged; WOPI pre-gate DID work). install/backup/restore/custom+OIDC pass, deploy-count=1, teardown clean. Filed F2-12 BLOCKING 2026-05-29 11:47:58 +01:00
911680f843 claim(2): Q3.2 lasuite-drive — full lifecycle 3x green via install-time OIDC + collabora-ready upgrade gate
3× repeat-green (logs /root/ccci-drive-q32a-r2/r3/r4.log): install+upgrade+backup+restore+custom all
pass, OIDC password-grant PASSED (not skip), deploy-count=1, clean teardown each run. Resolves the
Adversary's standing veto-eligible obligation (lasuite-drive upgrade tier GREEN + reliable OIDC).

Fixes: install-time OIDC wiring (a151489: _provision_deps before single deploy + OIDC_AT_INSTALL +
install_steps.sh) eliminated the flaky post-deploy --chaos reconverge; collabora-WOPI-ready upgrade
gate + DEPLOY_TIMEOUT plumbing (4b38b66) fixed the upgrade tier (was killing a still-booting collabora,
exit 70). Gate evidence + cold-verify HOW/EXPECTED/WHERE in STATUS-2.md. BACKLOG-2 Q3.2/Q3.2a ticked;
DEFERRED.md disk follow-on noted done.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 11:16:18 +01:00
5e0af07b86 journal(2): Q3.2a fixed-code run 1 FULL SUITE GREEN (collabora-ready gate fixed upgrade tier); launching 3x repeat-green 2026-05-29 10:52:44 +01:00
e0a80124bc inbox(2): consume BUILDER-INBOX (flag rename relay) + finish --extra rename in BACKLOG-2 Adversary-section lines 241/248/292 (Adversary explicitly delegated)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 10:40:49 +01:00
a22ba9c9cc inbox(2): relay orchestrator flag rename --extra-tests -> --extra to Builder (DEFERRED.md 12 occ + BACKLOG-2 4 occ; single-writer files, not editing them myself) 2026-05-29 10:39:46 +01:00
4b38b66fa5 fix(2): lasuite-drive Q3.2a — gate upgrade redeploy on collabora-ready + plumb DEPLOY_TIMEOUT
Q3.2a run 1: Part A (install-time OIDC) GREEN — deploy-count=1, install/backup/restore/custom +
OIDC test all PASS. BUT upgrade tier FAILED: the in-place `abra app deploy --chaos` redeploy landed
on a STILL-BOOTING collabora (coolwsd ~2min boot: 1300+ l10n files + RSA keygen) and SIGTERMed it
mid-init ("Shutdown requested while starting up", forced exit 70) → abra aborted the deploy. The
install wait_healthy returns on container 1/1 while coolwsd is still loading. Fixes (plan §C
readiness-gating, no test weakened):

- tests/lasuite-drive/ops.py::pre_upgrade — wait for collabora WOPI discovery (/hosting/discovery
  on collabora-<domain>) → 200 BEFORE the chaos redeploy, so it replaces a ready collabora cleanly.
- runner/harness/lifecycle.chaos_redeploy + generic.perform_upgrade + run_recipe_ci._perform_op —
  plumb the recipe DEPLOY_TIMEOUT to the upgrade chaos redeploy (was abra.deploy's 900s default,
  while the .env internal TIMEOUT is 1500s → Python could SIGKILL abra mid-wait on the slow
  collabora/onlyoffice reconverge). Mirrors the install deploy_app timeout plumbing.

Also (operator naming change 2026-05-29): renamed `--extra-tests` -> `--extra` in DEFERRED.md +
BACKLOG-2.md Build-backlog section. 3 refs remain in BACKLOG-2 Adversary-findings section
(241/248/292, closed findings) — left for the Adversary (single-writer); orchestrator updated
IDEAS.md/plan-sso-dep-testing.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 10:37:55 +01:00
0b558529c9 review(2): pre-claim recon lasuite-drive Q3.2a Part A — minio scale is recipe one-shot (replicas:0) NOT a bypass; install-time OIDC=deploy-once; minio test is real round-trip; NO verdict (gate not claimed) 2026-05-29 10:33:01 +01:00
f89cf9b1b8 status(2): Q3.2a lasuite-drive Part A in validation — install-time OIDC landed, full-suite run in flight
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 10:13:21 +01:00
a151489996 feat(2): lasuite-drive Q3.2a Part A — wire OIDC at INSTALL, eliminate flaky redeploy
Q3.2a / plan-lasuite-drive-oidc-robustness.md Part A. The old setup_custom_tests.sh did a
post-deploy in-place `abra app deploy --force --chaos` of the heavy 12-service stack to apply
the OIDC env — flaky (collabora WOPI-discovery race + gunicorn-perms; JOURNAL Step 0). Since
the OIDC env only affects backend/app and keycloak is live-warm, provision the per-run realm
BEFORE the single deploy and wire OIDC into the .env at install time (no reconverge).

- runner/run_recipe_ci.py: new _provision_deps() helper (warm/cold split + SSO enrich + write
  $CCCI_DEPS_FILE), used by both paths. New per-recipe OIDC_AT_INSTALL meta flag (added to
  _load_meta whitelist). When set + deps live-warm: provision BEFORE deploy_app; the install
  tier's install_steps.sh wires OIDC into the single deploy; post-deploy step runs only the
  MinIO bucket one-shot — no re-provision, no redeploy. Legacy post-deploy path unchanged for
  all other dep recipes (gated on `not oidc_at_install`).
- tests/lasuite-drive/install_steps.sh (NEW): install-time OIDC env + secret wiring; no-ops on
  empty deps file (recipe still boots, OIDC test skips → F2-11 RED).
- tests/lasuite-drive/setup_custom_tests.sh: trimmed to MinIO-bucket-only (OIDC moved out).
- tests/lasuite-drive/recipe_meta.py: OIDC_AT_INSTALL = True.
- JOURNAL-2: Step-0 root-cause failure logs captured before the fix.

NOT a claim — validating 3x green (incl. now-required upgrade tier) before claiming Q3.2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 10:10:05 +01:00
4356f0009c review(2): cross-phase probe — 2pc prune-policy did NOT regress 2w warm infra (volumes survived, timers active, canonical idle@1.11.0); no finding, standing obligations stand 2026-05-29 10:00:38 +01:00
d389dd516b status(2pc): ## DONE — Adversary PASS for PC1+PC2+PC3, F2pc-1 closed, no VETO
Phase 2pc complete: conservative surgical gated prune (ci-docker-prune) live + reproducible from
git, local Docker store retained as the cache (PAT-authenticated, layer reuse proven), registry
pull-through cache deferred to IDEAS. Adversary review(2pc) 486d162 PASS @2026-05-29. Watchdog
auto-returns to Phase 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:53:30 +01:00
486d162663 review(2pc): PASS gate 2pc (re-claim 9e73ebd) — PC1+PC2+PC3 cold-verified; F2pc-1 CLEARED. git==host: docker-prune.nix+swarm.nix byte-identical to /root/cc-ci, committed units now ci-docker-prune = live (enabled+active), old docker-prune.timer not-found. Live re-confirm: no-op prune@<80% images 18->18, cold->warm redis reuse. Pressure-branch keep-cache property structural (image prune w/o --all). PC2 PAT nptest2+retention+no-mirror, PC3 teardown-keeps-images+bogus-tag-fails GREEN from prior pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:52:28 +01:00
9e73ebda3d claim(2pc): re-claim — F2pc-1 resolved (git==host==ci-docker-prune via b9bbd25)
Adversary FAILed claim de6103d because that commit still named the units docker-prune while the
host runs ci-docker-prune; the rename was committed in b9bbd25 (its endorsed fix) which is in the
current pushed HEAD. git now defines the same ci-docker-prune units STATUS documents and the host
runs. Behavior was already cold-verified GREEN. Inert NixOS-builtin docker-prune.service
(inactive/linked, no timer) is unchanged by this and reproduces identically from git.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:50:39 +01:00
49892be7b0 review(2pc): FAIL gate 2pc (claim de6103d) — PC1/PC2/PC3 behavior cold-verified GREEN on host (surgical gated prune no-op@31%, images 17→17; teardown keeps images; PAT nptest2; cold→teardown→warm reuses local layers; bogus tag still fails), BUT committed code != verified host: git defines docker-prune units, host runs ci-docker-prune from uncommitted /root/cc-ci → not reproducible from git (D8). Filed F2pc-1 BLOCKING.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:47:45 +01:00
f6af7edd97 status(2pc): add probe-5 evidence — surgical prune reclaimed 2.34GB (dangling+old only), all tagged images kept, disk bounded without -af
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:44:57 +01:00
de6103d41d claim(2pc): PC1 conservative prune deployed+verified; PC2/PC3 local-store cache confirmed
ci-docker-prune (gated surgical prune) live on cc-ci: old autoPrune --all gone, new timer
enabled (daily), no-ops below 80% disk keeping the local image cache, never --all/--volumes.
Daemon stays PAT-authenticated (nptest2); /var/lib/docker retained across rebuild. PC3 proof:
redis:7-alpine deploy->teardown(service rm, image retained)->redeploy = "Image is up to date",
no layer re-download (cold 5303ms -> warm 674ms). Docs: runbook "Image cache & prune policy",
warm.md, DECISIONS Phase-2pc, IDEAS (registry pull-through cache deferred + revisit trigger).
Gate 2pc CLAIMED, awaiting Adversary cold-verify.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:42:36 +01:00
16d177e73a feat(2pc): PC1 conservative prune — drop autoPrune --all, add gated surgical docker-prune
Removes virtualisation.docker.autoPrune (daily `docker system prune --all` evicted in-use base
images → cold re-pull → Hub rate-limit churn, JOURNAL-2). Adds modules/docker-prune.nix: daily
timer + oneshot that prunes only dangling+until=24h, gated on disk pressure (>=80%) AND no run-app
live AND no swarm service converging; never --all, never --volumes. Teardown unchanged (never
removes images). Registry pull-through cache dropped per operator scope correction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:30:07 +01:00
e42753c17c note(2pc): realign REVIEW-2pc to narrowed scope — registry pull-through cache DROPPED per operator; 2pc is now prune-policy only (PC1 surgical prune + teardown must NOT remove images, PC2 confirm PAT-auth+local-store retention, PC3 deploy/teardown/redeploy reuses local layers). Break-it checklist updated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:25:55 +01:00
863bbac4de note(2pc): init REVIEW-2pc — AWAITING CLAIM; baseline recon of current prune (swarm.nix --all until=24h) + confirm no pull-through cache exists yet; break-it checklist staged
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 09:22:11 +01:00
78cf95aad3 status(2): Q3.2 truthful update — disk-blocker RESOLVED (cc-ci 64G); upgrade tier now REQUIRED green (not deferrable), runs via Q3.2a rework; F2-7 closed out-of-scope per SSO policy 2026-05-29 09:10:55 +01:00
139e8b9797 review(2): close F2-7 out-of-scope per operator SSO policy (keycloak default; Phase-2 DONE not gated on authentik; re-entry only if a recipe REQUIRES authentik); Builder owns DECISIONS/DEFERRED#9/cryptpad-keycloak edits 2026-05-29 09:10:00 +01:00
1537a928d5 decisions(2): record operator SSO-provider policy — keycloak DEFAULT for all recipe OIDC; authentik NOT a Phase-2 DONE gate (enroll only if a recipe REQUIRES it); cryptpad OIDC under keycloak; narrow DEFERRED #9 authentik re-entry trigger 2026-05-29 09:09:38 +01:00
779fb8917c status(2): link plan-lasuite-drive-oidc-robustness.md into Q3.2a (Step 0 logs → Part A install-time OIDC vs warm keycloak [deploy once, no reconverge, real-abra-only] → Part B recipe PR; 3x-green + cold-verified before Q3.2 claim) 2026-05-29 09:06:43 +01:00
542028a6a4 status(2): Q4.5 mattermost-lts DONE — full lifecycle green (install+upgrade+backup+restore+custom, deploy-count=1, clean teardown); P1+P3 met; P4 ops → Q5 sweep 2026-05-29 09:05:55 +01:00
200d599c06 status(2): Q4.5 mattermost-lts ENROLLED + install+custom GREEN (create-message §4.3 round-trip validated live); full lifecycle in flight for P1 2026-05-29 08:59:57 +01:00
6ff68e625a note(2): record Adversary cold-verify criteria for queued lasuite-drive Q3.2 rework (real-abra-only enforcement, repeat-green + upgrade tier required); not active yet 2026-05-29 08:58:32 +01:00
9b6c0e03dc review(2): disk-blocker LIFTED — cold-verified 64G/44G-free + infra healthy post-resize; lasuite-drive upgrade tier now REQUIRED green (deferral void, veto-eligible open obligation); DEFERRED.md edit left to Builder 2026-05-29 08:42:52 +01:00
6df4757f85 status(2): CLOSE disk-blocker DEFERRED — cc-ci resized to 64G (44G free); heavy-recipe upgrade tiers runnable; lasuite-drive full-lifecycle Q3.2 now active backlog 2026-05-29 08:42:24 +01:00
aca1fd5185 inbox(2): consume Adversary BUILDER-INBOX — disk-blocker deferral VOID post-resize; Q3.2 now requires the FULL lasuite-drive lifecycle incl. a GREEN upgrade tier (cold-verified). Aligns with my plan: re-run full after cc-ci healthy, claim only when upgrade green. 2026-05-29 08:37:10 +01:00
4eae6eb208 inbox(2): disk resize 30→70GB in progress — deferral VOID; lasuite-drive upgrade tier now REQUIRED green for Q3.2 sign-off (no longer deferrable); pausing host verify during restart 2026-05-29 08:36:32 +01:00
dd137f9683 status(2): disk resize 30->70GB in progress (orchestrator) — disk-blocker LIFTING; deploys paused; plan to re-run lasuite-drive FULL lifecycle + mattermost after cc-ci healthy 2026-05-29 08:36:17 +01:00
9df900d1cc journal(2): mumble scope correction — non-HTTP health = high-blast-radius core-harness feature (wait_healthy/canonical/generic), deserves dedicated effort; re-pick next unit = mattermost-lts (HTTP-native, no core changes) 2026-05-29 08:06:03 +01:00
7997b98935 journal(2): scouted mumble (Q4.2) — first non-HTTP recipe; design = python sidecar probe on app overlay network for the TLS protocol test; enrollment plan recorded for next tick 2026-05-29 07:47:42 +01:00
426a953c2b status(2): lasuite-drive Q3.2 NOT claimed — OIDC setup redeploy flaky (collabora reconverge); --detach fix validated; test assertions proven correct (run 1); Q3.2a robustness item added; prune-during-deploy lesson recorded 2026-05-29 07:27:50 +01:00
75ae226c0d status(2): Q3.2 lasuite-drive maximal subset GREEN (install+backup+restore+custom: health+MinIO roundtrip+OIDC JWT); upgrade tier deferred pending disk resize; clean re-run w/ --detach fix in flight before claim 2026-05-29 06:28:03 +01:00
d1aae43c7e inbox(2): consume Adversary BUILDER-INBOX — conditional/deferred sign-off model for lasuite-drive upgrade tier (deferred pending disk resize, NOT waived; veto-eligible open item until cold-verified green). Q3.2 claim will frame accordingly. 2026-05-29 05:54:49 +01:00
ccc42699ff chore(2): consume ADVERSARY-INBOX (Q3.2 lasuite-drive heads-up); reply via BUILDER-INBOX — disk blocker is operator-removable, will grant CONDITIONAL/deferred sign-off only, upgrade tier still blocks Phase-2 DONE 2026-05-29 05:53:51 +01:00
b78d708c49 decisions/deferred(2): lasuite-drive upgrade tier = disk env-blocker (28GB host, dual multi-GB office image crossover); maximal subset in flight; operator disk-resize escalation; adversary heads-up 2026-05-29 05:51:31 +01:00
2c245c83c7 journal(2): Phase 2 RESUMED post-2w — foundation re-confirmed (72 unit + custom-html full e2e green), reference-corpus mapping, lasuite-drive e2e in flight 2026-05-29 05:03:46 +01:00
7b5ed9c350 review(2): break-it probe @2026-05-29 — 2w WC5 promotion × F2-11 SSO-skip: NO regression (overall-gated, no alt promote path, 72 unit pass cold) 2026-05-29 04:54:02 +01:00
aebb28d774 done(2w): Phase 2w COMPLETE — WC1-WC9 (incl WC1.1/WC1.2) all Adversary-verified, NO VETO
## DONE written to STATUS-2w. Adversary authorized (REVIEW-2w 2822d60: all gates
cold-verified, no veto, no open findings). Final state healthy: keycloak+traefik
200, custom-html canonical idle@1.11.0+1.29.0, nightly-sweep timer active, system
running 0 failed, disk 50%. Watchdog auto-returns to Phase 2 (resume recipe
authoring; STATUS-2/BACKLOG-2 intact).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:48:02 +01:00
2822d60474 review(2w): WC8 + WC9 (FINAL) — PASS @2026-05-29; ALL WC1-WC9 (incl WC1.1/WC1.2) Adversary cold-verified, NO VETO — DONE authorized 2026-05-29 04:46:30 +01:00
40b03a9bf1 claim(2w): WC8 + WC9 (FINAL gates) — resource-safety consolidation + stale-warm prune + docs/warm.md + --quick rollback proof
WC8: canonical.prune_stale (drop de-enrolled warm data + volumes) wired into the
nightly sweep + df log; consolidated evidence (DRONE_RUNNER_CAPACITY=MAX_TESTS
serialize; autoPrune drops --volumes so warm vols survive; cold teardown sacred;
warm excluded from D8 — no nix source ref). +1 unit (72 pass). WC9: docs/warm.md
documents the full warm/quick model; --quick rollback proof already proven live
(W2 FAIL restores exact known-good; WC4 PASS byte-identical snapshot). On PASS,
all WC1-WC9 (incl WC1.1/WC1.2) verified → DONE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:43:34 +01:00
b8b698e2f5 review(2w): WC6 nightly full-cold sweep — PASS @2026-05-29 (declarative timer Persistent + orchestration + live systemd-service run: infra roll health-gated → serial cold sweep → canonical advanced, infra healthy, no leftovers) 2026-05-29 04:38:51 +01:00
465e1059b0 claim(2w): WC6 nightly full-cold sweep — timer+service roll warm/infra (health-gated) then serial cold sweep promoting canonicals (WC5); proven live
canonical.enrolled_recipes; runner/nightly_sweep.py (roll keycloak+traefik →
serial full-cold over enrolled on latest → green promotes; skip if test active;
operate against CCCI_REPO checkout for tests/); nix/modules/nightly-sweep.nix
(timer 03:00 Persistent + oneshot service) wired in. 2 bugs fixed via live
service run (repo-relative enrolled scan; util-linux for backup PTY). Live
SERVICE sweep: enrolled=['custom-html'] → all tiers green → canonical advanced
1.10.0→1.11.0; red-run correctly does NOT promote. 71 unit pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:33:08 +01:00
1e40a460ba status(2w): WC5 ADVERSARY PASS @2026-05-29 (8 WC items verified); building WC6 nightly sweep 2026-05-29 04:14:16 +01:00
5bbc47cb02 review(2w): WC5 promote-on-green-cold — PASS @2026-05-29 (gate predicate anti-poison verified + live advancement 1.10.0→1.11.0 cold-only; --quick/PR-head/red/unenrolled excluded) 2026-05-29 04:13:17 +01:00
125453df20 claim(2w): WC5 promote-on-green-cold proven — green cold run advances canonical (1.10.0→1.11.0); --quick never promotes; only cold advances
should_promote_canonical (enrolled+green+cold+latest) + promote_canonical
(re-seed canonical at green-verified latest, snapshot+registry, old known-good
replaced only on green). +5 unit (70 pass). Live: custom-html canonical advanced
1.10.0+1.28.0 → 1.11.0+1.29.0 via a full green cold run; snapshot refreshed; idle;
per-run app torn down. WC6 nightly sweep next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 04:08:14 +01:00
cf5999cdda decisions(2w): W3 WC5 promote-on-green-cold mechanism (re-seed canonical from fresh green-latest deploy; never lose known-good; gate=enrolled+green+cold+latest) 2026-05-29 04:01:59 +01:00