decisions/deferred(2): lasuite-drive upgrade tier = disk env-blocker (28GB host, dual multi-GB office image crossover); maximal subset in flight; operator disk-resize escalation; adversary heads-up

2026-05-29 05:51:31 +01:00
parent 2c245c83c7
commit b78d708c49
4 changed files with 102 additions and 0 deletions
--- a/machine-docs/ADVERSARY-INBOX.md
+++ b/machine-docs/ADVERSARY-INBOX.md
@ -0,0 +1,24 @@
+# ADVERSARY-INBOX — Builder → Adversary (non-gate heads-up) @2026-05-29
+
+**Phase 2 RESUMED** after the 2w detour. No gate claimed yet — this is a heads-up + an env-blocker
+that will need your sign-off when I claim Q3.2.
+
+1. **Foundation re-confirmed post-2w** (FYI, no action): `tests/unit` = 72 passed on HEAD `7b5ed9c`;
+   `RECIPE=custom-html` full e2e all 5 tiers PASS, deploy-count=1, WC5 promoted canonical. Your
+   cross-phase break-it probe (review(2) `7b5ed9c`) verdict NO-regression is consistent with this.
+
+2. **NEW env-level blocker for heavy recipes — lasuite-drive upgrade tier (DEFERRED.md 2026-05-29 +
+   DECISIONS.md Phase 2 entry).** The prev→PR-head upgrade crosses two multi-GB office image versions
+   at once (onlyoffice 9.2→9.3.1.2 @3.94GB + collabora 25.04.9.1.1→25.04.9.4.1); ~10GB transient vs
+   ~14GB docker headroom on the 28GB host → 99% disk → deploy fail. No harness fix (prev images are
+   *running* when new must be pulled). I escalated a disk-resize to the operator. install/backup/
+   restore/custom fit and pass. **When I claim Q3.2 it will cite the maximal testable subset green +
+   this upgrade tier as a genuine disk env-blocker (§7.1) needing your sign-off.** Repro if you want
+   to confirm: `RECIPE=lasuite-drive cc-ci-run runner/run_recipe_ci.py` and watch `df -h /` cross 95%
+   when the upgrade tier pulls onlyoffice 9.3.1.2. (Please don't leave it running to 100% — I had to
+   emergency-clean the host; runbook in DECISIONS.md.)
+
+3. **My build clone is `/root/builder-clone`** (origin/main; secrets submodule skipped — not needed
+   for recipe tests). Your `/root/adv-verify` is untouched.
+
+(Delete this file to mark consumed.)
--- a/machine-docs/DECISIONS.md
+++ b/machine-docs/DECISIONS.md
@ -693,3 +693,16 @@ green run SEEDS the canonical. `--quick` never promotes (proven W2). Only cold a
 Promote gate predicate (unit-tested): `is_enrolled(recipe) and overall==0 and not quick and not ref`.
 (`not ref` = a catalogue-latest run, i.e. the nightly sweep or a manual `RECIPE=<r>` run — a PR
 `!testme` carries REF=PR-head and must NOT advance the canonical to a PR's code.)
+
+## Phase 2 — heavy-recipe upgrade tier disk constraint (28GB host) — SETTLED finding @2026-05-29
+The upgrade tier (HC1: prev published → PR-head via in-place `abra app deploy --chaos`) cannot
+complete for recipes whose successive releases bump multi-GB image tags, because the rolling update
+must hold BOTH versions on disk transiently. Proven on lasuite-drive: onlyoffice 9.2 → 9.3.1.2
+(3.94GB each) + collabora two versions → ~10GB office images at once vs ~14GB docker headroom on the
+28GB host → 99% → deploy fail. **No harness fix is possible** (the prev images are running, so they
+are neither dangling-prunable nor `rmi`-able when the new must be pulled). install/backup/restore/
+custom (single version) fit and pass. Resolution = grow the host disk (Class A1 operator input,
+DEFERRED.md 2026-05-29). Until then, heavy recipes are verified via their maximal testable subset
+(install+backup+restore+custom) with the upgrade tier flagged as a genuine env-level (disk) blocker
+per plan §7.1 (Adversary sign-off required). The cleanup runbook for an over-full host: `pkill -f
+run_recipe_ci.py`; `docker stack rm <leftover>`; remove its volumes+secrets; `docker image prune -f`.
--- a/machine-docs/DEFERRED.md
+++ b/machine-docs/DEFERRED.md
@ -156,6 +156,28 @@ before the build is called done) — but does **not** force closure.
  pluggable, not just claimed).
 - **Linked IDEA:** —

+### 2026-05-29 — heavy-recipe upgrade tier needs more host disk (28GB too small)
+- [ ] **What:** The upgrade tier for the heaviest recipes cannot complete on the 28GB host. Proven
+      on **lasuite-drive**: the prev→PR-head chaos upgrade crosses two multi-GB office image versions
+      at once — onlyoffice/documentserver-de `9.2 → 9.3.1.2` (3.94GB each) + collabora/code
+      `25.04.9.1.1 → 25.04.9.4.1` (~1GB) — so ~10GB of office images must coexist on disk during the
+      in-place rolling update. The host has only ~14GB docker headroom over its ~13GB baseline (nix
+      store ~9.6GB + infra images), so the PR-head pull hit 99% and the deploy failed. There is **no
+      harness mitigation** (the prev images are *running* when the new must be pulled — cannot `rmi` a
+      running image; nothing dangling to prune pre-upgrade). install/backup/restore/custom (single
+      version, ~6GB) all fit and pass — only the upgrade tier overflows. Almost certainly also blocks
+      the upgrade tier of other heavy recipes (lasuite-docs ships collabora; immich ships multi-GB ML
+      images; lasuite-meet).
+- **Filed by:** Builder, phase 2 (Q3.2 lasuite-drive full-lifecycle attempt)
+- **Reason for deferral:** Class A1 EXTERNAL infra input — host disk size. Not improvisable; not a
+      test-quality issue; the recipe legitimately bumps office image tags across releases.
+- **Operator action to lift:** grow the cc-ci host disk (resize the droplet volume + online-grow the
+      filesystem) to give heavy-recipe upgrade tiers transient headroom — ~+20GB would comfortably
+      cover the dual-office-version crossover and the rest of the heavy set. Then re-run the full
+      lasuite-drive lifecycle (and re-confirm immich/lasuite-meet/lasuite-docs upgrade tiers).
+- **Re-entry trigger:** operator disk resize, OR Phase-2b pull-through cache + image-GC policy work.
+- **Linked IDEA:** `cc-ci-plan/IDEAS.md` (pull-through cache / Phase 2b).
+
 ---

 ## Closed deferrals
--- a/machine-docs/JOURNAL-2.md
+++ b/machine-docs/JOURNAL-2.md
@ -635,3 +635,46 @@ Q4.10 drone (specifics only), + deferral lift cryptpad create-pad (F2-9, must li
 (log `/root/ccci-resume-lasuite-drive.log`) — lasuite-drive suite (health parity + real MinIO S3
 upload/list/download round-trip + OIDC password-grant JWT-claims against dep keycloak) is fully
 authored; driving it to its first verified-green full run (the Q3.2 acceptance evidence).
+
+---
+
+## 2026-05-29 — lasuite-drive full e2e: upgrade tier hits a DISK-SIZE env blocker (host health emergency handled)
+
+Drove lasuite-drive (heaviest §5 recipe — BOTH office backends) toward its first verified-green full
+run. install tier PASSED (generic test_serving + cc-ci test_serving_and_frontend; all 12 services
+converged after collabora won its startup race — see below). backup tier PASSED. Then the **upgrade
+tier FAILED** and disk hit **99% (522M free)**, risking a host wedge.
+
+**Root cause (definitive, from the abra DEPLOY OVERVIEW in the log):** the prev→PR-head upgrade
+crosses *two different multi-GB office image versions simultaneously*:
+- onlyoffice/documentserver-de: 9.2 → **9.3.1.2** (3.94GB image)
+- collabora/code: 25.04.9.1.1 → 25.04.9.4.1 (~1GB)
+- (+ small drive-backend/frontend v0.12.0→v0.18.0, redis, nginx)
+abra's in-place chaos rolling update must hold BOTH the running prev office images AND pull the new
+ones before swapping — ~10GB of office images transiently. The 28GB host has only ~14GB docker
+headroom over the ~13GB baseline (nix store ~9.6GB + infra images ~1.75GB), so the PR-head pull
+overflowed. **No harness mitigation exists:** the prev images are *running* (not dangling) when the
+new must be pulled, and you cannot `docker rmi` a running image; a pre-upgrade prune finds nothing
+dangling. It is fundamentally a disk-SIZE constraint, driven by the recipe legitimately bumping office
+image tags across releases. Not a test-quality issue and not weakenable.
+
+**collabora startup race (separate, self-resolving):** collabora/code logs
+`/usr/bin/coolmount: Operation not permitted` (CapAdd=[] + default seccomp blocks mount()), falls back
+to slow file-COPYING into its jail; the healthcheck killed an early task (exit 137) but a later task
+finished the copy and reached 1/1. So collabora converges, just flaps once or twice first. Not the
+blocker; noting in case it recurs on slower disk.
+
+**Emergency handled — host fully restored:** killed the run (`pkill -f run_recipe_ci.py`), removed the
+orphaned `lasu-7ea5e3` stack + its volumes (minio, postgres) + 8 leftover secrets (the killed run's
+teardown never ran), pruned dangling images. Disk recovered 99% → 37% (17GB free). Infra stacks
+(traefik/drone/dashboard/bridge/backups/warm-keycloak) untouched and healthy throughout.
+
+**Decision:** the upgrade tier for lasuite-drive (and very likely other heavy recipes: lasuite-docs
+also ships collabora; immich ships multi-GB ML images; lasuite-meet) is a genuine **Class A1 env-level
+disk blocker** — the clean fix is a larger host disk (operator). Filed in DEFERRED.md + DECISIONS.md +
+BACKLOG-2; flagged to operator (PushNotification) and Adversary (inbox). Meanwhile banking the
+**maximal testable subset** (install+backup+restore+custom — single version, fits disk) to prove
+lasuite-drive's actual Q3.2 CONTENT works: parity health, the real MinIO S3 upload→list→download
+round-trip, and the OIDC password-grant + JWT-claims flow against the dep keycloak. Per §7.1 the
+maximal subset is implemented and only the genuinely-disk-blocked upgrade tier is outstanding —
+pending Adversary sign-off on the env-blocker.