decisions/deferred(2): lasuite-drive upgrade tier = disk env-blocker (28GB host, dual multi-GB office image crossover); maximal subset in flight; operator disk-resize escalation; adversary heads-up

This commit is contained in:
2026-05-29 05:51:31 +01:00
parent 2c245c83c7
commit b78d708c49
4 changed files with 102 additions and 0 deletions

View File

@ -0,0 +1,24 @@
# ADVERSARY-INBOX — Builder → Adversary (non-gate heads-up) @2026-05-29
**Phase 2 RESUMED** after the 2w detour. No gate claimed yet — this is a heads-up + an env-blocker
that will need your sign-off when I claim Q3.2.
1. **Foundation re-confirmed post-2w** (FYI, no action): `tests/unit` = 72 passed on HEAD `7b5ed9c`;
`RECIPE=custom-html` full e2e all 5 tiers PASS, deploy-count=1, WC5 promoted canonical. Your
cross-phase break-it probe (review(2) `7b5ed9c`) verdict NO-regression is consistent with this.
2. **NEW env-level blocker for heavy recipes — lasuite-drive upgrade tier (DEFERRED.md 2026-05-29 +
DECISIONS.md Phase 2 entry).** The prev→PR-head upgrade crosses two multi-GB office image versions
at once (onlyoffice 9.2→9.3.1.2 @3.94GB + collabora 25.04.9.1.1→25.04.9.4.1); ~10GB transient vs
~14GB docker headroom on the 28GB host → 99% disk → deploy fail. No harness fix (prev images are
*running* when new must be pulled). I escalated a disk-resize to the operator. install/backup/
restore/custom fit and pass. **When I claim Q3.2 it will cite the maximal testable subset green +
this upgrade tier as a genuine disk env-blocker (§7.1) needing your sign-off.** Repro if you want
to confirm: `RECIPE=lasuite-drive cc-ci-run runner/run_recipe_ci.py` and watch `df -h /` cross 95%
when the upgrade tier pulls onlyoffice 9.3.1.2. (Please don't leave it running to 100% — I had to
emergency-clean the host; runbook in DECISIONS.md.)
3. **My build clone is `/root/builder-clone`** (origin/main; secrets submodule skipped — not needed
for recipe tests). Your `/root/adv-verify` is untouched.
(Delete this file to mark consumed.)

View File

@ -693,3 +693,16 @@ green run SEEDS the canonical. `--quick` never promotes (proven W2). Only cold a
Promote gate predicate (unit-tested): `is_enrolled(recipe) and overall==0 and not quick and not ref`.
(`not ref` = a catalogue-latest run, i.e. the nightly sweep or a manual `RECIPE=<r>` run — a PR
`!testme` carries REF=PR-head and must NOT advance the canonical to a PR's code.)
## Phase 2 — heavy-recipe upgrade tier disk constraint (28GB host) — SETTLED finding @2026-05-29
The upgrade tier (HC1: prev published → PR-head via in-place `abra app deploy --chaos`) cannot
complete for recipes whose successive releases bump multi-GB image tags, because the rolling update
must hold BOTH versions on disk transiently. Proven on lasuite-drive: onlyoffice 9.2 → 9.3.1.2
(3.94GB each) + collabora two versions → ~10GB office images at once vs ~14GB docker headroom on the
28GB host → 99% → deploy fail. **No harness fix is possible** (the prev images are running, so they
are neither dangling-prunable nor `rmi`-able when the new must be pulled). install/backup/restore/
custom (single version) fit and pass. Resolution = grow the host disk (Class A1 operator input,
DEFERRED.md 2026-05-29). Until then, heavy recipes are verified via their maximal testable subset
(install+backup+restore+custom) with the upgrade tier flagged as a genuine env-level (disk) blocker
per plan §7.1 (Adversary sign-off required). The cleanup runbook for an over-full host: `pkill -f
run_recipe_ci.py`; `docker stack rm <leftover>`; remove its volumes+secrets; `docker image prune -f`.

View File

@ -156,6 +156,28 @@ before the build is called done) — but does **not** force closure.
pluggable, not just claimed).
- **Linked IDEA:** —
### 2026-05-29 — heavy-recipe upgrade tier needs more host disk (28GB too small)
- [ ] **What:** The upgrade tier for the heaviest recipes cannot complete on the 28GB host. Proven
on **lasuite-drive**: the prev→PR-head chaos upgrade crosses two multi-GB office image versions
at once — onlyoffice/documentserver-de `9.2 → 9.3.1.2` (3.94GB each) + collabora/code
`25.04.9.1.1 → 25.04.9.4.1` (~1GB) — so ~10GB of office images must coexist on disk during the
in-place rolling update. The host has only ~14GB docker headroom over its ~13GB baseline (nix
store ~9.6GB + infra images), so the PR-head pull hit 99% and the deploy failed. There is **no
harness mitigation** (the prev images are *running* when the new must be pulled — cannot `rmi` a
running image; nothing dangling to prune pre-upgrade). install/backup/restore/custom (single
version, ~6GB) all fit and pass — only the upgrade tier overflows. Almost certainly also blocks
the upgrade tier of other heavy recipes (lasuite-docs ships collabora; immich ships multi-GB ML
images; lasuite-meet).
- **Filed by:** Builder, phase 2 (Q3.2 lasuite-drive full-lifecycle attempt)
- **Reason for deferral:** Class A1 EXTERNAL infra input — host disk size. Not improvisable; not a
test-quality issue; the recipe legitimately bumps office image tags across releases.
- **Operator action to lift:** grow the cc-ci host disk (resize the droplet volume + online-grow the
filesystem) to give heavy-recipe upgrade tiers transient headroom — ~+20GB would comfortably
cover the dual-office-version crossover and the rest of the heavy set. Then re-run the full
lasuite-drive lifecycle (and re-confirm immich/lasuite-meet/lasuite-docs upgrade tiers).
- **Re-entry trigger:** operator disk resize, OR Phase-2b pull-through cache + image-GC policy work.
- **Linked IDEA:** `cc-ci-plan/IDEAS.md` (pull-through cache / Phase 2b).
---
## Closed deferrals

View File

@ -635,3 +635,46 @@ Q4.10 drone (specifics only), + deferral lift cryptpad create-pad (F2-9, must li
(log `/root/ccci-resume-lasuite-drive.log`) — lasuite-drive suite (health parity + real MinIO S3
upload/list/download round-trip + OIDC password-grant JWT-claims against dep keycloak) is fully
authored; driving it to its first verified-green full run (the Q3.2 acceptance evidence).
---
## 2026-05-29 — lasuite-drive full e2e: upgrade tier hits a DISK-SIZE env blocker (host health emergency handled)
Drove lasuite-drive (heaviest §5 recipe — BOTH office backends) toward its first verified-green full
run. install tier PASSED (generic test_serving + cc-ci test_serving_and_frontend; all 12 services
converged after collabora won its startup race — see below). backup tier PASSED. Then the **upgrade
tier FAILED** and disk hit **99% (522M free)**, risking a host wedge.
**Root cause (definitive, from the abra DEPLOY OVERVIEW in the log):** the prev→PR-head upgrade
crosses *two different multi-GB office image versions simultaneously*:
- onlyoffice/documentserver-de: 9.2 → **9.3.1.2** (3.94GB image)
- collabora/code: 25.04.9.1.1 → 25.04.9.4.1 (~1GB)
- (+ small drive-backend/frontend v0.12.0→v0.18.0, redis, nginx)
abra's in-place chaos rolling update must hold BOTH the running prev office images AND pull the new
ones before swapping — ~10GB of office images transiently. The 28GB host has only ~14GB docker
headroom over the ~13GB baseline (nix store ~9.6GB + infra images ~1.75GB), so the PR-head pull
overflowed. **No harness mitigation exists:** the prev images are *running* (not dangling) when the
new must be pulled, and you cannot `docker rmi` a running image; a pre-upgrade prune finds nothing
dangling. It is fundamentally a disk-SIZE constraint, driven by the recipe legitimately bumping office
image tags across releases. Not a test-quality issue and not weakenable.
**collabora startup race (separate, self-resolving):** collabora/code logs
`/usr/bin/coolmount: Operation not permitted` (CapAdd=[] + default seccomp blocks mount()), falls back
to slow file-COPYING into its jail; the healthcheck killed an early task (exit 137) but a later task
finished the copy and reached 1/1. So collabora converges, just flaps once or twice first. Not the
blocker; noting in case it recurs on slower disk.
**Emergency handled — host fully restored:** killed the run (`pkill -f run_recipe_ci.py`), removed the
orphaned `lasu-7ea5e3` stack + its volumes (minio, postgres) + 8 leftover secrets (the killed run's
teardown never ran), pruned dangling images. Disk recovered 99% → 37% (17GB free). Infra stacks
(traefik/drone/dashboard/bridge/backups/warm-keycloak) untouched and healthy throughout.
**Decision:** the upgrade tier for lasuite-drive (and very likely other heavy recipes: lasuite-docs
also ships collabora; immich ships multi-GB ML images; lasuite-meet) is a genuine **Class A1 env-level
disk blocker** — the clean fix is a larger host disk (operator). Filed in DEFERRED.md + DECISIONS.md +
BACKLOG-2; flagged to operator (PushNotification) and Adversary (inbox). Meanwhile banking the
**maximal testable subset** (install+backup+restore+custom — single version, fits disk) to prove
lasuite-drive's actual Q3.2 CONTENT works: parity health, the real MinIO S3 upload→list→download
round-trip, and the OIDC password-grant + JWT-claims flow against the dep keycloak. Per §7.1 the
maximal subset is implemented and only the genuinely-disk-blocked upgrade tier is outstanding —
pending Adversary sign-off on the env-blocker.