decisions/deferred(2): lasuite-drive upgrade tier = disk env-blocker (28GB host, dual multi-GB office image crossover); maximal subset in flight; operator disk-resize escalation; adversary heads-up
This commit is contained in:
24
machine-docs/ADVERSARY-INBOX.md
Normal file
24
machine-docs/ADVERSARY-INBOX.md
Normal file
@ -0,0 +1,24 @@
|
||||
# ADVERSARY-INBOX — Builder → Adversary (non-gate heads-up) @2026-05-29
|
||||
|
||||
**Phase 2 RESUMED** after the 2w detour. No gate claimed yet — this is a heads-up + an env-blocker
|
||||
that will need your sign-off when I claim Q3.2.
|
||||
|
||||
1. **Foundation re-confirmed post-2w** (FYI, no action): `tests/unit` = 72 passed on HEAD `7b5ed9c`;
|
||||
`RECIPE=custom-html` full e2e all 5 tiers PASS, deploy-count=1, WC5 promoted canonical. Your
|
||||
cross-phase break-it probe (review(2) `7b5ed9c`) verdict NO-regression is consistent with this.
|
||||
|
||||
2. **NEW env-level blocker for heavy recipes — lasuite-drive upgrade tier (DEFERRED.md 2026-05-29 +
|
||||
DECISIONS.md Phase 2 entry).** The prev→PR-head upgrade crosses two multi-GB office image versions
|
||||
at once (onlyoffice 9.2→9.3.1.2 @3.94GB + collabora 25.04.9.1.1→25.04.9.4.1); ~10GB transient vs
|
||||
~14GB docker headroom on the 28GB host → 99% disk → deploy fail. No harness fix (prev images are
|
||||
*running* when new must be pulled). I escalated a disk-resize to the operator. install/backup/
|
||||
restore/custom fit and pass. **When I claim Q3.2 it will cite the maximal testable subset green +
|
||||
this upgrade tier as a genuine disk env-blocker (§7.1) needing your sign-off.** Repro if you want
|
||||
to confirm: `RECIPE=lasuite-drive cc-ci-run runner/run_recipe_ci.py` and watch `df -h /` cross 95%
|
||||
when the upgrade tier pulls onlyoffice 9.3.1.2. (Please don't leave it running to 100% — I had to
|
||||
emergency-clean the host; runbook in DECISIONS.md.)
|
||||
|
||||
3. **My build clone is `/root/builder-clone`** (origin/main; secrets submodule skipped — not needed
|
||||
for recipe tests). Your `/root/adv-verify` is untouched.
|
||||
|
||||
(Delete this file to mark consumed.)
|
||||
@ -693,3 +693,16 @@ green run SEEDS the canonical. `--quick` never promotes (proven W2). Only cold a
|
||||
Promote gate predicate (unit-tested): `is_enrolled(recipe) and overall==0 and not quick and not ref`.
|
||||
(`not ref` = a catalogue-latest run, i.e. the nightly sweep or a manual `RECIPE=<r>` run — a PR
|
||||
`!testme` carries REF=PR-head and must NOT advance the canonical to a PR's code.)
|
||||
|
||||
## Phase 2 — heavy-recipe upgrade tier disk constraint (28GB host) — SETTLED finding @2026-05-29
|
||||
The upgrade tier (HC1: prev published → PR-head via in-place `abra app deploy --chaos`) cannot
|
||||
complete for recipes whose successive releases bump multi-GB image tags, because the rolling update
|
||||
must hold BOTH versions on disk transiently. Proven on lasuite-drive: onlyoffice 9.2 → 9.3.1.2
|
||||
(3.94GB each) + collabora two versions → ~10GB office images at once vs ~14GB docker headroom on the
|
||||
28GB host → 99% → deploy fail. **No harness fix is possible** (the prev images are running, so they
|
||||
are neither dangling-prunable nor `rmi`-able when the new must be pulled). install/backup/restore/
|
||||
custom (single version) fit and pass. Resolution = grow the host disk (Class A1 operator input,
|
||||
DEFERRED.md 2026-05-29). Until then, heavy recipes are verified via their maximal testable subset
|
||||
(install+backup+restore+custom) with the upgrade tier flagged as a genuine env-level (disk) blocker
|
||||
per plan §7.1 (Adversary sign-off required). The cleanup runbook for an over-full host: `pkill -f
|
||||
run_recipe_ci.py`; `docker stack rm <leftover>`; remove its volumes+secrets; `docker image prune -f`.
|
||||
|
||||
@ -156,6 +156,28 @@ before the build is called done) — but does **not** force closure.
|
||||
pluggable, not just claimed).
|
||||
- **Linked IDEA:** —
|
||||
|
||||
### 2026-05-29 — heavy-recipe upgrade tier needs more host disk (28GB too small)
|
||||
- [ ] **What:** The upgrade tier for the heaviest recipes cannot complete on the 28GB host. Proven
|
||||
on **lasuite-drive**: the prev→PR-head chaos upgrade crosses two multi-GB office image versions
|
||||
at once — onlyoffice/documentserver-de `9.2 → 9.3.1.2` (3.94GB each) + collabora/code
|
||||
`25.04.9.1.1 → 25.04.9.4.1` (~1GB) — so ~10GB of office images must coexist on disk during the
|
||||
in-place rolling update. The host has only ~14GB docker headroom over its ~13GB baseline (nix
|
||||
store ~9.6GB + infra images), so the PR-head pull hit 99% and the deploy failed. There is **no
|
||||
harness mitigation** (the prev images are *running* when the new must be pulled — cannot `rmi` a
|
||||
running image; nothing dangling to prune pre-upgrade). install/backup/restore/custom (single
|
||||
version, ~6GB) all fit and pass — only the upgrade tier overflows. Almost certainly also blocks
|
||||
the upgrade tier of other heavy recipes (lasuite-docs ships collabora; immich ships multi-GB ML
|
||||
images; lasuite-meet).
|
||||
- **Filed by:** Builder, phase 2 (Q3.2 lasuite-drive full-lifecycle attempt)
|
||||
- **Reason for deferral:** Class A1 EXTERNAL infra input — host disk size. Not improvisable; not a
|
||||
test-quality issue; the recipe legitimately bumps office image tags across releases.
|
||||
- **Operator action to lift:** grow the cc-ci host disk (resize the droplet volume + online-grow the
|
||||
filesystem) to give heavy-recipe upgrade tiers transient headroom — ~+20GB would comfortably
|
||||
cover the dual-office-version crossover and the rest of the heavy set. Then re-run the full
|
||||
lasuite-drive lifecycle (and re-confirm immich/lasuite-meet/lasuite-docs upgrade tiers).
|
||||
- **Re-entry trigger:** operator disk resize, OR Phase-2b pull-through cache + image-GC policy work.
|
||||
- **Linked IDEA:** `cc-ci-plan/IDEAS.md` (pull-through cache / Phase 2b).
|
||||
|
||||
---
|
||||
|
||||
## Closed deferrals
|
||||
|
||||
@ -635,3 +635,46 @@ Q4.10 drone (specifics only), + deferral lift cryptpad create-pad (F2-9, must li
|
||||
(log `/root/ccci-resume-lasuite-drive.log`) — lasuite-drive suite (health parity + real MinIO S3
|
||||
upload/list/download round-trip + OIDC password-grant JWT-claims against dep keycloak) is fully
|
||||
authored; driving it to its first verified-green full run (the Q3.2 acceptance evidence).
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-29 — lasuite-drive full e2e: upgrade tier hits a DISK-SIZE env blocker (host health emergency handled)
|
||||
|
||||
Drove lasuite-drive (heaviest §5 recipe — BOTH office backends) toward its first verified-green full
|
||||
run. install tier PASSED (generic test_serving + cc-ci test_serving_and_frontend; all 12 services
|
||||
converged after collabora won its startup race — see below). backup tier PASSED. Then the **upgrade
|
||||
tier FAILED** and disk hit **99% (522M free)**, risking a host wedge.
|
||||
|
||||
**Root cause (definitive, from the abra DEPLOY OVERVIEW in the log):** the prev→PR-head upgrade
|
||||
crosses *two different multi-GB office image versions simultaneously*:
|
||||
- onlyoffice/documentserver-de: 9.2 → **9.3.1.2** (3.94GB image)
|
||||
- collabora/code: 25.04.9.1.1 → 25.04.9.4.1 (~1GB)
|
||||
- (+ small drive-backend/frontend v0.12.0→v0.18.0, redis, nginx)
|
||||
abra's in-place chaos rolling update must hold BOTH the running prev office images AND pull the new
|
||||
ones before swapping — ~10GB of office images transiently. The 28GB host has only ~14GB docker
|
||||
headroom over the ~13GB baseline (nix store ~9.6GB + infra images ~1.75GB), so the PR-head pull
|
||||
overflowed. **No harness mitigation exists:** the prev images are *running* (not dangling) when the
|
||||
new must be pulled, and you cannot `docker rmi` a running image; a pre-upgrade prune finds nothing
|
||||
dangling. It is fundamentally a disk-SIZE constraint, driven by the recipe legitimately bumping office
|
||||
image tags across releases. Not a test-quality issue and not weakenable.
|
||||
|
||||
**collabora startup race (separate, self-resolving):** collabora/code logs
|
||||
`/usr/bin/coolmount: Operation not permitted` (CapAdd=[] + default seccomp blocks mount()), falls back
|
||||
to slow file-COPYING into its jail; the healthcheck killed an early task (exit 137) but a later task
|
||||
finished the copy and reached 1/1. So collabora converges, just flaps once or twice first. Not the
|
||||
blocker; noting in case it recurs on slower disk.
|
||||
|
||||
**Emergency handled — host fully restored:** killed the run (`pkill -f run_recipe_ci.py`), removed the
|
||||
orphaned `lasu-7ea5e3` stack + its volumes (minio, postgres) + 8 leftover secrets (the killed run's
|
||||
teardown never ran), pruned dangling images. Disk recovered 99% → 37% (17GB free). Infra stacks
|
||||
(traefik/drone/dashboard/bridge/backups/warm-keycloak) untouched and healthy throughout.
|
||||
|
||||
**Decision:** the upgrade tier for lasuite-drive (and very likely other heavy recipes: lasuite-docs
|
||||
also ships collabora; immich ships multi-GB ML images; lasuite-meet) is a genuine **Class A1 env-level
|
||||
disk blocker** — the clean fix is a larger host disk (operator). Filed in DEFERRED.md + DECISIONS.md +
|
||||
BACKLOG-2; flagged to operator (PushNotification) and Adversary (inbox). Meanwhile banking the
|
||||
**maximal testable subset** (install+backup+restore+custom — single version, fits disk) to prove
|
||||
lasuite-drive's actual Q3.2 CONTENT works: parity health, the real MinIO S3 upload→list→download
|
||||
round-trip, and the OIDC password-grant + JWT-claims flow against the dep keycloak. Per §7.1 the
|
||||
maximal subset is implemented and only the genuinely-disk-blocked upgrade tier is outstanding —
|
||||
pending Adversary sign-off on the env-blocker.
|
||||
|
||||
Reference in New Issue
Block a user