blocked(2): Q3.2 drive base-deploy hits Docker Hub rate limit + Gitea outage
- recipe_meta: bump drive abra TIMEOUT 900->1500, DEPLOY_TIMEOUT 1200->1800 (12-svc stack w/ onlyoffice+collabora; cold pulls need a wide window). - STATUS-2 ## Blocked: two Class-A1 external blocks documented w/ verify commands — (1) Docker Hub anon pull rate limit (registry-creds finding per plan §1.5; blocks all new deploys), (2) Gitea git.autonomic.zone 404 outage (coordination down; 2 watchdog pings unconsumable until recovery). JOURNAL-2: full disk->prune->rate-limit chain. - Queued locally; push + Adversary-inbox processing deferred to Gitea recovery. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -491,3 +491,43 @@ Operator capacity-unblocked cc-ci (RAM 4→8GB, other VMs stopped). Resumed Phas
|
|||||||
|
|
||||||
Pausing for Adversary cold-verify of Q4.1+Q4.3 (and re-verify of Q3.1+Q3.4 if updated). Will
|
Pausing for Adversary cold-verify of Q4.1+Q4.3 (and re-verify of Q3.1+Q3.4 if updated). Will
|
||||||
resume on watchdog ping.
|
resume on watchdog ping.
|
||||||
|
|
||||||
|
## 2026-05-28 (later) — Q3.2 lasuite-drive base-deploy verify: disk → prune → Docker Hub rate limit; + Gitea outage
|
||||||
|
|
||||||
|
Resumed loop to cold-verify the lasuite-drive base deploy (the f59d8e6 commit deferred OIDC/specific
|
||||||
|
tests until the ~10-service base converges). Chain of events:
|
||||||
|
|
||||||
|
1. **First install run timed out at abra TIMEOUT=900.** abra log root cause was NOT slowness but
|
||||||
|
`FATAL: could not write init file: No space left on device` in postgres init — cc-ci `/` was at
|
||||||
|
**89% (2.9 GB free)**. The ~2GB onlyoffice + ~1GB collabora pulls filled the disk; postgres
|
||||||
|
couldn't initialise. Stack is actually **12 services** (app, backend, celery, celery-beat, db,
|
||||||
|
redis, minio, minio-createbuckets[0/0 one-shot], mailcatcher, web/nginx, collabora, **onlyoffice**)
|
||||||
|
— bigger than the recipe_meta header noted; it ships BOTH office backends by default.
|
||||||
|
|
||||||
|
2. **Freed disk via `docker image prune -af`** → reclaimed 10.1 GB (30 dangling images from prior
|
||||||
|
recipe runs); host went 2.9 GB → 14 GB free. Bumped abra TIMEOUT 900→1500, DEPLOY_TIMEOUT
|
||||||
|
1200→1800 (recipe_meta.py edit; not yet committed — Gitea down, see below).
|
||||||
|
|
||||||
|
3. **Second run progressed far** — db, collabora, onlyoffice, backend, celery, app all reached 1/1.
|
||||||
|
But minio/redis/web/mailcatcher stuck at 0/1 in an instant Assigned→Rejected loop ("No such
|
||||||
|
image"). Manual `docker pull minio/minio:...` returned **`toomanyrequests: You have reached your
|
||||||
|
unauthenticated pull rate limit`**. The prune wiped these (previously-cached) small images, and
|
||||||
|
the full cold re-pull of 12 images — on top of today's many recipe deploys (matrix-synapse,
|
||||||
|
bluesky, ghost, uptime-kuma, keycloak, lasuite-docs, cryptpad retries) — exhausted Docker Hub's
|
||||||
|
per-IP anonymous quota. Big images pulled first; the 4 small ones got starved.
|
||||||
|
|
||||||
|
**Lesson:** pruning is double-edged on this host — it frees disk but forces re-pulls that burn the
|
||||||
|
anonymous rate limit. The real fix is authenticated registry pulls (plan §1.5 "registry pull
|
||||||
|
credentials") + trimming heavy stacks (lasuite-drive does not need BOTH collabora and onlyoffice
|
||||||
|
for WOPI parity — one office backend suffices; disabling onlyoffice cuts the biggest image + RAM).
|
||||||
|
|
||||||
|
4. **Gitea (git.autonomic.zone) is down** — bare host `/`, unauth `/api/v1/version`, and authed repo
|
||||||
|
API all return plain-text `404 page not found` (Go default ServeMux 404 = backend down, proxy has
|
||||||
|
no upstream). Same from both my sandbox and cc-ci (same IP 116.203.211.204), so it's a real
|
||||||
|
instance outage, not my creds/path. Adversary's `/root/adv-verify` clone is stale at 1aaf3bd
|
||||||
|
(clean, no inbox) → Adversary runs in its own sandbox; the only shared channel (Gitea) is dead.
|
||||||
|
**Two watchdog pings arrived (REVIEW-2 update + BUILDER-INBOX.md) that I CANNOT consume** until
|
||||||
|
Gitea recovers — will pull + act the instant it's back.
|
||||||
|
|
||||||
|
Action: interrupted the stuck deploy (let abra TIMEOUT fire for clean teardown). Recording finding;
|
||||||
|
notifying operator (registry creds per §1.5 + Gitea outage). Idle-retry both until recovery.
|
||||||
|
|||||||
@ -245,8 +245,29 @@ ssh cc-ci 'cd /root/cc-ci && cc-ci-run -m pytest tests/unit -v && RECIPE=custom-
|
|||||||
```
|
```
|
||||||
|
|
||||||
## Blocked
|
## Blocked
|
||||||
(none) — bootstrap access re-verified @2026-05-28: `ssh cc-ci` ok (root, NixOS 24.11), Gitea API
|
**@2026-05-28 ~19:45Z — two concurrent EXTERNAL (Class A1) infra blocks; operator notified.**
|
||||||
HTTP 200, wildcard DNS resolves to gateway 143.244.213.108.
|
|
||||||
|
1. **Docker Hub anonymous pull rate limit (registry creds finding, plan §1.5).** All docker.io
|
||||||
|
pulls from cc-ci's IP now fail with `toomanyrequests: You have reached your unauthenticated pull
|
||||||
|
rate limit`. Verify: `ssh cc-ci 'docker pull minio/minio:RELEASE.2025-09-07T16-13-09Z'` →
|
||||||
|
rate-limit error. Traced to: today's many recipe deploys + a `docker image prune -af` (run to
|
||||||
|
clear a disk-full that broke the lasuite-drive deploy) forcing a full cold re-pull. This blocks
|
||||||
|
**every** new recipe deploy (all pull from docker.io). Per §1.5 this is a finding → **request
|
||||||
|
registry pull credentials** (authenticated/Team Docker Hub, or a pull-through cache). Recurs for
|
||||||
|
all remaining Q3.5/Q4 enrollments. Self-resolves partially as the rolling 6h window ages out.
|
||||||
|
|
||||||
|
2. **Gitea (git.autonomic.zone) outage.** Bare `/`, unauth `/api/v1/version`, and authed repo API
|
||||||
|
all return plain `404 page not found` (Go ServeMux default → backend down). Same from my sandbox
|
||||||
|
AND cc-ci (IP 116.203.211.204) — a real instance outage, not creds/path. Verify:
|
||||||
|
`curl -s -o /dev/null -w '%{http_code}' https://git.autonomic.zone/api/v1/version` → 404.
|
||||||
|
Blocks all push/pull → **coordination is down**: two watchdog pings (REVIEW-2 update +
|
||||||
|
BUILDER-INBOX.md) are unconsumable until Gitea recovers. Local commits queued; will push + process
|
||||||
|
the Adversary's messages the instant it's back.
|
||||||
|
|
||||||
|
Local build work proceeds where it needs no new pulls / no push. Loop idle-retries both ~15-20m.
|
||||||
|
|
||||||
|
**Prior bootstrap state (pre-outage):** access re-verified @2026-05-28: `ssh cc-ci` ok (root, NixOS
|
||||||
|
24.11), Gitea API HTTP 200, wildcard DNS resolves to gateway 143.244.213.108.
|
||||||
|
|
||||||
## Carryover from Phase 1e (not blockers for Phase 2)
|
## Carryover from Phase 1e (not blockers for Phase 2)
|
||||||
- **F1e-2** [adversary] — concurrent same-recipe `abra recipe fetch` race in
|
- **F1e-2** [adversary] — concurrent same-recipe `abra recipe fetch` race in
|
||||||
|
|||||||
@ -9,7 +9,12 @@
|
|||||||
# (login is OIDC-gated, exercised by the SSO functional tests, not by the install health check).
|
# (login is OIDC-gated, exercised by the SSO functional tests, not by the install health check).
|
||||||
HEALTH_PATH = "/"
|
HEALTH_PATH = "/"
|
||||||
HEALTH_OK = (200, 301, 302)
|
HEALTH_OK = (200, 301, 302)
|
||||||
DEPLOY_TIMEOUT = 1200
|
# This is the heaviest stack in the Phase-2 set: 12 services incl. BOTH office backends
|
||||||
|
# (collabora/code ~1GB + onlyoffice/documentserver ~2GB) plus impress front/backend, postgres,
|
||||||
|
# minio, redis, nginx. Cold image pull + onlyoffice's multi-minute internal boot exceed the
|
||||||
|
# default abra TIMEOUT (300s) and even 900s, so allow a wide window (abra TIMEOUT below stays
|
||||||
|
# under DEPLOY_TIMEOUT so the Python subprocess never kills abra mid-wait).
|
||||||
|
DEPLOY_TIMEOUT = 1800
|
||||||
HTTP_TIMEOUT = 900
|
HTTP_TIMEOUT = 900
|
||||||
|
|
||||||
# NOTE (Phase 2 Q3.2): the keycloak SSO dep + OIDC functional tests land in the SSO iteration once
|
# NOTE (Phase 2 Q3.2): the keycloak SSO dep + OIDC functional tests land in the SSO iteration once
|
||||||
@ -31,7 +36,8 @@ def EXTRA_ENV(domain):
|
|||||||
"MINIO_DOMAIN": f"minio-{domain}",
|
"MINIO_DOMAIN": f"minio-{domain}",
|
||||||
"COLLABORA_DOMAIN": f"collabora-{domain}",
|
"COLLABORA_DOMAIN": f"collabora-{domain}",
|
||||||
# abra's internal per-deploy convergence timeout (recipe TIMEOUT env, default 300s) is too
|
# abra's internal per-deploy convergence timeout (recipe TIMEOUT env, default 300s) is too
|
||||||
# short for this ~10-service stack on a cold image cache (impress frontend/backend, minio,
|
# short for this 12-service stack on a cold image cache (impress frontend/backend, minio,
|
||||||
# postgres, redis, collabora ~1GB). Bump so abra waits long enough for convergence.
|
# postgres, redis, collabora ~1GB, onlyoffice ~2GB). Bump so abra waits long enough for
|
||||||
"TIMEOUT": "900",
|
# convergence; kept under DEPLOY_TIMEOUT (1800) so Python never kills abra mid-wait.
|
||||||
|
"TIMEOUT": "1500",
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user