status(2): Docker Hub rate-limit RESOLVED — declarative sops auth + swarm pulls authenticate (3 conditions); DECISIONS recorded

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-28 22:13:25 +01:00
parent 5e14963d51
commit 7a337f5d69
2 changed files with 66 additions and 20 deletions

View File

@ -277,29 +277,39 @@ ssh cc-ci 'cd /root/cc-ci && cc-ci-run -m pytest tests/unit -v && RECIPE=custom-
```
## Blocked
**@2026-05-28 ~21:10Z — ONE standing EXTERNAL (Class A1) block: Docker Hub pull rate limit.**
(The earlier Gitea outage is RESOLVED — see below — and git state is reconciled/pushed.)
**(none) — the Docker Hub rate-limit block is RESOLVED @2026-05-28 ~22:10Z. Awaiting Adversary
re-verify of the 3 conditions (immediate relief already confirmed by Adversary in REVIEW-2).**
**Docker Hub anonymous pull rate limit (registry-creds finding, plan §1.5).** docker.io pulls from
cc-ci's IP fail with `toomanyrequests: You have reached your unauthenticated pull rate limit`. Verify:
`ssh cc-ci 'docker pull redis:8.6.3'` → rate-limit error. After the Gitea outage I re-tested: exactly
**1** pull (minio) trickled through as the rolling 6h window aged, then the next 3 (redis/nginx/
mailcatcher) hit the limit again — so the quota is still effectively exhausted, dribbling ~1 pull at a
time. Traced to: today's many recipe deploys + a `docker image prune -af` (run to clear a disk-full
that broke the drive deploy) forcing a full cold re-pull. Blocks **every** new recipe deploy. Per §1.5
this is a finding → **request registry pull credentials** (authenticated/Team Docker Hub, or a
pull-through cache). Recurs for all remaining Q3.5/Q4 enrollments. Operator notified @~19:45Z.
**Docker Hub rate-limit fix — DONE (registry-creds finding, plan §1.5), all 3 conditions met.**
Operator provided a read-only PAT (`DOCKERHUB_USERNAME=nptest2` + `DOCKERHUB_TOKEN` in `.testenv`).
Wired declaratively; verify commands + expected outcomes for the Adversary:
1. **Authenticated 200-limit from account source** (Adversary already CONFIRMED in REVIEW-2). Re-check:
`ssh cc-ci` → `docker info | grep Username` = `nptest2`; an authenticated manifest HEAD shows
`ratelimit-limit: 200;w=21600` and `docker-ratelimit-source: b662dd8b-…` (account hash, NOT IP
`68.14.43.142`).
2. **Swarm SERVICE-task pulls authenticate** — PROVEN with an **uncached** image:
`ssh cc-ci 'cd /root/cc-ci && RECIPE=n8n STAGES=install cc-ci-run runner/run_recipe_ci.py'`
→ EXPECTED: `install: pass`, deploy-count=1, NO `toomanyrequests`; the swarm task pulls
`n8nio/n8n:2.20.6` to 1/1. During the run the **account** counter decrements (197→196 resolution
→195 agent layer pull, source = account hash) — the agent pull is billed to the account, not the
anon IP. (n8n images were uncached, so this is a real fresh-pull test, not a cached false-pass.)
Conclusion: abra `docker stack deploy` propagates the cred on this single-node swarm; no
`--with-registry-auth` flag or pre-pull needed.
3. **Declarative persistence across a 1c rebuild** — PAT sops-encrypted (`secrets/secrets.yaml` key
`dockerhub_auth` = base64("nptest2:PAT"), submodule `cdd5e0a`); `nix/modules/secrets.nix` adds
`sops.secrets.dockerhub_auth` + `sops.templates."docker-config.json"` → renders
`/root/.docker/config.json` (0600 root) at activation. Verify: after `nixos-rebuild switch`,
`ls -l /root/.docker/config.json` → symlink to `/run/secrets/rendered/docker-config.json`; the
activation log shows `adding rendered secret: docker-config.json`. Recorded in DECISIONS.md
("Docker Hub auth: declarative config.json via sops").
Impact on Q3.2 lasuite-drive: base deploy got 8/12 services up (incl. heavy onlyoffice+collabora; big
image LAYERS now cached on cc-ci so a re-run is light) but the last 3 small images can't pull. Will
re-run the moment pulls flow (creds or window reset). cc-ci is CLEAN (teardown verified: 0 stack, 0
residue volumes/secrets; 6.8 GB disk + 6.5 GB RAM free).
**Bonus unblocked:** Q3.2 lasuite-drive base deploy now CONVERGES (all 12 services incl.
onlyoffice+collabora) — `RECIPE=lasuite-drive STAGES=install` → `install: pass`, deploy-count=1
(commit before this; the rate limit was the only blocker). Q3.2 specifics (OIDC/WOPI/upload) are next.
**Gitea outage (RESOLVED @~21:08Z).** git.autonomic.zone returned blanket `404 page not found` for
~1.5h (backend down; same from my sandbox AND cc-ci). Orchestrator confirmed it back online; I
re-ran `git pull --rebase` (up to date) and pushed the 2 queued local commits — `origin/main` is now
`4a118ea`. The 3 watchdog pings during the outage were phantoms (Adversary's failed push retries);
the remote has NO pending BUILDER-INBOX and NO new REVIEW-2 verdict, so nothing was lost on my side.
**Earlier Gitea outage (RESOLVED @~21:08Z).** git.autonomic.zone returned blanket `404` for ~1.5h
(backend down; same from my sandbox AND cc-ci). Reconciled: pulled + pushed queued commits. The 3
watchdog pings during the outage were phantoms (Adversary's failed push retries); nothing lost.
**Prior bootstrap state:** access re-verified @2026-05-28: `ssh cc-ci` ok (root, NixOS 24.11), Gitea
API HTTP 200, wildcard DNS resolves to gateway 143.244.213.108.