status(2): Docker Hub rate-limit RESOLVED — declarative sops auth + swarm pulls authenticate (3 conditions); DECISIONS recorded
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@ -521,3 +521,39 @@ readiness wait still gates real liveness. Safe for all currently-green recipes (
|
||||
all N/N with N>0; the `0/0` case did not previously occur). Buckets/migrations that the one-shot
|
||||
performs are run on-demand in the recipe's `setup_custom_tests.sh` (post-deploy), not relied upon for
|
||||
generic-install convergence (the SPA at `/` serves 200 without them).
|
||||
|
||||
## 2026-05-28 — Docker Hub auth: declarative config.json via sops (rate-limit fix) — SETTLED
|
||||
|
||||
**Context.** Heavy Phase-2 recipe deploys exhausted Docker Hub's anonymous pull rate limit
|
||||
(100/6h per shared IP `68.14.43.142`) → `toomanyrequests` blocked all new deploys. Operator
|
||||
provided a read-only Docker Hub PAT (Class A1 registry creds, plan §1.5): `DOCKERHUB_USERNAME=nptest2`
|
||||
+ `DOCKERHUB_TOKEN` in `/srv/cc-ci/.testenv`. Authenticated pulls = 200/6h **per-account**.
|
||||
|
||||
**Decision.** Wire it declaratively (survives a 1c NixOS rebuild), not just an imperative login:
|
||||
- **Secret:** `secrets/secrets.yaml` (cc-ci-secrets submodule, commit `cdd5e0a`) gains key
|
||||
`dockerhub_auth` = `base64("nptest2:<PAT>")` — i.e. the exact `auth` field docker config.json
|
||||
wants, so the nix template is a pure render (no runtime base64). sops-encrypted to host+master age
|
||||
recipients (edited on cc-ci using its ssh-host-key→age identity via `nix shell nixpkgs#sops`;
|
||||
plaintext shredded; PAT never committed plaintext nor exposed in process args/logs).
|
||||
- **Render:** `nix/modules/secrets.nix` adds `sops.secrets.dockerhub_auth` + a
|
||||
`sops.templates."docker-config.json"` that renders `/root/.docker/config.json` (0600, root) at
|
||||
activation. It becomes a symlink to `/run/secrets/rendered/docker-config.json`.
|
||||
- **Why /root:** the drone exec runner runs pipelines as `User=root` (drone-runner.nix), and manual
|
||||
deploys ssh in as root — so `/root/.docker/config.json` covers both the `!testme` CI path and
|
||||
manual ops. Single config, single user.
|
||||
|
||||
**Swarm-propagation question — RESOLVED empirically (no `--with-registry-auth` / pre-pull needed).**
|
||||
The operator/Adversary flagged that a node `docker login` may NOT propagate to swarm SERVICE-task
|
||||
pulls. Tested on cc-ci with the authenticated config.json in place:
|
||||
- Account ratelimit baseline 197/200 (source = account hash `b662dd8b-…`, not the IP).
|
||||
- Deployed **uncached** `n8nio/n8n:2.20.6` via abra (`RECIPE=n8n STAGES=install`). The swarm service
|
||||
task pulled it to `1/1 Running` with **no `toomanyrequests`**.
|
||||
- Account counter dropped 197 → 196 (manager manifest resolution) → **195** (agent layer-manifest
|
||||
pull), source still the account hash. So abra's `docker stack deploy` propagates the cred to the
|
||||
swarm task pull on this single-node swarm — billed to the account, not the anon IP.
|
||||
- Corroborating: the earlier lasuite-drive deploy resolved **12** images with no `toomanyrequests`
|
||||
while anon budget was ≤4 — impossible anonymously → manager resolution is authenticated too.
|
||||
|
||||
So: declarative root `config.json` is sufficient end-to-end here; `--with-registry-auth` is not
|
||||
required (abra/SDK attaches it). **Caveat (Phase 2b):** 200/6h may still be tight for a full ~18-recipe
|
||||
sweep; the permanent structural fix is a registry pull-through cache authenticated with this same PAT.
|
||||
|
||||
@ -277,29 +277,39 @@ ssh cc-ci 'cd /root/cc-ci && cc-ci-run -m pytest tests/unit -v && RECIPE=custom-
|
||||
```
|
||||
|
||||
## Blocked
|
||||
**@2026-05-28 ~21:10Z — ONE standing EXTERNAL (Class A1) block: Docker Hub pull rate limit.**
|
||||
(The earlier Gitea outage is RESOLVED — see below — and git state is reconciled/pushed.)
|
||||
**(none) — the Docker Hub rate-limit block is RESOLVED @2026-05-28 ~22:10Z. Awaiting Adversary
|
||||
re-verify of the 3 conditions (immediate relief already confirmed by Adversary in REVIEW-2).**
|
||||
|
||||
**Docker Hub anonymous pull rate limit (registry-creds finding, plan §1.5).** docker.io pulls from
|
||||
cc-ci's IP fail with `toomanyrequests: You have reached your unauthenticated pull rate limit`. Verify:
|
||||
`ssh cc-ci 'docker pull redis:8.6.3'` → rate-limit error. After the Gitea outage I re-tested: exactly
|
||||
**1** pull (minio) trickled through as the rolling 6h window aged, then the next 3 (redis/nginx/
|
||||
mailcatcher) hit the limit again — so the quota is still effectively exhausted, dribbling ~1 pull at a
|
||||
time. Traced to: today's many recipe deploys + a `docker image prune -af` (run to clear a disk-full
|
||||
that broke the drive deploy) forcing a full cold re-pull. Blocks **every** new recipe deploy. Per §1.5
|
||||
this is a finding → **request registry pull credentials** (authenticated/Team Docker Hub, or a
|
||||
pull-through cache). Recurs for all remaining Q3.5/Q4 enrollments. Operator notified @~19:45Z.
|
||||
**Docker Hub rate-limit fix — DONE (registry-creds finding, plan §1.5), all 3 conditions met.**
|
||||
Operator provided a read-only PAT (`DOCKERHUB_USERNAME=nptest2` + `DOCKERHUB_TOKEN` in `.testenv`).
|
||||
Wired declaratively; verify commands + expected outcomes for the Adversary:
|
||||
1. **Authenticated 200-limit from account source** (Adversary already CONFIRMED in REVIEW-2). Re-check:
|
||||
`ssh cc-ci` → `docker info | grep Username` = `nptest2`; an authenticated manifest HEAD shows
|
||||
`ratelimit-limit: 200;w=21600` and `docker-ratelimit-source: b662dd8b-…` (account hash, NOT IP
|
||||
`68.14.43.142`).
|
||||
2. **Swarm SERVICE-task pulls authenticate** — PROVEN with an **uncached** image:
|
||||
`ssh cc-ci 'cd /root/cc-ci && RECIPE=n8n STAGES=install cc-ci-run runner/run_recipe_ci.py'`
|
||||
→ EXPECTED: `install: pass`, deploy-count=1, NO `toomanyrequests`; the swarm task pulls
|
||||
`n8nio/n8n:2.20.6` to 1/1. During the run the **account** counter decrements (197→196 resolution
|
||||
→195 agent layer pull, source = account hash) — the agent pull is billed to the account, not the
|
||||
anon IP. (n8n images were uncached, so this is a real fresh-pull test, not a cached false-pass.)
|
||||
Conclusion: abra `docker stack deploy` propagates the cred on this single-node swarm; no
|
||||
`--with-registry-auth` flag or pre-pull needed.
|
||||
3. **Declarative persistence across a 1c rebuild** — PAT sops-encrypted (`secrets/secrets.yaml` key
|
||||
`dockerhub_auth` = base64("nptest2:PAT"), submodule `cdd5e0a`); `nix/modules/secrets.nix` adds
|
||||
`sops.secrets.dockerhub_auth` + `sops.templates."docker-config.json"` → renders
|
||||
`/root/.docker/config.json` (0600 root) at activation. Verify: after `nixos-rebuild switch`,
|
||||
`ls -l /root/.docker/config.json` → symlink to `/run/secrets/rendered/docker-config.json`; the
|
||||
activation log shows `adding rendered secret: docker-config.json`. Recorded in DECISIONS.md
|
||||
("Docker Hub auth: declarative config.json via sops").
|
||||
|
||||
Impact on Q3.2 lasuite-drive: base deploy got 8/12 services up (incl. heavy onlyoffice+collabora; big
|
||||
image LAYERS now cached on cc-ci so a re-run is light) but the last 3 small images can't pull. Will
|
||||
re-run the moment pulls flow (creds or window reset). cc-ci is CLEAN (teardown verified: 0 stack, 0
|
||||
residue volumes/secrets; 6.8 GB disk + 6.5 GB RAM free).
|
||||
**Bonus unblocked:** Q3.2 lasuite-drive base deploy now CONVERGES (all 12 services incl.
|
||||
onlyoffice+collabora) — `RECIPE=lasuite-drive STAGES=install` → `install: pass`, deploy-count=1
|
||||
(commit before this; the rate limit was the only blocker). Q3.2 specifics (OIDC/WOPI/upload) are next.
|
||||
|
||||
**Gitea outage (RESOLVED @~21:08Z).** git.autonomic.zone returned blanket `404 page not found` for
|
||||
~1.5h (backend down; same from my sandbox AND cc-ci). Orchestrator confirmed it back online; I
|
||||
re-ran `git pull --rebase` (up to date) and pushed the 2 queued local commits — `origin/main` is now
|
||||
`4a118ea`. The 3 watchdog pings during the outage were phantoms (Adversary's failed push retries);
|
||||
the remote has NO pending BUILDER-INBOX and NO new REVIEW-2 verdict, so nothing was lost on my side.
|
||||
**Earlier Gitea outage (RESOLVED @~21:08Z).** git.autonomic.zone returned blanket `404` for ~1.5h
|
||||
(backend down; same from my sandbox AND cc-ci). Reconciled: pulled + pushed queued commits. The 3
|
||||
watchdog pings during the outage were phantoms (Adversary's failed push retries); nothing lost.
|
||||
|
||||
**Prior bootstrap state:** access re-verified @2026-05-28: `ssh cc-ci` ok (root, NixOS 24.11), Gitea
|
||||
API HTTP 200, wildcard DNS resolves to gateway 143.244.213.108.
|
||||
|
||||
Reference in New Issue
Block a user