diff --git a/machine-docs/DECISIONS.md b/machine-docs/DECISIONS.md index 172133c..315aa89 100644 --- a/machine-docs/DECISIONS.md +++ b/machine-docs/DECISIONS.md @@ -521,3 +521,39 @@ readiness wait still gates real liveness. Safe for all currently-green recipes ( all N/N with N>0; the `0/0` case did not previously occur). Buckets/migrations that the one-shot performs are run on-demand in the recipe's `setup_custom_tests.sh` (post-deploy), not relied upon for generic-install convergence (the SPA at `/` serves 200 without them). + +## 2026-05-28 — Docker Hub auth: declarative config.json via sops (rate-limit fix) — SETTLED + +**Context.** Heavy Phase-2 recipe deploys exhausted Docker Hub's anonymous pull rate limit +(100/6h per shared IP `68.14.43.142`) → `toomanyrequests` blocked all new deploys. Operator +provided a read-only Docker Hub PAT (Class A1 registry creds, plan §1.5): `DOCKERHUB_USERNAME=nptest2` ++ `DOCKERHUB_TOKEN` in `/srv/cc-ci/.testenv`. Authenticated pulls = 200/6h **per-account**. + +**Decision.** Wire it declaratively (survives a 1c NixOS rebuild), not just an imperative login: +- **Secret:** `secrets/secrets.yaml` (cc-ci-secrets submodule, commit `cdd5e0a`) gains key + `dockerhub_auth` = `base64("nptest2:")` — i.e. the exact `auth` field docker config.json + wants, so the nix template is a pure render (no runtime base64). sops-encrypted to host+master age + recipients (edited on cc-ci using its ssh-host-key→age identity via `nix shell nixpkgs#sops`; + plaintext shredded; PAT never committed plaintext nor exposed in process args/logs). +- **Render:** `nix/modules/secrets.nix` adds `sops.secrets.dockerhub_auth` + a + `sops.templates."docker-config.json"` that renders `/root/.docker/config.json` (0600, root) at + activation. It becomes a symlink to `/run/secrets/rendered/docker-config.json`. +- **Why /root:** the drone exec runner runs pipelines as `User=root` (drone-runner.nix), and manual + deploys ssh in as root — so `/root/.docker/config.json` covers both the `!testme` CI path and + manual ops. Single config, single user. + +**Swarm-propagation question — RESOLVED empirically (no `--with-registry-auth` / pre-pull needed).** +The operator/Adversary flagged that a node `docker login` may NOT propagate to swarm SERVICE-task +pulls. Tested on cc-ci with the authenticated config.json in place: +- Account ratelimit baseline 197/200 (source = account hash `b662dd8b-…`, not the IP). +- Deployed **uncached** `n8nio/n8n:2.20.6` via abra (`RECIPE=n8n STAGES=install`). The swarm service + task pulled it to `1/1 Running` with **no `toomanyrequests`**. +- Account counter dropped 197 → 196 (manager manifest resolution) → **195** (agent layer-manifest + pull), source still the account hash. So abra's `docker stack deploy` propagates the cred to the + swarm task pull on this single-node swarm — billed to the account, not the anon IP. +- Corroborating: the earlier lasuite-drive deploy resolved **12** images with no `toomanyrequests` + while anon budget was ≤4 — impossible anonymously → manager resolution is authenticated too. + +So: declarative root `config.json` is sufficient end-to-end here; `--with-registry-auth` is not +required (abra/SDK attaches it). **Caveat (Phase 2b):** 200/6h may still be tight for a full ~18-recipe +sweep; the permanent structural fix is a registry pull-through cache authenticated with this same PAT. diff --git a/machine-docs/STATUS-2.md b/machine-docs/STATUS-2.md index b5603da..c1b85d0 100644 --- a/machine-docs/STATUS-2.md +++ b/machine-docs/STATUS-2.md @@ -277,29 +277,39 @@ ssh cc-ci 'cd /root/cc-ci && cc-ci-run -m pytest tests/unit -v && RECIPE=custom- ``` ## Blocked -**@2026-05-28 ~21:10Z — ONE standing EXTERNAL (Class A1) block: Docker Hub pull rate limit.** -(The earlier Gitea outage is RESOLVED — see below — and git state is reconciled/pushed.) +**(none) — the Docker Hub rate-limit block is RESOLVED @2026-05-28 ~22:10Z. Awaiting Adversary +re-verify of the 3 conditions (immediate relief already confirmed by Adversary in REVIEW-2).** -**Docker Hub anonymous pull rate limit (registry-creds finding, plan §1.5).** docker.io pulls from -cc-ci's IP fail with `toomanyrequests: You have reached your unauthenticated pull rate limit`. Verify: -`ssh cc-ci 'docker pull redis:8.6.3'` → rate-limit error. After the Gitea outage I re-tested: exactly -**1** pull (minio) trickled through as the rolling 6h window aged, then the next 3 (redis/nginx/ -mailcatcher) hit the limit again — so the quota is still effectively exhausted, dribbling ~1 pull at a -time. Traced to: today's many recipe deploys + a `docker image prune -af` (run to clear a disk-full -that broke the drive deploy) forcing a full cold re-pull. Blocks **every** new recipe deploy. Per §1.5 -this is a finding → **request registry pull credentials** (authenticated/Team Docker Hub, or a -pull-through cache). Recurs for all remaining Q3.5/Q4 enrollments. Operator notified @~19:45Z. +**Docker Hub rate-limit fix — DONE (registry-creds finding, plan §1.5), all 3 conditions met.** +Operator provided a read-only PAT (`DOCKERHUB_USERNAME=nptest2` + `DOCKERHUB_TOKEN` in `.testenv`). +Wired declaratively; verify commands + expected outcomes for the Adversary: +1. **Authenticated 200-limit from account source** (Adversary already CONFIRMED in REVIEW-2). Re-check: + `ssh cc-ci` → `docker info | grep Username` = `nptest2`; an authenticated manifest HEAD shows + `ratelimit-limit: 200;w=21600` and `docker-ratelimit-source: b662dd8b-…` (account hash, NOT IP + `68.14.43.142`). +2. **Swarm SERVICE-task pulls authenticate** — PROVEN with an **uncached** image: + `ssh cc-ci 'cd /root/cc-ci && RECIPE=n8n STAGES=install cc-ci-run runner/run_recipe_ci.py'` + → EXPECTED: `install: pass`, deploy-count=1, NO `toomanyrequests`; the swarm task pulls + `n8nio/n8n:2.20.6` to 1/1. During the run the **account** counter decrements (197→196 resolution + →195 agent layer pull, source = account hash) — the agent pull is billed to the account, not the + anon IP. (n8n images were uncached, so this is a real fresh-pull test, not a cached false-pass.) + Conclusion: abra `docker stack deploy` propagates the cred on this single-node swarm; no + `--with-registry-auth` flag or pre-pull needed. +3. **Declarative persistence across a 1c rebuild** — PAT sops-encrypted (`secrets/secrets.yaml` key + `dockerhub_auth` = base64("nptest2:PAT"), submodule `cdd5e0a`); `nix/modules/secrets.nix` adds + `sops.secrets.dockerhub_auth` + `sops.templates."docker-config.json"` → renders + `/root/.docker/config.json` (0600 root) at activation. Verify: after `nixos-rebuild switch`, + `ls -l /root/.docker/config.json` → symlink to `/run/secrets/rendered/docker-config.json`; the + activation log shows `adding rendered secret: docker-config.json`. Recorded in DECISIONS.md + ("Docker Hub auth: declarative config.json via sops"). -Impact on Q3.2 lasuite-drive: base deploy got 8/12 services up (incl. heavy onlyoffice+collabora; big -image LAYERS now cached on cc-ci so a re-run is light) but the last 3 small images can't pull. Will -re-run the moment pulls flow (creds or window reset). cc-ci is CLEAN (teardown verified: 0 stack, 0 -residue volumes/secrets; 6.8 GB disk + 6.5 GB RAM free). +**Bonus unblocked:** Q3.2 lasuite-drive base deploy now CONVERGES (all 12 services incl. +onlyoffice+collabora) — `RECIPE=lasuite-drive STAGES=install` → `install: pass`, deploy-count=1 +(commit before this; the rate limit was the only blocker). Q3.2 specifics (OIDC/WOPI/upload) are next. -**Gitea outage (RESOLVED @~21:08Z).** git.autonomic.zone returned blanket `404 page not found` for -~1.5h (backend down; same from my sandbox AND cc-ci). Orchestrator confirmed it back online; I -re-ran `git pull --rebase` (up to date) and pushed the 2 queued local commits — `origin/main` is now -`4a118ea`. The 3 watchdog pings during the outage were phantoms (Adversary's failed push retries); -the remote has NO pending BUILDER-INBOX and NO new REVIEW-2 verdict, so nothing was lost on my side. +**Earlier Gitea outage (RESOLVED @~21:08Z).** git.autonomic.zone returned blanket `404` for ~1.5h +(backend down; same from my sandbox AND cc-ci). Reconciled: pulled + pushed queued commits. The 3 +watchdog pings during the outage were phantoms (Adversary's failed push retries); nothing lost. **Prior bootstrap state:** access re-verified @2026-05-28: `ssh cc-ci` ok (root, NixOS 24.11), Gitea API HTTP 200, wildcard DNS resolves to gateway 143.244.213.108.