Some checks failed
continuous-integration/drone/push Build is failing
Live maintenance executed 2026-06-13T05:33–05:46Z: - Removed 6 stacks from proxy (traefik, drone, bridge, dashboard, reports, warm-keycloak) - Waited for proxy to drain, removed old 10.0.1.0/24 network - nixos-rebuild switch with git+file:///?submodules=1 → swarm-init restarted - proxy recreated: Subnet 10.10.0.0/16, gateway 10.10.0.1 - All 9 swarm services running 1/1 - ci.commoninternet.net → HTTP/2 200; drone → 303 Adversary: verify from host that proxy subnet is /16 and routes healthy. Full evidence in STATUS-pvfix.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
155 lines
6.4 KiB
Markdown
155 lines
6.4 KiB
Markdown
# JOURNAL — phase pvfix
|
||
|
||
## 2026-06-13T05:29Z — Bootstrap + M1 patch
|
||
|
||
### Context gathered
|
||
|
||
Read the phase plan + runbook. Key facts:
|
||
- Root cause confirmed: proxy is `10.0.1.0/24` (254 VIPs), Docker GC race leaks endpoints → pool exhaustion
|
||
- Fix: enlarge to `/16` (`--subnet 10.10.0.0/16`)
|
||
- The network can't be resized in place; requires remove + recreate
|
||
|
||
### Live host survey
|
||
|
||
Subnets in use on the live host (collected via `docker network inspect`):
|
||
- `ingress`: `10.0.0.0/24`
|
||
- `proxy`: `10.0.1.0/24` (current — to change)
|
||
- `traefik internal`: `10.0.2.0/24`
|
||
- `warm-keycloak internal`: `10.0.3.0/24`
|
||
- `backups default`: `10.0.4.0/24`
|
||
- `bridge`/`docker_gwbridge`: `172.17/18.0.0/16`
|
||
|
||
`10.10.0.0/16` is clean — no conflicts. Host eth0: `91.98.47.73/32`, Tailscale: `100.95.31.88/32`.
|
||
No route entries for `10.10.x.x` in `ip route show`.
|
||
|
||
### Services on proxy (will be disrupted during maintenance)
|
||
|
||
From `docker service ls` + per-service network inspection:
|
||
- `traefik_ci_commoninternet_net_app` — uses proxy
|
||
- `drone_ci_commoninternet_net_app` — uses proxy
|
||
- `ccci-bridge_app` — uses proxy
|
||
- `ccci-dashboard_app` — uses proxy
|
||
- `ccci-reports_app` — uses proxy
|
||
- `warm-keycloak_ci_commoninternet_net_app` — uses proxy
|
||
|
||
NOT on proxy: `backups_ci_commoninternet_net_app`, traefik socket-proxy, warm-keycloak DB.
|
||
|
||
### Deployment mechanism
|
||
|
||
- `swarm-init.service` — oneshot, creates proxy. Changes here → systemd restarts it on nixos-rebuild
|
||
- `deploy-proxy`, `deploy-drone`, `deploy-bridge`, `deploy-dashboard`, `deploy-reports`, `warm-keycloak` —
|
||
RemainAfterExit oneshots; their definitions don't change so they WON'T auto-restart after nixos-rebuild.
|
||
Must be manually `systemctl restart`-ed after nixos-rebuild removes their stacks.
|
||
|
||
### Design choice: why 10.10.0.0/16
|
||
|
||
- Must be `/16` for ~65k VIP headroom
|
||
- Must not overlap `10.0.0.0/24` (ingress) or any of the `10.0.1-4.0/24` per-stack overlays
|
||
- The Docker default-addr-pool is `10.0.0.0/8` — any `/16` in that range is fine as long as
|
||
it doesn't overlap an existing allocation
|
||
- `10.10.0.0/16` is the first clean `/16` outside the current allocation band — clear of `10.0.x.x`
|
||
while still in Docker's pool. No host route conflicts.
|
||
|
||
### swarm.nix patch
|
||
|
||
Added `--subnet 10.10.0.0/16` to the `docker network create` call.
|
||
Also added a short comment explaining the motivation (required WHY per §7 comment policy for non-obvious constraint).
|
||
|
||
### Maintenance window state
|
||
|
||
Host state at time of claim:
|
||
- `docker stack ls` shows 6 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak
|
||
- NO active recipe CI runs (only warm stacks, no test app containers)
|
||
- Confirmed with `docker ps --format "{{.Names}}"` — only infra/warm containers
|
||
|
||
Host is quiet → suitable maintenance window. No active upgrade-all or !testme runs.
|
||
|
||
---
|
||
|
||
## 2026-06-13T05:33–05:46Z — Live maintenance execution
|
||
|
||
### Adversary M1 PASS received
|
||
|
||
Adversary confirmed patch correct and procedure safe. Non-blocking recommendation: add explicit
|
||
`systemctl restart swarm-init` after nixos-rebuild. Adopted.
|
||
|
||
### Pre-flight confirmed
|
||
|
||
- No active recipe test containers (`docker ps` — empty)
|
||
- All stacks infra-only (7 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak)
|
||
|
||
### Stack removal
|
||
|
||
```
|
||
docker stack rm traefik_ci_commoninternet_net drone_ci_commoninternet_net ccci-bridge ccci-dashboard ccci-reports warm-keycloak_ci_commoninternet_net
|
||
```
|
||
Output showed all services/configs/networks being removed. proxy drained in ~12s (4 polling attempts).
|
||
|
||
### Proxy removal
|
||
|
||
```
|
||
docker network rm proxy
|
||
→ proxy
|
||
proxy removed
|
||
```
|
||
|
||
### builder-clone sync issue
|
||
|
||
`/root/cc-ci` didn't exist — needed `/root/builder-clone` instead. The builder-clone was at `e1c4198` (old).
|
||
`git pull --rebase` failed with untracked files: `tests/concurrency/test_run_state.py`.
|
||
Moved to `/root/test_run_state.py.bak`. Second pull succeeded, fast-forwarded to `b6e12ef`.
|
||
|
||
Then `git merge --ff-only origin/main` also failed (many stale untracked files from previous phases).
|
||
Moved all conflicting files to `/root/stash-pvfix/`. Successfully merged to `caef217` (latest main).
|
||
Confirmed `grep subnet /root/builder-clone/nix/modules/swarm.nix` → `--subnet 10.10.0.0/16`.
|
||
|
||
### nixos-rebuild
|
||
|
||
First attempt: `nixos-rebuild switch --flake /root/builder-clone#cc-ci` → FAILED
|
||
- Error: `path '/nix/store/.../secrets/secrets.yaml' does not exist`
|
||
- Root cause: flake default doesn't include git submodule content
|
||
|
||
Second attempt: `path:` scheme with `?submodules=1` → FAILED
|
||
- Error: `path URL has unsupported parameter 'submodules'`
|
||
|
||
Third attempt: `git+file:///root/builder-clone?submodules=1#cc-ci` → SUCCESS (exit 0)
|
||
- Output: `building the system configuration...` (used nix cache, fast)
|
||
|
||
### swarm-init restart
|
||
|
||
Checked: the new unit script `/nix/store/apv1zvz658ddq0i8z0ivmc8f9sydxv7h-unit-script-swarm-init-start/bin/swarm-init-start`
|
||
contained `--subnet 10.10.0.0/16`. The service was still showing "active" from its old run (Jun 12).
|
||
|
||
Ran: `systemctl restart swarm-init`
|
||
→ Active: active (exited) since 2026-06-13 05:38:17 UTC
|
||
→ `docker network inspect proxy` → Subnet: 10.10.0.0/16 ✓
|
||
|
||
### Deploy-proxy health gate deadlock
|
||
|
||
`systemctl restart deploy-proxy` started successfully. Traefik deployed.
|
||
But health gate (`ci.commoninternet.net → 200`) failed because dashboard not yet deployed.
|
||
Reconciler logged: `[traefik] on latest 5.1.1+v3.6.15 but UNHEALTHY → redeploy`
|
||
|
||
Analysis: The `deploy-proxy` health_timeout=300s (5 min) gives enough time for dashboard to be
|
||
deployed concurrently. The `After=` ordering in systemd means these services DON'T start until
|
||
deploy-proxy is "active", but since deploy-proxy was still "activating", systemd would have
|
||
waited indefinitely if we relied on the ordering chain.
|
||
|
||
Fix: started deploy-drone, deploy-bridge, deploy-dashboard, deploy-reports concurrently:
|
||
```
|
||
systemctl start deploy-drone deploy-bridge deploy-dashboard deploy-reports
|
||
```
|
||
Within ~20 seconds, `ci.commoninternet.net` returned 200. Deploy-proxy health gate passed.
|
||
|
||
### Final health state (2026-06-13T05:45Z)
|
||
|
||
```
|
||
docker stack ls → 7 stacks all present
|
||
docker service ls → all 9 services 1/1
|
||
docker network inspect proxy → Subnet: 10.10.0.0/16
|
||
ci.commoninternet.net → HTTP/2 200
|
||
drone.ci.commoninternet.net → HTTP/2 303
|
||
systemctl is-active deploy-proxy deploy-drone deploy-bridge deploy-dashboard deploy-reports warm-keycloak
|
||
→ active active active active active active
|
||
```
|