Files
cc-ci/machine-docs/JOURNAL-pvfix.md
autonomic-bot 71319d7096
Some checks failed
continuous-integration/drone/push Build is failing
claim(pvfix-M2): proxy recreated as /16 — all routes healthy
Live maintenance executed 2026-06-13T05:33–05:46Z:
- Removed 6 stacks from proxy (traefik, drone, bridge, dashboard, reports, warm-keycloak)
- Waited for proxy to drain, removed old 10.0.1.0/24 network
- nixos-rebuild switch with git+file:///?submodules=1 → swarm-init restarted
- proxy recreated: Subnet 10.10.0.0/16, gateway 10.10.0.1
- All 9 swarm services running 1/1
- ci.commoninternet.net → HTTP/2 200; drone → 303

Adversary: verify from host that proxy subnet is /16 and routes healthy.
Full evidence in STATUS-pvfix.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 05:47:04 +00:00

155 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# JOURNAL — phase pvfix
## 2026-06-13T05:29Z — Bootstrap + M1 patch
### Context gathered
Read the phase plan + runbook. Key facts:
- Root cause confirmed: proxy is `10.0.1.0/24` (254 VIPs), Docker GC race leaks endpoints → pool exhaustion
- Fix: enlarge to `/16` (`--subnet 10.10.0.0/16`)
- The network can't be resized in place; requires remove + recreate
### Live host survey
Subnets in use on the live host (collected via `docker network inspect`):
- `ingress`: `10.0.0.0/24`
- `proxy`: `10.0.1.0/24` (current — to change)
- `traefik internal`: `10.0.2.0/24`
- `warm-keycloak internal`: `10.0.3.0/24`
- `backups default`: `10.0.4.0/24`
- `bridge`/`docker_gwbridge`: `172.17/18.0.0/16`
`10.10.0.0/16` is clean — no conflicts. Host eth0: `91.98.47.73/32`, Tailscale: `100.95.31.88/32`.
No route entries for `10.10.x.x` in `ip route show`.
### Services on proxy (will be disrupted during maintenance)
From `docker service ls` + per-service network inspection:
- `traefik_ci_commoninternet_net_app` — uses proxy
- `drone_ci_commoninternet_net_app` — uses proxy
- `ccci-bridge_app` — uses proxy
- `ccci-dashboard_app` — uses proxy
- `ccci-reports_app` — uses proxy
- `warm-keycloak_ci_commoninternet_net_app` — uses proxy
NOT on proxy: `backups_ci_commoninternet_net_app`, traefik socket-proxy, warm-keycloak DB.
### Deployment mechanism
- `swarm-init.service` — oneshot, creates proxy. Changes here → systemd restarts it on nixos-rebuild
- `deploy-proxy`, `deploy-drone`, `deploy-bridge`, `deploy-dashboard`, `deploy-reports`, `warm-keycloak`
RemainAfterExit oneshots; their definitions don't change so they WON'T auto-restart after nixos-rebuild.
Must be manually `systemctl restart`-ed after nixos-rebuild removes their stacks.
### Design choice: why 10.10.0.0/16
- Must be `/16` for ~65k VIP headroom
- Must not overlap `10.0.0.0/24` (ingress) or any of the `10.0.1-4.0/24` per-stack overlays
- The Docker default-addr-pool is `10.0.0.0/8` — any `/16` in that range is fine as long as
it doesn't overlap an existing allocation
- `10.10.0.0/16` is the first clean `/16` outside the current allocation band — clear of `10.0.x.x`
while still in Docker's pool. No host route conflicts.
### swarm.nix patch
Added `--subnet 10.10.0.0/16` to the `docker network create` call.
Also added a short comment explaining the motivation (required WHY per §7 comment policy for non-obvious constraint).
### Maintenance window state
Host state at time of claim:
- `docker stack ls` shows 6 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak
- NO active recipe CI runs (only warm stacks, no test app containers)
- Confirmed with `docker ps --format "{{.Names}}"` — only infra/warm containers
Host is quiet → suitable maintenance window. No active upgrade-all or !testme runs.
---
## 2026-06-13T05:3305:46Z — Live maintenance execution
### Adversary M1 PASS received
Adversary confirmed patch correct and procedure safe. Non-blocking recommendation: add explicit
`systemctl restart swarm-init` after nixos-rebuild. Adopted.
### Pre-flight confirmed
- No active recipe test containers (`docker ps` — empty)
- All stacks infra-only (7 stacks: backups, ccci-bridge, ccci-dashboard, ccci-reports, drone, traefik, warm-keycloak)
### Stack removal
```
docker stack rm traefik_ci_commoninternet_net drone_ci_commoninternet_net ccci-bridge ccci-dashboard ccci-reports warm-keycloak_ci_commoninternet_net
```
Output showed all services/configs/networks being removed. proxy drained in ~12s (4 polling attempts).
### Proxy removal
```
docker network rm proxy
→ proxy
proxy removed
```
### builder-clone sync issue
`/root/cc-ci` didn't exist — needed `/root/builder-clone` instead. The builder-clone was at `e1c4198` (old).
`git pull --rebase` failed with untracked files: `tests/concurrency/test_run_state.py`.
Moved to `/root/test_run_state.py.bak`. Second pull succeeded, fast-forwarded to `b6e12ef`.
Then `git merge --ff-only origin/main` also failed (many stale untracked files from previous phases).
Moved all conflicting files to `/root/stash-pvfix/`. Successfully merged to `caef217` (latest main).
Confirmed `grep subnet /root/builder-clone/nix/modules/swarm.nix``--subnet 10.10.0.0/16`.
### nixos-rebuild
First attempt: `nixos-rebuild switch --flake /root/builder-clone#cc-ci` → FAILED
- Error: `path '/nix/store/.../secrets/secrets.yaml' does not exist`
- Root cause: flake default doesn't include git submodule content
Second attempt: `path:` scheme with `?submodules=1` → FAILED
- Error: `path URL has unsupported parameter 'submodules'`
Third attempt: `git+file:///root/builder-clone?submodules=1#cc-ci` → SUCCESS (exit 0)
- Output: `building the system configuration...` (used nix cache, fast)
### swarm-init restart
Checked: the new unit script `/nix/store/apv1zvz658ddq0i8z0ivmc8f9sydxv7h-unit-script-swarm-init-start/bin/swarm-init-start`
contained `--subnet 10.10.0.0/16`. The service was still showing "active" from its old run (Jun 12).
Ran: `systemctl restart swarm-init`
→ Active: active (exited) since 2026-06-13 05:38:17 UTC
`docker network inspect proxy` → Subnet: 10.10.0.0/16 ✓
### Deploy-proxy health gate deadlock
`systemctl restart deploy-proxy` started successfully. Traefik deployed.
But health gate (`ci.commoninternet.net → 200`) failed because dashboard not yet deployed.
Reconciler logged: `[traefik] on latest 5.1.1+v3.6.15 but UNHEALTHY → redeploy`
Analysis: The `deploy-proxy` health_timeout=300s (5 min) gives enough time for dashboard to be
deployed concurrently. The `After=` ordering in systemd means these services DON'T start until
deploy-proxy is "active", but since deploy-proxy was still "activating", systemd would have
waited indefinitely if we relied on the ordering chain.
Fix: started deploy-drone, deploy-bridge, deploy-dashboard, deploy-reports concurrently:
```
systemctl start deploy-drone deploy-bridge deploy-dashboard deploy-reports
```
Within ~20 seconds, `ci.commoninternet.net` returned 200. Deploy-proxy health gate passed.
### Final health state (2026-06-13T05:45Z)
```
docker stack ls → 7 stacks all present
docker service ls → all 9 services 1/1
docker network inspect proxy → Subnet: 10.10.0.0/16
ci.commoninternet.net → HTTP/2 200
drone.ci.commoninternet.net → HTTP/2 303
systemctl is-active deploy-proxy deploy-drone deploy-bridge deploy-dashboard deploy-reports warm-keycloak
→ active active active active active active
```