feat(skill): add Hetzner server recovery playbook

2026-06-01 13:48:23 +00:00
parent 8093a95184
commit 9574972f1d
1 changed files with 253 additions and 0 deletions
--- a/.claude/skills/hetzner-server-recovery/SKILL.md
+++ b/.claude/skills/hetzner-server-recovery/SKILL.md
@ -0,0 +1,253 @@
+---
+name: hetzner-server-recovery
+description: Recover a Hetzner-hosted server when SSH or Tailscale access is broken. Use when a Hetzner cloud server is unreachable, drops into emergency mode, or needs rescue-mode repair through the Hetzner API and web console. Invoke as /hetzner-server-recovery.
+---
+
+# hetzner-server-recovery
+
+Use this skill when a **Hetzner Cloud server** is no longer reachable over normal SSH/Tailscale and
+needs recovery through the Hetzner API, rescue mode, or the web console.
+
+This procedure was proven on the cc-ci server (`server id 134485294`) when it booted into emergency
+mode because `/dev/disk/by-label/ESP` was missing.
+
+## Preconditions
+
+- You have a valid `HCLOUD_TOKEN` in `.testenv`.
+- You know the Hetzner server ID.
+- You have the SSH private key that is registered in Hetzner rescue mode.
+- You can install temporary tooling locally if needed.
+
+## 1. First triage
+
+Check all three access paths before changing anything:
+
+```bash
+ssh -o BatchMode=yes -o ConnectTimeout=10 <host-alias> hostname
+ssh -o BatchMode=yes -o ConnectTimeout=10 root@<public-ip> hostname
+tailscale ping -c 3 <host-alias>
+```
+
+Interpretation:
+
+- `tailscale ssh` timeout + public `Connection refused` often means the box booted but `sshd` or boot
+  dependencies failed.
+- `Host key verification failed` on the public IP after rescue/power-cycle is normal; use a temporary
+  `UserKnownHostsFile`.
+
+## 2. Try the least-destructive recovery first
+
+Request a plain reboot:
+
+```bash
+set -a && . /srv/cc-ci/.testenv && set +a
+curl -s -X POST \
+  -H "Authorization: Bearer ${HCLOUD_TOKEN}" \
+  -H "Content-Type: application/json" \
+  "https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/reboot"
+```
+
+Then poll:
+
+```bash
+ssh -o BatchMode=yes -o ConnectTimeout=10 <host-alias> hostname
+ssh -o BatchMode=yes -o ConnectTimeout=10 root@<public-ip> hostname
+tailscale ping -c 3 <host-alias>
+```
+
+If the host still does not come back, continue.
+
+## 3. Request the Hetzner console
+
+Request a remote console session:
+
+```bash
+curl -s -X POST \
+  -H "Authorization: Bearer ${HCLOUD_TOKEN}" \
+  -H "Content-Type: application/json" \
+  "https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/request_console"
+```
+
+The API returns:
+
+- `wss_url`
+- `password`
+
+If you have a browser, use the Hetzner console directly.
+
+If you only have shell access, you can still drive it locally because the console is **raw VNC over
+websocket**.
+
+## 4. Shell-only console access (websocket VNC bridge)
+
+Install temporary tools:
+
+```bash
+nix shell nixpkgs#websocat -c websocat --version
+python3 -m venv /tmp/opencode/hetzner-console-venv
+/tmp/opencode/hetzner-console-venv/bin/pip install --disable-pip-version-check pillow websocket-client vncdotool
+```
+
+Bridge the websocket console to a local VNC TCP port:
+
+```bash
+nohup nix shell nixpkgs#websocat -c \
+  websocat -b -E tcp-l:127.0.0.1:5905 '<WSS_URL>' \
+  >/tmp/opencode/hetzner-websockify.log 2>&1 &
+```
+
+Validate the RFB banner:
+
+```bash
+python3 - <<'PY'
+import socket
+s=socket.socket(); s.settimeout(5); s.connect(('127.0.0.1',5905))
+print(repr(s.recv(32)))
+PY
+```
+
+Expected:
+
+```text
+b'RFB 003.008\n'
+```
+
+Capture a screenshot from the console:
+
+```bash
+/tmp/opencode/hetzner-console-venv/bin/python - <<'PY'
+from vncdotool import api
+client = api.connect('127.0.0.1::5905', password='<PASSWORD>')
+client.captureScreen('/tmp/opencode/console.png')
+PY
+```
+
+Read the screenshot with your local tooling to inspect the boot state.
+
+## 5. If rescue mode is needed
+
+Enable rescue mode using the registered SSH key ID:
+
+```bash
+curl -s -X POST \
+  -H "Authorization: Bearer ${HCLOUD_TOKEN}" \
+  -H "Content-Type: application/json" \
+  -d '{"type":"linux64","ssh_keys":[<SSH_KEY_ID>]}' \
+  "https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/enable_rescue"
+```
+
+Then do a **full power cycle** if a normal reboot does not actually switch into rescue:
+
+```bash
+curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \
+  -H "Content-Type: application/json" \
+  "https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/poweroff"
+
+curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \
+  -H "Content-Type: application/json" \
+  "https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/poweron"
+```
+
+Then connect with a temporary known-hosts file:
+
+```bash
+ssh -o BatchMode=yes -o ConnectTimeout=10 \
+  -o StrictHostKeyChecking=no \
+  -o UserKnownHostsFile=/tmp/opencode/known_hosts_rescue \
+  -i <private-key> root@<public-ip>
+```
+
+## 6. Mount the installed system
+
+In rescue mode:
+
+```bash
+lsblk -f
+mkdir -p /mnt/recover
+mount /dev/sda1 /mnt/recover
+ls -l /mnt/recover/nix/var/nix/profiles
+readlink -f /mnt/recover/nix/var/nix/profiles/system
+```
+
+Useful follow-ups:
+
+```bash
+journalctl --directory=/mnt/recover/var/log/journal -p err..alert -n 120 --no-pager
+grep -R "by-label/ESP\|/boot\|/boot/efi" -n /mnt/recover/etc /mnt/recover/etc/nixos 2>/dev/null
+```
+
+## 7. Proven repair: restore the EFI label
+
+On the cc-ci host, the failure was:
+
+- boot dropped into emergency mode
+- console showed root locked / emergency prompt
+- journal showed:
+  - `Timed out waiting for device /dev/disk/by-label/ESP`
+  - `Dependency failed for /boot`
+  - `Dependency failed for Local File Systems`
+
+Disk inspection showed the EFI partition existed as `/dev/sda15` with UUID `D978-69EE`, but it had no
+filesystem label.
+
+Repair:
+
+```bash
+blkid /dev/sda15
+fatlabel /dev/sda15 ESP
+blkid /dev/sda15
+sync
+```
+
+Expected after repair:
+
+```text
+LABEL="ESP"
+```
+
+## 8. Return to normal boot
+
+Disable rescue mode and reboot:
+
+```bash
+curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \
+  -H "Content-Type: application/json" \
+  "https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/disable_rescue"
+
+curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \
+  -H "Content-Type: application/json" \
+  "https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/reboot"
+```
+
+Then verify both paths again:
+
+```bash
+ssh -o BatchMode=yes -o ConnectTimeout=10 <host-alias> hostname
+ssh -o BatchMode=yes -o ConnectTimeout=10 \
+  -o StrictHostKeyChecking=no \
+  -o UserKnownHostsFile=/tmp/opencode/known_hosts_postfix \
+  root@<public-ip> hostname
+tailscale ping -c 3 <host-alias>
+```
+
+## 9. Post-recovery follow-up
+
+- Re-run the exact operation that caused the outage, but more cautiously.
+- Compare the live boot blocker against the committed Nix config.
+- If the server runs helper sessions or agents, notify them that SSH is back and they can resume work.
+- Record the failure and the repair in `JOURNAL.md`.
+
+## cc-ci-specific notes
+
+- cc-ci Hetzner server ID: `134485294`
+- Public IP observed during this recovery: `91.98.47.73`
+- Tailscale alias: `cc-ci`
+- Registered rescue SSH key that matched local `/home/loops/.ssh/cc-ci-root-ed25519`:
+  `cc-ci-orchestrator-deploy` (`SSH key id 113082420`)
+
+## Hard guardrails
+
+- Prefer reboot, then rescue, before destructive rebuild.
+- Never force-rebuild a server disk unless recovery/rescue is exhausted.
+- Avoid editing data on mounted filesystems unless you understand the exact boot blocker.
+- For cloud-hosted NixOS, filesystem labels/UUID expectations can be just as critical as service config.