feat(skill): add Hetzner server recovery playbook

This commit is contained in:
autonomic-bot
2026-06-01 13:48:23 +00:00
parent 8093a95184
commit 9574972f1d

View File

@ -0,0 +1,253 @@
---
name: hetzner-server-recovery
description: Recover a Hetzner-hosted server when SSH or Tailscale access is broken. Use when a Hetzner cloud server is unreachable, drops into emergency mode, or needs rescue-mode repair through the Hetzner API and web console. Invoke as /hetzner-server-recovery.
---
# hetzner-server-recovery
Use this skill when a **Hetzner Cloud server** is no longer reachable over normal SSH/Tailscale and
needs recovery through the Hetzner API, rescue mode, or the web console.
This procedure was proven on the cc-ci server (`server id 134485294`) when it booted into emergency
mode because `/dev/disk/by-label/ESP` was missing.
## Preconditions
- You have a valid `HCLOUD_TOKEN` in `.testenv`.
- You know the Hetzner server ID.
- You have the SSH private key that is registered in Hetzner rescue mode.
- You can install temporary tooling locally if needed.
## 1. First triage
Check all three access paths before changing anything:
```bash
ssh -o BatchMode=yes -o ConnectTimeout=10 <host-alias> hostname
ssh -o BatchMode=yes -o ConnectTimeout=10 root@<public-ip> hostname
tailscale ping -c 3 <host-alias>
```
Interpretation:
- `tailscale ssh` timeout + public `Connection refused` often means the box booted but `sshd` or boot
dependencies failed.
- `Host key verification failed` on the public IP after rescue/power-cycle is normal; use a temporary
`UserKnownHostsFile`.
## 2. Try the least-destructive recovery first
Request a plain reboot:
```bash
set -a && . /srv/cc-ci/.testenv && set +a
curl -s -X POST \
-H "Authorization: Bearer ${HCLOUD_TOKEN}" \
-H "Content-Type: application/json" \
"https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/reboot"
```
Then poll:
```bash
ssh -o BatchMode=yes -o ConnectTimeout=10 <host-alias> hostname
ssh -o BatchMode=yes -o ConnectTimeout=10 root@<public-ip> hostname
tailscale ping -c 3 <host-alias>
```
If the host still does not come back, continue.
## 3. Request the Hetzner console
Request a remote console session:
```bash
curl -s -X POST \
-H "Authorization: Bearer ${HCLOUD_TOKEN}" \
-H "Content-Type: application/json" \
"https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/request_console"
```
The API returns:
- `wss_url`
- `password`
If you have a browser, use the Hetzner console directly.
If you only have shell access, you can still drive it locally because the console is **raw VNC over
websocket**.
## 4. Shell-only console access (websocket VNC bridge)
Install temporary tools:
```bash
nix shell nixpkgs#websocat -c websocat --version
python3 -m venv /tmp/opencode/hetzner-console-venv
/tmp/opencode/hetzner-console-venv/bin/pip install --disable-pip-version-check pillow websocket-client vncdotool
```
Bridge the websocket console to a local VNC TCP port:
```bash
nohup nix shell nixpkgs#websocat -c \
websocat -b -E tcp-l:127.0.0.1:5905 '<WSS_URL>' \
>/tmp/opencode/hetzner-websockify.log 2>&1 &
```
Validate the RFB banner:
```bash
python3 - <<'PY'
import socket
s=socket.socket(); s.settimeout(5); s.connect(('127.0.0.1',5905))
print(repr(s.recv(32)))
PY
```
Expected:
```text
b'RFB 003.008\n'
```
Capture a screenshot from the console:
```bash
/tmp/opencode/hetzner-console-venv/bin/python - <<'PY'
from vncdotool import api
client = api.connect('127.0.0.1::5905', password='<PASSWORD>')
client.captureScreen('/tmp/opencode/console.png')
PY
```
Read the screenshot with your local tooling to inspect the boot state.
## 5. If rescue mode is needed
Enable rescue mode using the registered SSH key ID:
```bash
curl -s -X POST \
-H "Authorization: Bearer ${HCLOUD_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"type":"linux64","ssh_keys":[<SSH_KEY_ID>]}' \
"https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/enable_rescue"
```
Then do a **full power cycle** if a normal reboot does not actually switch into rescue:
```bash
curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \
-H "Content-Type: application/json" \
"https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/poweroff"
curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \
-H "Content-Type: application/json" \
"https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/poweron"
```
Then connect with a temporary known-hosts file:
```bash
ssh -o BatchMode=yes -o ConnectTimeout=10 \
-o StrictHostKeyChecking=no \
-o UserKnownHostsFile=/tmp/opencode/known_hosts_rescue \
-i <private-key> root@<public-ip>
```
## 6. Mount the installed system
In rescue mode:
```bash
lsblk -f
mkdir -p /mnt/recover
mount /dev/sda1 /mnt/recover
ls -l /mnt/recover/nix/var/nix/profiles
readlink -f /mnt/recover/nix/var/nix/profiles/system
```
Useful follow-ups:
```bash
journalctl --directory=/mnt/recover/var/log/journal -p err..alert -n 120 --no-pager
grep -R "by-label/ESP\|/boot\|/boot/efi" -n /mnt/recover/etc /mnt/recover/etc/nixos 2>/dev/null
```
## 7. Proven repair: restore the EFI label
On the cc-ci host, the failure was:
- boot dropped into emergency mode
- console showed root locked / emergency prompt
- journal showed:
- `Timed out waiting for device /dev/disk/by-label/ESP`
- `Dependency failed for /boot`
- `Dependency failed for Local File Systems`
Disk inspection showed the EFI partition existed as `/dev/sda15` with UUID `D978-69EE`, but it had no
filesystem label.
Repair:
```bash
blkid /dev/sda15
fatlabel /dev/sda15 ESP
blkid /dev/sda15
sync
```
Expected after repair:
```text
LABEL="ESP"
```
## 8. Return to normal boot
Disable rescue mode and reboot:
```bash
curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \
-H "Content-Type: application/json" \
"https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/disable_rescue"
curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \
-H "Content-Type: application/json" \
"https://api.hetzner.cloud/v1/servers/<SERVER_ID>/actions/reboot"
```
Then verify both paths again:
```bash
ssh -o BatchMode=yes -o ConnectTimeout=10 <host-alias> hostname
ssh -o BatchMode=yes -o ConnectTimeout=10 \
-o StrictHostKeyChecking=no \
-o UserKnownHostsFile=/tmp/opencode/known_hosts_postfix \
root@<public-ip> hostname
tailscale ping -c 3 <host-alias>
```
## 9. Post-recovery follow-up
- Re-run the exact operation that caused the outage, but more cautiously.
- Compare the live boot blocker against the committed Nix config.
- If the server runs helper sessions or agents, notify them that SSH is back and they can resume work.
- Record the failure and the repair in `JOURNAL.md`.
## cc-ci-specific notes
- cc-ci Hetzner server ID: `134485294`
- Public IP observed during this recovery: `91.98.47.73`
- Tailscale alias: `cc-ci`
- Registered rescue SSH key that matched local `/home/loops/.ssh/cc-ci-root-ed25519`:
`cc-ci-orchestrator-deploy` (`SSH key id 113082420`)
## Hard guardrails
- Prefer reboot, then rescue, before destructive rebuild.
- Never force-rebuild a server disk unless recovery/rescue is exhausted.
- Avoid editing data on mounted filesystems unless you understand the exact boot blocker.
- For cloud-hosted NixOS, filesystem labels/UUID expectations can be just as critical as service config.