From 9574972f1dd4988d81c69cee82cbf52a51c91653 Mon Sep 17 00:00:00 2001 From: autonomic-bot Date: Mon, 1 Jun 2026 13:48:23 +0000 Subject: [PATCH] feat(skill): add Hetzner server recovery playbook --- .../skills/hetzner-server-recovery/SKILL.md | 253 ++++++++++++++++++ 1 file changed, 253 insertions(+) create mode 100644 .claude/skills/hetzner-server-recovery/SKILL.md diff --git a/.claude/skills/hetzner-server-recovery/SKILL.md b/.claude/skills/hetzner-server-recovery/SKILL.md new file mode 100644 index 0000000..526cd6c --- /dev/null +++ b/.claude/skills/hetzner-server-recovery/SKILL.md @@ -0,0 +1,253 @@ +--- +name: hetzner-server-recovery +description: Recover a Hetzner-hosted server when SSH or Tailscale access is broken. Use when a Hetzner cloud server is unreachable, drops into emergency mode, or needs rescue-mode repair through the Hetzner API and web console. Invoke as /hetzner-server-recovery. +--- + +# hetzner-server-recovery + +Use this skill when a **Hetzner Cloud server** is no longer reachable over normal SSH/Tailscale and +needs recovery through the Hetzner API, rescue mode, or the web console. + +This procedure was proven on the cc-ci server (`server id 134485294`) when it booted into emergency +mode because `/dev/disk/by-label/ESP` was missing. + +## Preconditions + +- You have a valid `HCLOUD_TOKEN` in `.testenv`. +- You know the Hetzner server ID. +- You have the SSH private key that is registered in Hetzner rescue mode. +- You can install temporary tooling locally if needed. + +## 1. First triage + +Check all three access paths before changing anything: + +```bash +ssh -o BatchMode=yes -o ConnectTimeout=10 hostname +ssh -o BatchMode=yes -o ConnectTimeout=10 root@ hostname +tailscale ping -c 3 +``` + +Interpretation: + +- `tailscale ssh` timeout + public `Connection refused` often means the box booted but `sshd` or boot + dependencies failed. +- `Host key verification failed` on the public IP after rescue/power-cycle is normal; use a temporary + `UserKnownHostsFile`. + +## 2. Try the least-destructive recovery first + +Request a plain reboot: + +```bash +set -a && . /srv/cc-ci/.testenv && set +a +curl -s -X POST \ + -H "Authorization: Bearer ${HCLOUD_TOKEN}" \ + -H "Content-Type: application/json" \ + "https://api.hetzner.cloud/v1/servers//actions/reboot" +``` + +Then poll: + +```bash +ssh -o BatchMode=yes -o ConnectTimeout=10 hostname +ssh -o BatchMode=yes -o ConnectTimeout=10 root@ hostname +tailscale ping -c 3 +``` + +If the host still does not come back, continue. + +## 3. Request the Hetzner console + +Request a remote console session: + +```bash +curl -s -X POST \ + -H "Authorization: Bearer ${HCLOUD_TOKEN}" \ + -H "Content-Type: application/json" \ + "https://api.hetzner.cloud/v1/servers//actions/request_console" +``` + +The API returns: + +- `wss_url` +- `password` + +If you have a browser, use the Hetzner console directly. + +If you only have shell access, you can still drive it locally because the console is **raw VNC over +websocket**. + +## 4. Shell-only console access (websocket VNC bridge) + +Install temporary tools: + +```bash +nix shell nixpkgs#websocat -c websocat --version +python3 -m venv /tmp/opencode/hetzner-console-venv +/tmp/opencode/hetzner-console-venv/bin/pip install --disable-pip-version-check pillow websocket-client vncdotool +``` + +Bridge the websocket console to a local VNC TCP port: + +```bash +nohup nix shell nixpkgs#websocat -c \ + websocat -b -E tcp-l:127.0.0.1:5905 '' \ + >/tmp/opencode/hetzner-websockify.log 2>&1 & +``` + +Validate the RFB banner: + +```bash +python3 - <<'PY' +import socket +s=socket.socket(); s.settimeout(5); s.connect(('127.0.0.1',5905)) +print(repr(s.recv(32))) +PY +``` + +Expected: + +```text +b'RFB 003.008\n' +``` + +Capture a screenshot from the console: + +```bash +/tmp/opencode/hetzner-console-venv/bin/python - <<'PY' +from vncdotool import api +client = api.connect('127.0.0.1::5905', password='') +client.captureScreen('/tmp/opencode/console.png') +PY +``` + +Read the screenshot with your local tooling to inspect the boot state. + +## 5. If rescue mode is needed + +Enable rescue mode using the registered SSH key ID: + +```bash +curl -s -X POST \ + -H "Authorization: Bearer ${HCLOUD_TOKEN}" \ + -H "Content-Type: application/json" \ + -d '{"type":"linux64","ssh_keys":[]}' \ + "https://api.hetzner.cloud/v1/servers//actions/enable_rescue" +``` + +Then do a **full power cycle** if a normal reboot does not actually switch into rescue: + +```bash +curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \ + -H "Content-Type: application/json" \ + "https://api.hetzner.cloud/v1/servers//actions/poweroff" + +curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \ + -H "Content-Type: application/json" \ + "https://api.hetzner.cloud/v1/servers//actions/poweron" +``` + +Then connect with a temporary known-hosts file: + +```bash +ssh -o BatchMode=yes -o ConnectTimeout=10 \ + -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/tmp/opencode/known_hosts_rescue \ + -i root@ +``` + +## 6. Mount the installed system + +In rescue mode: + +```bash +lsblk -f +mkdir -p /mnt/recover +mount /dev/sda1 /mnt/recover +ls -l /mnt/recover/nix/var/nix/profiles +readlink -f /mnt/recover/nix/var/nix/profiles/system +``` + +Useful follow-ups: + +```bash +journalctl --directory=/mnt/recover/var/log/journal -p err..alert -n 120 --no-pager +grep -R "by-label/ESP\|/boot\|/boot/efi" -n /mnt/recover/etc /mnt/recover/etc/nixos 2>/dev/null +``` + +## 7. Proven repair: restore the EFI label + +On the cc-ci host, the failure was: + +- boot dropped into emergency mode +- console showed root locked / emergency prompt +- journal showed: + - `Timed out waiting for device /dev/disk/by-label/ESP` + - `Dependency failed for /boot` + - `Dependency failed for Local File Systems` + +Disk inspection showed the EFI partition existed as `/dev/sda15` with UUID `D978-69EE`, but it had no +filesystem label. + +Repair: + +```bash +blkid /dev/sda15 +fatlabel /dev/sda15 ESP +blkid /dev/sda15 +sync +``` + +Expected after repair: + +```text +LABEL="ESP" +``` + +## 8. Return to normal boot + +Disable rescue mode and reboot: + +```bash +curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \ + -H "Content-Type: application/json" \ + "https://api.hetzner.cloud/v1/servers//actions/disable_rescue" + +curl -s -X POST -H "Authorization: Bearer ${HCLOUD_TOKEN}" \ + -H "Content-Type: application/json" \ + "https://api.hetzner.cloud/v1/servers//actions/reboot" +``` + +Then verify both paths again: + +```bash +ssh -o BatchMode=yes -o ConnectTimeout=10 hostname +ssh -o BatchMode=yes -o ConnectTimeout=10 \ + -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/tmp/opencode/known_hosts_postfix \ + root@ hostname +tailscale ping -c 3 +``` + +## 9. Post-recovery follow-up + +- Re-run the exact operation that caused the outage, but more cautiously. +- Compare the live boot blocker against the committed Nix config. +- If the server runs helper sessions or agents, notify them that SSH is back and they can resume work. +- Record the failure and the repair in `JOURNAL.md`. + +## cc-ci-specific notes + +- cc-ci Hetzner server ID: `134485294` +- Public IP observed during this recovery: `91.98.47.73` +- Tailscale alias: `cc-ci` +- Registered rescue SSH key that matched local `/home/loops/.ssh/cc-ci-root-ed25519`: + `cc-ci-orchestrator-deploy` (`SSH key id 113082420`) + +## Hard guardrails + +- Prefer reboot, then rescue, before destructive rebuild. +- Never force-rebuild a server disk unless recovery/rescue is exhausted. +- Avoid editing data on mounted filesystems unless you understand the exact boot blocker. +- For cloud-hosted NixOS, filesystem labels/UUID expectations can be just as critical as service config.