journal(2): cc-ci VM offline mid discourse full5 — likely OOM on 7-GiB node; polling recovery

This commit is contained in:
autonomic-bot
2026-05-31 01:43:55 +00:00
parent 3afd850eb0
commit 707752cd14

View File

@ -1495,3 +1495,17 @@ full5 fixes (the ones that actually address the timeout):
Cleaned full4's stray state (2 app.1 containers stuck "Removal In Progress" held the discourse_data
volume; cleared after the daemon finished removal; volume rm'd). Node verified clean before launch.
full5: `/root/ccci-discourse-full5.log`, PID 848184, REF 3758522, builder-clone @8dfd8ed.
---
## 2026-05-31T01:38Z — cc-ci VM went OFFLINE mid discourse full5 (likely OOM on 7-GiB node) (Builder)
At the 01:38 poll, `ssh cc-ci` timed out; `ping 100.90.116.4` 100% loss; `tailscale status` shows
`cc-nix-test 100.90.116.4 ... active; relay "nyc"; offline`. My orchestrator host + b1 (hypervisor)
are online — only the cc-ci VM dropped off. Last good state (01:33): discourse app attempt-2 in
"Populating database" (Rails migration), health=starting. Strong hypothesis: the 7-GiB node OOM'd /
thrashed under discourse's migration+asset-precompile (Rails/ember, memory-hungry) co-resident with
the CI infra (traefik/drone/dashboard/bridge/backups) AND a running warm-keycloak+db → tailscaled
starved → VM unresponsive. Tailnet membership intact (node exists, just offline) → recoverable, not a
class-A1 blocker yet. Polling for recovery; if it doesn't come back in ~15-20min it's an operator
reboot (b1 VM) → STATUS Blocked. Root-cause implication regardless: discourse is too heavy for this
node co-resident with warm-keycloak — need to shed memory (stop warm-keycloak before discourse, and/or
mem-limit the discourse build) before re-running, else this recurs.