journal(2): cc-ci VM offline mid discourse full5 — likely OOM on 7-GiB node; polling recovery

2026-05-31 01:43:55 +00:00
parent 3afd850eb0
commit 707752cd14
1 changed files with 14 additions and 0 deletions
--- a/machine-docs/JOURNAL-2.md
+++ b/machine-docs/JOURNAL-2.md
@ -1495,3 +1495,17 @@ full5 fixes (the ones that actually address the timeout):
 Cleaned full4's stray state (2 app.1 containers stuck "Removal In Progress" held the discourse_data
 volume; cleared after the daemon finished removal; volume rm'd). Node verified clean before launch.
 full5: `/root/ccci-discourse-full5.log`, PID 848184, REF 3758522, builder-clone @8dfd8ed.
+
+---
+## 2026-05-31T01:38Z — cc-ci VM went OFFLINE mid discourse full5 (likely OOM on 7-GiB node) (Builder)
+At the 01:38 poll, `ssh cc-ci` timed out; `ping 100.90.116.4` 100% loss; `tailscale status` shows
+`cc-nix-test  100.90.116.4 ... active; relay "nyc"; offline`. My orchestrator host + b1 (hypervisor)
+are online — only the cc-ci VM dropped off. Last good state (01:33): discourse app attempt-2 in
+"Populating database" (Rails migration), health=starting. Strong hypothesis: the 7-GiB node OOM'd /
+thrashed under discourse's migration+asset-precompile (Rails/ember, memory-hungry) co-resident with
+the CI infra (traefik/drone/dashboard/bridge/backups) AND a running warm-keycloak+db → tailscaled
+starved → VM unresponsive. Tailnet membership intact (node exists, just offline) → recoverable, not a
+class-A1 blocker yet. Polling for recovery; if it doesn't come back in ~15-20min it's an operator
+reboot (b1 VM) → STATUS Blocked. Root-cause implication regardless: discourse is too heavy for this
+node co-resident with warm-keycloak — need to shed memory (stop warm-keycloak before discourse, and/or
+mem-limit the discourse build) before re-running, else this recurs.