bonsai: graceful no-CUDA fallback and declare imagegen unit #1

Merged
stephen merged 4 commits from fix/imagegen-external-and-gpu-graceful into master 2026-06-09 20:15:21 -07:00
Owner

Two admin-UI fixes for hosts without an NVIDIA GPU.

Graceful no-CUDA fallback -- with no CUDA GPU the torch/gemlite import crashes bonsai-backend on load, which cascades (the frontend depends on it) and shows the whole app as degraded. This probes torch.cuda before starting the real backend and, when unavailable, serves a small stand-in that returns a clear NVIDIA-GPU-required page so the units stay healthy. The imagegen UI now passes the backend error text through so the message is shown verbatim.

imagegen mislabeled External -- its catalog entry had no systemd-service-names, so the admin UI fell back to labeling this on-box app as an External reverse proxy. Declaring bonsai-imagegen makes it report real run-state.

Verified on the live AMD box (no NVIDIA): backend previously crash-looped with No-CUDA-GPUs-are-available.

Two admin-UI fixes for hosts without an NVIDIA GPU. Graceful no-CUDA fallback -- with no CUDA GPU the torch/gemlite import crashes bonsai-backend on load, which cascades (the frontend depends on it) and shows the whole app as degraded. This probes torch.cuda before starting the real backend and, when unavailable, serves a small stand-in that returns a clear NVIDIA-GPU-required page so the units stay healthy. The imagegen UI now passes the backend error text through so the message is shown verbatim. imagegen mislabeled External -- its catalog entry had no systemd-service-names, so the admin UI fell back to labeling this on-box app as an External reverse proxy. Declaring bonsai-imagegen makes it report real run-state. Verified on the live AMD box (no NVIDIA): backend previously crash-looped with No-CUDA-GPUs-are-available.
Two admin-UI fixes for hosts without an NVIDIA GPU:

- backend: with no CUDA GPU the torch/gemlite import crashes
  bonsai-backend on load, which cascades (the frontend's `requires`
  dependency fails) and shows the whole app as "degraded". Probe
  torch.cuda before starting the real backend; if unavailable, serve a
  tiny stand-in that returns a clear "NVIDIA GPU required" page so the
  units stay healthy. The imagegen UI now passes the backend's error
  text through, so that message is shown verbatim instead of a 500.

- imagegen: its catalog entry had no systemd-service-names, so the
  HomeFree admin UI fell back to labeling this on-box app as an
  "External" reverse proxy. Declare bonsai-imagegen so it reports real
  run-state.
The Next.js studio (bonsai-frontend) hardcoded host port 3001, which
collides with other apps on the box — zwave-js-ui publishes host :3001 —
so `next start` could never bind it (EADDRINUSE) and the unit
crash-looped, which also failed the rebuild's switch-to-configuration.

Use HomeFree's central port allocator (the "bonsai" catalog label, same
pattern as mosaic/cal-diy) so the studio gets a deconflicted port, and
bind it to 127.0.0.1 like the backend/imagegen siblings (it's reached
through Caddy). backendPort/imagegenPort stay fixed — they bind loopback
and backendPort is baked into the Next build.
Previously only the inference backend fell back to the "NVIDIA GPU
required" stand-in; the studio (bonsai.homefree.host) and the simple
generator (imagegen.homefree.host) still ran, so on a non-GPU box the
studio loaded but dumped the vendored Next.js error page on Generate —
ugly and confusing.

Factor the CUDA probe into one `gpuOrNotice <port>` guard and run it at
the top of all three start scripts. With no usable GPU each service
serves the self-contained GPU-required page on its own port instead of
the real (useless) service; the port and unit are unchanged, so all
three stay healthy and every Bonsai URL shows the same clean message.

NVIDIA boxes are unaffected: the guard sets LD_LIBRARY_PATH to the driver
libs before probing torch.cuda (so a present GPU is detected, never a
false negative) and, when CUDA is available, execution falls through to
the real backend / studio / generator exactly as before.
The ~2-4 GB ternary-model download (step 6) and the Next.js studio build
(step 7) are only useful with an NVIDIA GPU. Gate both on NVIDIA hardware
detected at the PCI level (sysfs vendor 0x10de), so a CPU/AMD box no
longer fetches gigabytes it can't use — the runtime already serves the
GPU-required page there.

The check is deliberately hardware-level, not torch.cuda: PCI enumeration
completes long before userspace, so a real NVIDIA box is never
mis-detected as GPU-less and always runs the full build (the original
`[ ! -d ... ]` idempotency guards are unchanged). The venv (step 4) is
still built unconditionally — the runtime page and GPU probe need it.
stephen merged commit 943cd0011c into master 2026-06-09 20:15:21 -07:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
homefree-plugins/homefree-bonsai!1
No description provided.