GPU Configuration

predictor.sh automatically detects and uses available GPU acceleration.

Supported Backends

Backend
Platform
Hardware

Metal

macOS

Apple Silicon (M1/M2/M3)

CUDA

Linux/Windows

NVIDIA GPUs (RTX 20/30/40 series)

CPU

All

Fallback (slower)

Automatic Detection

When you run predictor up, it automatically detects available GPUs:

$ predictor up --model ./llama.gguf

Detecting hardware...
  CUDA: NVIDIA RTX 4090 (24GB VRAM) ✓

Loading model: llama-7b-q4.gguf
  Size: 3.8GB
  Loading... ████████████████████ 100%

✓ Tunnel established

Force CPU Mode

If no GPU is detected, predictor.sh requires explicit consent to run on CPU:

Use the --cpu flag to proceed:

GPU Selection (Multi-GPU)

On systems with multiple GPUs, use CUDA_VISIBLE_DEVICES:

VRAM Requirements

predictor.sh checks if your GPU has enough VRAM before loading:

Typical VRAM Usage (7B Models)

Quantization
VRAM Required

Q4_K_M

~4GB

Q5_K_M

~5GB

Q6_K

~6GB

Q8_0

~8GB

F16

~14GB

Troubleshooting

CUDA Not Detected

  1. Check NVIDIA drivers:

  2. Verify CUDA installation:

  3. Check for GPU visibility:

Metal Not Detected

Metal is only available on macOS with Apple Silicon. Intel Macs do not support Metal acceleration.

Verify your chip:

Out of Memory Errors

If you see CUDA/Metal out-of-memory errors:

  1. Use a smaller quantization (e.g., Q4_K_M instead of Q8_0)

  2. Close other GPU-intensive applications

  3. Try a smaller model

GPU Temperature

Monitor GPU temperature in the TUI:

If temperature is high (>80°C), consider:

  • Improving case airflow

  • Reducing request concurrency

  • Using a lower-power model

Last updated