Common issues and solutions.
Connection Issues
"Tunnel connection failed"
Symptoms: Can't establish tunnel to predictor.sh servers.
Solutions:
Check internet connection:
curl https://predictor.sh
Check firewall: Ensure outbound HTTPS (443) is allowed.
Corporate proxy: Configure proxy settings:
export HTTPS_PROXY=http://proxy.company.com:8080
VPN interference: Try disconnecting VPN temporarily.
"Endpoint abc123 is already connected"
Cause: Another predictor instance is using this endpoint.
Solutions:
If the other instance crashed, wait 30 seconds for timeout.
Use --force to override:
predictor down --force
Authentication Issues
"Unauthorized (401)"
Cause: Invalid or expired token.
Solutions:
Ensure correct Authorization header:
"HuggingFace: Unauthorized"
Cause: Missing or invalid HF_TOKEN for gated models.
Solution:
Get token at: huggingface.co/settings/tokens
"No GPU detected"
Symptoms:
Solutions:
Check NVIDIA drivers (Linux/Windows):
If not found, install NVIDIA drivers.
Verify GPU visibility:
Empty means all GPUs visible. Unset if needed.
"Insufficient VRAM"
Symptoms:
Solutions:
Use smaller quantization:
Q4_K_M (~4GB for 7B model)
Q5_K_M (~5GB for 7B model)
Close other GPU applications.
"CUDA out of memory" during inference
Cause: GPU ran out of memory during generation.
Solutions:
Reduce max_tokens in requests.
Lower concurrency in predictor.yaml:
Use smaller quantization.
Cause: predictor.sh doesn't support this file format.
Supported formats:
.gguf - GGUF quantized models
.safetensors - SafeTensors format
Not supported:
.pt, .bin - PyTorch checkpoints
Solution: Find a GGUF or SafeTensors version of your model.
"Model architecture not recognized"
Cause: Missing or invalid config.json.
Solution: Ensure your model directory contains:
config.json (model configuration)
*.safetensors (model weights)
tokenizer.json (for text models)
"Download failed / checksum mismatch"
Cause: Corrupted or incomplete download.
Solutions:
Try a different network connection.
"Request timeout (504)"
Cause: Inference took too long.
Solutions:
Increase timeout in predictor.yaml:
Reduce max_tokens in your request.
Use a faster model/quantization.
"Too many requests (429)"
Cause: Rate limit exceeded.
Solutions:
Reduce request frequency.
Upgrade to a higher tier.
Logs and Debugging
Enable debug logging
Logs are written to /tmp/predictor.log:
Raw log output (no TUI)
If you're still stuck:
Search existing issues for your error message
Open a new issue with:
predictor --version output