Text Generation (LLM)
predictor.sh supports various text generation model formats with GPU acceleration.
Supported Formats
GGUF
llama.cpp
Quantized models, best performance
SafeTensors
Candle
HuggingFace models
ONNX
ONNX Runtime
Custom models, embeddings
Supported Architectures
Llama
✅
✅
Llama 2, Llama 3, Code Llama
Mistral
✅
✅
Mistral 7B, Mixtral 8x7B
Phi
✅
✅
Phi-2, Phi-3
SmolLM
✅
✅
SmolLM2-135M/360M/1.7B
Qwen
✅
⚠️
Qwen 1.5, Qwen 2
Falcon
✅
⚠️
Falcon 7B/40B
Loading Models
Local GGUF File
HuggingFace Model
Pre-download Models
Quantization
predictor.sh supports pre-quantized models. Common quantization levels:
Q4_K_M
~75%
Good
~4GB
Q5_K_M
~70%
Better
~5GB
Q6_K
~60%
Great
~6GB
Q8_0
~50%
Excellent
~8GB
F16
None
Original
~14GB
predictor.sh only supports pre-quantized models. On-the-fly quantization is not supported.
Gated Models
Some HuggingFace models (Llama 3, Mistral) require authentication:
Get your token at: huggingface.co/settings/tokens
Benchmarking
Test model performance before deployment:
Sample output:
API Endpoints
Text models expose OpenAI-compatible endpoints:
POST /v1/chat/completions- Chat completionsPOST /v1/completions- Text completionsPOST /v1/embeddings- Text embeddingsGET /v1/models- List models
See API Reference for full documentation.
Last updated