Text Generation (LLM)

predictor.sh supports various text generation model formats with GPU acceleration.

Supported Formats

Format

Runtime

Best For

GGUF

llama.cpp

Quantized models, best performance

SafeTensors

Candle

HuggingFace models

ONNX

ONNX Runtime

Custom models, embeddings

Supported Architectures

Architecture

GGUF

SafeTensors

Popular Models

Llama

✅

Llama 2, Llama 3, Code Llama

Mistral

✅

Mistral 7B, Mixtral 8x7B

Phi

✅

Phi-2, Phi-3

SmolLM

✅

SmolLM2-135M/360M/1.7B

Qwen

✅

⚠️

Qwen 1.5, Qwen 2

Falcon

✅

⚠️

Falcon 7B/40B

⚠️ indicates GGUF format is recommended for best compatibility.

Loading Models

Local GGUF File

predictor up ./llama-7b-q4.gguf

HuggingFace Model

# Auto-detect best file
predictor up TheBloke/Llama-2-7B-GGUF

# Specific file
predictor up TheBloke/Llama-2-7B-GGUF:llama-2-7b.Q4_K_M.gguf

# Specific revision
predictor up TheBloke/Llama-2-7B-GGUF@main:llama-2-7b.Q4_K_M.gguf

Pre-download Models

# Download without starting server
predictor pull TheBloke/Llama-2-7B-GGUF

# Then serve instantly
predictor up TheBloke/Llama-2-7B-GGUF

Quantization

predictor.sh supports pre-quantized models. Common quantization levels:

Quantization

Size Reduction

Quality

VRAM (7B model)

Q4_K_M

~75%

Good

~4GB

Q5_K_M

~70%

Better

~5GB

Q6_K

~60%

Great

~6GB

Q8_0

~50%

Excellent

~8GB

F16

None

Original

~14GB

predictor.sh only supports pre-quantized models. On-the-fly quantization is not supported.

Gated Models

Some HuggingFace models (Llama 3, Mistral) require authentication:

# Set your HuggingFace token
export HF_TOKEN=hf_xxxxxxxxxxxxx

# Then download/serve
predictor up meta-llama/Llama-3-8B-Instruct-GGUF

Get your token at: huggingface.co/settings/tokens

Benchmarking

Test model performance before deployment:

# Basic benchmark
predictor benchmark --model ./llama.gguf

# Custom parameters
predictor benchmark --model ./llama.gguf --tokens 256 --runs 5

# JSON output
predictor benchmark --model ./llama.gguf --json

Sample output:

Benchmark Results: llama-2-7b.Q4_K_M.gguf
─────────────────────────────────────────
Hardware:     NVIDIA RTX 4090 (24GB)
Backend:      CUDA 12.4

Prompt Processing:  2,450 tokens/sec
Generation:         85 tokens/sec
Time to First Token: 142ms
Memory Usage:       4.2GB / 24GB VRAM

API Endpoints

Text models expose OpenAI-compatible endpoints:

POST /v1/chat/completions - Chat completions
POST /v1/completions - Text completions
POST /v1/embeddings - Text embeddings
GET /v1/models - List models

See API Reference for full documentation.

PreviousAuthentication NextSpeech-to-Text (Whisper)

Last updated 1 month ago

hashtagSupported Formats

hashtagSupported Architectures

hashtagLoading Models

hashtagLocal GGUF File

hashtagHuggingFace Model

hashtagPre-download Models

hashtagQuantization

hashtagGated Models

hashtagBenchmarking

hashtagAPI Endpoints

Supported Formats

Supported Architectures

Loading Models

Local GGUF File

HuggingFace Model

Pre-download Models

Quantization

Gated Models

Benchmarking

API Endpoints