Speech-to-Text (Whisper)
predictor.sh includes native Whisper support for audio transcription via the Candle framework.
Supported Models
whisper-tiny
39M
~1GB
Real-time, low resource
whisper-tiny.en
39M
~1GB
English-only, faster
whisper-base
74M
~1GB
Balanced for simple audio
whisper-base.en
74M
~1GB
English-only
whisper-small
244M
~2GB
Good accuracy/speed balance
whisper-small.en
244M
~2GB
English-only
whisper-medium
769M
~5GB
High accuracy
whisper-medium.en
769M
~5GB
English-only
whisper-large
1550M
~10GB
Best accuracy
whisper-large-v2
1550M
~10GB
Improved large
whisper-large-v3
1550M
~10GB
Latest, best quality
whisper-large-v3-turbo
809M
~6GB
Fast large-quality
Distilled Models (Faster)
distil-whisper-small.en
166M
~1.5GB
English only
distil-whisper-medium.en
394M
~3GB
English only
distil-whisper-large-v2
756M
~5GB
Best distilled
distil-whisper-large-v3
756M
~5GB
Latest distilled
Getting Started
Supported Audio Formats
WAV
.wav
Native, recommended
MP3
.mp3
Most common
AAC
.m4a, .aac
Apple format
OGG Vorbis
.ogg, .oga
Open format
FLAC
.flac
Lossless
WebM
.webm
Browser recording
Audio is automatically:
Decoded to PCM
Resampled to 16kHz
Converted to mono
API Usage
Transcribe Audio
With Language Hint
Python SDK
Response Format
Streaming Response
For long audio files, responses stream segment-by-segment:
Performance
Real-time factor (RTF) indicates processing speed:
RTF < 1.0 = Faster than real-time
RTF = 0.1 = 10x faster than real-time
Example: 60 seconds of audio with RTF 0.1 takes ~6 seconds to transcribe.
tiny
0.02
0.2
small
0.05
0.5
large-v3
0.12
1.5+
Last updated