Speech-to-Text (Whisper)

predictor.sh includes native Whisper support for audio transcription via the Candle framework.

Supported Models

Model
Parameters
VRAM
Best For

whisper-tiny

39M

~1GB

Real-time, low resource

whisper-tiny.en

39M

~1GB

English-only, faster

whisper-base

74M

~1GB

Balanced for simple audio

whisper-base.en

74M

~1GB

English-only

whisper-small

244M

~2GB

Good accuracy/speed balance

whisper-small.en

244M

~2GB

English-only

whisper-medium

769M

~5GB

High accuracy

whisper-medium.en

769M

~5GB

English-only

whisper-large

1550M

~10GB

Best accuracy

whisper-large-v2

1550M

~10GB

Improved large

whisper-large-v3

1550M

~10GB

Latest, best quality

whisper-large-v3-turbo

809M

~6GB

Fast large-quality

Distilled Models (Faster)

Model
Parameters
VRAM
Notes

distil-whisper-small.en

166M

~1.5GB

English only

distil-whisper-medium.en

394M

~3GB

English only

distil-whisper-large-v2

756M

~5GB

Best distilled

distil-whisper-large-v3

756M

~5GB

Latest distilled

Getting Started

Supported Audio Formats

Format
Extension
Notes

WAV

.wav

Native, recommended

MP3

.mp3

Most common

AAC

.m4a, .aac

Apple format

OGG Vorbis

.ogg, .oga

Open format

FLAC

.flac

Lossless

WebM

.webm

Browser recording

Audio is automatically:

  • Decoded to PCM

  • Resampled to 16kHz

  • Converted to mono

API Usage

Transcribe Audio

With Language Hint

Python SDK

Response Format

Streaming Response

For long audio files, responses stream segment-by-segment:

Performance

Real-time factor (RTF) indicates processing speed:

  • RTF < 1.0 = Faster than real-time

  • RTF = 0.1 = 10x faster than real-time

Example: 60 seconds of audio with RTF 0.1 takes ~6 seconds to transcribe.

Model
Typical RTF (GPU)
Typical RTF (CPU)

tiny

0.02

0.2

small

0.05

0.5

large-v3

0.12

1.5+

Last updated