Speech-to-Text (Whisper)

predictor.sh includes native Whisper support for audio transcription via the Candle framework.

Supported Models

Model

Parameters

VRAM

Best For

whisper-tiny

39M

~1GB

Real-time, low resource

whisper-tiny.en

39M

~1GB

English-only, faster

whisper-base

74M

~1GB

Balanced for simple audio

whisper-base.en

74M

~1GB

English-only

whisper-small

244M

~2GB

Good accuracy/speed balance

whisper-small.en

244M

~2GB

English-only

whisper-medium

769M

~5GB

High accuracy

whisper-medium.en

769M

~5GB

English-only

whisper-large

1550M

~10GB

Best accuracy

whisper-large-v2

1550M

~10GB

Improved large

whisper-large-v3

1550M

~10GB

Latest, best quality

whisper-large-v3-turbo

809M

~6GB

Fast large-quality

Distilled Models (Faster)

Model

Parameters

VRAM

Notes

distil-whisper-small.en

166M

~1.5GB

English only

distil-whisper-medium.en

394M

~3GB

English only

distil-whisper-large-v2

756M

~5GB

Best distilled

distil-whisper-large-v3

756M

~5GB

Latest distilled

Getting Started

# Serve Whisper model
predictor up openai/whisper-large-v3

Supported Audio Formats

Format

Extension

Notes

WAV

.wav

Native, recommended

MP3

.mp3

Most common

AAC

.m4a, .aac

Apple format

OGG Vorbis

.ogg, .oga

Open format

FLAC

.flac

Lossless

WebM

.webm

Browser recording

Audio is automatically:

Decoded to PCM
Resampled to 16kHz
Converted to mono

API Usage

Transcribe Audio

curl https://your-endpoint.predictor.sh/v1/audio/transcriptions \
  -H "Authorization: Bearer $TOKEN" \
  -F "[email protected]" \
  -F "model=whisper-1"

With Language Hint

curl https://your-endpoint.predictor.sh/v1/audio/transcriptions \
  -H "Authorization: Bearer $TOKEN" \
  -F "[email protected]" \
  -F "model=whisper-1" \
  -F "language=en"

Python SDK

from openai import OpenAI

client = OpenAI(
    base_url="https://your-endpoint.predictor.sh/v1",
    api_key="pred_your_token"
)

with open("audio.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )
    print(transcript.text)

Response Format

{
  "text": "Hello, this is a transcription test."
}

Streaming Response

For long audio files, responses stream segment-by-segment:

data: {"text": "First segment...", "start": 0.0, "end": 5.2}
data: {"text": "Second segment...", "start": 5.2, "end": 11.4}
data: [DONE]

Performance

Real-time factor (RTF) indicates processing speed:

RTF < 1.0 = Faster than real-time
RTF = 0.1 = 10x faster than real-time

Example: 60 seconds of audio with RTF 0.1 takes ~6 seconds to transcribe.

Model

Typical RTF (GPU)

Typical RTF (CPU)

tiny

0.02

0.2

small

0.05

0.5

large-v3

0.12

1.5+

PreviousText Generation (LLM)NextText-to-Speech

Last updated 1 month ago

hashtagSupported Models

hashtagDistilled Models (Faster)

hashtagGetting Started

hashtagSupported Audio Formats

hashtagAPI Usage

hashtagTranscribe Audio

hashtagWith Language Hint

hashtagPython SDK

hashtagResponse Format

hashtagStreaming Response

hashtagPerformance

Supported Models

Distilled Models (Faster)

Getting Started

Supported Audio Formats

API Usage

Transcribe Audio

With Language Hint

Python SDK

Response Format

Streaming Response

Performance