OpenAI Whisper logo

OpenAI Whisper

voice

by openai

Open source speech recognition model supporting 98 languages, available both as a free local tool and through OpenAI's transcription API.

Key features

Multilingual Transcription
Speech Translation
Open Source (MIT License)
Multiple Model Sizes
Turbo Variant
Word Level Timestamps
Pricing

Free tier available, API (gpt-4o-mini-transcribe) at $null/mo

Best For

Developers building speech to text features into applications who want a proven, reliable transcription engine

Verdict

Fully open source under MIT license with no usage restrictions, allowing unlimited commercial and personal use at zero cost

What it does

Multilingual Transcription

Transcribe speech to text in 98 languages. The model automatically detects the spoken language and produces accurate transcriptions with punctuation and formatting.

Learn more

Speech Translation

Translate speech from any of the 98 supported languages directly into English text in a single step. No intermediate transcription is needed; the model translates end to end.

Open Source (MIT License)

The entire model, weights, and code are available under the MIT license. You can download, modify, fine tune, and deploy Whisper for any purpose, including commercial use, with no restrictions.

Multiple Model Sizes

Choose from tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1.55B), and turbo (809M) depending on your hardware and accuracy requirements. Smaller models run on CPUs; larger models need GPUs.

Learn more

Turbo Variant

The large-v3-turbo model (809M parameters) was created by pruning the large-v3 decoder from 32 layers to 4 layers. It runs significantly faster with minimal quality loss, making it the best balance of speed and accuracy for most use cases.

Learn more

Word Level Timestamps

Get precise start and end times for every word in the transcription. Essential for subtitle generation, audio editing, and synchronized text displays.

Speaker Diarization

The gpt-4o-transcribe-diarize API model identifies and labels different speakers in a conversation. Useful for meeting transcription, interviews, and multi speaker audio.

Streaming Transcription

The gpt-4o-transcribe API models support streaming output, delivering transcription results in real time as audio is processed. Also available via the Realtime API over WebSocket.

Learn more

Prompting for Domain Vocabulary

Provide a text prompt with domain specific terminology, acronyms, or proper nouns to improve transcription accuracy. The model uses the prompt as context to correctly spell specialized vocabulary.

Multiple Output Formats

Export transcriptions in JSON (with timestamps and metadata), SRT (subtitle format), VTT (web subtitles), plain text, or verbose JSON with word level timing data.

Realtime API Support

Use Whisper through OpenAI's Realtime API for live audio transcription over WebSocket connections. Enables real time voice applications, live captioning, and conversational AI systems.

Learn more

Pricing

Open Source

Free

Run Whisper locally on your own hardware at no cost. MIT license allows unlimited commercial and personal use. You provide the compute (CPU or GPU).

  • All model sizes (tiny to large-v3)
  • Turbo variant included
  • 98 language support
  • Word level timestamps
  • Translation to English
  • No API limits or quotas
  • Full offline operation
  • MIT license (commercial use allowed)

API (whisper-1)

Custom

OpenAI hosted transcription at $0.006 per minute of audio. No infrastructure to manage. Based on the large-v2 model.

  • $0.006 per minute of audio
  • Managed infrastructure
  • Automatic language detection
  • Word level timestamps
  • Multiple output formats
  • Prompting support
  • 25 MB file size limit
Best Value

API (gpt-4o-mini-transcribe)

Custom

Newer, more accurate transcription model at $0.003 per minute. Better accuracy than whisper-1 at half the price per minute.

  • $0.003 per minute of audio
  • Improved accuracy over whisper-1
  • Streaming support
  • Managed infrastructure
  • Token based pricing also available
  • 25 MB file size limit
Best value

API (gpt-4o-transcribe)

Custom

Most capable API transcription model at $0.006 per minute. Full streaming support with the highest accuracy available.

  • $0.006 per minute of audio
  • Highest accuracy
  • Full streaming support
  • Speaker diarization (diarize variant)
  • Managed infrastructure
  • Token based pricing also available
  • 25 MB file size limit

Pros & Cons

Pros

  • Fully open source under MIT license with no usage restrictions, allowing unlimited commercial and personal use at zero cost
  • Free to run locally on your own hardware, from a Raspberry Pi (tiny model) to a workstation GPU (large model)
  • 98 language support with automatic language detection, one of the broadest multilingual ASR models available
  • Multiple model sizes let you choose the right tradeoff between speed and accuracy for your hardware and use case
  • Massive community ecosystem including whisper.cpp, faster-whisper, and insanely-fast-whisper with significant performance improvements
  • API option provides a managed, hassle free experience for teams that do not want to manage GPU infrastructure
  • Word level timestamps enable precise subtitle generation, audio editing, and synchronized text overlays

Cons

  • The large-v3 model requires approximately 10GB of VRAM, putting it out of reach for machines without a dedicated GPU
  • The turbo model cannot translate (only transcribe), so translation still requires the full large model or the older models
  • API pricing adds up for high volume workloads; at $0.006/min, 1,000 hours of audio costs $360
  • The newer gpt-4o-transcribe models are not open source, so the best API accuracy is locked behind the paid service
  • Accuracy varies significantly by language; high resource languages (English, Spanish, French) perform much better than low resource ones
  • The 30 second processing window can cause issues with long pauses, silence, or non speech audio segments

How to get started

1

Choose your approach: local or API

Decide whether you want to run Whisper locally (free, requires hardware) or use the OpenAI API (paid per minute, no setup). Local is best for privacy, high volume, and customization. API is best for convenience and minimal infrastructure.

2

Run locally with pip install

Install the open source model with pip install openai-whisper. Choose a model size: start with base or small for testing, then move to large-v3-turbo for production. Run with: whisper audio.mp3 --model turbo

3

Or use the API

Get an API key from platform.openai.com. Send audio files to the /v1/audio/transcriptions endpoint. The API handles all infrastructure, model loading, and scaling automatically.

4

Try community alternatives for better performance

For faster local inference, try faster-whisper (CTranslate2 based, up to 4x faster) or whisper.cpp (C/C++ port, runs efficiently on CPUs). For batch processing on GPUs, insanely-fast-whisper uses batched inference for maximum throughput.

5

Optimize for your domain

Use the prompt parameter to provide domain specific vocabulary, acronyms, and proper nouns. This significantly improves accuracy for specialized content like medical, legal, or technical transcription.

Deep dive

Detailed guides with comparisons, tips, and visuals for each feature.

Get notified about updates

We'll email you when this tool's pricing or features change.

Last updated: 2026-02-21