OpenAI Whisper
voiceby openai
Open source speech recognition model supporting 98 languages, available both as a free local tool and through OpenAI's transcription API.
Key features
Free tier available, API (gpt-4o-mini-transcribe) at $null/mo
Developers building speech to text features into applications who want a proven, reliable transcription engine
Fully open source under MIT license with no usage restrictions, allowing unlimited commercial and personal use at zero cost
What it does
Multilingual Transcription
Transcribe speech to text in 98 languages. The model automatically detects the spoken language and produces accurate transcriptions with punctuation and formatting.
Learn moreSpeech Translation
Translate speech from any of the 98 supported languages directly into English text in a single step. No intermediate transcription is needed; the model translates end to end.
Open Source (MIT License)
The entire model, weights, and code are available under the MIT license. You can download, modify, fine tune, and deploy Whisper for any purpose, including commercial use, with no restrictions.
Multiple Model Sizes
Choose from tiny (39M), base (74M), small (244M), medium (769M), large-v3 (1.55B), and turbo (809M) depending on your hardware and accuracy requirements. Smaller models run on CPUs; larger models need GPUs.
Learn moreTurbo Variant
The large-v3-turbo model (809M parameters) was created by pruning the large-v3 decoder from 32 layers to 4 layers. It runs significantly faster with minimal quality loss, making it the best balance of speed and accuracy for most use cases.
Learn moreWord Level Timestamps
Get precise start and end times for every word in the transcription. Essential for subtitle generation, audio editing, and synchronized text displays.
Speaker Diarization
The gpt-4o-transcribe-diarize API model identifies and labels different speakers in a conversation. Useful for meeting transcription, interviews, and multi speaker audio.
Streaming Transcription
The gpt-4o-transcribe API models support streaming output, delivering transcription results in real time as audio is processed. Also available via the Realtime API over WebSocket.
Learn morePrompting for Domain Vocabulary
Provide a text prompt with domain specific terminology, acronyms, or proper nouns to improve transcription accuracy. The model uses the prompt as context to correctly spell specialized vocabulary.
Multiple Output Formats
Export transcriptions in JSON (with timestamps and metadata), SRT (subtitle format), VTT (web subtitles), plain text, or verbose JSON with word level timing data.
Realtime API Support
Use Whisper through OpenAI's Realtime API for live audio transcription over WebSocket connections. Enables real time voice applications, live captioning, and conversational AI systems.
Learn morePricing
Open Source
Run Whisper locally on your own hardware at no cost. MIT license allows unlimited commercial and personal use. You provide the compute (CPU or GPU).
- All model sizes (tiny to large-v3)
- Turbo variant included
- 98 language support
- Word level timestamps
- Translation to English
- No API limits or quotas
- Full offline operation
- MIT license (commercial use allowed)
API (whisper-1)
OpenAI hosted transcription at $0.006 per minute of audio. No infrastructure to manage. Based on the large-v2 model.
- $0.006 per minute of audio
- Managed infrastructure
- Automatic language detection
- Word level timestamps
- Multiple output formats
- Prompting support
- 25 MB file size limit
API (gpt-4o-mini-transcribe)
Newer, more accurate transcription model at $0.003 per minute. Better accuracy than whisper-1 at half the price per minute.
- $0.003 per minute of audio
- Improved accuracy over whisper-1
- Streaming support
- Managed infrastructure
- Token based pricing also available
- 25 MB file size limit
API (gpt-4o-transcribe)
Most capable API transcription model at $0.006 per minute. Full streaming support with the highest accuracy available.
- $0.006 per minute of audio
- Highest accuracy
- Full streaming support
- Speaker diarization (diarize variant)
- Managed infrastructure
- Token based pricing also available
- 25 MB file size limit
Pros & Cons
Pros
- Fully open source under MIT license with no usage restrictions, allowing unlimited commercial and personal use at zero cost
- Free to run locally on your own hardware, from a Raspberry Pi (tiny model) to a workstation GPU (large model)
- 98 language support with automatic language detection, one of the broadest multilingual ASR models available
- Multiple model sizes let you choose the right tradeoff between speed and accuracy for your hardware and use case
- Massive community ecosystem including whisper.cpp, faster-whisper, and insanely-fast-whisper with significant performance improvements
- API option provides a managed, hassle free experience for teams that do not want to manage GPU infrastructure
- Word level timestamps enable precise subtitle generation, audio editing, and synchronized text overlays
Cons
- The large-v3 model requires approximately 10GB of VRAM, putting it out of reach for machines without a dedicated GPU
- The turbo model cannot translate (only transcribe), so translation still requires the full large model or the older models
- API pricing adds up for high volume workloads; at $0.006/min, 1,000 hours of audio costs $360
- The newer gpt-4o-transcribe models are not open source, so the best API accuracy is locked behind the paid service
- Accuracy varies significantly by language; high resource languages (English, Spanish, French) perform much better than low resource ones
- The 30 second processing window can cause issues with long pauses, silence, or non speech audio segments
How to get started
Choose your approach: local or API
Decide whether you want to run Whisper locally (free, requires hardware) or use the OpenAI API (paid per minute, no setup). Local is best for privacy, high volume, and customization. API is best for convenience and minimal infrastructure.
Run locally with pip install
Install the open source model with pip install openai-whisper. Choose a model size: start with base or small for testing, then move to large-v3-turbo for production. Run with: whisper audio.mp3 --model turbo
Or use the API
Get an API key from platform.openai.com. Send audio files to the /v1/audio/transcriptions endpoint. The API handles all infrastructure, model loading, and scaling automatically.
Try community alternatives for better performance
For faster local inference, try faster-whisper (CTranslate2 based, up to 4x faster) or whisper.cpp (C/C++ port, runs efficiently on CPUs). For batch processing on GPUs, insanely-fast-whisper uses batched inference for maximum throughput.
Optimize for your domain
Use the prompt parameter to provide domain specific vocabulary, acronyms, and proper nouns. This significantly improves accuracy for specialized content like medical, legal, or technical transcription.
Deep dive
Detailed guides with comparisons, tips, and visuals for each feature.
Model Sizes and Performance
All Whisper model sizes from tiny (39M) to large-v3 (1.55B) and turbo (809M), with performance versus speed tradeoffs and hardware requirements.
API vs Local Deployment
When to use the OpenAI API versus running Whisper locally, with cost comparisons and the newer gpt-4o-transcribe models.
Whisper vs Other Speech to Text Services
How Whisper compares to Google Speech to Text, Azure Speech, AssemblyAI, and Deepgram on accuracy, pricing, and features.
Links
Documentation
Pricing
Similar Tools
ElevenLabs
voiceelevenlabs
The most realistic AI voice platform. Text to speech, voice cloning, dubbing, sound effects, music generation, and conversational AI agents.
ChatGPT
chatbotopenai
The most popular AI assistant in the world: text, images, video, voice, search, and code in one place.
Get notified about updates
We'll email you when this tool's pricing or features change.
Last updated: 2026-02-21