OpenAI’s open-weights speech-to-text model. Shipped September 2022 as a 1.5B-parameter transformer trained on 680,000 hours of multilingual audio. Weights remain under the MIT license. OpenAI’s hosted whisper-1 API runs at $0.006 per minute.
As of March 2025, OpenAI added two newer hosted transcription models on the same endpoint: GPT-4o Transcribe (same $0.006/min, improved accuracy on low-resource languages, optional speaker diarization) and GPT-4o Mini Transcribe ($0.003/min, budget tier). Whisper-1 is now the legacy model but stays available and stays the reference for self-hosted deployments.
Recent developments
- April 2026: Self-hosted Whisper remains the dominant open-weights baseline. Projects like faster-whisper (CTranslate2 runtime) and whisper.cpp (C++ port) deliver 5-10x speedups on consumer hardware; Apple Silicon inference is routinely faster-than-realtime.
- March 2025: OpenAI shipped GPT-4o Transcribe and Mini Transcribe as hosted successors on the same
/v1/audio/transcriptionsendpoint. Whisper-1 stays callable by model ID; MIT weights are unchanged.
System Verdict
Pick Whisper if you need multilingual transcription with the option to self-host, or a $0.006/minute API baseline. Self-hosted under the MIT license gives unlimited offline volume; the hosted API covers teams that do not want to operate GPUs. Accent and noise robustness beat most commercial APIs on published evaluations.
Skip it for real-time streaming, word-accurate timestamps, or built-in speaker diarization. Whisper is batch-first: segment-level timestamps only, and no native diarization. GPT-4o Transcribe with Diarization fits there. For sub-second live transcription, AssemblyAI and Deepgram lead.
Who pays which tier: Self-host free for researchers, privacy-first teams, and batch jobs. OpenAI API at $0.006/min for teams that do not want to run GPUs. GPT-4o Mini Transcribe at $0.003/min for cost-sensitive production. Third-party wrappers (Replicate, Hugging Face Inference Endpoints) offer managed Whisper hosting at comparable rates.
Key Facts
| Model | Whisper-1 (1.5B parameters, transformer encoder-decoder) |
| License | MIT (weights and reference code) |
| Successor models on same API | GPT-4o Transcribe, GPT-4o Mini Transcribe |
| Languages | 99 |
| API pricing | $0.006/min (Whisper-1, GPT-4o Transcribe) · $0.003/min (Mini) |
| Self-host cost | Free (MIT) plus GPU compute |
| Max file size (API) | 25MB per request; split long files client-side |
| Formats accepted | mp3, mp4, mpeg, mpga, m4a, wav, webm, flac, ogg |
| Timestamps | Segment-level (word-level via Whisper-Timestamped fork) |
| Speaker diarization | No native; available via GPT-4o Transcribe or third-party (pyannote, whisperx) |
| Real-time streaming | No native; batch-first architecture |
Every data point above verified on 2026-04-17 via openai.com and OpenAI API pricing.
What it actually is
A single transformer model that takes audio in and emits text plus segment-level timestamps. Shipped with five size variants (tiny through large-v3). Large-v3 is the production default; smaller variants run on CPU or mobile hardware with tradeoffs on accuracy.
The moats compound:
- Open weights. MIT license. Downloadable from OpenAI’s GitHub or Hugging Face. Runnable anywhere.
- Multilingual depth. 99 languages trained together, not a patchwork of per-language models.
- Community runtimes. faster-whisper, whisper.cpp, whisperx, mlx-whisper, and others deliver the speed gains the reference implementation lacks.
- Baseline status. Every transcription tool in the last three years benchmarks against Whisper. Most wrap it.
When to pick Whisper
- Self-hosting is a hard requirement. Air-gapped environments, healthcare audio, legal transcription where PHI cannot leave the perimeter.
- Batch processing beats real-time. Podcast archives, meeting recordings, video subtitling pipelines.
- Multilingual audio. 99 languages in one model. No per-language routing logic.
- Building a transcription product. MIT license lets you ship commercial products without royalty.
- Accents and noise matter. Field recordings, mobile audio, conference rooms with poor acoustics.
When to pick something else
- Real-time streaming transcription: AssemblyAI or Deepgram. Whisper’s batch-first design fights against sub-second latency.
- Speaker diarization bundled: GPT-4o Transcribe with Diarization on the same OpenAI endpoint, or whisperx as a self-host wrapper.
- Word-level timestamps: Whisper-Timestamped or whisperx forks. Base Whisper emits segment-level only.
- Podcast recording + editing: Descript wraps transcription with a transcript-first editor.
- Translation alongside transcription: Whisper’s
translatetask covers English targets only. For other targets, pair transcription with a separate translation model.
Pricing
OpenAI API (via /v1/audio/transcriptions):
| Model | Price | Notes |
|---|---|---|
| whisper-1 | $0.006/min | Legacy Whisper-1; MIT self-host equivalent is free |
| gpt-4o-transcribe | $0.006/min | Modern successor, improved low-resource language accuracy |
| gpt-4o-transcribe with diarization | $0.006/min | Speaker labels in output |
| gpt-4o-mini-transcribe | $0.003/min | Cost-sensitive tier |
Self-host weights at github.com/openai/whisper and on Hugging Face. Managed Whisper hosting via Replicate or Hugging Face Inference Endpoints runs $0.003-$0.01/minute depending on model size and batching.
Prices verified 2026-04-17 via OpenAI API pricing.
Against the alternatives
| Whisper (self-host) | OpenAI API whisper-1 | AssemblyAI | Deepgram | |
|---|---|---|---|---|
| Price | Free + GPU | $0.006/min | $0.37/hour (~$0.0062/min) | $0.0043/min |
| Open weights | MIT | No | No | No |
| Languages | 99 | 99 | 99 | 50+ |
| Real-time | No | No | Yes | Yes (lowest latency) |
| Diarization | Via wrappers | GPT-4o Transcribe tier | Native | Native |
| Best viewed as | Open-source baseline | Managed Whisper | Real-time specialist | Low-latency streaming |
Failure modes
- Hallucinated text on silence. Whisper is prone to generating plausible-sounding text when the audio contains long silence or non-speech. Mitigations: VAD preprocessing (Silero VAD), chunk with overlap, or move to GPT-4o Transcribe which handles this better.
- No word-level timestamps in base model. Use whisperx or Whisper-Timestamped forks for word-accurate alignment.
- No speaker diarization natively. Add pyannote, whisperx, or switch to GPT-4o Transcribe.
- 25MB API file cap. Long recordings need client-side chunking.
- Large-v3 is compute-heavy. Real-time on CPU needs small or tiny variants with quality tradeoffs; use faster-whisper + quantization for 5-10x speedup.
- English bias. Accuracy is strongest on English, weakest on low-resource languages. Vendor-reported multilingual numbers mix across tiers.
Methodology
This page was produced by the aipedia.wiki editorial pipeline, an automated system that ingests vendor documentation, verifies pricing and model details against primary sources, and generates the editorial analysis you are reading. No individual human wrote this review. Scoring follows the four-dimension rubric at /about/scoring/. Last verified 2026-04-23 against github.com/openai/whisper, OpenAI Whisper research post, and OpenAI API pricing.
FAQ
Is Whisper free?
Yes, when self-hosted. The model weights and reference code are under the MIT license. Downloadable from OpenAI’s GitHub or Hugging Face. You pay only for compute. The OpenAI-hosted API costs $0.006 per minute of audio.
How accurate is Whisper?
On English, Whisper-1 Large-v3 sits near the top of published leaderboards, within 1-2 WER points of the best commercial APIs. On multilingual audio, accents, and noisy recordings, independent evaluations often rank it first. GPT-4o Transcribe improves on low-resource languages specifically.
What's the difference between Whisper-1 and GPT-4o Transcribe?
Whisper-1 is the 2022 open-weights model, MIT license, widely self-hosted. GPT-4o Transcribe (March 2025) is OpenAI’s hosted successor on the same API endpoint with improved accuracy and optional speaker diarization. Same $0.006/min pricing. Whisper-1 stays callable by model ID; GPT-4o Transcribe is the default recommendation for new builds on the hosted API.
Can Whisper do real-time streaming transcription?
Not natively. The base model is batch-first: it expects complete audio chunks. For real-time, use AssemblyAI or Deepgram, which are architected for low-latency streaming.
Does Whisper do speaker diarization?
Not natively. Options: (1) wrap it with pyannote.audio or whisperx for open-source diarization, (2) switch to GPT-4o Transcribe with Diarization on the OpenAI API, or (3) use a commercial tool that wraps Whisper (Descript, AssemblyAI) for built-in speaker labels.
What languages does Whisper support?
99 languages in a single multilingual model. English and major European languages hit the highest accuracy tiers. Low-resource languages (Swahili, Bengali, Tamil, Khmer) trail commercial specialists but remain usable. GPT-4o Transcribe closes some of the low-resource gap.
Related
- Category: AI Voice
- Also see: ChatGPT (the hosted Whisper entry point) · Descript (transcript-first editor wrapping Whisper) · ElevenLabs (voice synthesis pair)
Embed this score on your site Free. Links back.
<a href="https://aipedia.wiki/tools/whisper/" target="_blank" rel="noopener"><img src="https://aipedia.wiki/badges/whisper.svg" alt="Whisper on aipedia.wiki" width="260" height="72" /></a> [](https://aipedia.wiki/tools/whisper/) Badge value auto-updates if the editorial score changes. Attribution via the link is required.
Cite this page For journalists, researchers, and bloggers
According to aipedia.wiki Editorial at aipedia.wiki (https://aipedia.wiki/tools/whisper/) aipedia.wiki Editorial. (2026). Whisper — Editorial Review. aipedia.wiki. Retrieved May 8, 2026, from https://aipedia.wiki/tools/whisper/ aipedia.wiki Editorial. "Whisper — Editorial Review." aipedia.wiki, 2026, https://aipedia.wiki/tools/whisper/. Accessed May 8, 2026. aipedia.wiki Editorial. 2026. "Whisper — Editorial Review." aipedia.wiki. https://aipedia.wiki/tools/whisper/. @misc{whisper-editorial-review-2026,
author = {{aipedia.wiki Editorial}},
title = {Whisper — Editorial Review},
year = {2026},
publisher = {aipedia.wiki},
url = {https://aipedia.wiki/tools/whisper/},
note = {Accessed: 2026-05-08}
} Spotted an error or want to share your experience with Whisper?
Every tool page is re-verified on a recurring cycle, and corrections land faster when readers flag them directly. If you spot a stale fact, a missing capability, or have used Whisper and want to share what worked or didn't, the editorial desk reviews every message sent through this form.
Email editorial@aipedia.wiki