Engineering deep-dive

How whisper.cpp works under the hood

GGML, quantization, and three GPU backends - what makes OpenAI's research model fast enough to run on the laptop in front of you.

By SnailText's founder · Published 2026-05-23

The short version

whisper.cpp ports OpenAI's Whisper from Python and PyTorch to a single C++ binary using the GGML tensor library. GGML stores weights in custom quantized formats (INT8, INT4) that cut model size 4-8x and let CPUs run inference at real-time speed. GPU acceleration via Metal on Apple, Vulkan cross-vendor on Windows and Linux, or CUDA on NVIDIA drops latency from seconds to under 200ms for short phrases.

Real performance numbers from production

Real performance numbers from production (verified 2026-07-07)
Model	Hardware	Backend	10s clip	30s clip
Whisper Tiny	M2 Air (8 GB)	Metal	0.18s	0.42s
Whisper Base	M2 Air (8 GB)	Metal	0.25s	0.58s
Whisper Small	M2 Air (8 GB)	Metal	0.45s	1.1s
Whisper Medium	M2 Pro (16 GB)	Metal	0.85s	2.0s
Whisper Tiny	Intel i5-1240P	CPU only	0.95s	2.8s
Whisper Base	Intel i5-1240P	CPU only	1.6s	4.8s
Whisper Small	RTX 3060 Mobile	Vulkan	0.30s	0.75s
Whisper Medium	RTX 3060 Mobile	Vulkan	0.55s	1.4s
Whisper Large-v3	RTX 4070	Vulkan	0.60s	1.6s
Whisper Large-v3	RTX 4070	CUDA	0.45s	1.2s

What whisper.cpp actually is (and what it is not)

When OpenAI released Whisper in September 2022, the reference implementation was a Python package built on PyTorch. It worked, but it carried the full PyTorch dependency tree of about 2 GB, needed CUDA for any reasonable speed, and was awkward to ship inside a desktop app.

Georgi Gerganov, a Bulgarian software engineer with a background in physics simulations, released whisper.cpp in October 2022. The project rewrites the Whisper inference loop in C++ using a custom tensor library called GGML. The result is a self-contained binary that links no Python, no PyTorch, no CUDA runtime by default. It runs on a Raspberry Pi 4, an iPhone, a 2015 MacBook Air, or a maxed-out Threadripper, and produces the same transcripts as OpenAI’s Python reference within rounding error.

whisper.cpp is not a different model. The neural network architecture is identical: a 12 or 24-layer Transformer encoder followed by a 12 or 24-layer Transformer decoder, depending on size. What changed is the format of the weights, the runtime that executes the matrix multiplications, and the absence of Python in the deployment path.

The practical effect is that any application that wants to run Whisper without internet access can statically link whisper.cpp and ship a self-contained executable. SnailText does this for offline dictation on Mac and Windows. MacWhisper does this. SuperWhisper in local mode does this. Three years after the project started, whisper.cpp is fast enough for real-time dictation on a phone.

whisper.cpp vs openai-whisper - they are two different things

People new to whisper.cpp often arrive from Python. They have used pip install openai-whisper, run whisper audio.mp3 --model base, and it worked. Now they are trying to find the C++ equivalent and running into confusion.

The short answer: there is no pip install whisper.cpp. The two projects are unrelated in their delivery format.

openai-whisper is the Python reference implementation that OpenAI maintains. It installs via pip, requires PyTorch, and calls into CUDA for GPU acceleration. It is the canonical implementation for server-side processing and research.

whisper.cpp is a separate C++ codebase that runs the same neural network, written from scratch by Georgi Gerganov. It is compiled from source and ships as a binary or static library. There is no pip package. The way you use it is either to compile it yourself (git clone https://github.com/ggerganov/whisper.cpp && make), or to use a desktop application that ships it pre-built.

The model files are also different. openai-whisper downloads .pt files (PyTorch checkpoints). whisper.cpp uses its own binary format — older files use the .bin extension with the GGML format (e.g. ggml-base.bin), newer ones use the .gguf extension. If you try to load a PyTorch .pt file into whisper.cpp, it will fail with a header mismatch error.

Where to get the whisper.cpp model files: they are distributed on Hugging Face under the ggerganov/whisper.cpp repository. The naming convention is ggml-<size>.bin for the default Q5_1 quantized versions:

File	Model	Size on disk	Good for
`ggml-tiny.bin`	Whisper Tiny	77 MB	Raspberry Pi, embedded, lowest latency
`ggml-base.bin`	Whisper Base	148 MB	Fast laptops, CPU-only use
`ggml-small.bin`	Whisper Small	488 MB	Good balance of speed and accuracy
`ggml-medium.bin`	Whisper Medium	1.5 GB	High accuracy on GPU
`ggml-large-v3.bin`	Whisper Large-v3	3.1 GB	Best accuracy, needs GPU

If you are building a desktop application that ships whisper.cpp, you typically bundle one or more of these files as part of your installer. SnailText, for example, offers a model picker UI — the user selects a size, the app downloads the appropriate .bin file from our CDN, and places it in the application data directory.

If you are building a server-side Python pipeline, openai-whisper or faster-whisper (which uses CTranslate2) are more idiomatic. whisper.cpp is the right choice for desktop apps, embedded systems, and anything that needs to ship a self-contained binary without Python.

GGML - the tensor library that does the actual math

GGML is a small C library (about 30,000 lines as of mid-2026) that handles four things: a file format for serialized neural-network weights, a graph representation for the sequence of operations a network performs, a CPU execution backend with hand-tuned SIMD kernels, and a set of GPU backends (Metal, CUDA, Vulkan, ROCm, SYCL) that delegate the matrix multiplications to hardware accelerators when one is available.

The file format - now called GGUF as of August 2023, replacing the older GGML format - is a single binary file that contains the model’s architecture metadata, vocabulary, and quantized weight tensors in a layout designed for memory-mapped access. When whisper.cpp loads a model, it does not allocate gigabytes of RAM and copy the weights in. It calls mmap (or the Windows equivalent) and lets the operating system page weights into memory as the kernel actually touches them. On a system with plenty of RAM, the file effectively ends up cached entirely. On a tight system, the OS pages out cold layers and keeps hot ones resident. Either way, GGML does not have to manage that itself.

The execution model is a directed acyclic graph. Before the first inference, GGML builds a graph node for every tensor operation Whisper performs: the multi-head self-attention, the feed-forward MLPs, the cross-attention between encoder and decoder, the final softmax. The graph is compiled once. On each inference, GGML walks the graph, dispatching each node to whichever backend has been selected. The dispatch happens at the granularity of a tensor operation, not the whole graph, which means a GPU backend that does not support some exotic op can fall back to CPU for just that op without losing the rest of the GPU acceleration.

GGML compiles the inference graph once at load time. Each tensor operation is then dispatched per-op to the active backend, so a GPU backend that lacks one exotic op can fall back to CPU for that single op without losing the rest of the acceleration.

INT8 and INT4 quantization - how a 1.5 GB model becomes 600 MB without falling apart

The Whisper Large model, in its original FP16 PyTorch form, is roughly 3 GB on disk. Whisper Small is around 480 MB. Whisper Tiny is around 75 MB. These are the FP16 numbers from OpenAI’s release.

GGML stores these models in quantized formats. The default for production use is Q5_1, which means each weight is encoded with about 5.5 bits on average instead of 16. The result is a roughly 3-4x reduction in disk size and a similar reduction in RAM use. Whisper Large at Q5_1 fits in 1.2 GB. Whisper Small fits in roughly 250 MB. This is what makes it realistic to ship Whisper as part of a desktop application installer.

The trick is in how the bits are allocated. Naive quantization - just truncating each FP16 weight to its nearest INT8 value - destroys accuracy. The transformer architecture is sensitive to weight magnitude, and uniform quantization produces enough error per layer that, multiplied across 24 layers, the model just produces nonsense.

GGML uses block-wise quantization with per-block scaling factors. The weights are split into small blocks (typically 32 weights per block), each block’s maximum absolute value is computed, and the weights are scaled to fit in the available bit budget. A small floating-point scale factor is stored alongside each block. On dequantization, the integer values are multiplied back by the scale to recover an approximation of the original FP16. For some quantization formats, a second offset value per block is also stored to handle weights with asymmetric distributions.

The Q5_1 format used by most production whisper.cpp deployments stores 5-bit integers per weight, a 4-bit zero-point per block, and an FP16 scale per block. The effective bits-per-weight including all the overhead is about 5.5. For comparison, Q4_0 is around 4.5 bits per weight and Q8_0 is around 8.5 bits per weight.

The accuracy cost is measurable but small. On the LibriSpeech clean test set, Whisper Small at FP16 scores about 3.4% word error rate (per the original Whisper paper, Table 2). The same model at Q5_1 scores about 3.5-3.7% depending on the build. At Q4_0 the gap widens to about 0.3-0.5 percentage points. Below Q4 the gap starts to matter, and below Q3 you get noticeable degradation on accented speech.

Production deployments (SnailText included) almost always ship Q5_1 by default. It is the format with the best size and accuracy tradeoff for compact-to-medium Whisper variants. The full Q5_1 model files for Whisper Tiny through Large-v3 are available on Hugging Face under ggerganov/whisper.cpp.

The three GPU backends, and why they all exist

CPU inference works on every machine but is the slowest path. A modern laptop CPU can run Whisper Small at roughly 1.5-2x real-time, meaning a 10-second recording transcribes in about 5-7 seconds. For dictation, where the user expects to see the text within a second of finishing the phrase, that is too slow.

GPU acceleration is the path to sub-second latency, and whisper.cpp supports three serious GPU backends.

Metal is Apple’s GPU API. On any Mac with Apple Silicon (M1 and later) the M-series chip has a unified-memory architecture where the GPU shares the same RAM as the CPU. This is unusual. On x86 systems with discrete GPUs, you have to copy data from system RAM to GPU VRAM before the GPU can touch it, and copy results back when it is done. On Apple Silicon there is no copy. The GPU and CPU read the same physical memory pages. The practical effect is that Metal-backed whisper.cpp on an M2 or M3 transcribes a 10-second clip in about 0.4-0.8 seconds for Whisper Small. M4 is faster again. The Metal backend was contributed by Apple engineers and is the most polished of the three.

CUDA is NVIDIA’s GPU API. It is the original GPU compute platform, dating to 2007, and remains the fastest backend by absolute throughput for large models on a dedicated NVIDIA card. An RTX 4070 transcribes Whisper Large-v3 in well under real-time. The downside is that CUDA only runs on NVIDIA hardware (no AMD, no Intel, no integrated graphics) and requires the user to have the CUDA Toolkit installed, which is a multi-gigabyte download and a friction point for end users. whisper.cpp supports CUDA but it is typically not the default backend in applications targeting consumers.

Vulkan is the cross-vendor GPU API. It works on NVIDIA, AMD, Intel Arc, AMD integrated graphics, Intel integrated graphics, and ARM Mali GPUs. The Vulkan backend in GGML and whisper.cpp shipped in 2024 and matured rapidly through 2025-2026. On a discrete RTX card, Vulkan is roughly 70-90% the speed of CUDA for the same workload. On AMD and Intel hardware, where CUDA is not an option, Vulkan is the only meaningful path to acceleration. The Vulkan backend is what SnailText ships for voice-to-text on Windows because it lets us produce a single binary that GPU-accelerates on any modern laptop without asking the user to install drivers.

There is also ROCm (AMD’s CUDA equivalent) and SYCL (Intel’s compute stack). Both work in whisper.cpp but have meaningfully smaller user bases.

Building whisper.cpp on Windows 11 with CUDA or Vulkan

The upstream README covers Linux most clearly. Here is what the Windows build actually requires.

Prerequisites for any Windows build:

Visual Studio 2022 with “Desktop development with C++” workload (provides MSVC and cmake)
Git for Windows

CUDA build (NVIDIA cards):

Install the CUDA Toolkit 12.x from NVIDIA’s site — it is a separate download from the driver. Then:

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

The resulting build/bin/Release/whisper.exe uses CUDA. You can verify with --list-backends — it should report CUDA in the output.

Vulkan build (AMD, Intel, or any GPU without CUDA):

Install the Vulkan SDK and make sure VULKAN_SDK is set in your environment. Then:

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Vulkan works on NVIDIA too — it is the safer default if you want a single binary that runs on any modern Windows GPU. Consumer apps typically ship Vulkan rather than CUDA to avoid requiring the CUDA Toolkit on the end user’s machine.

Model download:

Both builds expect .bin model files in GGML format from the ggerganov/whisper.cpp repo on HuggingFace. Download ggml-base.bin (148 MB) to test, ggml-large-v3.bin (3.1 GB) for maximum accuracy.

How the production numbers translate to user experience

The benchmark table above shows the end-to-end times we measure on common consumer hardware configurations. Three observations are worth flagging.

First, M-series Macs hit interactive latency (under a second for short dictation phrases) on every model except Large, even without a discrete GPU, because of unified memory. Second, the Vulkan backend on a modern NVIDIA card is within 25-30% of CUDA, which is close enough that the convenience of no CUDA install wins for consumer apps. Third, CPU-only on a recent Intel laptop is fast enough for occasional use but not fast enough to feel instant.

Streaming - getting transcripts before the user finishes speaking

The numbers above are end-to-end latency: hit the stop key, wait for the result. There is a second mode, streaming, where whisper.cpp transcribes the audio in chunks while the user is still talking. The result appears within a few hundred milliseconds of the user stopping, because most of the work is already done.

Streaming requires two things. A voice activity detection (VAD) model that identifies when a phrase has ended (most production deployments use Silero VAD, a small ONNX model that runs in milliseconds). And a way to feed completed phrases into Whisper as they happen, without restarting the model.

whisper.cpp supports both. The standard pattern is: open the audio stream, run VAD continuously, accumulate samples until VAD reports the user has paused, then dispatch that chunk to Whisper while continuing to record. Each chunk transcribes independently. When the user hits the stop key, only the trailing partial chunk has to be transcribed - the prior chunks are already done.

SnailText uses this pattern where the hardware benefits from it - for a typical 30-second dictation on GPU-enabled machines, the user sees the result within a second or two of stopping, because most of the inference happened during the recording itself. Streaming is not free, though: it adds overhead that pays off when individual inference passes are cheap, and pays less (or not at all) when they are not. Whether to enable it for a given configuration is the kind of decision that needs production telemetry rather than a rule of thumb.

The GPU-selection problem nobody warns you about

On a laptop with both Intel and NVIDIA GPUs, whisper.cpp can silently fall back to CPU even when the dedicated GPU is available. The Vulkan backend is wonderful in theory: one binary, any GPU, just works. In practice there is a class of problem that is not in any documentation, and that you only meet by shipping to real consumer hardware.

On a laptop with multiple GPUs - NVIDIA Optimus, AMD Switchable Graphics, integrated plus dedicated - the operating system enumerates them in one order. The graphics APIs enumerate them in different orders, sometimes disagreeing with each other and sometimes shifting depending on driver state. whisper.cpp’s Vulkan backend addresses a chosen device by integer index, and the integer that points to the discrete card in one API may point to the integrated GPU in another.

The symptom is a silent CPU fallback. You think the GPU is being used. The application reports the GPU is being used. The system monitor shows zero GPU activity. Inference takes 20 seconds instead of 2.

The class of fix that holds up across this kind of heterogeneity is empirical rather than predictive: don’t try to guess which device index corresponds to a real GPU, measure it. Run a quick test inference and decide based on observed performance whether to keep that device or fall back to the next one. The diagram below sketches the shape of the loop.

The shape of an empirical GPU selection loop. On single-GPU machines it confirms the obvious quickly. On laptops with NVIDIA Optimus or AMD Switchable Graphics, the loop routes to the discrete card even when system enumeration is misleading. The class of approach generalizes to any setting where the right device index cannot be predicted from metadata alone.

On single-GPU machines this just confirms the obvious in a fraction of a second at startup. On multi-GPU laptops it routes inference to the discrete card even when system enumeration is misleading. There are several edge cases that catch you out at this layer - duplicate enumeration entries on some integrated GPUs, vendor-specific driver quirks, shared-memory iGPUs that need different latency budgets to avoid false negatives - and each of those needs its own treatment. The general lesson is that GPU selection on consumer Windows hardware is a problem you can only solve by watching how your build behaves on real machines and adding paths for the cases telemetry surfaces.

None of this is in the whisper.cpp docs or the GGML repo issues. It is the kind of work that turns “we support Vulkan” into something a non-technical user will not need to think about.

Failure categories you only see in production

Probe-and-fallback was one example of a broader category — production-only failure modes that the upstream documentation does not warn you about. Each of these took us telemetry from real users to even notice; we list them here as categories rather than recipes because the right fix in each case depends on your specific stack, and what worked for us is not the only valid path.

1. Multi-GPU enumeration disagreement. Covered above. The shape: different graphics APIs on the same machine disagree about which integer index points to which physical GPU. The class of fix: empirical probing rather than predictive selection. The version of this problem you face depends on which OS, which driver state, and whether the machine has hybrid graphics — the class of solution generalizes.

2. Integrated-GPU phantom enumeration. On some integrated-GPU configurations, the Vulkan loader exposes multiple entries that all point to the same underlying physical adapter. Running inference against each of them in turn — which is what naive probe-and-fallback does — can saturate shared memory bandwidth in ways that are not just slow but actively destabilizing. The class of fix: identify physical adapter identity before iterating indices, so a probe loop never accidentally hammers the same hardware multiple times.

3. Vulkan cold-shader compilation. First-time GPU initialization on a fresh driver state can take much longer than steady-state inference suggests. We have seen cold-shader compile times stretch from one second on warm hardware to tens of seconds on cold hardware. The class of fix: make timeouts adaptive to whether the GPU has been used recently, not hard-coded to a single threshold.

4. GPU memory pressure under streaming. Streaming inference means multiple Whisper invocations happen in close succession instead of one big batch. On GPUs with limited or shared VRAM (integrated graphics, mobile NVIDIA chips with TDR enabled), back-to-back invocations can trigger driver-level recovery — the OS resets the GPU thinking the app has hung. The class of fix: detect VRAM constraints up front and either disable streaming or throttle invocation rate on devices that are not ready for it.

5. Silent CPU fallback. Across all four of the above, the worst failure mode is that whisper.cpp logs nothing wrong, the app reports the GPU as active, the system monitor agrees, and inference takes ten or twenty times longer than it should. The class of fix is the same: do not trust metadata about whether you are on GPU — measure latency on a real workload and decide based on the measurement.

The throughline is that GPU acceleration on consumer hardware is a problem you can only fully solve by deploying to real users and watching the telemetry. None of these categories shows up in a local-dev environment with one well-understood graphics stack. All of them showed up the moment we had a few thousand installs with different vendors, generations, and driver versions.

Where to download the ggml model files (ggml-tiny.bin, ggml-base.bin, and the rest)

If you landed here searching for ggml-tiny.bin pip install, here is the short version: there is no pip package. ggml-tiny.bin is not a Python module, it is a model weights file in the GGML format. pip install will never find it because it does not live on PyPI. The file lives on Hugging Face.

The canonical source is the ggerganov/whisper.cpp repository on Hugging Face. Every quantized model is there as a direct download:

ggml-tiny.bin — Whisper Tiny, ~77 MB, fastest, lowest accuracy
ggml-base.bin — Whisper Base, ~148 MB
ggml-small.bin — Whisper Small, ~488 MB
ggml-medium.bin — Whisper Medium, ~1.5 GB
ggml-large-v3.bin — Whisper Large v3, ~3.1 GB (or ~1.2 GB in Q5_1)

You can pull one from the command line with the download script that ships in the whisper.cpp repo (./models/download-ggml-model.sh tiny), or just click the file on Hugging Face. Once you have the .bin, you pass its path to the whisper.cpp binary or to whatever wrapper you are using. There is no install step for the model itself; it is a static file you point the runtime at.

If your actual goal is not to wire up whisper.cpp by hand but to dictate with a local Whisper model, you can skip the whole download-and-configure dance. SnailText bundles the models and the whisper.cpp runtime in one installer for Mac and Windows: it picks a model that fits your hardware, downloads it in-app if needed, and runs it offline. No .bin hunting, no build step, no Python. Download it here if that is closer to what you were after.

What this means for somebody building on whisper.cpp today

If you are evaluating whisper.cpp for a product, the core capability is solid. The model accuracy is good, the quantization formats are well-understood, and the project is actively maintained by Gerganov and an open-source community of about 600 contributors as of 2026.

The complexity is in the long tail. Getting GPU acceleration to work reliably across the diversity of consumer hardware is its own engineering project. The right streaming behaviour depends on the hardware in front of you. Prompt engineering (the initial_prompt parameter) materially affects accuracy on domain-specific vocabulary, and has its own subtleties depending on how you use it. Long-recording behaviour has edge cases that the defaults do not handle gracefully. The default settings are good but not optimal for every use case.

For a desktop dictation app, all of that is solvable, and whisper.cpp remains the best foundation on which to solve it. For a server-side transcription pipeline at scale, you might look at faster-whisper (Python, CTranslate2 backend) or one of the cloud APIs - the operational profile is different. For embedded use cases (Raspberry Pi, mobile, edge devices), whisper.cpp is in a category of one.

The reason the project exists, and the reason SnailText, MacWhisper, SuperWhisper, and many others build on it, is that it makes Whisper genuinely portable. The same C++ code runs on every desktop OS, every modern mobile OS, and a wide range of embedded targets. The model itself is just a file. The runtime is a static library. There is no Python, no CUDA, no service to call. That, in 2026, is still a rare combination.

If you would rather use the result than build it, that is exactly what we ship: SnailText runs whisper.cpp locally on Mac and Windows, picks a model that fits your hardware, and stays offline. Download it here — free to start, no account.

SnailText is offline voice dictation for Mac and Windows — local, private, free to start.

Download for Mac

Common questions

Can I install whisper.cpp with pip?

No. whisper.cpp is a C++ project, not a Python package. There is no `pip install whisper.cpp`. If you want the Python package, that is `pip install openai-whisper` (OpenAI's original Python reference implementation). If you want whisper.cpp specifically, you either compile it from source (`git clone` + `make`) or use a desktop app that ships it pre-built, such as SnailText, MacWhisper, or LM Studio.

What are the .bin files on Hugging Face (ggml-tiny.bin, ggml-base.bin, etc)?

Those are whisper.cpp's quantized model files in the GGUF/GGML format. They are not the same as the .pt files in OpenAI's original Whisper repo. ggml-tiny.bin is Whisper Tiny in Q5_1 quantization — 77 MB, fast, good for low-powered devices. ggml-base.bin is Whisper Base — 148 MB. ggml-large-v3.bin is the full Large v3 model in Q5_1 — about 1.2 GB. You download them from the `ggerganov/whisper.cpp` repository on Hugging Face.

What is the difference between Whisper and whisper.cpp?

Whisper is OpenAI's open-source speech recognition model. whisper.cpp is an independent C++ port of the inference code that runs the same model without Python, PyTorch, or CUDA. Same model weights, same outputs, different runtime.

Is whisper.cpp slower than the Python version?

No, typically faster. The C++ runtime has less overhead, and quantized models use less memory bandwidth. On the same hardware, whisper.cpp generally beats the Python reference on CPU and matches it on GPU.

What does GGML stand for?

GGML originally stood for Georgi Gerganov's Machine Learning library. It has since been formalized as just the project name and is the tensor library underneath llama.cpp, whisper.cpp, and several other ports.

What is the difference between Q4, Q5, and Q8 quantization?

The number is roughly the number of bits used per weight. Q4 is about 4 bits per weight, Q5 is about 5 bits, Q8 is about 8 bits. Lower numbers mean smaller files and faster inference but more accuracy loss. Q5_1 is the typical production choice for whisper.cpp deployments.

Can I run whisper.cpp on a Raspberry Pi?

Yes. Whisper Tiny runs in real-time on a Pi 4. Whisper Base runs at roughly 0.5x real-time. Whisper Small is too slow to be useful on Pi-class hardware.

Does whisper.cpp support GPU on AMD cards?

Yes, via Vulkan (cross-vendor) or ROCm (AMD-specific). Vulkan is easier to deploy because it does not require ROCm installation. SnailText uses Vulkan for AMD support on Windows.

Is whisper.cpp the same as faster-whisper?

No. faster-whisper is a different port that uses CTranslate2 as its backend instead of GGML. It is Python-based and typically used server-side. whisper.cpp is C++ and typically used in desktop and embedded applications.

Want whisper.cpp in a polished desktop app?

SnailText is whisper.cpp with VAD, hotkey, GPU auto-detection, and a UI. Free tier has unlimited local dictation, no account needed.

Download for Mac How SnailText works