SnailText
EN

Dictation deep-dive · 2026

Do you need a GPU for voice-to-text? No — here's what actually runs on CPU

A common worry before installing a local dictation app is whether your computer is powerful enough. The short answer is that you do not need a graphics card. Here is what voice-to-text actually requires, and why "CPU is too slow" depends entirely on the model.

By SnailText's founder · Published

The short version

You do not need a GPU for voice-to-text. Speech recognition runs fine on an ordinary CPU — what people picture when they think "you need a GPU" is cloud-scale model training, not transcribing your own speech. What you actually need is a modern CPU and a few gigabytes of free RAM; the smallest Whisper models run in about 1 GB and work on a Raspberry Pi. The real variable is the model, not the chip. NVIDIA's Parakeet TDT is built for CPU and runs faster than real time there; large Whisper models are much slower on CPU and are where a GPU helps most. Pick a model that fits your machine and CPU dictation is genuinely fast.

Before installing a local dictation app, a lot of people ask the same question: is my computer fast enough, or do I need a gaming-grade graphics card for this? It is a fair worry. “AI” and “GPU” have become so linked that running speech recognition without one feels like it should not work.

It does work. You do not need a GPU for voice-to-text. The short version: speech recognition runs on an ordinary CPU, and whether it feels fast depends on which model you pick, not on whether you own a graphics card. Here is the full picture.

Why people think they need a GPU

The association is understandable but misplaced. The GPUs you read about in AI headlines are doing one of two heavy jobs: training models from scratch in a data center, or serving thousands of users at once in the cloud. Both of those genuinely need racks of graphics cards.

Running a finished model on your own machine to transcribe your own voice is a completely different, much smaller task. You are doing inference on a few seconds of audio at a time, for one person. That is well within what a normal processor handles. Conflating “training needs GPUs” with “I need a GPU to dictate” is the root of the worry, and it is not true.

What voice-to-text actually requires

The real requirements are modest:

  • A modern CPU. Anything from roughly the last decade with AVX instructions works. This is almost every laptop and desktop sold today.
  • A few gigabytes of free RAM. This is the main constraint, and it is smaller than people expect.
  • Disk space for the model. A local model is a one-time download, typically a few hundred megabytes to about a gigabyte.

On memory specifically: the smallest Whisper models run in about 1 GB of RAM and work on hardware as modest as a Raspberry Pi, per the whisper.cpp project. A mid-size model wants roughly 2 GB. A good practical rule is that with about 4 GB of free RAM you can run a solid dictation model comfortably — so even an 8 GB machine, which is fairly modest by 2026 standards, handles local dictation well. In our own testing, a Parakeet model sat around 1.5 GB of RAM in use, and a small Whisper model under 1 GB.

No graphics card appears anywhere in that list.

”CPU is too slow” depends entirely on the model

This is the part that gets oversimplified. People say “local dictation is slow on CPU” as if the CPU is the problem. The real variable is the model you run on it.

NVIDIA’s Parakeet TDT is built specifically for fast CPU inference. On a modern CPU it runs faster than real time — it transcribes audio quicker than the audio plays. Independent comparisons put Parakeet TDT around 10x faster than Whisper Large v3 Turbo for English, and that speed advantage holds on CPU thanks to its architecture and 8-bit quantization.

Whisper is the other side of the picture, and it is the market standard most apps ship. Smaller Whisper models are roughly real-time on a modern CPU — fine for dictation. But large Whisper models are genuinely slow on CPU: their real-time factor can be several times slower than the audio itself. That is the experience people remember when they say “CPU dictation is too slow” — they were running a big Whisper model without a GPU.

So the honest framing:

How different local speech models behave on CPU: Parakeet TDT vs small Whisper vs large Whisper
Model on CPUCPU speedBest for
Parakeet TDTFaster than real timeFast English dictation on any modern CPU
Small WhisperRoughly real timeGood accuracy across many languages, modest hardware
Large WhisperSeveral times slower than real timeHighest accuracy — this is where a GPU earns its keep

The takeaway: CPU dictation is fast when the model is matched to the chip. The mistake is running the heaviest model on a machine without a GPU and concluding that “local is slow.”

When a GPU does help

To be fair to the other side: a GPU is genuinely useful in some cases. If you want to run the largest, most accurate Whisper models, a GPU turns them from sluggish to instant. If you process long audio files in bulk, a GPU saves real time. For those workloads it is a meaningful upgrade.

For everyday dictation — short phrases, one at a time, with a model chosen to fit your hardware — the gain is much smaller. A CPU-friendly model is already fast enough that you would not notice the GPU much. A graphics card is a nice-to-have for heavy setups, not a prerequisite for talking instead of typing.

How SnailText handles this

SnailText runs Whisper and Parakeet TDT locally on Mac and Windows, on whatever hardware you have. Rather than make you guess which model your machine can handle, it recommends a model that fits your hardware — CPU or GPU — out of the box.

That recommendation is a starting point, not a cage. You can switch to a different model whenever you want: a lighter, faster one if you value speed, or a larger, more accurate one if your machine can take it. If you have a GPU, SnailText will use it; if you do not, it runs on your CPU and picks a model that stays fast there.

It is free to start, needs no account, and the model downloads once and then works offline. So the practical answer to “do I need a GPU” is no — download SnailText, let it suggest a model for your machine, and start dictating.

The short version

You do not need a graphics card for voice-to-text. A modern CPU and a few gigabytes of free RAM are enough, and the smallest models run on hardware as light as a Raspberry Pi. Whether CPU dictation feels fast is about the model, not the chip: Parakeet TDT is built for CPU and runs faster than real time, while large Whisper models are the slow ones that benefit from a GPU. Match the model to your machine — or let the app do it for you — and CPU voice-to-text is genuinely quick.

SnailText is offline voice dictation for Mac and Windows — local, private, free to start.

Download for Mac

Common questions

Do you need a GPU to run voice-to-text?

No. Speech recognition runs on an ordinary CPU. A GPU speeds up large models, but it is not required to transcribe your own speech. The confusion comes from associating "AI" with the graphics cards used to train models in data centers — that is a different task from running a finished model on your laptop to type what you say. For dictation, a modern CPU with a few gigabytes of free RAM is enough.

How much RAM does local voice-to-text need?

Less than most people expect. The smallest Whisper models run in about 1 GB of RAM and work on hardware as modest as a Raspberry Pi. A mid-size model wants roughly 2 GB. In our own testing, a Parakeet model used around 1.5 GB of RAM in use and a small Whisper model under 1 GB. As a practical rule, if you have around 4 GB of free RAM you can run a good local dictation model comfortably, which means even an 8 GB machine — fairly modest by 2026 standards — handles it well. Bigger, more accurate models want more memory, but you do not need them to get clean dictation.

Is voice-to-text slow on CPU?

It depends entirely on the model, not on the fact that it is a CPU. NVIDIA's Parakeet TDT is designed for CPU inference and runs faster than real time there. Smaller Whisper models are roughly real-time on a modern CPU. Large Whisper models are the slow ones on CPU — that is the case where a GPU makes a real difference. So "CPU is too slow for dictation" is a half-truth: pick a CPU-friendly model and it is fast; pick the largest Whisper model and it will lag.

Which is faster on CPU, Parakeet or Whisper?

Parakeet, by a wide margin, for English. Independent comparisons put Parakeet TDT roughly 10x faster than Whisper Large v3 Turbo, and its real-time factor on CPU is well under 1 (faster than the audio plays) where large Whisper models can run several times slower than real time. Whisper is still excellent and covers more languages, but if raw CPU speed is your priority, Parakeet is the one built for it.

Will voice-to-text work on an old laptop without a graphics card?

Usually yes, with the right model. A laptop with an integrated GPU or no discrete GPU at all can run a small Whisper model or Parakeet on its CPU. The lighter the model, the better it runs on older hardware — whisper.cpp's tiny and base models even run on single-board computers. You may not get the absolute best accuracy of the largest models, but everyday dictation on an older machine is realistic.

When does a GPU actually help with voice-to-text?

A GPU helps most when you run large, high-accuracy models or process long audio files where every bit of speed matters. For large Whisper models the speedup is significant. For a CPU-friendly model like Parakeet doing short dictation phrases, the gain is smaller because it is already fast on CPU. So a GPU is a nice-to-have for heavier setups, not a requirement for dictation.

Want SnailText?

Free tier has unlimited local dictation, no account needed.