SnailText
EN

Dictation for Mac

Dictation for Mac — voice typing in any app, no cloud

Press a hotkey. Talk. The text lands wherever your cursor is. Works in Slack, Notion, VS Code, Mail, anywhere you type. Audio stays on your Mac.

By Evgenii Balabanov, founder of SnailText · Published

The short version

Dictation for Mac means voice-to-text running locally on Apple Silicon via the Whisper engine. Apple's built-in Dictation has no hard duration limit per Apple Support docs, but auto-stops after 30 seconds of silence — pauses for thought count — and only works reliably inside a subset of apps. Third-party local tools run continuously, support larger Whisper models with materially better quality on technical and accented content, and process everything on-device using Metal GPU acceleration on M-series chips. The realistic minimum hardware is M1; on M3 and newer, Whisper Large v3 runs at multiples of real time. SnailText is one of few apps with Mac and Windows feature parity from day one. If you arrived here looking for voice to text on Mac, our dedicated voice-to-text page compares Apple Dictation against Whisper-based alternatives in detail.

Apple Dictation vs SnailText, structurally

macOS ships with built-in dictation. For short, casual use inside Notes or Messages it is fine. For sustained work it has structural limits that third-party tools exist to solve. The table below is product-level differences, not accuracy benchmarks — we're holding back the latter until we publish a reproducible methodology.

Apple Dictation vs SnailText structural differences, May 2026.
Feature Apple Dictation SnailText
Recording length Auto-stops after 30 seconds of silence per Apple docs (no hard duration limit) Unlimited — runs as long as the hotkey is held or until you press it again
Where it works Native Apple apps and a subset of third-party apps that opt in via the system text input API Any text field in any app, via global hotkey + paste — Slack, VS Code, Cursor, Telegram, terminals, web inputs
Model size Compact Apple-trained model, not user-selectable Choice of Whisper Tiny through Large v3 (and Parakeet TDT on Pro) — pick the size that fits your accuracy/latency tradeoff
Custom vocabulary Not user-editable beyond what Apple's models already know Dictionary for proper nouns and product names; snippets for boilerplate expansion (Pro)
Hotkey Fixed to Fn-Fn or your single chosen modifier; activation cancels in many third-party apps Global Cmd+Shift+Space (configurable); does not steal focus from the active app
Offline guarantee "Enhanced Dictation" downloads a local model for offline; the default settings vary by macOS version and language Always offline by design. No cloud option, no opt-out toggle to forget

Apple's offering is best understood as a system convenience. SnailText is the tool you reach for when dictation is part of how you actually work.

Apple Silicon dictation performance at a glance

Indicative ranges from third-party whisper.cpp Metal benchmarks (Voicci 2026, PromptQuorum 2026, DEV Community Mac M4 analysis). These are not measurements taken under a fixed SnailText methodology — we'll publish that separately once it's finalized. Real per-hardware latency varies with thermals, background load, and model build.

Apple Silicon chip Whisper Small Whisper Medium Whisper Large v3 Turbo
M1 (base) Real time Borderline real time Slower than real time
M2 Pro 3-4× real time 2-3× real time 1.5-2× (60s audio in ~2.8s)
M3 MacBook Air 5-6× 3-4× ~7× on long-form
M4 10-15× 6-8× 3-5×
M5 Pro 15-20× 8-12× ~10× real time

"Real time" means transcription finishes in the same wall-clock time as the recording. Anything faster than 1× is suitable for live dictation. The tiny model on M4 runs at roughly 27× real time on short clips per DEV Community testing.

What "Nx real time" actually feels like at the cursor

The multipliers above are easy to mis-read. In plain terms, here's the practical pattern on Apple Silicon: any M-series chip can transcribe a one-minute dictation in well under the time it took to record. The faster the chip and the smaller the model, the shorter the wait. On M3 and M4 with Whisper Medium or Large v3 Turbo, the wait for a one-minute clip drops into the single seconds. On M1 and M2 with Whisper Small, the wait is still well below the recording length — comfortably interactive for normal dictation.

The lag you actually notice is shorter than these multipliers suggest. SnailText runs the model on closed phrases as you speak (streaming inference on GPU-equipped machines), so by the time you press the stop hotkey most of the work is already done. End-to-end wait on Apple Silicon is typically one to two seconds for any phrase under thirty seconds.

We are deliberately not publishing wall-clock benchmark tables on this page yet. A reproducible comparison needs a fixed methodology — same audio sample, same model build, same hardware state — and we have not yet finalized and released ours. When we do, the numbers will live on a separate methodology page that this article links to.

Neural Engine, Metal, CPU — which one is actually doing the work?

A question we get a lot: does it use the Apple Neural Engine (ANE)? Short answer: no, and that's fine. The longer version:

  • whisper.cpp runs on Metal, Apple's general-purpose GPU compute API. That's how the speed numbers above happen. The Metal backend was contributed by Apple engineers and is the most polished of whisper.cpp's three GPU paths.
  • The Neural Engine is a separate, ANE-specific accelerator that ships on every Apple Silicon Mac. It is fast but only addressable through Apple's own private frameworks (Core ML, MLX) — there is no public ggml backend that targets it. Both MLX and WhisperKit can use the ANE; whisper.cpp cannot, as of 2026.
  • CPU is the fallback path when Metal is unavailable (older Intel Macs, virtualized environments). It still works, just slower — Whisper Small on a 2020 Intel MBP runs at roughly real-time, which is borderline for live dictation.
  • Unified memory is why Metal works so well on Apple Silicon. On x86 systems with discrete GPUs, the audio buffer has to be copied across the PCIe bus to VRAM before the GPU can touch it. On M-series the GPU reads the same physical memory pages as the CPU. No copy.

The practical answer is that the Metal path on M-series is fast enough that the absence of an ANE backend doesn't matter for dictation latency. If you specifically need the Neural Engine for power efficiency on battery, WhisperKit and MLX are the projects to look at.

Why Apple's built-in Dictation is not enough for daily use

Apple Dictation works. It runs on-device on any Mac with an M1 chip or newer, the transcription is acceptable for short bursts, and it costs nothing. For a quick text message or a one-line search query, it does the job.

It stops being enough the moment you try to use it for real work.

The first thing you hit is the silence cutoff. Apple's docs state Dictation on Apple Silicon has no hard duration limit, but the system auto-stops after 30 seconds of detected silence — and "silence" includes the natural pauses you take while composing. There is no setting to extend the cutoff. Dictating an email longer than two paragraphs means re-activating two or three times. Multiple discussion threads on Apple's own support forums note the cutoff sensitivity has shifted across iOS 18 and macOS Tahoe updates.

The second is accuracy on anything technical. Apple Dictation is fine on general clear speech and visibly worse on code, jargon, accented English, and domain-specific vocabulary — the kinds of content where developers, clinicians, and lawyers actually use dictation. Third-party tools running modern Whisper-class models are materially better on the same content. We're holding back specific WER numbers on this page until we publish a reproducible benchmark methodology — others have published their own comparisons (VoicePrivate, Voicci, PromptQuorum each have 2026 testing), but we'd rather not cite figures we haven't reproduced under controlled conditions.

The third is the integration boundary. Apple Dictation works inside Apple apps and most native macOS text fields. It does not have a consistent hotkey-to-paste flow across web apps, Electron apps, or terminals. You end up disabling it in half the places you want to use it.

There is a good built-in dictation tool for casual use, and there is a separate category of tools built for people who type for a living. The category exists because the casual tool was never designed to be the second one.

What a real dictation app for Mac does

A dictation app for Mac is a tool that converts spoken voice into typed text in any application via a global hotkey, with the speech recognition model running locally on Apple Silicon. The three components that define the category are: a universal hotkey that works in every macOS app including web apps, Electron apps, and terminals; a speech recognition model with 95%+ accuracy on clean English audio; and a local processing pipeline that keeps audio on your device.

A hotkey that works the same way in every app. You press it once, the recording starts. You press it again, the recording stops. Your transcribed text lands at your cursor position, whatever app you happen to be in. No app-specific configuration, no menu trees, no waiting.

A speech recognition model that is actually good. The free tier of modern Mac dictation apps ships with compact Whisper models that hit 95%+ accuracy on clean English audio. Paid tiers add larger models, additional languages, and post-processing for filler word removal and punctuation. The point is to not have to think about the model at all once it is running.

A local pipeline that does not need the internet. The audio buffer stays in RAM, the model runs on your Mac's GPU or Neural Engine, and the text appears in the active text field. Nothing leaves your machine unless you explicitly opt into a cloud feature.

That third part is the one that defines the category. Once you have a tool that runs the model on your own hardware, the privacy story changes from "we promise not to misuse your audio" to "your audio does not leave the device." It is a different argument with different consequences.

Apple Silicon makes local Whisper genuinely fast

Running large Whisper models locally on Windows usually means installing CUDA, finding a compatible NVIDIA GPU, and tuning batch sizes. On Mac, the same workflow is built in.

The whisper.cpp engine, which powers most of the modern Mac dictation apps including ours, compiles with Apple Metal GPU acceleration by default on Apple Silicon. Metal is Apple's GPU API, and on M-series chips it sits directly on top of the unified memory pool, which means the model weights and the audio buffer live in the same physical memory as your application code. There is no memory copy between CPU and GPU before each inference. That single architectural detail is the reason an M1 MacBook Air can run Whisper Large v3 Turbo in real time, while the same model on a Windows laptop typically needs a dedicated NVIDIA GPU.

On any Apple Silicon Mac from M1 onward, you can run the small or medium Whisper model locally and never feel the latency. The text appears the moment you stop talking. The difference between an M1 Air and an M5 Pro is whether you can also run the large models without thought, not whether dictation works at all.

The other side of this story is the older Intel Macs. Apple's own documentation makes clear that Intel Macs running Apple Dictation send audio to Apple's servers because the on-device pathway only works on Apple Silicon. Third-party apps that use whisper.cpp similarly need the Metal acceleration to be usable in real time. The realistic minimum hardware for modern local dictation on Mac is M1 or later.

Local vs cloud — why it matters for daily dictation

A cloud dictation tool sends each utterance to a remote server, transcribes it there, and sends the text back. The model running in the cloud is often larger than what you can run locally, which can mean a small accuracy edge in noisy conditions. The latency cost is the round trip, typically 200-800ms on a good connection, more on a bad one.

A local dictation tool runs the model on your Mac. The latency is just the inference time, which on Apple Silicon is usually faster than the round trip to a cloud server. The audio stays on your device. There is no inference cost beyond the electricity to run the chip.

For daily dictation, the local approach compounds over time. If you dictate 8000 words a day at work, you are running thousands of inference calls. A local tool processes those for free on hardware you already own. A cloud tool either charges you a subscription or burns through API credits you bought from OpenAI or another provider. Over a year, the cost difference for a heavy user is in the hundreds of dollars range, and the privacy difference is in the category of "everything you said all year, somewhere on a server" versus "nothing left your device."

There are still cases where cloud has an edge. For very heavy accents that compact local models struggle with, or for less common languages like Vietnamese or Bengali where local Whisper has known accuracy gaps, the larger cloud models still beat what a local app can do today. The right tool depends on what you actually dictate.

How we built dictation for Mac and Windows at the same time

SnailText runs on Mac and Windows from a single codebase with feature parity from day one. Most Mac dictation apps shipped Mac-first and added Windows years later: MacWhisper is Mac-only, SuperWhisper shipped Windows in November 2025 (roughly two years after the macOS version), Voibe and Aqua Voice are Mac-only. The Mac dictation app market has been mature for years; the Windows side is a recent expansion.

We took a different path. SnailText was built from day one as a Tauri app with a single Rust core shared between Mac and Windows. The same whisper.cpp engine runs on both platforms, with Metal acceleration on Mac and Vulkan on Windows. The hotkey, the overlay UI, the history, the dictionary, the snippets — all of it is identical. There is no "Mac app first, Windows app later" feature gap.

For people who only use Mac, this design choice does not matter much. For people who use both, or who work in a household or team where some are on Mac and some on Windows, or who might switch platforms in the future, it means one tool instead of two.

What you actually do with dictation on Mac, day to day

Mac dictation users spend most of their input time across five use cases: email and Slack replies (highest frequency, saves about one hour per workday for typical knowledge work), long-form writing first drafts at 2-3× typing speed, code-adjacent natural language tasks like commit messages and AI agent prompts, voice memos that bypass the record-transfer-transcribe workflow, and accessibility use during RSI recovery or as a permanent input preference.

Email and Slack replies. Highest-frequency case. A two-sentence reply that would take 30 seconds to type takes 5 seconds to dictate. Across a workday with 40-80 short replies, you save an hour.

Long-form writing. First drafts of blog posts, essays, documentation, or notes. Most writers dictate faster than they type, often by 2-3×. The transcript is rough and needs editing, but the editing is faster than producing the first draft would have been.

Code-adjacent dictation. Not writing code character by character, but writing the natural-language parts of code work: commit messages, PR descriptions, comments explaining tricky logic, prompts to AI coding assistants like Cursor or Claude. Our page for vibe-coders covers this use case in detail.

Voice memos to text. You are walking the dog, you have an idea, you press the hotkey, you talk for 30 seconds. The text is in a note when you get back. The Apple Voice Memos workflow requires you to record, transfer, transcribe, and review. A real-time dictation tool removes those steps.

Accessibility. Wrist injuries, RSI, recovering from surgery, or just preferring voice as a primary input. A good local dictation tool is a real accessibility tool, and the offline aspect matters more here than anywhere else.

How to get started on Mac

The download is on our Mac download page. We ship a notarized DMG, so there is no Gatekeeper warning on first launch on macOS Sequoia or Tahoe. Apple Silicon is required (M1 or later). The app is around 150MB and unpacks to about 600MB with the default Whisper Small model included.

First launch asks for two permissions: microphone access (obvious) and accessibility access (so we can paste text into other apps). Both are standard macOS permission prompts. We do not ask for anything else.

The default hotkey is Command+Shift+Space. You can change it in Settings if it conflicts with something. Press the hotkey once to start, press it again to stop. The text appears at your cursor.

Free tier is unlimited dictation with compact local models, no account required, no time limits. Pro tier ($7.49/mo · $89/yr, 3 devices) adds larger models, multi-language support, snippet expansion, dictionary entries, and a 30-day money-back guarantee on the first paid charge.

FAQ

Does this work on Intel Macs?

Technically yes, in degraded form. The whisper.cpp engine works on Intel CPUs but the inference speed without Metal acceleration is significantly slower. Real-time dictation with the small model is borderline acceptable on a high-end Intel iMac from 2019 or 2020. We recommend Apple Silicon (M1 or later) for the actual experience described on this page.

How is this different from Apple Dictation?

Apple Dictation is built into macOS, runs on-device on Apple Silicon, and is free. Apple's docs state there is no hard duration timeout, but Dictation auto-stops after 30 seconds of silence — pauses for thought count. There is also no extensibility (no custom vocabulary, no snippets, no hotkey customization beyond the basic toggle). SnailText runs larger Whisper-class models, has no silence cutoff, supports custom vocabulary and snippets, and works through a unified hotkey across all apps.

Do you upload my audio anywhere?

No. Local Whisper runs in our app on your Mac. The audio buffer stays in RAM during a recording session and is not written to disk. We do not upload audio to any server in any mode, free or paid. Optional cloud STT for Pro users with hard-audio cases is on our roadmap but not in the product today.

What about HIPAA, GDPR, regulated industries?

The simplest path to compliance for voice dictation is to not transmit the audio anywhere. Local Whisper does exactly that — no Business Associate Agreement needed, no Data Processing Agreement, no cross-border data transfer assessment. Our Privacy page covers the legal specifics; the short version is that data that never leaves your device is the easiest data to keep compliant.

How does the accuracy compare to Wispr Flow or SuperWhisper?

For clean English audio, our compact local models match Apple Dictation (around 95%) and the medium and large models match Wispr Flow and SuperWhisper Pro (around 97-99%). For very heavy accents or background noise, cloud models still have a slight edge over local models in our category. For everything else, the gap is small enough that the privacy and cost differences matter more.

Does it work with custom vocabulary?

Yes, on Pro. You can add custom terms (your company name, product names, your kids' names) and snippet expansions (type a trigger, get a longer phrase). Both apply during transcription, not after.

What about multi-language dictation?

Pro tier supports 25+ languages with Parakeet TDT v3, which is about 10× faster than Whisper for European languages. Free tier is English-only with the compact Whisper models.

Try it on your Mac

Free tier is unlimited with compact local models, no account needed. If you want larger models, multi-language support, dictionary, and snippets, Pro is $7.49/mo · $89/yr. 30-day refund on the first paid charge.