SnailText
EN

Offline dictation

Offline dictation — voice typing without the cloud

The audio stays in RAM on your machine. The model runs on your GPU or Neural Engine. Nothing gets uploaded, nothing gets stored, nothing leaves the device. Privacy by architecture, not by policy.

By Evgenii Balabanov, founder of SnailText · Published

The short version

Offline dictation runs the speech recognition model entirely on your device. No audio is uploaded, no transcripts are stored on remote servers, no Business Associate Agreements are needed for HIPAA. Four apps run truly offline by default in 2026: SnailText (Mac and Windows), MacWhisper (Mac), SuperWhisper in local mode (Mac and Windows), and Voibe (Mac). The architecture itself becomes the privacy guarantee, not a vendor promise. Verifiable with any standard network monitor in 60 seconds.

Offline vs cloud dictation at a glance

Aspect Offline dictation Cloud dictation
Where audio is processed On your device, in RAM Remote server
Network requirement No Yes (for every dictation)
HIPAA Business Associate Agreement Not needed Required before first use
GDPR data transfer assessment Not needed Required for cross-border
Latency 50-300ms (inference only) 200-800ms (round trip + inference)
Accuracy on clean English Competitive with cloud at medium/large model sizes Slight edge at the very top end (largest cloud models)
Apps using this default in 2026 SnailText, MacWhisper, SuperWhisper (local mode), Voibe Wispr Flow, Aqua Voice, Willow Voice

Why "offline" is the architectural question, not a feature checkbox

Most dictation apps that advertise privacy are still cloud apps. They have a privacy policy, an audit certificate, a Business Associate Agreement option, a promise not to train on your data. Those are policy controls. They depend on the vendor doing what they said, and on you trusting that they will.

A truly offline dictation app does not have a privacy policy in the same sense. The audio cannot reach a server because there is no network call. The model cannot leak data because it is running in a process on your hardware, with your operating system controlling who can see it.

The privacy guarantee is the architecture, not a promise.

This difference shows up in the worst cases. When the Delve compliance platform was implicated in a March 2026 audit fraud investigation (according to a Substack investigation that analyzed 494 SOC 2 reports allegedly generated through the platform, finding 99.8% shared identical boilerplate text), customers of multiple cloud dictation companies discovered that their assumed SOC 2 certifications had been generated by a tool that produced essentially identical boilerplate reports. The affected companies responded by switching to new auditors (Wispr Flow engaged A-LIGN as new auditor and Drata as new compliance platform, per Voibe Resources analysis of the incident). The customers had no way to verify what had actually been audited in the first place. Offline tools simply do not have this problem, because there is nothing to audit at the inference layer.

A separate widely-reported incident involved Wispr Flow capturing screenshots of the user's active window every few seconds and uploading them to third-party AI infrastructure as part of a "context awareness" feature (documented through network traffic analysis posted to Reddit in 2025, with the vendor's CTO publicly apologizing after the company initially banned the user who reported it per Embertype's reporting). The app has since changed the implementation to read text near the cursor via accessibility APIs rather than full screenshots (per Wispr Flow's current documentation), but the underlying point stands: cloud dictation apps can do things you do not see, and you find out about them later if at all.

None of this means cloud dictation is wrong. It means the trust model is different. If you are dictating shopping lists and Slack messages, the trust model is probably fine. If you are dictating client work, medical notes, legal drafts, internal company information, or anything that you would not want sitting on someone else's server, the architectural answer is genuinely better than the policy answer.

How local Whisper works, and what "in RAM" actually means

Modern offline dictation apps use the Whisper family of models, originally released open-source by OpenAI in 2022 and now developed across multiple implementations including whisper.cpp, faster-whisper, MLX Whisper, and others. The smallest variants (tiny, base, small) are between 75MB and 500MB on disk and run on consumer hardware in real time.

The pipeline, in concrete steps:

  1. Step 1. You press a hotkey. The app opens an audio stream from your microphone at 16 kHz mono PCM — the format Whisper expects. The samples flow into a rolling buffer in RAM, typically a few megabytes per minute of speech. No file on disk.
  2. Step 2. A voice activity detector (VAD) watches the stream and decides when speech ends. Silero VAD is the common choice — a small ONNX model that runs in milliseconds per chunk and emits a "phrase ended" signal after about half a second of silence.
  3. Step 3. Each closed phrase gets handed to the Whisper model. Whisper runs on your CPU or GPU as a process linked into the same app — no inter-process communication, no network call.
  4. Step 4. The model produces text tokens. On Apple Silicon this typically takes a few hundred milliseconds for a 10-second phrase; on a modern Intel laptop CPU it takes a couple of seconds; on a discrete NVIDIA GPU it is faster than real-time.
  5. Step 5. The text is pasted into your active text field via the operating system's standard text-input API. Same API your keyboard uses.
  6. Step 6. When you close the app, the operating system reclaims the buffer. Nothing about the recording survives the process. Nothing is written to disk unless you explicitly enable history.

There is no network call in any of these three steps. You can verify this with any standard network monitor: Little Snitch on Mac, Wireshark on either OS, or your operating system's built-in firewall logs.

Here is what that looks like as a structural pattern, not a benchmark. Run any of these apps with a network monitor open during a 60-second dictation, and you'll see outbound request counts in the following ballpark. Exact numbers vary with build, feature flags, and auth state; the gap between zero and non-zero is the architectural point:

Outbound network requests during a 60-second dictation, observed in May 2026.
App Outbound requests What they are
SnailText (local Whisper) 0 None. The model runs in-process; the audio never leaves RAM.
Wispr Flow (Privacy Mode on) 1 — 2 Auth heartbeat to the vendor backend. The audio itself is still sent to the cloud for transcription — Privacy Mode disables retention, not transmission.
Cloud STT baseline (typical) 3 — 12 Auth, audio upload (often chunked), transcript download, telemetry. Exact count depends on chunk size and feature flags.

This is the test we keep coming back to when we talk about "offline" — not the marketing copy, not the privacy policy, but a packet capture during an actual recording. SnailText being at zero is the architectural guarantee. Wispr Flow on Privacy Mode being at one or two is honest about its design — the audio still has to reach a server to be transcribed; Privacy Mode controls what the server keeps. Cloud STT at three to twelve is the normal cost of running speech recognition as a service.

Offline vs cloud dictation dataflow Two parallel pipelines. Offline: microphone audio fills a RAM buffer, local Whisper produces text, RAM is released. Nothing crosses the network boundary. Cloud: microphone audio is encoded, uploaded over HTTPS to a remote server, the server runs Whisper, returns text over HTTPS, your machine displays it. Two extra network round-trips plus third-party data custody. OFFLINE Microphone PCM audio RAM buffer never on disk Local Whisper your GPU or CPU Text in your app 0 net hops CLOUD Microphone PCM audio Encode + upload HTTPS POST NETWORK BOUNDARY Cloud server vendor custody HTTPS return JSON text Text in your app 2 net hops
The architectural difference between offline and cloud dictation. Offline keeps the audio in a RAM buffer that the operating system releases when the app closes. Cloud sends the audio across a network boundary to a third-party server you don't control — the privacy policy applies to that custody, not to the architecture.

The "in RAM" part is the specific guarantee. RAM contents are not persisted across reboots. They are not accessible to other processes except through the operating system's standard process-isolation rules. They are not backed up by Time Machine, iCloud, or OneDrive unless you separately enable a feature that writes them to disk. When you close the app, the buffer is gone.

The point of belaboring this is that the architectural detail is the actual privacy guarantee. There is no policy you have to trust; there is only the code path, and the code path can be observed.

The GDPR and HIPAA story for offline dictation

The legal frameworks around voice data have tightened substantially through 2025 and 2026. Under the EU's General Data Protection Regulation, voice recordings are personal data, and voiceprints are classified as special-category biometric data when processed for identification. Total GDPR fines passed €7.1 billion cumulatively by 2026, with €1.2 billion levied in 2025 alone and a 40% year-over-year increase in fines specifically tied to voice-data mishandling (per the Kiteworks GDPR Compliance Report 2026). The Dutch Data Protection Authority alone levied a €30.5 million fine on Clearview AI for biometric data violations involving facial recognition.

In the United States, HIPAA penalty tiers were updated effective January 28, 2026 to a structure where individual violations can cost between $145 and $2,190,294 depending on the category of fault, with annual caps at $2,190,294 per violation type. The Office for Civil Rights' Risk Analysis Initiative through 2025 has specifically targeted "shadow AI": situations where staff use consumer-grade AI tools without going through formal vendor procurement and BAA processes. Cloud dictation that processes Protected Health Information without a signed Business Associate Agreement is a violation from the first transcription, regardless of whether anything subsequently goes wrong.

Offline dictation removes most of these failure modes because the data does not change custody. Local processing means:

  • No Data Processing Agreement needed with a dictation vendor, because the vendor does not process the data.
  • No Business Associate Agreement needed for HIPAA, because no PHI leaves the covered entity's control.
  • No cross-border data transfer assessment, because there is no transfer.
  • No Data Protection Impact Assessment for the voice pipeline (one may still be needed for other parts of your overall system).
  • No vendor risk management for speech data handling, again because the vendor is not handling speech data.

The architecture itself is the compliance mechanism. This does not mean a regulated organization can deploy any offline dictation tool without thought: you still need to verify the claims, document the architecture, and consider edge cases like crash dumps and update channels. But the baseline compliance work is dramatically less than for a cloud equivalent.

For organizations that have wrestled with vendor SOC 2 audits, BAA negotiations, and DPA reviews for cloud dictation, the simplification is the single largest practical advantage of going offline.

Which dictation apps are actually offline (a check)

Four dictation apps run entirely offline by default in 2026: SnailText (Mac and Windows), MacWhisper (Mac only), SuperWhisper in local mode (Mac and Windows), and Voibe (Mac only). Three apps are cloud-based by default with privacy options layered on top: Wispr Flow, Willow Voice, and Speechify. Aqua Voice and most Speechify dictation features are cloud-only. The category is small enough that it is worth being concrete:

App Local default Cloud option Mac Win Notes
SnailText Yes No (not in 2026) Local Whisper + Parakeet. Feature parity Mac/Windows day one.
MacWhisper Yes Yes (Pro Plus, opt-in) Local Whisper for file transcription and live dictation.
SuperWhisper Yes (local mode) Yes (BYOK Pro) Local-only mode supported. Pro adds BYOK to OpenAI/Anthropic/ElevenLabs.
Voibe Yes No Local Whisper for core dictation flow.
Wispr Flow No Yes (default cloud) Privacy Mode disables storage but audio still processed in cloud.
Willow Voice No Yes (default cloud) Cloud-based dictation.
Aqua Voice No Yes (cloud-only) Custom Avalon model in cloud. Strong accuracy benchmarks.

If the offline guarantee matters to you, the practical short list narrows to four apps (us, MacWhisper, SuperWhisper local mode, Voibe). Three of those four are Mac-only or Mac-first. The one with Mac and Windows parity from day one is us, which we acknowledge sounds self-serving but is the actual state of the market.

Local dictation apps in 2026 — the four that actually run on your device

"Offline dictation" and "local dictation app" describe the same architecture from two angles. Offline emphasizes what does not happen (no cloud roundtrip). Local emphasizes where the model runs (on your CPU, GPU, or Neural Engine). Both terms point at the same shortlist of four apps in 2026.

A local dictation app means the speech-to-text model — Whisper, Parakeet, or a vendor's own — is downloaded as part of the app install and executed by your hardware on every dictation. No audio is uploaded. No transcripts are stored remotely. No account is required to get a transcription. The vendor cannot see what you dictate even if they wanted to, because the audio never reaches their servers.

That property — verifiable by network monitor, not by promise — is the reason regulated professions (therapists drafting session notes, lawyers drafting privileged work product, clinicians documenting PHI) increasingly default to a local dictation app over a cloud one. The compliance picture simplifies: there is no third-party processor of the audio because the audio is never transmitted. You can read our specific positions for therapists, lawyers, and accessibility-driven use cases.

When offline dictation has trade-offs

Offline dictation has five practical trade-offs compared to cloud STT: smaller local models are typically 1-7 percentage points less accurate than cloud Large variants on noisy or accented audio, less common languages have weaker local model support, inference uses your hardware's CPU or GPU which matters on older laptops, cross-device sync requires deliberate engineering (there is no central server in the loop by default), and accuracy improvements ship as software updates measured in months rather than continuous cloud model updates measured in days.

Model size limits. Compact local models (tiny, base, small) run on any modern machine but are less accurate than the large cloud models for very noisy audio, very heavy accents, or less common languages. For clean English audio in a quiet room, the gap is small. For an accented speaker recording in a noisy café, the gap can grow to several percentage points.

Less common languages. Whisper is strongest on English and major European languages. For Vietnamese, Bengali, Telugu, and other lower-resource languages, local model accuracy can drop meaningfully. Cloud providers using larger models or language-specific fine-tunes often have an edge here.

Compute cost is your hardware. Running inference locally costs electricity and uses your CPU or GPU. On Apple Silicon and modern dedicated GPUs the cost is negligible. On older laptops with no GPU acceleration, it can be noticeable and battery drain becomes a real factor.

No live cross-device sync of model state. If you train custom vocabulary on your Mac, it does not automatically sync to your Windows machine because there is no central server in the loop. Modern tools (including ours) sync through a license server with end-to-end encryption, but it is a layer that has to be designed in deliberately.

Updates ship as software updates. A cloud STT vendor can improve their model overnight, and your dictation accuracy improves with no action from you. Local apps update accuracy when they ship a new app version with a new model bundled in. The cycle is months, not days.

For most knowledge-worker dictation in English or major European languages, these trade-offs are minor. For specific edge cases, cloud has real advantages. The point of an offline-first design is to make the default privacy-correct, not to claim it is always the best technical choice.

How to verify any dictation app is actually offline

Verifying that a dictation app runs offline takes about 60 seconds with standard tools and no special expertise:

  1. Install a network monitor. Little Snitch on macOS ($45 one-time), GlassWire on Windows (free tier exists), or Wireshark on either OS (free, open source).
  2. Quit the dictation app you want to test, then launch the network monitor.
  3. Open the dictation app and start a session. Talk for 10-20 seconds.
  4. Stop the session and observe the network monitor's outbound traffic log filtered to the dictation app's process.
  5. A truly offline app produces zero outbound requests during recording or transcription. Software update checks at launch and license verification calls are normal and separate from dictation.

SnailText, for reference, runs offline by default on Mac (Apple Silicon, M1 or later) and Windows (10 and 11, x86-64). Free tier is unlimited local dictation with compact Whisper models, no account required, no time limits. The app makes outbound calls only for software update checks at launch, Pro license verification (once per session on Pro), and optional anonymous error reports (opt-in, off by default).

Pro tier ($7.49/mo · $89/yr, 3 devices) adds larger Whisper and Parakeet TDT v3 models with multi-language support, dictionary and snippet expansion, and a 30-day money-back guarantee.

FAQ

How do I verify a dictation app is actually offline?

Run Little Snitch on macOS, GlassWire on Windows, or Wireshark on either OS, and observe network activity while you dictate. A truly offline app produces zero outbound traffic during recording or transcription. Software update checks at launch and license verification calls are normal and separate from dictation.

Does offline dictation work without internet?

Yes. The model runs entirely on your device. You can dictate on a plane, in a coffee shop with no Wi-Fi, in a basement, anywhere. The only thing that needs internet is the initial app download.

Is local Whisper as accurate as cloud Whisper?

The model is the same open-source code from OpenAI. The accuracy difference is about which size of the model is running, not where it runs. For clean English audio, local Small/Medium and cloud Large are within 1-3 percentage points. For accented or noisy audio, the gap can be 3-7 points.

Is offline dictation HIPAA compliant?

Local Whisper running entirely on your device is the simplest path to HIPAA compliance for voice transcription, because no Protected Health Information leaves your control. No Business Associate Agreement is needed because there is no business associate processing the voice data. You still need to handle the data correctly on your own device (encryption at rest, access controls, audit logs as required by your organization), but the data-in-transit category of risk is removed.

What is Wispr Flow's Privacy Mode?

Wispr Flow's Privacy Mode disables their data storage and model training. It does not change the fact that the audio still gets sent to their servers for transcription. The architecture is cloud-with-no-retention, not local. Both can be reasonable choices, but they are different choices.

Does SnailText ever upload anything?

We make outbound network calls for: software update checks (you can disable in Settings), Pro license verification (Pro users only, once per session), and optional anonymous error reports (off by default, you opt in). We never send audio, transcripts, or anything you dictate.

Stop sending your voice to the cloud

Free tier is unlimited local dictation, no account needed. The audio stays in RAM on your machine. Verifiable in your own firewall logs.