AutoKaam Playbook

Whisper, the Local Transcription I Run on Every Voice Memo

OpenAI's open ASR model. Powers the empire field-note pattern. Zero rupees per hour of audio.

Last reviewed:

The operator take

Whisper is the unsung hero of my workflow. Every empire field note, every voice-memo-to-article transition, every tutorial-with-narration goes through whisper.cpp on my M75q first. The cost per hour of audio is whatever the box's electricity costs me, which is rounding error, and the quality on Indian-accented English is good enough that I never have to clean up output by hand.

I run whisper.cpp specifically, not the Python whisper or the OpenAI hosted API. The compile-from-source on my M75q takes about a minute, the binary is self-contained, and the medium model is the right balance of speed and quality for my voice. Large-v3 is more accurate by maybe 4 percent on my test clips but takes 3x longer to transcribe, which on a 30-minute clip is the difference between minutes and quarter of an hour. For 95 percent of my use cases medium is fine.

What I learned the hard way is that the model size choice matters for the box you run on. I tried large-v3 on the Pi 4B for the principle of it, the transcription took 4x real time which is genuinely unusable for any live workflow. Switched to small.en on the Pi and got real-time transcription back, with a 6 percent WER on my test set which is fine for caption work but borderline for high-stakes notes. So the empire pattern is, M75q with medium for daily, Pi with small.en for ambient capture, never large-v3 unless I am batch-transcribing podcast-length audio overnight.

The empire field-note pattern uses Whisper as the first stage, voice memo on phone gets transferred to M75q, transcribed by whisper.cpp, then routed through Sonnet or MiMo for prose polish into an italic block-quote injected into a tutorial or a news article. The original cadence of how I speak survives the polish, which is exactly the uncopiable founder-voice marker the AdSense E-E-A-T pass needs. Pure Sonnet output without a Whisper transcript first reads like every other AI piece, that is the whole point.

Where Whisper breaks for me is Hindi and Hinglish. The multilingual variants are trained on enough Hindi to do passable subtitles for clear speakers, but my own Hindi speech with code-switching gets misheard about 30 percent of the time. For Indian operators recording in Hindi I would still use it but I would expect to clean up output, and for Hinglish-heavy speech I have given up and use Sonnet directly with a streaming-audio fallback.

The other thing Whisper is bad at is real-time low-latency transcription. The model is batch by design and the streaming variants underperform. For real-time voice agent work I would route audio through a different stack like Sarvam or Krutrim on the Indian-language path, or Deepgram on the English path. Whisper is for offline file-level transcription, not for live conversation.

For Indian operators starting from zero, whisper.cpp plus the medium model on any laptop with 8GB RAM is genuinely free transcription forever. No subscription, no per-minute billing, no privacy leakage. The integration cost is one afternoon, the daily cost is zero. That math defends itself against any hosted ASR I have priced.

Why it matters in 2026

ASR pricing from hosted vendors stayed flat through 2026 while local models got faster. For any operator transcribing meaningful volumes of audio, local Whisper is the cost-cutting move that defends itself.

Cost in INR

Free, open source. Compute cost on consumer hardware is electricity, effectively zero per hour of audio.

Use when

  • +Voice memos and field notes feeding the founder-voice pattern
  • +Offline batch transcription of podcast or meeting audio
  • +Privacy-strict environments where audio cannot leave the box
  • +Caption generation for video content

Skip when

  • xReal-time low-latency conversation, use streaming ASR vendors
  • xHeavy Hindi or Hinglish speech, accuracy is borderline
  • xPi class hardware with the large-v3 model, the math does not work

Alternatives I would consider