AutoKaam Playbook
Whisper, the Local Transcription I Run on Every Voice Memo
OpenAI's open ASR model. Powers the empire field-note pattern. Zero rupees per hour of audio.
Last reviewed:
The operator take
Whisper is the unsung hero of my workflow. Every empire field note, every voice-memo-to-article transition, every tutorial-with-narration goes through whisper.cpp on my M75q first. The cost per hour of audio is whatever the box's electricity costs me, which is rounding error, and the quality on Indian-accented English is good enough that I never have to clean up output by hand.
I run whisper.cpp specifically, not the Python whisper or the OpenAI hosted API. The compile-from-source on my M75q takes about a minute, the binary is self-contained, and the medium model is the right balance of speed and quality for my voice. Large-v3 is more accurate by maybe 4 percent on my test clips but takes 3x longer to transcribe, which on a 30-minute clip is the difference between minutes and quarter of an hour. For 95 percent of my use cases medium is fine.
What I learned the hard way is that the model size choice matters for the box you run on. I tried large-v3 on the Pi 4B for the principle of it, the transcription took 4x real time which is genuinely unusable for any live workflow. Switched to small.en on the Pi and got real-time transcription back, with a 6 percent WER on my test set which is fine for caption work but borderline for high-stakes notes. So the empire pattern is, M75q with medium for daily, Pi with small.en for ambient capture, never large-v3 unless I am batch-transcribing podcast-length audio overnight.
The empire field-note pattern uses Whisper as the first stage, voice memo on phone gets transferred to M75q, transcribed by whisper.cpp, then routed through Sonnet or MiMo for prose polish into an italic block-quote injected into a tutorial or a news article. The original cadence of how I speak survives the polish, which is exactly the uncopiable founder-voice marker the AdSense E-E-A-T pass needs. Pure Sonnet output without a Whisper transcript first reads like every other AI piece, that is the whole point.
Where Whisper breaks for me is Hindi and Hinglish. The multilingual variants are trained on enough Hindi to do passable subtitles for clear speakers, but my own Hindi speech with code-switching gets misheard about 30 percent of the time. For Indian operators recording in Hindi I would still use it but I would expect to clean up output, and for Hinglish-heavy speech I have given up and use Sonnet directly with a streaming-audio fallback.
The other thing Whisper is bad at is real-time low-latency transcription. The model is batch by design and the streaming variants underperform. For real-time voice agent work I would route audio through a different stack like Sarvam or Krutrim on the Indian-language path, or Deepgram on the English path. Whisper is for offline file-level transcription, not for live conversation.
For Indian operators starting from zero, whisper.cpp plus the medium model on any laptop with 8GB RAM is genuinely free transcription forever. No subscription, no per-minute billing, no privacy leakage. The integration cost is one afternoon, the daily cost is zero. That math defends itself against any hosted ASR I have priced.
Why it matters in 2026
ASR pricing from hosted vendors stayed flat through 2026 while local models got faster. For any operator transcribing meaningful volumes of audio, local Whisper is the cost-cutting move that defends itself.
Cost in INR
Free, open source. Compute cost on consumer hardware is electricity, effectively zero per hour of audio.
Use when
- +Voice memos and field notes feeding the founder-voice pattern
- +Offline batch transcription of podcast or meeting audio
- +Privacy-strict environments where audio cannot leave the box
- +Caption generation for video content
Skip when
- xReal-time low-latency conversation, use streaming ASR vendors
- xHeavy Hindi or Hinglish speech, accuracy is borderline
- xPi class hardware with the large-v3 model, the math does not work
Alternatives I would consider
Read next
Adjacent in the playbook
Free, open source. Compute cost on consumer hardware is electricity, roughly Rs 4 to Rs 8 per active inference hour on a 65W desktop.
Ollama, the Local Model Runtime I Actually Trust
Free, open source. Compile-time cost on a M75q is under two minutes, on a Pi 4B about ten minutes.
llama.cpp, the Engine Under Most Local Inference
Free for personal use. Commercial use license is in flux for 2026, treat as not licensed for production.
LM Studio, the GUI On-Ramp for People Who Hate Terminals
Free, open source. Compute cost via RunPod is about Rs 42 per hour for L4 (24GB VRAM) and Rs 250 to Rs 800 per hour for A100 or H100.