How I Run a Full Transcription Stack on a 6 GB GTX 1660
int8 Faster-Whisper, Silero VAD, and pyannote diarization, capturing mic and system audio at once

As I write this on June 28, 2026, nvidia-smi on my desktop reports about 781 MiB free on a 6 GB GTX 1660. Ollama is sitting on the rest. That one reading is the whole problem I solved to run speech-to-text locally: int8 Whisper, Silero VAD, and speaker diarization each want a slice of a card that a single language model can swallow whole. I run this stack to turn my own audio into searchable text, the voice notes I dictate, recorded talks I want to skim later, and the occasional two-person podcast I record in the browser. Shipping hours of that to a hosted app like Otter is both a privacy cost and a money cost. Sarvam's Saaras v3 runs about ₹30 per hour, with diarization billed separately on its batch API. The 1660 is already in the box, so once it is wired, local transcription costs roughly the price of electricity.
The card decides everything
The GTX 1660 is Turing, the TU116 die, and it has no tensor cores. That single fact rules out most of the modern local-inference path. vLLM, FP8 weights, Marlin kernels, all of it expects Ampere or newer. On this card the only runtimes that behave are CTranslate2 int8, ONNX, and llama.cpp GGUF. There is a sharper trap underneath. The GTX 16xx line has half-rate FP16, and FP16 matmuls overflow to NaN on real inputs. I hit that running an FP16 embedding model that returned NaN and an HTTP 500 on any input longer than a few hundred characters. Truncating the text did nothing, because it was never a length bug, it was the FP16 path itself. The lesson carried straight into the transcription stack: on a 1660 you run int8 or fp32, never fp16.
nvidia-smi reports 6144 MiB on the card, but CUDA reserves a slice and the usable ceiling sits near 5.61 GiB. Whisper, the VAD, and pyannote together have to live under that. If Ollama is up, it has already taken 3 to 4 GB for whatever model it last served, which is exactly the 781 MiB-free state I am looking at right now.
The VRAM budget
| Component | Model / library | GPU memory | Runtime |
|---|---|---|---|
| Speech-to-text | Whisper large-v3-turbo, int8 | about 2.4 GB resident | faster-whisper / CTranslate2 |
| Voice activity | Silero VAD | CPU, near zero on GPU | onnxruntime |
| Diarization | pyannote community-1 | offline burst, not concurrent with STT | torch |
| CUDA context | n/a | about 0.5 GB | n/a |
The discipline that makes it fit is one rule: never hold two GPU bursts at once. Transcription and diarization do not have to run in the same instant. The VAD and the segmenter live on the CPU. Peak GPU use is the resident STT model plus at most one burst, which keeps me under roughly 3.5 GB and well clear of the 5.61 GiB ceiling. The anti-pattern is loading everything resident at once, which lands near 5.8 GB and OOMs the moment CUDA wants a scratch buffer.
int8, and what it costs
Whisper large-v3-turbo is 809M parameters. At fp16 the weights alone are about 1.6 GB. The int8 build is roughly 0.8 GB on disk. Loaded with its decode buffers the model sits near 2.4 GB resident, about 2.9 GB once the 0.5 GB CUDA context is counted. I pin [email protected] on top of [email protected]. The accuracy hit from int8 on this model is small in my testing, well inside the noise of room hiss and accent, while the memory win is the difference between fitting and not fitting. Turbo is already a 4-decoder-layer cut of large-v3, so it clears about 2x real time even on an 8-core Ryzen with no GPU at all. On the 1660 it finishes short chunks faster than they arrive.
The compute_type argument is where the Turing constraint shows up. I use int8_float32, which keeps int8 weights with an fp32 compute path, precisely because the fp16 path is the one that NaNs on this card.
from faster_whisper import WhisperModel
# int8 weights, fp32 compute. No fp16 on Turing 16xx (half-rate fp16 -> NaN).
model = WhisperModel(
"deepdml/faster-whisper-large-v3-turbo-ct2",
device="cuda",
compute_type="int8_float32",
)
segments, info = model.transcribe(
"chunk.wav",
vad_filter=True, # Silero VAD gate, drop silence before the decoder
beam_size=1, # greedy decode, lowest latency on a small card
# no initial_prompt on short chunks, see the gotcha below
)
for s in segments:
print(f"[{s.start:.1f}-{s.end:.1f}] {s.text}")
Silero VAD, and the prompt that broke everything
Silero VAD is the cheap gate in front of Whisper. I run [email protected] through [email protected] on the CPU, and it segments the stream into speech regions of about 2 to 12 seconds, dropping silence before it ever reaches the decoder. faster-whisper exposes it directly with vad_filter=True. On live audio this is what keeps the GPU idle between sentences instead of burning cycles on room tone.
The gotcha that cost me an evening: I tried to bias proper-noun spelling by passing a glossary of names as Whisper's initial_prompt. On short VAD chunks the decoder reads that prompt as previous context, and with a long prompt over a 3-second clip it just echoes the prompt or collapses to a single character. Four of six chunks came back as f or fm. The fix is to pass no prompt on short chunks and do vocabulary correction as a post-pass: fuzzy-match the raw transcript against the name list with [email protected] and substitute above an 85% similarity threshold. Decoder-time biasing and short audio do not mix.
Capturing both sides
The reason this is more than pointing Whisper at a wav file is that I want two audio sources at the same time: my microphone, and whatever the system itself is playing. When I record a two-person podcast in the browser, my voice is the mic and the other person comes back through the system output. PipeWire makes both addressable. The monitor of the default sink is the system audio. The default source is the mic. I capture them as two parallel 16 kHz mono streams.
# list devices, find the sink monitor and the mic source ids
pw-cli ls Node | grep -E "node.name|media.class"
# what the system plays (monitor of the default sink)
pw-record --target <sink-monitor-id> --rate 16000 --channels 1 system.wav &
# my microphone, separate stream
pw-record --target <mic-source-id> --rate 16000 --channels 1 mic.wav &
# PulseAudio-compatible alternative under pipewire-pulse:
# parec --device=@[email protected] --rate=16000 --channels=1 --format=s16le system.raw
I keep them as separate streams on purpose. The stream a sentence arrived on, mic or system, is the most reliable speaker signal I have, more reliable than diarization, and it is free. Each stream gets its own VAD pass and its own Whisper pass, then the side tag becomes the first half of the speaker label.
Diarization, and its honest limit
For finer who-spoke detail inside a single stream, pyannote does the work. I run [email protected], and two things bit me in production. First, the API moved. pyannote.audio 4.x wraps the result in a DiarizeOutput object, where 3.x returned an Annotation directly, so old code calling .itertracks() throws AttributeError. A hasattr guard keeps it working across both shapes.
import torch
from pyannote.audio import Pipeline
dia = Pipeline.from_pretrained(
"pyannote/speaker-diarization-community-1",
token=HF_TOKEN, # gated repo, accept the terms on Hugging Face first
).to(torch.device("cuda"))
out = dia("session.wav")
# 4.x wraps the Annotation; 3.x returned it directly
ann = out.speaker_diarization if hasattr(out, "speaker_diarization") else out
for turn, _, spk in ann.itertracks(yield_label=True):
print(f"{turn.start:.1f}-{turn.end:.1f} {spk}")
Second, the models are gated. You accept the terms on Hugging Face for pyannote/speaker-diarization-community-1 and pyannote/segmentation-3.0 before the loader will pull them, and the gate is per-user, not per-token.
The honest limit is speaker linkage. If you diarize chunk by chunk for low latency, pyannote labels speakers SPEAKER_00 and SPEAKER_01 fresh inside each window, and the loudest voice in a 10-second window is a different human across windows. I watched a 38-minute recording with four real speakers collapse into two labels, one with 200 lines and one with two. Per-chunk diarization cannot link identity across chunks. The real fix is a session-level pass: run pyannote once over the full recording at the end, or cluster x-vector embeddings across chunks by distance. Run offline over the whole file, pyannote is far more reliable on multi-party audio. Live and per-chunk, treat the labels as a hint and trust the mic-versus-system split instead.
The orchestration that actually holds
The last piece is VRAM hygiene, because the silent failure mode here is brutal. If Ollama is holding the card and I launch the transcriber, torch raises OutOfMemoryError, the worker dies, and the tray icon still says recording. You get a clean wav and an empty transcript. So before every run I evict whatever is resident, then verify the free figure before trusting any status flag.
# Ollama keeps a model loaded for keep_alive (5 min default) after the last
# request, so killing the consumer process is not enough. Evict the model:
curl -s http://localhost:11434/api/generate \
-d '{"model":"gemma3:4b-it-qat","keep_alive":0}' >/dev/null
# or free the whole card:
sudo systemctl stop ollama
# confirm before loading Whisper, want >= 3500 MiB
nvidia-smi --query-gpu=memory.free --format=csv,noheader
# 5040 MiB <- now int8 Whisper loads clean
After eviction the int8 model loads cleanly at about 2.9 GB. Then I confirm the Python worker is still alive a few seconds after launch, rather than trusting the status flag, because the flag tracks the PID file, not the VRAM-loaded model.
When I reach for the cloud instead
I host this locally because it is private and the marginal cost is zero, and because the hosted note-takers upload everything. There are days the maths flips. For a one-off hour of clean single-speaker audio where I do not care who said what, Sarvam's Saaras v3 at about ₹30 per hour, with ₹1000 in free signup credit good for roughly 33 hours, is faster than babysitting VRAM. The local stack earns its place on volume and on privacy. For the audio I record myself, week after week, the 1660 has paid for itself many times over.
If you want to push the same card further on other local models, the int8 and GGUF discipline carries straight over to running Gemma locally with Ollama on the same 6 GB. The reference docs worth bookmarking: faster-whisper, Silero VAD, pyannote.audio, and the CTranslate2 runtime underneath it all.
More Automation

Programmatic PDF Table Extraction and OCR with Adobe PDF Services REST: The Auth, the Extract Call, and Parsing the Output
I wired Adobe PDF Services REST into my stack as a local tool and pointed it at the scanned invoices and merged-header statements that pdfplumber turned into soup. Here is the exact auth flow, the extract call, and the structuredData.json parsing I run in production, with the real latency and free-tier limits.

I Gave My AI Agent Eyes and Hands on Native Linux Apps With AT-SPI2
I was tired of my agent missing buttons because a window shifted a few pixels. So I pointed it at the AT-SPI2 accessibility tree instead, the same data a screen reader consumes, and had it act by element name and role. This walks through driving a GTK dialog and a native Save dialog, then reading the value back to prove the action actually landed.

Reboot-Proof Cloudflare Named Tunnels: The systemd Setup I Run in Production
I expose every self-hosted app on my home box through a Cloudflare named tunnel, kept alive by a systemd unit that has survived every reboot for weeks. This is the real login-to-systemd flow, the config file, the unit, and why a named tunnel beats a quick tunnel for anything you mean to keep.