I Run Gemma 3 Vision On A 6GB GTX 1660 For Screenshot OCR: The Real VRAM And Latency Numbers
Local document extraction on a budget card, measured on my own box, not copied off a spec sheet

The short version
I run Gemma 3 4B vision on a single GTX 1660 with 6GB of VRAM. It reads screenshots, lifts text off invoices, and returns clean structured JSON, all on my own Ubuntu desktop with no API key and no image ever leaving the box. This week I sat down and measured the two things everyone quotes wrong, the VRAM footprint and the cold versus warm latency, so the numbers below come from my own nvidia-smi, not a model card.
The headline correction first. The gemma3:4b-it-qat tag gets passed around as a "2.6GB" model. On my card, loaded with a 4096 token context, it sits at 4916 MiB resident. That fills most of a 6GB card. If you plan around 2.6GB you will try to co-load a second model and walk straight into an out-of-memory wall. I did exactly that before I bothered to measure it.
My box
Nothing exotic here. The point of the whole exercise is that an older 6GB gaming card does this job.
| Component | What I run |
|---|---|
| GPU | NVIDIA GeForce GTX 1660, 6144 MiB, driver 580.159.03 |
| CPU | Ryzen 7 3700X, 8 cores |
| RAM | 31 GB DDR4 |
| OS | Ubuntu, Wayland session |
| Runtime | [email protected] |
| Model | gemma3:4b-it-qat |
The -it-qat suffix matters. It is the instruction-tuned, quantization-aware-trained build. Quantization-aware training holds more accuracy at low bit-width than a plain post-training quant, so for OCR, where one wrong digit ruins an invoice total, I pick the qat tag on purpose.
Install and pull
If you do not have Ollama yet, one line on Linux:
curl -fsSL https://ollama.com/install.sh | sh
ollama --version # I pinned 0.23.4
Then pull the Gemma 3 vision model and confirm it landed:
ollama pull gemma3:4b-it-qat
ollama list | grep gemma3
# gemma3:4b-it-qat d01ad0579247 4.0 GB ...
The 4.0 GB there is the blob on disk, not what it costs in VRAM. Those are two different numbers and the gap is exactly where people get caught.
What it actually costs in VRAM
This is the part I care about as an operator. Idle, before I touch the model, my card shows 515 MiB used by the desktop. I load the model and ask ollama ps what is resident:
ollama ps
# NAME SIZE PROCESSOR CONTEXT UNTIL
# gemma3:4b-it-qat 5.8 GB 100% GPU 4096 4 minutes from now
nvidia-smi --query-gpu=memory.used --format=csv,noheader
# 4916 MiB
So ollama ps reports 5.8 GB allocated, including the 4096 token KV cache, and nvidia-smi shows 4916 MiB actually resident on the card. Either way I am sitting near the ceiling of a 6GB card with roughly a gigabyte of headroom. That works for one model doing one job. It breaks the moment you want Gemma plus an embedding model plus your browser GPU compositor all at once. Plan for one resident model at a time on this card, and stop trusting the small number from the marketing tag.
Cold versus warm, measured today
Latency on local vision has two very different modes, and you have to design around both.
The first call after the model loads pays for the weights coming off disk into VRAM plus the image encoder spinning up. I fed it a synthetic test invoice, a 640x360 PNG I generated myself with dummy data, no real document anywhere near it, and asked for verbatim text plus a JSON summary:
# first call after a cold load
wall_s=13.15 total_ms=13104 prompt_eval=297 eval_count=70
# immediate second call, model still warm
wall_s=2.35 total_ms=2292 prompt_eval=297 eval_count=72
So roughly 13 seconds cold, a touch over 2 seconds warm for a short structured answer. In my testing the warm number drifts up toward 4 to 5 seconds when I ask for a long verbatim transcription of a dense screenshot, because there are more tokens to generate. The image itself tokenized to about 256 image tokens, which is why prompt_eval sits near 297 no matter how little text the picture actually holds.
And it was correct. From my throwaway invoice it returned this on the warm call:
{
"invoice_no": "#BS-2026-0488",
"vendor": "Bharat Stationers, Pune",
"date": "2026-06-21",
"total": "Rs 5,380"
}
For a 4B model on a budget card, pulling a rupee total and an invoice number cleanly into JSON is the entire reason I keep it around. The cold penalty is the thing to engineer against. Ollama unloads the model after about five minutes idle by default, so a bursty workload pays the 13 second tax on every burst. For a steady screenshot queue I raise the keep-alive and never let it idle out, which keeps me on the 2 second warm path.
When I reach for local, and when I do not
I do not run everything on the card out of stubbornness. The split is practical.
I keep it local when:
- The input is a screenshot, a scanned bill, a PAN or Aadhaar style card, or a printed form, and I want plain text or a handful of fields out.
- I am doing volume. A few thousand images a month on a hosted vision API turns into a real per-image bill. On my card it is the cost of electricity and nothing else.
- The image should not leave my machine. Local inference means it never goes over the wire to anyone.
- I want zero rate limits and zero quota anxiety while I iterate on the prompt.
I reach for a cloud vision API instead when:
- The document is genuinely hard, multi-column, handwritten, or a 30 page PDF, where a frontier model reading it is worth the round trip and the fee.
- I need long context or reasoning across many pages at once, which a 4B model on 6GB simply cannot hold.
- The job is rare. For ten images a month the cold start and the setup are not worth it over a hosted call.
For the bread and butter, parsing a screenshot or a stack of bills into fields, the 6GB card wins on cost and privacy, and I have run it that way on my own desktop for weeks.
Two gotchas that cost me time
First, the VRAM surprise above. Size on disk is not size in VRAM, and the 4096 context KV cache is real memory you have to account for. Budget the full resident figure, around 4.9 GB on this card, not the tag's quoted number.
Second, driver state. Ollama will quietly fall back to CPU if your NVIDIA driver is half-installed, and you will think the model is just slow when it is actually not on the GPU at all. Confirm ollama ps shows 100% GPU, not a CPU split. When mine once read CPU, a clean driver reinstall and a reboot took the same call from tens of seconds down to the 2 second warm path above.
If you already run text models through Ollama, you have everything you need. Pull the qat vision tag, point an image at it, and read your own nvidia-smi rather than trusting any single quoted VRAM number, mine included. For the text-only side of this same setup, I wrote up my Gemma 4 local walkthrough separately.
More Automation

Programmatic PDF Table Extraction and OCR with Adobe PDF Services REST: The Auth, the Extract Call, and Parsing the Output
I wired Adobe PDF Services REST into my stack as a local tool and pointed it at the scanned invoices and merged-header statements that pdfplumber turned into soup. Here is the exact auth flow, the extract call, and the structuredData.json parsing I run in production, with the real latency and free-tier limits.

I Gave My AI Agent Eyes and Hands on Native Linux Apps With AT-SPI2
I was tired of my agent missing buttons because a window shifted a few pixels. So I pointed it at the AT-SPI2 accessibility tree instead, the same data a screen reader consumes, and had it act by element name and role. This walks through driving a GTK dialog and a native Save dialog, then reading the value back to prove the action actually landed.

Reboot-Proof Cloudflare Named Tunnels: The systemd Setup I Run in Production
I expose every self-hosted app on my home box through a Cloudflare named tunnel, kept alive by a systemd unit that has survived every reboot for weeks. This is the real login-to-systemd flow, the config file, the unit, and why a named tunnel beats a quick tunnel for anything you mean to keep.