A 6 GB GTX 1660 running Gemma 3 4B vision through Ollama, set against the 753B GLM-5.2 release it cannot host.
OPERATOR READ · COVER · JUN 28, 2026 · ISSUE LEAD
OPERATOR READ·Jun 28, 2026·7 MIN

GLM-5.2 Ships 753B Open Weights. My GTX 1660 Holds 6 GB.

The most powerful open-weight model of mid-June 2026 needs a cluster. A 4B model on a 6 GB card taught me what that headline leaves out.

By·
OPERATOR READJUN 28, 2026 · ADITYA SHARMA

GLM-5.2 ... marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a solid 1M-token context.

Z.ai (Zhipu AI), GLM-5.2 release

What AutoKaam Thinks
  • Open weights and a model you can run are two different claims. GLM-5.2 ships MIT weights at over 750 billion parameters, and none of that helps a 6 GB card.
  • On a GTX 1660 the only runtimes that load are GGUF, CTranslate2 int8, and ONNX. No tensor cores means vLLM, FP8, and Marlin are off the table before you start.
  • Budget VRAM as resident plus one burst, never all-resident. Two models that each fit alone will run out of memory together, and Ollama fails that silently while reporting idle.
  • Local wins on private screenshots, offline triage, and zero per-call cost. For real reasoning quality a 4B model is still a 4B model, and I route that work to a cloud flagship.
753B
GLM-5.2 total parameters
753B OPEN WEIGHTS vs A 6GB CARD
Named stake

In mid-June 2026 Z.ai released GLM-5.2, and the open-weight leaderboard moved again. It is a 753-billion-parameter open-weight model, MIT-licensed, with a 1M-token context window, and on the long-horizon coding benchmarks Z.ai published with it, it runs roughly even with GPT-5.5, edging ahead on FrontierSWE and trailing slightly on Terminal-Bench. The official card calls it a substantial leap in long-horizon task capability over GLM-5.1, and the model weights sit on Hugging Face for anyone to download. I read that announcement on the same desk where I have been running a 4-billion-parameter model on a 6 GB graphics card since 14 May 2026. The gap between those two facts, 753 billion parameters on one side and 6 GB of VRAM on the other, is the whole operator story.

"Open weights" and "a model you can actually run" are two separate sentences, and the second one is the one nobody puts in the headline.

The box this runs on

My local inference box is a GTX 1660 with 6 GB of VRAM, paired with a Ryzen 7 3700X and 32 GB of system RAM on Ubuntu. That card is Turing TU116. It has no tensor cores. In practice that one detail decides almost everything. The runtimes that load on it are GGUF through llama.cpp or Ollama, CTranslate2 int8, and ONNX. The runtimes that do not load are vLLM, FP8, and Marlin, because they all assume tensor-core paths the 1660 does not have. So a 753-billion-parameter model is not a question I get to ask. It is settled before I open a terminal.

What I actually host is gemma3:4b-it-qat served through Ollama at localhost:11434. Its Q4 weights are quoted at about 2.6 GB, but with a 4096-token context loaded it sits closer to 4.9 GB resident by nvidia-smi, leaving only about a gigabyte of headroom on a card that reports about 5.61 GiB usable after the display takes its cut. I wired it as a local screenshot-OCR and image-describe service so that quick, mechanical vision work never has to leave the machine. It runs at roughly 22 tokens per second. That is slow next to a cloud endpoint, and fast enough when the alternative is shipping a private screenshot off the box.

VRAM budgeting is the entire game

The single lesson I would hand any first-time self-hoster: budget VRAM as resident plus one burst, never all-resident. Once you internalise that, most of the pain stops.

Component Role VRAM on the 1660 Operator note
gemma3:4b-it-qat vision and text describe about 4.9 GB resident QAT Q4, weights ~2.6 GB plus KV cache
qwen3-vl:2b heavier vision pass about 3.3 GB will not coexist with an ASR model
Whisper int8 speech to text about 1.83 GB runs out of memory if a vision model is already resident
bge-m3 (F16) embeddings returns NaN on GPU must be forced onto CPU, see below

I learned the coexistence rule the hard way on a small voice-plus-vision experiment. I tried to keep a 2B vision model and a Whisper int8 speech model resident at the same time. Each fits alone. Together they asked for about 3.33 GB plus 1.83 GB plus CUDA overhead, which clears the 5.61 GiB ceiling, and the second model to load died with a CUDA out-of-memory error. The cruel part is that the wrapper reported "idle" while capturing nothing. Silent failure is the default on this card, so I now check the real state every time:

# what is actually resident right now
ollama ps
# NAME               SIZE     PROCESSOR    UNTIL
# gemma3:4b-it-qat   2.6 GB   100% GPU     4 minutes from now

# killing the python caller does NOT free this.
# Ollama holds the model for keep_alive (5 minutes by default).
# force-evict before loading a second model:
ollama stop gemma3:4b-it-qat

# or set a short keep_alive at call time so VRAM releases fast:
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b-it-qat",
  "keep_alive": "30s",
  "prompt": "describe this UI"
}'

That keep_alive trap caught me twice. Ollama keeps a model resident for five minutes after the last request, so killing your Python process frees nothing. If you load a second model inside that window, you get the out-of-memory crash even though, as far as your task manager shows, the first job is long dead.

The FP16 trap nobody warns you about

Here is the gotcha that will never show up in a glossy "run LLMs locally" post, because the writers benchmarked on a 4090. The GTX 16xx line has half-rate, effectively broken FP16. I hit it on 29 May 2026 when wiring a local embedding index with bge-m3 served in F16. Short strings embedded fine. Any chunk over roughly 2,000 characters returned NaN every single time, regardless of content, and Ollama threw an HTTP 500: failed to encode response: json: unsupported value: NaN. Truncating the input did not fix it, because it was never a length bug. It is the card overflowing FP16 matmuls.

The fix is to force the embedder onto CPU:

# bge-m3 in F16 NaNs on 2000+ char inputs on the GTX 1660.
# Force CPU: about 0.22s per query, zero NaN, dim=1024 correct.
curl http://localhost:11434/api/embed -d '{
  "model": "bge-m3",
  "input": "<a real 2000+ character chunk of text>",
  "options": { "num_gpu": 0 }
}'

This is also why gemma3:4b-it-qat was the right vision pick and an F16 build would not have been. The QAT Q4 quantisation sidesteps the FP16 path entirely, so it stays numerically stable on a card that cannot be trusted with half-precision. On Turing, quantisation type matters more than parameter count.

Latency, and the clock that lies to you

Cold start on this stack is about 13 seconds while Ollama pulls the model into VRAM. Once it is warm, calls land in 3 to 5 seconds. The whole interactive feel of a local model lives or dies on that keep_alive window: keep it long and the second call is snappy, keep it short and you pay the 13-second cold tax again.

One more measured surprise. On idle, the card kept dropping its memory clock from 4001 MHz down to 810 MHz and back roughly once a second, which micro-stalled anything bandwidth-bound. On consumer Turing the usual fixes are dead. The -lmc and -ac flags are fused off and silently report success while doing nothing. The only lever that holds the memory clock up is GPUPowerMizerMode=1, set through nvidia-settings. Pin it and the VRAM clock stays at 4001 even at single-digit utilisation.

When local actually beats the cloud, and when it does not

After months of this, my honest split is narrow and specific. Local wins when the input is private and should never leave the box, when the job is high-volume and mechanical so a per-token cloud bill would compound, and when the network is flaky and an offline path keeps the work moving. Screenshot OCR, image triage, and quick describe calls fit that profile exactly, and the marginal cost per call is zero.

Local loses the moment you need real reasoning. A 4B model is a 4B model. For dense-text OCR, long documents, or anything that needs judgement, the quality drop is not worth the saving, and I moved that general vision work back to a cloud flagship without regret. The 753B class of capability that GLM-5.2 represents simply does not fit, and pretending otherwise on a 6 GB card is the fastest way to ship bad output.

So the GLM-5.2 release is a genuine milestone for open weights, and it changes nothing about my desk. For the roughly 6 GB card the median Indian developer actually owns, on Jio or Airtel, the runnable open-weight frontier is still the 2B to 8B band. For an operator, the leaderboard ranking matters less than which model still fits your VRAM after you account for context length and a second resident model, and which quantisation keeps it numerically honest on the silicon you have. If you have measured your own card before downloading 753 billion parameters, you already know the answer. If you are also fighting a VRAM ceiling on image work, my field note on ComfyUI over VRAM covers the same wall from the diffusion side.