
GLM-5.2 Ships 753B Open Weights. My GTX 1660 Holds 6 GB.
The most powerful open-weight model of mid-June 2026 needs a cluster. A 4B model on a 6 GB card taught me what that headline leaves out.
GLM-5.2 ... marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a solid 1M-token context.
- Open weights and a model you can run are two different claims. GLM-5.2 ships MIT weights at over 750 billion parameters, and none of that helps a 6 GB card.
- On a GTX 1660 the only runtimes that load are GGUF, CTranslate2 int8, and ONNX. No tensor cores means vLLM, FP8, and Marlin are off the table before you start.
- Budget VRAM as resident plus one burst, never all-resident. Two models that each fit alone will run out of memory together, and Ollama fails that silently while reporting idle.
- Local wins on private screenshots, offline triage, and zero per-call cost. For real reasoning quality a 4B model is still a 4B model, and I route that work to a cloud flagship.
In mid-June 2026 Z.ai released GLM-5.2, and the open-weight leaderboard moved again. It is a 753-billion-parameter open-weight model, MIT-licensed, with a 1M-token context window, and on the long-horizon coding benchmarks Z.ai published with it, it runs roughly even with GPT-5.5, edging ahead on FrontierSWE and trailing slightly on Terminal-Bench. The official card calls it a substantial leap in long-horizon task capability over GLM-5.1, and the model weights sit on Hugging Face for anyone to download. I read that announcement on the same desk where I have been running a 4-billion-parameter model on a 6 GB graphics card since 14 May 2026. The gap between those two facts, 753 billion parameters on one side and 6 GB of VRAM on the other, is the whole operator story.
"Open weights" and "a model you can actually run" are two separate sentences, and the second one is the one nobody puts in the headline.
The box this runs on
My local inference box is a GTX 1660 with 6 GB of VRAM, paired with a Ryzen 7 3700X and 32 GB of system RAM on Ubuntu. That card is Turing TU116. It has no tensor cores. In practice that one detail decides almost everything. The runtimes that load on it are GGUF through llama.cpp or Ollama, CTranslate2 int8, and ONNX. The runtimes that do not load are vLLM, FP8, and Marlin, because they all assume tensor-core paths the 1660 does not have. So a 753-billion-parameter model is not a question I get to ask. It is settled before I open a terminal.
What I actually host is gemma3:4b-it-qat served through Ollama at localhost:11434. Its Q4 weights are quoted at about 2.6 GB, but with a 4096-token context loaded it sits closer to 4.9 GB resident by nvidia-smi, leaving only about a gigabyte of headroom on a card that reports about 5.61 GiB usable after the display takes its cut. I wired it as a local screenshot-OCR and image-describe service so that quick, mechanical vision work never has to leave the machine. It runs at roughly 22 tokens per second. That is slow next to a cloud endpoint, and fast enough when the alternative is shipping a private screenshot off the box.
VRAM budgeting is the entire game
The single lesson I would hand any first-time self-hoster: budget VRAM as resident plus one burst, never all-resident. Once you internalise that, most of the pain stops.
| Component | Role | VRAM on the 1660 | Operator note |
|---|---|---|---|
gemma3:4b-it-qat |
vision and text describe | about 4.9 GB resident | QAT Q4, weights ~2.6 GB plus KV cache |
qwen3-vl:2b |
heavier vision pass | about 3.3 GB | will not coexist with an ASR model |
| Whisper int8 | speech to text | about 1.83 GB | runs out of memory if a vision model is already resident |
bge-m3 (F16) |
embeddings | returns NaN on GPU | must be forced onto CPU, see below |
I learned the coexistence rule the hard way on a small voice-plus-vision experiment. I tried to keep a 2B vision model and a Whisper int8 speech model resident at the same time. Each fits alone. Together they asked for about 3.33 GB plus 1.83 GB plus CUDA overhead, which clears the 5.61 GiB ceiling, and the second model to load died with a CUDA out-of-memory error. The cruel part is that the wrapper reported "idle" while capturing nothing. Silent failure is the default on this card, so I now check the real state every time:
# what is actually resident right now
ollama ps
# NAME SIZE PROCESSOR UNTIL
# gemma3:4b-it-qat 2.6 GB 100% GPU 4 minutes from now
# killing the python caller does NOT free this.
# Ollama holds the model for keep_alive (5 minutes by default).
# force-evict before loading a second model:
ollama stop gemma3:4b-it-qat
# or set a short keep_alive at call time so VRAM releases fast:
curl http://localhost:11434/api/generate -d '{
"model": "gemma3:4b-it-qat",
"keep_alive": "30s",
"prompt": "describe this UI"
}'
That keep_alive trap caught me twice. Ollama keeps a model resident for five minutes after the last request, so killing your Python process frees nothing. If you load a second model inside that window, you get the out-of-memory crash even though, as far as your task manager shows, the first job is long dead.
The FP16 trap nobody warns you about
Here is the gotcha that will never show up in a glossy "run LLMs locally" post, because the writers benchmarked on a 4090. The GTX 16xx line has half-rate, effectively broken FP16. I hit it on 29 May 2026 when wiring a local embedding index with bge-m3 served in F16. Short strings embedded fine. Any chunk over roughly 2,000 characters returned NaN every single time, regardless of content, and Ollama threw an HTTP 500: failed to encode response: json: unsupported value: NaN. Truncating the input did not fix it, because it was never a length bug. It is the card overflowing FP16 matmuls.
The fix is to force the embedder onto CPU:
# bge-m3 in F16 NaNs on 2000+ char inputs on the GTX 1660.
# Force CPU: about 0.22s per query, zero NaN, dim=1024 correct.
curl http://localhost:11434/api/embed -d '{
"model": "bge-m3",
"input": "<a real 2000+ character chunk of text>",
"options": { "num_gpu": 0 }
}'
This is also why gemma3:4b-it-qat was the right vision pick and an F16 build would not have been. The QAT Q4 quantisation sidesteps the FP16 path entirely, so it stays numerically stable on a card that cannot be trusted with half-precision. On Turing, quantisation type matters more than parameter count.
Latency, and the clock that lies to you
Cold start on this stack is about 13 seconds while Ollama pulls the model into VRAM. Once it is warm, calls land in 3 to 5 seconds. The whole interactive feel of a local model lives or dies on that keep_alive window: keep it long and the second call is snappy, keep it short and you pay the 13-second cold tax again.
One more measured surprise. On idle, the card kept dropping its memory clock from 4001 MHz down to 810 MHz and back roughly once a second, which micro-stalled anything bandwidth-bound. On consumer Turing the usual fixes are dead. The -lmc and -ac flags are fused off and silently report success while doing nothing. The only lever that holds the memory clock up is GPUPowerMizerMode=1, set through nvidia-settings. Pin it and the VRAM clock stays at 4001 even at single-digit utilisation.
When local actually beats the cloud, and when it does not
After months of this, my honest split is narrow and specific. Local wins when the input is private and should never leave the box, when the job is high-volume and mechanical so a per-token cloud bill would compound, and when the network is flaky and an offline path keeps the work moving. Screenshot OCR, image triage, and quick describe calls fit that profile exactly, and the marginal cost per call is zero.
Local loses the moment you need real reasoning. A 4B model is a 4B model. For dense-text OCR, long documents, or anything that needs judgement, the quality drop is not worth the saving, and I moved that general vision work back to a cloud flagship without regret. The 753B class of capability that GLM-5.2 represents simply does not fit, and pretending otherwise on a 6 GB card is the fastest way to ship bad output.
So the GLM-5.2 release is a genuine milestone for open weights, and it changes nothing about my desk. For the roughly 6 GB card the median Indian developer actually owns, on Jio or Airtel, the runnable open-weight frontier is still the 2B to 8B band. For an operator, the leaderboard ranking matters less than which model still fits your VRAM after you account for context length and a second resident model, and which quantisation keeps it numerically honest on the silicon you have. If you have measured your own card before downloading 753 billion parameters, you already know the answer. If you are also fighting a VRAM ceiling on image work, my field note on ComfyUI over VRAM covers the same wall from the diffusion side.
More from the same beat.
GLM-5.2 Cleared the Six Hard Tasks I Use to Vet Any Cheap Model
A new open-weights model matched my flagship on objective hard tasks. The battery I run before trusting any cheap model in production did not change.
- A leaderboard rank is a reason to test a model, not a reason to trust it. I keep a fixed battery of objective hard tasks with execution-checked answers and run it on every cheap or open release bef…
Claude Code v2.1.172 Unlocks Recursive Sub-Agents. My Fleet Found Three Walls.
Recursive sub-agents are a real upgrade, and after weeks of running CLI agent fleets in tmux, I can tell you exactly where the orchestration breaks.
- Recursion is real, but it is not magic. v2.1.172 lets a sub-agent fan out its own sub-agents five levels deep, which means level two can quietly multiply your concurrency and your bill at the same …
I Burned 90% Of GitHub's Free CI Minutes. Here's The Escape.
A real multi-repo empire eats 2000 free Actions minutes a month. When you hit zero, deploys stop firing silently. The fix is not paying per minute.
- A multi-repo solo operator will exhaust 2000 free Actions minutes a month, not might, will. I hit 1800 of 2000 across one account with three active content repos, no macOS or Windows multiplier, ju…