GLM-5.2 Cleared the Six Hard Tasks I Use to Vet Any Cheap Model

A new open-weights model matched my flagship on objective hard tasks. The battery I run before trusting any cheap model in production did not change.

ByAditya Sharma·Jun 28, 2026

OPERATOR READJUN 28, 2026 · ADITYA SHARMA

I've been trying it out via OpenRouter, which has it from 9 different providers, almost all of which are charging $1.40/million for input and $4.40/million for output.

— Simon Willison, simonwillison.net

What AutoKaam Thinks

A leaderboard rank is a reason to test a model, not a reason to trust it. I keep a fixed battery of objective hard tasks with execution-checked answers and run it on every cheap or open release bef…
On my June battery, GLM-5.2 matched Claude Opus 4.8 on all six tasks, from discrete optimization with a proof to a money-rounding fix.
The flagship still wins where the battery cannot reach, on long autonomous runs, on judgment and voice, and on depth of native tool use. Those decide where each tier runs, not the per-task score.
Pricing is the easy part. GLM-5.2 lists near $1.40 input and $4.40 output per million tokens versus $5 and $25 for Opus 4.8. The hard part is proving parity on your own work first.

6/6

objective hard tasks where the cheap model matched my flagship

GLM-5.2 vs CLAUDE OPUS 4.8

Named stake

The week GLM-5.2 landed, I did the same thing I do with every cheap or open model that gets a loud launch. I ignored the launch. Then I opened a folder on my machine that has not changed in months, a fixed battery of hard tasks with answers I can check by running code, and I made the new model earn a place in my stack one task at a time.

This is the part of model adoption that does not show up in a benchmark table. A leaderboard tells me a model is worth an afternoon. It does not tell me the model will hold up on the specific shape of work I bill for. The distance between those two facts is where operators lose money, because the cheapest confident answer is free and the wrong one shows up in production two weeks later.

GLM-5.2 is the loud launch this time. Z.ai shipped it in the middle of June 2026, open weights under a permissive license, over 700 billion parameters with a context window up to 1 million tokens, and within days it sat at the top of the open-weights chart on the Artificial Analysis Intelligence Index. Simon Willison, who has been right about more of these than most, wrote that he was impressed to see it rank so highly and noted he was running it through OpenRouter across nine providers at around $1.40 per million input tokens and $4.40 per million output. For reference, the flagship I pay for, Claude Opus 4.8, lists at $5 input and $25 output. The output side alone is close to a fifth of the price.

The battery, not the leaderboard

I do not test models on vibes and I do not test them on one clever prompt. Both reward the model that is good at first impressions, which is the opposite of what production needs. My battery is six tasks, each with a single correct answer that a script can verify. The point is objectivity. If the answer is checkable, the model cannot talk its way to a pass, and I cannot fool myself into liking the one that sounds the most confident.

The six tasks I ran in June, each chosen because a strong-sounding model can still get it wrong:

#	Task	What it actually tests
1	Earliest-deadline-first schedule plus a proof of optimality	discrete optimization, where a model needs a defensible reason, not just an answer
2	The expected value of the higher of two fair dice	probability the model has to compute, not recall
3	Invoice text to strict JSON with a tie-break sort rule	extraction under an exact schema and an ordering edge case
4	Root-cause of an infinite loop in a binary search	debugging by reading, not by rewriting
5	The smallest n where n factorial has exactly five trailing zeros	a counting trap where the obvious answer does not exist
6	A float rounding bug that undercharges, and the correct fix	money math where the wrong type silently loses rupees

I run all six through the candidate model and through my flagship on the same prompts, in the same harness, and then a verifier checks every answer against ground truth. The harness is deliberately boring. Two scripts, no cleverness.

# one prompt per line, same file for every model under test
while read -r task; do
  ans=$(call_model "$CANDIDATE" "$task")     # the cheap / open model
  ref=$(call_model "$FLAGSHIP"  "$task")     # the model I already trust
  printf '%s\t%s\t%s\n' "$task" "$ans" "$ref" >> runs.tsv
done < battery.txt

The verifier is where the objectivity lives. Ground truth is code, not opinion, so the model that sounds most sure has no advantage over the model that is simply correct.

# verify.py, every answer is checked against a computed truth
CHECKS = {
    "dice_max_expectation":          lambda x: round(float(x), 4) == 4.4722,
    "factorial_trailing_zeros_five": lambda x: x.strip().lower() == "no such n",
    "invoice_json":                  lambda x: json.loads(x) == EXPECTED_ROWS,  # order matters
}
for name, got in load_runs("runs.tsv"):
    ok = CHECKS[name](got)
    print(f"{name:32} {'PASS' if ok else 'FAIL'}")

GLM-5.2 passed all six. The same answers my flagship produced, including the trap in task five, where the honest answer is that no such n exists, and the rounding fix in task six, where the model has to reach for a decimal type instead of a float. Six out of six against the model I already trust. On objective, checkable, single-answer reasoning, the cheap open-weights model was not behind the flagship. It was level with it.

Where the cheap model still loses

Parity on a battery is not parity on everything, and reading the result any other way is how operators talk themselves into a bad migration. My six tasks are short. Each one fits in a single turn. They measure whether a model can reason, extract, and debug correctly, and on that axis the gap has closed.

They do not measure the things that actually separate a flagship over a long session. Three of those held GLM-5.2 below Opus 4.8 in my use, and none of them appear on a leaderboard:

Long-horizon coherence. A two-hour autonomous coding run, where the model has to hold a plan across dozens of steps without drifting, is a different test from a single hard prompt. The flagship stays on track longer.
Judgment, taste, and voice. For anything written under my own name, the cheaper model is a competent draft and the flagship is the editor. That gap is real and it does not reduce to a score.
Depth of native tool use. The flagship is wired into my coding harness at a level the open model does not match yet, and that depth compounds across a long task.

There is also a cost wrinkle the headline price hides. GLM-5.2 thinks in long, expensive output. In my runs it burned around 43,000 output tokens on a single hard task, heavier than the flagship on the same prompt. Cheaper per token does not always mean cheaper per task once the model reasons at that length. The output-price edge narrows on long jobs, so I measure cost per finished task, not per million tokens.

What I actually do with the result

The test decides placement, not loyalty. Once a cheap model clears the battery, it earns a defined lane: gated, checkable work where a correctness gate sits between the model and anything that ships. Structured extraction, classification, code generation covered by tests, deterministic reasoning, and first drafts that a separate check has to pass. This is exactly the work where an objective gate already exists, so a model that proves it can clear that gate at flagship level can run there.

The flagship keeps the rest. Synthesis, judgment, long autonomous runs, anything published under my name, and any task where I cannot cheaply check the answer after the fact. The expensive model is the adjudicator, not the volume engine.

For an Indian operator billing in rupees, that split is the whole game. The open model runs on a separate, cheap budget, and the flagship envelope is spent only where it changes the outcome. None of it works without the battery, though. The price was never the hard part. Proving parity on my own work, before I trusted it with a single live task, was.

GLM-5.2 is a genuinely strong release, and the Artificial Analysis Intelligence Index ranking it earned is deserved. But that ranking is the start of my work, not the end of it. The model that wins a leaderboard and the model that holds up on the specific shape of work I bill for are two different claims, and only one of them pays. The battery is how I tell them apart, and I run it again on the next release, because a test you trust is a moat. A leaderboard you worship is a liability.

Topics

#GLM-5.2 #open weights #model evaluation #LLM routing #Claude Opus 4.8

Adjacent

GLM-5.2 Cleared the Six Hard Tasks I Use to Vet Any Cheap Model

The battery, not the leaderboard

Where the cheap model still loses

What I actually do with the result

More from the same beat.

GLM-5.2 Ships 753B Open Weights. My GTX 1660 Holds 6 GB.

Claude Code v2.1.172 Unlocks Recursive Sub-Agents. My Fleet Found Three Walls.

I Burned 90% Of GitHub's Free CI Minutes. Here's The Escape.