Binary code is displayed, 10101.
SHEET · COVER · APR 28, 2026 · ISSUE LEAD
SHEET·Apr 28, 2026·6 MIN

Ollama Ships Kimi CLI, Guts MLX Sampling

Same install, new CLI path, hard floor on long-horizon execution tasks your agents must run.

Tom Reilly·
SHEETAPR 28, 2026 · TOM REILLY

You can now install and run the Kimi CLI through Ollama. ollama launch kimi --model kimi-k2.6:cloud

GitHub Releases (ollama/ollama)

What AutoKaam Thinks
  • Kimi CLI lands inside Ollama, locking long-horizon agentic workflows to cloud-only kimi-k2.6—your local runners won’t cut it.
  • MLX sampling speed jumps via fused top-P/K sorting, but you must audit repeat penalty handling if you rely on niche output shaping.
  • Mac users: a stale model picker bug just got fixed—update now or risk command drift in multi-chat workflows.
  • Structured outputs for Gemma 4 are stable when think=false, clearing a blocker for lightweight agent pipelines.
v0.21.1
Ollama release
OLLAMA + KIMI
Named stake

If you run a fifty-person dev shop or an indie agent team using Ollama, here is the operator's read.

v0.21.1 dropped quietly on April 22, no fanfare. But the changes aren’t cosmetic. They shift how your agents handle long-execution tasks, how fast MLX models sample, and how your team avoids stale state on macOS. The release notes read like a dev’s checklist, not a marketing memo. That’s Ollama’s brand. And that’s why you pay attention.

This isn’t a “nice-to-have” bump. It’s a lockfile breaker for anyone running Kimi, MLX, or Gemma 4 in production pipelines. The audit pass starts now.

What Shipped

Ollama v0.21.1 ships nine changes that matter to operators.

First, Kimi CLI is now installable and runnable through Ollama. The command is ollama launch kimi --model kimi-k2.6:cloud. That :cloud suffix matters. Kimi K2.6 runs in multi-agent mode for long-horizon tasks, but it does not run locally. You cannot pull it to air-gapped or low-bandwidth setups. If your agent workflows depend on offline execution, you’re blocked.

Second, MLX sampling got faster. Top-P and top-K are now fused into a single sort pass. Repeat penalties apply in the sampler. This reduces latency in token generation,especially under load. You’ll see it in throughput metrics, not just lab benches.

Third, logprobs support is added for compatible MLX models. If you’re doing confidence scoring or output tracing, this unblocks you. But only for models explicitly supporting it. No blanket coverage.

Fourth, prompt tokenization moved into request handler goroutines. This improves concurrency under high request volume. No more tokenization bottlenecks when multiple agents fire at once.

Fifth, MLX thread safety for array management is improved. If you’ve seen race conditions in array handling, this should clear them. But test it,threading bugs don’t die quietly.

Sixth, GLM4 MoE Lite performance improved via a fused sigmoid router head. Throughput gains are real, but only if you’re using GLM4 MoE Lite. Don’t expect wins on other MoE models.

Seventh, the macOS model picker no longer shows stale models after switching chats. If your team uses Ollama Desktop and swaps between models mid-session, this bug was costing you time and causing command drift. It’s fixed.

Eighth, structured outputs for Gemma 4 now work when think=false. That’s critical for low-latency agent chains where reasoning steps are disabled. Before this, structured JSON output would fail silently. Now it’s reliable.

Ninth, the full changelog spans 10 commits from v0.21.0. No breaking API changes, but enough under-the-hood shifts to demand a test pass.

[[IMG: a developer in a co-working space launching the Kimi CLI via Ollama, terminal window showing the --model kimi-k2.6:cloud flag, early-morning coffee on the desk]]

Why It Matters

This release isn’t about features. It’s about constraints.

Ollama is drawing a line: long-horizon agentic tasks belong in the cloud. Kimi K2.6 isn’t local. You can’t pull it. You can’t run it on your MacBook Pro. It’s a cloud-bound agent system. That’s a strategic choice, not a technical limitation. They’re pushing users toward managed execution for complex workflows.

That torpedoes one of Ollama’s early selling points,local, private, offline AI. For Kimi, that’s over. If you want multi-agent long-horizon execution, you’re on the network. Your data leaves your machine. Your latency depends on their endpoint. Your availability ties to their uptime.

The MLX optimizations? They’re real. Fused top-P and top-K in one sort pass is a known accelerator. It cuts memory sweeps. Repeat penalties in the sampler? That’s output hygiene. Both should improve throughput by 15–30% under load, based on past MLX upgrades.

But here’s the hidden cost: you must test it. Because optimizations can shift output distributions. If your agent relies on precise token shaping,say, for API call formatting or regex-constrained outputs,repeat penalties might now over-correct. You won’t catch it in unit tests. You’ll catch it in production drift.

The macOS model picker fix? Small bug, big impact. For teams using Ollama Desktop in agent switching workflows, stale model state meant commands running against the wrong model. That’s not just inefficient. It’s dangerous. Wrong model, wrong output, wrong action. Audit logs get messy. Debugging takes hours.

Same with Gemma 4. Structured outputs failing when think=false was a silent killer. You’d set up a fast agent chain, disable reasoning to cut latency, and suddenly JSON formatting breaks. No error. Just malformed output. Now it’s fixed. But how many teams paused Gemma 4 adoption because of it? This unblocks them.

Ollama’s playing the devops game now. Not just model runner. Infrastructure. Concurrency. Thread safety. Output reliability. They’re targeting teams that ship agent pipelines, not just tinker with prompts.

That’s good. But it means more moving parts. More things that can break. More audit passes required.

What to Migrate

Don’t just brew upgrade ollama and move on. This release demands a checklist.

First, audit your Kimi CLI usage. If you were running Kimi locally, you can’t anymore. The new command forces kimi-k2.6:cloud. No local fallback. If your security policy prohibits cloud-bound agents, you’re stuck on v0.21.0. Or you redesign.

Second, test MLX sampling under load. Spin up a stress test with your highest-throughput agent. Compare token generation speed pre- and post-upgrade. Watch for latency drops. But also, inspect output quality. Are repeat penalties over-suppressing tokens? Are fused top-P/K choices altering your output distribution? Use a diff tool on sample outputs. Don’t assume.

Third, Mac users: update immediately. The stale model picker bug is real. If your team uses Ollama Desktop for agent switching, this was a consistency risk. Upgrade to v0.21.1. Restart the app. Validate model state clears between chats.

Fourth, re-enable Gemma 4 structured outputs with think=false. If you disabled this mode due to output failures, now’s the time to bring it back. Run a side-by-side test: old version (broken), new version (fixed). Verify JSON schema compliance. Then roll it into production.

Fifth, pin your Ollama version in CI/CD. Do not let auto-updates push this to production. You need a test pass first. Use ollama --version in your build script. Freeze it at v0.21.1 until you’ve validated all pipelines.

Your lockfile is now production infrastructure. Treat it like a database migration.

Sixth, document the cloud dependency for Kimi. Update your internal runbooks. Flag that kimi-k2.6:cloud means external execution. Add network timeout handling. Add retry logic. Assume the endpoint can fail.

Seventh, monitor GLM4 MoE Lite performance. If you use it, watch throughput and memory. The fused sigmoid router should cut latency. But verify. Don’t trust the changelog.

Eighth, check logprobs support for your MLX models. Not all models support it. If you’re using it for confidence scoring, confirm your model is on the list. If not, you’ll get nulls or errors.

Ninth, train your team on the new launch command. No more standalone Kimi CLI. It’s ollama launch kimi --model kimi-k2.6:cloud. Update docs. Update onboarding. Update Slack snippets.

Tenth, schedule a rollback plan. If MLX sampling breaks output shaping, you need to drop back fast. Have v0.21.0 binaries ready. Test the downgrade path now.

This isn’t a five-minute upgrade. It’s a six-hour audit pass for any team running agents in production.

[[IMG: an engineering lead reviewing a terminal diff of MLX sampling outputs pre- and post-upgrade, comparing token sequences on a dual-monitor setup in a quiet office]]

Looking Ahead

Ollama is tightening its grip on agent execution. Local models for simple tasks. Cloud for complex, long-horizon workflows. That split will widen.

Expect more cloud-only models. Expect tighter integration with managed runtimes. Expect fewer local options for agentic systems.

If you’re running a small team, budget for the network cost. Budget for the audit time. Budget for the rollback plan.

Pin tight. Test under load. Watch output drift.

Upgrade only after validation.

Budget six hours. Cap the rollout to one agent service first. If output integrity drops below 99.5% in the first 24 hours, roll back.