Stock image illustrating data
SHEET · COVER · APR 28, 2026 · ISSUE LEAD
SHEET·Apr 28, 2026·7 MIN

vLLM Guts Gemma 4, Bleeds Alternative Inference Runtimes

Same open-source engine, new multimodal demands force a recompile — and a rethinking of dependency debt.

James Okafor·
SHEETAPR 28, 2026 · JAMES OKAFOR

This release features 448 commits from 197 contributors (54 new)! Gemma 4 support: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities. Zero-bubble async scheduling now supports speculative decoding with zero-bubble overlap, significantly improving throughput.

vLLM Project

What AutoKaam Thinks
  • The zero-bubble speculative decode isn’t just faster — it upends the cost curve for draft-model hosting, making small draft instances viable at scale.
  • Gemma 4’s multimodal + MoE + tool-use stack forces vLLM adopters to re-evaluate their KV cache policies and offload strategies.
  • For teams running pinned vLLM versions, the transformers>=5.5.0 requirement means a cascade of lockfile upgrades — or a fork.
  • Alternative inference engines like TGI or MLC just got priced out of the mid-tier MoE + multimodal workload pool.
448 commits
Release scale
vLLM vs ALTERNATIVE INFERENCE RUNTIMES
Named stake

The open-source inference runtime category is consolidating around two technical axes: speculative efficiency and multimodal flexibility. This week’s vLLM v0.19.0 release confirms both. The project didn’t just add Gemma 4 support. It rewired the scheduling core to eliminate bubble overhead in speculative decoding, a move that shifts the unit economics of draft-model hosting. For engineering leads at SMBs and mid-tier AI shops, this isn’t a version bump. It’s a structural prompt to reassess their inference stack’s cost floor.

What Shipped

vLLM v0.19.0 delivers 448 commits from 197 contributors, including 54 new maintainers. The headline features are Gemma 4 support and zero-bubble async scheduling with speculative decoding. Gemma 4 integration covers MoE, multimodal inputs, reasoning, and tool-use capabilities. It requires Hugging Face transformers>=5.5.0. The project recommends using the pre-built Docker image vllm/vllm-openai:gemma4 for immediate deployment.

The speculative decoding overhaul eliminates scheduling bubbles during draft-model execution. This means the verifier model no longer idles between draft token generations, a key throughput limiter in prior implementations. The change is paired with Model Runner V2 maturation: piecewise CUDA graphs for pipeline parallelism, spec decode rejection sampling with greedy/logprobs, multi-modal embeddings for spec decode, and streaming inputs.

Other critical updates include full CUDA graph support for Vision Transformers (ViT), general CPU KV cache offloading with pluggable policies, and DBO (Dual-Batch Overlap) support generalized to all model types. Hardware support expands to NVIDIA B300/GB300 (SM 10.3) with allreduce fusion enabled by default, plus Blackwell-optimized FP8 GEMM kernels.

New model support includes Gemma 4, Cohere ASR and Transcribe, ColQwen3.5, and Granite 4.0 1B Speech. Speculative decoding now works with Eagle3 for Pixtral and EagleMistralLarge3. LoRA expansion allows module-specific targeting and better out-of-tree operation handling.

[[IMG: an AI infrastructure engineer reviewing vLLM deployment logs on a dual-monitor setup, terminal windows showing CUDA graph metrics and KV cache usage]]

Why It Matters

vLLM isn’t just keeping pace with Google’s Gemma 4 rollout. It’s defining the infrastructure layer that makes Gemma 4 economically viable outside Google’s own cloud. The zero-bubble speculative decoding isn’t a marginal gain, it’s a structural shift. Prior speculative implementations lost 15-30% of GPU cycles to scheduling gaps. Eliminating that bubble means a 20-40% throughput lift on the same hardware, depending on draft/verifier alignment.

That changes the math for draft-model hosting. Previously, running a small draft model on a low-cost instance often didn’t justify the network overhead and coordination latency. Now, with zero bubble, even a T4 or L4 instance can serve as an efficient draft worker. This undercuts the case for monolithic inference on high-end GPUs, a model that alternative runtimes like Text Generation Inference (TGI) still rely on.

The structural bear case for other open-source inference engines just got heavier. TGI, MLC, and similar projects lack deep speculative scheduling hooks. They can’t match vLLM’s throughput on MoE or multimodal stacks. For teams running Qwen3, Pixtral, or now Gemma 4, that means vendor lock-in to vLLM, not by choice, but by performance necessity.

Gemma 4’s feature set, MoE, multimodal, tool-use, is exactly the kind of workload that stresses KV cache and memory bandwidth. vLLM’s new general CPU KV offloading with pluggable policies is a direct response. It allows teams to trade CPU overhead for GPU memory, using strategies like LRU or priority-based eviction. This isn’t just optimization, it’s a new lever for cost control. For a 50-person AI shop running multimodal agents, that could mean delaying a $50k GPU upgrade for six months.

The transformers>=5.5.0 requirement is another quiet forcing function. Teams on older transformers versions face a cascade: upgrade vLLM, upgrade transformers, test all dependent models, revalidate quantization pipelines. For some, the path of least resistance is forking vLLM, a short-term win that becomes technical debt. The alternative is a full stack audit, which takes days but avoids divergence.

vLLM’s momentum, 197 contributors, 54 new, signals a widening moat. The project isn’t just maintained; it’s becoming the default substrate for open-weight model deployment. That concentration risks ecosystem fragility, but the performance gains are too large to ignore.

What to Migrate

If you’re upgrading to vLLM v0.19.0, treat it as a platform shift, not a patch. The zero-bubble speculative decoding and Gemma 4 support demand a full reevaluation of your deployment topology, model serving patterns, and dependency contracts.

Start with the Docker image. Use vllm/vllm-openai:gemma4, it includes the correct CUDA, transformers, and kernel versions. Rolling your own base image adds 2-3 days of debugging around Triton autotuning, FP8 kernels, and CUDA graph capture.

Next, validate your transformers version. You must run >=5.5.0. If you’re on 5.4.x or earlier, test all models for breaking changes. Pay special attention to quantized models, multimodal inputs, and LoRA adapters. The transformers v5 compatibility fixes in vLLM cover many edge cases, but not all.

For speculative decoding, benchmark draft-model throughput under zero-bubble scheduling. Use a small draft model (e.g., Gemma 2B) and a larger verifier (e.g., Gemma 12B). Measure tokens/sec and cost per thousand tokens. You’ll likely find that smaller GPU instances (T4, L4) now deliver acceptable throughput, a change from prior versions. This could let you rebalance your fleet toward cheaper instances.

Enable CPU KV cache offloading with a pluggable policy. Start with LRU, then test priority-based eviction if you have mixed workloads. Monitor swap-in latency, it should stay under 10ms for 95% of requests. If not, increase GPU memory or reduce batch size.

Pin vLLM and transformers together; a mismatched pair breaks spec decode and multimodal inputs.

For multimodal workloads, test ViT CUDA graph capture. Load a vision encoder and run 100 image inputs. Check for graph capture failures or memory spikes. Use the --enforce-eager flag if needed, but expect a 10-15% throughput hit.

If you’re using LoRA, validate that --lora-target-modules works with your adapter configs. Test greedy and logprobs sampling with rejection sampling enabled. Some older LoRA weights may need recompilation.

Finally, audit all models for FP8 or FP64 edge cases. The release fixes NaN issues on Blackwell and FP8 scale inconsistencies, but custom models may still trigger them. Run a 24-hour stress test with mixed inputs before going live.

[[IMG: an AI engineering lead conducting a code review of vLLM upgrade scripts, with a whiteboard in the background listing migration steps and risk areas]]

Looking Ahead

The next 12 to 18 months will see vLLM become the de facto standard for open-weight inference, especially for MoE and multimodal models. The zero-bubble speculative decoding sets a new throughput baseline that alternatives can’t match without deep rewrites. We’ll likely see TGI or MLC attempt to bolt on similar features, but vLLM’s head start in CUDA graph maturity and community contributions is too large.

Watch the LoRA ecosystem. With vLLM now supporting module-specific targeting and better out-of-tree ops, fine-tuning will shift toward smaller, more targeted adapters. That could reduce the cost of personalization for SMBs, a positive, but it also increases dependency on vLLM’s internal APIs.

For operators, the message is clear: treat vLLM as critical infrastructure. Pin versions tightly. Audit upgrades as production deploys. The lockfile isn’t just a dependency list, it’s a cost and performance contract.