Gemma 4 open model lineup, launches on AutoKaam

OFFICIAL UPDATE · COVER · APR 10, 2026 · ISSUE LEAD

OFFICIAL UPDATE·Apr 10, 2026·8 MIN

Gemma 4 Now Has Four Real Variants, E2B, E4B, 26B A4B MoE, and 31B Dense

Google's current Gemma 4 lineup is Apache 2.0 licensed, multimodal, and built for edge, workstation, and high-memory server deployment. The old 6B and 13B claims are wrong.

ByAditya Sharma·Apr 10, 2026·REVIEWED Jun 25, 2026

OFFICIAL UPDATEAPR 10, 2026 · ADITYA SHARMA

Google and Hugging Face list four Gemma 4 variants: E2B, E4B, 26B A4B MoE, and 31B Dense. E2B and E4B support audio, text, and image inputs. The 26B and 31B models support text and image inputs.

— Google AI and Hugging Face model cards

What AutoKaam Thinks

Gemma 4 is not a 2.3B, 6B, 13B, 31B lineup. The current listed variants are E2B, E4B, 26B A4B MoE, and 31B Dense.
Audio support is not universal. E2B and E4B support audio, text, and image inputs. The 26B A4B MoE and 31B Dense models are text and image models.
The larger Gemma 4 models move into long-context deployment territory, with 128K context on smaller variants and 256K context on larger variants.
Indian teams should pick Gemma 4 for Apache 2.0 licensing and self-hosting control, not because of unverified benchmark claims.

31B

Largest dense model size

GOOGLE + INDIAN STARTUPS

Named stake

Google's Gemma 4 story changed since the first version of this article.

The old variant list, 2.3B, 6B, 13B, and 31B, is wrong. At launch (April 2, 2026) Google released four variants. A fifth , Gemma 4 12B Unified , was added on June 3, 2026 (covered in the June follow-up section below). The launch lineup:

Variant	Architecture	Context window	Inputs	Practical role
Gemma 4 E2B	Edge model	128K	Text, image, audio	On-device and low-cost inference
Gemma 4 E4B	Edge model	128K	Text, image, audio	Stronger edge and laptop deployment
Gemma 4 26B A4B MoE	Mixture of Experts, 26B total with 4B active	256K	Text, image	Server inference with lower active compute per token
Gemma 4 31B Dense	Dense model	256K	Text, image	Highest-capacity Gemma 4 deployment

The correction matters. Audio is not available across the full lineup. The larger models are text and image models, not audio models. The 13B deployment advice in the previous article is also removed because there is no current 13B Gemma 4 variant in the official lineup.

The Four Gemma 4 Variants

Gemma 4 E2B

E2B is the smallest current Gemma 4 model. It is built for edge deployment and supports text, image, and audio inputs.

Use it when the product constraint is local execution, low memory, or offline access. For Indian apps, that means field workflows, education, voice-assisted intake, and support tools where sending every request to a hosted API is not acceptable.

Do not expect it to behave like the 31B model. The value is deployment control, and on a 6 GB consumer card the small end of the lineup is the only tier that fits once you budget for context and a second resident model.

Gemma 4 E4B

E4B is the stronger edge model. It keeps the same 128K context class and supports text, image, and audio inputs.

This is the first model I would test for a laptop or edge-server product before moving to a larger GPU. It gives teams more room than E2B without forcing the full server stack required by the 26B and 31B models.

Gemma 4 26B A4B MoE

The 26B A4B model is the important new middle option.

MoE means Mixture of Experts. The model has 26B total parameters, but only approximately 3.8B are active for a given token (Google's model card lists 3.8B active parameters, marketed under the A4B label). That can reduce compute per generated token compared with a dense model of similar total size.

Do not confuse active parameters with total memory. The runtime still needs to store experts. Weight memory follows total parameters more than active parameters. A 26B A4B MoE can be cheaper per token than a dense 26B model, but it is not a 4B model for deployment planning.

The 26B A4B model supports text and image inputs, not audio.

Gemma 4 31B Dense

The 31B Dense model is the top-capacity Gemma 4 variant.

Dense means every parameter path is active during inference. It is simpler to reason about than MoE, easier to compare with other dense models, and usually cleaner for fine-tuning experiments.

The old claim that 31B full precision needs about 40GB VRAM was too low. Plan around roughly 71GB for BF16 weights, before long-context KV cache and serving overhead. INT4 weight storage is closer to 16GB, but that does not mean every 24GB card can serve 256K context. KV cache can dominate memory once prompts get long.

What Changed From The First Report

The first version of this article overstated three things:

It listed non-current sizes.
It claimed all variants were audio-capable.
It repeated benchmark and pricing claims that were not backed by primary sources.

The corrected view is simpler. Gemma 4 is a strong open model family with Apache 2.0 licensing, a real edge story, a real MoE option, and long-context larger models. Treat any leaderboard or cost claim as secondary until it is tied to a reproducible run.

Why Apache 2.0 Still Matters

Gemma 4's Apache 2.0 license is the main reason Indian startups should care.

Apache 2.0 permits:

Commercial use
Modification
Redistribution
A patent grant
Internal deployment without a model vendor contract

For a startup building in India, that changes the procurement discussion. You can host in your own account, fine-tune where the license allows it, and avoid sending regulated or sensitive customer data to a third-party API by default.

This does not remove all compliance work. Your dataset license, logs, customer contracts, and industry rules still matter. The model license is only one layer.

Architecture, In Plain English

Gemma 4 is not just a larger Gemma release. The family introduces design choices aimed at long context, edge deployment, and lower serving cost.

Hybrid local and global attention

Hybrid attention mixes local attention with global attention.

Local attention keeps compute focused on nearby tokens. Global attention preserves access to broader context. The point is to make long-context inference more practical without forcing every layer to attend across the full prompt in the same way.

For builders, the result is not magic. Long prompts still cost memory and time. But the architecture is built for larger context windows than old short-context open models.

p-RoPE for long context

Gemma 4 uses p-RoPE as part of its long-context design.

The practical takeaway: do not treat 128K or 256K as a free setting. Long-context quality depends on prompt structure, retrieval quality, runtime support, and KV-cache memory. A model can accept a long context and still give poor answers if the important evidence is buried.

Unified KV handling

Gemma 4's serving story depends heavily on KV cache behavior.

KV cache stores attention keys and values so the model does not recompute the whole prompt every token. At 128K and 256K context, KV cache is often the real deployment limit. Weight quantization helps with model weights, but it does not automatically solve KV-cache memory.

Use paged attention where your runtime supports it. Cap context length per route. Do not let every user request hit the maximum context window.

PLE for edge models

The E2B and E4B models are where edge-specific choices matter most. PLE is part of that edge model design.

The operator point is simple: the edge variants are not just smaller names in the table. They are designed for local and low-memory use. Test them with LiteRT-LM or the mobile runtime you plan to ship, not only with a desktop Python script.

26B A4B MoE

The MoE model is the most interesting cost option in the lineup.

A4B means about 4B active parameters per token. That can reduce active compute, but the deployment still needs the full expert set. If you are choosing between 26B A4B and 31B Dense, test both on your actual prompts. MoE wins only if your runtime, batching, and routing behavior suit your traffic.

Context Windows

Variant	Context window	Deployment warning
E2B	128K	Edge deployment rarely has memory for full-context use
E4B	128K	Good for longer local workflows, still needs context caps
26B A4B MoE	256K	KV cache can exceed the weight memory advantage
31B Dense	256K	BF16 serving needs high-memory hardware, INT4 still needs KV planning

Do not design your product around maximum context by default. Most production apps should use retrieval, summarization, and route-level context limits.

A sane setup:

Short support reply: 4K to 16K
Document question answering: 16K to 64K
Legal or audit review: higher context only for selected routes
Agent traces: summarize old steps instead of appending the full trace forever

Practical Deployment

The safest way to deploy Gemma 4 is to start with the official Google repositories on Hugging Face and copy the exact model ID for the variant you are using. Do not assume an old 6B or 13B repo exists.

vLLM

Use vLLM when you want OpenAI-compatible serving and GPU batching.

pip install vllm

For an E2B or E4B test:

vllm serve <GOOGLE_GEMMA_4_REPO_ID> \
  --dtype bfloat16 \
  --max-model-len 131072

For 26B A4B MoE or 31B Dense:

vllm serve <GOOGLE_GEMMA_4_REPO_ID> \
  --dtype bfloat16 \
  --max-model-len 262144

If memory fails, lower --max-model-len first. Then test quantization. Long context is usually the first deployment problem.

Docker with vLLM

docker run --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model <GOOGLE_GEMMA_4_REPO_ID> \
  --dtype bfloat16 \
  --max-model-len 131072

Change 131072 to 262144 only when your GPU memory and runtime path can handle it.

SGLang

SGLang is worth testing for agent workloads and structured generation.

python -m sglang.launch_server \
  --model-path <GOOGLE_GEMMA_4_REPO_ID> \
  --context-length 131072

For the larger models:

python -m sglang.launch_server \
  --model-path <GOOGLE_GEMMA_4_REPO_ID> \
  --context-length 262144

Again, use the exact model repository from Google's Hugging Face collection.

LiteRT-LM and edge deployment

Use LiteRT-LM or Google's current mobile path for E2B and E4B edge work.

Do not start with the 31B model and then search for a mobile story. Start with E2B, measure latency, battery impact, memory use, and quality. Move to E4B only if E2B fails the task.

VRAM Planning

Model	Weight memory note	Deployment note
E2B	Depends on quantization and runtime	Best first test for local use
E4B	Depends on quantization and runtime	Better quality than E2B, still edge-oriented
26B A4B MoE	Stores experts, not just active 4B	Test runtime support before committing
31B Dense	BF16 around 71GB, INT4 around 16GB	KV cache and serving overhead come on top

For the 31B model, a 24GB GPU may be enough for some quantized short-context tests. It is not a promise for 256K production serving.

The common failure mode is simple: the model loads, the first short prompt works, then a long customer document crashes the server. Cap context length before launch.

System Role, Function Calling, and Agents

Do not assume Gemma 4 behaves like an OpenAI API drop-in.

For agent products, test these separately:

Chat template behavior
System instruction handling
Tool or function-calling format
JSON reliability
Multi-step planning
Long trace summarization
Refusal behavior
Recovery after tool errors

If your app needs function calling, wrap the model with strict schemas, validation, and retry logic. Do not rely on a free-form answer to call payment, CRM, ticketing, or database tools.

For thinking-style prompts, measure quality and latency. Longer hidden or explicit reasoning traces can raise token cost and expose sensitive intermediate text in logs. Keep traces out of customer-visible output unless the product needs them.

India-Relevant Use Cases

Gemma 4 is useful for India, but not because of unverified leaderboard claims.

The clean use cases are:

Customer support summarization where data must stay in your cloud account
Document question answering for contracts, policies, invoices, and tickets
Internal coding assistants where source code cannot leave the VPC
Education apps that need offline or low-connectivity support
Field applications where E2B or E4B can run locally
Vision workflows where text and image input are both needed

For Indian language depth, still compare against Indian-focused models from Sarvam, Krutrim, and Bhashini-linked projects. Do not assume an English-strong open model is best for every Indian language workflow.

AutoKaam has not published a measured Gemma 4 benchmark against Sarvam, Krutrim, or Bhashini models. Until we run that test, this article will not rank them.

Gemma 4 vs Other Options

Choice	When it makes sense	Main risk
Gemma 4 E2B or E4B	Edge, offline, low-cost local inference	Quality ceiling
Gemma 4 26B A4B MoE	Server inference with lower active compute	Runtime and memory planning
Gemma 4 31B Dense	Highest Gemma 4 quality target	GPU memory and KV cache
Sarvam models	Indian language first products	Compare license, deployment, and task fit
Closed APIs	Best quality with no infra work	Data control, recurring cost, vendor dependence
Smaller local models	Simple workflows and fast latency	Lower reasoning quality

The decision is not ideological. Run the cheapest model that clears your quality bar, privacy bar, and latency bar.

June Follow-Up: 12B Unified and DiffusionGemma

Google released a fifth Gemma 4 variant on June 3, 2026: Gemma 4 12B Unified. It adds a mid-sized option between E4B and the 26B A4B MoE, with 11.95B parameters, a 256K context window, and support for text, image, and audio inputs in a single encoder-free pass. It ships under the same Apache 2.0 license and requires roughly 16GB VRAM in practice. If you are picking a server-side model and the 26B MoE is too heavy, the 12B Unified is now the natural middle step to evaluate.

Google also released DiffusionGemma in June. Treat it as a separate line, not a replacement for Gemma 4 chat and multimodal deployment. If your product is built around standard autoregressive LLM serving, keep Gemma 4 in the evaluation set. If you are testing diffusion-style generation for language tasks, evaluate DiffusionGemma on its own terms.

Getting Started

Gemma 4 models are available through official Google and Hugging Face pages.

Start here:

Hugging Face Google collection: https://huggingface.co/google
Google Gemma developer pages: https://ai.google.dev/gemma
Managed deployment paths where your cloud provider supports the exact model
Local and edge runtimes for E2B and E4B

Before you deploy, write down:

Exact model repo ID
License attached to that repo
Runtime version
Quantization method
Maximum context length
GPU type
Prompt template
Tool-calling format, if any
Logging policy
Rollback model

That list prevents most production mistakes.

FAQ

Is Gemma 4 still Apache 2.0 licensed?
Yes, the current article treats Gemma 4 as Apache 2.0 licensed based on official listings. Still check the exact repository you deploy because fine-tuned or third-party variants can carry different terms.

Are all Gemma 4 models audio-capable?
No. E2B and E4B support audio, text, and image inputs. The 26B A4B MoE and 31B Dense variants support text and image inputs.

Does Gemma 4 have a 13B model?
No. There is no 13B variant. The 13B deployment section from the old article has been removed. The closest mid-sized option is Gemma 4 12B Unified, released June 3, 2026.

Can the 31B Dense model run on a 24GB GPU?
A quantized short-context test may fit. BF16 weight memory is around 71GB, and INT4 weight storage is around 16GB before KV cache and serving overhead. Do not assume 256K context will fit on a 24GB card.

Which Gemma 4 model should an Indian startup test first?
For edge apps, start with E2B. For stronger local quality, test E4B. For server-side apps, compare 26B A4B MoE and 31B Dense on your actual prompts.

Should Indian language products use Gemma 4 by default?
No. Test against Indian-focused models from Sarvam, Krutrim, and Bhashini-linked work. Gemma 4 is attractive for licensing and self-hosting, but language fit must be measured.

Explore more open-source options in our Code & Development AI tools category.

Sources: Google AI Gemma pages (ai.google.dev/gemma), Google DeepMind model card (deepmind.google), and Hugging Face Google model cards. Launch lineup (April 2, 2026): E2B, E4B, 26B A4B MoE, 31B Dense. Gemma 4 12B Unified added June 3, 2026. Checked 2026-06-25.

Topics

#Gemma #Google #Open Source #LLM

Adjacent