Running Gemma 4 Locally With Ollama — Setup Guide For Indian Devs
Google's 31B open-source model on your Macbook or PC — free, private, unlimited
Gemma 4 (Google's 31B open-source model) has taken the #3 spot on open-source leaderboards. The biggest surprise — it can run on your laptop locally, with zero API cost, private data, and unlimited usage.
Hardware Requirements
Model size vs minimum RAM:
| Gemma 4 variant | Parameters | VRAM / RAM needed | Who can run |
|---|---|---|---|
| Gemma 4 2B | 2B | 4 GB | Any laptop (CPU or entry GPU) |
| Gemma 4 9B | 9B | 12 GB | M1 Pro, RTX 3060+ |
| Gemma 4 31B | 31B | 24-48 GB | M2 Max, RTX 4090, workstation |
Reality check for the Indian market:
- Macbook Air M2 (16GB): comfortable at 2B, 9B with tricks
- Macbook Pro M2/M3 (32GB+): 9B smooth, 31B possible (quantized)
- Gaming PC (RTX 4070+): 9B smooth, 31B with quantization
If you want 31B cheaply, rent a GPU on Runpod / Vast.ai for Rs 50-100/hour.
Install Ollama (One-Liner)
Mac:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
Download from ollama.com
Verify:
ollama --version
Download Gemma 4
# Start with 2B (smallest, fastest)
ollama pull gemma4:2b
# Or 9B (better quality, needs 12GB RAM)
ollama pull gemma4:9b
# Or 31B (full power, needs 24GB+)
ollama pull gemma4:31b
# Quantized versions (smaller, slightly lower quality)
ollama pull gemma4:9b-q4 # 4-bit, ~5GB RAM
ollama pull gemma4:31b-q4 # 4-bit, ~17GB RAM
First Run
ollama run gemma4:9b
# Prompt:
> Explain in Hindi: what is cryptocurrency?
Gemma 4 handles Hindi surprisingly well — not native fluency like Sarvam, but understandable.
API Mode (For Integration)
Ollama runs a local HTTP server with an OpenAI-compatible interface:
# Serve in background
ollama serve
Python client:
from openai import OpenAI
client = OpenAI(
api_key="ollama", # dummy key
base_url="http://localhost:11434/v1"
)
response = client.chat.completions.create(
model="gemma4:9b",
messages=[{"role": "user", "content": "Hello"}]
)
Existing OpenAI/Claude SDK code just needs a base_url swap. Zero other changes.
Where Local Beats Cloud
1. Data Privacy
- Legal documents, medical records, client PII — never leave your laptop
- Easier DPDP Act compliance for sensitive data
2. Unlimited Volume
- Running 10,000 classifications daily? Cloud bills balloon. Local is free.
- Perfect if you run a content factory — blog generation, product descriptions, bulk tasks
3. Offline / Low-Connectivity
- Tier-3 cities with patchy internet — local has zero dependency
- Train journeys, flights
4. Cost Predictability
- Monthly bill: Rs 0 (only electricity ~Rs 200/mo for heavy use)
- No surprise charges
Where Cloud Still Wins
- Complex reasoning chains (Gemma 4 9B << Claude Opus 4.6)
- Long context (Gemma 4 has 128k; Opus has 1M)
- Tool use / agents
- Frontier-quality writing
Performance Benchmarks (My M2 Macbook Pro 32GB)
| Model | Tokens/sec | First-token latency |
|---|---|---|
| Gemma 4 2B | 120 tok/s | ~150ms |
| Gemma 4 9B | 38 tok/s | ~400ms |
| Gemma 4 9B-q4 | 55 tok/s | ~300ms |
9B is the sweet spot — quality closer to Claude Sonnet, speed usable.
Use Case — A Local RAG System
Privacy-sensitive knowledge base RAG:
# Install local deps
# pip install chromadb langchain langchain-ollama
from langchain_ollama import ChatOllama
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma(persist_directory="./chroma", embedding_function=embeddings)
# Ingest documents (run once)
# for pdf in pdfs: vectorstore.add_documents(chunks)
# Query
llm = ChatOllama(model="gemma4:9b", temperature=0)
relevant = vectorstore.similarity_search(query, k=3)
context = "\n\n".join([d.page_content for d in relevant])
response = llm.invoke(f"Context:\n{context}\n\nQuestion: {query}")
100% local, zero API calls. Perfect for law firms, doctors, CAs working with client data.
Fine-Tuning Locally
Gemma 4 is fine-tunable, though it needs more RAM. LoRA approach (10x less memory):
# Install Unsloth (makes Gemma training 2x faster)
pip install unsloth
# Fine-tune script (simplified):
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained("unsloth/gemma-4-9b-it")
# ... standard LoRA training with your data
~4-6 hours on a single 4090 for a domain-specific fine-tune.
Privacy + Compliance Notes
- Gemma 4 license: Google's custom terms (commercial use allowed with some restrictions — read the license)
- Your data stays local — no telemetry by default
- GDPR / DPDP friendly for Indian enterprises
Alternative Models Worth Trying
Via ollama pull:
llama3.3:70b— Meta's latestqwen2.5:72b— Alibaba; strong in Hindideepseek-r1:8b— reasoning specialistmistral-large:123b— European option
Switching is a one-command operation.
Bottom Line
Local AI is practical for real use cases in 2026. If privacy matters, bills are ballooning, or you need bulk throughput — Gemma 4 on Ollama is a 30-minute setup that saves thousands down the line.
See the Gemma 4 launch news for broader context.
More Automation
Building a Hindi Voice Banking Bot With Sarvam AI
A complete guide to building a Hindi or regional-language voice agent with the Sarvam AI API. Banking use-case walkthrough with code and cost analysis.
Cut AI API Costs 98% With DeepSeek V3.2 — A Production Migration Guide
Take advantage of DeepSeek V3.2's disruptive pricing. Production setup, failover to Claude/GPT, cost comparison with real numbers. OpenRouter setup included.