Running Gemma 4 Locally With Ollama, Setup Guide For Indian Devs
Google's 31B open-source model on your Macbook or PC, free, private, unlimited

Gemma 4 (Google's 31B open-source model) has taken the #3 spot on open-source leaderboards. The biggest surprise, it can run on your laptop locally, with zero API cost, private data, and unlimited usage.
Hardware Requirements
Model size vs minimum RAM:
| Gemma 4 variant | Parameters | VRAM / RAM needed | Who can run |
|---|---|---|---|
| Gemma 4 2B | 2B | 4 GB | Any laptop (CPU or entry GPU) |
| Gemma 4 9B | 9B | 12 GB | M1 Pro, RTX 3060+ |
| Gemma 4 31B | 31B | 24-48 GB | M2 Max, RTX 4090, workstation |
Reality check for the Indian market:
- Macbook Air M2 (16GB): comfortable at 2B, 9B with tricks
- Macbook Pro M2/M3 (32GB+): 9B smooth, 31B possible (quantized)
- Gaming PC (RTX 4070+): 9B smooth, 31B with quantization
If you want 31B cheaply, rent a GPU on Runpod / Vast.ai for Rs 50-100/hour.
Install Ollama (One-Liner)
Mac:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
Download from ollama.com
Verify:
ollama --version
Download Gemma 4
# Start with 2B (smallest, fastest)
ollama pull gemma4:2b
# Or 9B (better quality, needs 12GB RAM)
ollama pull gemma4:9b
# Or 31B (full power, needs 24GB+)
ollama pull gemma4:31b
# Quantized versions (smaller, slightly lower quality)
ollama pull gemma4:9b-q4 # 4-bit, ~5GB RAM
ollama pull gemma4:31b-q4 # 4-bit, ~17GB RAM
Bandwidth-First Install Path
JIO Air Fiber and most tier-2 broadband come with daily caps or tiered speeds after a quota. Ollama model pulls are not small: 2B is around 1.5 GB, 9B is 5 to 6 GB, 31B can hit 18 GB. A snapped pull at 80% wastes the data.
Four rules that save the run:
(a) Schedule for unmetered hours. JIO Air Fiber has a 2 AM to 8 AM unmetered window on most plans. Airtel Xstream is similar. Start the pull at midnight, sleep, wake to a finished model.
(b) Use tmux or screen. SSH disconnects, Wi-Fi drops, laptop sleep, all kill foreground pulls. Wrap the command:
tmux new -s ollama
ollama pull gemma4:9b
# Ctrl+B then D to detach
# tmux attach -t ollama # to resume later
(c) Verify the manifest. After the pull, check the integrity:
ls -la ~/.ollama/models/manifests/registry.ollama.ai/library/gemma4/
cat ~/.ollama/models/manifests/registry.ollama.ai/library/gemma4/9b
Each blob has a SHA256 hash. If ollama run fails with "invalid digest" the pull was truncated, just re-run ollama pull.
(d) Start small on slow lines. On a 5 Mbps line a 9B pull takes 3 hours. Pull 2B first (1.5 GB, around 40 minutes), confirm the workflow you want runs at acceptable quality, then upgrade. Half the time you find 2B is enough for what you actually need.
For BSNL or government broadband on flaky days, prefer the quantized variants:
ollama pull gemma4:9b-q4 # ~5 GB, often less noisy on packet loss
First Run
ollama run gemma4:9b
# Prompt:
> Explain in Hindi: what is cryptocurrency?
Gemma 4 handles Hindi surprisingly well, not native fluency like Sarvam, but understandable.
API Mode (For Integration)
Ollama runs a local HTTP server with an OpenAI-compatible interface:
# Serve in background
ollama serve
Python client:
from openai import OpenAI
client = OpenAI(
api_key="ollama", # dummy key
base_url="http://localhost:11434/v1"
)
response = client.chat.completions.create(
model="gemma4:9b",
messages=[{"role": "user", "content": "Hello"}]
)
Existing OpenAI/Claude SDK code just needs a base_url swap. Zero other changes.
Hosting It For Your Team Over Tailscale
A 4-person Indian SaaS team running cloud Sonnet on each laptop burns Rs 8,000 to 15,000 per month on subscriptions. Local Gemma on a single shared machine cuts that to electricity and a Tailscale free tier.
The recipe:
1. Install Ollama on the strongest machine. That is whoever has the biggest VRAM, usually the founder-engineer with the gaming laptop.
2. Bind Ollama to all interfaces.
sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload
sudo systemctl restart ollama
3. Install Tailscale on the host and every client.
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up
Note the Tailscale IP of the host (tailscale status), something like 100.64.0.5.
4. Block the public port. Tailscale exposure is enough, do not open 11434 on the public IP.
sudo ufw allow in on tailscale0 to any port 11434
sudo ufw deny in on eth0 to any port 11434
5. Clients point at the host.
# On every other team member's machine
export OLLAMA_HOST=http://100.64.0.5:11434
ollama list # should show models the host has
Single download, single GPU, four happy developers. Free tier of Tailscale handles up to 100 devices.
For an office-only setup without Tailscale, an internal LAN IP works, just be honest about which network you trust.
Where Local Beats Cloud
1. Data Privacy
- Legal documents, medical records, client PII, never leave your laptop
- Easier DPDP Act compliance for sensitive data
2. Unlimited Volume
- Running 10,000 classifications daily? Cloud bills balloon. Local is free.
- Perfect if you run a content factory, blog generation, product descriptions, bulk tasks
- For the cloud calls you cannot move local, memoize repeated LLM calls so identical inputs replay at zero tokens instead of re-billing
3. Offline / Low-Connectivity
- Tier-3 cities with patchy internet, local has zero dependency
- Train journeys, flights
4. Cost Predictability
- Monthly bill: Rs 0 (only electricity ~Rs 200/mo for heavy use)
- No surprise charges
Where Cloud Still Wins
- Complex reasoning chains (Gemma 4 9B << Claude Opus 4.6)
- Long context (Gemma 4 has 128k; Opus has 1M)
- Tool use / agents
- Frontier-quality writing
Performance Benchmarks (My M2 Macbook Pro 32GB)
| Model | Tokens/sec | First-token latency |
|---|---|---|
| Gemma 4 2B | 120 tok/s | ~150ms |
| Gemma 4 9B | 38 tok/s | ~400ms |
| Gemma 4 9B-q4 | 55 tok/s | ~300ms |
9B is the sweet spot, quality closer to Claude Sonnet, speed usable.
VRAM And RAM Tuning, Real Constraints
The hardware table earlier is the headline. The actual limits depend on quantization and context window size.
For an 8 GB RAM laptop (the entry-level Indian dev machine), the practical config is gemma4:2b at full precision or gemma4:9b-q4 at 4-bit. Both leave 2 to 3 GB headroom for Chrome and your IDE. Anything bigger swaps to disk and tokens drop to single digits.
For a Macbook Air M2 8 GB, the same numbers apply, except the unified memory means the GPU does not need its own slice. Real-world numbers across common Indian dev hardware:
| Hardware | Best Gemma 4 fit | tok/s |
|---|---|---|
| M1 Air 8 GB | 2b | 60 |
| M2 Pro 16 GB | 9b-q4 | 70 |
| M2 Pro 32 GB | 9b full | 38 |
| RTX 3060 12 GB | 9b-q4 | 65 |
| RTX 4070 12 GB | 9b full | 48 |
| RTX 4090 24 GB | 31b-q4 | 32 |
To inspect what quantization a model actually uses:
ollama show gemma4:9b --modelfile
# Look for: PARAMETER quantization q4_K_M
q4_K_M is the modern default for Ollama. It trades roughly 2x compression for around 3% measured quality loss on most benchmarks. For chat, summarization, code completion, the loss is invisible. For maths-heavy or long-chain reasoning, prefer q8_0 if you have the RAM.
Context window also costs RAM. Ollama defaults to 2k tokens. Push it to 32k for longer code reviews:
ollama run gemma4:9b
> /set parameter num_ctx 32768
Each doubling of context roughly adds 1 GB of RAM. A 128k context on 9B needs around 14 GB total, the M2 Pro 16 GB is the floor for that.
Use Case, A Local RAG System
Privacy-sensitive knowledge base RAG:
# Install local deps
# pip install chromadb langchain langchain-ollama
from langchain_ollama import ChatOllama
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma(persist_directory="./chroma", embedding_function=embeddings)
# Ingest documents (run once)
# for pdf in pdfs: vectorstore.add_documents(chunks)
# Query
llm = ChatOllama(model="gemma4:9b", temperature=0)
relevant = vectorstore.similarity_search(query, k=3)
context = "\n\n".join([d.page_content for d in relevant])
response = llm.invoke(f"Context:\n{context}\n\nQuestion: {query}")
100% local, zero API calls. Perfect for law firms, doctors, CAs working with client data.
Hooking Gemma Into LangChain And LlamaIndex
The Chroma example above is fine for under 100k chunks. For tighter footprint and zero daemon, FAISS ships in a single Python wheel:
# pip install langchain-ollama langchain-community faiss-cpu
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
llm = ChatOllama(model="gemma4:9b", temperature=0)
emb = OllamaEmbeddings(model="nomic-embed-text")
# Ingest once
docs = open("knowledge.md").read()
chunks = RecursiveCharacterTextSplitter(chunk_size=800).split_text(docs)
store = FAISS.from_texts(chunks, embedding=emb)
store.save_local("./faiss_index")
# Query
hits = store.similarity_search("How do I file GST returns?", k=3)
ctx = "\n\n".join(d.page_content for d in hits)
print(llm.invoke(f"Context:\n{ctx}\n\nQuestion: How do I file GST returns?").content)
Total stack cost: zero. nomic-embed-text is a 137 MB embedding model that runs on CPU in around 80 ms per chunk. Pair it with gemma4:9b and the entire pipeline runs on a Rs 60,000 Acer Aspire 5, no cloud round trip.
For LlamaIndex the swap is identical:
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
llm = Ollama(model="gemma4:9b", request_timeout=120)
emb = OllamaEmbedding(model_name="nomic-embed-text")
Both libraries treat Ollama as a drop-in for OpenAI. Retrieval-heavy workflows that pay per-token in production are the obvious place to switch.
Fine-Tuning Locally
Gemma 4 is fine-tunable, though it needs more RAM. LoRA approach (10x less memory):
# Install Unsloth (makes Gemma training 2x faster)
pip install unsloth
# Fine-tune script (simplified):
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained("unsloth/gemma-4-9b-it")
# ... standard LoRA training with your data
~4-6 hours on a single 4090 for a domain-specific fine-tune.
Privacy + Compliance Notes
- Gemma 4 license: Google's custom terms (commercial use allowed with some restrictions, read the license)
- Your data stays local, no telemetry by default
- GDPR / DPDP friendly for Indian enterprises
Alternative Models Worth Trying
Via ollama pull:
llama3.3:70b, Meta's latestqwen2.5:72b, Alibaba; strong in Hindideepseek-r1:8b, reasoning specialistmistral-large:123b, European option
Switching is a one-command operation. If you run local Gemma as one tier alongside cloud models, put a multi-model fallback router in front so a local OOM or a slow pull never stalls the whole job.
Bottom Line
Local AI is practical for real use cases in 2026. If privacy matters, bills are ballooning, or you need bulk throughput, Gemma 4 on Ollama is a 30-minute setup that saves thousands down the line.
See the Gemma 4 launch news for broader context.
More Automation

Cloudflare API Token Gotchas: The PUT That Wiped Mine Twice
I broke production twice by updating a Cloudflare token's scopes through the public API, then learned the wrangler auth fix and a secret-scrub habit the hard way. This is exactly what bit me and how I handle tokens now.

Fix NVIDIA Cursor and Video Stutter on Linux: GPU Clock Thrash
Cursor jitter and dropped video frames on NVIDIA Linux get blamed on the compositor every time. On my GTX 1660 the real cause was the driver bouncing graphics and VRAM clocks under light load. Here is the fix that held.

Litestream to Cloudflare R2: Disaster Recovery for SQLite
SQLite on one free box is one disk failure away from gone. Here is the exact Litestream-to-R2 setup I run across every PocketBase backend in my stack, including the restore drill and the gotcha that bites first.