Automationadvanced

Running Gemma 4 Locally With Ollama, Setup Guide For Indian Devs

Google's 31B open-source model on your Macbook or PC, free, private, unlimited

By··8 min read
Running Gemma 4 Locally With Ollama, Setup Guide For Indian Devs, automation on AutoKaam

Gemma 4 (Google's 31B open-source model) has taken the #3 spot on open-source leaderboards. The biggest surprise, it can run on your laptop locally, with zero API cost, private data, and unlimited usage.

Hardware Requirements

Model size vs minimum RAM:

Gemma 4 variant Parameters VRAM / RAM needed Who can run
Gemma 4 2B 2B 4 GB Any laptop (CPU or entry GPU)
Gemma 4 9B 9B 12 GB M1 Pro, RTX 3060+
Gemma 4 31B 31B 24-48 GB M2 Max, RTX 4090, workstation

Reality check for the Indian market:

  • Macbook Air M2 (16GB): comfortable at 2B, 9B with tricks
  • Macbook Pro M2/M3 (32GB+): 9B smooth, 31B possible (quantized)
  • Gaming PC (RTX 4070+): 9B smooth, 31B with quantization

If you want 31B cheaply, rent a GPU on Runpod / Vast.ai for Rs 50-100/hour.

Install Ollama (One-Liner)

Mac:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download from ollama.com

Verify:

ollama --version

Download Gemma 4

# Start with 2B (smallest, fastest)
ollama pull gemma4:2b

# Or 9B (better quality, needs 12GB RAM)
ollama pull gemma4:9b

# Or 31B (full power, needs 24GB+)
ollama pull gemma4:31b

# Quantized versions (smaller, slightly lower quality)
ollama pull gemma4:9b-q4   # 4-bit, ~5GB RAM
ollama pull gemma4:31b-q4  # 4-bit, ~17GB RAM

Bandwidth-First Install Path

JIO Air Fiber and most tier-2 broadband come with daily caps or tiered speeds after a quota. Ollama model pulls are not small: 2B is around 1.5 GB, 9B is 5 to 6 GB, 31B can hit 18 GB. A snapped pull at 80% wastes the data.

Four rules that save the run:

(a) Schedule for unmetered hours. JIO Air Fiber has a 2 AM to 8 AM unmetered window on most plans. Airtel Xstream is similar. Start the pull at midnight, sleep, wake to a finished model.

(b) Use tmux or screen. SSH disconnects, Wi-Fi drops, laptop sleep, all kill foreground pulls. Wrap the command:

tmux new -s ollama
ollama pull gemma4:9b
# Ctrl+B then D to detach
# tmux attach -t ollama   # to resume later

(c) Verify the manifest. After the pull, check the integrity:

ls -la ~/.ollama/models/manifests/registry.ollama.ai/library/gemma4/
cat ~/.ollama/models/manifests/registry.ollama.ai/library/gemma4/9b

Each blob has a SHA256 hash. If ollama run fails with "invalid digest" the pull was truncated, just re-run ollama pull.

(d) Start small on slow lines. On a 5 Mbps line a 9B pull takes 3 hours. Pull 2B first (1.5 GB, around 40 minutes), confirm the workflow you want runs at acceptable quality, then upgrade. Half the time you find 2B is enough for what you actually need.

For BSNL or government broadband on flaky days, prefer the quantized variants:

ollama pull gemma4:9b-q4   # ~5 GB, often less noisy on packet loss

First Run

ollama run gemma4:9b

# Prompt:
> Explain in Hindi: what is cryptocurrency?

Gemma 4 handles Hindi surprisingly well, not native fluency like Sarvam, but understandable.

API Mode (For Integration)

Ollama runs a local HTTP server with an OpenAI-compatible interface:

# Serve in background
ollama serve

Python client:

from openai import OpenAI

client = OpenAI(
    api_key="ollama",  # dummy key
    base_url="http://localhost:11434/v1"
)

response = client.chat.completions.create(
    model="gemma4:9b",
    messages=[{"role": "user", "content": "Hello"}]
)

Existing OpenAI/Claude SDK code just needs a base_url swap. Zero other changes.

Hosting It For Your Team Over Tailscale

A 4-person Indian SaaS team running cloud Sonnet on each laptop burns Rs 8,000 to 15,000 per month on subscriptions. Local Gemma on a single shared machine cuts that to electricity and a Tailscale free tier.

The recipe:

1. Install Ollama on the strongest machine. That is whoever has the biggest VRAM, usually the founder-engineer with the gaming laptop.

2. Bind Ollama to all interfaces.

sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload
sudo systemctl restart ollama

3. Install Tailscale on the host and every client.

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

Note the Tailscale IP of the host (tailscale status), something like 100.64.0.5.

4. Block the public port. Tailscale exposure is enough, do not open 11434 on the public IP.

sudo ufw allow in on tailscale0 to any port 11434
sudo ufw deny in on eth0 to any port 11434

5. Clients point at the host.

# On every other team member's machine
export OLLAMA_HOST=http://100.64.0.5:11434
ollama list   # should show models the host has

Single download, single GPU, four happy developers. Free tier of Tailscale handles up to 100 devices.

For an office-only setup without Tailscale, an internal LAN IP works, just be honest about which network you trust.

Where Local Beats Cloud

1. Data Privacy

  • Legal documents, medical records, client PII, never leave your laptop
  • Easier DPDP Act compliance for sensitive data

2. Unlimited Volume

  • Running 10,000 classifications daily? Cloud bills balloon. Local is free.
  • Perfect if you run a content factory, blog generation, product descriptions, bulk tasks
  • For the cloud calls you cannot move local, memoize repeated LLM calls so identical inputs replay at zero tokens instead of re-billing

3. Offline / Low-Connectivity

  • Tier-3 cities with patchy internet, local has zero dependency
  • Train journeys, flights

4. Cost Predictability

  • Monthly bill: Rs 0 (only electricity ~Rs 200/mo for heavy use)
  • No surprise charges

Where Cloud Still Wins

  • Complex reasoning chains (Gemma 4 9B << Claude Opus 4.6)
  • Long context (Gemma 4 has 128k; Opus has 1M)
  • Tool use / agents
  • Frontier-quality writing

Performance Benchmarks (My M2 Macbook Pro 32GB)

Model Tokens/sec First-token latency
Gemma 4 2B 120 tok/s ~150ms
Gemma 4 9B 38 tok/s ~400ms
Gemma 4 9B-q4 55 tok/s ~300ms

9B is the sweet spot, quality closer to Claude Sonnet, speed usable.

VRAM And RAM Tuning, Real Constraints

The hardware table earlier is the headline. The actual limits depend on quantization and context window size.

For an 8 GB RAM laptop (the entry-level Indian dev machine), the practical config is gemma4:2b at full precision or gemma4:9b-q4 at 4-bit. Both leave 2 to 3 GB headroom for Chrome and your IDE. Anything bigger swaps to disk and tokens drop to single digits.

For a Macbook Air M2 8 GB, the same numbers apply, except the unified memory means the GPU does not need its own slice. Real-world numbers across common Indian dev hardware:

Hardware Best Gemma 4 fit tok/s
M1 Air 8 GB 2b 60
M2 Pro 16 GB 9b-q4 70
M2 Pro 32 GB 9b full 38
RTX 3060 12 GB 9b-q4 65
RTX 4070 12 GB 9b full 48
RTX 4090 24 GB 31b-q4 32

To inspect what quantization a model actually uses:

ollama show gemma4:9b --modelfile
# Look for: PARAMETER quantization q4_K_M

q4_K_M is the modern default for Ollama. It trades roughly 2x compression for around 3% measured quality loss on most benchmarks. For chat, summarization, code completion, the loss is invisible. For maths-heavy or long-chain reasoning, prefer q8_0 if you have the RAM.

Context window also costs RAM. Ollama defaults to 2k tokens. Push it to 32k for longer code reviews:

ollama run gemma4:9b
> /set parameter num_ctx 32768

Each doubling of context roughly adds 1 GB of RAM. A 128k context on 9B needs around 14 GB total, the M2 Pro 16 GB is the floor for that.

Use Case, A Local RAG System

Privacy-sensitive knowledge base RAG:

# Install local deps
# pip install chromadb langchain langchain-ollama

from langchain_ollama import ChatOllama
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma(persist_directory="./chroma", embedding_function=embeddings)

# Ingest documents (run once)
# for pdf in pdfs: vectorstore.add_documents(chunks)

# Query
llm = ChatOllama(model="gemma4:9b", temperature=0)
relevant = vectorstore.similarity_search(query, k=3)
context = "\n\n".join([d.page_content for d in relevant])

response = llm.invoke(f"Context:\n{context}\n\nQuestion: {query}")

100% local, zero API calls. Perfect for law firms, doctors, CAs working with client data.

Hooking Gemma Into LangChain And LlamaIndex

The Chroma example above is fine for under 100k chunks. For tighter footprint and zero daemon, FAISS ships in a single Python wheel:

# pip install langchain-ollama langchain-community faiss-cpu

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

llm = ChatOllama(model="gemma4:9b", temperature=0)
emb = OllamaEmbeddings(model="nomic-embed-text")

# Ingest once
docs = open("knowledge.md").read()
chunks = RecursiveCharacterTextSplitter(chunk_size=800).split_text(docs)
store = FAISS.from_texts(chunks, embedding=emb)
store.save_local("./faiss_index")

# Query
hits = store.similarity_search("How do I file GST returns?", k=3)
ctx = "\n\n".join(d.page_content for d in hits)
print(llm.invoke(f"Context:\n{ctx}\n\nQuestion: How do I file GST returns?").content)

Total stack cost: zero. nomic-embed-text is a 137 MB embedding model that runs on CPU in around 80 ms per chunk. Pair it with gemma4:9b and the entire pipeline runs on a Rs 60,000 Acer Aspire 5, no cloud round trip.

For LlamaIndex the swap is identical:

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

llm = Ollama(model="gemma4:9b", request_timeout=120)
emb = OllamaEmbedding(model_name="nomic-embed-text")

Both libraries treat Ollama as a drop-in for OpenAI. Retrieval-heavy workflows that pay per-token in production are the obvious place to switch.

Fine-Tuning Locally

Gemma 4 is fine-tunable, though it needs more RAM. LoRA approach (10x less memory):

# Install Unsloth (makes Gemma training 2x faster)
pip install unsloth

# Fine-tune script (simplified):
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained("unsloth/gemma-4-9b-it")

# ... standard LoRA training with your data

~4-6 hours on a single 4090 for a domain-specific fine-tune.

Privacy + Compliance Notes

  • Gemma 4 license: Google's custom terms (commercial use allowed with some restrictions, read the license)
  • Your data stays local, no telemetry by default
  • GDPR / DPDP friendly for Indian enterprises

Alternative Models Worth Trying

Via ollama pull:

  • llama3.3:70b, Meta's latest
  • qwen2.5:72b, Alibaba; strong in Hindi
  • deepseek-r1:8b, reasoning specialist
  • mistral-large:123b, European option

Switching is a one-command operation. If you run local Gemma as one tier alongside cloud models, put a multi-model fallback router in front so a local OOM or a slow pull never stalls the whole job.

Bottom Line

Local AI is practical for real use cases in 2026. If privacy matters, bills are ballooning, or you need bulk throughput, Gemma 4 on Ollama is a 30-minute setup that saves thousands down the line.

See the Gemma 4 launch news for broader context.