⚡Automationadvanced

Running Gemma 4 Locally With Ollama, Setup Guide For Indian Devs

Google's 31B open-source model on your Macbook or PC, free, private, unlimited

ByAditya Sharma·Apr 4, 2026·8 min read

Running Gemma 4 Locally With Ollama, Setup Guide For Indian Devs, automation on AutoKaam

Gemma 4 (Google's 31B open-source model) has taken the #3 spot on open-source leaderboards. The biggest surprise, it can run on your laptop locally, with zero API cost, private data, and unlimited usage.

Hardware Requirements

Model size vs minimum RAM:

Gemma 4 variant	Parameters	VRAM / RAM needed	Who can run
Gemma 4 2B	2B	4 GB	Any laptop (CPU or entry GPU)
Gemma 4 9B	9B	12 GB	M1 Pro, RTX 3060+
Gemma 4 31B	31B	24-48 GB	M2 Max, RTX 4090, workstation

Reality check for the Indian market:

Macbook Air M2 (16GB): comfortable at 2B, 9B with tricks
Macbook Pro M2/M3 (32GB+): 9B smooth, 31B possible (quantized)
Gaming PC (RTX 4070+): 9B smooth, 31B with quantization

If you want 31B cheaply, rent a GPU on Runpod / Vast.ai for Rs 50-100/hour.

Install Ollama (One-Liner)

Mac:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download from ollama.com

Verify:

ollama --version

Download Gemma 4

# Start with 2B (smallest, fastest)
ollama pull gemma4:2b

# Or 9B (better quality, needs 12GB RAM)
ollama pull gemma4:9b

# Or 31B (full power, needs 24GB+)
ollama pull gemma4:31b

# Quantized versions (smaller, slightly lower quality)
ollama pull gemma4:9b-q4   # 4-bit, ~5GB RAM
ollama pull gemma4:31b-q4  # 4-bit, ~17GB RAM

Bandwidth-First Install Path

JIO Air Fiber and most tier-2 broadband come with daily caps or tiered speeds after a quota. Ollama model pulls are not small: 2B is around 1.5 GB, 9B is 5 to 6 GB, 31B can hit 18 GB. A snapped pull at 80% wastes the data.

Four rules that save the run:

(a) Schedule for unmetered hours. JIO Air Fiber has a 2 AM to 8 AM unmetered window on most plans. Airtel Xstream is similar. Start the pull at midnight, sleep, wake to a finished model.

(b) Use tmux or screen. SSH disconnects, Wi-Fi drops, laptop sleep, all kill foreground pulls. Wrap the command:

tmux new -s ollama
ollama pull gemma4:9b
# Ctrl+B then D to detach
# tmux attach -t ollama   # to resume later

(c) Verify the manifest. After the pull, check the integrity:

ls -la ~/.ollama/models/manifests/registry.ollama.ai/library/gemma4/
cat ~/.ollama/models/manifests/registry.ollama.ai/library/gemma4/9b

Each blob has a SHA256 hash. If ollama run fails with "invalid digest" the pull was truncated, just re-run ollama pull.

(d) Start small on slow lines. On a 5 Mbps line a 9B pull takes 3 hours. Pull 2B first (1.5 GB, around 40 minutes), confirm the workflow you want runs at acceptable quality, then upgrade. Half the time you find 2B is enough for what you actually need.

For BSNL or government broadband on flaky days, prefer the quantized variants:

ollama pull gemma4:9b-q4   # ~5 GB, often less noisy on packet loss

First Run

ollama run gemma4:9b

# Prompt:
> Explain in Hindi: what is cryptocurrency?

Gemma 4 handles Hindi surprisingly well, not native fluency like Sarvam, but understandable.

API Mode (For Integration)

Ollama runs a local HTTP server with an OpenAI-compatible interface:

# Serve in background
ollama serve

Python client:

from openai import OpenAI

client = OpenAI(
    api_key="ollama",  # dummy key
    base_url="http://localhost:11434/v1"
)

response = client.chat.completions.create(
    model="gemma4:9b",
    messages=[{"role": "user", "content": "Hello"}]
)

Existing OpenAI/Claude SDK code just needs a base_url swap. Zero other changes.

Hosting It For Your Team Over Tailscale

A 4-person Indian SaaS team running cloud Sonnet on each laptop burns Rs 8,000 to 15,000 per month on subscriptions. Local Gemma on a single shared machine cuts that to electricity and a Tailscale free tier.

The recipe:

1. Install Ollama on the strongest machine. That is whoever has the biggest VRAM, usually the founder-engineer with the gaming laptop.

2. Bind Ollama to all interfaces.

sudo systemctl edit ollama
# Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload
sudo systemctl restart ollama

3. Install Tailscale on the host and every client.

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

Note the Tailscale IP of the host (tailscale status), something like 100.64.0.5.

4. Block the public port. Tailscale exposure is enough, do not open 11434 on the public IP.

sudo ufw allow in on tailscale0 to any port 11434
sudo ufw deny in on eth0 to any port 11434

5. Clients point at the host.

# On every other team member's machine
export OLLAMA_HOST=http://100.64.0.5:11434
ollama list   # should show models the host has

Single download, single GPU, four happy developers. Free tier of Tailscale handles up to 100 devices.

For an office-only setup without Tailscale, an internal LAN IP works, just be honest about which network you trust.

Where Local Beats Cloud

1. Data Privacy

Legal documents, medical records, client PII, never leave your laptop, and the same holds when reading images and scans with a local vision model
Easier DPDP Act compliance for sensitive data

2. Unlimited Volume

Running 10,000 classifications daily? Cloud bills balloon. Local is free.
Perfect if you run a content factory, blog generation, product descriptions, bulk tasks
For the cloud calls you cannot move local, memoize repeated LLM calls so identical inputs replay at zero tokens instead of re-billing

3. Offline / Low-Connectivity

Tier-3 cities with patchy internet, local has zero dependency
Train journeys, flights

4. Cost Predictability

Monthly bill: Rs 0 (only electricity ~Rs 200/mo for heavy use)
No surprise charges

Where Cloud Still Wins

Complex reasoning chains (Gemma 4 9B << Claude Opus 4.6)
Long context (Gemma 4 has 128k; Opus has 1M)
Tool use / agents
Frontier-quality writing

Performance Benchmarks (My M2 Macbook Pro 32GB)

Model	Tokens/sec	First-token latency
Gemma 4 2B	120 tok/s	~150ms
Gemma 4 9B	38 tok/s	~400ms
Gemma 4 9B-q4	55 tok/s	~300ms

9B is the sweet spot, quality closer to Claude Sonnet, speed usable.

VRAM And RAM Tuning, Real Constraints

The hardware table earlier is the headline. The actual limits depend on quantization and context window size.

For an 8 GB RAM laptop (the entry-level Indian dev machine), the practical config is gemma4:2b at full precision or gemma4:9b-q4 at 4-bit. Both leave 2 to 3 GB headroom for Chrome and your IDE. Anything bigger swaps to disk and tokens drop to single digits.

For a Macbook Air M2 8 GB, the same numbers apply, except the unified memory means the GPU does not need its own slice. Real-world numbers across common Indian dev hardware:

Hardware	Best Gemma 4 fit	tok/s
M1 Air 8 GB	2b	60
M2 Pro 16 GB	9b-q4	70
M2 Pro 32 GB	9b full	38
RTX 3060 12 GB	9b-q4	65
RTX 4070 12 GB	9b full	48
RTX 4090 24 GB	31b-q4	32

To inspect what quantization a model actually uses:

ollama show gemma4:9b --modelfile
# Look for: PARAMETER quantization q4_K_M

q4_K_M is the modern default for Ollama. It trades roughly 2x compression for around 3% measured quality loss on most benchmarks. For chat, summarization, code completion, the loss is invisible. For maths-heavy or long-chain reasoning, prefer q8_0 if you have the RAM. The same quantize-to-fit discipline reaches past chat: an int8 speech-to-text stack fits a 6 GB GTX 1660 once you evict the resident Ollama model to free the card.

Context window also costs RAM. Ollama defaults to 2k tokens. Push it to 32k for longer code reviews:

ollama run gemma4:9b
> /set parameter num_ctx 32768

Each doubling of context roughly adds 1 GB of RAM. A 128k context on 9B needs around 14 GB total, the M2 Pro 16 GB is the floor for that.

Use Case, A Local RAG System

Privacy-sensitive knowledge base RAG:

# Install local deps
# pip install chromadb langchain langchain-ollama

from langchain_ollama import ChatOllama
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma(persist_directory="./chroma", embedding_function=embeddings)

# Ingest documents (run once)
# for pdf in pdfs: vectorstore.add_documents(chunks)

# Query
llm = ChatOllama(model="gemma4:9b", temperature=0)
relevant = vectorstore.similarity_search(query, k=3)
context = "\n\n".join([d.page_content for d in relevant])

response = llm.invoke(f"Context:\n{context}\n\nQuestion: {query}")

100% local, zero API calls. Perfect for law firms, doctors, CAs working with client data.

Hooking Gemma Into LangChain And LlamaIndex

The Chroma example above is fine for under 100k chunks. For tighter footprint and zero daemon, FAISS ships in a single Python wheel:

# pip install langchain-ollama langchain-community faiss-cpu

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

llm = ChatOllama(model="gemma4:9b", temperature=0)
emb = OllamaEmbeddings(model="nomic-embed-text")

# Ingest once
docs = open("knowledge.md").read()
chunks = RecursiveCharacterTextSplitter(chunk_size=800).split_text(docs)
store = FAISS.from_texts(chunks, embedding=emb)
store.save_local("./faiss_index")

# Query
hits = store.similarity_search("How do I file GST returns?", k=3)
ctx = "\n\n".join(d.page_content for d in hits)
print(llm.invoke(f"Context:\n{ctx}\n\nQuestion: How do I file GST returns?").content)

Total stack cost: zero. nomic-embed-text is a 137 MB embedding model that runs on CPU in around 80 ms per chunk. Pair it with gemma4:9b and the entire pipeline runs on a Rs 60,000 Acer Aspire 5, no cloud round trip.

For LlamaIndex the swap is identical:

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

llm = Ollama(model="gemma4:9b", request_timeout=120)
emb = OllamaEmbedding(model_name="nomic-embed-text")

Both libraries treat Ollama as a drop-in for OpenAI. Retrieval-heavy workflows that pay per-token in production are the obvious place to switch.

Fine-Tuning Locally

Gemma 4 is fine-tunable, though it needs more RAM. LoRA approach (10x less memory):

# Install Unsloth (makes Gemma training 2x faster)
pip install unsloth

# Fine-tune script (simplified):
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained("unsloth/gemma-4-9b-it")

# ... standard LoRA training with your data

~4-6 hours on a single 4090 for a domain-specific fine-tune.

Privacy + Compliance Notes

Gemma 4 license: Google's custom terms (commercial use allowed with some restrictions, read the license)
Your data stays local, no telemetry by default
GDPR / DPDP friendly for Indian enterprises

Alternative Models Worth Trying

Via ollama pull:

llama3.3:70b, Meta's latest
qwen2.5:72b, Alibaba; strong in Hindi
deepseek-r1:8b, reasoning specialist
mistral-large:123b, European option

Switching is a one-command operation. If you run local Gemma as one tier alongside cloud models, put a multi-model fallback router in front so a local OOM or a slow pull never stalls the whole job.

Bottom Line

Local AI is practical for real use cases in 2026. If privacy matters, bills are ballooning, or you need bulk throughput, Gemma 4 on Ollama is a 30-minute setup that saves thousands down the line.

See the Gemma 4 launch news for broader context.

Topics

#Gemma #Ollama #Local AI #Open Source #Privacy

More Automation

Terminal showing a structuredData.json table extraction from a scanned PDF via Adobe PDF Services REST

⚡Automationintermediate

Programmatic PDF Table Extraction and OCR with Adobe PDF Services REST: The Auth, the Extract Call, and Parsing the Output

I wired Adobe PDF Services REST into my stack as a local tool and pointed it at the scanned invoices and merged-header statements that pdfplumber turned into soup. Here is the exact auth flow, the extract call, and the structuredData.json parsing I run in production, with the real latency and free-tier limits.

Jun 28, 2026·8 min read

An AT-SPI2 accessibility tree of a GTK dialog with element names and roles, next to the same dialog being driven by an agent

⚡Automationadvanced

I Gave My AI Agent Eyes and Hands on Native Linux Apps With AT-SPI2

I was tired of my agent missing buttons because a window shifted a few pixels. So I pointed it at the AT-SPI2 accessibility tree instead, the same data a screen reader consumes, and had it act by element name and role. This walks through driving a GTK dialog and a native Save dialog, then reading the value back to prove the action actually landed.

Jun 28, 2026·9 min read

Cloudflare named tunnel exposing a self-hosted app, kept reboot-proof with a systemd unit

⚡Automationintermediate

Reboot-Proof Cloudflare Named Tunnels: The systemd Setup I Run in Production

I expose every self-hosted app on my home box through a Cloudflare named tunnel, kept alive by a systemd unit that has survived every reboot for weeks. This is the real login-to-systemd flow, the config file, the unit, and why a named tunnel beats a quick tunnel for anything you mean to keep.

Jun 28, 2026·8 min read