Automationadvanced

Running Gemma 4 Locally With Ollama — Setup Guide For Indian Devs

Google's 31B open-source model on your Macbook or PC — free, private, unlimited

AutoKaam Editorial··8 min read
AI chip hardware

Gemma 4 (Google's 31B open-source model) has taken the #3 spot on open-source leaderboards. The biggest surprise — it can run on your laptop locally, with zero API cost, private data, and unlimited usage.

Hardware Requirements

Model size vs minimum RAM:

Gemma 4 variant Parameters VRAM / RAM needed Who can run
Gemma 4 2B 2B 4 GB Any laptop (CPU or entry GPU)
Gemma 4 9B 9B 12 GB M1 Pro, RTX 3060+
Gemma 4 31B 31B 24-48 GB M2 Max, RTX 4090, workstation

Reality check for the Indian market:

  • Macbook Air M2 (16GB): comfortable at 2B, 9B with tricks
  • Macbook Pro M2/M3 (32GB+): 9B smooth, 31B possible (quantized)
  • Gaming PC (RTX 4070+): 9B smooth, 31B with quantization

If you want 31B cheaply, rent a GPU on Runpod / Vast.ai for Rs 50-100/hour.

Install Ollama (One-Liner)

Mac:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download from ollama.com

Verify:

ollama --version

Download Gemma 4

# Start with 2B (smallest, fastest)
ollama pull gemma4:2b

# Or 9B (better quality, needs 12GB RAM)
ollama pull gemma4:9b

# Or 31B (full power, needs 24GB+)
ollama pull gemma4:31b

# Quantized versions (smaller, slightly lower quality)
ollama pull gemma4:9b-q4   # 4-bit, ~5GB RAM
ollama pull gemma4:31b-q4  # 4-bit, ~17GB RAM

First Run

ollama run gemma4:9b

# Prompt:
> Explain in Hindi: what is cryptocurrency?

Gemma 4 handles Hindi surprisingly well — not native fluency like Sarvam, but understandable.

API Mode (For Integration)

Ollama runs a local HTTP server with an OpenAI-compatible interface:

# Serve in background
ollama serve

Python client:

from openai import OpenAI

client = OpenAI(
    api_key="ollama",  # dummy key
    base_url="http://localhost:11434/v1"
)

response = client.chat.completions.create(
    model="gemma4:9b",
    messages=[{"role": "user", "content": "Hello"}]
)

Existing OpenAI/Claude SDK code just needs a base_url swap. Zero other changes.

Where Local Beats Cloud

1. Data Privacy

  • Legal documents, medical records, client PII — never leave your laptop
  • Easier DPDP Act compliance for sensitive data

2. Unlimited Volume

  • Running 10,000 classifications daily? Cloud bills balloon. Local is free.
  • Perfect if you run a content factory — blog generation, product descriptions, bulk tasks

3. Offline / Low-Connectivity

  • Tier-3 cities with patchy internet — local has zero dependency
  • Train journeys, flights

4. Cost Predictability

  • Monthly bill: Rs 0 (only electricity ~Rs 200/mo for heavy use)
  • No surprise charges

Where Cloud Still Wins

  • Complex reasoning chains (Gemma 4 9B << Claude Opus 4.6)
  • Long context (Gemma 4 has 128k; Opus has 1M)
  • Tool use / agents
  • Frontier-quality writing

Performance Benchmarks (My M2 Macbook Pro 32GB)

Model Tokens/sec First-token latency
Gemma 4 2B 120 tok/s ~150ms
Gemma 4 9B 38 tok/s ~400ms
Gemma 4 9B-q4 55 tok/s ~300ms

9B is the sweet spot — quality closer to Claude Sonnet, speed usable.

Use Case — A Local RAG System

Privacy-sensitive knowledge base RAG:

# Install local deps
# pip install chromadb langchain langchain-ollama

from langchain_ollama import ChatOllama
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma(persist_directory="./chroma", embedding_function=embeddings)

# Ingest documents (run once)
# for pdf in pdfs: vectorstore.add_documents(chunks)

# Query
llm = ChatOllama(model="gemma4:9b", temperature=0)
relevant = vectorstore.similarity_search(query, k=3)
context = "\n\n".join([d.page_content for d in relevant])

response = llm.invoke(f"Context:\n{context}\n\nQuestion: {query}")

100% local, zero API calls. Perfect for law firms, doctors, CAs working with client data.

Fine-Tuning Locally

Gemma 4 is fine-tunable, though it needs more RAM. LoRA approach (10x less memory):

# Install Unsloth (makes Gemma training 2x faster)
pip install unsloth

# Fine-tune script (simplified):
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained("unsloth/gemma-4-9b-it")

# ... standard LoRA training with your data

~4-6 hours on a single 4090 for a domain-specific fine-tune.

Privacy + Compliance Notes

  • Gemma 4 license: Google's custom terms (commercial use allowed with some restrictions — read the license)
  • Your data stays local — no telemetry by default
  • GDPR / DPDP friendly for Indian enterprises

Alternative Models Worth Trying

Via ollama pull:

  • llama3.3:70b — Meta's latest
  • qwen2.5:72b — Alibaba; strong in Hindi
  • deepseek-r1:8b — reasoning specialist
  • mistral-large:123b — European option

Switching is a one-command operation.

Bottom Line

Local AI is practical for real use cases in 2026. If privacy matters, bills are ballooning, or you need bulk throughput — Gemma 4 on Ollama is a 30-minute setup that saves thousands down the line.

See the Gemma 4 launch news for broader context.

#Gemma#Ollama#Local AI#Open Source#Privacy