⚡Automationbeginner

Ollama Gemma 4 On Linux, From Install To First Token

Google's open-source Gemma 4 9B running locally on my ThinkCentre, the install that landed for me

Aditya Sharma·May 6, 2026·6 min read

Ollama running Gemma 4 9B in a terminal on Ubuntu

import APIPriceLive from "@/components/data/APIPriceLive";

Gemma 4 is the open-source Google model that put solid 9B-class quality within reach of a stock laptop. I run Gemma 4 9B on the ThinkCentre with 32GB RAM, no GPU, and it serves my routine summarisation, content-rewrite, and bulk classification work without an API call. This is the install that worked, including the Ollama config tweak that fixed my context-window issue.

What you'll build

Ollama installed on Linux, Gemma 4 9B pulled and running, with a Python OpenAI-compatible client to call it from your scripts. Roughly 25 minutes including the model pull on a 50Mbps line.

Gemma 4 9B running on my Linux box Caption: Ollama serving Gemma 4 9B via the local API.

Prerequisites

Linux x86_64 (Ubuntu 22.04+ or any modern distro)
16GB RAM minimum for the 9B model, 8GB if you stick with Gemma 4 2B
6GB free disk for the 9B at default quantization
A Python venv if you plan to call the API from Python

If you have an Nvidia GPU, Ollama will use it automatically. CPU-only works fine on a modern processor; expect ~30 tok/sec on the 9B.

Step 1, install Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

Ollama install output

The script writes to /usr/local/bin/ollama, sets up a systemd unit, and binds the API to localhost:11434. Verify with systemctl status ollama.

Step 2, pull Gemma 4

ollama pull gemma4:9b
ollama list

Ollama pull progress on Gemma 4

The 9B at default quantization is ~5.5GB. On my Jio fibre at 50Mbps, the pull takes ~16 minutes. The Ollama registry is geo-routed; if you see slow speeds, your route is hitting the wrong CDN edge. Cloudflare WARP (free) often fixes this.

For comparison:

gemma4:2b is 2GB, runs on 8GB RAM
gemma4:9b is 5.5GB, needs 16GB RAM
gemma4:9b-q4 is 4.4GB, lower quality but easier on RAM

Step 3, run a basic prompt

ollama run gemma4:9b
> Summarise the difference between Gemma 4 and Llama 3.3 in two sentences.

Gemma 4 first prompt response

The first prompt has a ~10-second cold-start while the model loads into RAM. Subsequent prompts in the same session are sub-second to first token.

Step 4, fix the context window default

The Ollama default context window is 2048 tokens, which is comically small for any real work. To set a larger context for Gemma 4:

ollama run gemma4:9b
> /set parameter num_ctx 8192
> /save gemma4-8k

Ollama context window adjusted

You now have a gemma4-8k model with the larger context. For one-off use, you can also pass --num-ctx 8192 on the command line. The official docs mention this in passing; the default trips up most people.

Step 5, call from Python

Ollama exposes an OpenAI-compatible API. Install the OpenAI SDK and point it at localhost:

pip install openai

from openai import OpenAI

client = OpenAI(
    api_key="ollama",  # dummy, ignored
    base_url="http://localhost:11434/v1",
)

response = client.chat.completions.create(
    model="gemma4:9b",
    messages=[
        {"role": "user", "content": "Write a one-line summary of microservices."},
    ],
)
print(response.choices[0].message.content)

Python client calling Gemma

Existing OpenAI SDK code can swap to local Gemma with a base_url change. The migration friction is genuinely zero for non-streaming workloads.

First run

A typical day-of-use session for me:

$ ollama run gemma4-8k
> classify these 50 product reviews as positive, negative, or mixed:

[paste 50 reviews]

[Gemma processes, returns categorised list]

> rewrite the negative ones as polite customer-feedback summaries

[Gemma rewrites all]

Gemma session for bulk classification

For bulk text work, local Gemma is the right tool. Cloud APIs would charge for every token; local is free past the electricity.

What broke for me

Two issues. First, Ollama would silently OOM on long prompts when the context filled up, with no error in the foreground; the model would just go quiet. The fix was watching journalctl -u ollama -f while running a long prompt; the OOM kill showed up clearly there. I added swap (sudo fallocate -l 8G /swapfile etc.) which gave me headroom for occasional spikes without buying more RAM.

Second, on Ubuntu 24.04 with the systemd-managed Ollama, the model cache lived under /usr/share/ollama/.ollama/models/ rather than ~/.ollama/models/ as the docs sometimes imply. When I tried to symlink models from another machine, the systemd service ran as a separate user and could not see the symlinks. The fix was running Ollama as my own user (systemctl stop ollama, then OLLAMA_HOST=127.0.0.1:11434 ollama serve) for the migration session.

What it costs

Item	Cost
Ollama	Free (MIT)
Gemma 4 9B	Free (custom Google licence)
Disk (5.5GB)	Negligible
Electricity (heavy use)	~Rs 200/mo
Network (initial pull)	One-time 5.5GB

For Indian operators running bulk text work, the math against API pricing is decisive. 100,000 summaries on Sonnet 4.6 would be Rs 4,000+ in API spend; locally on Gemma 4 9B it is electricity money.

When NOT to use this

Skip Gemma 4 9B locally if your work needs frontier reasoning quality. Gemma 4 9B is solid but not Claude Sonnet; for novel architecture work, code review of complex business logic, or anything where one wrong inference is expensive, the cloud frontier still wins.

Skip if you have less than 16GB RAM and no swap. The 2B variant works in 8GB but the quality drop versus 9B is noticeable for any real task.

Indian operator angle

Local Gemma is the only zero-marginal-cost LLM workflow I have found that holds up for production-volume work. For a content factory, bulk product description generation, or classification at scale, a Tier-2 city studio with one good machine can run more inference per day than a Mumbai consultancy paying API bills.

For Indian-language work, Gemma 4 handles Hindi reasonably; Krutrim, Sarvam-1, or Bhashini-tuned Gemma variants are better for native fluency. Pull krutrim:7b or sarvam-1:7b from Ollama for India-trained baselines.

Topics

#Ollama #Gemma #Local LLM #Linux #Open Source

More Automation

Cloudflare Workers AI dashboard with model usage

⚡Automationintermediate

Cloudflare Workers AI, Edge Inference Without Your Own GPU

Workers AI runs Llama, Mistral, and Stable Diffusion at Cloudflare's edge. I tried it for a low-latency demo. This is the setup, with the rate-limit gotcha that bit me.

May 6, 2026·7 min read

Coolify dashboard managing apps on Oracle ARM VM

⚡Automationadvanced

Coolify Deploy LLM App On Oracle ARM, Free Forever

Coolify is the self-hosted PaaS I use across the empire. Paired with Oracle ARM's free tier, it deploys Node, Python, and Go LLM apps at zero monthly cost. This is the install.

May 6, 2026·7 min read

CrewAI logs showing multi-agent collaboration

⚡Automationadvanced

CrewAI Multi-Agent Orchestration, A Real Workflow That Shipped

CrewAI is the most popular multi-agent orchestration framework. I built a real research crew with it. This is the install, the workflow, and the gotcha that ate my afternoon.

May 6, 2026·7 min read

What you'll build

Prerequisites

Step 1, install Ollama

Step 2, pull Gemma 4

Step 3, run a basic prompt

Step 4, fix the context window default

Step 5, call from Python

First run

What broke for me

What it costs

When NOT to use this

Indian operator angle

Related

More Automation

Cloudflare Workers AI, Edge Inference Without Your Own GPU

Coolify Deploy LLM App On Oracle ARM, Free Forever

CrewAI Multi-Agent Orchestration, A Real Workflow That Shipped