Building a Hindi Voice Banking Bot With Sarvam AI
Voice AI for 22 Indian languages — IVR, customer support, and regional outreach
Sarvam AI is India's most credible foundational AI for the 22 scheduled Indian languages. OpenAI and Anthropic handle Hindi well, but Sarvam is actually trained for Indic languages — voice quality, accent handling, and code-switching (Hindi with English words mixed in) all feel native.
Why Sarvam Over OpenAI For Regional
Quick comparison on Hindi voice:
| Task | OpenAI Voice | Sarvam Voice |
|---|---|---|
| Hindi pronunciation | Decent but American accent leaks | Native Indian |
| Code-switching | Okay | Excellent |
| Regional (Tamil, Telugu) | Weak | Native |
| Latency from India | 400-600ms | 150-300ms |
| Pricing | $0.03/min | ~Rs 1.50/min |
For Indian consumer-facing use cases, Sarvam wins on authenticity.
Setup — API Access
- Sign up at sarvam.ai
- Dashboard → API Keys → generate
- Free tier: 1,000 minutes/month
- Paid: ~Rs 1.50/min (voice)
Environment variable:
export SARVAM_API_KEY="sk-sarvam-..."
Use Case: Banking Balance Inquiry Voice Bot
An Indian customer calls, speaks in Hindi, and wants their balance.
Architecture
[Phone (Exotel/Twilio)]
↓
[Your webhook (FastAPI)]
↓
[Sarvam STT: audio → Hindi text]
↓
[Intent classifier (Sarvam-30B)]
↓
[Banking API (mock)]
↓
[Sarvam TTS: response text → Hindi audio]
↓
[Response to phone]
Step 1 — STT (Speech-To-Text)
import requests
def hindi_stt(audio_file_path: str) -> str:
with open(audio_file_path, "rb") as f:
files = {"file": f}
response = requests.post(
"https://api.sarvam.ai/speech-to-text",
headers={"API-Subscription-Key": SARVAM_KEY},
files=files,
data={"language_code": "hi-IN", "model": "saarika:v2"}
)
return response.json()["transcript"]
# Example:
# Input audio: "Mera balance kya hai?"
# Output text: "मेरा balance क्या है?"
Note: Sarvam handles code-switching (Hindi + English) automatically.
Step 2 — Intent Classification
def classify_intent(text: str) -> dict:
response = requests.post(
"https://api.sarvam.ai/chat/completions",
headers={"Authorization": f"Bearer {SARVAM_KEY}"},
json={
"model": "sarvam-30b",
"messages": [
{"role": "system", "content": "You are a banking assistant. Classify user queries into intents: balance, transfer, statement, help."},
{"role": "user", "content": text}
],
"response_format": {"type": "json_object"}
}
)
return response.json()
Step 3 — Banking Action (Mock)
def get_balance(user_id: str) -> dict:
# In production: call the actual banking API
return {"balance": 12500, "currency": "INR", "account": "XXXX1234"}
Step 4 — Response Generation
def generate_response(intent: dict, banking_data: dict) -> str:
response = requests.post(
"https://api.sarvam.ai/chat/completions",
headers={"Authorization": f"Bearer {SARVAM_KEY}"},
json={
"model": "sarvam-30b",
"messages": [
{
"role": "system",
"content": "You are a polite Hindi banking assistant. Respond in 2-3 sentences. Use formal 'aap' form. Amounts in rupees with 'rupay'."
},
{
"role": "user",
"content": f"User asked: {intent}. Banking data: {banking_data}. Respond in Hindi."
}
]
}
)
return response.json()["choices"][0]["message"]["content"]
Step 5 — TTS (Text-To-Speech)
def hindi_tts(text: str) -> bytes:
response = requests.post(
"https://api.sarvam.ai/text-to-speech",
headers={"API-Subscription-Key": SARVAM_KEY},
json={
"text": text,
"target_language_code": "hi-IN",
"speaker": "meera", # Indian female voice
"pitch": 0,
"pace": 1.0,
"loudness": 1.5,
"speech_sample_rate": 22050,
"enable_preprocessing": True,
"model": "bulbul:v2"
}
)
return response.json()["audios"][0] # base64 audio
Full Pipeline
def handle_call(audio_input_path: str) -> bytes:
user_text = hindi_stt(audio_input_path)
intent = classify_intent(user_text)
if intent["type"] == "balance":
data = get_balance(user_id="USER_001")
# ... handle other intents
response_text = generate_response(intent, data)
audio_response = hindi_tts(response_text)
return audio_response
Phone Integration (Exotel)
Exotel is the leading Indian telephony provider. You can connect a webhook:
# Exotel flow
- Record user audio (up to 30 sec)
- POST to your webhook with the audio URL
- Your webhook downloads audio, calls handle_call()
- Returns the response audio URL
- Exotel plays the response to the caller
Typical round-trip: 3-5 seconds. Acceptable for a banking use case.
Cost Analysis (1000 calls/day)
- Average call: 2 minutes
- STT: 2 min × 1000 = 2000 min × Rs 0.50/min = Rs 1,000
- TTS: 1 min × 1000 = 1000 min × Rs 1.00/min = Rs 1,000
- LLM: 2 API calls × 1000 = 2,000 × Rs 0.10 = Rs 200
- Exotel: Rs 0.50/call × 1000 = Rs 500
Total: Rs 2,700/day (~Rs 81,000/month for 30k calls)
Versus a human call centre: Rs 15-20/call = Rs 15,000-20,000/day. AI is 6-7x cheaper.
Regional Languages
The same code works for:
- Tamil:
language_code: "ta-IN" - Telugu:
te-IN - Bengali:
bn-IN - Marathi:
mr-IN - Gujarati:
gu-IN - Kannada:
kn-IN - Malayalam:
ml-IN - Punjabi:
pa-IN - Odia:
or-IN
All 22 scheduled languages are supported.
Common Pitfalls
- Numbers spoken vs written: "12,500" reads as "bara hazar paanch sau" — Sarvam TTS handles this if
enable_preprocessing: true - Long text latency: break responses into <200 chars for faster TTS
- Accent drift: test with real users — some regional speakers may still hit recognition errors
- Silence detection: include VAD in your phone integration
Beyond Banking
- Government helplines (PMKVY, Ayushman Bharat)
- Telco customer support (Jio, Airtel)
- Agri-tech advisory (crop prices in regional languages)
- Healthcare triage
The Sarvam ecosystem is growing fast. Their recent $350M raise should bring more capabilities soon.
More Automation
Cut AI API Costs 98% With DeepSeek V3.2 — A Production Migration Guide
Take advantage of DeepSeek V3.2's disruptive pricing. Production setup, failover to Claude/GPT, cost comparison with real numbers. OpenRouter setup included.
Running Gemma 4 Locally With Ollama — Setup Guide For Indian Devs
Local inference setup for Gemma 4 with Ollama on Mac, Windows, and Linux. Hardware requirements, performance benchmarks, and use cases where local beats cloud.