💼For Businessintermediate

GST-Aware Invoice Parsing API with GPT-4o-mini - Architecture, Cost, Real Pitfalls

How to build a production-grade GST invoice parser using GPT-4o-mini, with real cost breakdowns, prompt engineering for Indian tax fields, and handling the messy edge cases from r/developersIndia.

By··4 min read
GST-Aware Invoice Parsing API with GPT-4o-mini - Architecture, Cost, Real Pitfalls

A developer on r/developersIndia posted last week about building a GST-aware invoice parsing API using GPT-4o-mini. The thread blew up because it hit a real pain point: most invoice parsers choke on Indian formats - handwritten amounts, multi-line HSN descriptions, and the IGST vs CGST/SGST split logic that changes based on inter-state vs intra-state supply. This article covers the full architecture, real costs, and the edge cases nobody warns you about.

Why GPT-4o-mini over Sonnet or Haiku

The original poster compared three models on a 500-invoice benchmark. Here is the numbers:

Model Accuracy (fields correct) Cost per 1000 invoices Avg latency Notes
GPT-4o-mini 94% $0.40 (₹33) 1.2s Best GSTIN validation
Claude 3.5 Haiku 88% $0.15 (₹12) 0.8s Struggles with HSN codes
GPT-4o 96% $3.00 (₹248) 2.1s Overkill for this task

GPT-4o-mini won. Not because it is the most accurate, but because it handles GSTIN checksum validation and HSN code extraction better than Haiku, at one-seventh the cost of GPT-4o. For a startup processing 50,000 invoices/month, that is ₹16,500 vs ₹1.24 lakh. The math is straightforward.

Prompt structure for GST-specific fields

The core prompt targets six fields: GSTIN, invoice number, invoice date, taxable value, tax split (CGST/SGST or IGST), and HSN/SAC codes. Here is the system prompt skeleton:

SYSTEM_PROMPT = """
You are a GST invoice parser for Indian invoices.
Extract these fields and return JSON:
- seller_gstin (15-char GSTIN with checksum)
- buyer_gstin (15-char GSTIN with checksum)
- invoice_number
- invoice_date (YYYY-MM-DD)
- taxable_value (number)
- tax_split: {cgst: number, sgst: number, igst: number}
- hsn_codes: [{code: string, description: string, amount: number}]
- reverse_charge: boolean
- total_amount (number)

Rules:
1. GSTIN must be 15 chars, validate checksum digit
2. If inter-state supply, IGST applies (no CGST/SGST)
3. If intra-state, CGST = SGST
4. HSN codes are 4 or 8 digits
5. Reverse charge applies for specific scenarios
6. Return null for unreadable fields
"""

The key insight from the thread: you must explicitly tell the model about the inter-state vs intra-state logic. Without that, GPT-4o-mini will guess wrong on roughly 30% of invoices where IGST applies instead of CGST+SGST.

Edge cases that break naive parsers

Indian invoices are a nightmare. Here are the real ones from the test set of 100 invoices:

Handwritten amounts. Roughly 15% of the test invoices had handwritten totals that Tesseract alone got wrong 60% of the time. The fix: send the raw image to GPT-4o-mini vision API instead of OCR-first. Let the model read the image directly. Accuracy jumped from 72% to 91% on handwritten fields.

Multi-line item descriptions. A single line item might span 3-4 lines with HSN code on a separate line. The prompt must explicitly say "combine multi-line descriptions into one field."

GSTIN OCR errors. The most common: 0 vs O, 1 vs l, 5 vs S. The checksum validation in the prompt catches most of these. If the extracted GSTIN fails checksum, the model returns null instead of guessing.

Tax slab confusion. The 18% vs 12% vs 5% vs 0% slab depends on HSN code. The model needs a lookup table for common HSN-to-slab mapping. Without it, accuracy on tax amount extraction drops from 94% to 81%.

Handling the math fallback

GPT-4o-mini gets the tax math wrong on about 6% of invoices. The fix is a post-processing validator:

def validate_tax_split(taxable_value, cgst, sgst, igst, reverse_charge):
    if igst and not cgst and not sgst:
        expected_igst = taxable_value * get_hsn_rate(hsn_code)
        if abs(igst - expected_igst) > 0.02:
            return {"igst": round(expected_igst, 2), "corrected": True}
    elif cgst == sgst and not igst:
        expected_cgst = taxable_value * get_hsn_rate(hsn_code) / 2
        if abs(cgst - expected_cgst) > 0.02:
            return {"cgst": round(expected_cgst, 2), "sgst": round(expected_cgst, 2), "corrected": True}
    return {"corrected": False}

This catches the 6% error rate and brings effective accuracy to 99.2% on tax amounts.

Real per-invoice cost

On the 100-invoice test set, average tokens per invoice: 1,200 input tokens (image + prompt), 180 output tokens. At GPT-4o-mini pricing ($0.15/1M input, $0.60/1M output):

  • Input cost per invoice: $0.00018 (₹0.015)
  • Output cost per invoice: $0.000108 (₹0.009)
  • Total per invoice: $0.000288 (₹0.024)

For 50,000 invoices/month: $14.40 (₹1,200). Add Tesseract preprocessing for non-image invoices: another ₹500/month on a single t2.small in Mumbai region. Total: under ₹1,700/month.

Accuracy on the 100-invoice test set

Metric Raw GPT-4o-mini + Post-validation
GSTIN extraction 91% 98%
Taxable value 94% 99%
Tax split (CGST/SGST/IGST) 88% 99.2%
HSN codes 85% 96%
Overall field accuracy 89.5% 98.1%

The post-processing validator is non-negotiable. Without it, you are shipping broken tax data to your users.

Quick takeaways

  • GPT-4o-mini at $0.15/1M input tokens is the sweet spot for Indian invoice parsing; Sonnet and Haiku fail on GSTIN checksums and HSN codes
  • Send raw images to the vision API for handwritten invoices; OCR-first drops accuracy by 19%
  • Always add a post-processing tax math validator; GPT-4o-mini gets the split wrong 6% of the time
  • Budget ₹0.024 per invoice at scale; a 50K/month pipeline costs under ₹2,000 total
  • The inter-state vs intra-state logic must be explicit in the prompt; without it, IGST/CGST/SGST errors hit 30%

Related