AutoKaam Playbook
Langfuse, the LLM Observability Stack I Actually Run
Self-hosted tracing for prompts, latency, cost, and eval results, with no per-trace pricing.
Last reviewed:
The operator take
Langfuse is the only LLM observability tool I let through to empire production. I tried PostHog with the LLM addon, I tried Helicone, I tried OpenLLMetry, and I came back to Langfuse because it does the one thing I want, which is trace every Claude call across every empire app, in one searchable place, without metering me on traces.
What I run today is Langfuse Cloud-Free for the public empire endpoints (PramaanAI API, taxwallaai webhooks, autokaam social composer). The free tier gives me 50K observations a month and that has been enough so far. For the experimental stuff and anything that touches private data, I run Langfuse self-hosted on the same Coolify box that hosts kaam-tracker. The setup is one compose file in their repo, except I am Docker-banned in my empire, so I deploy it as separate Coolify services with PostgreSQL alongside.
The reason traces matter more in 2026 than they did in 2025 is that empire workloads are now agentic. A single user request fans out to four or five Claude calls, each with tool use, each with retries. When something silently produces a bad output, the only way to find out which sub-step lied is to look at the trace tree. I caught a bug last month where the autokaam social composer was passing the wrong article slug into the headline-rewriter call, and the only reason I caught it was a Langfuse trace that showed input slug and output slug mismatched. Without Langfuse I would have been grepping logs for hours.
The Indian-operator angle here is cost. Datadog wants 800 dollars a month for the equivalent stack. PostHog with the LLM hooks is reasonable but its UX for prompt comparison is weak. Langfuse self-hosted costs me the marginal Postgres on the Coolify box, and the cloud free tier covers most empire production.
What I would change is the eval module. The Langfuse evals UI is good for spot-checking but bad for building a regression bench. I run a separate evals harness in Python and only push the headline numbers back into Langfuse as datasets. Two systems, fine.
If you build anything LLM-shaped this year and you are not tracing, you are flying blind. Langfuse is the cheapest way to see clearly.
Why it matters in 2026
Agentic workflows fan out a single request into 5-10 LLM calls. Without traces you cannot debug. Per-call observability vendors charge brutally; Langfuse is free to self-host and the cloud free tier covers most production. The 2026 LLM bill is now bigger than the compute bill for most empire-scale AI apps, and you cannot optimize what you cannot measure.
Cost in INR
Free self-hosted; Cloud free tier 50K observations/mo; Pro from Rs 5,000/mo
Use when
- +Any production app calling Claude or GPT more than 100 times a day
- +Agentic workflows with multi-step tool use
- +You need to compare prompt versions or A/B model picks
- +Cost attribution per user or per feature matters
Skip when
- xSingle-call demos with no production users
- xYou already have OpenTelemetry plus a custom dashboard you like
- xTraces contain regulated data your jurisdiction forbids storing
Alternatives I would consider
Read next
Adjacent in the playbook
Free self-hosted; Cloud tier from Rs 2,000/mo (lower run caps than Zapier at the same price)
n8n, Workflow Automation Without the Cloud Tax
Free open source; LangSmith (paid sister product) from Rs 4,000/mo
LangChain, the Framework I Have a Love-Hate Relationship With
Free open source; Enterprise from Rs 8,000/mo (managed observability)