AutoKaam Playbook

Langfuse, the LLM Observability Stack I Actually Run

Self-hosted tracing for prompts, latency, cost, and eval results, with no per-trace pricing.

Last reviewed:

The operator take

Langfuse is the only LLM observability tool I let through to empire production. I tried PostHog with the LLM addon, I tried Helicone, I tried OpenLLMetry, and I came back to Langfuse because it does the one thing I want, which is trace every Claude call across every empire app, in one searchable place, without metering me on traces.

What I run today is Langfuse Cloud-Free for the public empire endpoints (PramaanAI API, taxwallaai webhooks, autokaam social composer). The free tier gives me 50K observations a month and that has been enough so far. For the experimental stuff and anything that touches private data, I run Langfuse self-hosted on the same Coolify box that hosts kaam-tracker. The setup is one compose file in their repo, except I am Docker-banned in my empire, so I deploy it as separate Coolify services with PostgreSQL alongside.

The reason traces matter more in 2026 than they did in 2025 is that empire workloads are now agentic. A single user request fans out to four or five Claude calls, each with tool use, each with retries. When something silently produces a bad output, the only way to find out which sub-step lied is to look at the trace tree. I caught a bug last month where the autokaam social composer was passing the wrong article slug into the headline-rewriter call, and the only reason I caught it was a Langfuse trace that showed input slug and output slug mismatched. Without Langfuse I would have been grepping logs for hours.

The Indian-operator angle here is cost. Datadog wants 800 dollars a month for the equivalent stack. PostHog with the LLM hooks is reasonable but its UX for prompt comparison is weak. Langfuse self-hosted costs me the marginal Postgres on the Coolify box, and the cloud free tier covers most empire production.

What I would change is the eval module. The Langfuse evals UI is good for spot-checking but bad for building a regression bench. I run a separate evals harness in Python and only push the headline numbers back into Langfuse as datasets. Two systems, fine.

If you build anything LLM-shaped this year and you are not tracing, you are flying blind. Langfuse is the cheapest way to see clearly.

Why it matters in 2026

Agentic workflows fan out a single request into 5-10 LLM calls. Without traces you cannot debug. Per-call observability vendors charge brutally; Langfuse is free to self-host and the cloud free tier covers most production. The 2026 LLM bill is now bigger than the compute bill for most empire-scale AI apps, and you cannot optimize what you cannot measure.

Cost in INR

Free self-hosted; Cloud free tier 50K observations/mo; Pro from Rs 5,000/mo

Use when

  • +Any production app calling Claude or GPT more than 100 times a day
  • +Agentic workflows with multi-step tool use
  • +You need to compare prompt versions or A/B model picks
  • +Cost attribution per user or per feature matters

Skip when

  • xSingle-call demos with no production users
  • xYou already have OpenTelemetry plus a custom dashboard you like
  • xTraces contain regulated data your jurisdiction forbids storing

Alternatives I would consider