I Self-Hosted Langfuse to Watch My AI Agents Think

Two weeks ago I shipped an AI agent that helps our procurement team check vendor proposals against our ERP database. It worked great in testing. In production? Silent failures. The LLM would sometimes return garbled JSON, sometimes skip steps entirely, and I had no way to know until someone complained.

That’s when I decided to set up Langfuse — self-hosted, running alongside the rest of our internal tools.

Why Langfuse Over Logging to a File

I tried the obvious thing first: print statements. Every LLM call got logged to a JSONL file with the prompt, response, and latency. It worked for about three days before the file grew to 200MB and I stopped looking at it.

The problem wasn’t collecting data — it was making sense of it. I needed to see traces grouped by session, spans labeled by step, and scores attached to outputs so I could filter bad calls instantly.

Langfuse gives you that dashboard without sending your data to a third party. Self-hosting was straightforward: Docker Compose with a Postgres backend, pointed the Python SDK at localhost:3000, and I was tracing in under an hour.

Manual Tracing Is Worth the Extra Code

Langfuse offers a decorator-based approach that auto-instruments your functions. I tried it first because it’s less typing. But my agent has branching logic — if the vendor has a preferred status, skip the approval step; if the proposal mentions custom pricing, run a separate validation chain. Decorators capture function calls but don’t capture why a particular path was taken.

Manual tracing with the low-level API let me add custom spans and metadata at each decision point. I tagged spans with step: “vendor_lookup”, status: “preferred”, or skip_reason: “standard_pricing”. Now when something goes wrong, I can filter traces by any of those tags and replay exactly what happened.

The key insight: manual tracing isn’t about more code — it’s about better questions. With decorators I could see “the agent called 4 functions.” With manual spans I can see “the agent called 4 functions, skipped 2 because the vendor had preferred status, and the LLM hallucinated a non-existent field name in step 3.”

What I Actually Caught

In the first week of tracing, I found:

1. Token waste on retries. The agent would retry malformed JSON 3 times before falling back, burning roughly 8K tokens per failure. I added a JSON schema validator before the LLM call and cut retries by 80%.

2. Prompt drift. A supposedly stable system prompt had picked up extra newlines from a templating bug. Every single call was sending 47 blank characters. Not a disaster, but multiplied by 2,000 calls a day it adds up.

3. Silent model degradation. One afternoon the LLM started returning consistently shorter responses. Langfuse’s score tracking showed average response length dropped 40% in 3 hours. Turned out a provider-side update had changed the default max_tokens.

The Setup That Worked

I’m running Langfuse on an old Ubuntu box in the corner of the office — the same one that hosts our internal dashboards. The stack is docker compose up for Langfuse Server + Postgres, then pip install langfuse for the Python SDK.

The Python integration is 4 lines to initialize, then trace.span() calls wherever I need visibility. I spent maybe 3 hours setting it up and another 2 hours adding spans to the critical paths.

Verdict

If you’re building anything with LLMs beyond a weekend demo, you need observability. Not logs — observability. The difference is being able to answer “why did it do that?” instead of “what did it do?”

Self-hosted Langfuse costs nothing but a few gigabytes of disk and an evening. My procurement agent still makes mistakes, but now I know exactly which mistakes, how often, and whether the fix I deployed actually helped.

That’s worth the extra setup. Every time.

Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.