LLM monitoring and observability made practical: hands-on with Langfuse

Trace, measure, and improve your LLM apps

LLM applications are only as good as the visibility you have into their behavior, costs, and user experience. This hands-on guide shows how to stand up a self-hosted Langfuse stack, instrument OpenAI and a LangChain RAG pipeline, and close the loop with automated evaluation. The result is a reproducible workflow that traces every span, tracks token spend, and scores quality—before issues hit production.

Points clés

Monitoring tracks predefined metrics over time, while observability explains why issues occur; in LLMs this spans telemetry, instrumentation, traces, spans, generations, logs, and metrics (per IBM’s definitions).
LLM monitoring matters for reliability, debugging unpredictable errors, improving user experience, mitigating bias and harmful outputs, and managing API costs with per-token visibility and alerts.
The open-source toolbox includes Langfuse, Arize Phoenix, AgentOps, Grafana with OpenTelemetry, and Weights & Biases Weave, with SDKs primarily for Python and JavaScript/TypeScript.
Langfuse is model/framework agnostic and provides tracing, token and cost monitoring, prompt management, datasets, security, and evaluations (LLM-as-a-judge and user feedback), plus a playground for premium users.
Hosting options for Langfuse include self-hosting, managed, and on‑prem; self-hosting requires a persistent Postgres database and is fastest via docker‑compose (git clone langfuse repo and docker compose up at http://localhost:3000).
A manual Docker setup uses a custom network, a postgres:latest container with POSTGRESUSER/PASSWORD/DB, then a langfuse/langfuse:2 container configured with DATABASEURL, NEXTAUTHSECRET, SALT, ENCRYPTIONKEY (256‑bit hex), and NEXTAUTH_URL.
Tracing OpenAI calls can be enabled by swapping import openai with from langfuse.openai import openai, setting LANGFUSEPUBLICKEY/SECRET_KEY/HOST, and logging chat completions (e.g., gpt‑4o‑mini) to the dashboard.
An end‑to‑end RAG example uses LangChain + ChromaDB with OpenAI embeddings (text‑embedding‑3‑large) and RetrievalQA over ChatOpenAI (gpt‑4o‑mini), with CallbackHandler to capture the full retrieval‑to‑generation pipeline.
The Langfuse dashboard shows time‑filtered traces with latency, total cost, and token usage (input/output), and lets you drill into retrieval steps, LLM calls, and generations per trace.
Quality evaluation can be automated by fetching traces via the Langfuse API and scoring with DeepEval’s AnswerRelevancyMetric (e.g., using gpt‑4o‑mini), publishing scores back to the “Scores” panel; schedule with cron for self‑host, or use live evaluation in enterprise.

À retenir

Start with the path of least pain: spin up docker‑compose, point Langfuse at Postgres, and instrument early—yes, even that “tiny” utility script. Add the Langfuse callback to your RAG chain, watch latency and token costs like a hawk, and wire up DeepEval so you’re not grading answers by vibes. Finally, schedule evaluations via cron; unless you enjoy late‑night firefighting, in which case, please proceed without monitoring and tell us how that adventure goes.

Sources