Trace, measure, and improve your LLM apps
LLM applications are only as good as the visibility you have into their behavior, costs, and user experience. This hands-on guide shows how to stand up a self-hosted Langfuse stack, instrument OpenAI and a LangChain RAG pipeline, and close the loop with automated evaluation. The result is a reproducible workflow that traces every span, tracks token spend, and scores quality—before issues hit production.
Points clés
- Monitoring tracks predefined metrics over time, while observability explains why issues occur; in LLMs this spans telemetry, instrumentation, traces, spans, generations, logs, and metrics (per IBM’s definitions).
- LLM monitoring matters for reliability, debugging unpredictable errors, improving user experience, mitigating bias and harmful outputs, and managing API costs with per-token visibility and alerts.
- The open-source toolbox includes Langfuse, Arize Phoenix, AgentOps, Grafana with OpenTelemetry, and Weights & Biases Weave, with SDKs primarily for Python and JavaScript/TypeScript.
- Langfuse is model/framework agnostic and provides tracing, token and cost monitoring, prompt management, datasets, security, and evaluations (LLM-as-a-judge and user feedback), plus a playground for premium users.
- Hosting options for Langfuse include self-hosting, managed, and on‑prem; self-hosting requires a persistent Postgres database and is fastest via docker‑compose (git clone langfuse repo and docker compose up at http://localhost:3000).
- A manual Docker setup uses a custom network, a postgres:latest container with POSTGRESUSER/PASSWORD/DB, then a langfuse/langfuse:2 container configured with DATABASEURL, NEXTAUTHSECRET, SALT, ENCRYPTIONKEY (256‑bit hex), and NEXTAUTH_URL.
- Tracing OpenAI calls can be enabled by swapping import openai with from langfuse.openai import openai, setting LANGFUSEPUBLICKEY/SECRET_KEY/HOST, and logging chat completions (e.g., gpt‑4o‑mini) to the dashboard.
- An end‑to‑end RAG example uses LangChain + ChromaDB with OpenAI embeddings (text‑embedding‑3‑large) and RetrievalQA over ChatOpenAI (gpt‑4o‑mini), with CallbackHandler to capture the full retrieval‑to‑generation pipeline.
- The Langfuse dashboard shows time‑filtered traces with latency, total cost, and token usage (input/output), and lets you drill into retrieval steps, LLM calls, and generations per trace.
- Quality evaluation can be automated by fetching traces via the Langfuse API and scoring with DeepEval’s AnswerRelevancyMetric (e.g., using gpt‑4o‑mini), publishing scores back to the “Scores” panel; schedule with cron for self‑host, or use live evaluation in enterprise.
À retenir
Start with the path of least pain: spin up docker‑compose, point Langfuse at Postgres, and instrument early—yes, even that “tiny” utility script. Add the Langfuse callback to your RAG chain, watch latency and token costs like a hawk, and wire up DeepEval so you’re not grading answers by vibes. Finally, schedule evaluations via cron; unless you enjoy late‑night firefighting, in which case, please proceed without monitoring and tell us how that adventure goes.
Sources





