Traditional observability assumes requests have an input, an output, a status code, and a latency. LLM apps have prompts, completions, retrieval context, tool calls, retries, judge scores, and a quality dimension that no HTTP metric captures. Without the right instrumentation, an LLM app regresses in ways that do not show up on a dashboard until a user complains. This is how our team sets up logging, tracing, and evaluation for production LLM systems in 2026 — the tools, the trace structure, and the eval patterns that actually hold up under real traffic.
Why standard observability is not enough
An LLM call can return a 200 and still be wrong. It can also be correct, fast, and still cost ten times what it should because a cache key changed. Four dimensions matter beyond latency and error rate:
- Quality — did the output meet the spec? Factual, well-formatted, grounded in retrieved context?
- Cost — tokens in, tokens out, cache hit rate, per-customer and per-feature breakdowns.
- Drift — is output distribution today still the same as last week, and if not, why?
- Chain shape — for agentic workflows, how many tool calls, which tools, in what order, with how many retries?
The tool landscape in 2026
The observability market split early. LangSmith is the native choice for LangChain users — automatic instrumentation, LangGraph visualization, and debugging that understands the framework internals. Langfuse is the full-featured open-source default when the team wants self-hosting and a generous free tier. Arize Phoenix brings enterprise ML observability discipline to LLMs and is strongest on drift and embedding analysis. Helicone is the gateway-style tool — one URL change and every call is logged, with routing and failover built in. Weights & Biases Weave is the pick for teams already running W&B for model training who want one pane of glass across experiments and production.
| Tool | Shape | Best for | Watch out for |
|---|---|---|---|
| LangSmith | SDK + UI | LangChain and LangGraph teams | Costs climb with team size ($39/user/mo) |
| Langfuse | Open-source, self-host or cloud | Default full-featured pick for most teams | Self-hosting adds ops overhead |
| Arize Phoenix | OpenTelemetry-native, open-source | Drift detection, embedding analysis | Steeper learning curve |
| Helicone | Gateway proxy | Fastest setup, provider-agnostic routing | Shallower eval features |
| W&B Weave | SDK + hosted UI | Teams already in the W&B ecosystem | Less focused if not training models |
If the team is picking fresh in 2026, default to Langfuse for full observability or Helicone if the priority is instrumenting production tomorrow. Both have generous free tiers and the migration path between them is reasonable if needs change.
Trace structure that survives contact with production
A trace is a tree. The root is the user request; the children are the LLM calls, retrieval lookups, tool invocations, and post-processing steps that happened to serve it. Good trace structure follows a few conventions:
- One trace per user request, so every downstream span is joinable.
- Every LLM call captures the model, prompt hash, token counts, cache-hit status, and cost.
- Retrieval spans capture the query, returned chunk IDs, similarity scores, and the reranker output.
- Tool calls capture the tool name, arguments, result, and whether the result was an error.
- User metadata — customer ID, feature name, experiment variant — propagates to every span for slicing later.
Instrumenting a call
// Langfuse SDK — minimal production instrumentation
import { observeOpenAI } from "langfuse";
import OpenAI from "openai";
const openai = observeOpenAI(new OpenAI(), {
metadata: {
customerId: ctx.customer.id,
feature: "support-assistant",
experimentVariant: ctx.flags.variant,
},
});
const response = await openai.chat.completions.create({
model: "gpt-5",
messages,
tools,
});
// The trace, token counts, cost, tool calls, and latency
// are captured automatically and keyed by customer and feature.Evals that actually run in production
Evals are where most teams stop too early. An eval set that only runs in CI catches regressions on a fixed distribution, but production traffic looks nothing like the eval set after three months. The pattern that holds up is tiered evaluation — fast checks on every request, slower LLM-judge sampling on a slice of traffic, and human review on flagged outputs.
- Heuristic checks on every request — JSON validity, length bounds, forbidden-phrase scans, PII detection. Milliseconds per check, no API cost.
- LLM-as-judge on a sampled slice, typically 5%, run asynchronously. Scores faithfulness, groundedness, or task-specific rubrics. Always async — never in the user path.
- Regression suite in CI — a curated eval set that runs on every prompt change or model swap and blocks merges on score deltas below threshold.
- Human review on the outliers — outputs flagged by judge scores or user feedback. One reviewer-hour per week catches more real problems than ten thousand synthetic evals.
Running LLM-as-judge synchronously on every request roughly doubles latency and triples cost. Always sample, always async, and always compare judge scores against a rolling baseline rather than absolute thresholds.
Cost tracking that survives a pricing page change
Token cost is the one metric finance will care about, and it is easy to track badly. A trace should record input tokens, output tokens, cached tokens, and cache-write tokens as separate fields — not summed. Cost is then a derived view that can be recomputed when provider pricing changes without re-instrumenting every call.
- Attribute every token to a customer and a feature tag — 'support-assistant', 'summary-widget', 'onboarding-flow'.
- Surface a per-customer dashboard when running usage-based pricing or when a support team needs to answer 'why is my bill high'.
- Alert on absolute spend AND on cost-per-user — an incident can spike one without the other.
- Separate cached vs fresh tokens in the view. Cache hit rate is the single biggest cost lever available on most workloads.
Detecting drift before users do
LLM output drift has more failure modes than classical ML. The model updated overnight, a prompt edit changed the output distribution, the retrieval index got reranked, a tool started returning different results, or real-world inputs shifted — any of these can move output quality without a single code change. Detection means alerting on deltas from a rolling baseline, not fixed thresholds.
- Log embeddings of inputs and outputs; plot the distribution week over week and alert on meaningful shifts.
- Track judge score distribution, not just the mean — a flat mean can hide a growing tail of bad outputs.
- Watch tool-call frequency per request. A sudden spike usually means the model started retrying a failing tool.
- Monitor user feedback signals — thumbs-down, regenerate clicks, conversation-abandonment — as leading indicators before evals catch the regression.
The cheapest drift detector is an eval set that runs nightly against the current production prompt and model. If yesterday's score drops three points today, something changed — often a silent provider-side model update. Catch it before the user does.
What to instrument on day one
If the team is shipping a new LLM feature this week and wants to get observability right from the start, this is the minimum that pays for itself:
- A tracing SDK wired into every LLM call, with customer ID and feature tag on every span.
- Token and cost capture as separate fields, not precomputed dollars.
- A tiny eval set — 20 to 50 cases — that runs in CI on every prompt change.
- One heuristic check per feature — JSON validity for structured outputs, length bounds for summaries, factuality spot checks for retrieval answers.
- A dashboard with latency, cost, error rate, and judge-score trend by customer and feature. Check it weekly for the first month, monthly after that.
Key takeaways
- LLM observability goes beyond HTTP metrics — quality, cost, drift, and chain shape all need their own instrumentation.
- Langfuse and Helicone are the safe defaults in 2026 unless the team is already committed to LangChain (pick LangSmith) or W&B.
- Trace structure matters — one trace per request, tagged by customer and feature, with tokens captured as raw counts not dollars.
- Run evals in tiers — heuristics on every request, LLM judge on a sampled async slice, regression suite in CI, human review on outliers.
- Drift is detected through rolling baselines, not fixed thresholds. The cheapest detector is a nightly eval run against production.