Evaluation & MLOps
LLM Observability Tooling
How tracing an LLM app captures the full fan-out of model and tool calls behind one user request, and how LangSmith, Langfuse, Helicone, and Phoenix differ in what they instrument.
intermediate · 8 min read
One click in a chat UI is not one model call. A single user turn in a modern agent might rewrite the query, embed it, hit a vector store twice, call a re-ranker, feed the top chunks into a generation call, decide it needs a weather tool, call that tool, retry it once when it times out, and only then produce an answer. When that turn returns something wrong, an average latency dashboard tells you nothing: it says the p95 was 4.2 seconds. It cannot tell you the re-ranker returned garbage, the retrieval missed the relevant document, or the tool call silently failed and the model hallucinated over the gap. Metrics aggregate. To debug an LLM app you need to see the individual chain, step by step, which is what tracing gives you and dashboards cannot.
Traces and spans: the data model you actually need
Observability for LLM apps borrows its core model directly from distributed tracing, because an LLM request is a distributed computation. Two nouns carry the weight.
A span is one unit of work with a start time, an end time, a status, and a bag of attributes. A single model call is a span; a retrieval query is a span; a tool invocation is a span. A trace is the tree of spans produced by one top-level request, linked by parent-child relationships. The root span is the user turn; its children are the steps that turn fanned out into.
The tree structure is the point. A flat log of "called model, called tool, called model" loses which model call triggered which tool, and how deep the recursion went. The parent-child edges reconstruct the exact control flow, so you can open a failed trace and walk down to the span where the answer went off the rails. For an agent that loops, the trace is often the only faithful record of what it actually did, as opposed to what its final message claims it did.
What you capture on each span is where the LLM-specific value lives:
- The prompt and the completion (the rendered input messages and the raw output), so you can read exactly what the model saw and said. This is the single most useful artefact and the one plain metrics throw away.
- Token counts, input and output, per call. These drive cost and explain latency.
- Cost, derived from tokens times the model's price, rolled up over the trace.
- Latency per step, so you can see that 3 of your 4.2 seconds were the re-ranker, not the model.
- Tool calls: name, arguments, result, and whether it errored.
- Retrieval spans: the query, the documents returned, and their similarity scores, so a bad answer can be traced to a bad retrieval rather than a bad generation.
Online evaluation in the loop
Tracing tells you what happened. It does not tell you whether what happened was any good. Offline evals (running a fixed test set through the app before you ship) answer that for known cases; online evaluation answers it for live traffic you never anticipated.
The pattern is to attach a score to spans or traces as they arrive. Scores come from three places: explicit user feedback (a thumbs-up, a copy-to-clipboard, a regenerate click), implicit signals (did the user rephrase and try again, which suggests the last answer failed), and automated evaluators, most commonly an LLM-as-a-judge run asynchronously over a sample of production traces to grade faithfulness, relevance, or policy adherence. The scores flow back onto the trace so you can filter for "low-faithfulness answers last Tuesday" and read the exact chains that scored badly. This is the loop that turns a pile of traces into a signal you can act on, and every tool below implements some version of it.
The emerging standard: OpenTelemetry GenAI conventions
For years each observability tool defined its own span shape, so instrumenting for one vendor locked your data into that vendor. OpenTelemetry, the CNCF standard that already governs traces and metrics across the rest of the industry, is closing that gap for LLMs through its GenAI semantic conventions: an agreed set of span, metric, and event definitions for model calls, agents, and even Model Context Protocol servers.
The conventions name the attributes you would expect: gen_ai.operation.name for the kind of call, gen_ai.provider.name for which provider served it, gen_ai.request.model for the model, and gen_ai.usage.input_tokens and gen_ai.usage.output_tokens for the token counts. Instrument once against these names and any OTel-compatible backend can read your data.
The honest caveat: these conventions are marked Development stability in the specification, meaning the attribute names and shapes can still change. They are the direction of travel and worth instrumenting toward, but they are not yet frozen, so pin the version you built against and expect some churn.
What the tools actually focus on
The four common tools solve overlapping problems from different starting points. Descriptions below stick to what each project's own docs claim.
| Tool | Starting point | Focus per its docs |
|---|---|---|
| LangSmith | SDK / framework tracing | Observability platform: tracing, dashboards and alerts, online evaluations, feedback. Framework-agnostic (works with OpenAI, Anthropic, and others), not LangChain-only. |
| Langfuse | SDK tracing, open source | Tracing built on OpenTelemetry, plus prompt management and evaluation (including LLM-as-a-judge). Self-hostable. |
| Helicone | Proxy / gateway | AI gateway that routes and logs requests, plus observability (cost, caching, rate limits, alerts). Logging by changing a base URL rather than adding SDK spans. Has an open-source component. |
| Arize Phoenix | Open source, OTel-native | Tracing, evaluation, prompt engineering, and experimentation, built on OpenTelemetry and OpenInference instrumentation. |
The axis that matters when choosing is how the data gets captured. A gateway like Helicone sits in front of your provider: you point your client at its URL and it logs every request with near-zero code change, which is the fastest way to get cost and latency visibility, but it sees HTTP calls, not your application's internal structure, so cross-call trace context needs extra work. SDK tracing (LangSmith, Langfuse, Phoenix) wraps your code, so it captures the full parent-child tree including retrieval and tool spans, at the cost of instrumenting your app. Phoenix and Langfuse lean explicitly on OpenTelemetry, which is the strongest hedge against lock-in; Phoenix and Langfuse can be self-hosted, which matters when prompts contain data you cannot send to a third party.
When it falls down
- Prompts are a data-leak surface. The most useful thing you capture, the full prompt and completion, is also where user PII, API keys pasted into a chat, and internal documents live. Shipping raw prompts to a hosted backend can move regulated data outside your boundary. The levers are redaction before capture, self-hosting the backend, or capturing metadata and token counts without the payload; each trades debuggability for safety.
- Full capture does not scale, and sampling hides things. Storing every prompt and completion for high-traffic apps gets expensive fast, so teams sample. But sampling at the trace level means the rare failing trace, the one you actually want, is exactly what you are most likely to drop. Head-based random sampling is cheap but blind; tail-based sampling (decide after the trace finishes, keep the errors and the slow ones) keeps the interesting traces but costs more to run.
- Observability does not create good evals. A trace tool shows you the answer and lets you attach a score; it does not tell you what a good score is. The judge prompt, the rubric, the labelled reference set, all of that is work you still have to do. Buying a platform gives you the plumbing to run evaluations, not the evaluations themselves, and teams routinely mistake the dashboard for the judgement.
- Vendor lock-in, and the OTel escape hatch. Instrument deeply against a proprietary SDK and your traces, your dashboards, and your muscle memory all live in one vendor; moving is a re-instrumentation project. Instrumenting against the OpenTelemetry GenAI conventions instead keeps the emitting code standard, so you can repoint the exporter at a different backend. The catch is that those conventions are still at Development stability, so the escape hatch is real but not yet bolted down.
Further reading
- OpenTelemetry GenAI semantic conventions - the spans, metrics, and events for GenAI clients and MCP, with per-attribute stability badges.
- Langfuse documentation - open-source, OpenTelemetry-based tracing plus prompt management and evaluation, with self-hosting instructions.
- LangSmith documentation - framework-agnostic tracing, dashboards, alerts, and online evaluation.
- Arize Phoenix documentation - OpenTelemetry and OpenInference-based tracing, evaluation, and experimentation.