Trace an LLM application tutorial

In this tutorial, you will build a customer support chatbot using retrieval-augmented generation (RAG) and add LangSmith observability at each stage of development, from early prototyping through production. By the end, you will know how to:

Trace individual LLM calls and full application pipelines.
Collect and query user feedback.
Log metadata and use it for filtering and A/B testing.
Use monitoring dashboards to track production performance.

The application will retrieve relevant documentation snippets and use them to answer user questions. The retriever is mocked in this tutorial; in a real application you would replace it with a vector search or similar.

Prerequisites

Before you begin, make sure you have:

A LangSmith account: Sign up or log in at smith.langchain.com.
A LangSmith API key: Follow the Create an API key guide.
An OpenAI API key: Generate this from the OpenAI dashboard.

Install the required packages:

pip install langsmith openai

Prototyping

Having observability set up from the start lets you iterate faster. You can see exactly what is being sent to the model, what is coming back, and where time is being spent, without adding print statements or running a debugger.

Set up your environment

Set the following environment variables in your shell:

export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY="<your-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"

To send traces to a specific project, use the LANGSMITH_PROJECT environment variable. If this is not set, LangSmith will create a default tracing project automatically on trace ingestion.

You may see these variables referenced as LANGCHAIN_* in other places. Both work, but LANGSMITH_TRACING and LANGSMITH_API_KEY are the recommended names.

Trace LLM calls

Start by tracing your OpenAI calls, where the model is actually invoked. This gives you immediate visibility into the prompts your app sends and the responses the model returns. Wrap the OpenAI client with wrap_openai (Python) or wrapOpenAI (TypeScript). Create a file called app.py (or app.ts) with the following code:

from openai import OpenAI
from langsmith.wrappers import wrap_openai

client = wrap_openai(OpenAI())

docs = [
    "Acme Cloud supports unlimited users on Enterprise plans. Starter plans are limited to 5 users.",
    "To reset your password, click 'Forgot password' on the login page and follow the instructions sent to your email.",
    "API rate limits are 1,000 requests per hour on the Starter plan and 10,000 requests per hour on Enterprise.",
]

def retriever(query: str) -> list[str]:
    return docs

def support_bot(question: str) -> str:
    context = retriever(question)
    system_message = (
        "You are a helpful customer support agent. "
        "Answer using only the information provided below:\n\n"
        + "\n".join(context)
    )
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    print(support_bot("How many users can I have on the Starter plan?"))

Calling support_bot("How many users can I have on the Starter plan?") produces a trace of the OpenAI call.

Trace the whole pipeline

Tracing the LLM call is useful, but tracing the full pipeline (including retrieval) gives you a complete overview of your application’s behavior. Add @traceable (Python) or traceable (TypeScript) to the main function:

from openai import OpenAI
from langsmith import traceable
from langsmith.wrappers import wrap_openai

client = wrap_openai(OpenAI())

docs = [
    "Acme Cloud supports unlimited users on Enterprise plans. Starter plans are limited to 5 users.",
    "To reset your password, click 'Forgot password' on the login page and follow the instructions sent to your email.",
    "API rate limits are 1,000 requests per hour on the Starter plan and 10,000 requests per hour on Enterprise.",
]

def retriever(query: str) -> list[str]:
    return docs

@traceable
def support_bot(question: str) -> str:
    context = retriever(question)
    system_message = (
        "You are a helpful customer support agent. "
        "Answer using only the information provided below:\n\n"
        + "\n".join(context)
    )
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    print(support_bot("How many users can I have on the Starter plan?"))

Calling support_bot("How many users can I have on the Starter plan?") now produces a trace of the full RAG pipeline.

LangSmith UI showing a trace with an outer application span and a nested LLM call span.

Beta testing

Once your app is working well in prototyping, you release it to a small group of real users. At this stage, you often don’t know exactly how users will interact with your app, so you need richer observability. You want to understand not just what the app did, but how users responded to it.

Collect feedback

Linking user feedback to specific traces lets you identify which responses were helpful or unhelpful. Update app.py (or app.ts) from the previous step to add a run ID to each call and attach a score afterward:

from openai import OpenAI
from langsmith import traceable, Client, uuid7  
from langsmith.wrappers import wrap_openai

client = wrap_openai(OpenAI())

docs = [
    "Acme Cloud supports unlimited users on Enterprise plans. Starter plans are limited to 5 users.",
    "To reset your password, click 'Forgot password' on the login page and follow the instructions sent to your email.",
    "API rate limits are 1,000 requests per hour on the Starter plan and 10,000 requests per hour on Enterprise.",
]

def retriever(query: str) -> list[str]:
    return docs

@traceable
def support_bot(question: str) -> str:
    context = retriever(question)
    system_message = (
        "You are a helpful customer support agent. "
        "Answer using only the information provided below:\n\n"
        + "\n".join(context)
    )
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    run_id = str(uuid7())  
    support_bot(  
        "How many users can I have on the Starter plan?",  
        langsmith_extra={"run_id": run_id},  
    )  
    ls_client = Client()  
    ls_client.create_feedback(run_id, key="user-score", score=1.0)  

In production, these two pieces would live in separate locations: the support_bot call with run_id stays in your app, and create_feedback moves to whichever endpoint receives user feedback (for example, a /feedback API route). The run_id is passed from one to the other so the feedback can be linked to the correct trace.

The feedback appears in the Feedback tab when you inspect the run in the UI. You can then filter runs by feedback score using the filtering controls in the Runs table.

Log metadata

Metadata lets you tag runs with attributes useful for filtering and comparison. For example, which model version was used or which user made the request. The following example traces both the retriever (with run_type="retriever") and the main function (with a metadata attribute for the model name):

from openai import OpenAI
from langsmith import traceable
from langsmith.wrappers import wrap_openai

client = wrap_openai(OpenAI())

docs = [
    "Acme Cloud supports unlimited users on Enterprise plans. Starter plans are limited to 5 users.",
    "To reset your password, click 'Forgot password' on the login page and follow the instructions sent to your email.",
    "API rate limits are 1,000 requests per hour on the Starter plan and 10,000 requests per hour on Enterprise.",
]

@traceable(run_type="retriever")  
def retriever(query: str) -> list[str]:
    return docs

@traceable(metadata={"llm": "gpt-4.1-mini"})  
def support_bot(question: str) -> str:
    context = retriever(question)
    system_message = (
        "You are a helpful customer support agent. "
        "Answer using only the information provided below:\n\n"
        + "\n".join(context)
    )
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    support_bot("How many users can I have on the Starter plan?")

Both metadata values appear on the trace. You can filter runs by metadata using the filtering controls in the Runs table.

Production

With strong observability in place, you can confidently ship to production. In production, you have significantly more traffic and can’t examine every trace individually. LangSmith provides monitoring tools to help you understand aggregate behavior and drill down when something looks wrong.

Monitoring

In the UI sidebar, select Monitoring, then choose a tracing project from the dropdown at the top left. Charts display key metrics for the project over time, including trace count, latency, error rate, feedback scores, and costs. For more on available metrics and chart configuration, refer to Dashboards.

A/B testing

Group-by functionality requires at least two different values for a given metadata key.

Because you have been logging the llm metadata attribute, you can group monitoring charts by that attribute to compare model performance over time. From Monitoring in the UI sidebar, click Group by in the top left corner, select Metadata from the dropdown, then select llm. The charts update to show results grouped by that attribute. For more on grouping and custom charts, refer to Dashboards.

Drilldown

When a monitoring chart shows something unexpected, click a data point to freeze the tooltip, then click the metric name (for example, Input) to jump to the filtered runs table for that time window. For more on searching and filtering runs, refer to Filter traces.

LangSmith UI showing the monitoring page with a specific point on the Input Tokens chart highlighted.

Conclusion

In this tutorial, you added LangSmith observability to an application across its full development lifecycle. The same tracing setup that helps you iterate quickly during prototyping will continue to provide value in production. You’ll have visibility into individual traces and aggregate performance trends. For more, see:

Observability concepts: terminology and core ideas.
Tracing integrations: LangChain, LangGraph, Anthropic, and other providers.
Automations: rules and online evaluations that run automatically on your traces.

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Tracing setup

Configuration & troubleshooting

Viewing & managing traces

Automations

Feedback & evaluation

Monitoring & alerting

Data type reference

Trace an LLM application tutorial

Prerequisites

Prototyping

Set up your environment

Trace LLM calls

Trace the whole pipeline

Beta testing

Collect feedback

Log metadata

Production

Monitoring

A/B testing

Drilldown

Conclusion

Tracing setup

Configuration & troubleshooting

Viewing & managing traces

Automations

Feedback & evaluation

Monitoring & alerting

Data type reference

​Prerequisites

​Prototyping

​Set up your environment

​Trace LLM calls

​Trace the whole pipeline

​Beta testing

​Collect feedback

​Log metadata

​Production

​Monitoring

​A/B testing

​Drilldown

​Conclusion

Prerequisites

Prototyping

Set up your environment

Trace LLM calls

Trace the whole pipeline

Beta testing

Collect feedback

Log metadata

Production

Monitoring

A/B testing

Drilldown

Conclusion