Home > Articles

This chapter is from the book

Evaluating Agents

We’ve already discussed ways to evaluate agents, including latency, cost, and memory considerations, and especially in comparison to rigid workflows. Figure 4.23 breaks down even more ways to consider evaluating agents in four main categories:

  • System: Focused on the “behind the scenes” functionality of an agent (e.g., the tool latency, the LLM provider error rate).

  • Quality assurance: Judging the LLM’s ability to adhere to instructions and follow a specific output format. The quality of that output falls into this category as well.

  • Tool interaction: Specifically zooming in on the agent’s tool-calling ability, including whether the agent is calling the right tools for the right tasks and whether it passed the right arguments into the tool.

  • Agent efficiency: Dependent on the tool interaction and system results to a degree. This category is concerned with questions like how many steps the agent took to come up with the answer, how many tokens it took, and how much it cost.

    • For example, an agent that provides 1,000 tokens of reasoning just to call a single tool could be slower and more expensive than an agent that calls five tools with no thinking in between, even though the latter would be considered “tool inefficient” compared to the former’s “token inefficiency.”

Figure 4.23

Figure 4.23 There are dozens, if not hundreds, of metrics we can use to measure the efficacy of agents and multi-agent systems, ranging from low-level tool-based metrics to holistic rubric-based accuracy.

We can also rely on third-party tooling (no pun intended) to help us audit and look back on our agents’ past work.

LangSmith for Traceability

Even if all we need is a way to audit our agents, there’s no reason to reinvent the wheel. LangSmith (brought to us by the same people who developed LangGraph) is a tracing and observability tool that is purpose-built for LLM workflows. It can track the step-by-step execution of both workflows and agents. LangSmith provides a live dashboard showing every input, output, and tool call, which makes debugging and long-term monitoring a breeze. It also give us the ability to audit past work. There are dozens (at least) of platforms offering observability, evaluations, and more, but LangSmith is free to get started with, simple to set up, and provides immediate value out of the gate. I generally recommend starting with LangSmith even if you are still evaluating other platforms, as it provides an excellent baseline for bare-bones features you should expect from other platforms.

Getting started with LangSmith is a very simple process. All we need to do is set a few environment variables in whatever deployment we are using (even a code notebook), and the SDK will automatically trace everything (see Listing 4.6). We can even split traces by project or use the built-in UI to review results across experiments.

Listing 4.6 Setting LangSmith keys in the agent environment

- LANGSMITH_API_KEY="XXX"
- LANGSMITH_TRACING=true
- LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
- LANGSMITH_API_KEY="XXX"
- LANGSMITH_PROJECT="ai-agent"

Once the environment variables are set, every agent run (including tool calls, LLM generations, and custom functions) will be recorded in a LangSmith dashboard. This lets us dig into failed runs, analyze tool usage, and share and trace URLs for easier debugging and collaboration. Figure 4.24 shows what an agent trace looks like in the LangSmith user interface.

Figure 4.24

Figure 4.24 An agent trace in the LangSmith dashboard, highlighting each tool call and LLM step. LangSmith offers an easy way to get auditable logs by adding only a few environment variables to the AI system.

In the next few chapters, we will continue to evaluate agents based on the specific work they’re asked to do. But for now, let’s wrap up this Agents 101 discussion with a recap.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.