- Introduction
- Case Study 3: From RAG to Agents
- When Should You Use Workflows Versus Agents?
- Case Study 4: A (Nearly) End-to-End SDR
- Evaluating Agents
- Conclusion
Case Study 3: From RAG to Agents
A central theme of this case study will be when to use workflows versus agents. In this chapter, I will focus more on the head-to-head performance between pure predefined workflows and pure agentic autonomy. Chapter 5 will make the case for a hybrid approach.
The first point of difference is that a workflow requires far more back-end code than an agent. Instead of defining nodes, edges, and conditions and accounting for pathways, edge cases, and so on, an agent—at least in theory—doesn’t need any of this. That is, an agent simply requires a goal, tools, and motivation to solve the task (which foundation labs have already imbued them with).
Listing 4.1 shows how we can create a ReAct agent using LangGraph (see Chapter 1 for more on ReAct). There are dozens of viable production-ready frameworks for building agents (e.g., from OpenAI, CrewAI, and Autogen) but we will stick with LangGraph. At the end of the day, any framework we choose will have the same type of LLMs under the hood, the same tool functionality, and roughly the same prompting. However, with LangGraph, we can have full control over prompts if we don’t want to use its default.
Listing 4.1 Creating the first agent in LangGraph
from os import getenv
# Initialize the language model
llm = ChatOpenAI(
model="openai/gpt-4.1-mini",
temperature=0,
base_url="https://openrouter.ai/api/v1",
api_key=getenv("OPENROUTER_API_KEY"),
extra_body={
"usage": {"include": True}
}
)
# Define the tools for the agent
tools = [look_up_evidence, run_sql_against_database, get_database_schema]
# Create the ReAct agent using LangGraph's create_react_agent
checkpointer = MemorySaver() # For conversation memory
react_agent = create_react_agent(
model=llm,
tools=tools,
checkpointer=checkpointer,
prompt="""You are a helpful SQL assistant. You can: ....”””
If you’re curious about what this agent looks like in LangGraph, Figure 4.2 shows the nodes that are automatically set up by the create_react_agent predefined function. The “agent” node will invoke a LLM with tool-calling enabled. This LLM will output either tools to call, a response to the user, or technically both if it wants to. If tool calls are detected, the graph moves to the “tools” node, which will execute the tools in order as the LLM output (as always, order matters here, too).
Figure 4.2 The standard ReAct agent as defined by a LangGraph graph.
However, you don’t need the tool-calling function of an LLM to build an agent. It can be done entirely through prompting if we want. In fact, I built an entire package for a video course on agents that never uses tool-calling and relies solely on prompting to retrieve tool calls and get tool results (you can find a link to the package, “squad goals,” in the GitHub for this book). Whether we rely on prompting alone or tool-calling features (which exist only in a handful of LLMs at the time of writing), the distinguishing factor between a plain LLM and an agent is the surrounding system having the ability to recognize that the LLM is asking for an external tool call and being able to execute the tool on its behalf while passing back the tool’s response to the LLM. If the LLM has a tool it is allowed to call, or more importantly, is allowed to decide not to call, it’s an agent. For the purposes of this book, for the most part, I will rely on LLMs with tool-calling built in, as this approach provides a much simpler developer experience.
A quick note on Listing 4.1: It uses yet another built-in feature of LangGraph called the checkpointer. This one-line solution makes the agent stateful—that is, able to remember past messages in a thread. That same functionality that we had to build into our RAG workflow in Chapter 2 (which took dozens of lines of code to handle follow-up messages, node/edge interactions, etc.) is now being handled by a single line!
Let’s turn our RAG workflow into an agent.
Defining Our Tools
To make our SQL agent functional, we will create three tools using LangChain’s built-in tool definition. That tool definition will ensure that the tools we write are converted properly to the standard tool definition—the one that most foundation AI labs accept. Figure 4.3 visualizes our agent with the following tools:
look_up_evidence: Given a natural language query (e.g., “How to look up date of birth”) + the database name (e.g., formula_1, california_schools) + k, the number of pieces of evidence to retrieve (e.g., 5), output the k most relevant pieces of evidence where “relevant” is being approximated by cosine similarity of resulting embeddings of the natural language query and the pieces of evidence.
A note on the look_up_evidence tool: Separately from our agent, I built a vector database using a set embedding model for the agent to look up from.
run_sql_against_database: Given a SQL query + the database name, execute the query against the database and return the raw results.
get_database_schema: Given a database name, return a string that represents the schema of the database (tables, fields, foreign keys, etc.).
Figure 4.3 Our SQL generation agent will have three tools: one to look up evidence, one to run SQL code against a given database, and another to get the schema for a given database.
Listing 4.2 shows an abbreviated code section defining all of our tools. As always you can find the full code on the book’s GitHub.
Listing 4.2 Tool definitions for the SQL agent
@tool
def look_up_evidence(query: str, database_id: str = "", k: int = 5) -> str:
"""
Look up relevant SQL evidence/examples from the vector database...
"""
try:
# Perform similarity search with optional filtering
search_kwargs = {"k": k}
if database_id:
search_kwargs["filter"] = {"db_id": database_id}
results = vector_store.similarity_search_with_score(query, **search_kwargs)
if not results:
return f"No relevant evidence found for query: '{query}'"
# Format results
evidence_text = f"Found {len(results)} relevant examples:\n\n"
...
return evidence_text
except Exception as e:
error_msg = f"Error looking up evidence: {str(e)}"
return error_msg
@tool
def run_sql_against_database(sql_query: str, database_id: str) -> str:
"""
Execute a SQL query against the specified database...
"""
try:
...
cursor.execute(sql_query)
...
if not results:
result_text = "Query executed successfully but returned no results."
else:
result_text = f"Query executed successfully! Found {len(results)} row(s):
\n\n"
...
return result_text
except Exception as e:
error_msg = f"Error executing SQL query: {str(e)}"
return error_msg
@tool
def get_database_schema(database_id: str) -> str:
"""
Get the schema description for a specific database...
"""
try:
db_path = f"../dbs/dev_databases/{database_id}/{database_id}.sqlite"
if not os.path.exists(db_path):
return f"Error: Database '{database_id}' not found at {db_path}"
schema = describe_database(db_path)
return f"Database Schema for '{database_id}':\n\n{schema}"
except Exception as e:
error_msg = f"Error getting database schema: {str(e)}"
return error_msg
These tools all have arguments that we will expect the AI to write. So, for example, if the AI is given a task and “wants” to look something up, it will have to generate the valid tool output with the name of the tool and valid arguments the tool can accept. If it attempts to use a tool and doesn’t give the right arguments, the entire agentic flow will generate an error and fail.
Once we have defined the agent, we can try it out. Figure 4.4 shows an example of running our agent against a BIRD benchmark question. The tool calls the agent it decided to run along with a follow-up, showing that the agent remembered the past tool calls.
Figure 4.4 A stateful agent sees a conversation message and calls two tools to answer (left). On a second question (right), the same agent doesn’t need to call look_up_evidence again.
Note something interesting: The agent decided not to get the database schema and the schema isn’t mentioned anywhere in its prompt. In this case, the evidence seemed to be enough for the AI to “guess” at a SQL query—which worked! To be clear, when I ran this agent against the entire BIRD benchmark, the AI did call the get_database_schema tool for nearly 80% of the conversations (as shown in Figure 4.5).
Figure 4.5 Percentage of conversations where each tool was called at least once. Our agent decided to get the database schema only 80% of the time.
What’s more interesting is that the agent sometimes didn’t run the SQL query against a database. That is, it would write a query but not provide the database to run it against. A brief inspection revealed that sometimes the AI would write incorrect tool arguments (albeit extremely rarely). Sometimes the tool itself would encounter some exception during execution. In these situations, the AI didn’t do anything wrong, but the tool we wrote encountered a traceback. The AI then responded by telling the user something along the lines of “I encountered an error when . . . .” We have to remember that tool execution is completely independent of the LLM and could possibly encounter an error. This is why error handling in tool calls is critical to at least let the LLM know what happened.
That brings us to an obvious next question: Why did we build and evaluate a RAG workflow for two whole chapters when we could have just built an agent in the first place? These are the right questions to ask. In short, the workflow was effectively just as accurate as the agent and far cheaper and faster, and the only way to prove that was to build both and test them against the same dataset. Allow me to show you.
Evaluating Our SQL Agent
I could write a whole book on evaluating AI and agents—and frankly, that’s not off the table in the future. For now, though, the focus is on main criteria for evaluating agents. We will focus on these aspects of agent performance for now:
The workflow the AI decided to take: Did it use the right tools (the ones we were expecting), and in the right order (if that mattered)?
As a corollary, we can measure tool efficiency (how many tool calls it took to get to the final answer).
Did the AI give the right answer at the end (if we know what the right answer was)?
With our RAG workflow, we could exactly measure the raw SQL results. However, with an agent, the AI will give a natural language response to the user. That is what we need to judge, rather than the SQL query itself. (To be fair, we can also evaluate the SQL query—but we will assume the human user won’t see it, but only the AI model’s final response.)
How expensive and how fast was the AI in answering questions?
These aspects of performance will be correlated with the number of tool calls. The cost comes down to mostly LLM pricing (token usage).
Measuring the Number of Tool Calls
Perhaps more important than the final answer itself is the process the AI took to get there. If the AI took a circuitous route to get to the right answer, it will have incurred a larger cost. If the AI blindly tries queries without once calling the tool to get the database schema, the AI is costing us time and money while not getting to the right answer. Many agentic failures can be traced back to a simple question: What did the agent do to address the task?
Figure 4.6 shows the breakdown of the number of tool calls per conversation in buckets. Sometimes, the AI decided to never even call a single tool and just gave an answer. (Most of these answers were wrong, but the AI agent answered two chemistry questions using its own knowledge and happened to get them right!)
Figure 4.6 The number of tool calls per conversation.
Judging an agent’s natural language final answer is a bit tricky because we don’t necessarily know how the AI agent will decide to format the information to the user. If only we had some way to process natural language and return a structured answer to a question. Oh, wait, . . .
Using Structured Outputs and an LLM Rubric to Evaluate Final Responses
Let’s take advantage of structured outputs and LLMs to build an automated rubric (Figure 4.7) to evaluate our final responses. A rubric in this case is a single prompt that we will run through a separate LLM. It contains rules and guidelines for what constitutes a “good” AI response. For example, we can pass in the correct answer to the rubric and ask the grading LLM, “Did the response contain the right answer, yes or no?” We can ask the grading LLM to rate the AI responses’ “politeness.” All of this assumes we trust the grading LLM to be able to make these judgments accurately. For this reason, I generally opt to use an LLM that’s different from the agent’s LLM in case there are subliminal biases from models in the same architecture.
Figure 4.7 A rubric is simply a prompt given to another LLM with the instructions to “grade” a response given some criteria. We are relying on the AI agent’s ability to match our own judgments.
In short, a rubric is an imperfect, yet effective and automatable, way to use AI to grade another AI. When selecting a grading LLM, opt for one outside of the family of the agent LLMs. Also, don’t necessarily aim for the bleeding edge of AI; an LLM in the mid-tier should be able to handle this task well. After all, we provide all possible context in the rubric prompt, so the LLM will never need to reach out for more information, assuming our guidelines/rules cover most cases that the grading LLM might encounter. We will be using rubrics many of times throughout this book. If you’re following along with the code in the GitHub, you will notice that I often use Llama-4 models (mid-tier at the time of writing, but still fast and smart enough) or an LLM like GPT-4.1-Nano (again, mid-tier but still smart enough) to be the grading LLM.
Our rubric will be simple to start. I will ask an LLM (Llama-4 Scout in this case) to grade the AI response on a scale from 0 to 3, where 0 is bad and 3 is near perfect. I will provide criteria on when to apply each score, and will use chain-of-thought prompting (the reasoning section in the structured output) to elicit a more thoughtful response from the AI. Listing 4.3 shows the implementation of this rubric.
Listing 4.3 Implementing a rubric to judge an AI agent
from langchain.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
# Define the Pydantic model for structured output (using chain of thought)
class ScoreResponse(BaseModel):
reasoning: str = Field(description="The reasoning process of what score you should
pick")
score: int = Field(description="An integer between 0 and 3 representing the
correctness of the AI response compared to the ground truth")
# Create a prompt template for scoring AI responses against ground truth
score_prompt = ChatPromptTemplate.from_messages([
(
"system",
(
"You are an expert evaluator. "
"Given an AI's response to a question and the ground truth answer, "
"score the AI's response on a scale from 0 to 3 based on correctness:\n"
"0 = Completely incorrect or irrelevant\n"
"1 = Partially correct, but with major errors or omissions\n"
"2 = Mostly correct, but with minor errors or missing details\n"
"3 = Completely correct and matches the ground truth"
)
),
(
"human",
(
"Question: {question}\n"
"AI Response: {ai_response}\n"
"Ground Truth: {ground_truth}\n\n"
"Score the AI response from 0 to 3."
)
)
])
# Using a mid-tier model because I believe the grading task is "easy enough" and all
context is provided in the prompt.
llm = ChatOpenAI(model="meta-llama/llama-4-scout", temperature=0, …)
# Create the structured LLM using with_structured_output
structured_llm = llm.with_structured_output(ScoreResponse)
At this point, I could start an entire diatribe on how we have to evaluate the rubric prompt to make sure it’s accurate—and that’s certainly a good idea. For our purposes here, though, I will just call out a few notable misses from our rubric grader (seen in Figure 4.8). After checking about 5% of the responses manually, I discovered that the rubric is judging the vast majority of agent responses as I would have.
Figure 4.8 Samples of poor rubric quality. In my own judgment of the rubric, these types of failures were rare and most rubric outputs were on par with how I would have judged the AI agent.
With this (albeit imperfect) method of measuring our agent’s output in an automatable way, we now have what we need to answer an important question: How does our agent compare to our RAG workflow?
Comparison of the SQL Agent and the RAG Workflow
We were able to judge our RAG workflow’s accuracy using the direct SQL output. Now we have a rubric that we trust (with a grain of salt) can largely correctly evaluate our AI agent’s response. Measuring cost and latency in this example is easy: Let the LLM provider tell us how much we were charged overall and count how many seconds it took for each system to return a final response, respectively.
Figure 4.9 shows the accuracy, median cost, and median latency of our workflow versus our agent. A few things stand out:
Accuracy is not too dissimilar. Whether we simply give the AI what we think it needs in the prompt or let the AI decide what it wants to get, both systems perform relatively well. (Recall that we are using the production-ready, cost-effective LLM GPT-4.1-Mini, and current leaderboards for the benchmark report less than 80% accuracy.)
The AI agent tends to take longer and cost us more money. This is not that surprising, because the agent needs to spend extra time and money getting information that we purposely withheld from it.
Figure 4.9 AI agent versus RAG performance on accuracy, latency, and cost. Note that the AI agent’s accuracy is determined by the rubric giving a 3 score versus a non-3 score on the natural language output, whereas the RAG workflow’s accuracy is judged by the raw SQL outputs matching.
From here, the logical next steps would be as follows:
Can we prompt-engineer the agent to be more efficient? Yes
Can we enhance the workflow to try and squeeze even more performance from it? Yes
Can we try the same experiment with a different LLM? Yes
I’ll move on to new topics, but those sound like great ideas for homework for you. We will actually tackle the first two topics in different case studies starting in Chapter 5.
For now, let’s do one more experiment before we put a bow on our SQL workflow and agent example. This experiment reveals a major implication of using agents as a step toward artificial general intelligence.
Experiment: The Extended Mind Thesis and Agentic Memory
Andy Clark and David Chalmers are prominent philosophers known for their work on the philosophy of mind and cognitive science. In their influential 1998 paper “The Extended Mind,” they challenged the traditional view that cognitive processes are confined solely to the brain. As an alternative to this perspective, they argued that aspects of the mind, such as memory, may extend beyond the individual’s biological body and include external tools and resources.
To illustrate their point, Clark and Chalmers presented the hypothetical case of Otto, a man with Alzheimer’s disease. Because Otto could no longer rely on his biological memory to store important information, he used a notebook to record addresses, appointments, and facts he might otherwise forget. Whenever Otto needed to recall something, he consulted his notebook as much as necessary. Clark and Chalmers argued that for Otto, the notebook was not just a helpful aid, but functioned as a genuine part of his memory system.
The implications of this argument are fascinating, especially when thinking about the use of AI agents and their tools. Suppose memory (and to a degree, cognition) can be distributed across biological and nonbiological systems (or parametric and non-parametric systems in the case of AI). Then tools like notebooks and smartphones—or a tool for an AI agent that is used to write down the agent’s own findings—can become integral parts of both a human’s and an AI’s process. Let’s see how this plays out.
Our experiment will consist of two parts:
The construction of a tool to allow the agent to write down its own evidence to the vector database, as seen in Listing 4.4. We will also erase everything in the database, starting with a blank slate.
Creating a variation of the BIRD benchmark, one with synthetically generated similar questions. The idea here is that we want to simulate an environment where an AI sees the same or very similar questions throughout its lifetime and where the evidence it logs will eventually become more and more useful. Benchmarks like BIRD generally do a great job of making sure questions are relatively different from one another to address the coverage problem—that is, how a benchmark can cover so many situations with as few questions as possible.
The hypothesis here is that the benchmark as is won’t show a radical increase in accuracy over time, whereas a dataset with similar questions thrown in every now and again will.
Listing 4.4 A new tool: Otto’s notebook
@tool
def log_evidence(text: str, database_id: str) -> str:
"""
Write evidence text to a scratchpad to look up later for another SQL query.
Rejects the request if the database_id is not a known database.
E.g., "Note to self: The table for countries is called "Country" and not Nation"
like I previously thought"
Args:
text: The evidence text to store (e.g., "The foreign key in the 'sales' table
to the customer is called 'buyer_uuid'")
database_id: The database identifier (e.g., 'formula_1', 'california_schools')
Returns:
Acknowledgment message.
"""
# List of known/allowed database IDs
known_databases = db_names
if database_id not in known_databases:
return f"Error: '{database_id}' is not a known database. Allowed databases: {',
'.join(known_databases)}"
try:
# Create a Document object with the evidence text
doc = Document(
page_content=text,
metadata={
"db_id": database_id,
"source": "agent_learned_evidence" # Tag to identify agent-learned
evidence
}
)
# Add the document to the vector store
vector_store.add_documents([doc])
return f"Evidence successfully added to database '{database_id}'. Document
added with content: '{text[:100]}{'...' if len(text) > 100 else ''}'"
except Exception as e:
return f"Error adding evidence to vector store: {str(e)}"
Figure 4.10 shows the results of the two experiments:
In the first experiment, I used an approximately 40% sample of the BIRD benchmark. In the second, I took 10% of the benchmark (a subset of the previous 30%).
Using a second dataset, I asked another LLM (GPT-5 in this case) to generate three synthetically rephrased questions for each data point. I used the resulting dataset (close to the same size as simply taking 40% from the original) to test the agent over time.
For example, one of the questions was “Where is Amy Firth’s hometown?” The synthetic question came back as “What town does Amy Firth originally come from?” and “In which place did Amy Firth grow up?”
Figure 4.10 With no synthetically similar questions (top), we see no noticeable change in modified accuracy (rubric score ≥ 2). When we added in similar rephrased questions (bottom), suddenly the AI was allowed to rely more on its own written evidence as hints for how to tackle the problem.
I’ll highlight a few points about the results:
Without evidence, both agents started off with a notably worse accuracy than previously reported (approximately 36% here versus 51% in a previous section). This is evidence that the retrieved statements are crucial to the AI agent solving these questions correctly.
In the case of the BIRD benchmark with no synthetic additions (the top two graphs), the AI agent is writing down information as it goes, but that information isn’t useful to it in the future. If anything, this is a positive note for the benchmark itself: It implies that the questions in the benchmark do not overlap too much and are covering a wider span of cases efficiently. Basically, the questions aren’t repeating themselves, which is what we want in a test set.
When we added in the synthetically similar questions (the bottom two graphs), we see a big bump in accuracy over time, from 37.7% to 67.2% accuracy. Now the AI is adding information that becomes useful in the future. This accuracy bump doesn’t appear immediately, because it took a while for the AI to write down enough useful information—but it did get there in the end.
This is also a great jumping-off point for new experiments. For example, what if the AI were asked to “grade” logged evidence and was allowed to edit or delete evidence that didn’t end up being useful? What if a second prompt/agent were in charge of retrieving evidence, and the tool our primary agent called was simply a proxy into a workflow that performed a more structured RAG to retrieve evidence? These are all great ideas to try against our test set.
The number of times the agent decided to log evidence seemed to increase in the dataset with the number of synthetic questions added (as seen in the percentage of conversations using the log_evidence tool in Figure 4.11).
Figure 4.11 “Otto” (the agent with a notebook) called for the database schema fewer times and logged more evidence when the database included a larger number of synthetically similar questions.
Quite honestly, I don’t know “why” the last result happened. My theory is that the AI agent noticed when it was looking up evidence that matched a synthetically generated similar question, which encouraged the AI agent to log even more evidence. (Note that in Figure 4.10, the raw number of evidence points added to the database was also higher in the synthetic dataset experiment.)
It’s easy to see the differences between workflows and agents in specific use-cases like our SQL/RAG generation task. In general, though, you should take a few considerations into account when deciding whether to go with a workflow or an agent.
