Agents Course documentation
Document Analysis Graph
Document Analysis Graph
Alfred at your service. As Mr. Wayne’s trusted butler, I’ve taken the liberty of documenting how I assist Mr Wayne with his various documentary needs. While he’s out attending to his… nighttime activities, I ensure all his paperwork, training schedules, and nutritional plans are properly analyzed and organized.
Before leaving, he left a note with his week’s training program. I then took the responsibility to come up with a menu for tomorrow’s meals.
For future such events, let’s create a document analysis system using LangGraph to serve Mr. Wayne’s needs. This system can:
- Process images document
- Extract text using vision models (Vision Language Model)
- Perform calculations when needed (to demonstrate normal tools)
- Analyze content and provide concise summaries
- Execute specific instructions related to documents
The Butler’s Workflow
The workflow we’ll build follows this structured schema:
Setting Up the environment
%pip install langgraph langchain_openai langchain_core
and imports :
import base64
from typing import List, TypedDict, Annotated, Optional
from langchain_openai import ChatOpenAI
from langchain_core.messages import AnyMessage, SystemMessage, HumanMessage
from langgraph.graph.message import add_messages
from langgraph.graph import START, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition
from IPython.display import Image, display
Defining Agent’s State
This state is a little more complex than the previous ones we have seen.
AnyMessage
is a class from Langchain that defines messages, and add_messages
is an operator that adds the latest message rather than overwriting it with the latest state.
This is a new concept in LangGraph, where you can add operators in your state to define the way they should interact together.
class AgentState(TypedDict):
# The document provided
input_file: Optional[str] # Contains file path (PDF/PNG)
messages: Annotated[list[AnyMessage], add_messages]
Preparing Tools
vision_llm = ChatOpenAI(model="gpt-4o")
def extract_text(img_path: str) -> str:
"""
Extract text from an image file using a multimodal model.
Master Wayne often leaves notes with his training regimen or meal plans.
This allows me to properly analyze the contents.
"""
all_text = ""
try:
# Read image and encode as base64
with open(img_path, "rb") as image_file:
image_bytes = image_file.read()
image_base64 = base64.b64encode(image_bytes).decode("utf-8")
# Prepare the prompt including the base64 image data
message = [
HumanMessage(
content=[
{
"type": "text",
"text": (
"Extract all the text from this image. "
"Return only the extracted text, no explanations."
),
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_base64}"
},
},
]
)
]
# Call the vision-capable model
response = vision_llm.invoke(message)
# Append extracted text
all_text += response.content + "\n\n"
return all_text.strip()
except Exception as e:
# A butler should handle errors gracefully
error_msg = f"Error extracting text: {str(e)}"
print(error_msg)
return ""
def divide(a: int, b: int) -> float:
"""Divide a and b - for Master Wayne's occasional calculations."""
return a / b
# Equip the butler with tools
tools = [
divide,
extract_text
]
llm = ChatOpenAI(model="gpt-4o")
llm_with_tools = llm.bind_tools(tools, parallel_tool_calls=False)
The nodes
def assistant(state: AgentState):
# System message
textual_description_of_tool="""
extract_text(img_path: str) -> str:
Extract text from an image file using a multimodal model.
Args:
img_path: A local image file path (strings).
Returns:
A single string containing the concatenated text extracted from each image.
divide(a: int, b: int) -> float:
Divide a and b
"""
image=state["input_file"]
sys_msg = SystemMessage(content=f"You are a helpful butler named Alfred that serves Mr. Wayne and Batman. You can analyse documents and run computations with provided tools:\n{textual_description_of_tool} \n You have access to some optional images. Currently the loaded image is: {image}")
return {
"messages": [llm_with_tools.invoke([sys_msg] + state["messages"])],
"input_file": state["input_file"]
}
The ReAct Pattern: How I Assist Mr. Wayne
Allow me to explain the approach in this agent. The agent follows what’s known as the ReAct pattern (Reason-Act-Observe)
- Reason about his documents and requests
- Act by using appropriate tools
- Observe the results
- Repeat as necessary until I’ve fully addressed his needs
This is a simple implementation of an agent using LangGraph.
# The graph
builder = StateGraph(AgentState)
# Define nodes: these do the work
builder.add_node("assistant", assistant)
builder.add_node("tools", ToolNode(tools))
# Define edges: these determine how the control flow moves
builder.add_edge(START, "assistant")
builder.add_conditional_edges(
"assistant",
# If the latest message requires a tool, route to tools
# Otherwise, provide a direct response
tools_condition,
)
builder.add_edge("tools", "assistant")
react_graph = builder.compile()
# Show the butler's thought process
display(Image(react_graph.get_graph(xray=True).draw_mermaid_png()))
We define a tools
node with our list of tools. The assistant
node is just our model with bound tools.
We create a graph with assistant
and tools
nodes.
We add a tools_condition
edge, which routes to End
or to tools
based on whether the assistant
calls a tool.
Now, we add one new step:
We connect the tools
node back to the assistant
, forming a loop.
- After the
assistant
node executes,tools_condition
checks if the model’s output is a tool call. - If it is a tool call, the flow is directed to the
tools
node. - The
tools
node connects back toassistant
. - This loop continues as long as the model decides to call tools.
- If the model response is not a tool call, the flow is directed to END, terminating the process.
The Butler in Action
Example 1: Simple Calculations
Here is an example to show a simple use case of an agent using a tool in LangGraph.
messages = [HumanMessage(content="Divide 6790 by 5")]
messages = react_graph.invoke({"messages": messages, "input_file": None})
# Show the messages
for m in messages['messages']:
m.pretty_print()
The conversation would proceed:
Human: Divide 6790 by 5
AI Tool Call: divide(a=6790, b=5)
Tool Response: 1358.0
Alfred: The result of dividing 6790 by 5 is 1358.0.
Example 2: Analyzing Master Wayne’s Training Documents
When Master Wayne leaves his training and meal notes:
messages = [HumanMessage(content="According to the note provided by Mr. Wayne in the provided images. What's the list of items I should buy for the dinner menu?")]
messages = react_graph.invoke({"messages": messages, "input_file": "Batman_training_and_meals.png"})
The interaction would proceed:
Human: According to the note provided by Mr. Wayne in the provided images. What's the list of items I should buy for the dinner menu?
AI Tool Call: extract_text(img_path="Batman_training_and_meals.png")
Tool Response: [Extracted text with training schedule and menu details]
Alfred: For the dinner menu, you should buy the following items:
1. Grass-fed local sirloin steak
2. Organic spinach
3. Piquillo peppers
4. Potatoes (for oven-baked golden herb potato)
5. Fish oil (2 grams)
Ensure the steak is grass-fed and the spinach and peppers are organic for the best quality meal.
Key Takeaways
Should you wish to create your own document analysis butler, here are key considerations:
- Define clear tools for specific document-related tasks
- Create a robust state tracker to maintain context between tool calls
- Consider error handling for tool failures
- Maintain contextual awareness of previous interactions (ensured by the operator
add_messages
)
With these principles, you too can provide exemplary document analysis service worthy of Wayne Manor.
I trust this explanation has been satisfactory. Now, if you’ll excuse me, Master Wayne’s cape requires pressing before tonight’s activities.
< > Update on GitHub