Spaces:
Running
on
CPU Upgrade
Why does Computer Use agent internally use CodeAgent?
Hi everyone. I have been surfing through the Computer Use agent implementation from HF in much detail. It is such a really strong work, congrats! I have been thinking about why does this agentic system use CodeAgent frameworkf approach instead of a ToolCalling one. I mean, from what I know and have read, CodeAgent approach gives us benefit because an agent can receive a task to do and it will have a set of tools available. Then, instead of doing a 1step tool calling like "Tool 2 - Wait for its output - Tool 3 - Waits for its output - ..." the LLM can generate a single python code which can combine the usage of the differents tools it has. However, thats because the agent has been given a future abstract task and it can think about the sequence of tools to use. However, the Computer Use use case is different, because the usage of tools depends directly on the screen the agent is in a specific moment. I mean, if in a specific moment of the execution the screen is on "google.es" initial welcome screen and the task is "Book a flight from Madrid to Paris" it would have no sense that the agent produces an output as a python code which says "mouse_click(); scroll(); mouse_click(); type(); ...." because the agent has NO IDEA about the screen it will encounter when the first "mouse_click()" is executed. I dont know If i have explained my doubt properly, I hope you guys can understand and help me. Thanks in advance!
Great question π.
Based on my understanding
<task_resolution_example>
For a task like "Open a text editor and type 'Hello World'":
Step 1:
Short term goal: I want to open a text editor.
What I see: I am on the homepage of my desktop. I see the applications
Reflection: I think that a notes application would fit in the Applications menu, let's open it. I'll carefully click in the middle of the text 'Applications'/
Action:
```python
click(51, 8)
```<end_code>
Step 2:
Short term goal: I want to open a text editor.
What I see: I am on the homepage of my desktop, with the applications menu open. I see an Accessories section, I see it is a section in the menu thanks to the tiny white triangle after the text accessories.
Reflection: I think that a notes application would fit the Accessories section. I SHOULD NOT try to move through the menus with scroll, it won't work:
I'll look for Accessories and click on it being very precise, clicking in the middle of the text 'Accessories'.
Action:
```python
click(76, 195)
```<end_code>
Step 3:
Short term goal: I want to open a text editor.
What I see: I am under the Accessories menu. Under the open submenu Accessories, I've found 'Text Editor'.
Reflection: This must be my notes app. I remember that menus are navigated through clicking. I will now click on it being very precise, clicking in the middle of the text 'Text Editor'.
Action:
```python
click(251, 441)
```<end_code>
Step 4:
Short term goal: I want to open a text editor.
What I see: I am still under the Accessories menu. Nothing has changed compared to previous screenshot. Under the open submenu Accessories, I still see 'Text Editor'. The green cross is off from the element.
Reflection: My last click must have been off. Let's correct this. I will click the correct place, right in the middle of the element.
Action:
```python
click(241, 441)
```<end_code>
Step 5:
Short term goal: I want to type 'Hello World'.
What I see: I have opened a Notepad. The Notepad app is open on an empty page
Reflection: Now Notepad is open as intended, time to type text.
Action:
```python
type_text("Hello World")
```<end_code>
Step 6:
Short term goal: I want to type 'Hello World'.
What I see: The Notepad app displays 'Hello World'
Reflection: Now that I've 1. Opened the notepad and 2. typed 'Hello World', and 3. the result seems correct, I think the Task is completed. I will return a confirmation that the task is completed.
Action:
```python
final_answer("Done")
```<end_code>
</task_resolution_example>
I saw this prompt, and based on my understanding what they are doing is like pregenerate steps, which in practical kind of reduces compute time because we already have did the computation. Now if the model somehow fails on some specific step then they restart from that step and regenerate using new output.
@Meshwa I know what you mean, and you are completly right, but my doubt is no that one. What I mean is that, from my humble opinion, the potential of using a CodeAgent approach instead a ToolCallingAgent is that in the python code the agent is going to produce there will be multiple instructions and the logic will be far more complex. However, as you can see even in the prompt example, ALL the python codes have 1 single code line due to the reason i explained above. Do you know what I mean?
if you go to this in source code:
you find this:
class CodeAgent(MultiStepAgent):
"""
In this agent, the tool calls will be formulated by the LLM in code format, then parsed and executed.
So if you read this, and the given above, you will understand its only to generate a plan with actual tool code without triggering the normal tool execution flow, but instead, it will be parsed internally, you can read this:
Parse output
try:
code_action = fix_final_answer_code(parse_code_blobs(model_output))
except Exception as e:
error_msg = f"Error in code parsing:\n{e}\nMake sure to provide correct code blobs."
raise AgentParsingError(error_msg, self.logger)
memory_step.tool_calls = [
ToolCall(
name="python_interpreter",
arguments=code_action,
id=f"call_{len(self.memory.steps)}",
)
]
### Execute action ###
self.logger.log_code(title="Executing parsed code:", content=code_action, level=LogLevel.INFO)
So they need to generate the tool call as a normal response, then parse and execute manually.
@GreazySpoon i answer you exactly as before. I know what you mean, and you are completly right, but my doubt is no that one. What I mean is that, from my humble opinion, the potential of using a CodeAgent approach instead a ToolCallingAgent is that in the python code the agent is going to produce there will be multiple instructions and the logic will be far more complex. However, as you can see even in the prompt example, ALL the python codes have 1 single code line due to the reason i explained above. Do you know what I mean?