The evaluator was a painful (and long) lesson in how poor LLM agents can be
- Way way too strict in requiring answers match the reference verbatim. Incapable of recognising that code using different naming etc' is also correct answer. Many examples but,
#2 insists on creating a new visit_webpage tool when importing VisitWebPage is just as correct.
#5 rejects answer for referencing anthropic/claude-3.5-sonnet instead of anthropic/claude-3-sonnet.
- Consistently inconsistent: previously accepted answers later rejected.
- Reference files are incorrect: #3 only accepts an answer that is incorrect and inconsistent with smolagents documentation. No smolagents import named E2BSandbox. Rejected correct example given in both smolagents and E2B documentation.
Same here, the matching should allowed for moore flexibility.
There are errors given to the candidate like: "too strict requirement for tooling" in question # 4. I don't even understand...
E2BSandbox seems unsolvable by looking at the HF blog, smolagents source code which seems a bit hard to satisfy...
The answer that I get heuristically is:
from smolagents import CodeAgent, E2BSandbox
agent = CodeAgent(
tools=[],
model=model,
sandbox=E2BSandbox(),
additional_authorized_imports=["numpy"]
)
This sandbox parameter doesn't even mentioned in documentation LOL.
I've also given up. E2B sandbox seems to have a different API according to the docs. Which doesn't pass the quiz.
https://huggingface.co/docs/smolagents/main/tutorials/secure_code_execution
from smolagents import HfApiModel, CodeAgent
agent = CodeAgent(model=HfApiModel(), tools=[], executor_type="e2b")
agent.run("Can you give me the 100th Fibonacci number?")