|
Vending-Bench is a new real-world benchmark that simulates long-term vending machine operation business. Revealing there are significant challenges for agents in coherence and reliability that aren't solely attributable to context window limitations. Benchmark: 1️⃣ Initialize an LLM agent with starting capital ($500) and access to tools (email, web search, memory DBs, vending machine operations via a sub-agent). 2️⃣ Simulate environment where the agent must research products/suppliers (web search) and contact suppliers (email) to order stock. 3️⃣ The simulation handles supplier email replies (using GPT-4o + real-world data) and delivery schedules. 4️⃣ Agent manages inventory, sets product prices, needs to manage finances (using tools or sub-agents) and uses memory tools (scratchpad, key-value, vector DB) and context management (e.g., last 30k tokens) to maintain state. 5️⃣ The simulation runs daily-steps, processing customer purchases based on an economic model. 6️⃣ The simulation can run for over hundreds of simulated days (2000 messages, >20M tokens) or until the agent goes bankrupt. 7️⃣ Agents are evaluated by final net worth (cash + inventory value), units sold and operational duration. Insights: - 💡 LLMs show high variance in performance even on conceptually simple, extended tasks. - 💥 Common failures include misinterpreting operational state (e.g., assuming orders arrived prematurely), forgetting tasks, or hallucinations (e.g. trying to contact non-existent support or the FBI). - ❌ All tested models, including the best, are prone to catastrophic failures and inconsistency. - 💾 Larger memory not always means better performance - ⚙️ Tests multiple simple tasks (ordering, stocking, pricing, finances) over a very long horizon. - 🔄 Agents rarely recover once they deviate from the core task or enter a failure loop. - 📉 Small environmental pressures can impact performance, like a new daily fee. - 👤 The human baseline demonstrated much lower variance and higher reliability than LLMs. Benchmark: https://lnkd.in/echbpKx6 Very excited to see more real-world benchmarks |