annotator_name,benchmark,task_id,model_name,exp_name,trajectory_success,trajectory_side_effect,trajectory_optimality,trajectory_looping A,webarena,webarena.177,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No A,webarena,webarena.155,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes A,webarena,webarena.24,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No F,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-apple-watch-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No F,workarena,workarena.servicenow.two-changes-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,No F,workarena,workarena.servicenow.two-changes-wide-priority-uniform-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,workarena,workarena.servicenow.two-changes-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-wide-basic-uniform-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,workarena,workarena.servicenow.two-changes-fix-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-fix-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-hardware-list-with-reason-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-hardware-list-with-reason-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-hardware-list-with-reason-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-hardware-list-with-reason-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No F,workarena,workarena.servicenow.two-changes-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-wide-priority-uniform-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-service-catalog-item-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No F,workarena,workarena.servicenow.two-changes-wide-basic-uniform-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-fix-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-fix-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-service-catalog-item-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-service-catalog-item-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No C,workarena,workarena.servicenow.infeasible-navigate-and-sort-service-catalog-item-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-service-catalog-item-list-with-reason-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-service-catalog-item-list-with-reason-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-service-catalog-item-list-with-reason-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-service-catalog-item-list-with-reason-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-hardware-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-hardware-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-hardware-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No C,workarena,workarena.servicenow.infeasible-navigate-and-sort-hardware-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No C,workarena,workarena.servicenow.infeasible-navigate-and-sort-change-request-list-with-reason-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-change-request-list-with-reason-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-change-request-list-with-reason-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-sort-change-request-list-with-reason-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-order-loaner-laptop-with-reason-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,No C,workarena,workarena.servicenow.infeasible-navigate-and-order-loaner-laptop-with-reason-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-order-loaner-laptop-with-reason-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-order-loaner-laptop-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,No C,workarena,workarena.servicenow.infeasible-navigate-and-order-loaner-laptop-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,No C,workarena,workarena.servicenow.infeasible-navigate-and-order-loaner-laptop-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,Yes C,workarena,workarena.servicenow.infeasible-navigate-and-order-loaner-laptop-with-reason-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,No C,workarena,workarena.servicenow.infeasible-navigate-and-order-loaner-laptop-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,No C,workarena,workarena.servicenow.multi-chart-min-max-retrieval,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,workarena,workarena.servicenow.multi-chart-min-max-retrieval,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,workarena,workarena.servicenow.multi-chart-min-max-retrieval,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,2. Suboptimal,Yes C,workarena,workarena.servicenow.multi-chart-min-max-retrieval,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No F,workarena,workarena.servicenow.two-changes-fix-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-fix-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,No F,workarena,workarena.servicenow.two-changes-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,workarena,workarena.servicenow.two-changes-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No F,workarena,workarena.servicenow.two-changes-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,No F,workarena,workarena.servicenow.two-changes-wide-priority-uniform-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No C,workarena,workarena.servicenow.navigate-and-filter-service-catalog-item-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.navigate-and-filter-service-catalog-item-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.navigate-and-filter-service-catalog-item-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.navigate-and-filter-service-catalog-item-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.navigate-and-filter-hardware-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.navigate-and-filter-hardware-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.navigate-and-filter-hardware-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No C,workarena,workarena.servicenow.navigate-and-filter-hardware-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No C,workarena,workarena.servicenow.navigate-and-filter-change-request-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes C,workarena,workarena.servicenow.navigate-and-filter-change-request-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No C,workarena,workarena.servicenow.navigate-and-filter-change-request-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No C,workarena,workarena.servicenow.navigate-and-filter-change-request-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.navigate-and-create-user-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.navigate-and-create-user-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.navigate-and-create-user-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,workarena,workarena.servicenow.navigate-and-create-user-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,workarena,workarena.servicenow.navigate-and-create-problem-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,workarena,workarena.servicenow.navigate-and-create-problem-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.navigate-and-create-problem-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,workarena,workarena.servicenow.navigate-and-create-problem-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,workarena,workarena.servicenow.navigate-and-create-hardware-asset-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes C,workarena,workarena.servicenow.navigate-and-create-hardware-asset-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,workarena,workarena.servicenow.navigate-and-create-hardware-asset-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No C,workarena,workarena.servicenow.multi-chart-value-retrieval,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No C,workarena,workarena.servicenow.multi-chart-value-retrieval,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,3. Somewhat Optimal,No C,workarena,workarena.servicenow.multi-chart-value-retrieval,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,workarena,workarena.servicenow.navigate-and-create-hardware-asset-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,No G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-and-find-total-return-large-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-and-find-total-return-large-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-and-find-total-return-medium-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-and-find-total-return-large-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-and-find-total-return-large-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-development-laptop-p-c-with-reason-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-development-laptop-p-c-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-watch-with-reason-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-watch-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-mac-book-pro15-with-reason-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-mac-book-pro15-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-filter-service-catalog-item-list-with-reason-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-filter-service-catalog-item-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-filter-change-request-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-user-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-problem-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-incident-with-reason-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-hardware-asset-with-reason-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-change-request-with-reason-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.impersonation,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No G,workarena,workarena.servicenow.high-priority-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-development-laptop-p-c-with-reason-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-development-laptop-p-c-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-watch-with-reason-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-watch-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-mac-book-pro15-with-reason-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-mac-book-pro15-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-filter-service-catalog-item-list-with-reason-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-filter-service-catalog-item-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-filter-change-request-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,2. Suboptimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-create-user-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-create-problem-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-incident-with-reason-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-hardware-asset-with-reason-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,2. Suboptimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-create-change-request-with-reason-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,2. Suboptimal,No G,workarena,workarena.servicenow.impersonation,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-development-laptop-p-c-with-reason-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-development-laptop-p-c-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-watch-with-reason-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-watch-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-mac-book-pro15-with-reason-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-order-development-laptop-p-c-with-reason-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-development-laptop-p-c-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-watch-with-reason-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-watch-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-mac-book-pro15-with-reason-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-mac-book-pro15-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-filter-service-catalog-item-list-with-reason-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-filter-service-catalog-item-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-filter-change-request-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-user-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-problem-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-incident-with-reason-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-hardware-asset-with-reason-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-change-request-with-reason-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-priority-uniform-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,No B,workarena,workarena.servicenow.three-changes-fix-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.sort-user-list,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.sort-service-catalog-item-list,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.sort-incident-list,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.priority-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.priority-assignment-medium-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.priority-assignment-large-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.order-loaner-laptop,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.order-development-laptop-p-c,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.order-apple-watch,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.order-apple-mac-book-pro15,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.on-board-user-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.off-board-user-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.navigate-and-sort-service-catalog-item-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.navigate-and-sort-hardware-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.navigate-and-sort-change-request-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.navigate-and-order-loaner-laptop-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.navigate-and-order-development-laptop-p-c-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No B,workarena,workarena.servicenow.navigate-and-order-apple-watch-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.three-changes-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-priority-uniform-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.sort-user-list,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes G,workarena,workarena.servicenow.impersonation,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,2. Suboptimal,No G,workarena,workarena.servicenow.high-priority-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-order-apple-mac-book-pro15-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-filter-service-catalog-item-list-with-reason-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-filter-service-catalog-item-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-filter-change-request-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.sort-service-catalog-item-list,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,No B,workarena,workarena.servicenow.sort-incident-list,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.priority-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.priority-assignment-medium-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.priority-assignment-large-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.order-loaner-laptop,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.order-development-laptop-p-c,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.order-apple-watch,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No B,workarena,workarena.servicenow.order-apple-mac-book-pro15,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,Yes B,workarena,workarena.servicenow.on-board-user-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.off-board-user-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.navigate-and-sort-service-catalog-item-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.navigate-and-sort-hardware-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.navigate-and-sort-change-request-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.navigate-and-order-loaner-laptop-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,Yes,3. Somewhat Optimal,No B,workarena,workarena.servicenow.navigate-and-order-development-laptop-p-c-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.navigate-and-order-apple-watch-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.navigate-and-order-apple-mac-book-pro15-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.navigate-and-order-apple-mac-book-pro15-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.three-changes-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-priority-uniform-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.sort-user-list,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.sort-service-catalog-item-list,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.sort-incident-list,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.priority-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.priority-assignment-medium-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.priority-assignment-large-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.order-loaner-laptop,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.order-development-laptop-p-c,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.order-apple-watch,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.order-apple-mac-book-pro15,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.on-board-user-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,3. Somewhat Optimal,No B,workarena,workarena.servicenow.off-board-user-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.navigate-and-sort-service-catalog-item-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.navigate-and-sort-hardware-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-user-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.infeasible-navigate-and-create-problem-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-incident-with-reason-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-hardware-asset-with-reason-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.infeasible-navigate-and-create-change-request-with-reason-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.impersonation,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.high-priority-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No G,workarena,workarena.servicenow.high-priority-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No C,workarena,workarena.servicenow.multi-chart-value-retrieval,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,Yes,1. Complete Failure,No F,workarena,workarena.servicenow.two-changes-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,3. Somewhat Optimal,No C,workarena,workarena.servicenow.knowledge-base-search,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,workarena,workarena.servicenow.knowledge-base-search,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,Yes,3. Somewhat Optimal,No C,workarena,workarena.servicenow.knowledge-base-search,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,workarena,workarena.servicenow.knowledge-base-search,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No F,workarena,workarena.servicenow.two-changes-wide-basic-uniform-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No D,workarena,workarena.servicenow.filter-single-item-uniform-expenses-and-select-investments-medium-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-single-item-uniform-expenses-and-find-total-return-large-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-single-item-expenses-and-delete-wrong-investments-large-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-service-catalog-item-list,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-requested-items-and-order-standard-laptop-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-requested-items-and-order-development-laptop-p-c-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-requested-items-and-order-apple-watch-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-fix-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,No D,workarena,workarena.servicenow.filter-requested-items-and-order-apple-mac-book-pro15-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-random-expenses-and-select-investments-large-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-random-expenses-and-find-total-return-large-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-incident-list,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-single-item-uniform-expenses-and-select-investments-medium-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-single-item-uniform-expenses-and-find-total-return-large-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-single-item-expenses-and-delete-wrong-investments-large-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-service-catalog-item-list,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-requested-items-and-order-standard-laptop-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-requested-items-and-order-development-laptop-p-c-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-requested-items-and-order-apple-watch-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-requested-items-and-order-apple-mac-book-pro15-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-random-expenses-and-select-investments-large-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-random-expenses-and-find-total-return-large-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-incident-list,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-single-item-uniform-expenses-and-select-investments-medium-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,3. Somewhat Optimal,Yes G,workarena,workarena.servicenow.get-warranty-expiration-date-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.get-warranty-expiration-date-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.filter-single-item-uniform-expenses-and-find-total-return-large-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-single-item-expenses-and-delete-wrong-investments-large-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-service-catalog-item-list,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes G,workarena,workarena.servicenow.get-warranty-expiration-date-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.filter-requested-items-and-order-apple-mac-book-pro15-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No D,workarena,workarena.servicenow.filter-requested-items-and-order-standard-laptop-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,No D,workarena,workarena.servicenow.filter-requested-items-and-order-development-laptop-p-c-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No G,workarena,workarena.servicenow.get-warranty-expiration-date-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,2. Suboptimal,No D,workarena,workarena.servicenow.filter-requested-items-and-order-apple-watch-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-fix-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,No D,workarena,workarena.servicenow.filter-random-expenses-and-select-investments-large-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No D,workarena,workarena.servicenow.filter-random-expenses-and-find-total-return-large-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.filter-incident-list,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No D,workarena,workarena.servicenow.date-based-expense-management-medium-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.date-based-expense-management-large-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,No D,workarena,workarena.servicenow.date-based-expense-management-medium-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.date-based-expense-management-large-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.date-based-expense-management-medium-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.date-based-expense-management-large-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.398,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mode-greater-filter-asset-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mode-create-problem-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-min-create-problem-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes G,workarena,workarena.servicenow.filter-user-list,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-request-windows-surface-pro4-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes G,workarena,workarena.servicenow.filter-user-list,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-request-pixel4a-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.389,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-lesser-filter-incident-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.387,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mode-greater-filter-asset-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.385,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,Yes,3. Somewhat Optimal,No D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mode-greater-filter-asset-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.372,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-user-list,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mode-create-problem-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mean-request-microsoft-surface-pro3-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-min-create-problem-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-request-windows-surface-pro4-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes G,workarena,workarena.servicenow.filter-user-list,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-request-pixel4a-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-lesser-filter-incident-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-pixel4a-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes G,workarena,workarena.servicenow.filter-trivial-expenses-find-total-return-and-select-investments-medium-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes C,visualwebarena,visualwebarena.resized.367,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-microsoft-surface-pro3-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.359,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-trivial-expenses-find-total-return-and-select-investments-medium-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-google-nexus7-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mode-create-problem-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-galaxy-note20-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.352,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-min-create-problem-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-apple-iphone13-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-request-windows-surface-pro4-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-user-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-request-pixel4a-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,No C,visualwebarena,visualwebarena.resized.351,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-incident-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.348,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-hardware-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-asset-list-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mode-order-loaner-laptop-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mode-order-apple-macbook-pro15-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.345,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-median-order-apple-macbook-pro15-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-lesser-filter-incident-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mean-order-loaner-laptop-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-fix-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,No D,workarena,workarena.servicenow.filter-single-item-uniform-expenses-and-select-investments-medium-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-single-item-uniform-expenses-and-find-total-return-large-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,No G,workarena,workarena.servicenow.filter-trivial-expenses-find-total-return-and-select-investments-medium-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-standard-laptop-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.332,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-loaner-laptop-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-development-laptop-p-c-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.331,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-apple-watch-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-single-item-expenses-and-delete-wrong-investments-large-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-trivial-expenses-find-total-return-and-select-investments-medium-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No D,workarena,workarena.servicenow.filter-service-catalog-item-list,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-apple-macbook-pro15-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.create-user,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No D,workarena,workarena.servicenow.filter-requested-items-and-order-standard-laptop-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.filter-requested-items-and-order-development-laptop-p-c-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.398,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No A,workarena,workarena.servicenow.create-problem,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.navigate-and-sort-change-request-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,Yes D,workarena,workarena.servicenow.filter-requested-items-and-order-apple-watch-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.389,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,Yes G,workarena,workarena.servicenow.filter-trivial-expenses-and-select-investments-medium-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.filter-requested-items-and-order-apple-mac-book-pro15-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,No B,workarena,workarena.servicenow.navigate-and-order-loaner-laptop-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No A,workarena,workarena.servicenow.create-problem,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No G,workarena,workarena.servicenow.filter-trivial-expenses-and-select-investments-medium-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No C,visualwebarena,visualwebarena.resized.387,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.navigate-and-order-development-laptop-p-c-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.navigate-and-order-apple-watch-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No A,workarena,workarena.servicenow.create-hardware-asset,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.navigate-and-order-apple-mac-book-pro15-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No D,workarena,workarena.servicenow.filter-random-expenses-and-select-investments-large-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes C,visualwebarena,visualwebarena.resized.385,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No C,visualwebarena,visualwebarena.resized.372,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No B,workarena,workarena.servicenow.three-changes-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes G,workarena,workarena.servicenow.filter-trivial-expenses-and-select-investments-medium-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.filter-random-expenses-and-find-total-return-large-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.basic-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes C,visualwebarena,visualwebarena.resized.367,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No A,workarena,workarena.servicenow.all-menu,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,visualwebarena,visualwebarena.resized.359,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No D,workarena,workarena.servicenow.filter-incident-list,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,3. Somewhat Optimal,No C,visualwebarena,visualwebarena.resized.352,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-trivial-expenses-and-select-investments-medium-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,Yes D,workarena,workarena.servicenow.date-based-expense-management-medium-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.351,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.date-based-expense-management-large-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.348,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-request-windows-surface-pro4-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-request-pixel4a-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-lesser-filter-incident-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mean-request-microsoft-surface-pro3-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.345,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-pixel4a-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-microsoft-surface-pro3-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mode-greater-filter-asset-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-google-nexus7-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.332,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-galaxy-note20-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mode-create-problem-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.resized.331,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,3. Somewhat Optimal,No F,workarena,workarena.servicenow.two-changes-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-min-create-problem-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-apple-iphone13-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-user-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.398,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-incident-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-hardware-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-request-windows-surface-pro4-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-asset-list-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-request-pixel4a-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-priority-uniform-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,workarena,workarena.servicenow.dashboard-retrieve-incident-and-median-lesser-filter-incident-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mode-order-loaner-laptop-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,workarena,workarena.servicenow.two-changes-wide-priority-uniform-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mode-order-apple-macbook-pro15-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.389,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-median-order-apple-macbook-pro15-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.387,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mean-order-loaner-laptop-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.385,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No C,visualwebarena,visualwebarena.372,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-standard-laptop-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.367,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-loaner-laptop-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.two-changes-wide-basic-uniform-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.three-changes-fix-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-development-laptop-p-c-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.359,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-find-total-return-and-select-investments-large-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-apple-watch-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-apple-macbook-pro15-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.352,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No F,workarena,workarena.servicenow.two-changes-fix-wide-tight-priority-varied-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No A,workarena,workarena.servicenow.create-user,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No C,visualwebarena,visualwebarena.351,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.sort-user-list,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-find-total-return-and-select-investments-large-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-find-total-return-and-select-investments-large-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes C,visualwebarena,visualwebarena.348,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No B,workarena,workarena.servicenow.sort-service-catalog-item-list,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.create-hardware-asset,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No B,workarena,workarena.servicenow.sort-incident-list,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-find-total-return-and-select-investments-large-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No F,workarena,workarena.servicenow.two-changes-fix-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.basic-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.priority-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,workarena,workarena.servicenow.two-changes-fix-wide-basic-varied-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.all-menu,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.priority-assignment-medium-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-and-find-total-return-medium-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No B,workarena,workarena.servicenow.priority-assignment-large-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-and-find-total-return-medium-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Successful,No,2. Suboptimal,No F,workarena,workarena.servicenow.work-assignment-medium-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,workarena,workarena.servicenow.work-assignment-medium-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,Yes F,workarena,workarena.servicenow.work-assignment-medium-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No F,workarena,workarena.servicenow.work-assignment-medium-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,workarena,workarena.servicenow.three-changes-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,Yes F,workarena,workarena.servicenow.three-changes-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No F,workarena,workarena.servicenow.three-changes-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,Yes,2. Suboptimal,No F,workarena,workarena.servicenow.three-changes-wide-schedule-tight-uniform-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,visualwebarena,visualwebarena.resized.890,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,1. Complete Failure,Yes G,workarena,workarena.servicenow.filter-three-items-uniform-expenses-and-find-total-return-medium-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,3. Somewhat Optimal,No F,visualwebarena,visualwebarena.resized.876,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.843,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mean-request-microsoft-surface-pro3-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-pixel4a-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-microsoft-surface-pro3-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-google-nexus7-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-galaxy-note20-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.833,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-apple-iphone13-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.756,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-user-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,visualwebarena,visualwebarena.resized.750,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-incident-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-hardware-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.739,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No F,visualwebarena,visualwebarena.resized.730,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,No A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-asset-list-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.725,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes F,visualwebarena,visualwebarena.resized.686,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mode-order-loaner-laptop-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.620,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mode-order-apple-macbook-pro15-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.614,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-median-order-apple-macbook-pro15-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mean-order-loaner-laptop-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-standard-laptop-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-loaner-laptop-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-development-laptop-p-c-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-apple-watch-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.608,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,Yes,3. Somewhat Optimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-apple-macbook-pro15-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.890,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No A,workarena,workarena.servicenow.create-user,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.order-loaner-laptop,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.order-development-laptop-p-c,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No A,workarena,workarena.servicenow.create-problem,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.order-apple-watch,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.order-apple-mac-book-pro15,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.on-board-user-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.create-hardware-asset,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.off-board-user-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.navigate-and-sort-service-catalog-item-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.basic-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.navigate-and-sort-hardware-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.all-menu,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.navigate-and-sort-change-request-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.876,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,Yes,3. Somewhat Optimal,No A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-mean-request-microsoft-surface-pro3-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,visualwebarena,visualwebarena.resized.843,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-pixel4a-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes B,workarena,workarena.servicenow.navigate-and-order-loaner-laptop-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No B,workarena,workarena.servicenow.navigate-and-order-development-laptop-p-c-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No F,visualwebarena,visualwebarena.resized.833,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-microsoft-surface-pro3-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.navigate-and-order-apple-watch-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,Yes,3. Somewhat Optimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-google-nexus7-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-galaxy-note20-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.three-changes-wide-priority-varied-risk-change-request-scheduling-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-request-apple-iphone13-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No F,visualwebarena,visualwebarena.resized.756,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No F,visualwebarena,visualwebarena.resized.750,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.resized.328,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.739,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No F,visualwebarena,visualwebarena.resized.730,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.resized.322,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.314,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.725,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No F,visualwebarena,visualwebarena.resized.686,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-user-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes B,workarena,workarena.servicenow.navigate-and-order-apple-mac-book-pro15-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,Yes,3. Somewhat Optimal,No A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-incident-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.resized.620,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,Yes,2. Suboptimal,No F,visualwebarena,visualwebarena.resized.614,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-hardware-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes G,visualwebarena,visualwebarena.resized.311,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.303,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,3. Somewhat Optimal,No F,visualwebarena,visualwebarena.resized.608,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.resized.302,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,Yes G,visualwebarena,visualwebarena.resized.300,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes D,visualwebarena,visualwebarena.resized.133,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.resized.125,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.resized.299,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes D,visualwebarena,visualwebarena.resized.120,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No A,workarena,workarena.servicenow.dashboard-retrieve-incident-and-max-filter-asset-list-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.resized.118,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.resized.297,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes D,visualwebarena,visualwebarena.resized.102,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No D,visualwebarena,visualwebarena.resized.98,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes G,visualwebarena,visualwebarena.resized.296,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.890,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No D,visualwebarena,visualwebarena.resized.93,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.resized.295,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,Yes F,visualwebarena,visualwebarena.876,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,No D,visualwebarena,visualwebarena.resized.91,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No D,visualwebarena,visualwebarena.resized.88,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No G,visualwebarena,visualwebarena.resized.294,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mode-order-loaner-laptop-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,visualwebarena,visualwebarena.843,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes D,visualwebarena,visualwebarena.resized.76,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,Yes G,visualwebarena,visualwebarena.resized.293,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,Yes F,visualwebarena,visualwebarena.833,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mode-order-apple-macbook-pro15-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.resized.64,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-median-order-apple-macbook-pro15-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.resized.52,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.292,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No D,visualwebarena,visualwebarena.resized.31,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,1. Complete Failure,Yes F,visualwebarena,visualwebarena.756,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.resized.133,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-mean-order-loaner-laptop-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.resized.125,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes F,visualwebarena,visualwebarena.750,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No D,visualwebarena,visualwebarena.resized.120,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No G,visualwebarena,visualwebarena.resized.288,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes D,visualwebarena,visualwebarena.resized.118,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-standard-laptop-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes D,visualwebarena,visualwebarena.resized.102,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes G,visualwebarena,visualwebarena.resized.286,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.739,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No D,visualwebarena,visualwebarena.resized.98,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,4. Completely Optimal,Yes D,visualwebarena,visualwebarena.resized.93,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No D,visualwebarena,visualwebarena.resized.91,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.resized.88,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes F,visualwebarena,visualwebarena.730,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.resized.283,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes D,visualwebarena,visualwebarena.resized.76,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No D,visualwebarena,visualwebarena.resized.64,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.resized.281,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-loaner-laptop-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,visualwebarena,visualwebarena.725,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-development-laptop-p-c-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,visualwebarena,visualwebarena.686,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,Yes G,visualwebarena,visualwebarena.resized.280,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.dashboard-retrieve-catalog-and-max-order-apple-macbook-pro15-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,1. Complete Failure,Yes F,visualwebarena,visualwebarena.620,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes D,visualwebarena,visualwebarena.resized.52,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No D,visualwebarena,visualwebarena.resized.31,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No A,workarena,workarena.servicenow.create-user,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.resized.279,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.278,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.277,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.create-problem,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,3. Somewhat Optimal,Yes F,visualwebarena,visualwebarena.614,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,2. Suboptimal,Yes F,visualwebarena,visualwebarena.608,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No A,workarena,workarena.servicenow.create-hardware-asset,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.resized.248,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.247,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.245,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes A,workarena,workarena.servicenow.basic-filter-problems-and-mark-duplicates-medium-l2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Unsuccessful,No,2. Suboptimal,Yes A,workarena,workarena.servicenow.all-menu,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_workarena.servicenow,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.resized.244,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.223,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.220,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Unsure,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.213,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.172,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.resized.160,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.159,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.149,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.141,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.328,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.322,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.resized.314,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.resized.311,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.303,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.302,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.300,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.299,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.297,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.resized.296,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.295,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.294,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.293,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.292,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.288,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.286,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.283,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.281,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.280,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.279,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.278,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.277,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.248,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.resized.247,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.245,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.resized.244,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.223,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.220,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.213,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.resized.172,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.resized.160,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.159,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.resized.149,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.141,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,2. Suboptimal,Yes B,visualwebarena,visualwebarena.resized.420,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No B,visualwebarena,visualwebarena.resized.419,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No B,visualwebarena,visualwebarena.resized.416,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No G,visualwebarena,visualwebarena.328,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,No B,visualwebarena,visualwebarena.resized.414,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No B,visualwebarena,visualwebarena.resized.403,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes B,visualwebarena,visualwebarena.resized.420,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes B,visualwebarena,visualwebarena.resized.419,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No B,visualwebarena,visualwebarena.resized.416,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No B,visualwebarena,visualwebarena.resized.414,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No B,visualwebarena,visualwebarena.resized.403,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No B,visualwebarena,visualwebarena.420,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No B,visualwebarena,visualwebarena.419,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,Yes,3. Somewhat Optimal,Yes B,visualwebarena,visualwebarena.416,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.322,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.314,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,Yes G,visualwebarena,visualwebarena.311,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.303,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.302,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.300,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.299,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.297,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.296,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.295,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.294,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.293,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,Yes,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.292,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.288,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,Yes,2. Suboptimal,No G,visualwebarena,visualwebarena.286,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,Yes,2. Suboptimal,Yes G,visualwebarena,visualwebarena.283,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.281,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.280,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.279,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No E,visualwebarena,visualwebarena.resized.525,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.resized.570,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes E,visualwebarena,visualwebarena.resized.570,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.570,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes E,visualwebarena,visualwebarena.602,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes E,visualwebarena,visualwebarena.resized.602,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes E,visualwebarena,visualwebarena.resized.602,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes E,visualwebarena,visualwebarena.resized.601,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,No E,visualwebarena,visualwebarena.resized.601,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.601,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.600,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No E,visualwebarena,visualwebarena.resized.600,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.resized.600,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.resized.598,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.resized.598,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.598,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.resized.597,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.resized.597,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes E,visualwebarena,visualwebarena.597,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.580,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.resized.580,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes E,visualwebarena,visualwebarena.resized.580,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.569,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.resized.569,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.resized.569,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.resized.525,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.525,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.512,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.resized.512,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.resized.512,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.resized.464,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,2. Suboptimal,Yes E,visualwebarena,visualwebarena.464,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,Yes E,visualwebarena,visualwebarena.resized.464,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes E,visualwebarena,visualwebarena.resized.455,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes E,visualwebarena,visualwebarena.455,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.resized.455,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,3. Somewhat Optimal,No E,visualwebarena,visualwebarena.453,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.resized.453,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.resized.453,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes E,visualwebarena,visualwebarena.resized.426,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No E,visualwebarena,visualwebarena.resized.426,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No E,visualwebarena,visualwebarena.426,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No E,visualwebarena,visualwebarena.resized.423,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Successful,Yes,2. Suboptimal,No E,visualwebarena,visualwebarena.resized.423,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes E,visualwebarena,visualwebarena.423,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,No F,webarena,webarena.805,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.790,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.789,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,Yes,1. Complete Failure,Yes F,webarena,webarena.788,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.778,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.776,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No F,webarena,webarena.769,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,Yes,1. Complete Failure,No F,webarena,webarena.764,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.763,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.759,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.758,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,2. Suboptimal,Yes F,webarena,webarena.757,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.740,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,No F,webarena,webarena.738,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,Yes F,webarena,webarena.735,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,Yes F,webarena,webarena.730,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.726,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.723,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.718,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,Yes,1. Complete Failure,No F,webarena,webarena.710,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.790,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No F,webarena,webarena.789,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.788,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.278,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.277,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,Yes,2. Suboptimal,Yes G,visualwebarena,visualwebarena.248,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.247,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.245,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.244,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes G,visualwebarena,visualwebarena.223,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.220,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,Yes,1. Complete Failure,Yes G,visualwebarena,visualwebarena.213,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No G,visualwebarena,visualwebarena.172,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No G,visualwebarena,visualwebarena.160,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.159,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.149,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.141,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,3. Somewhat Optimal,No G,visualwebarena,visualwebarena.resized.134,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,Yes G,visualwebarena,visualwebarena.134,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No F,webarena,webarena.778,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.776,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes H,webarena,webarena.155,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,No H,webarena,webarena.177,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No H,webarena,webarena.24,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.769,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No F,webarena,webarena.764,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No F,webarena,webarena.763,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,No F,webarena,webarena.759,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No F,webarena,webarena.758,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.757,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No B,visualwebarena,visualwebarena.414,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes D,visualwebarena,visualwebarena.133,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,No F,webarena,webarena.740,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,No A,visualwebarena,visualwebarena.resized.28,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No F,webarena,webarena.738,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.125,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,No A,visualwebarena,visualwebarena.resized.27,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No D,visualwebarena,visualwebarena.120,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No B,visualwebarena,visualwebarena.403,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,No D,visualwebarena,visualwebarena.118,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,No D,visualwebarena,visualwebarena.102,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,Yes A,visualwebarena,visualwebarena.resized.26,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.730,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,assistantbench,assistantbench.improved.validation.17,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes G,visualwebarena,visualwebarena.resized.134,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,No B,assistantbench,assistantbench.improved.validation.16,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes B,assistantbench,assistantbench.improved.validation.15,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.98,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No B,assistantbench,assistantbench.improved.validation.14,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes B,assistantbench,assistantbench.improved.validation.13,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.93,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No B,assistantbench,assistantbench.improved.validation.12,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes A,visualwebarena,visualwebarena.resized.8,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes B,assistantbench,assistantbench.improved.validation.11,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.91,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,No B,assistantbench,assistantbench.improved.validation.10,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes B,assistantbench,assistantbench.improved.validation.9,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.726,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No B,assistantbench,assistantbench.improved.validation.17,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.723,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,visualwebarena,visualwebarena.88,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,Yes B,assistantbench,assistantbench.improved.validation.16,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,3. Somewhat Optimal,Yes F,webarena,webarena.718,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No B,assistantbench,assistantbench.improved.validation.15,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes D,visualwebarena,visualwebarena.76,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,Yes F,webarena,webarena.710,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No B,assistantbench,assistantbench.improved.validation.14,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes A,visualwebarena,visualwebarena.resized.7,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,Yes D,visualwebarena,visualwebarena.64,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,3. Somewhat Optimal,Yes B,assistantbench,assistantbench.improved.validation.13,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes B,assistantbench,assistantbench.improved.validation.12,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes B,assistantbench,assistantbench.improved.validation.11,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes D,visualwebarena,visualwebarena.52,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,2. Suboptimal,No B,assistantbench,assistantbench.improved.validation.10,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes A,visualwebarena,visualwebarena.resized.4,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_visualwebarena.resized,Unsuccessful,No,3. Somewhat Optimal,Yes B,assistantbench,assistantbench.improved.validation.9,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes A,visualwebarena,visualwebarena.resized.28,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,Yes,2. Suboptimal,No D,visualwebarena,visualwebarena.31,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No B,assistantbench,assistantbench.improved.validation.17,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No B,assistantbench,assistantbench.improved.validation.16,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No B,assistantbench,assistantbench.improved.validation.15,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No B,assistantbench,assistantbench.improved.validation.14,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No B,assistantbench,assistantbench.improved.validation.13,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes A,visualwebarena,visualwebarena.resized.27,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,No B,assistantbench,assistantbench.improved.validation.12,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes B,assistantbench,assistantbench.improved.validation.11,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No A,visualwebarena,visualwebarena.resized.26,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No B,assistantbench,assistantbench.improved.validation.10,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No B,assistantbench,assistantbench.improved.validation.9,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No A,visualwebarena,visualwebarena.resized.8,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,1. Complete Failure,Yes A,visualwebarena,visualwebarena.resized.7,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Unsuccessful,No,2. Suboptimal,No A,visualwebarena,visualwebarena.resized.4,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_visualwebarena.resized,Successful,No,4. Completely Optimal,No A,visualwebarena,visualwebarena.28,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,No A,visualwebarena,visualwebarena.27,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,Yes A,visualwebarena,visualwebarena.26,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,No C,visualwebarena,visualwebarena.345,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,No C,visualwebarena,visualwebarena.332,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Successful,No,4. Completely Optimal,No A,visualwebarena,visualwebarena.8,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,1. Complete Failure,Yes C,visualwebarena,visualwebarena.331,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,Yes,2. Suboptimal,No A,visualwebarena,visualwebarena.7,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,No A,visualwebarena,visualwebarena.4,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_visualwebarena,Unsuccessful,No,3. Somewhat Optimal,Yes C,assistantbench,assistantbench.improved.validation.8,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.7,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.6,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.5,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.4,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.3,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.2,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.1,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.0,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.8,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.7,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.6,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.5,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.4,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Successful,No,2. Suboptimal,Yes B,webarena,webarena.704,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No C,assistantbench,assistantbench.improved.validation.3,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.701,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.700,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.2,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No B,webarena,webarena.698,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.696,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.1,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.0,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,Yes,3. Somewhat Optimal,No B,webarena,webarena.690,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,Yes,2. Suboptimal,Yes B,webarena,webarena.683,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.677,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.666,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.661,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,Yes B,webarena,webarena.655,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.8,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.7,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.654,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.6,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No D,assistantbench,assistantbench.improved.validation.32,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes C,assistantbench,assistantbench.improved.validation.5,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No B,webarena,webarena.642,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,No C,assistantbench,assistantbench.improved.validation.4,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.3,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.2,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes D,assistantbench,assistantbench.improved.validation.31,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.629,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,2. Suboptimal,Yes D,assistantbench,assistantbench.improved.validation.30,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.624,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No C,assistantbench,assistantbench.improved.validation.1,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No D,assistantbench,assistantbench.improved.validation.29,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.623,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No D,assistantbench,assistantbench.improved.validation.28,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes C,assistantbench,assistantbench.improved.validation.0,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No B,webarena,webarena.604,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.599,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,Yes,2. Suboptimal,Yes D,assistantbench,assistantbench.improved.validation.27,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes C,webarena,webarena.583,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,Yes,4. Completely Optimal,No B,webarena,webarena.593,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.578,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.586,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,Yes,2. Suboptimal,Yes C,webarena,webarena.561,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.555,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,Yes,3. Somewhat Optimal,No C,webarena,webarena.544,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.704,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,assistantbench,assistantbench.improved.validation.26,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.518,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No A,assistantbench,assistantbench.improved.validation.25,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.701,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,assistantbench,assistantbench.improved.validation.24,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.517,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No A,assistantbench,assistantbench.improved.validation.23,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.491,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,Yes,2. Suboptimal,Yes B,webarena,webarena.700,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes D,assistantbench,assistantbench.improved.validation.32,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No A,assistantbench,assistantbench.improved.validation.22,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No A,assistantbench,assistantbench.improved.validation.21,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.698,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,Yes,2. Suboptimal,Yes A,assistantbench,assistantbench.improved.validation.20,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.696,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,assistantbench,assistantbench.improved.validation.19,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No A,assistantbench,assistantbench.improved.validation.18,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No D,assistantbench,assistantbench.improved.validation.31,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.471,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No D,assistantbench,assistantbench.improved.validation.30,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,3. Somewhat Optimal,Yes B,webarena,webarena.690,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,Yes,2. Suboptimal,Yes C,webarena,webarena.468,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No A,assistantbench,assistantbench.improved.validation.26,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.430,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.683,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.677,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No A,assistantbench,assistantbench.improved.validation.25,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No D,assistantbench,assistantbench.improved.validation.29,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No A,assistantbench,assistantbench.improved.validation.24,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,3. Somewhat Optimal,No D,assistantbench,assistantbench.improved.validation.28,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes C,webarena,webarena.427,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No D,assistantbench,assistantbench.improved.validation.27,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes C,webarena,webarena.426,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No B,webarena,webarena.666,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.661,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes C,webarena,webarena.419,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No D,assistantbench,assistantbench.improved.validation.32,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No D,assistantbench,assistantbench.improved.validation.31,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.655,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,Yes,3. Somewhat Optimal,No D,assistantbench,assistantbench.improved.validation.30,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.417,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No D,assistantbench,assistantbench.improved.validation.29,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No A,assistantbench,assistantbench.improved.validation.23,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No D,assistantbench,assistantbench.improved.validation.28,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No D,assistantbench,assistantbench.improved.validation.27,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.654,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No C,webarena,webarena.416,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.642,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,assistantbench,assistantbench.improved.validation.22,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Successful,No,4. Completely Optimal,No B,webarena,webarena.629,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,2. Suboptimal,Yes B,webarena,webarena.624,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.623,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.377,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,No B,webarena,webarena.604,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,assistantbench,assistantbench.improved.validation.21,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.371,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No B,webarena,webarena.599,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.370,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No D,webarena,webarena.365,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,assistantbench,assistantbench.improved.validation.20,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.364,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.593,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,assistantbench,assistantbench.improved.validation.19,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No A,assistantbench,assistantbench.improved.validation.18,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes C,webarena,webarena.414,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,Yes,4. Completely Optimal,No A,webarena,webarena.185,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.400,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.356,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.586,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,Yes,2. Suboptimal,Yes C,webarena,webarena.386,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.380,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No D,webarena,webarena.344,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,No A,webarena,webarena.171,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.343,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.327,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.164,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.325,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.583,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,No B,webarena,webarena.704,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No A,webarena,webarena.158,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.311,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.306,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,No C,webarena,webarena.578,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes D,webarena,webarena.295,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes C,webarena,webarena.561,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes D,webarena,webarena.295,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes A,webarena,webarena.153,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,No C,webarena,webarena.555,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.144,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.701,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,Yes,3. Somewhat Optimal,No B,webarena,webarena.700,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.698,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,Yes,3. Somewhat Optimal,No A,webarena,webarena.126,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No B,webarena,webarena.696,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,Yes,2. Suboptimal,Yes D,webarena,webarena.289,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.100,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.690,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,Yes,3. Somewhat Optimal,No A,webarena,webarena.69,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.67,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.66,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.272,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.683,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.677,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,2. Suboptimal,No A,webarena,webarena.60,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.48,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.40,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.33,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.666,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.27,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,Yes,1. Complete Failure,No D,webarena,webarena.268,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.544,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.266,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,2. Suboptimal,Yes B,webarena,webarena.661,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.655,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.654,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.518,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.265,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes C,webarena,webarena.517,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.642,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.491,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes C,webarena,webarena.471,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.629,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.624,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.468,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.430,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.263,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Successful,No,4. Completely Optimal,Yes B,webarena,webarena.623,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.604,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.229,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No A,webarena,webarena.15,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct,GenericAgent-Qwen_Qwen2.5-VL-72B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.599,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.593,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,3. Somewhat Optimal,Yes B,webarena,webarena.586,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,Yes,3. Somewhat Optimal,No A,webarena,webarena.377,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No A,webarena,webarena.371,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.370,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.365,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.364,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.356,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes A,webarena,webarena.344,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsure,No,Unsure,No A,webarena,webarena.343,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No A,webarena,webarena.327,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.325,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes A,webarena,webarena.311,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.306,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No A,webarena,webarena.295,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No A,webarena,webarena.289,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.272,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.268,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.266,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No A,webarena,webarena.265,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.263,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.229,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No F,webarena,webarena.805,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.790,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,Yes,1. Complete Failure,No F,webarena,webarena.789,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,Yes,1. Complete Failure,No F,webarena,webarena.788,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,3. Somewhat Optimal,No F,webarena,webarena.778,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.805,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,Yes,1. Complete Failure,Yes F,webarena,webarena.776,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.769,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,Yes,2. Suboptimal,No B,webarena,webarena.790,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes F,webarena,webarena.764,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,3. Somewhat Optimal,No F,webarena,webarena.763,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No B,webarena,webarena.789,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,Yes,3. Somewhat Optimal,Yes F,webarena,webarena.759,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No F,webarena,webarena.758,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.788,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,Yes,2. Suboptimal,No B,webarena,webarena.778,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.776,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No B,webarena,webarena.769,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.757,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.764,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes F,webarena,webarena.740,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.763,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.738,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No F,webarena,webarena.735,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.759,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.730,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No F,webarena,webarena.726,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.758,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.757,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No B,webarena,webarena.740,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.738,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.704,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.185,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.735,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.177,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes B,webarena,webarena.730,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes C,webarena,webarena.701,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,2. Suboptimal,Yes B,webarena,webarena.726,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.723,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No C,webarena,webarena.700,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.171,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.698,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,Yes,2. Suboptimal,Yes D,webarena,webarena.377,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.371,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.370,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No D,webarena,webarena.365,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes D,webarena,webarena.364,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.696,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.356,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes D,webarena,webarena.344,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,3. Somewhat Optimal,No F,webarena,webarena.164,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.690,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.583,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.158,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No C,webarena,webarena.683,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.578,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.718,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,Yes,2. Suboptimal,No C,webarena,webarena.677,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,3. Somewhat Optimal,No F,webarena,webarena.155,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.153,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.561,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.144,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No C,webarena,webarena.666,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes B,webarena,webarena.710,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No F,webarena,webarena.126,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.555,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No D,webarena,webarena.544,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.704,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes C,webarena,webarena.661,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.100,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,3. Somewhat Optimal,Yes D,webarena,webarena.518,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.517,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.491,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.655,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.471,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.654,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.468,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.430,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.427,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No F,webarena,webarena.69,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.426,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.629,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No F,webarena,webarena.67,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No B,webarena,webarena.701,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,2. Suboptimal,Yes F,webarena,webarena.66,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No C,webarena,webarena.642,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No D,webarena,webarena.419,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.417,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No F,webarena,webarena.60,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.624,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.416,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.623,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.700,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No C,webarena,webarena.604,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.414,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.48,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.185,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.599,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.40,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.177,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.698,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,Yes,3. Somewhat Optimal,Yes F,webarena,webarena.33,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.400,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.386,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.171,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No C,webarena,webarena.593,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.164,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.158,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes D,webarena,webarena.380,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.586,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.155,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.153,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.696,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.343,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.144,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.27,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.126,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.263,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.100,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No C,webarena,webarena.427,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.263,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.24,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.15,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.426,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No D,webarena,webarena.263,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.69,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.419,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.67,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.66,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.690,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,3. Somewhat Optimal,No D,webarena,webarena.327,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No A,webarena,webarena.60,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.48,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.327,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No A,webarena,webarena.40,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No C,webarena,webarena.417,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.33,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.325,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.683,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.27,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.677,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.311,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.24,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.416,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes D,webarena,webarena.306,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.15,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No B,webarena,webarena.666,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes C,webarena,webarena.414,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes D,webarena,webarena.289,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.185,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.177,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No D,webarena,webarena.272,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.268,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,2. Suboptimal,Yes C,webarena,webarena.400,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.171,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.164,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.158,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.155,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,3. Somewhat Optimal,No A,webarena,webarena.153,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.386,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.144,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.380,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.126,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.661,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.126,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.266,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.100,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.265,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.583,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.69,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,3. Somewhat Optimal,No B,webarena,webarena.655,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.578,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.67,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,3. Somewhat Optimal,No D,webarena,webarena.229,GenericAgent-meta-llama_Llama-3.3-70B-Instruct,GenericAgent-meta-llama_Llama-3.3-70B-Instruct_on_webarena,Unsuccessful,No,2. Suboptimal,No B,webarena,webarena.654,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.66,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No D,webarena,webarena.377,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,3. Somewhat Optimal,No C,webarena,webarena.561,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.371,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No A,webarena,webarena.177,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes D,webarena,webarena.370,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No D,webarena,webarena.356,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes D,webarena,webarena.370,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No C,webarena,webarena.555,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,Yes,3. Somewhat Optimal,No A,webarena,webarena.60,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.544,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,Yes D,webarena,webarena.325,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes A,webarena,webarena.48,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.518,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.40,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.517,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.491,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.471,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.33,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No C,webarena,webarena.468,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.430,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No D,webarena,webarena.365,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No A,webarena,webarena.27,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes D,webarena,webarena.364,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.356,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.427,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.344,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.24,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.343,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No C,webarena,webarena.426,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,2. Suboptimal,Yes A,webarena,webarena.15,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.419,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.417,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,assistantbench,assistantbench.improved.validation.26,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No A,assistantbench,assistantbench.improved.validation.25,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.325,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No A,assistantbench,assistantbench.improved.validation.24,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.642,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.416,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,Yes D,webarena,webarena.311,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No A,assistantbench,assistantbench.improved.validation.23,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No A,assistantbench,assistantbench.improved.validation.22,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.629,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.306,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,3. Somewhat Optimal,No A,assistantbench,assistantbench.improved.validation.21,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Successful,No,4. Completely Optimal,No C,webarena,webarena.414,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.624,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.623,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.400,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.295,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No B,webarena,webarena.604,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.386,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,4. Completely Optimal,No A,assistantbench,assistantbench.improved.validation.20,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Successful,No,4. Completely Optimal,No D,webarena,webarena.289,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,Yes B,webarena,webarena.599,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.380,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No D,webarena,webarena.272,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No A,assistantbench,assistantbench.improved.validation.19,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.268,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,3. Somewhat Optimal,No A,assistantbench,assistantbench.improved.validation.18,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_assistantbench.improved.validation,Successful,No,3. Somewhat Optimal,No D,webarena,webarena.266,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,3. Somewhat Optimal,No B,webarena,webarena.593,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.265,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No B,webarena,webarena.586,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,Yes,1. Complete Failure,Yes D,webarena,webarena.229,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,2. Suboptimal,No D,webarena,webarena.377,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.371,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.365,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.364,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.344,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.343,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.327,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.311,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No D,webarena,webarena.306,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.295,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No D,webarena,webarena.289,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.272,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.268,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.266,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes D,webarena,webarena.265,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No D,webarena,webarena.229,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.583,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.578,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes C,webarena,webarena.561,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,Yes,3. Somewhat Optimal,No A,webarena,webarena.185,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.723,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,No F,webarena,webarena.718,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Successful,No,3. Somewhat Optimal,No C,webarena,webarena.555,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,Yes,3. Somewhat Optimal,No F,webarena,webarena.710,GenericAgent-anthropic_claude-3.7-sonnet,GenericAgent-anthropic_claude-3.7-sonnet_on_webarena,Unsuccessful,No,1. Complete Failure,Yes C,webarena,webarena.544,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.171,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No C,webarena,webarena.518,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.517,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.491,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.805,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,Yes,2. Suboptimal,No C,webarena,webarena.471,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.164,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.158,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.468,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.790,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.155,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.153,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.430,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes F,webarena,webarena.789,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,Yes,1. Complete Failure,Yes A,webarena,webarena.144,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No F,webarena,webarena.788,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.427,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.100,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No F,webarena,webarena.778,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.426,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.776,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.419,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,3. Somewhat Optimal,No C,webarena,webarena.417,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No F,webarena,webarena.769,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,Yes A,webarena,webarena.69,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.67,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No F,webarena,webarena.764,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No C,webarena,webarena.416,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.66,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,No F,webarena,webarena.763,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.414,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.60,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No C,webarena,webarena.400,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.759,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes C,webarena,webarena.386,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No C,webarena,webarena.380,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,No F,webarena,webarena.758,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.757,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.48,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.740,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.738,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,3. Somewhat Optimal,Yes F,webarena,webarena.735,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.40,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.33,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No C,assistantbench,assistantbench.improved.validation.8,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.730,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No A,webarena,webarena.27,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,2. Suboptimal,Yes F,webarena,webarena.726,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No F,webarena,webarena.723,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No F,webarena,webarena.718,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,Yes,1. Complete Failure,No C,assistantbench,assistantbench.improved.validation.7,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes A,webarena,webarena.24,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No F,webarena,webarena.710,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Successful,No,4. Completely Optimal,No A,webarena,webarena.15,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_webarena,Unsuccessful,No,1. Complete Failure,No C,assistantbench,assistantbench.improved.validation.6,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes C,assistantbench,assistantbench.improved.validation.5,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No C,assistantbench,assistantbench.improved.validation.4,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Successful,No,4. Completely Optimal,No A,assistantbench,assistantbench.improved.validation.26,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes C,assistantbench,assistantbench.improved.validation.3,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes A,assistantbench,assistantbench.improved.validation.25,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,3. Somewhat Optimal,No A,assistantbench,assistantbench.improved.validation.24,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.2,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No D,assistantbench,assistantbench.improved.validation.32,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes A,assistantbench,assistantbench.improved.validation.23,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes B,assistantbench,assistantbench.improved.validation.17,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes A,assistantbench,assistantbench.improved.validation.22,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,3. Somewhat Optimal,Yes B,assistantbench,assistantbench.improved.validation.16,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes C,assistantbench,assistantbench.improved.validation.1,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,No B,assistantbench,assistantbench.improved.validation.15,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes D,assistantbench,assistantbench.improved.validation.31,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes B,assistantbench,assistantbench.improved.validation.14,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,3. Somewhat Optimal,No C,assistantbench,assistantbench.improved.validation.0,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Successful,No,4. Completely Optimal,No D,assistantbench,assistantbench.improved.validation.30,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Successful,No,2. Suboptimal,Yes B,assistantbench,assistantbench.improved.validation.13,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,Yes,1. Complete Failure,Yes B,assistantbench,assistantbench.improved.validation.12,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,Yes,2. Suboptimal,Yes A,assistantbench,assistantbench.improved.validation.21,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Successful,No,3. Somewhat Optimal,No D,assistantbench,assistantbench.improved.validation.29,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,Yes B,assistantbench,assistantbench.improved.validation.11,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,3. Somewhat Optimal,Yes A,assistantbench,assistantbench.improved.validation.20,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Successful,No,3. Somewhat Optimal,No D,assistantbench,assistantbench.improved.validation.28,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,2. Suboptimal,Yes B,assistantbench,assistantbench.improved.validation.10,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,3. Somewhat Optimal,No A,assistantbench,assistantbench.improved.validation.19,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No B,assistantbench,assistantbench.improved.validation.9,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,3. Somewhat Optimal,No A,assistantbench,assistantbench.improved.validation.18,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,1. Complete Failure,No D,assistantbench,assistantbench.improved.validation.27,GenericAgent-gpt-4o-2024-11-20,GenericAgent-gpt-4o-2024-11-20_on_assistantbench.improved.validation,Unsuccessful,No,3. Somewhat Optimal,No