TITLE = """

🏆 Online Mind2Web Leaderboard

""" INTRODUCTION_TEXT = """ Online Mind2Web is a benchmark designed to evaluate real-world performance of web agents on online websites. [[Blog]]() [[Paper]]() [[Code]]() [[Data]]() ## Tasks Online Mind2Web includes 300 tasks from 136 popular websites across various domains. It covers a diverse set of user tasks, to evaluate agents' performance in real-world environments. Tasks are categorized into three difficulty levels based on the steps human annotators need: - Easy: 1 - 5 - Medium: 6 - 10 - Hard: 11 + ## Leaderboard """ SUBMISSION_TEXT = """ ## Submissions Participants are invited to submit your agent's trajectory to test. The submissions will be evaluated based on our auto-eval. ### Format of submission Submissions must include a sequence of images (i.e., screenshots in the trajectory) and a result.json file for each task. The JSON file should contain the fields: "Task", "Task_id", and "action_history". You can refer to an example of the submission files. """ CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" CITATION_BUTTON_TEXT = r""" Online Mind2Web""" SUBMIT_INTRODUCTION = """ ## ⚠ Please submit the trajectory file with the following format: Each task is stored in a folder named after its `task_id`, containing: - `trajectory/`: Stores screenshots of each step. - `result.json`: Task metadata and action history. **Structure:** ``` main_directory/ └── task_id/ ├── result.json └── trajectory/ ├── 0_screenshot.png ├── 1_screenshot.png └── ... ``` **`result.json` format:** ```json { "task_id": 123, "task": "abc", "action_history": ["abc", "xyz", "..."] } ``` Please send your agent's name, model family, and organization via email to xue.681@osu.edu, along with the trajectory directory attached. We will run the auto-evaluation. If you have conducted your own human evaluation, please also attach your human eval results—we will spot-check these before adding them to the human-eval table. """ DATA_DATASET = """## More Statistics for Online Mind2Web Benchmark """ def format_error(msg): return f"

{msg}

" def format_warning(msg): return f"

{msg}

" def format_log(msg): return f"

{msg}

" def model_hyperlink(link, model_name): return f'{model_name}'