BLUR-leaderboard / index.html
skychwang2's picture
Update index.html
d0fe437 verified
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning</title>
<style>
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
line-height: 1.6;
color: #333;
max-width: 1200px;
margin: 0 auto;
padding: 20px;
background-color: #f5f7fa;
}
.container {
background-color: white;
border-radius: 8px;
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);
padding: 30px;
margin-bottom: 20px;
}
h1 {
color: #2c3e50;
text-align: center;
margin-bottom: 30px;
font-weight: 600;
}
.description {
margin-bottom: 30px;
font-size: 16px;
color: #555;
text-align: center;
}
table {
width: 100%;
border-collapse: collapse;
margin-top: 20px;
font-size: 15px;
}
thead {
background-color: #f8f9fa;
font-weight: bold;
}
th, td {
padding: 12px 15px;
text-align: center;
border-bottom: 1px solid #e0e0e0;
}
th {
position: sticky;
top: 0;
background-color: #f8f9fa;
box-shadow: 0 2px 2px -1px rgba(0, 0, 0, 0.1);
}
tbody tr:hover {
background-color: #f1f5f9;
}
.model-name {
text-align: left;
font-weight: 500;
}
.human-row {
font-weight: bold;
background-color: #e3f2fd;
}
.top-model {
background-color: #fff8e1;
}
.category-header {
background-color: #f5f5f5;
font-weight: bold;
}
.file-support {
font-size: 12px;
color: #666;
}
.footnote {
font-size: 14px;
color: #666;
margin-top: 30px;
border-top: 1px solid #eee;
padding-top: 20px;
}
</style>
</head>
<body>
<div class="container">
<h1>The <i>BLUR</i> Leaderboard</h1>
<div class="description">
<p>Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning</p>
<p>Dataset: <a href="https://huggingface.co/datasets/PatronusAI/BLUR">Link</a>; Paper: <a href="https://arxiv.org/abs/2503.19193">Link</a></p>
</div>
<table>
<thead>
<tr>
<th>Model / System</th>
<th>Q<sub>T</sub></th>
<th>Q<sub>F</sub></th>
<th>H<sub>E</sub></th>
<th>H<sub>M</sub></th>
<th>H<sub>H</sub></th>
<th>Overall</th>
</tr>
</thead>
<tbody>
<!-- Base Models -->
<tr class="category-header">
<td colspan="7">Foundation Models</td>
</tr>
<tr>
<td class="model-name">Llama-3.1-405B</td>
<td>0.34</td>
<td>0.17<span class="file-support">°</span></td>
<td>0.35</td>
<td>0.32</td>
<td>0.25</td>
<td>0.30</td>
</tr>
<tr>
<td class="model-name">claude-3-5-sonnet-20241022</td>
<td>0.44</td>
<td>0.28<span class="file-support"></span></td>
<td>0.42</td>
<td>0.42</td>
<td>0.36</td>
<td>0.40</td>
</tr>
<tr>
<td class="model-name">gpt-4o-2024-11-20</td>
<td>0.42</td>
<td>0.28<span class="file-support"></span></td>
<td>0.39</td>
<td>0.43</td>
<td>0.35</td>
<td>0.38</td>
</tr>
<tr>
<td class="model-name">o1-2024-12-17</td>
<td>0.54</td>
<td>0.36<span class="file-support"></span></td>
<td>0.56</td>
<td>0.52</td>
<td>0.44</td>
<td>0.49</td>
</tr>
<tr>
<td class="model-name">DeepSeek-R1</td>
<td>0.45</td>
<td>0.27<span class="file-support">°</span></td>
<td>0.46</td>
<td>0.44</td>
<td>0.35</td>
<td>0.41</td>
</tr>
<!-- Chat Products -->
<tr class="category-header">
<td colspan="7">AI Assistants</td>
</tr>
<tr>
<td class="model-name">Microsoft Copilot</td>
<td>0.29</td>
<td>0.23<span class="file-support"></span></td>
<td>0.29</td>
<td>0.32</td>
<td>0.22</td>
<td>0.27</td>
</tr>
<tr>
<td class="model-name">Mistral Le Chat</td>
<td>0.40</td>
<td>0.27<span class="file-support"></span></td>
<td>0.47</td>
<td>0.38</td>
<td>0.32</td>
<td>0.37</td>
</tr>
<tr>
<td class="model-name">Perplexity Pro Search</td>
<td>0.31</td>
<td>0.15<span class="file-support"></span></td>
<td>0.29</td>
<td>0.29</td>
<td>0.24</td>
<td>0.27</td>
</tr>
<tr>
<td class="model-name">ChatGPT-4o</td>
<td>0.53</td>
<td>0.36</td>
<td>0.60</td>
<td>0.52</td>
<td>0.41</td>
<td>0.49</td>
</tr>
<!-- Agent Systems -->
<tr class="category-header">
<td colspan="7">Agentic Systems</td>
</tr>
<tr class="top-model">
<td class="model-name">HuggingFace Agents + Claude 3.5 Sonnet</td>
<td>0.61</td>
<td>0.41<span class="file-support"></span></td>
<td>0.60</td>
<td>0.56</td>
<td>0.54</td>
<td>0.56</td>
</tr>
<tr>
<td class="model-name">DynaSaur + GPT-4o</td>
<td>0.58</td>
<td>0.27</td>
<td>0.61</td>
<td>0.52</td>
<td>0.44</td>
<td>0.50</td>
</tr>
<tr>
<td class="model-name">Operator</td>
<td>0.57</td>
<td>0.46<span class="file-support"></span></td>
<td>0.56</td>
<td>0.56</td>
<td>0.52</td>
<td>0.54</td>
</tr>
<!-- Baselines -->
<tr class="category-header">
<td colspan="7">Baselines</td>
</tr>
<tr>
<td class="model-name">Search Engine</td>
<td>0.05</td>
<td>0.03<span class="file-support"></span></td>
<td>0.08</td>
<td>0.05</td>
<td>0.02</td>
<td>0.04</td>
</tr>
<tr class="human-row">
<td class="model-name">Human</td>
<td>0.98</td>
<td>1.00</td>
<td>0.98</td>
<td>0.98</td>
<td>0.99</td>
<td>0.98</td>
</tr>
</tbody>
</table>
<div class="footnote">
<p><strong>Table 1:</strong> System and model performance on the BLUR benchmark. Q<sub>T</sub> and Q<sub>F</sub> denote performance on text-only queries and queries with file inputs, respectively. System support for file inputs is indicated, where ° signifies that the system does not support file uploads and • denotes partial support of certain file type extensions; the absence of a circle denotes that all file type uploads are supported. H<sub>E</sub>, H<sub>M</sub>, and H<sub>H</sub> represent system performance on <em>easy</em>, <em>medium</em>, and <em>hard</em> query difficulty subsets, respectively.</p>
</div>
</div>
</body>
</html>