Spaces:
Running
Running
<html lang="en"> | |
<head> | |
<meta charset="utf-8" /> | |
<meta name="viewport" content="width=device-width, initial-scale=1.0" /> | |
<title>Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning</title> | |
<style> | |
body { | |
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; | |
line-height: 1.6; | |
color: #333; | |
max-width: 1200px; | |
margin: 0 auto; | |
padding: 20px; | |
background-color: #f5f7fa; | |
} | |
.container { | |
background-color: white; | |
border-radius: 8px; | |
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1); | |
padding: 30px; | |
margin-bottom: 20px; | |
} | |
h1 { | |
color: #2c3e50; | |
text-align: center; | |
margin-bottom: 30px; | |
font-weight: 600; | |
} | |
.description { | |
margin-bottom: 30px; | |
font-size: 16px; | |
color: #555; | |
text-align: center; | |
} | |
table { | |
width: 100%; | |
border-collapse: collapse; | |
margin-top: 20px; | |
font-size: 15px; | |
} | |
thead { | |
background-color: #f8f9fa; | |
font-weight: bold; | |
} | |
th, td { | |
padding: 12px 15px; | |
text-align: center; | |
border-bottom: 1px solid #e0e0e0; | |
} | |
th { | |
position: sticky; | |
top: 0; | |
background-color: #f8f9fa; | |
box-shadow: 0 2px 2px -1px rgba(0, 0, 0, 0.1); | |
} | |
tbody tr:hover { | |
background-color: #f1f5f9; | |
} | |
.model-name { | |
text-align: left; | |
font-weight: 500; | |
} | |
.human-row { | |
font-weight: bold; | |
background-color: #e3f2fd; | |
} | |
.top-model { | |
background-color: #fff8e1; | |
} | |
.category-header { | |
background-color: #f5f5f5; | |
font-weight: bold; | |
} | |
.file-support { | |
font-size: 12px; | |
color: #666; | |
} | |
.footnote { | |
font-size: 14px; | |
color: #666; | |
margin-top: 30px; | |
border-top: 1px solid #eee; | |
padding-top: 20px; | |
} | |
</style> | |
</head> | |
<body> | |
<div class="container"> | |
<h1>The <i>BLUR</i> Leaderboard</h1> | |
<div class="description"> | |
<p>Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning</p> | |
<p>Dataset: <a href="https://huggingface.co/datasets/PatronusAI/BLUR">Link</a>; Paper: <a href="https://arxiv.org/abs/2503.19193">Link</a></p> | |
</div> | |
<table> | |
<thead> | |
<tr> | |
<th>Model / System</th> | |
<th>Q<sub>T</sub></th> | |
<th>Q<sub>F</sub></th> | |
<th>H<sub>E</sub></th> | |
<th>H<sub>M</sub></th> | |
<th>H<sub>H</sub></th> | |
<th>Overall</th> | |
</tr> | |
</thead> | |
<tbody> | |
<!-- Base Models --> | |
<tr class="category-header"> | |
<td colspan="7">Foundation Models</td> | |
</tr> | |
<tr> | |
<td class="model-name">Llama-3.1-405B</td> | |
<td>0.34</td> | |
<td>0.17<span class="file-support">°</span></td> | |
<td>0.35</td> | |
<td>0.32</td> | |
<td>0.25</td> | |
<td>0.30</td> | |
</tr> | |
<tr> | |
<td class="model-name">claude-3-5-sonnet-20241022</td> | |
<td>0.44</td> | |
<td>0.28<span class="file-support">•</span></td> | |
<td>0.42</td> | |
<td>0.42</td> | |
<td>0.36</td> | |
<td>0.40</td> | |
</tr> | |
<tr> | |
<td class="model-name">gpt-4o-2024-11-20</td> | |
<td>0.42</td> | |
<td>0.28<span class="file-support">•</span></td> | |
<td>0.39</td> | |
<td>0.43</td> | |
<td>0.35</td> | |
<td>0.38</td> | |
</tr> | |
<tr> | |
<td class="model-name">o1-2024-12-17</td> | |
<td>0.54</td> | |
<td>0.36<span class="file-support">•</span></td> | |
<td>0.56</td> | |
<td>0.52</td> | |
<td>0.44</td> | |
<td>0.49</td> | |
</tr> | |
<tr> | |
<td class="model-name">DeepSeek-R1</td> | |
<td>0.45</td> | |
<td>0.27<span class="file-support">°</span></td> | |
<td>0.46</td> | |
<td>0.44</td> | |
<td>0.35</td> | |
<td>0.41</td> | |
</tr> | |
<!-- Chat Products --> | |
<tr class="category-header"> | |
<td colspan="7">AI Assistants</td> | |
</tr> | |
<tr> | |
<td class="model-name">Microsoft Copilot</td> | |
<td>0.29</td> | |
<td>0.23<span class="file-support">•</span></td> | |
<td>0.29</td> | |
<td>0.32</td> | |
<td>0.22</td> | |
<td>0.27</td> | |
</tr> | |
<tr> | |
<td class="model-name">Mistral Le Chat</td> | |
<td>0.40</td> | |
<td>0.27<span class="file-support">•</span></td> | |
<td>0.47</td> | |
<td>0.38</td> | |
<td>0.32</td> | |
<td>0.37</td> | |
</tr> | |
<tr> | |
<td class="model-name">Perplexity Pro Search</td> | |
<td>0.31</td> | |
<td>0.15<span class="file-support">•</span></td> | |
<td>0.29</td> | |
<td>0.29</td> | |
<td>0.24</td> | |
<td>0.27</td> | |
</tr> | |
<tr> | |
<td class="model-name">ChatGPT-4o</td> | |
<td>0.53</td> | |
<td>0.36</td> | |
<td>0.60</td> | |
<td>0.52</td> | |
<td>0.41</td> | |
<td>0.49</td> | |
</tr> | |
<!-- Agent Systems --> | |
<tr class="category-header"> | |
<td colspan="7">Agentic Systems</td> | |
</tr> | |
<tr class="top-model"> | |
<td class="model-name">HuggingFace Agents + Claude 3.5 Sonnet</td> | |
<td>0.61</td> | |
<td>0.41<span class="file-support">•</span></td> | |
<td>0.60</td> | |
<td>0.56</td> | |
<td>0.54</td> | |
<td>0.56</td> | |
</tr> | |
<tr> | |
<td class="model-name">DynaSaur + GPT-4o</td> | |
<td>0.58</td> | |
<td>0.27</td> | |
<td>0.61</td> | |
<td>0.52</td> | |
<td>0.44</td> | |
<td>0.50</td> | |
</tr> | |
<tr> | |
<td class="model-name">Operator</td> | |
<td>0.57</td> | |
<td>0.46<span class="file-support">•</span></td> | |
<td>0.56</td> | |
<td>0.56</td> | |
<td>0.52</td> | |
<td>0.54</td> | |
</tr> | |
<!-- Baselines --> | |
<tr class="category-header"> | |
<td colspan="7">Baselines</td> | |
</tr> | |
<tr> | |
<td class="model-name">Search Engine</td> | |
<td>0.05</td> | |
<td>0.03<span class="file-support">•</span></td> | |
<td>0.08</td> | |
<td>0.05</td> | |
<td>0.02</td> | |
<td>0.04</td> | |
</tr> | |
<tr class="human-row"> | |
<td class="model-name">Human</td> | |
<td>0.98</td> | |
<td>1.00</td> | |
<td>0.98</td> | |
<td>0.98</td> | |
<td>0.99</td> | |
<td>0.98</td> | |
</tr> | |
</tbody> | |
</table> | |
<div class="footnote"> | |
<p><strong>Table 1:</strong> System and model performance on the BLUR benchmark. Q<sub>T</sub> and Q<sub>F</sub> denote performance on text-only queries and queries with file inputs, respectively. System support for file inputs is indicated, where ° signifies that the system does not support file uploads and • denotes partial support of certain file type extensions; the absence of a circle denotes that all file type uploads are supported. H<sub>E</sub>, H<sub>M</sub>, and H<sub>H</sub> represent system performance on <em>easy</em>, <em>medium</em>, and <em>hard</em> query difficulty subsets, respectively.</p> | |
</div> | |
</div> | |
</body> | |
</html> | |