Add vision evals
Browse files
README.md
CHANGED
@@ -334,6 +334,26 @@ Non-coding tasks were evaluated with [lm-evaluation-harness](https://github.com/
|
|
334 |
--batch_size auto
|
335 |
```
|
336 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
337 |
**Coding**
|
338 |
|
339 |
The commands below can be used for mbpp by simply replacing the dataset name.
|
@@ -366,7 +386,6 @@ evalplus.evaluate \
|
|
366 |
|
367 |
### Accuracy
|
368 |
|
369 |
-
#### Open LLM Leaderboard evaluation scores
|
370 |
<table>
|
371 |
<tr>
|
372 |
<th>Category
|
@@ -526,5 +545,27 @@ evalplus.evaluate \
|
|
526 |
<td>100.7%
|
527 |
</td>
|
528 |
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
529 |
</table>
|
530 |
|
|
|
334 |
--batch_size auto
|
335 |
```
|
336 |
|
337 |
+
**MMMU**
|
338 |
+
```
|
339 |
+
lm_eval \
|
340 |
+
--model vllm \
|
341 |
+
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_images=8,enable_chunk_prefill=True,tensor_parallel_size=2 \
|
342 |
+
--tasks mmmu \
|
343 |
+
--apply_chat_template\
|
344 |
+
--batch_size auto
|
345 |
+
```
|
346 |
+
|
347 |
+
**ChartQA**
|
348 |
+
```
|
349 |
+
lm_eval \
|
350 |
+
--model vllm \
|
351 |
+
--model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.9,max_images=8,enable_chunk_prefill=True,tensor_parallel_size=2 \
|
352 |
+
--tasks chartqa \
|
353 |
+
--apply_chat_template\
|
354 |
+
--batch_size auto
|
355 |
+
```
|
356 |
+
|
357 |
**Coding**
|
358 |
|
359 |
The commands below can be used for mbpp by simply replacing the dataset name.
|
|
|
386 |
|
387 |
### Accuracy
|
388 |
|
|
|
389 |
<table>
|
390 |
<tr>
|
391 |
<th>Category
|
|
|
545 |
<td>100.7%
|
546 |
</td>
|
547 |
</tr>
|
548 |
+
<tr>
|
549 |
+
<td rowspan="2" ><strong>Vision</strong>
|
550 |
+
</td>
|
551 |
+
<td>MMMU (0-shot)
|
552 |
+
</td>
|
553 |
+
<td>52.11
|
554 |
+
</td>
|
555 |
+
<td>53.11
|
556 |
+
</td>
|
557 |
+
<td>101.9%
|
558 |
+
</td>
|
559 |
+
</tr>
|
560 |
+
<tr>
|
561 |
+
<td>ChartQA (0-shot)
|
562 |
+
</td>
|
563 |
+
<td>81.36
|
564 |
+
</td>
|
565 |
+
<td>82.36
|
566 |
+
</td>
|
567 |
+
<td>101.2%
|
568 |
+
</td>
|
569 |
+
</tr>
|
570 |
</table>
|
571 |
|