Spaces:

evalitahf
/

evalita_llm_leaderboard

Running

rzanoli commited on Mar 27

Commit

3b91660

1 Parent(s): c03f591

Small Changes

Files changed (1) hide show

src/about.py CHANGED Viewed

@@ -92,6 +92,9 @@ TITLE = """<h1 align="center" id="space-title">🚀 EVALITA-LLM Leaderboard 🚀
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
 Evalita-LLM, a new benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing and innovative features of Evalita-LLM are the following: (i) **all tasks are native Italian**, avoiding issues of translating from Italian and potential cultural biases; (ii) in addition to well-established **multiple-choice** tasks (6 tasks), the benchmark includes **generative** tasks (4 tasks), enabling more natural interaction with LLMs; (iii) **all tasks are evaluated against multiple prompts**, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation.
 """
 # Which evaluations are you running? how can people reproduce what you have?

 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
 Evalita-LLM, a new benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing and innovative features of Evalita-LLM are the following: (i) **all tasks are native Italian**, avoiding issues of translating from Italian and potential cultural biases; (ii) in addition to well-established **multiple-choice** tasks (6 tasks), the benchmark includes **generative** tasks (4 tasks), enabling more natural interaction with LLMs; (iii) **all tasks are evaluated against multiple prompts**, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation.
+**Multiple Choice**: 📊TE (Textual Entailment), 😃SA (Sentiment Analysis), ⚠️HS (Hate Speech Detection), 🏥AT (Admission Test), 🔤WIC (Word in Context), ❓FAQ (Frequently Asked Questions)
+**Generative**: 🔄LS (Lexical Substitution), 📝SU (Summarization), 🏷️NER (Named Entity Recognition), 🔗REL (Relation Extraction)
 """
 # Which evaluations are you running? how can people reproduce what you have?