|
Model Evaluation and Leaderboard |
|
|
|
1) Model Evaluation |
|
Before integrating a model into the leaderboard, it must first be evaluated using the lm-eval-harness library in both zero-shot and 5-shot configurations. |
|
|
|
This can be done with the following command: |
|
|
|
lm_eval --model hf --model_args pretrained=google/gemma-3-12b-it \ |
|
--tasks evalita-mp --device cuda:0 --batch_size 1 --trust_remote_code \ |
|
--output_path model_output --num_fewshot 5 -- |
|
|
|
The output generated by the library will include the model's accuracy scores on the benchmark tasks. |
|
This output is written to the standard output and should be saved in a txt file (e.g., slurm-8368.out), which needs to be placed in the |
|
evalita_llm_models_output LOCAL directory for further processing. Examples of such files can be found in: https://huggingface.co/datasets/evalitahf/evalita_llm_models_output/ |
|
|
|
2) Extracting Model Metadata |
|
To display model details on the leaderboard (e.g., organization/group, model name, and parameter count), metadata must be retrieved from Hugging Face. |
|
|
|
This can be done by running: |
|
|
|
python get_model_info.py |
|
|
|
This script processes the evaluation files from Step 1 and saves each model's metadata in a JSON file within the evalita_llm_requests LOCAL directory. |
|
|
|
3) Generating Leaderboard Submission File |
|
The leaderboard requires a structured file containing each model’s metadata along with its benchmark accuracy scores. |
|
|
|
To generate this file, run: |
|
|
|
python preprocess_model_output.py |
|
|
|
This script combines the accuracy results from Step 1 with the metadata from Step 2 and outputs a JSON file for each kind of model in the evalita_llm_results LOCAL directory. |
|
Examples of these files are in https://huggingface.co/datasets/evalitahf/evalita_llm_results |
|
|
|
4) Updating the Hugging Face Repository |
|
A commit and push of the following three directories from the local disk to HuggingFace is required, in order to update the evalita_llm_results repository with the newly generated files from Step 3: |
|
evalita_llm_models_output, evalita_llm_requests and evalita_llm_results |
|
|
|
5) Running the Leaderboard Application |
|
To test the leaderboard locally, run the following command in your terminal and open your browser at the indicated address: |
|
|
|
python app.py |
|
|
|
On Hugging Face, the leaderboard can be started or stopped directly from the graphical interface, so running this command is only necessary when working locally. |
|
|
|
|
|
|