Spaces:
Sleeping
Sleeping
title: LLM as a Judge | |
emoji: 🧐 | |
colorFrom: yellow | |
colorTo: purple | |
sdk: gradio | |
sdk_version: 5.0.1 | |
app_file: app.py | |
pinned: false | |
This is a space where you can compare two models using the technique "LLM as a Judge". LLM as a Judge uses a LLM itself for judging the response from two LLMs, and compare them based on certain evaluation metrics which are relevant for the task. | |
In this space, our default placeholder repos and models compare two LLMs finetuned on the same base model, [Llama 3.2 3B parameter model](unsloth/Llama-3.2-3B-Instruct). Both of them are finetuned on the [FineTome-100k dataset](https://huggingface.co/datasets/mlabonne/FineTome-100k), but they have been finetuned on a different amount of data. | |
The models were finetuned using [Unsloth](https://unsloth.ai/), a framework which allows finetuning, training and inference with LLMs 2x faster. | |
## Default models and their hyperparameters | |
Both models were trained with a [Tesla T4 GPU](https://www.nvidia.com/en-us/data-center/tesla-t4/) with 16GB of GDDR6 memory and 2560 CUDA cores. | |
### forestav/LoRA-2000 | |
Finetuned on 2000 steps.\ | |
Quantization method: `float16` | |
### KolumbusLindh/LoRA-4100 | |
Finetuned on 4100 steps.\ | |
Quantization method: `float16` | |
### Hyperparameters | |
Both models used the same hyperparameters during training.\ | |
`lora_alpha=16`: Scaling factor for low-rank matrices' contribution. Higher increases influence, speeds up convergence, risks instability/overfitting. Lower gives small effect, but may require more training steps.\ | |
`lora_dropout=0`: Probability of zeroing out elements in low-rank matrices for regularization. Higher gives more regularization but may slow training and degrade performance.\ | |
`per_device_train_batch_size=2`:\ | |
`gradient_accumulation_steps=4`: The number of steps to accumulate gradients before performing a backpropagation update. Higher accumulates gradients over multiple steps, increasing the batch size without requiring additional memory. Can improve training stability and convergence if you have a large model and limited hardware.\ | |
`learning_rate=2e-4`: Rate at which the model updates its parameters during training. Higher gives faster convergence but risks overshooting optimal parameters and instability. Lower requires more training steps but better performance.\ | |
`optim="adamw_8bit"`\ | |
`weight_decay=0.01`: Penalty to add to the weights during training to prevent overfitting. The value is proportional to the magnitude of the weights to the loss function.\ | |
`lr_scheduler_type="linear"` | |
These hyperparameters are [suggested as default](https://docs.unsloth.ai/tutorials/how-to-finetune-llama-3-and-export-to-ollama) when using Unsloth. However, to experiment with them we also tried to finetune a third model by changing the hyperparameters, keeping some of of the above but changing to: | |
`dropout=0.3`\ | |
`per_device_train_batch_size=20`\ | |
`gradient_accumulation_steps=40`\ | |
`learning_rate=2e-2` | |
The effects of this were evident. One step took around 10 minutes due to the increased `gradient_accumulation_steps`, and it required significant amount of memory from the GPU due to `per_device_train_batch_size=20`. It also overfitted just in 15 steps, achieving `loss=0`, due to the high learning rate. We wanted to try if the dropout could prevent overfitting while at the same time having a high learning rate, but it could not. | |
Both models have a max sequence length of 2048 tokens. This means that they only process the 2048 first tokens in the input. | |
We chose float16 as the quantization method as it according to [Unsloth wiki](https://github.com/unslothai/unsloth/wiki) has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage. | |
## Judge | |
We are using the KolumbusLindh/LoRA-4100 model as a judge. However, for better accuracy one should use a stronger model such as GPT-4, which can evaluate the responses more thoroughly. | |
## Evaluation using GPT-4 | |
To better evaluate our fine-tuned models, we let GPT-4 be our judge, when the respective model answered the following prompts: | |
1. Describe step-by-step how to set up a tent in a windy environment. | |
2. How-To Guidance: "Explain how to bake a chocolate cake without using eggs." | |
3. Troubleshooting: "Provide instructions for troubleshooting a laptop that won’t turn on." | |
4. Educational Explanation: "Teach a beginner how to solve a Rubik’s Cube in simple steps." | |
5. DIY Project: "Give detailed instructions for building a birdhouse using basic tools." | |
6. Fitness Routine: "Design a beginner-friendly 15-minute workout routine that requires no equipment." | |
7. Cooking Tips: "Explain how to properly season and cook a medium-rare steak." | |
8. Technical Guidance: "Write a step-by-step guide for setting up a local Git repository and pushing code to GitHub." | |
9. Emergency Response: "Provide instructions for administering first aid to someone with a sprained ankle." | |
10. Language Learning: "Outline a simple plan for a beginner to learn Spanish in 30 days." | |
### Results | |
#### Prompt 1: Describe step-by-step how to set up a tent in a windy environment. | |