Spaces:

forestav
/

llm-as-a-judge

Sleeping

App Files Files Community

llm-as-a-judge / README.md

Filip

update readme

09cc62e 6 months ago

preview code

raw

history blame

5.12 kB

	---
	title: LLM as a Judge
	emoji: 🧐
	colorFrom: yellow
	colorTo: purple
	sdk: gradio
	sdk_version: 5.0.1
	app_file: app.py
	pinned: false
	---

	This is a space where you can compare two models using the technique "LLM as a Judge". LLM as a Judge uses a LLM itself for judging the response from two LLMs, and compare them based on certain evaluation metrics which are relevant for the task.

	In this space, our default placeholder repos and models compare two LLMs finetuned on the same base model, [Llama 3.2 3B parameter model](unsloth/Llama-3.2-3B-Instruct). Both of them are finetuned on the [FineTome-100k dataset](https://huggingface.co/datasets/mlabonne/FineTome-100k), but they have been finetuned on a different amount of data.

	The models were finetuned using [Unsloth](https://unsloth.ai/), a framework which allows finetuning, training and inference with LLMs 2x faster.

	## Default models and their hyperparameters

	Both models were trained with a [Tesla T4 GPU](https://www.nvidia.com/en-us/data-center/tesla-t4/) with 16GB of GDDR6 memory and 2560 CUDA cores.

	### forestav/LoRA-2000

	Finetuned on 2000 steps.\
	Quantization method: `float16`

	### KolumbusLindh/LoRA-4100

	Finetuned on 4100 steps.\
	Quantization method: `float16`

	### Hyperparameters

	Both models used the same hyperparameters during training.\
	`lora_alpha=16`: Scaling factor for low-rank matrices' contribution. Higher increases influence, speeds up convergence, risks instability/overfitting. Lower gives small effect, but may require more training steps.\
	`lora_dropout=0`: Probability of zeroing out elements in low-rank matrices for regularization. Higher gives more regularization but may slow training and degrade performance.\
	`per_device_train_batch_size=2`:\
	`gradient_accumulation_steps=4`: The number of steps to accumulate gradients before performing a backpropagation update. Higher accumulates gradients over multiple steps, increasing the batch size without requiring additional memory. Can improve training stability and convergence if you have a large model and limited hardware.\
	`learning_rate=2e-4`: Rate at which the model updates its parameters during training. Higher gives faster convergence but risks overshooting optimal parameters and instability. Lower requires more training steps but better performance.\
	`optim="adamw_8bit"`\
	`weight_decay=0.01`: Penalty to add to the weights during training to prevent overfitting. The value is proportional to the magnitude of the weights to the loss function.\
	`lr_scheduler_type="linear"`

	These hyperparameters are [suggested as default](https://docs.unsloth.ai/tutorials/how-to-finetune-llama-3-and-export-to-ollama) when using Unsloth. However, to experiment with them we also tried to finetune a third model by changing the hyperparameters, keeping some of of the above but changing to:

	`dropout=0.3`\
	`per_device_train_batch_size=20`\
	`gradient_accumulation_steps=40`\
	`learning_rate=2e-2`

	The effects of this were evident. One step took around 10 minutes due to the increased `gradient_accumulation_steps`, and it required significant amount of memory from the GPU due to `per_device_train_batch_size=20`. It also overfitted just in 15 steps, achieving `loss=0`, due to the high learning rate. We wanted to try if the dropout could prevent overfitting while at the same time having a high learning rate, but it could not.

	Both models have a max sequence length of 2048 tokens. This means that they only process the 2048 first tokens in the input.

	We chose float16 as the quantization method as it according to [Unsloth wiki](https://github.com/unslothai/unsloth/wiki) has the fastest conversion and retains 100% accuracy. However, it is slow and memory hungry which is a disadvantage.

	## Judge

	We are using the KolumbusLindh/LoRA-4100 model as a judge. However, for better accuracy one should use a stronger model such as GPT-4, which can evaluate the responses more thoroughly.

	## Evaluation using GPT-4

	To better evaluate our fine-tuned models, we let GPT-4 be our judge, when the respective model answered the following prompts:

	1. Describe step-by-step how to set up a tent in a windy environment.

	2. How-To Guidance: "Explain how to bake a chocolate cake without using eggs."

	3. Troubleshooting: "Provide instructions for troubleshooting a laptop that won’t turn on."

	4. Educational Explanation: "Teach a beginner how to solve a Rubik’s Cube in simple steps."

	5. DIY Project: "Give detailed instructions for building a birdhouse using basic tools."

	6. Fitness Routine: "Design a beginner-friendly 15-minute workout routine that requires no equipment."

	7. Cooking Tips: "Explain how to properly season and cook a medium-rare steak."

	8. Technical Guidance: "Write a step-by-step guide for setting up a local Git repository and pushing code to GitHub."

	9. Emergency Response: "Provide instructions for administering first aid to someone with a sprained ankle."

	10. Language Learning: "Outline a simple plan for a beginner to learn Spanish in 30 days."

	### Results

	#### Prompt 1: Describe step-by-step how to set up a tent in a windy environment.