Spaces:

MrSimple01
/

RuSimulBench_arena

Sleeping

App Files Files Community

RuSimulBench_arena / README.md

MrSimple01

Update README.md

3c775cf verified 2 months ago

preview code

raw

history blame contribute delete

2.31 kB

	---
	title: RuSimulBench Arena
	emoji: 📊
	colorFrom: green
	colorTo: blue
	sdk: gradio
	sdk_version: 5.21.0
	app_file: app.py
	pinned: false
	---
	# Model Response Evaluator

	This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity).

	## Features

	- Evaluate individual model responses for creativity, diversity, relevance, and stability
	- Run batch evaluations on multiple models from a CSV file
	- Web interface for easy use
	- Command-line interface for scripting and automation
	- Combined scoring that balances creativity and stability

	## Installation

	1. Clone this repository
	2. Install dependencies:

	```bash
	pip install -r requirements.txt
	```

	3. Get a Gemini API key from Google AI Studio (https://makersuite.google.com/)

	## Usage

	### Web Interface

	```bash
	python app.py --web
	```

	This will start a Gradio web interface where you can:
	- Evaluate single responses
	- Upload CSV files for batch evaluation
	- View evaluation results

	### Command Line

	For batch evaluation of models from a CSV file:

	```bash
	python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv
	```

	Optional arguments:
	- `--models`: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3")
	- `--prompt_col`: Column name containing prompts (default: "rus_prompt")

	## CSV Format

	Your CSV file should have these columns:
	- A prompt column (default: "rus_prompt")
	- One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers")

	## Evaluation Metrics

	### Creativity Metrics
	- Креативность (Creativity): Uniqueness and originality of the response
	- Разнообразие (Diversity): Use of varied linguistic features
	- Релевантность (Relevance): How well the response addresses the prompt

	### Stability Metrics
	- Stability Score: Semantic similarity between prompts and responses

	### Combined Score
	- Average of creativity and stability scores

	## Output

	The evaluation produces:
	- CSV files with detailed per-response evaluations for each model
	- A benchmark_results.csv file with aggregated metrics for all models

	## Environment Variables

	You can set the `GEMINI_API_KEY` environment variable instead of passing it as an argument.