RuSimulBench_arena / README.md
MrSimple01's picture
Update README.md
3c775cf verified
---
title: RuSimulBench Arena
emoji: 📊
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.21.0
app_file: app.py
pinned: false
---
# Model Response Evaluator
This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity).
## Features
- Evaluate individual model responses for creativity, diversity, relevance, and stability
- Run batch evaluations on multiple models from a CSV file
- Web interface for easy use
- Command-line interface for scripting and automation
- Combined scoring that balances creativity and stability
## Installation
1. Clone this repository
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Get a Gemini API key from Google AI Studio (https://makersuite.google.com/)
## Usage
### Web Interface
```bash
python app.py --web
```
This will start a Gradio web interface where you can:
- Evaluate single responses
- Upload CSV files for batch evaluation
- View evaluation results
### Command Line
For batch evaluation of models from a CSV file:
```bash
python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv
```
Optional arguments:
- `--models`: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3")
- `--prompt_col`: Column name containing prompts (default: "rus_prompt")
## CSV Format
Your CSV file should have these columns:
- A prompt column (default: "rus_prompt")
- One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers")
## Evaluation Metrics
### Creativity Metrics
- **Креативность (Creativity)**: Uniqueness and originality of the response
- **Разнообразие (Diversity)**: Use of varied linguistic features
- **Релевантность (Relevance)**: How well the response addresses the prompt
### Stability Metrics
- **Stability Score**: Semantic similarity between prompts and responses
### Combined Score
- Average of creativity and stability scores
## Output
The evaluation produces:
- CSV files with detailed per-response evaluations for each model
- A benchmark_results.csv file with aggregated metrics for all models
## Environment Variables
You can set the `GEMINI_API_KEY` environment variable instead of passing it as an argument.