Spaces:
Sleeping
Sleeping
File size: 2,312 Bytes
01c7da5 3c775cf 01c7da5 3c775cf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
title: RuSimulBench Arena
emoji: 📊
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.21.0
app_file: app.py
pinned: false
---
# Model Response Evaluator
This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity).
## Features
- Evaluate individual model responses for creativity, diversity, relevance, and stability
- Run batch evaluations on multiple models from a CSV file
- Web interface for easy use
- Command-line interface for scripting and automation
- Combined scoring that balances creativity and stability
## Installation
1. Clone this repository
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Get a Gemini API key from Google AI Studio (https://makersuite.google.com/)
## Usage
### Web Interface
```bash
python app.py --web
```
This will start a Gradio web interface where you can:
- Evaluate single responses
- Upload CSV files for batch evaluation
- View evaluation results
### Command Line
For batch evaluation of models from a CSV file:
```bash
python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv
```
Optional arguments:
- `--models`: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3")
- `--prompt_col`: Column name containing prompts (default: "rus_prompt")
## CSV Format
Your CSV file should have these columns:
- A prompt column (default: "rus_prompt")
- One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers")
## Evaluation Metrics
### Creativity Metrics
- **Креативность (Creativity)**: Uniqueness and originality of the response
- **Разнообразие (Diversity)**: Use of varied linguistic features
- **Релевантность (Relevance)**: How well the response addresses the prompt
### Stability Metrics
- **Stability Score**: Semantic similarity between prompts and responses
### Combined Score
- Average of creativity and stability scores
## Output
The evaluation produces:
- CSV files with detailed per-response evaluations for each model
- A benchmark_results.csv file with aggregated metrics for all models
## Environment Variables
You can set the `GEMINI_API_KEY` environment variable instead of passing it as an argument. |