Spaces:
Sleeping
Sleeping
title: RuSimulBench Arena | |
emoji: 📊 | |
colorFrom: green | |
colorTo: blue | |
sdk: gradio | |
sdk_version: 5.21.0 | |
app_file: app.py | |
pinned: false | |
# Model Response Evaluator | |
This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity). | |
## Features | |
- Evaluate individual model responses for creativity, diversity, relevance, and stability | |
- Run batch evaluations on multiple models from a CSV file | |
- Web interface for easy use | |
- Command-line interface for scripting and automation | |
- Combined scoring that balances creativity and stability | |
## Installation | |
1. Clone this repository | |
2. Install dependencies: | |
```bash | |
pip install -r requirements.txt | |
``` | |
3. Get a Gemini API key from Google AI Studio (https://makersuite.google.com/) | |
## Usage | |
### Web Interface | |
```bash | |
python app.py --web | |
``` | |
This will start a Gradio web interface where you can: | |
- Evaluate single responses | |
- Upload CSV files for batch evaluation | |
- View evaluation results | |
### Command Line | |
For batch evaluation of models from a CSV file: | |
```bash | |
python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv | |
``` | |
Optional arguments: | |
- `--models`: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3") | |
- `--prompt_col`: Column name containing prompts (default: "rus_prompt") | |
## CSV Format | |
Your CSV file should have these columns: | |
- A prompt column (default: "rus_prompt") | |
- One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers") | |
## Evaluation Metrics | |
### Creativity Metrics | |
- **Креативность (Creativity)**: Uniqueness and originality of the response | |
- **Разнообразие (Diversity)**: Use of varied linguistic features | |
- **Релевантность (Relevance)**: How well the response addresses the prompt | |
### Stability Metrics | |
- **Stability Score**: Semantic similarity between prompts and responses | |
### Combined Score | |
- Average of creativity and stability scores | |
## Output | |
The evaluation produces: | |
- CSV files with detailed per-response evaluations for each model | |
- A benchmark_results.csv file with aggregated metrics for all models | |
## Environment Variables | |
You can set the `GEMINI_API_KEY` environment variable instead of passing it as an argument. |