File size: 2,312 Bytes
01c7da5
 
 
 
 
 
 
 
 
 
3c775cf
01c7da5
3c775cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
title: RuSimulBench Arena
emoji: 📊
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.21.0
app_file: app.py
pinned: false
---
# Model Response Evaluator

This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity).

## Features

- Evaluate individual model responses for creativity, diversity, relevance, and stability
- Run batch evaluations on multiple models from a CSV file
- Web interface for easy use
- Command-line interface for scripting and automation
- Combined scoring that balances creativity and stability

## Installation

1. Clone this repository
2. Install dependencies:

```bash
pip install -r requirements.txt
```

3. Get a Gemini API key from Google AI Studio (https://makersuite.google.com/)

## Usage

### Web Interface

```bash
python app.py --web
```

This will start a Gradio web interface where you can:
- Evaluate single responses
- Upload CSV files for batch evaluation
- View evaluation results

### Command Line

For batch evaluation of models from a CSV file:

```bash
python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv
```

Optional arguments:
- `--models`: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3")
- `--prompt_col`: Column name containing prompts (default: "rus_prompt")

## CSV Format

Your CSV file should have these columns:
- A prompt column (default: "rus_prompt")
- One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers")

## Evaluation Metrics

### Creativity Metrics
- **Креативность (Creativity)**: Uniqueness and originality of the response
- **Разнообразие (Diversity)**: Use of varied linguistic features
- **Релевантность (Relevance)**: How well the response addresses the prompt

### Stability Metrics
- **Stability Score**: Semantic similarity between prompts and responses

### Combined Score
- Average of creativity and stability scores

## Output

The evaluation produces:
- CSV files with detailed per-response evaluations for each model
- A benchmark_results.csv file with aggregated metrics for all models

## Environment Variables

You can set the `GEMINI_API_KEY` environment variable instead of passing it as an argument.