Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
@@ -8,5 +8,79 @@ sdk_version: 5.21.0
|
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
---
|
|
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
---
|
11 |
+
# Model Response Evaluator
|
12 |
|
13 |
+
This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity).
|
14 |
+
|
15 |
+
## Features
|
16 |
+
|
17 |
+
- Evaluate individual model responses for creativity, diversity, relevance, and stability
|
18 |
+
- Run batch evaluations on multiple models from a CSV file
|
19 |
+
- Web interface for easy use
|
20 |
+
- Command-line interface for scripting and automation
|
21 |
+
- Combined scoring that balances creativity and stability
|
22 |
+
|
23 |
+
## Installation
|
24 |
+
|
25 |
+
1. Clone this repository
|
26 |
+
2. Install dependencies:
|
27 |
+
|
28 |
+
```bash
|
29 |
+
pip install -r requirements.txt
|
30 |
+
```
|
31 |
+
|
32 |
+
3. Get a Gemini API key from Google AI Studio (https://makersuite.google.com/)
|
33 |
+
|
34 |
+
## Usage
|
35 |
+
|
36 |
+
### Web Interface
|
37 |
+
|
38 |
+
```bash
|
39 |
+
python app.py --web
|
40 |
+
```
|
41 |
+
|
42 |
+
This will start a Gradio web interface where you can:
|
43 |
+
- Evaluate single responses
|
44 |
+
- Upload CSV files for batch evaluation
|
45 |
+
- View evaluation results
|
46 |
+
|
47 |
+
### Command Line
|
48 |
+
|
49 |
+
For batch evaluation of models from a CSV file:
|
50 |
+
|
51 |
+
```bash
|
52 |
+
python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv
|
53 |
+
```
|
54 |
+
|
55 |
+
Optional arguments:
|
56 |
+
- `--models`: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3")
|
57 |
+
- `--prompt_col`: Column name containing prompts (default: "rus_prompt")
|
58 |
+
|
59 |
+
## CSV Format
|
60 |
+
|
61 |
+
Your CSV file should have these columns:
|
62 |
+
- A prompt column (default: "rus_prompt")
|
63 |
+
- One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers")
|
64 |
+
|
65 |
+
## Evaluation Metrics
|
66 |
+
|
67 |
+
### Creativity Metrics
|
68 |
+
- **Креативность (Creativity)**: Uniqueness and originality of the response
|
69 |
+
- **Разнообразие (Diversity)**: Use of varied linguistic features
|
70 |
+
- **Релевантность (Relevance)**: How well the response addresses the prompt
|
71 |
+
|
72 |
+
### Stability Metrics
|
73 |
+
- **Stability Score**: Semantic similarity between prompts and responses
|
74 |
+
|
75 |
+
### Combined Score
|
76 |
+
- Average of creativity and stability scores
|
77 |
+
|
78 |
+
## Output
|
79 |
+
|
80 |
+
The evaluation produces:
|
81 |
+
- CSV files with detailed per-response evaluations for each model
|
82 |
+
- A benchmark_results.csv file with aggregated metrics for all models
|
83 |
+
|
84 |
+
## Environment Variables
|
85 |
+
|
86 |
+
You can set the `GEMINI_API_KEY` environment variable instead of passing it as an argument.
|