MrSimple01 commited on
Commit
3c775cf
·
verified ·
1 Parent(s): bc48928

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -1
README.md CHANGED
@@ -8,5 +8,79 @@ sdk_version: 5.21.0
8
  app_file: app.py
9
  pinned: false
10
  ---
 
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  app_file: app.py
9
  pinned: false
10
  ---
11
+ # Model Response Evaluator
12
 
13
+ This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity).
14
+
15
+ ## Features
16
+
17
+ - Evaluate individual model responses for creativity, diversity, relevance, and stability
18
+ - Run batch evaluations on multiple models from a CSV file
19
+ - Web interface for easy use
20
+ - Command-line interface for scripting and automation
21
+ - Combined scoring that balances creativity and stability
22
+
23
+ ## Installation
24
+
25
+ 1. Clone this repository
26
+ 2. Install dependencies:
27
+
28
+ ```bash
29
+ pip install -r requirements.txt
30
+ ```
31
+
32
+ 3. Get a Gemini API key from Google AI Studio (https://makersuite.google.com/)
33
+
34
+ ## Usage
35
+
36
+ ### Web Interface
37
+
38
+ ```bash
39
+ python app.py --web
40
+ ```
41
+
42
+ This will start a Gradio web interface where you can:
43
+ - Evaluate single responses
44
+ - Upload CSV files for batch evaluation
45
+ - View evaluation results
46
+
47
+ ### Command Line
48
+
49
+ For batch evaluation of models from a CSV file:
50
+
51
+ ```bash
52
+ python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv
53
+ ```
54
+
55
+ Optional arguments:
56
+ - `--models`: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3")
57
+ - `--prompt_col`: Column name containing prompts (default: "rus_prompt")
58
+
59
+ ## CSV Format
60
+
61
+ Your CSV file should have these columns:
62
+ - A prompt column (default: "rus_prompt")
63
+ - One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers")
64
+
65
+ ## Evaluation Metrics
66
+
67
+ ### Creativity Metrics
68
+ - **Креативность (Creativity)**: Uniqueness and originality of the response
69
+ - **Разнообразие (Diversity)**: Use of varied linguistic features
70
+ - **Релевантность (Relevance)**: How well the response addresses the prompt
71
+
72
+ ### Stability Metrics
73
+ - **Stability Score**: Semantic similarity between prompts and responses
74
+
75
+ ### Combined Score
76
+ - Average of creativity and stability scores
77
+
78
+ ## Output
79
+
80
+ The evaluation produces:
81
+ - CSV files with detailed per-response evaluations for each model
82
+ - A benchmark_results.csv file with aggregated metrics for all models
83
+
84
+ ## Environment Variables
85
+
86
+ You can set the `GEMINI_API_KEY` environment variable instead of passing it as an argument.