Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.29.0
metadata
title: RuSimulBench Arena
emoji: 📊
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.21.0
app_file: app.py
pinned: false
Model Response Evaluator
This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity).
Features
- Evaluate individual model responses for creativity, diversity, relevance, and stability
- Run batch evaluations on multiple models from a CSV file
- Web interface for easy use
- Command-line interface for scripting and automation
- Combined scoring that balances creativity and stability
Installation
- Clone this repository
- Install dependencies:
pip install -r requirements.txt
- Get a Gemini API key from Google AI Studio (https://makersuite.google.com/)
Usage
Web Interface
python app.py --web
This will start a Gradio web interface where you can:
- Evaluate single responses
- Upload CSV files for batch evaluation
- View evaluation results
Command Line
For batch evaluation of models from a CSV file:
python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv
Optional arguments:
--models
: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3")--prompt_col
: Column name containing prompts (default: "rus_prompt")
CSV Format
Your CSV file should have these columns:
- A prompt column (default: "rus_prompt")
- One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers")
Evaluation Metrics
Creativity Metrics
- Креативность (Creativity): Uniqueness and originality of the response
- Разнообразие (Diversity): Use of varied linguistic features
- Релевантность (Relevance): How well the response addresses the prompt
Stability Metrics
- Stability Score: Semantic similarity between prompts and responses
Combined Score
- Average of creativity and stability scores
Output
The evaluation produces:
- CSV files with detailed per-response evaluations for each model
- A benchmark_results.csv file with aggregated metrics for all models
Environment Variables
You can set the GEMINI_API_KEY
environment variable instead of passing it as an argument.