Spaces:

MrSimple01
/

RuSimulBench_arena

Sleeping

App Files Files Community

RuSimulBench_arena / README.md

MrSimple01's picture

Update README.md

3c775cf verified about 2 months ago

|

history blame contribute delete

2.31 kB

A newer version of the Gradio SDK is available: 5.29.0

Upgrade

metadata

title: RuSimulBench Arena
emoji: 📊
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.21.0
app_file: app.py
pinned: false

Model Response Evaluator

This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity).

Features

Evaluate individual model responses for creativity, diversity, relevance, and stability
Run batch evaluations on multiple models from a CSV file
Web interface for easy use
Command-line interface for scripting and automation
Combined scoring that balances creativity and stability

Installation

Clone this repository
Install dependencies:

pip install -r requirements.txt

Get a Gemini API key from Google AI Studio (https://makersuite.google.com/)

Usage

Web Interface

python app.py --web

This will start a Gradio web interface where you can:

Evaluate single responses
Upload CSV files for batch evaluation
View evaluation results

Command Line

For batch evaluation of models from a CSV file:

python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv

Optional arguments:

--models: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3")
--prompt_col: Column name containing prompts (default: "rus_prompt")

CSV Format

Your CSV file should have these columns:

A prompt column (default: "rus_prompt")
One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers")

Evaluation Metrics

Creativity Metrics

Креативность (Creativity): Uniqueness and originality of the response
Разнообразие (Diversity): Use of varied linguistic features
Релевантность (Relevance): How well the response addresses the prompt

Stability Metrics

Stability Score: Semantic similarity between prompts and responses

Combined Score

Average of creativity and stability scores

Output

The evaluation produces:

CSV files with detailed per-response evaluations for each model
A benchmark_results.csv file with aggregated metrics for all models

Environment Variables

You can set the GEMINI_API_KEY environment variable instead of passing it as an argument.