RuSimulBench_arena / README.md
MrSimple01's picture
Update README.md
3c775cf verified

A newer version of the Gradio SDK is available: 5.29.0

Upgrade
metadata
title: RuSimulBench Arena
emoji: 📊
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.21.0
app_file: app.py
pinned: false

Model Response Evaluator

This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity).

Features

  • Evaluate individual model responses for creativity, diversity, relevance, and stability
  • Run batch evaluations on multiple models from a CSV file
  • Web interface for easy use
  • Command-line interface for scripting and automation
  • Combined scoring that balances creativity and stability

Installation

  1. Clone this repository
  2. Install dependencies:
pip install -r requirements.txt
  1. Get a Gemini API key from Google AI Studio (https://makersuite.google.com/)

Usage

Web Interface

python app.py --web

This will start a Gradio web interface where you can:

  • Evaluate single responses
  • Upload CSV files for batch evaluation
  • View evaluation results

Command Line

For batch evaluation of models from a CSV file:

python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv

Optional arguments:

  • --models: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3")
  • --prompt_col: Column name containing prompts (default: "rus_prompt")

CSV Format

Your CSV file should have these columns:

  • A prompt column (default: "rus_prompt")
  • One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers")

Evaluation Metrics

Creativity Metrics

  • Креативность (Creativity): Uniqueness and originality of the response
  • Разнообразие (Diversity): Use of varied linguistic features
  • Релевантность (Relevance): How well the response addresses the prompt

Stability Metrics

  • Stability Score: Semantic similarity between prompts and responses

Combined Score

  • Average of creativity and stability scores

Output

The evaluation produces:

  • CSV files with detailed per-response evaluations for each model
  • A benchmark_results.csv file with aggregated metrics for all models

Environment Variables

You can set the GEMINI_API_KEY environment variable instead of passing it as an argument.