metadata

title: SE-Arena
emoji: 🛠️
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
hf_oauth: true
pinned: false
short_description: The chatbot arena for software engineering

SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering

Welcome to SE Arena, an open-source platform designed for evaluating software engineering-focused foundation models (FMs), particularly large language models (LLMs). SE Arena benchmarks models in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.

Key Features

Multi-Round Conversational Workflows: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes.
RepoChat Integration: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations.
Advanced Evaluation Metrics: Assess models using a comprehensive suite of metrics including:
- Traditional metrics: Elo score and average win rate
- Network-based metrics: Eigenvector centrality, PageRank score
- Community detection: Newman modularity score
- Consistency score: Quantify model determinism and reliability through self-play matches
Transparent, Open-Source Leaderboard: View real-time model rankings across diverse SE workflows with full transparency.

Why SE Arena?

Existing evaluation frameworks (like Chatbot Arena, WebDev Arena, and Copilot Arena) often don't address the complex, iterative nature of SE tasks. SE Arena fills critical gaps by:

Supporting context-rich, multi-turn evaluations to capture iterative workflows
Integrating repository-level context through RepoChat to simulate real-world development scenarios
Providing multidimensional metrics for nuanced model comparisons
Focusing on the full breadth of SE tasks beyond just code generation

How It Works

Submit a Prompt: Sign in and input your SE-related task (optional: include a repository URL for RepoChat context)
Compare Responses: Two anonymous models provide responses to your query
Continue the Conversation: Test contextual understanding over multiple rounds
Vote: Choose the better model at any point, with ability to re-assess after multiple turns

Getting Started

Prerequisites

A Hugging Face account
Basic understanding of software engineering workflows

Usage

Navigate to the SE Arena platform
Sign in with your Hugging Face account
Enter your SE task prompt (optionally include a repository URL for RepoChat)
Engage in multi-round interactions and vote on model performance

Contributing

We welcome contributions from the community! Here's how you can help:

Submit SE Tasks: Share your real-world SE problems to enrich our evaluation dataset
Report Issues: Found a bug or have a feature request? Open an issue in this repository
Enhance the Codebase: Fork the repository, make your changes, and submit a pull request

Privacy Policy

Your interactions are anonymized and used solely for improving SE Arena and FM benchmarking. By using SE Arena, you agree to our Terms of Service.

Future Plans

Analysis of Real-World SE Workloads: Identify common patterns and challenges in user-submitted tasks
Multi-Round Evaluation Metrics: Develop specialized metrics for assessing model adaptation over successive turns
Enhanced Community Engagement: Enable broader participation through voting and contributions
Expanded FM Coverage: Include domain-specific and multimodal foundation models
Advanced Context Compression: Integrate techniques like LongRope and SelfExtend to manage long-term memory

Contact

For inquiries or feedback, please open an issue in this repository. We welcome your contributions and suggestions!