A newer version of the Gradio SDK is available:
5.27.1
title: SE-Arena
emoji: 🛠️
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
hf_oauth: true
pinned: false
short_description: The chatbot arena for software engineering
SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering
Welcome to SE Arena, an open-source platform designed for evaluating software engineering-focused foundation models (FMs), particularly large language models (LLMs). SE Arena benchmarks models in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.
Key Features
- Multi-Round Conversational Workflows: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes.
- RepoChat Integration: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations.
- Advanced Evaluation Metrics: Assess models using a comprehensive suite of metrics including:
- Traditional metrics: Elo score and average win rate
- Network-based metrics: Eigenvector centrality, PageRank score
- Community detection: Newman modularity score
- Consistency score: Quantify model determinism and reliability through self-play matches
- Transparent, Open-Source Leaderboard: View real-time model rankings across diverse SE workflows with full transparency.
Why SE Arena?
Existing evaluation frameworks (like Chatbot Arena, WebDev Arena, and Copilot Arena) often don't address the complex, iterative nature of SE tasks. SE Arena fills critical gaps by:
- Supporting context-rich, multi-turn evaluations to capture iterative workflows
- Integrating repository-level context through RepoChat to simulate real-world development scenarios
- Providing multidimensional metrics for nuanced model comparisons
- Focusing on the full breadth of SE tasks beyond just code generation
How It Works
- Submit a Prompt: Sign in and input your SE-related task (optional: include a repository URL for RepoChat context)
- Compare Responses: Two anonymous models provide responses to your query
- Continue the Conversation: Test contextual understanding over multiple rounds
- Vote: Choose the better model at any point, with ability to re-assess after multiple turns
Getting Started
Prerequisites
- A Hugging Face account
- Basic understanding of software engineering workflows
Usage
- Navigate to the SE Arena platform
- Sign in with your Hugging Face account
- Enter your SE task prompt (optionally include a repository URL for RepoChat)
- Engage in multi-round interactions and vote on model performance
Contributing
We welcome contributions from the community! Here's how you can help:
- Submit SE Tasks: Share your real-world SE problems to enrich our evaluation dataset
- Report Issues: Found a bug or have a feature request? Open an issue in this repository
- Enhance the Codebase: Fork the repository, make your changes, and submit a pull request
Privacy Policy
Your interactions are anonymized and used solely for improving SE Arena and FM benchmarking. By using SE Arena, you agree to our Terms of Service.
Future Plans
- Analysis of Real-World SE Workloads: Identify common patterns and challenges in user-submitted tasks
- Multi-Round Evaluation Metrics: Develop specialized metrics for assessing model adaptation over successive turns
- Enhanced Community Engagement: Enable broader participation through voting and contributions
- Expanded FM Coverage: Include domain-specific and multimodal foundation models
- Advanced Context Compression: Integrate techniques like LongRope and SelfExtend to manage long-term memory
Contact
For inquiries or feedback, please open an issue in this repository. We welcome your contributions and suggestions!