ai-phone-leaderboard / docs /ranking_system.md
agh123's picture
feat: add Glicko2 ranking
94a1f00

A newer version of the Streamlit SDK is available: 1.45.0

Upgrade

Glicko-2 Ranking System Implementation

Overview

The Glicko-2 ranking system is used in this project to rank devices based on their performance in benchmark tests, specifically measuring token generation speed (tokens/second) and prompt processing speed (tokens/second). This document explains both the theoretical foundations of Glicko-2 and its specific implementation in our system.

Glicko-2 Theory

Glicko-2 is an improvement over the original Glicko system, which itself was an improvement over the Elo rating system. It was developed by Mark Glicko and is particularly well-suited for situations where:

  1. Devices have different numbers of benchmark runs
  2. There's uncertainty about a device's true performance capabilities
  3. Performance metrics need to be compared across different model sizes and configurations

Key Components

  1. Rating (μ): A numerical value representing a device's relative performance level (higher is better)
  2. Rating Deviation (RD): The uncertainty in the performance rating
  3. Volatility (σ): A measure of how consistent a device's performance is across different benchmarks

Rating System Parameters

  • Initial Rating: 1500 (standard starting point on the Glicko-2 scale)
  • Initial RD: 350 (high uncertainty for new devices)
  • Volatility: 0.06 (controls how quickly performance ratings can change)
  • Tau: 0.5 (system constant that limits the change in volatility)

Note: The rating numbers themselves are on a relative scale and don't directly correspond to tokens/second. Instead, they represent relative performance levels where higher numbers indicate better performance. The actual token generation and prompt processing speeds (in tokens/second) are used to determine the relative performance outcomes that update these ratings.

Implementation Details

Data Preparation

Before applying Glicko-2, we preprocess the benchmark data:

  1. Filter out emulators and iOS devices with insufficient GPU layers, so that we are consistent among iOS devices.
  2. Normalize scores within each model group to account for different model difficulties
  3. Convert continuous performance metrics into relative comparisons:
    • For each pair of devices running the same model, we compare their token generation and prompt processing speeds
    • If a device is faster in both metrics, it "wins" the comparison (outcome = 1)
    • If a device is slower in both metrics, it "loses" the comparison (outcome = 0)
    • If one device is faster in one metric but slower in the other, it's considered a "draw" (outcome = 0.5)
    • This conversion is necessary because Glicko-2 works with discrete outcomes (win/loss/draw) rather than continuous performance values

For example, if:

  • Device A: Token Generation = 50 tokens/sec, Prompt Processing = 30 tokens/sec
  • Device B: Token Generation = 45 tokens/sec, Prompt Processing = 25 tokens/sec

Then Device A "wins" this comparison because it's faster in both metrics. This relative outcome (1 for Device A, 0 for Device B) is what's used to update the Glicko-2 ratings.

Match Processing

For each model, we compare devices pairwise based on their token generation and prompt processing speeds:

# Example of match processing
for model, group in df.groupby("Model ID"):
    devices = group["Normalized Device ID"].unique()
    for i in range(len(devices)):
        for j in range(i + 1, len(devices)):
            device1 = devices[i]
            device2 = devices[j]
            
            # Compare performance metrics
            token_speed1 = group[group["Normalized Device ID"] == device1]["Token Generation"].iloc[0]
            token_speed2 = group[group["Normalized Device ID"] == device2]["Token Generation"].iloc[0]
            
            prompt_speed1 = group[group["Normalized Device ID"] == device1]["Prompt Processing"].iloc[0]
            prompt_speed2 = group[group["Normalized Device ID"] == device2]["Prompt Processing"].iloc[0]
            
            # Determine performance outcome
            if token_speed1 > token_speed2 and prompt_speed1 > prompt_speed2:
                outcome = 1  # device1 performs better
            elif token_speed1 < token_speed2 and prompt_speed1 < prompt_speed2:
                outcome = 0  # device2 performs better
            else:
                outcome = 0.5  # mixed performance

Rating Updates

The Glicko-2 system updates performance ratings after each benchmark comparison:

  1. Calculate Expected Performance:

    def expected_performance(rating1, rating2, rd1, rd2):
        q = math.log(10) / 400
        g_rd = 1 / math.sqrt(1 + 3 * q**2 * (rd2**2) / math.pi**2)
        return 1 / (1 + 10**(-g_rd * (rating1 - rating2) / 400))
    
  2. Update Performance Rating and RD:

    def update_performance(rating, rd, outcome, expected):
        q = math.log(10) / 400
        d_squared = 1 / (q**2 * g_rd**2 * expected * (1 - expected))
        new_rd = math.sqrt(1 / (1 / rd**2 + 1 / d_squared))
        new_rating = rating + q / (1 / rd**2 + 1 / d_squared) * g_rd * (outcome - expected)
        return new_rating, new_rd
    

Confidence Thresholds

We implement several confidence thresholds:

  1. Minimum Benchmarks: Devices must have at least 5 benchmark runs to be included in confident rankings
  2. Performance Deviation: Devices with RD > 100 tokens/second are considered less reliable
  3. Performance Consistency: High volatility indicates inconsistent performance across benchmarks

Practical Considerations

Handling Sparse Data

The system is designed to handle sparse benchmark data by:

  1. Using conservative initial performance ratings for new devices
  2. Increasing RD for devices with few benchmark runs
  3. Implementing a minimum benchmark threshold

Performance Metrics

We track several performance metrics:

  • Combined performance rating (overall tokens/second)
  • Token generation rating (tokens/second)
  • Prompt processing rating (tokens/second)
  • Performance deviation (uncertainty in tokens/second)
  • Number of benchmark runs
  • Performance comparison statistics

Visualization

The system provides:

  1. Overall performance rankings with confidence intervals
  2. Platform-specific performance statistics
  3. Head-to-head performance comparison tools
  4. Performance trend analysis across different model sizes

Advantages Over Other Systems

  1. Better Handling of Performance Uncertainty: Explicit modeling of performance measurement uncertainty
  2. More Accurate with Fewer Benchmarks: Can provide meaningful performance ratings with limited data
  3. Dynamic Performance Updates: Volatility parameter allows for appropriate rating changes
  4. Transparent Confidence: Performance deviations provide clear confidence measures

Limitations

  1. Computational Complexity: More complex than Elo, requiring more calculations
  2. Parameter Sensitivity: Results can be sensitive to system parameters
  3. Continuous Metrics: Requires conversion of continuous performance metrics (tokens/second) to relative comparisons

References

  1. Glicko, M. (2001). "The Glicko-2 Rating System"
  2. Glickman, M. E. (1999). "Parameter estimation in large dynamic paired comparison experiments"
  3. Glickman, M. E. (2001). "Dynamic paired comparison models with stochastic variances"