Glicko-2 Ranking System Implementation

Overview

The Glicko-2 ranking system is used in this project to rank devices based on their performance in benchmark tests, specifically measuring token generation speed (tokens/second) and prompt processing speed (tokens/second). This document explains both the theoretical foundations of Glicko-2 and its specific implementation in our system.

Glicko-2 Theory

Glicko-2 is an improvement over the original Glicko system, which itself was an improvement over the Elo rating system. It was developed by Mark Glicko and is particularly well-suited for situations where:

Devices have different numbers of benchmark runs
There's uncertainty about a device's true performance capabilities
Performance metrics need to be compared across different model sizes and configurations

Key Components

Rating (μ): A numerical value representing a device's relative performance level (higher is better)
Rating Deviation (RD): The uncertainty in the performance rating
Volatility (σ): A measure of how consistent a device's performance is across different benchmarks

Rating System Parameters

Initial Rating: 1500 (standard starting point on the Glicko-2 scale)
Initial RD: 350 (high uncertainty for new devices)
Volatility: 0.06 (controls how quickly performance ratings can change)
Tau: 0.5 (system constant that limits the change in volatility)

Note: The rating numbers themselves are on a relative scale and don't directly correspond to tokens/second. Instead, they represent relative performance levels where higher numbers indicate better performance. The actual token generation and prompt processing speeds (in tokens/second) are used to determine the relative performance outcomes that update these ratings.

Implementation Details

Data Preparation

Before applying Glicko-2, we preprocess the benchmark data:

Filter out emulators and iOS devices with insufficient GPU layers, so that we are consistent among iOS devices.
Normalize scores within each model group to account for different model difficulties
Convert continuous performance metrics into relative comparisons:
- For each pair of devices running the same model, we compare their token generation and prompt processing speeds
- If a device is faster in both metrics, it "wins" the comparison (outcome = 1)
- If a device is slower in both metrics, it "loses" the comparison (outcome = 0)
- If one device is faster in one metric but slower in the other, it's considered a "draw" (outcome = 0.5)
- This conversion is necessary because Glicko-2 works with discrete outcomes (win/loss/draw) rather than continuous performance values

For example, if:

Device A: Token Generation = 50 tokens/sec, Prompt Processing = 30 tokens/sec
Device B: Token Generation = 45 tokens/sec, Prompt Processing = 25 tokens/sec

Then Device A "wins" this comparison because it's faster in both metrics. This relative outcome (1 for Device A, 0 for Device B) is what's used to update the Glicko-2 ratings.

Match Processing

For each model, we compare devices pairwise based on their token generation and prompt processing speeds:

# Example of match processing
for model, group in df.groupby("Model ID"):
    devices = group["Normalized Device ID"].unique()
    for i in range(len(devices)):
        for j in range(i + 1, len(devices)):
            device1 = devices[i]
            device2 = devices[j]
            
            # Compare performance metrics
            token_speed1 = group[group["Normalized Device ID"] == device1]["Token Generation"].iloc[0]
            token_speed2 = group[group["Normalized Device ID"] == device2]["Token Generation"].iloc[0]
            
            prompt_speed1 = group[group["Normalized Device ID"] == device1]["Prompt Processing"].iloc[0]
            prompt_speed2 = group[group["Normalized Device ID"] == device2]["Prompt Processing"].iloc[0]
            
            # Determine performance outcome
            if token_speed1 > token_speed2 and prompt_speed1 > prompt_speed2:
                outcome = 1  # device1 performs better
            elif token_speed1 < token_speed2 and prompt_speed1 < prompt_speed2:
                outcome = 0  # device2 performs better
            else:
                outcome = 0.5  # mixed performance

Rating Updates

The Glicko-2 system updates performance ratings after each benchmark comparison:

Calculate Expected Performance:

def expected_performance(rating1, rating2, rd1, rd2):
    q = math.log(10) / 400
    g_rd = 1 / math.sqrt(1 + 3 * q**2 * (rd2**2) / math.pi**2)
    return 1 / (1 + 10**(-g_rd * (rating1 - rating2) / 400))

Update Performance Rating and RD:

def update_performance(rating, rd, outcome, expected):
    q = math.log(10) / 400
    d_squared = 1 / (q**2 * g_rd**2 * expected * (1 - expected))
    new_rd = math.sqrt(1 / (1 / rd**2 + 1 / d_squared))
    new_rating = rating + q / (1 / rd**2 + 1 / d_squared) * g_rd * (outcome - expected)
    return new_rating, new_rd

Confidence Thresholds

We implement several confidence thresholds:

Minimum Benchmarks: Devices must have at least 5 benchmark runs to be included in confident rankings
Performance Deviation: Devices with RD > 100 tokens/second are considered less reliable
Performance Consistency: High volatility indicates inconsistent performance across benchmarks

Practical Considerations

Handling Sparse Data

The system is designed to handle sparse benchmark data by:

Using conservative initial performance ratings for new devices
Increasing RD for devices with few benchmark runs
Implementing a minimum benchmark threshold

Performance Metrics

We track several performance metrics:

Combined performance rating (overall tokens/second)
Token generation rating (tokens/second)
Prompt processing rating (tokens/second)
Performance deviation (uncertainty in tokens/second)
Number of benchmark runs
Performance comparison statistics

Visualization

The system provides:

Overall performance rankings with confidence intervals
Platform-specific performance statistics
Head-to-head performance comparison tools
Performance trend analysis across different model sizes

Advantages Over Other Systems

Better Handling of Performance Uncertainty: Explicit modeling of performance measurement uncertainty
More Accurate with Fewer Benchmarks: Can provide meaningful performance ratings with limited data
Dynamic Performance Updates: Volatility parameter allows for appropriate rating changes
Transparent Confidence: Performance deviations provide clear confidence measures

Limitations

Computational Complexity: More complex than Elo, requiring more calculations
Parameter Sensitivity: Results can be sensitive to system parameters
Continuous Metrics: Requires conversion of continuous performance metrics (tokens/second) to relative comparisons

References

Glicko, M. (2001). "The Glicko-2 Rating System"
Glickman, M. E. (1999). "Parameter estimation in large dynamic paired comparison experiments"
Glickman, M. E. (2001). "Dynamic paired comparison models with stochastic variances"