A newer version of the Streamlit SDK is available:
1.45.0
Glicko-2 Ranking System Implementation
Overview
The Glicko-2 ranking system is used in this project to rank devices based on their performance in benchmark tests, specifically measuring token generation speed (tokens/second) and prompt processing speed (tokens/second). This document explains both the theoretical foundations of Glicko-2 and its specific implementation in our system.
Glicko-2 Theory
Glicko-2 is an improvement over the original Glicko system, which itself was an improvement over the Elo rating system. It was developed by Mark Glicko and is particularly well-suited for situations where:
- Devices have different numbers of benchmark runs
- There's uncertainty about a device's true performance capabilities
- Performance metrics need to be compared across different model sizes and configurations
Key Components
- Rating (μ): A numerical value representing a device's relative performance level (higher is better)
- Rating Deviation (RD): The uncertainty in the performance rating
- Volatility (σ): A measure of how consistent a device's performance is across different benchmarks
Rating System Parameters
- Initial Rating: 1500 (standard starting point on the Glicko-2 scale)
- Initial RD: 350 (high uncertainty for new devices)
- Volatility: 0.06 (controls how quickly performance ratings can change)
- Tau: 0.5 (system constant that limits the change in volatility)
Note: The rating numbers themselves are on a relative scale and don't directly correspond to tokens/second. Instead, they represent relative performance levels where higher numbers indicate better performance. The actual token generation and prompt processing speeds (in tokens/second) are used to determine the relative performance outcomes that update these ratings.
Implementation Details
Data Preparation
Before applying Glicko-2, we preprocess the benchmark data:
- Filter out emulators and iOS devices with insufficient GPU layers, so that we are consistent among iOS devices.
- Normalize scores within each model group to account for different model difficulties
- Convert continuous performance metrics into relative comparisons:
- For each pair of devices running the same model, we compare their token generation and prompt processing speeds
- If a device is faster in both metrics, it "wins" the comparison (outcome = 1)
- If a device is slower in both metrics, it "loses" the comparison (outcome = 0)
- If one device is faster in one metric but slower in the other, it's considered a "draw" (outcome = 0.5)
- This conversion is necessary because Glicko-2 works with discrete outcomes (win/loss/draw) rather than continuous performance values
For example, if:
- Device A: Token Generation = 50 tokens/sec, Prompt Processing = 30 tokens/sec
- Device B: Token Generation = 45 tokens/sec, Prompt Processing = 25 tokens/sec
Then Device A "wins" this comparison because it's faster in both metrics. This relative outcome (1 for Device A, 0 for Device B) is what's used to update the Glicko-2 ratings.
Match Processing
For each model, we compare devices pairwise based on their token generation and prompt processing speeds:
# Example of match processing
for model, group in df.groupby("Model ID"):
devices = group["Normalized Device ID"].unique()
for i in range(len(devices)):
for j in range(i + 1, len(devices)):
device1 = devices[i]
device2 = devices[j]
# Compare performance metrics
token_speed1 = group[group["Normalized Device ID"] == device1]["Token Generation"].iloc[0]
token_speed2 = group[group["Normalized Device ID"] == device2]["Token Generation"].iloc[0]
prompt_speed1 = group[group["Normalized Device ID"] == device1]["Prompt Processing"].iloc[0]
prompt_speed2 = group[group["Normalized Device ID"] == device2]["Prompt Processing"].iloc[0]
# Determine performance outcome
if token_speed1 > token_speed2 and prompt_speed1 > prompt_speed2:
outcome = 1 # device1 performs better
elif token_speed1 < token_speed2 and prompt_speed1 < prompt_speed2:
outcome = 0 # device2 performs better
else:
outcome = 0.5 # mixed performance
Rating Updates
The Glicko-2 system updates performance ratings after each benchmark comparison:
Calculate Expected Performance:
def expected_performance(rating1, rating2, rd1, rd2): q = math.log(10) / 400 g_rd = 1 / math.sqrt(1 + 3 * q**2 * (rd2**2) / math.pi**2) return 1 / (1 + 10**(-g_rd * (rating1 - rating2) / 400))
Update Performance Rating and RD:
def update_performance(rating, rd, outcome, expected): q = math.log(10) / 400 d_squared = 1 / (q**2 * g_rd**2 * expected * (1 - expected)) new_rd = math.sqrt(1 / (1 / rd**2 + 1 / d_squared)) new_rating = rating + q / (1 / rd**2 + 1 / d_squared) * g_rd * (outcome - expected) return new_rating, new_rd
Confidence Thresholds
We implement several confidence thresholds:
- Minimum Benchmarks: Devices must have at least 5 benchmark runs to be included in confident rankings
- Performance Deviation: Devices with RD > 100 tokens/second are considered less reliable
- Performance Consistency: High volatility indicates inconsistent performance across benchmarks
Practical Considerations
Handling Sparse Data
The system is designed to handle sparse benchmark data by:
- Using conservative initial performance ratings for new devices
- Increasing RD for devices with few benchmark runs
- Implementing a minimum benchmark threshold
Performance Metrics
We track several performance metrics:
- Combined performance rating (overall tokens/second)
- Token generation rating (tokens/second)
- Prompt processing rating (tokens/second)
- Performance deviation (uncertainty in tokens/second)
- Number of benchmark runs
- Performance comparison statistics
Visualization
The system provides:
- Overall performance rankings with confidence intervals
- Platform-specific performance statistics
- Head-to-head performance comparison tools
- Performance trend analysis across different model sizes
Advantages Over Other Systems
- Better Handling of Performance Uncertainty: Explicit modeling of performance measurement uncertainty
- More Accurate with Fewer Benchmarks: Can provide meaningful performance ratings with limited data
- Dynamic Performance Updates: Volatility parameter allows for appropriate rating changes
- Transparent Confidence: Performance deviations provide clear confidence measures
Limitations
- Computational Complexity: More complex than Elo, requiring more calculations
- Parameter Sensitivity: Results can be sensitive to system parameters
- Continuous Metrics: Requires conversion of continuous performance metrics (tokens/second) to relative comparisons
References
- Glicko, M. (2001). "The Glicko-2 Rating System"
- Glickman, M. E. (1999). "Parameter estimation in large dynamic paired comparison experiments"
- Glickman, M. E. (2001). "Dynamic paired comparison models with stochastic variances"