File size: 7,369 Bytes
94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 19c7047 94a1f00 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
# Glicko-2 Ranking System Implementation
## Overview
The Glicko-2 ranking system is used in this project to rank devices based on their performance in benchmark tests, specifically measuring token generation speed (tokens/second) and prompt processing speed (tokens/second). This document explains both the theoretical foundations of Glicko-2 and its specific implementation in our system.
## Glicko-2 Theory
Glicko-2 is an improvement over the original Glicko system, which itself was an improvement over the Elo rating system. It was developed by Mark Glicko and is particularly well-suited for situations where:
1. Devices have different numbers of benchmark runs
2. There's uncertainty about a device's true performance capabilities
3. Performance metrics need to be compared across different model sizes and configurations
### Key Components
1. **Rating (μ)**: A numerical value representing a device's relative performance level (higher is better)
2. **Rating Deviation (RD)**: The uncertainty in the performance rating
3. **Volatility (σ)**: A measure of how consistent a device's performance is across different benchmarks
### Rating System Parameters
- **Initial Rating**: 1500 (standard starting point on the Glicko-2 scale)
- **Initial RD**: 350 (high uncertainty for new devices)
- **Volatility**: 0.06 (controls how quickly performance ratings can change)
- **Tau**: 0.5 (system constant that limits the change in volatility)
Note: The rating numbers themselves are on a relative scale and don't directly correspond to tokens/second. Instead, they represent relative performance levels where higher numbers indicate better performance. The actual token generation and prompt processing speeds (in tokens/second) are used to determine the relative performance outcomes that update these ratings.
## Implementation Details
### Data Preparation
Before applying Glicko-2, we preprocess the benchmark data:
1. Filter out emulators and iOS devices with insufficient GPU layers, so that we are consistent among iOS devices.
2. Normalize scores within each model group to account for different model difficulties
3. Convert continuous performance metrics into relative comparisons:
- For each pair of devices running the same model, we compare their token generation and prompt processing speeds
- If a device is faster in both metrics, it "wins" the comparison (outcome = 1)
- If a device is slower in both metrics, it "loses" the comparison (outcome = 0)
- If one device is faster in one metric but slower in the other, it's considered a "draw" (outcome = 0.5)
- This conversion is necessary because Glicko-2 works with discrete outcomes (win/loss/draw) rather than continuous performance values
For example, if:
- Device A: Token Generation = 50 tokens/sec, Prompt Processing = 30 tokens/sec
- Device B: Token Generation = 45 tokens/sec, Prompt Processing = 25 tokens/sec
Then Device A "wins" this comparison because it's faster in both metrics. This relative outcome (1 for Device A, 0 for Device B) is what's used to update the Glicko-2 ratings.
### Match Processing
For each model, we compare devices pairwise based on their token generation and prompt processing speeds:
```python
# Example of match processing
for model, group in df.groupby("Model ID"):
devices = group["Normalized Device ID"].unique()
for i in range(len(devices)):
for j in range(i + 1, len(devices)):
device1 = devices[i]
device2 = devices[j]
# Compare performance metrics
token_speed1 = group[group["Normalized Device ID"] == device1]["Token Generation"].iloc[0]
token_speed2 = group[group["Normalized Device ID"] == device2]["Token Generation"].iloc[0]
prompt_speed1 = group[group["Normalized Device ID"] == device1]["Prompt Processing"].iloc[0]
prompt_speed2 = group[group["Normalized Device ID"] == device2]["Prompt Processing"].iloc[0]
# Determine performance outcome
if token_speed1 > token_speed2 and prompt_speed1 > prompt_speed2:
outcome = 1 # device1 performs better
elif token_speed1 < token_speed2 and prompt_speed1 < prompt_speed2:
outcome = 0 # device2 performs better
else:
outcome = 0.5 # mixed performance
```
### Rating Updates
The Glicko-2 system updates performance ratings after each benchmark comparison:
1. **Calculate Expected Performance**:
```python
def expected_performance(rating1, rating2, rd1, rd2):
q = math.log(10) / 400
g_rd = 1 / math.sqrt(1 + 3 * q**2 * (rd2**2) / math.pi**2)
return 1 / (1 + 10**(-g_rd * (rating1 - rating2) / 400))
```
2. **Update Performance Rating and RD**:
```python
def update_performance(rating, rd, outcome, expected):
q = math.log(10) / 400
d_squared = 1 / (q**2 * g_rd**2 * expected * (1 - expected))
new_rd = math.sqrt(1 / (1 / rd**2 + 1 / d_squared))
new_rating = rating + q / (1 / rd**2 + 1 / d_squared) * g_rd * (outcome - expected)
return new_rating, new_rd
```
### Confidence Thresholds
We implement several confidence thresholds:
1. **Minimum Benchmarks**: Devices must have at least 5 benchmark runs to be included in confident rankings
2. **Performance Deviation**: Devices with RD > 100 tokens/second are considered less reliable
3. **Performance Consistency**: High volatility indicates inconsistent performance across benchmarks
## Practical Considerations
### Handling Sparse Data
The system is designed to handle sparse benchmark data by:
1. Using conservative initial performance ratings for new devices
2. Increasing RD for devices with few benchmark runs
3. Implementing a minimum benchmark threshold
### Performance Metrics
We track several performance metrics:
- Combined performance rating (overall tokens/second)
- Token generation rating (tokens/second)
- Prompt processing rating (tokens/second)
- Performance deviation (uncertainty in tokens/second)
- Number of benchmark runs
- Performance comparison statistics
### Visualization
The system provides:
1. Overall performance rankings with confidence intervals
2. Platform-specific performance statistics
3. Head-to-head performance comparison tools
4. Performance trend analysis across different model sizes
## Advantages Over Other Systems
1. **Better Handling of Performance Uncertainty**: Explicit modeling of performance measurement uncertainty
2. **More Accurate with Fewer Benchmarks**: Can provide meaningful performance ratings with limited data
3. **Dynamic Performance Updates**: Volatility parameter allows for appropriate rating changes
4. **Transparent Confidence**: Performance deviations provide clear confidence measures
## Limitations
1. **Computational Complexity**: More complex than Elo, requiring more calculations
2. **Parameter Sensitivity**: Results can be sensitive to system parameters
3. **Continuous Metrics**: Requires conversion of continuous performance metrics (tokens/second) to relative comparisons
## References
1. Glicko, M. (2001). "The Glicko-2 Rating System"
2. Glickman, M. E. (1999). "Parameter estimation in large dynamic paired comparison experiments"
3. Glickman, M. E. (2001). "Dynamic paired comparison models with stochastic variances" |