Spaces:
Sleeping
Sleeping
title: On-Device LLM Throughput Calculator | |
emoji: 🚀 | |
colorFrom: pink | |
colorTo: blue | |
sdk: gradio | |
sdk_version: 4.36.0 | |
app_file: src/app.py | |
pinned: false | |
license: mit | |
# On-Device LLM Throughput Calculator | |
A Gradio web application that helps visualize LLM throughput on memory-bandwidth-constrained devices. | |
## Overview | |
This tool calculates and visualizes the theoretical throughput (tokens per second) that can be achieved by a Large Language Model (LLM) running on devices with memory bandwidth constraints. It supports different attention mechanisms: | |
- Grouped Query Attention (GQA) | |
- Multi-Query Attention (MQA) | |
- Memory-Latent Attention (MLA) | |
It also visualizes how sliding window attention impacts throughput at different context lengths. | |
## Features | |
- Customize device specifications (memory bandwidth) | |
- Configure model parameters (size, layers, heads) | |
- Compare different attention mechanisms | |
- Visualize performance across different context lengths | |
- Sliding window attention support | |
## Usage | |
1. Configure your device details (name, memory bandwidth) | |
2. Set model parameters (number of parameters, layer count, etc.) | |
3. Choose which attention mechanism configurations to compare | |
4. Generate a visualization of expected throughput | |
## Installation | |
```bash | |
pip install -r requirements.txt | |
``` | |
## Running Locally | |
```bash | |
cd src | |
python app.py | |
``` | |
## Theory | |
The calculations are based on memory bandwidth bottlenecks as described in the [JAX ML Scaling Book](https://jax-ml.github.io/scaling-book/inference/#theoretical-estimates-for-llm-latency-and-throughput). | |
The basic formula for tokens per second: | |
``` | |
tokens_per_second = (batch_size * memory_bandwidth) / (batch_size * total_kv_size + parameter_size) | |
``` | |
## License | |
MIT | |