π LLM to ONNX Converter
Convert small language models to ONNX format with guaranteed reliability for RAG and chatbot applications on resource-constrained hardware.
π Overview
This repository provides scripts to convert small language models to ONNX format and create INT8 quantized versions for efficient deployment on resource-constrained devices. Perfect for mobile applications, Unity game engines, and embedded systems.
β Tested Models
We've successfully tested the following models with example outputs:
Model | Size | Quantized | Response Quality | Speed (sec) |
---|---|---|---|---|
Qwen-0.5B | 500M | β | β Poor | 8.37 |
Qwen-0.5B | 500M | β | β Good | 15.69 |
TinyLlama-1.1B | 1.1B | β | β Poor | 10.15 |
TinyLlama-1.1B | 1.1B | β | β Good | 19.23 |
Phi-1.5 | 1.3B | β | β Good | 15.32 |
Falcon-RW-1B | 1B | β | β Good | 21.56 |
GPT2-Medium | 355M | β | β Good | 6.27 |
GPT2-Medium | 355M | β | β Good | 12.77 |
OPT-350M | 350M | β | β Good | 4.33 |
OPT-350M | 350M | β | β Good | 10.42 |
Bloom-560M | 560M | β | β Poor | 11.93 |
Bloom-560M | 560M | β | β Good | 34.38 |
π Recommendations
Based on our testing:
- For best speed + quality: OPT-350M (quantized) - fastest with good quality
- For best overall quality: Phi-1.5 (non-quantized) - excellent responses
- For smallest size: GPT2-Medium or OPT-350M (quantized) - small with good performance
π© Key Findings
- Quantization provides ~2x speed improvement
- Smaller models (350-500M) quantize better than larger models (1B+)
- Some architectures (OPT, GPT2) handle quantization better than others
π Repository Structure
onnx_models/
βββ bloom_onnx/
βββ bloom_onnx_quantized/
βββ falcon_onnx/
βββ gpt2_onnx/
βββ gpt2_onnx_quantized/
βββ opt_onnx/
βββ opt_onnx_quantized/
βββ phi_onnx/
βββ qwen_onnx/
βββ qwen_onnx_quantized/
βββ tinyllama_onnx/
βββ tinyllama_onnx_quantized/
π Requirements
- Python 3.8+
- optimum
- onnxruntime
- transformers
- numpy
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support