🚀 LLM to ONNX Converter

Convert small language models to ONNX format with guaranteed reliability for RAG and chatbot applications on resource-constrained hardware.

📋 Overview

This repository provides scripts to convert small language models to ONNX format and create INT8 quantized versions for efficient deployment on resource-constrained devices. Perfect for mobile applications, Unity game engines, and embedded systems.

✅ Tested Models

We've successfully tested the following models with example outputs:

Model	Size	Quantized	Response Quality	Speed (sec)
Qwen-0.5B	500M	✅	❌ Poor	8.37
Qwen-0.5B	500M	❌	✅ Good	15.69
TinyLlama-1.1B	1.1B	✅	❌ Poor	10.15
TinyLlama-1.1B	1.1B	❌	✅ Good	19.23
Phi-1.5	1.3B	❌	✅ Good	15.32
Falcon-RW-1B	1B	❌	✅ Good	21.56
GPT2-Medium	355M	✅	✅ Good	6.27
GPT2-Medium	355M	❌	✅ Good	12.77
OPT-350M	350M	✅	✅ Good	4.33
OPT-350M	350M	❌	✅ Good	10.42
Bloom-560M	560M	✅	❌ Poor	11.93
Bloom-560M	560M	❌	✅ Good	34.38

🌟 Recommendations

Based on our testing:

For best speed + quality: OPT-350M (quantized) - fastest with good quality
For best overall quality: Phi-1.5 (non-quantized) - excellent responses
For smallest size: GPT2-Medium or OPT-350M (quantized) - small with good performance

🚩 Key Findings

Quantization provides ~2x speed improvement
Smaller models (350-500M) quantize better than larger models (1B+)
Some architectures (OPT, GPT2) handle quantization better than others

📁 Repository Structure

onnx_models/
├── bloom_onnx/
├── bloom_onnx_quantized/
├── falcon_onnx/
├── gpt2_onnx/
├── gpt2_onnx_quantized/
├── opt_onnx/
├── opt_onnx_quantized/
├── phi_onnx/
├── qwen_onnx/
├── qwen_onnx_quantized/
├── tinyllama_onnx/
└── tinyllama_onnx_quantized/

📚 Requirements

Python 3.8+
optimum
onnxruntime
transformers
numpy