πŸš€ LLM to ONNX Converter

Convert small language models to ONNX format with guaranteed reliability for RAG and chatbot applications on resource-constrained hardware.

πŸ“‹ Overview

This repository provides scripts to convert small language models to ONNX format and create INT8 quantized versions for efficient deployment on resource-constrained devices. Perfect for mobile applications, Unity game engines, and embedded systems.

βœ… Tested Models

We've successfully tested the following models with example outputs:

Model Size Quantized Response Quality Speed (sec)
Qwen-0.5B 500M βœ… ❌ Poor 8.37
Qwen-0.5B 500M ❌ βœ… Good 15.69
TinyLlama-1.1B 1.1B βœ… ❌ Poor 10.15
TinyLlama-1.1B 1.1B ❌ βœ… Good 19.23
Phi-1.5 1.3B ❌ βœ… Good 15.32
Falcon-RW-1B 1B ❌ βœ… Good 21.56
GPT2-Medium 355M βœ… βœ… Good 6.27
GPT2-Medium 355M ❌ βœ… Good 12.77
OPT-350M 350M βœ… βœ… Good 4.33
OPT-350M 350M ❌ βœ… Good 10.42
Bloom-560M 560M βœ… ❌ Poor 11.93
Bloom-560M 560M ❌ βœ… Good 34.38

🌟 Recommendations

Based on our testing:

  1. For best speed + quality: OPT-350M (quantized) - fastest with good quality
  2. For best overall quality: Phi-1.5 (non-quantized) - excellent responses
  3. For smallest size: GPT2-Medium or OPT-350M (quantized) - small with good performance

🚩 Key Findings

  • Quantization provides ~2x speed improvement
  • Smaller models (350-500M) quantize better than larger models (1B+)
  • Some architectures (OPT, GPT2) handle quantization better than others

πŸ“ Repository Structure

onnx_models/
β”œβ”€β”€ bloom_onnx/
β”œβ”€β”€ bloom_onnx_quantized/
β”œβ”€β”€ falcon_onnx/
β”œβ”€β”€ gpt2_onnx/
β”œβ”€β”€ gpt2_onnx_quantized/
β”œβ”€β”€ opt_onnx/
β”œβ”€β”€ opt_onnx_quantized/
β”œβ”€β”€ phi_onnx/
β”œβ”€β”€ qwen_onnx/
β”œβ”€β”€ qwen_onnx_quantized/
β”œβ”€β”€ tinyllama_onnx/
└── tinyllama_onnx_quantized/

πŸ“š Requirements

  • Python 3.8+
  • optimum
  • onnxruntime
  • transformers
  • numpy

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support