Llama.cpp hybrid layer quantization of Llama-4-Scout-17B-16E-Instruct by meta-llama

Original model: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. This particular quant achieves a ~50G gguf with the same perplexity as a ~60G IQ4_XS gguf. The quants employed are all K to avoid slow CPU processing of IQ quants, and they range from Q3_K_S at early layers to a final layer which may be specified by the quant name. In this case the final layer is Q4_K_M so the quant is called Q4_K_H. Note there is no unique Q4_K_H quant since the selections of quantizations to use as a function of layer are arbitrary. For this file the layer quants are as follows:

embed  : Q3_K
0..7   : Q3_K_S
8..23  : alt Q3_K_M Q3_K_S
24..44 : Q3_K_M
45     : Q3_K_L
46     : Q4_K_S
47     : Q4_K_M
output : Q5_K

These quants were select based on combined subjective and objective performance evaluations to give both high performance and moderate file size.

A second smaller quant is also available with predominant Q3_K_S across layers as follows:

(layer distribution updates : v2 5/5/25, v3 5/6/25)
embed  : Q3_K
0..15  : alt Q3_K_S Q2_K
16..38 : Q3_K_S
39..45 : Q3_K_M
46     : Q3_K_L
47     : Q4_K_S
output : Q5_K

This quant reduces file size to 46.6G while maintaining perplexity (10.2) at the same level as a homogenous Q3_K_M quant with ~5G smaller file.

A third quant is also available making heavier use of Q2_K as follows:

embed  : Q3_K
0..15  : alt Q2_K Q2_K_S
16..23 : Q2_K
24..31 : alt Q3_K_S Q2_K
32..43 : Q3_K_S
44..47 : Q3_K_M
output : Q5_K

This quant reduces file size to 42.8 while maintaining good perplexity (10.47) close to homogenous Q3_K_M while being 9G smaller and exhibits no generation artifacts (nonsense words or completing words with chinese characters randomly) across a wide range of test promps.

Comparison:

Quant size PPL Comment
Q2_K 39.6 11.4 many generation artifacts, nonsense words, completes words with chinese tokens randomly
Q2_K_H 42.8e9 10.4 no generation artifacts, no nonsense words, very close to Q3_K_M perplexity
Q3_K_M_3_5 51.6e9 10.2 Q3_K_M with Q5_K output layer
Q3_K_H 46.6e9 10.2 Hybrid quant with Q5_K output layer, Q3_K_M perplexity
Q4_K_H 50.4e9 9.54 Hybrid quant with Q5_K output layer, IQ4_XS perplexity
IQ4_XS 59.9e9 9.57 IQ4_XS with default embed/output layer

Download the file from below:

Link Type Size/e9 B Notes
Llama-4-Scout-17B-16E-Instruct.Q2_K_H.gguf Q2_K_H 42.8e9 B Good quality
Llama-4-Scout-17B-16E-Instruct.Q3_K_H.gguf Q3_K_H 46.6e9 B Solid quality
Llama-4-Scout-17B-16E-Instruct.Q4_K_H.gguf Q4_K_H 50.4e9 B Excellent quality

A discussion thread about the approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
289
GGUF
Model size
108B params
Architecture
llama4
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF