steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF

Llama.cpp hybrid layer quantization of Llama-4-Scout-17B-16E-Instruct by meta-llama

Original model: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. This particular quant achieves a ~50G gguf with the same perplexity as a ~60G IQ4_XS gguf. The quants employed are all K to avoid slow CPU processing of IQ quants, and they range from Q3_K_S at early layers to a final layer which may be specified by the quant name. In this case the final layer is Q4_K_M so the quant is called Q4_K_H. Note there is no unique Q4_K_H quant since the selections of quantizations to use as a function of layer are arbitrary. For this file the layer quants are as follows:

embed  : Q3_K
0..7   : Q3_K_S
8..23  : alt Q3_K_M Q3_K_S
24..44 : Q3_K_M
45     : Q3_K_L
46     : Q4_K_S
47     : Q4_K_M
output : Q5_K

These quants were select based on combined subjective and objective performance evaluations to give both high performance and moderate file size.

A second smaller quant is also available with predominant Q3_K_S across layers as follows:

(layer distribution updates : v2 5/5/25, v3 5/6/25)

embed  : Q3_K
0..15  : alt Q3_K_S Q2_K
16..38 : Q3_K_S
39..45 : Q3_K_M
46     : Q3_K_L
47     : Q4_K_S
output : Q5_K

This quant reduces file size to 46.6G while maintaining perplexity (10.2) at the same level as a homogenous Q3_K_M quant with ~5G smaller file.

A third quant is also available making heavier use of Q2_K as follows:

embed  : Q3_K
0..15  : alt Q2_K Q2_K_S
16..23 : Q2_K
24..31 : alt Q3_K_S Q2_K
32..43 : Q3_K_S
44..47 : Q3_K_M
output : Q5_K

This quant reduces file size to 42.8 while maintaining good perplexity (10.47) close to homogenous Q3_K_M while being 9G smaller and exhibits no generation artifacts (nonsense words or completing words with chinese characters randomly) across a wide range of test promps.

Comparison:

Quant	size	PPL	Comment
Q2_K	39.6	11.4	many generation artifacts, nonsense words, completes words with chinese tokens randomly
Q2_K_H	42.8e9	10.4	no generation artifacts, no nonsense words, very close to Q3_K_M perplexity
Q3_K_M_3_5	51.6e9	10.2	Q3_K_M with Q5_K output layer
Q3_K_H	46.6e9	10.2	Hybrid quant with Q5_K output layer, Q3_K_M perplexity
Q4_K_H	50.4e9	9.54	Hybrid quant with Q5_K output layer, IQ4_XS perplexity
IQ4_XS	59.9e9	9.57	IQ4_XS with default embed/output layer

Download the file from below:

Link	Type	Size/e9 B	Notes
Llama-4-Scout-17B-16E-Instruct.Q2_K_H.gguf	Q2_K_H	42.8e9 B	Good quality
Llama-4-Scout-17B-16E-Instruct.Q3_K_H.gguf	Q3_K_H	46.6e9 B	Solid quality
Llama-4-Scout-17B-16E-Instruct.Q4_K_H.gguf	Q4_K_H	50.4e9 B	Excellent quality

A discussion thread about the approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

steampunque
/

Llama-4-Scout-17B-16E-Instruct-GGUF

Llama.cpp hybrid layer quantization of Llama-4-Scout-17B-16E-Instruct by meta-llama

Download the file from below:

Model tree for steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF