Llama.cpp hybrid layer quantization of Llama-4-Scout-17B-16E-Instruct by meta-llama
Original model: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. This particular quant achieves a ~50G gguf with the same perplexity as a ~60G IQ4_XS gguf. The quants employed are all K to avoid slow CPU processing of IQ quants, and they range from Q3_K_S at early layers to a final layer which may be specified by the quant name. In this case the final layer is Q4_K_M so the quant is called Q4_K_H. Note there is no unique Q4_K_H quant since the selections of quantizations to use as a function of layer are arbitrary. For this file the layer quants are as follows:
embed : Q3_K
0..7 : Q3_K_S
8..23 : alt Q3_K_M Q3_K_S
24..44 : Q3_K_M
45 : Q3_K_L
46 : Q4_K_S
47 : Q4_K_M
output : Q5_K
These quants were select based on combined subjective and objective performance evaluations to give both high performance and moderate file size.
A second smaller quant is also available with predominant Q3_K_S across layers as follows:
(layer distribution updates : v2 5/5/25, v3 5/6/25)
embed : Q3_K
0..15 : alt Q3_K_S Q2_K
16..38 : Q3_K_S
39..45 : Q3_K_M
46 : Q3_K_L
47 : Q4_K_S
output : Q5_K
This quant reduces file size to 46.6G while maintaining perplexity (10.2) at the same level as a homogenous Q3_K_M quant with ~5G smaller file.
A third quant is also available making heavier use of Q2_K as follows:
embed : Q3_K
0..15 : alt Q2_K Q2_K_S
16..23 : Q2_K
24..31 : alt Q3_K_S Q2_K
32..43 : Q3_K_S
44..47 : Q3_K_M
output : Q5_K
This quant reduces file size to 42.8 while maintaining good perplexity (10.47) close to homogenous Q3_K_M while being 9G smaller and exhibits no generation artifacts (nonsense words or completing words with chinese characters randomly) across a wide range of test promps.
Comparison:
Quant | size | PPL | Comment |
---|---|---|---|
Q2_K | 39.6 | 11.4 | many generation artifacts, nonsense words, completes words with chinese tokens randomly |
Q2_K_H | 42.8e9 | 10.4 | no generation artifacts, no nonsense words, very close to Q3_K_M perplexity |
Q3_K_M_3_5 | 51.6e9 | 10.2 | Q3_K_M with Q5_K output layer |
Q3_K_H | 46.6e9 | 10.2 | Hybrid quant with Q5_K output layer, Q3_K_M perplexity |
Q4_K_H | 50.4e9 | 9.54 | Hybrid quant with Q5_K output layer, IQ4_XS perplexity |
IQ4_XS | 59.9e9 | 9.57 | IQ4_XS with default embed/output layer |
Download the file from below:
Link | Type | Size/e9 B | Notes |
---|---|---|---|
Llama-4-Scout-17B-16E-Instruct.Q2_K_H.gguf | Q2_K_H | 42.8e9 B | Good quality |
Llama-4-Scout-17B-16E-Instruct.Q3_K_H.gguf | Q3_K_H | 46.6e9 B | Solid quality |
Llama-4-Scout-17B-16E-Instruct.Q4_K_H.gguf | Q4_K_H | 50.4e9 B | Excellent quality |
A discussion thread about the approach can be found here on the llama.cpp git repository:
- Downloads last month
- 289
2-bit
3-bit
4-bit
Model tree for steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF
Base model
meta-llama/Llama-4-Scout-17B-16E