DeepSeek-R1-GPTQ-4b-128g

Model Overview

This model was obtained by quantizing the weights of deepseek-ai/DeepSeek-R1 to INT4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 50%.

All layers within transformer blocks are compressed. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.

Model checkpoint is saved in compressed_tensors format.

Models	Experts Quantized	Attention blocks quantized	Size (GB)
deepseek-ai/DeepSeek-R1	❌	❌	671 GB
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g	✅	✅	325 GB
cognitivecomputations/DeepSeek-R1-AWQ	✅	✅	340 GB

Evaluation

This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME-24, GPQA-Diamond, MATH-500).

Model outputs were generated with the vLLM engine.

For reasoning tasks we estimate pass@1 based on 10 runs with different seeds and temperature=0.6, top_p=0.95 and max_new_tokens=32768.

OpenLLM Leaderboard V1 tasks

	Recovery (%)	Average Score	ARC-Challenge acc_norm, 25-shot	GSM8k exact_match, 5-shot	HellaSwag acc_norm, 10-shot	MMLU acc, 5-shot	TruthfulQA mc2, 0-shot	WinoGrande acc, 5-shot
deepseek/DeepSeek-R1	100.00	81.04	72.53	95.91	89.30	87.22	59.28	82.00
cognitivecomputations/DeepSeek-R1-AWQ	100.07	81.10	73.12	95.15	89.07	86.86	60.09	82.32
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g (this model)	99.86	80.93	72.70	95.68	89.25	86.83	58.77	82.32
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts	100.30	81.28	72.53	95.68	89.36	86.99	59.77	83.35

Reasoning tasks (AIME-24, GPQA-Diamond, MATH-500)

	Recovery (%)	Average Score	AIME 2024 pass@1	MATH-500 pass@1	GPQA Diamond pass@1
deepseek/DeepSeek-R1	100.00	82.99	78.33	97.24	73.38
cognitivecomputations/DeepSeek-R1-AWQ	94.29	78.25	70.67	93.64	70.46
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g (this model)	96.52	80.10	72.96	97.09	70.26
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts	98.81	82.00	77.00	97.08	71.92

Reproduction

The results were obtained using the following commands:

OpenLLM v1

MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
MODEL_ARGS="pretrained=$MODEL,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True"

lm_eval \
  --model vllm \
  --model_args $MODEL_ARGS \
  --tasks openllm \
  --batch_size auto

For reasoning evals we adopted the protocol from the open-r1 repository.

Reasoning tasks

MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":7686}"

export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:diamond|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038

Performance benchmarking

We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):

	Time to First Token Median TTFT (ms) ↓	Time per Output Token Median TPOT (ms) ↓	Inter-token Latency Median ITL (ms) ↓
cognitivecomputations/DeepSeek-R1-AWQ	1585.45	55.41	43.06
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts	1344.68	41.49	36.33
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g (this model)	815.19	44.65	37.88

GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.

Contributors

Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).