DeepSeek-R1-GPTQ-4b-128g

Model Overview

This model was obtained by quantizing the weights of deepseek-ai/DeepSeek-R1 to INT4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 50%.

All layers within transformer blocks are compressed. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.

Model checkpoint is saved in compressed_tensors format.

Models Experts Quantized Attention blocks quantized Size (GB)
deepseek-ai/DeepSeek-R1 671 GB
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g 325 GB
cognitivecomputations/DeepSeek-R1-AWQ 340 GB

Evaluation

This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME-24, GPQA-Diamond, MATH-500).

Model outputs were generated with the vLLM engine.

For reasoning tasks we estimate pass@1 based on 10 runs with different seeds and temperature=0.6, top_p=0.95 and max_new_tokens=32768.

OpenLLM Leaderboard V1 tasks

Recovery (%) Average Score ARC-Challenge
acc_norm, 25-shot
GSM8k
exact_match, 5-shot
HellaSwag
acc_norm, 10-shot
MMLU
acc, 5-shot
TruthfulQA
mc2, 0-shot
WinoGrande
acc, 5-shot
deepseek/DeepSeek-R1 100.00 81.04 72.53 95.91 89.30 87.22 59.28 82.00
cognitivecomputations/DeepSeek-R1-AWQ 100.07 81.10 73.12 95.15 89.07 86.86 60.09 82.32
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g
(this model)
99.86 80.93 72.70 95.68 89.25 86.83 58.77 82.32
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts 100.30 81.28 72.53 95.68 89.36 86.99 59.77 83.35

Reasoning tasks (AIME-24, GPQA-Diamond, MATH-500)

Recovery (%) Average Score AIME 2024
pass@1
MATH-500
pass@1
GPQA Diamond
pass@1
deepseek/DeepSeek-R1 100.00 82.99 78.33 97.24 73.38
cognitivecomputations/DeepSeek-R1-AWQ 94.29 78.25 70.67 93.64 70.46
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g
(this model)
96.52 80.10 72.96 97.09 70.26
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts 98.81 82.00 77.00 97.08 71.92

Reproduction

The results were obtained using the following commands:

OpenLLM v1

MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
MODEL_ARGS="pretrained=$MODEL,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True"

lm_eval \
  --model vllm \
  --model_args $MODEL_ARGS \
  --tasks openllm \
  --batch_size auto

For reasoning evals we adopted the protocol from the open-r1 repository.

Reasoning tasks

MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":7686}"

export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:diamond|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038

Performance benchmarking

We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):

Time to First Token
Median TTFT (ms) ↓
Time per Output Token
Median TPOT (ms) ↓
Inter-token Latency
Median ITL (ms) ↓
cognitivecomputations/DeepSeek-R1-AWQ 1585.45 55.41 43.06
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts 1344.68 41.49 36.33
ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g
(this model)
815.19 44.65 37.88

GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.

Contributors

Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).

Downloads last month
46
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support