RefalMachine commited on
Commit
f32412f
·
verified ·
1 Parent(s): 658c8fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -3
README.md CHANGED
@@ -10,11 +10,26 @@ base_model:
10
  - deepseek-ai/DeepSeek-V3-0324
11
  ---
12
 
13
- This is the BF16 model of DeekSeek V3-0324. Useful for quantization and inference on GPUs that do not support FP8 (Nvidia Ampere)
14
 
15
- BF16 is result of dequantizing the FP8 quantized weights from DeepSeek AI: https://huggingface.co/deepseek-ai/DeepSeek-V3-0324
 
16
 
17
- [GPTQModel](https://github.com/modelcloud/gptqmodel) is your go-to choice for DeepSeek V3-0324 quantization toolkit for inference on vLLM and SGLang
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
 
20
  # DeepSeek-V3-0324
 
10
  - deepseek-ai/DeepSeek-V3-0324
11
  ---
12
 
13
+ # Channel-wise INT8 DeepSeek-V3-0324
14
 
15
+ The INT8 quant for SGLang (https://github.com/sgl-project/sglang)
16
+ [PULL REQUEST](https://github.com/sgl-project/sglang/pull/3888)
17
 
18
+ ## 1. Quantization Process
19
+
20
+ We apply INT8 quantization to the BF16 checkpoints.
21
+
22
+ The quantization scales are determined by dividing the channnel-wise maximum of element values by the INT8 type maximum.
23
+
24
+ To generate this weight, run the provided script in the ``./inference`` directory:
25
+
26
+ ``
27
+ python3 bf16_cast_channel_int8.py --input-bf16-hf-path /path/to/bf16-weights/ --output-int8-hf-path /path/to/save-int8-weight/
28
+ ``
29
+ ## 2. Trouble Shooting
30
+ Before inference, you should confirm that there is no attribute "quantization_config" in `config.json`.
31
+
32
+ ---
33
 
34
 
35
  # DeepSeek-V3-0324