Expected speed in some known hardware - data required
What will be the expected token/s prompt processing (at 30k context and 2k output) for CPU + GPU(24GB)?
What will be the expected token/s prompt processing (at 30k context and 2k output) for CPU only?
Info for q2 as well as q4 of ik type.
It does not matter which CPU / which GPU as long as you spell it out. This will give us a motivation point for local use cases.
It depends on the CPU, how much RAM bandwidth you have in a single NUMA node, and what GPU you use.
Here is a great example benchmark showing
AMD Epyc 9374F 384GB RAM + RTX 4090 48GB VRAM getting 70-100 tok/sec prompt processing depending on length. Also 8-11tok/sec generation depending on length.
Some limited testing on Intel Xeon 6980P without GPU(one CPU only as dual socket has challenges though does scale pp okay):
On a AMD Ryzen Threadripper PRO 7965WX 24-Cores 256 GB RAM + RTX A6000 48GB RAM it can handle full 160k context and the smaller q2 can achieve prompt processing around 100 tok/sec for shorter lengths.
ik_llama.cpp
offers some of the best features available to tune both pp and tg allowing different number of threads for each etc.
Is there a specific target you want to hit? Is there specific hardware you have already?
Otherwise, I'd suggest you just give it a try and report back with your values! Cheers!
FWIW, on a Epyc Gen 2 +DDR4 @ 3200 and a 4090, on a proxmox VM with 45 cores, I get
numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ik_llama.cpp/build/bin/llama-bench --numa numactl -t 45 --model DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf -thp 1 -ctk q8_0 -ctv q8_0 -mla 2 -fa 1 -amb 2048 -fmoe 1 --n-gpu-layers 63 --override-tensor exps=CPU
model | size | params | backend | ngl | threads | type_k | type_v | fa | mla | amb | thp | fmoe | test | t/s |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deepseek2 671B IQ2_K_R4 - 2.375 bpw | 226.00 GiB | 672.05 B | CUDA | 63 | 45 | q8_0 | q8_0 | 1 | 2 | 2048 | 1 | 1 | pp512 | 75.13 ± 1.56 |
deepseek2 671B IQ2_K_R4 - 2.375 bpw | 226.00 GiB | 672.05 B | CUDA | 63 | 45 | q8_0 | q8_0 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.18 ± 0.10 |
numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ik_llama.cpp/build/bin/llama-bench --numa numactl -t 45 --model DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -thp 1 -ctk q8_0 -ctv q8_0 -mla 2 -fa 1 -amb 2048 -fmoe 1 --n-gpu-layers 63 --override-tensor exps=CPU
model | size | params | backend | ngl | threads | type_k | type_v | fa | mla | amb | thp | fmoe | test | t/s |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CUDA | 63 | 45 | q8_0 | q8_0 | 1 | 2 | 2048 | 1 | 1 | pp512 | 65.22 ± 0.69 |
deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CUDA | 63 | 45 | q8_0 | q8_0 | 1 | 2 | 2048 | 1 | 1 | tg128 | 3.96 ± 0.02 |
Thanks for the results! Given you are offloading onto a 4090 you can probably get faster speeds going with full f16
instead of q8_0
attention. Also if you need the extra VRAM for that you can save a little using lower -amb 512
without sacrificing much speed. Maybe something like this:
numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- \
ik_llama.cpp/build/bin/llama-bench \
--numa numactl \
-t 45 \
--model DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
-thp 1 -ctk f16 -ctv f16 \
-mla 2 \
-fa 1 \
-amb 512 \
-fmoe 1 \
--n-gpu-layers 63 \
--override-tensor exps=CPU
Finally you can specify different values of --threads
(for TG) and --threads-batch
(for PP). On some machines having more for --threads-batch
might allow faster PP.
I mentioned using llama-sweep-bench
to visualize PP and TG speeds across the entire context size too.
Nice rig, enjoy!