Expected speed in some known hardware - data required

#1
by mechanicmuthu - opened

What will be the expected token/s prompt processing (at 30k context and 2k output) for CPU + GPU(24GB)?
What will be the expected token/s prompt processing (at 30k context and 2k output) for CPU only?
Info for q2 as well as q4 of ik type.

It does not matter which CPU / which GPU as long as you spell it out. This will give us a motivation point for local use cases.

It depends on the CPU, how much RAM bandwidth you have in a single NUMA node, and what GPU you use.

Here is a great example benchmark showing

AMD Epyc 9374F 384GB RAM + RTX 4090 48GB VRAM getting 70-100 tok/sec prompt processing depending on length. Also 8-11tok/sec generation depending on length.

Some limited testing on Intel Xeon 6980P without GPU(one CPU only as dual socket has challenges though does scale pp okay):

q4-speed-graph.png

On a AMD Ryzen Threadripper PRO 7965WX 24-Cores 256 GB RAM + RTX A6000 48GB RAM it can handle full 160k context and the smaller q2 can achieve prompt processing around 100 tok/sec for shorter lengths.

ik_llama.cpp offers some of the best features available to tune both pp and tg allowing different number of threads for each etc.

Is there a specific target you want to hit? Is there specific hardware you have already?

Otherwise, I'd suggest you just give it a try and report back with your values! Cheers!

ubergarm changed discussion status to closed

FWIW, on a Epyc Gen 2 +DDR4 @ 3200 and a 4090, on a proxmox VM with 45 cores, I get
numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ik_llama.cpp/build/bin/llama-bench --numa numactl -t 45 --model DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf -thp 1 -ctk q8_0 -ctv q8_0 -mla 2 -fa 1 -amb 2048 -fmoe 1 --n-gpu-layers 63 --override-tensor exps=CPU

model size params backend ngl threads type_k type_v fa mla amb thp fmoe test t/s
deepseek2 671B IQ2_K_R4 - 2.375 bpw 226.00 GiB 672.05 B CUDA 63 45 q8_0 q8_0 1 2 2048 1 1 pp512 75.13 ± 1.56
deepseek2 671B IQ2_K_R4 - 2.375 bpw 226.00 GiB 672.05 B CUDA 63 45 q8_0 q8_0 1 2 2048 1 1 tg128 6.18 ± 0.10

numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ik_llama.cpp/build/bin/llama-bench --numa numactl -t 45 --model DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -thp 1 -ctk q8_0 -ctv q8_0 -mla 2 -fa 1 -amb 2048 -fmoe 1 --n-gpu-layers 63 --override-tensor exps=CPU

model size params backend ngl threads type_k type_v fa mla amb thp fmoe test t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 63 45 q8_0 q8_0 1 2 2048 1 1 pp512 65.22 ± 0.69
deepseek2 671B IQ4_K_R4 - 4.5 bpw 386.18 GiB 672.05 B CUDA 63 45 q8_0 q8_0 1 2 2048 1 1 tg128 3.96 ± 0.02

@BernardH

Thanks for the results! Given you are offloading onto a 4090 you can probably get faster speeds going with full f16 instead of q8_0 attention. Also if you need the extra VRAM for that you can save a little using lower -amb 512 without sacrificing much speed. Maybe something like this:

numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- \
    ik_llama.cpp/build/bin/llama-bench \
    --numa numactl \
    -t 45 \
    --model DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    -thp 1 -ctk f16 -ctv f16 \
    -mla 2 \
    -fa 1 \
    -amb 512 \
    -fmoe 1 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU 

Finally you can specify different values of --threads (for TG) and --threads-batch (for PP). On some machines having more for --threads-batch might allow faster PP.

I mentioned using llama-sweep-bench to visualize PP and TG speeds across the entire context size too.

Nice rig, enjoy!

Sign up or log in to comment