Expected speed in some known hardware - data required

by mechanicmuthu - opened Apr 2

Apr 2

What will be the expected token/s prompt processing (at 30k context and 2k output) for CPU + GPU(24GB)?
What will be the expected token/s prompt processing (at 30k context and 2k output) for CPU only?
Info for q2 as well as q4 of ik type.

It does not matter which CPU / which GPU as long as you spell it out. This will give us a motivation point for local use cases.

ubergarm

Owner Apr 2

•

edited Apr 2

It depends on the CPU, how much RAM bandwidth you have in a single NUMA node, and what GPU you use.

Here is a great example benchmark showing

AMD Epyc 9374F 384GB RAM + RTX 4090 48GB VRAM getting 70-100 tok/sec prompt processing depending on length. Also 8-11tok/sec generation depending on length.

Some limited testing on Intel Xeon 6980P without GPU(one CPU only as dual socket has challenges though does scale pp okay):

On a AMD Ryzen Threadripper PRO 7965WX 24-Cores 256 GB RAM + RTX A6000 48GB RAM it can handle full 160k context and the smaller q2 can achieve prompt processing around 100 tok/sec for shorter lengths.

ik_llama.cpp offers some of the best features available to tune both pp and tg allowing different number of threads for each etc.

Is there a specific target you want to hit? Is there specific hardware you have already?

Otherwise, I'd suggest you just give it a try and report back with your values! Cheers!

ubergarm changed discussion status to closed Apr 2

BernardH

14 days ago

•

edited 14 days ago

FWIW, on a Epyc Gen 2 +DDR4 @ 3200 and a 4090, on a proxmox VM with 45 cores, I get
numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ik_llama.cpp/build/bin/llama-bench --numa numactl -t 45 --model DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf -thp 1 -ctk q8_0 -ctv q8_0 -mla 2 -fa 1 -amb 2048 -fmoe 1 --n-gpu-layers 63 --override-tensor exps=CPU

model	size	params	backend	ngl	threads	type_k	type_v	fa	mla	amb	thp	fmoe	test	t/s
deepseek2 671B IQ2_K_R4 - 2.375 bpw	226.00 GiB	672.05 B	CUDA	63	45	q8_0	q8_0	1	2	2048	1	1	pp512	75.13 ± 1.56
deepseek2 671B IQ2_K_R4 - 2.375 bpw	226.00 GiB	672.05 B	CUDA	63	45	q8_0	q8_0	1	2	2048	1	1	tg128	6.18 ± 0.10

numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ik_llama.cpp/build/bin/llama-bench --numa numactl -t 45 --model DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -thp 1 -ctk q8_0 -ctv q8_0 -mla 2 -fa 1 -amb 2048 -fmoe 1 --n-gpu-layers 63 --override-tensor exps=CPU

model	size	params	backend	ngl	threads	type_k	type_v	fa	mla	amb	thp	fmoe	test	t/s
deepseek2 671B IQ4_K_R4 - 4.5 bpw	386.18 GiB	672.05 B	CUDA	63	45	q8_0	q8_0	1	2	2048	1	1	pp512	65.22 ± 0.69
deepseek2 671B IQ4_K_R4 - 4.5 bpw	386.18 GiB	672.05 B	CUDA	63	45	q8_0	q8_0	1	2	2048	1	1	tg128	3.96 ± 0.02

ubergarm

Owner 14 days ago

@BernardH

Thanks for the results! Given you are offloading onto a 4090 you can probably get faster speeds going with full f16 instead of q8_0 attention. Also if you need the extra VRAM for that you can save a little using lower -amb 512 without sacrificing much speed. Maybe something like this:

numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- \
    ik_llama.cpp/build/bin/llama-bench \
    --numa numactl \
    -t 45 \
    --model DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
    -thp 1 -ctk f16 -ctv f16 \
    -mla 2 \
    -fa 1 \
    -amb 512 \
    -fmoe 1 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU

Finally you can specify different values of --threads (for TG) and --threads-batch (for PP). On some machines having more for --threads-batch might allow faster PP.

I mentioned using llama-sweep-bench to visualize PP and TG speeds across the entire context size too.

Nice rig, enjoy!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment