Thank you but some issues

#2
by MB7977 - opened

Appreciate these new quants which now seem to support MLA.

Heads up for others — I experienced severe degradation at long context with the UD-IQ2_M and UD-Q2_K_XL quants. Without dropping cache type k to Q8 I was getting random Chinese and gibberish in response to a 7K token prompt on my setup. This was not the case with the previous quants made prior to the MLA commit to llama.cpp. I have a suspicion something is awry with the changes made in that commit.

Dropping to Q8 k cache resulted in non-gibberish, decent, if subpar results compared to old quants. Short context prompts were fine with either Q8 or FP16 cache.

I get the same issue, I'm using a mix of CUDA + CPU though (128GB VRAM + 192GB RAM)

I get gibberish in any prompt that is larger than like 2048 tokens, or ctx in general (like resuming a chat)

Old version doesn't suffer this, but cache uses way more VRAM.

Yeah, I’m using a mix of CUDA and CPU too (120GB VRAM + 256GB RAM with a Threadripper Pro 5965 CPU).

Have you tried -ctk q8_0? That got it working on longer contexts for me. But I’m still validating if there’s a quality loss. I’m also finding that my CPU temps are up from 70C with the old combo to 85-90C now. I think it’s the MLA commit causing these issues but not 100% sure as it seems Unsloth are also using a new methodology to make these dynamic quants so maybe it’s an interaction thing.

I tried with -ctk q8_0 but got the same issue sadly :(

I have a Ryzen 7 7800X3D, 5090+4090x2+A6000, tested on Fedora 42.

Interesting. Doesn’t make much sense to me that q8_0 k cache fixed the long context issues for me, I was just randomly running through different things to try and resolve it. I usually just run with FP16 kv cache. Without the q8 k cache I was always getting a couple of lines of combined Chinese/English/Russian then an EOS. Something definitely wrong there. Might be an issue with the context shifting that was part of the MLA commit?

Unsloth AI org

Oh my - I'll investigate asap - sorry on the issues! It might due to llama.cpp having their new MLA implementation

Oh my - I'll investigate asap - sorry on the issues! It might due to llama.cpp having their new MLA implementation

Did y'all make a new imatrix with the latest llama-cpp including the MLA patches or is this an older imatrix dat?

Just curious as yours says 720 entries, which is what i've seen using ik_llama.cpp fork as it has supported MLA longer... but i've heard making a new imatrix might say 721 entries and throw an error complaining about attn_k_b_weight wrong size in this github PR here

Might be a clue, or a red herring, not sure, just sharing what I've seen! Thanks!

Btw are you guys using CPU Offloading? We think that might be the issue because full GPU offloading works fine.

And which bits are you guys using?

We will be reuploading them nonetheless

Same issue here with CPU offloading, Threadripper with 128GB DDR5 + 5xRTX3090's.

I'm using ~ 25 layers on GPU and the rest on CPU on my case. Q2-K-XL.

Yes, similar here, I’m offloading 20 layers to my GPUs, the rest is CPU. I’ve tried the UD-IQ2_M and the UD-Q2_K_XL. I’ve used the old UD-Q2_K_XL in the past with no issues.

I'm seeing the nonsense output with UD-IQ3_XXS, even with a context size as low as 1000. It's not just the UD Q2 versions that show this issue. @MB7977 's workaround of using -ctk q8_0 does work for me (thanks for that), even with context over 10000. I have 32 GB VRAM and 128 GB main memory (which means I'm also running the model directly from my SSD drive as well). I'm offloading 4 layers to the GPU.

Unsloth AI org

Wait could this be related - https://github.com/ggml-org/llama.cpp/pull/13113

I'll test CPU overloading and report back!

No luck with the latest commit for me, unfortunately. I suspect it’s a bug in the original MLA commit.

Sorry I should have clarified. I don't think there's any issue with your quants. It's something wrong with llama.cpp's MLA implementation when using CUDA+CPU.

I had the same problem a few weeks ago when I built the MLA PR of llama.cpp and tried various R1 MLA quants (eg. regular Q3_K). I probably should have raised it on github but was too busy / didn't realize CUDA with CPU offload was an edge case.

@gghfez

It's something wrong with llama.cpp's MLA implementation when using CUDA+CPU.

Thanks for the report, I still haven't tried the recently merged MLA features in mainline llama.cpp yet. For any intrepid users, the ik_llama.cpp fork has had it working for a while now and I have an ik_llama.cpp exclusive quant ubergarm/DeepSeek-V3-0324-GGUF people have been asking me to compare with these quants.

I hope to test MLA on mainline llama.cpp more thoroughly once I get access to a big RAM rig again soon, especially given bartowski's issues with imatrix on that mentioned above.

Cheers!

Sadly I get a similar issue with IQ2_K_R4 https://github.com/ikawrakow/ik_llama.cpp/issues/305

Not exactly the same as I get just "DDDDD" there, while on main llamacpp I get gibberish (symbols and random letters)

We will be reuploading them nonetheless

It looks like all files were replaced indeed. What was fixed?

Unsloth AI org

@MB7977 @Panchovix @ubergarm @gghfez @qaraleza @Garf

Reuploaded all of them yes! Could you guys check if it got fixed? It should be :)

@shimmyshimmer Thanks a lot! I just downloaded and the new DeepSeek-V3-0324-UD-IQ1_S and gave it a quick test. The problem persists with CUDA + CPU.

Model output example:
Message 1: "Hi"
Model response: "Hello! How can I help you today? 😊"
Message2:
Model response: "55235A?@0!3'&!EC,."

Fairly low context:

prompt eval time =   10193.24 ms /   571 tokens (   17.85 ms per token,    56.02 tokens per second)
       eval time =    2250.50 ms /    19 tokens (  118.45 ms per token,     8.44 tokens per second)
      total time =   12443.74 ms /   590 tokens

I really don't think there's anything wrong with your quants, it's a llama.cpp + (cuda without offloading all layers) + mla issue, I've had the same thing happen with other quants / building the mla PR before it was merged.
I also tried compiling llama.cpp with Vulkan instead of cuda and tested it with the same settings, couldn't reproduce the problem with either of your uploads.
(Performance is unusable on Vulkan though with 2 t/s prompt processing / 3-4 t/s generation and higher gpu power usage)

Unsloth AI org

@gghfez oh ok :( thanks for letting us know that's unfortunate

Unsloth AI org

Wonder if we could do a 'git revert' to the MLA commits and see how it behaves with these new versions, mostly to keep the -ot parameter.

I also get the same issue, so as @gghfez says, I don't think it is an issue with your quants, but an issue when offloading and using CUDA with MLA. I think a way to disable it would be great since for now it seems to be forced to use MLA.

Thank you for that. I was going to open an issue but it'll get more weight from you guys. I wonder if it can't be fixed if including an -mla flag might be an option going forward. That seemed to be part of the original PR. I'd love for MLA to work but for the moment my main priority is still being able to use DeepSeek quants with the latest builds of llama.cpp, especially with R2 not far away.

Thank you again for being so helpful and engaged.

Unsloth AI org

@Panchovix @MB7977

Have you guys tried the latest llama.cpp version? Apparently it fixes it? https://github.com/ggml-org/llama.cpp/pull/13137

No luck with the latest commit for me, unfortunately.

Unsloth AI org

No luck with the latest commit for me, unfortunately.

Oh rip. 😔 https://github.com/ggml-org/llama.cpp/pull/12801#issuecomment-2835304458

I got it working!

prompt eval time =  116176.21 ms /  8150 tokens (   14.25 ms per token,    70.15 tokens per second) 👈 that's a lot faster than what I had before!
eval time =  114519.95 ms /   875 tokens (  130.88 ms per token,     7.64 tokens per second) 👈 This is above 10 at lower contexts
 total time =  230696.16 ms /  9025 tokens 

It's stable at 8k context. No more garbage outputs, and it's got it's "personality" back at low contexts (feels like Deepseek again). Prompt processing is also faster than just using -ngl <whatever I can fit>

Whatever the offload to CPU issue is, doesn't seem to affect expert tensors. So I ended up with this:

-ngl 99 -v --override-tensor 'blk\.(2[5-9]|[3-5][0-9]|60)\..*_exps\.=CPU' --override-tensor 'blk\.([1-4])\..*_exps\.=CUDA1' --override-tensor 'blk\.([5-9])\..*_exps\.=CUDA2' --override-tensor 'blk\.(1[0-4])\..*_exps\.=CUDA0' --override-tensor 'blk\.(1[5-9])\..*_exps\.=CUDA4' --override-tensor 'blk\.(2[0-4])\..*_exps\.=CUDA3'

The important thing is -ngl 99 to ensure all the non-expert layers are on CUDA devices., then, put experts on CPU.

To put them all on CPU, set ngl 99 then add this flag:

-ot "\d+.ffn_.*_exps.=CPU"

But that's slow too slow for me as I don't have the DDR5 capacity and end up memory-map'd to SSD.
So I started spreading experts across my CUDA CUDA devices to get system memory usage below 120GB like this:

  1. Replaced -ot "\d+.ffn_.*_exps.=CPU" with this to only offload experts 25-60 to CPU:
--override-tensor 'blk\.(2[5-9]|[3-5][0-9]|60)\..*_exps\.=CPU'

Change it to match however many experts you can't fit onto CUDA ^

  1. Individually assign experts to each CUDA device. For example, this puts puts experts 1-4 on CUDA1:
--override-tensor 'blk\.([1-4])\..*_exps\.=CUDA1'
  1. Do the same for all CUDA devices (prepare for trial and error / CUDA OOM to get it right)
--override-tensor 'blk\.([1-4])\..*_exps\.=CUDA1' \
--override-tensor 'blk\.([5-9])\..*_exps\.=CUDA2' \
--override-tensor 'blk\.(1[0-4])\..*_exps\.=CUDA0' \
--override-tensor 'blk\.(1[5-9])\..*_exps\.=CUDA4' \
--override-tensor 'blk\.(2[0-4])\..*_exps\.=CUDA3'

And be sure to verify that every expert is accounted for between the CPU and CUDA devices ^ or any you miss will end up on CUDA0 and OOM

I also added the -v to see where each tensor was being assigned (it's super verbose).

Here's "Hello", performance is better at lower context

prompt eval time =     977.57 ms /     9 tokens (  108.62 ms per token,     9.21 tokens per second)   👈 that just looks slow because there are less than 70 tokens in the prompt
       eval time =     874.05 ms /    12 tokens (   72.84 ms per token,    13.73 tokens per second) 
      total time =    1851.63 ms /    21 tokens

That's great @gghfez !

Is there a way to see what are the exp tensors? How much are the active parameters, about 40B? How many experts do we have? Sorry for too many questions.

I have 192GB RAM + 128GB VRAM, but I have searched a bit of -ot but I have not understand how to use it. I have a 4090 + 4090 + 5090 + A6000 (devices ordered like that)

In theory we would want ot have all active params + some experts on GPU, and the rest of the other expers on CPU? Then without active params on CPU issue shouldn't happen? Is there a way to see how much RAM/VRAM does each expert use?

Okay after tinkering like 5 hours, I can confirm @gghfez finding.

If you load all the active parameters into GPU and then load some experts on CUDA, and the rest on CPU, it works fine. My load is pretty bad for now but made it work for now.

My speeds are pretty bad though, but probably it's because I'm using X16/X4/X4/X4 instead of something like X8/X8/X8/X8

prompt eval time =  432377.39 ms /  3070 tokens (  140.84 ms per token,     7.10 tokens per second)
       eval time =   44220.34 ms /   307 tokens (  144.04 ms per token,     6.94 tokens per second)

And no gibberish.

So I guess there is a bug when using MLA on active params on a mix with CPU + CUDA.

EDIT: After tinkering a bit got better speeds

prompt eval time =  146999.55 ms /  3070 tokens (   47.88 ms per token,    20.88 tokens per second)
       eval time =   34334.69 ms /   257 tokens (  133.60 ms per token,     7.49 tokens per second)
./llama-server -m '/home/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf' -c 16384 --no-mmap --no-warmup -v -ngl 99 --override-tensor 'blk\.(2[5-9]|[3-6][0-9])\..*_exps\.=CPU' --override-tensor 'blk\.([1-6])\..*_exps\.=CUDA0' --override-tensor 'blk\.([7-9]|1[0])\..*_exps\.=CUDA1' --override-tensor 'blk\.(1[1-5])\..*_exps\.=CUDA2' --override-tensor 'blk\.(1[6-9]|2[0-4])\..*_exps\.=CUDA3'

My CUDA0 is saturated at 26gb/s during prompt processing. Make sure that's happening on your X16 GPU. There's some llama.cpp flag to do this iirc, but I got lucky this time and it's happening by default now.

Also note that mla is slower at longer contexts. The performance is more like when I quantized the K cache with the older builds (8 t/s at 4k context instead of closer to 10).
They know about this trade-off vs needing 4MB of VRAM per token of context with the old builds.

Oh and regarding

Wonder if we could do a 'git revert' to the MLA commits and see how it behaves with these new versions, mostly to keep the -ot parameter.

That won't work, the new quants require MLA. I remember reading that somewhere on github.
I also just tested my old llama.cpp build with where I merged pr-11397 to use -ot (This is how I run the normal DS quants) and it can't load these ones.

Hmm, I can't get any of these solutions to work for me - I am trying the DeepSeek-V3-0324-UD-Q2_K_XL quant, on the latest lcpp. I've tried every combination of everything, fa on/off, ctk at 8_0, full cpu offload, only experts on cpu - everything is giving me the same gibberish results.

full cpu offload

If you're doing this to rule out the CUDA+non-experts-on-cpu issue and you built llama.cpp with cuda support, try prefixing your command with CUDA_VISIBLE_DEVICES='' to hide all your GPUs from it.

eg:

CUDA_VISIBLE_DEVICES='' ./build/bin/llama-server -m  ...
Unsloth AI org

I will do a more thorough investigation over the weekend - apologies on the issues - I'll communicate with the llama.cpp folks and see how we can solve the issue!

Thank you.

I think it’s important to get to the root of the bug with llama.cpp/MLA. These are complicated workarounds rather than resolving the issue properly. As it stands it seems partial offloading Deepseek models and llama.cpp is broken without a lot of fiddling that most people will not be aware is necessary/possible.

@gghfez sorry to bump, but do you know the flag when you say:

My CUDA0 is saturated at 26gb/s during prompt processing. Make sure that's happening on your X16 GPU. There's some llama.cpp flag to do this iirc, but I got lucky this time and it's happening by default now.

I noticed it is using a slower GPU on my case (4.0 X8) instead of a faster GPU now (5.0 X8), so it saturates at 13 GB/s instead of 26 GB/s

Wow, disregard me - I had a hunch that something was off so I checked my sha256 sums and sure enough one of the splits was corrupted - will re-test when I redownload it. I guess the huggingface cli doesn't bother doing any checksums.

@Panchovix Looking at my old script, seems like I used the CUDA_VISIBLE_DEVICES= to re-order them.

eg. CUDA_VISIBLE_DEVICES=3,0,1,2,4

where 3 is my 16x card

Test it like this:

./build/bin/llama-server --list-devices

vs

CUDA_VISIBLE_DEVICES=3,0,1,2,4 ./build/bin/llama-server --list-devices

@Panchovix Looking at my old script, seems like I used the CUDA_VISIBLE_DEVICES= to re-order them.

eg. CUDA_VISIBLE_DEVICES=3,0,1,2,4

where 3 is my 16x card

Test it like this:

./build/bin/llama-server --list-devices

vs

CUDA_VISIBLE_DEVICES=3,0,1,2,4 ./build/bin/llama-server --list-devices

Perfect, that did the trick, and indeed does sature 26-27 GB/s

image.png

Now I wonder if running it at X16 5.0 would help...

Okay just an update, X16 5.0 doesn't help much. It seems to top at 28 GiB/s or near that. So I guess it's a limitation for now.

The latest commit over at llama.cpp (CUDA: fix logic for clearing padding with -ngl 0) seems to have resolved the issue. For me it now runs without any need to balance experts, or specify q8_0 k cache.

Just as info for us that use CUDA + CPU, there is a PR for FA + MLA https://github.com/ggml-org/llama.cpp/pull/13306

I tried it, added some more layers to GPUs (as compute buffers are really small) and then modified ubatch from 512 to 1024 (this uses more VRAM!)

./llama-server -m '/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 65536 --no-mmap --no-warmup -v -ngl 99 --override-tensor 'blk\.([0-7])\..*_exps\.=CUDA0' --override-tensor 'blk\.([8-9]|1[0-1])\..*_exps\.=CUDA1' --override-tensor 'blk\.(1[2-6])\..*_exps\.=CUDA2' --override-tensor 'blk\.(1[7-9]|2[0-6])\..*_exps\.=CUDA3' -fa --override-tensor 'blk\..*_exps\.=CPU' -mg 0 --ubatch-size 1024

And PP speeds increased a lot

prompt eval time =   34965.38 ms /  3565 tokens (    9.81 ms per token,   101.96 tokens per second)
       eval time =   45389.59 ms /   416 tokens (  109.11 ms per token,     9.17 tokens per second)

Then I tried with 1536 and then again it increased PP speed

prompt eval time =   28097.73 ms /  3565 tokens (    7.88 ms per token,   126.88 tokens per second)
       eval time =   43426.93 ms /   404 tokens (  107.49 ms per token,     9.30 tokens per second)

I think it may reach the point where removing some layers and increase ubatch, for a small perf hit on gen speed, it increases PP speed by a ton.

100 t/s is great!

I can't get it to run though much more than a "hi" prompt though:

llamacpp_pr13306/ggml/src/ggml-cuda/mmvq.cu:519: GGML_ASSERT(!src0->view_src) failed

Probably this commit breaking it for me:

https://github.com/ggml-org/llama.cpp/commit/2356fb1d53c86d838756211010bbabfafda7cb94

-GGML_ASSERT(ggml_is_contiguous(src0));
+GGML_ASSERT(ggml_is_contiguously_allocated(src0));
+GGML_ASSERT(!src0->view_src);

@gghfez Oh I'm using the PR itself which I think don't have that commit. That is interesting hmmm

Using the PR itself fixed it!

1024:

prompt eval time =   43703.13 ms /  4526 tokens (    9.66 ms per token,   103.56 tokens per second)
       eval time =   33580.05 ms /   363 tokens (   92.51 ms per token,    10.81 tokens per second)
      total time =   77283.18 ms /  4889 tokens

1536 looks the same for me:

prompt eval time =   43957.51 ms /  4526 tokens (    9.71 ms per token,   102.96 tokens per second)
       eval time =   59601.08 ms /   639 tokens (   93.27 ms per token,    10.72 tokens per second)
      total time =  103558.60 ms /  5165 tokens

Looks like it's coming to ik_llamacpp too

https://github.com/ikawrakow/ik_llama.cpp/pull/386/commits

I think it may reach the point where removing some layers and increase ubatch, for a small perf hit on gen speed, it increases PP speed by a ton.

I've got more VRAM than system memory so I'll either offload more layers, or run a larger quant.
Thanks for posting about this, otherwise I probably would have just built main in a few days, hit the new bug and given up / not noticed this PR.

Sign up or log in to comment