GGUF creation process
Do you have a How-To, or could you recommend some links for how you create your GGUFs? I just started researching and have some places to start but I would really appreciate any TLDR or lessons learned guide if you have one.
Well, the process is deceptively simple: you use convert_hf_to_gguf.pl to convert a model directory ot a guuf file, and you have it. You can then use llama-quantize to reduce bits per weight. Things cna go wrong, but overall, the process has improved considerably in the last year.
and dont need a lot of resources either..
I have a handful of T4s, 3090s and a 4060 16GB. I assume a single 3090 will work?
I have a handful of T4s, 3090s and a 4060 16GB. I assume a single 3090 will work?
For conversions/quantization i used to use just a spare CPU machine .. no GPU at all. the process is pretty efficient and didn't take terrible amounts of time.
You only need GPUs if you want to compute an imatrix to generate wighted/imatrix quants for which any GPU with at least 1 GiB GPU memory will do no matter the model size assuming you use -ngl 0
. The performance difference between offloading and not is quite minimal for imatrix computation. You just need enough RAM to fit the unquantized model.
I was going to look in to imatrix next. I've read everything here - https://huggingface.co/mradermacher/model_requests and here - https://github.com/ggml-org/llama.cpp/blob/master/examples/imatrix/README.md. Do I even need to worry about imatrix if I have enough GPU vRAM?
I wouldn't. But that is just me
I was going to look in to imatrix next. I've read everything here - https://huggingface.co/mradermacher/model_requests and here - https://github.com/ggml-org/llama.cpp/blob/master/examples/imatrix/README.md. Do I even need to worry about imatrix if I have enough GPU vRAM?
Just look at the plots on https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2 (you will need to expand the hidden messages) and decide yourself. I highly recommend only using imatrix quants as they give you a much better quality/size ratio. Even if you can fit the entire model into GPU memory using i1-Q5_K_M will run much faster than static Q8 and with the quality difference being so tiny you will never notice any difference in real world applications for monolithic models larger than 8B while for models or models with experts smaller than that I would go for i1-Q6.
Thanks, even if I don't use it I'd probably like to learn how to do it. Any tips for creating the imatrix training data? There seem to be a lot of references to wiki.train.raw but would output from the model to be quantized be better? If so any recommendations for creating that data set?
. Even if you can fit the entire model into GPU memory using i1-Q5_K_M will run much faster than static Q8 and with the quality difference being so tiny you will never notice any difference in real world applications for monolithic models larger than 8B while for models or models with experts smaller than that I would go for i1-Q6.
Perhaps i should reconsider what im doing? I have stayed away form imatrix thinking they were lesser quality too, for those low on Vram. ( or none )
Thanks, even if I don't use it I'd probably like to learn how to do it. Any tips for creating the imatrix training data? There seem to be a lot of references to wiki.train.raw but would output from the model to be quantized be better? If so any recommendations for creating that data set?
Just use the one from bartowski1182 from https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8 and double its size by adding your own high quality data to it or https://huggingface.co/Lewdiculous/Datura_7B-GGUF-Imatrix/blob/main/imatrix-with-rp-format-data.txt if you are too lazy to create your own dataset to account for bartowski1182's shortcomings such as roleplay/story writing data.
Perhaps i should reconsider what im doing? I have stayed away form imatrix thinking they were lesser quality too, for those low on Vram. ( or none )
Especially for users with low GPU memory imatrix quants are almost a requirement. In a quality per GPU memory usage comparison the imatrix quants always win. The lower BPW quants you use the higher the difference between static and imatrix quants usually gets. At below Q4 static quants are in my opinion almost useless as imatrix quants are far superior in terms of quality. Generally I just can't think of a single reason why anyone would use imatrix quants over static quants except that they are easier and less resource intensive to generate.
Here some plots from Meta-Llama-3.1-8B-Instruct so you don't have to keep scrolling through BabyHercules but please go there if you want to see the many other plots I posted:
ill look into this more.. thanks .. ( i normally run Q8's but that might change )