9 6

Compilade

compilade

compilade

AI & ML interests

None yet

Recent Activity

new activity 17 days ago

microsoft/bitnet-b1.58-2B-4T-gguf:TQ1 quant version

new activity about 1 month ago

mradermacher/BabyHercules-4x150M-GGUF:public mradermacher discussions

updated a model 4 months ago

compilade/quant-tests

View all activity

Organizations

compilade's activity

New activity in microsoft/bitnet-b1.58-2B-4T-gguf 17 days ago

TQ1 quant version

#7 opened 17 days ago by

TobDeBer

New activity in mradermacher/BabyHercules-4x150M-GGUF about 1 month ago

public mradermacher discussions

#5 opened about 1 month ago by

mradermacher

updated a model 4 months ago

compilade/quant-tests

Updated Dec 29, 2024

liked a model 7 months ago

ai21labs/Jamba-tiny-dev

Updated Oct 1, 2024 • 16.3k • 12

replied to bartowski's post 7 months ago

KLD measures the difference between 2 probability distributions, typically between a "ground truth" and a model prediction.

Yes, and ln(PPL(Q)/PPL(base)) from my understanding measures the difference between the probabilities for the "correct" tokens according to the test dataset (at least for the second half of each chunk (same as for KLD)). Which means it would be possible to somehow keep perplexity the same or better while also increasing KLD (by making the non-"correct" tokens have different probabilities).

This makes me wonder: do all of the token probabilities have to match closely for a quantized model to still be good?

I guess it depends on whether the goal is to make a faithful quantization, or an equally good model through quantization-aware fine-tuning.
The way imatrix works, it can't really "fine-tune" a model towards a lower perplexity, only prioritize error reduction in the quantization of the weights in the columns with more impact on the activations, so I would say that faithfulness to the full-precision model is the goal of the quantization in this case, and thus KLD feels more appropriate.

Of course, I might be wrong; I don't really have a full understanding of the statistics going on in perplexity and KL-divergence calculations.

However, for quantization-aware fine-tuning, then ln(PPL(Q)/PPL(base)) is likely a better indicator of a better quantization than KLD, unless the goal of the fine-tuning was actually to minimize KLD.

liked 2 models 7 months ago