Error when loading model quantized with BitsAndBytesConfig for inference
Thank you for the great model!
I have trouble loading this model using BitsAndBytesConfig for inference. The script that I used to load is the same as that in the model card, under ' Loading the Model Locally', but with the added keyword arg 'quantization_config'. For convenience, the script that works is:
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map='cuda',
torch_dtype='auto',
_attn_implementation='flash_attention_2'
).cuda()
And the scipt that throws an error:
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map='cuda',
torch_dtype='auto',
quantization_config=nf4_config,
_attn_implementation='flash_attention_2'
).cuda()
The error message I get:
It seems the model will return 'None' when quantization_config is passed.
I think there's another thread with a similar issue when fine tuning.
Some help would be much appreciated.