microsoft/Phi-4-multimodal-instruct · Error when loading model quantized with BitsAndBytesConfig for inference

Thank you for the great model!

I have trouble loading this model using BitsAndBytesConfig for inference. The script that I used to load is the same as that in the model card, under ' Loading the Model Locally', but with the added keyword arg 'quantization_config'. For convenience, the script that works is:

model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map='cuda',
torch_dtype='auto',
_attn_implementation='flash_attention_2'
).cuda()

And the scipt that throws an error:

nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map='cuda',
torch_dtype='auto',
quantization_config=nf4_config,
_attn_implementation='flash_attention_2'
).cuda()

The error message I get:

It seems the model will return 'None' when quantization_config is passed.

I think there's another thread with a similar issue when fine tuning.

Some help would be much appreciated.