How to Optimize Slow CPU Inference Speed

#2
by izhaohui - opened

Due to certain restrictions, I cannot use a GPU during deployment or access APIs that leverage GPUs on other devices. When using CPU embedding, processing an 800x800 image takes several minutes, which is extremely slow. Other CLIP models I’ve tested typically complete this in around 1 second. I’m not familiar with the model’s architecture, but I’d like to know if there are ways to optimize CPU inference speed. tks

after set flash_atten to None ,problem resolved,ths

izhaohui changed discussion status to closed

It feels like there is something wrong with your setup. The demo runs on CPU and completes the task in ~30-40 seconds, which is kinda expected. This model supports higher resolution - 512px compared to 256px in standard CLIP models - which results in 4x longer input sequence for the visual branch. Also, performance may be affected by long text inputs. Can you share the code and some data examples?
Generally, one good way to optimize CPU inference is to use ONNX. The conversion from PyTorch to ONNX is quite straightforward but may be hard if you do it for the first time. I'll convert the model to ONNX on the weekend and update the repo.

CPU: Intel 10500
Memory: 32GB
Runtime: Docker
OS: Debian

from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch
import sys
import time
name = "visheratin/mexma-siglip2"
path = "/xxxx/models" 
name = path
model = AutoModel.from_pretrained(name, torch_dtype='auto', trust_remote_code=True, device_map='cpu', attn_implementation=None)
tokenizer = AutoTokenizer.from_pretrained(name)
processor = AutoImageProcessor.from_pretrained(name)

begin = time.time()
with torch.inference_mode():
    img = Image.open("/mnt/xxx.jpg")
    img.thumbnail((800,800))
    img = processor(images=img, return_tensors="pt")["pixel_values"]
    text = tokenizer([*sys.argv[1:]], return_tensors="pt", padding=True)
    image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
    probs = image_logits.softmax(dim=-1).tolist()
    print(probs)
    print(f"cost {int(time.time() - begin)}")
  • case1:
    model = AutoModel.from_pretrained(name, torch_dtype='auto', trust_remote_code=True, device_map='cpu', attn_implementation=None)
    cost 3 seconds

  • case2:
    model = AutoModel.from_pretrained(name, torch_dtype='auto', trust_remote_code=True)
    cost abort 5 mins(319 seconds)

case1 and case2 use same image and command line args
command: python3 test.py key1 key2 key3

izhaohui changed discussion status to open

Here is the link to Colab with example on how to use ONNX version of the model. This version has more reasonable execution speed.

Thank you, I will try the ONNX model approach.

visheratin changed discussion status to closed

Also experiencing significantly slower speeds running this compared to equivalent param/frame res models via the open_clip repo, when using MPS backend on macOS. Any advice?

Did you compare it to 512px siglip2 from open_clip? My current guess is that the longer input sequence for the image branch is responsible for the slower speed. If that is true, the speeds for original siglip2 and mexma-siglip2 should be roughly equal. Another point is that siglip2 (and siglip) support <=64 sequence length. Even if you pass more than that, only 64 tokens will be processed. mexma-siglip2 has 512 tokens context window.

visheratin changed discussion status to open

On my phone here but I'll run som benchmarks once back at the keyboard and report back!

INFO - Encoded 327 imgs using ViT-B-16-SigLIP2-512 in 28.94 seconds
INFO - Encoded 327 imgs using ViT-L-16-SigLIP2-512 in 91.69 seconds
INFO - Encoded 327 imgs using mexma-siglip2 in 131.22 seconds

Running on an MBP M1 Pro 16GB using MPS. The first two SigLIP2 models using open_clip, mexma-siglip2 using transformers.
This seems pretty reasonable right? I think the issues I observed previously may have been due to pre-allocating too much memory for the embeddings when processing longer videos.

For reference, I ran this on CUDA using a laptop with an RTX 4060 6GB:

INFO - Encoded 327 imgs using ViT-B-16-SigLIP2-512 in 14.50 seconds
INFO - Encoded 327 imgs using mexma-siglip2 in 15.28 seconds

Any idea why the performance is so similar on CUDA compared to MPS?

Sorry, I can't help with MPS as I never worked with it. But the speed for CUDA looks reasonable (not sure what batch size you used).

Okay thanks, batch size was 32. I tested a few different ones and it didn't make much difference. The mlx-vlm repo is getting Sglip & Siglip2 support so hopefully I can switch over from MPS to MLX and get some faster speeds. Thanks again for your great work here!

Here is the link to Colab with example on how to use ONNX version of the model. This version has more reasonable execution speed.

How would you run the ONNX model for extracting text/img embeddings without first loading the PyTorch safetensors model as done in the colab?
Edit: let me rephrase that: how could we run the onnx model without first having to load the pytorch model to get the input ids, mask etc? i.e. only having a local file with the onnx weights

Sign up or log in to comment