Latest vision & multimodal releases in transformers
In this post you can find short summaries & snippets for latest vision & multimodal releases of transformers, namely, following models: D-FINE (sota object detection), Qwen2.5 Omni, HQ-SAM.
D-FINE
D-FINE is the sota real-time object detector that runs on T4 (free Colab). With transformers release there comes notebook examples for tracking, inference and fine-tuning as well as a demo.
Here's model specific install:
pip install git+https://github.com/huggingface/[email protected]
Inference:
import torch
import requests
from PIL import Image
from transformers import DFineForObjectDetection, AutoImageProcessor
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained("ustc-community/dfine-large-obj365")
model = DFineForObjectDetection.from_pretrained("ustc-community/dfine-large-obj365")
inputs = image_processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)
for result in results:
for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
score, label = score.item(), label_id.item()
box = [round(i, 2) for i in box.tolist()]
print(f"{model.config.id2label[label]}: {score:.2f} {box}")
- Collection with all checkpoints and demo https://huggingface.co/collections/ustc-community/d-fine-68109b427cbe6ee36b4e7352
Notebooks: - Tracking https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb
- Inference https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_inference.ipynb
- Fine-tuning https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_finetune_on_a_custom_dataset.ipynb
Qwen2.5-Omni
7B any-to-any model by Qwen, image, audio, video, text input to text and natural speech output.
Here's model specific install
pip install git+https://github.com/huggingface/[email protected]
Inference with text and video input. Feel free to swap with images in chat template when inferring with images.
import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto",
attn_implementation="flash_attention_2",
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
conversation = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"},
{"type": "text", "text": "What can you hear and see in this video?"},
],
},
]
inputs = processor.apply_chat_template(
conversation,
load_audio_from_video=True,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
video_fps=2,
# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,
)
# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, use_audio_in_video=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)
sf.write(
"output.wav",
audio.reshape(-1).detach().cpu().numpy(),
samplerate=24000,
)
Model collection: https://huggingface.co/collections/Qwen/qwen25-omni-67de1e5f0f9464dc6314b36e
SAM-HQ
SAM-HQ is an improvement on top SAM to create higher quality masks, by using more features from image encoder, refining masks with an additional MLP and adding a novel token for decoder inputs.
Here's model specific install:
pip install git+https://github.com/huggingface/transformers@v4.51.3-SAM-HQ-preview
Point inference
import torch
from PIL import Image
import requests
from transformers import SamHQModel, SamHQProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SamHQModel.from_pretrained("sushmanth/sam_hq_vit_b").to(device)
processor = SamHQProcessor.from_pretrained("sushmanth/sam_hq_vit_b")
img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
input_points = [[[450, 600]]] # 2D location of a window in the image
inputs = processor(raw_image, input_points=input_points, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(
outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu()
)
scores = outputs.iou_scores
```