HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
Paper
โข
2503.02003
โข
Published
โข
48
Score image-text similarity using CLIP or SigLIP models
Segment images based on text prompts
Identify and mask objects in images using text prompts
Generate correspondences between images
Explore images from ImageNet-Hard dataset