Object Detection?
Why does Meta's website show a Llama model returning bounding boxes when it can only return text?
Here's a screenshot from their website. The screen shot is from this link, under it's "expert image grounding" section.
You probably have to request the bounding boxes in the output. That is required to get them from qwen-vl-2.5 for example.
@treehugg3
I tried what you suggested. The coordinates it gives appear to be bogus. And it hallucinates the presence of a ruler.
Those actually look correct, although it hallucinated the ruler (basically it used the caliper's bounding box again).
Coordinates appear to be 0-1 for width and height. Honestly this is probably better than Qwen's way of doing it with pixels because the image size can be changed by the preprocessor. Thanks for testing it out.
@treehugg3 I don't believe they are correct. The tape measure is said to occupy more than half of the height and width of the image. The ruler occupies almost 90% of the height. The caliper over 80%. I think these numbers are completely hallucinated.
@buckeye17-bah I think you're reading it wrong. You are assuming it's represented as (x, y, w, h). I think it is (x1, y1, x2, y2).
BAH, are you doing this for work?