Object Detection?

#23

by buckeye17-bah - opened Apr 6

Discussion

buckeye17-bah

Apr 6

Why does Meta's website show a Llama model returning bounding boxes when it can only return text?

Here's a screenshot from their website. The screen shot is from this link, under it's "expert image grounding" section.

treehugg3

28 days ago

You probably have to request the bounding boxes in the output. That is required to get them from qwen-vl-2.5 for example.

buckeye17-bah

28 days ago

@treehugg3 I tried what you suggested. The coordinates it gives appear to be bogus. And it hallucinates the presence of a ruler.

treehugg3

27 days ago

Those actually look correct, although it hallucinated the ruler (basically it used the caliper's bounding box again).

Coordinates appear to be 0-1 for width and height. Honestly this is probably better than Qwen's way of doing it with pixels because the image size can be changed by the preprocessor. Thanks for testing it out.

buckeye17-bah

27 days ago

@treehugg3 I don't believe they are correct. The tape measure is said to occupy more than half of the height and width of the image. The ruler occupies almost 90% of the height. The caliper over 80%. I think these numbers are completely hallucinated.

treehugg3

27 days ago

•

edited 27 days ago

@buckeye17-bah I think you're reading it wrong. You are assuming it's represented as (x, y, w, h). I think it is (x1, y1, x2, y2).

BAH, are you doing this for work?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment