dwb2023 commited on
Commit
3253f8e
·
verified ·
1 Parent(s): 79605da

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +266 -1
README.md CHANGED
@@ -10,4 +10,269 @@ pinned: false
10
  license: openrail
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  license: openrail
11
  ---
12
 
13
+
14
+
15
+ ## Creating instructions
16
+
17
+ - Load the image from the given file path '/home/user/tmp9873xen5.jpg'.
18
+ - Use the 'owl_v2' tool to detect brain tumors in the image. The prompt should be 'brain tumor'.
19
+ - Use the 'grounding_sam' tool to segment brain tumors in the image. The prompt should be 'brain tumor'.
20
+ - Overlay the bounding boxes from the detection results on the original image using the 'overlay_bounding_boxes' utility.
21
+ - Overlay the segmentation masks from the segmentation results on the original image using the 'overlay_segmentation_masks' utility.
22
+ - Save the final image with both bounding boxes and segmentation masks to a specified output path.
23
+
24
+
25
+ ## Retrieving tools
26
+
27
+ - 'load_image' is a utility function that loads an image from the given file path string.
28
+ 'save_image' is a utility function that saves an image to a file path.
29
+ - 'owl_v2' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores.
30
+ - 'florencev2_object_detection' is a tool that can detect common objects in an image without any text prompt or thresholding. It returns a list of detected objects as labels and their location as bounding boxes.
31
+ - 'grounding_sam' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas or periods. It returns a list of bounding boxes, label names, mask file names and associated probability scores.
32
+ - 'detr_segmentation' is a tool that can segment common objects in an image without any text prompt. It returns a list of detected objects as labels, their regions as masks and their scores.
33
+ - 'overlay_bounding_boxes' is a utility function that displays bounding boxes on an image.
34
+ - 'overlay_heat_map' is a utility function that displays a heat map on an image.
35
+ - 'overlay_segmentation_masks' is a utility function that displays segmentation masks.
36
+
37
+
38
+ ### Retrieving tools - detailed notes on tool selection
39
+
40
+ load_image(image_path: str) -> numpy.ndarray:
41
+ 'load_image' is a utility function that loads an image from the given file path string.
42
+
43
+ Parameters:
44
+ image_path (str): The path to the image.
45
+
46
+ Returns:
47
+ np.ndarray: The image as a NumPy array.
48
+
49
+ Example
50
+ -------
51
+ >>> load_image("path/to/image.jpg")
52
+
53
+
54
+ save_image(image: numpy.ndarray, file_path: str) -> None:
55
+ 'save_image' is a utility function that saves an image to a file path.
56
+
57
+ Parameters:
58
+ image (np.ndarray): The image to save.
59
+ file_path (str): The path to save the image file.
60
+
61
+ Example
62
+ -------
63
+ >>> save_image(image)
64
+
65
+ owl_v2(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1, iou_threshold: float = 0.1) -> List[Dict[str, Any]]:
66
+ 'owl_v2' is a tool that can detect and count multiple objects given a text
67
+ prompt such as category names or referring expressions. The categories in text prompt
68
+ are separated by commas. It returns a list of bounding boxes with
69
+ normalized coordinates, label names and associated probability scores.
70
+
71
+ Parameters:
72
+ prompt (str): The prompt to ground to the image.
73
+ image (np.ndarray): The image to ground the prompt to.
74
+ box_threshold (float, optional): The threshold for the box detection. Defaults
75
+ to 0.10.
76
+ iou_threshold (float, optional): The threshold for the Intersection over Union
77
+ (IoU). Defaults to 0.10.
78
+
79
+ Returns:
80
+ List[Dict[str, Any]]: A list of dictionaries containing the score, label, and
81
+ bounding box of the detected objects with normalized coordinates between 0
82
+ and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the
83
+ top-left and xmax and ymax are the coordinates of the bottom-right of the
84
+ bounding box.
85
+
86
+ Example
87
+ -------
88
+ >>> owl_v2("car. dinosaur", image)
89
+ [
90
+ {'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]},
91
+ {'score': 0.98, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
92
+ ]
93
+
94
+ florencev2_object_detection(image: numpy.ndarray) -> List[Dict[str, Any]]:
95
+ 'florencev2_object_detection' is a tool that can detect common objects in an
96
+ image without any text prompt or thresholding. It returns a list of detected objects
97
+ as labels and their location as bounding boxes.
98
+
99
+ Parameters:
100
+ image (np.ndarray): The image to used to detect objects
101
+
102
+ Returns:
103
+ List[Dict[str, Any]]: A list of dictionaries containing the score, label, and
104
+ bounding box of the detected objects with normalized coordinates between 0
105
+ and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the
106
+ top-left and xmax and ymax are the coordinates of the bottom-right of the
107
+ bounding box. The scores are always 1.0 and cannot be thresholded
108
+
109
+ Example
110
+ -------
111
+ >>> florencev2_object_detection(image)
112
+ [
113
+ {'score': 1.0, 'label': 'window', 'bbox': [0.1, 0.11, 0.35, 0.4]},
114
+ {'score': 1.0, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5},
115
+ {'score': 1.0, 'label': 'person', 'bbox': [0.34, 0.21, 0.85, 0.5},
116
+ ]
117
+
118
+ grounding_sam(prompt: str, image: numpy.ndarray, box_threshold: float = 0.2, iou_threshold: float = 0.2) -> List[Dict[str, Any]]:
119
+ 'grounding_sam' is a tool that can segment multiple objects given a
120
+ text prompt such as category names or referring expressions. The categories in text
121
+ prompt are separated by commas or periods. It returns a list of bounding boxes,
122
+ label names, mask file names and associated probability scores.
123
+
124
+ Parameters:
125
+ prompt (str): The prompt to ground to the image.
126
+ image (np.ndarray): The image to ground the prompt to.
127
+ box_threshold (float, optional): The threshold for the box detection. Defaults
128
+ to 0.20.
129
+ iou_threshold (float, optional): The threshold for the Intersection over Union
130
+ (IoU). Defaults to 0.20.
131
+
132
+ Returns:
133
+ List[Dict[str, Any]]: A list of dictionaries containing the score, label,
134
+ bounding box, and mask of the detected objects with normalized coordinates
135
+ (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left
136
+ and xmax and ymax are the coordinates of the bottom-right of the bounding box.
137
+ The mask is binary 2D numpy array where 1 indicates the object and 0 indicates
138
+ the background.
139
+
140
+ Example
141
+ -------
142
+ >>> grounding_sam("car. dinosaur", image)
143
+ [
144
+ {
145
+ 'score': 0.99,
146
+ 'label': 'dinosaur',
147
+ 'bbox': [0.1, 0.11, 0.35, 0.4],
148
+ 'mask': array([[0, 0, 0, ..., 0, 0, 0],
149
+ [0, 0, 0, ..., 0, 0, 0],
150
+ ...,
151
+ [0, 0, 0, ..., 0, 0, 0],
152
+ [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
153
+ },
154
+ ]
155
+
156
+ detr_segmentation(image: numpy.ndarray) -> List[Dict[str, Any]]:
157
+ 'detr_segmentation' is a tool that can segment common objects in an
158
+ image without any text prompt. It returns a list of detected objects
159
+ as labels, their regions as masks and their scores.
160
+
161
+ Parameters:
162
+ image (np.ndarray): The image used to segment things and objects
163
+
164
+ Returns:
165
+ List[Dict[str, Any]]: A list of dictionaries containing the score, label
166
+ and mask of the detected objects. The mask is binary 2D numpy array where 1
167
+ indicates the object and 0 indicates the background.
168
+
169
+ Example
170
+ -------
171
+ >>> detr_segmentation(image)
172
+ [
173
+ {
174
+ 'score': 0.45,
175
+ 'label': 'window',
176
+ 'mask': array([[0, 0, 0, ..., 0, 0, 0],
177
+ [0, 0, 0, ..., 0, 0, 0],
178
+ ...,
179
+ [0, 0, 0, ..., 0, 0, 0],
180
+ [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
181
+ },
182
+ {
183
+ 'score': 0.70,
184
+ 'label': 'bird',
185
+ 'mask': array([[0, 0, 0, ..., 0, 0, 0],
186
+ [0, 0, 0, ..., 0, 0, 0],
187
+ ...,
188
+ [0, 0, 0, ..., 0, 0, 0],
189
+ [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
190
+ },
191
+ ]
192
+
193
+ overlay_bounding_boxes(image: numpy.ndarray, bboxes: List[Dict[str, Any]]) -> numpy.ndarray:
194
+ 'overlay_bounding_boxes' is a utility function that displays bounding boxes on
195
+ an image.
196
+
197
+ Parameters:
198
+ image (np.ndarray): The image to display the bounding boxes on.
199
+ bboxes (List[Dict[str, Any]]): A list of dictionaries containing the bounding
200
+ boxes.
201
+
202
+ Returns:
203
+ np.ndarray: The image with the bounding boxes, labels and scores displayed.
204
+
205
+ Example
206
+ -------
207
+ >>> image_with_bboxes = overlay_bounding_boxes(
208
+ image, [{'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]}],
209
+ )
210
+
211
+ overlay_heat_map(image: numpy.ndarray, heat_map: Dict[str, Any], alpha: float = 0.8) -> numpy.ndarray:
212
+ 'overlay_heat_map' is a utility function that displays a heat map on an image.
213
+
214
+ Parameters:
215
+ image (np.ndarray): The image to display the heat map on.
216
+ heat_map (Dict[str, Any]): A dictionary containing the heat map under the key
217
+ 'heat_map'.
218
+ alpha (float, optional): The transparency of the overlay. Defaults to 0.8.
219
+
220
+ Returns:
221
+ np.ndarray: The image with the heat map displayed.
222
+
223
+ Example
224
+ -------
225
+ >>> image_with_heat_map = overlay_heat_map(
226
+ image,
227
+ {
228
+ 'heat_map': array([[0, 0, 0, ..., 0, 0, 0],
229
+ [0, 0, 0, ..., 0, 0, 0],
230
+ ...,
231
+ [0, 0, 0, ..., 0, 0, 0],
232
+ [0, 0, 0, ..., 125, 125, 125]], dtype=uint8),
233
+ },
234
+ )
235
+
236
+ overlay_segmentation_masks(image: numpy.ndarray, masks: List[Dict[str, Any]]) -> numpy.ndarray:
237
+ 'overlay_segmentation_masks' is a utility function that displays segmentation
238
+ masks.
239
+
240
+ Parameters:
241
+ image (np.ndarray): The image to display the masks on.
242
+ masks (List[Dict[str, Any]]): A list of dictionaries containing the masks.
243
+
244
+ Returns:
245
+ np.ndarray: The image with the masks displayed.
246
+
247
+ Example
248
+ -------
249
+ >>> image_with_masks = overlay_segmentation_masks(
250
+ image,
251
+ [{
252
+ 'score': 0.99,
253
+ 'label': 'dinosaur',
254
+ 'mask': array([[0, 0, 0, ..., 0, 0, 0],
255
+ [0, 0, 0, ..., 0, 0, 0],
256
+ ...,
257
+ [0, 0, 0, ..., 0, 0, 0],
258
+ [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
259
+ }],
260
+ )
261
+
262
+ ## Vision Agent Tools - model summary
263
+
264
+ | Model Name | Hugging Face Model | Primary Function | Use Cases |
265
+ |---------------------|-------------------------------------|-------------------------------|--------------------------------------------------------------|
266
+ | OWL-ViT v2 | google/owlv2-base-patch16-ensemble | Object detection and localization | - Open-world object detection<br>- Locating specific objects based on text prompts |
267
+ | Florence-2 | microsoft/florence-base | Multi-purpose vision tasks | - Image captioning<br>- Visual question answering<br>- Object detection |
268
+ | Depth Anything V2 | LiheYoung/depth-anything-v2-small | Depth estimation | - Estimating depth in images<br>- Generating depth maps |
269
+ | CLIP | openai/clip-vit-base-patch32 | Image-text similarity | - Zero-shot image classification<br>- Image-text matching |
270
+ | BLIP | Salesforce/blip-image-captioning-base | Image captioning | - Generating text descriptions of images |
271
+ | LOCA | Custom implementation | Object counting | - Zero-shot object counting<br>- Object counting with visual prompts |
272
+ | GIT v2 | microsoft/git-base-textcaps | Visual question answering and image captioning | - Answering questions about image content<br>- Generating text descriptions of images |
273
+ | Grounding DINO | groundingdino/groundingdino-swint-ogc | Object detection and localization | - Detecting objects based on text prompts |
274
+ | SAM | facebook/sam-vit-huge | Instance segmentation | - Text-prompted instance segmentation |
275
+ | DETR | facebook/detr-resnet-50 | Object detection | - General object detection |
276
+ | ViT | google/vit-base-patch16-224 | Image classification | - General image classification<br>- NSFW content detection |
277
+ | DPT | Intel/dpt-hybrid-midas | Monocular depth estimation | - Estimating depth from single images |
278
+