Commit
Β·
05f47ba
1
Parent(s):
a4590c9
Update README
Browse filesSigned-off-by: Snehil Shah <[email protected]>
README.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
---
|
2 |
title: Multimodal Image Search Engine
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
sdk_version: 4.13.0
|
8 |
app_file: app.py
|
@@ -10,4 +10,44 @@ pinned: false
|
|
10 |
license: mit
|
11 |
---
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
title: Multimodal Image Search Engine
|
3 |
+
emoji: π
|
4 |
+
colorFrom: yellow
|
5 |
+
colorTo: yellow
|
6 |
sdk: gradio
|
7 |
sdk_version: 4.13.0
|
8 |
app_file: app.py
|
|
|
10 |
license: mit
|
11 |
---
|
12 |
|
13 |
+
<p align="center">
|
14 |
+
<h1 align="center">Multi-Modal Image Search Engine</h1>
|
15 |
+
<p align="center">
|
16 |
+
A Semantic Search Engine that understands the Content & Context of your Queries.
|
17 |
+
<br>
|
18 |
+
Use Multi-Modal inputs like Text-Image or a Reverse Image Search to Query a Vector Database of over 15k Images. <a href="https://huggingface.co/spaces/Snehil-Shah/Multimodal-Image-Search-Engine">Try it Out!</a>
|
19 |
+
<br><br>
|
20 |
+
<img src="https://github.com/Snehil-Shah/Multimodal-Image-Search-Engine/blob/main/assets/demo.gif?raw=true">
|
21 |
+
</p>
|
22 |
+
</p>
|
23 |
+
|
24 |
+
<h3>β’ About The Project</h3>
|
25 |
+
|
26 |
+
At its core, the Search Engine is built upon the concept of **Vector Similarity Search**.
|
27 |
+
All the Images are encoded into vector embeddings based on their semantic meaning using a Transformer Model, which are then stored in a vector space.
|
28 |
+
When searched with a query, it returns the nearest neighbors to the input query which are the relevant search results.
|
29 |
+
|
30 |
+
<p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/encoding_flow.png"></p>
|
31 |
+
|
32 |
+
We use the Contrastive Language-Image Pre-Training (CLIP) Model by OpenAI which is a Pre-trained Multi-Modal Vision Transformer that can semantically encode Words, Sentences & Images into a 512 Dimensional Vector. This Vector encapsulates the meaning & context of the entity into a *Mathematically Measurable* format.
|
33 |
+
|
34 |
+
<p align="center"><p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/Visualization.png" width=1000></p>
|
35 |
+
<p align="center"><i>2-D Visualization of 500 Images in a 512-D Vector Space</i></p></p>
|
36 |
+
|
37 |
+
The Images are stored as vector embeddings in a Qdrant Collection which is a Vector Database. The Search Term is encoded and run as a query to Qdrant, which returns the Nearest Neighbors based on their Cosine-Similarity to the Search Query.
|
38 |
+
|
39 |
+
<p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/retrieval_flow.png"></p>
|
40 |
+
|
41 |
+
**The Dataset**: All images are sourced from the [Open Images Dataset](https://github.com/cvdfoundation/open-images-dataset) by Common Visual Data Foundation.
|
42 |
+
|
43 |
+
<h3>β’ Technologies Used</h3>
|
44 |
+
|
45 |
+
- Python
|
46 |
+
- Jupyter Notebooks
|
47 |
+
- Qdrant - Vector Database
|
48 |
+
- Sentence-Transformers - Library
|
49 |
+
- CLIP by OpenAI - ViT Model
|
50 |
+
- Gradio - UI
|
51 |
+
- HuggingFace Spaces - Deployment
|
52 |
+
|
53 |
+
|