Snehil-Shah commited on
Commit
05f47ba
Β·
1 Parent(s): a4590c9

Update README

Browse files

Signed-off-by: Snehil Shah <[email protected]>

Files changed (1) hide show
  1. README.md +44 -4
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
  title: Multimodal Image Search Engine
3
- emoji: πŸš€
4
- colorFrom: indigo
5
- colorTo: indigo
6
  sdk: gradio
7
  sdk_version: 4.13.0
8
  app_file: app.py
@@ -10,4 +10,44 @@ pinned: false
10
  license: mit
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Multimodal Image Search Engine
3
+ emoji: πŸ”
4
+ colorFrom: yellow
5
+ colorTo: yellow
6
  sdk: gradio
7
  sdk_version: 4.13.0
8
  app_file: app.py
 
10
  license: mit
11
  ---
12
 
13
+ <p align="center">
14
+ <h1 align="center">Multi-Modal Image Search Engine</h1>
15
+ <p align="center">
16
+ A Semantic Search Engine that understands the Content & Context of your Queries.
17
+ <br>
18
+ Use Multi-Modal inputs like Text-Image or a Reverse Image Search to Query a Vector Database of over 15k Images. <a href="https://huggingface.co/spaces/Snehil-Shah/Multimodal-Image-Search-Engine">Try it Out!</a>
19
+ <br><br>
20
+ <img src="https://github.com/Snehil-Shah/Multimodal-Image-Search-Engine/blob/main/assets/demo.gif?raw=true">
21
+ </p>
22
+ </p>
23
+
24
+ <h3>β€’ About The Project</h3>
25
+
26
+ At its core, the Search Engine is built upon the concept of **Vector Similarity Search**.
27
+ All the Images are encoded into vector embeddings based on their semantic meaning using a Transformer Model, which are then stored in a vector space.
28
+ When searched with a query, it returns the nearest neighbors to the input query which are the relevant search results.
29
+
30
+ <p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/encoding_flow.png"></p>
31
+
32
+ We use the Contrastive Language-Image Pre-Training (CLIP) Model by OpenAI which is a Pre-trained Multi-Modal Vision Transformer that can semantically encode Words, Sentences & Images into a 512 Dimensional Vector. This Vector encapsulates the meaning & context of the entity into a *Mathematically Measurable* format.
33
+
34
+ <p align="center"><p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/Visualization.png" width=1000></p>
35
+ <p align="center"><i>2-D Visualization of 500 Images in a 512-D Vector Space</i></p></p>
36
+
37
+ The Images are stored as vector embeddings in a Qdrant Collection which is a Vector Database. The Search Term is encoded and run as a query to Qdrant, which returns the Nearest Neighbors based on their Cosine-Similarity to the Search Query.
38
+
39
+ <p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/retrieval_flow.png"></p>
40
+
41
+ **The Dataset**: All images are sourced from the [Open Images Dataset](https://github.com/cvdfoundation/open-images-dataset) by Common Visual Data Foundation.
42
+
43
+ <h3>β€’ Technologies Used</h3>
44
+
45
+ - Python
46
+ - Jupyter Notebooks
47
+ - Qdrant - Vector Database
48
+ - Sentence-Transformers - Library
49
+ - CLIP by OpenAI - ViT Model
50
+ - Gradio - UI
51
+ - HuggingFace Spaces - Deployment
52
+
53
+