File size: 2,572 Bytes
e062a64
 
05f47ba
 
 
e062a64
d956a15
e062a64
 
 
 
 
05f47ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d956a15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
title: Multimodal Image Search Engine
emoji: πŸ”
colorFrom: yellow
colorTo: yellow
sdk: gradio
sdk_version: 4.36.1
app_file: app.py
pinned: false
license: mit
---

<p align="center">
  <h1 align="center">Multi-Modal Image Search Engine</h1>
  <p align="center">
    A Semantic Search Engine that understands the Content & Context of your Queries. 
    <br>
    Use Multi-Modal inputs like Text-Image or a Reverse Image Search to Query a Vector Database of over 15k Images. <a href="https://huggingface.co/spaces/Snehil-Shah/Multimodal-Image-Search-Engine">Try it Out!</a>
    <br><br>
    <img src="https://github.com/Snehil-Shah/Multimodal-Image-Search-Engine/blob/main/assets/demo.gif?raw=true">
  </p>
</p>

<h3>β€’ About The Project</h3>

At its core, the Search Engine is built upon the concept of **Vector Similarity Search**.
All the Images are encoded into vector embeddings based on their semantic meaning using a Transformer Model, which are then stored in a vector space.
When searched with a query, it returns the nearest neighbors to the input query which are the relevant search results.

<p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/encoding_flow.png"></p>

We use the Contrastive Language-Image Pre-Training (CLIP) Model by OpenAI which is a Pre-trained Multi-Modal Vision Transformer that can semantically encode Words, Sentences & Images into a 512 Dimensional Vector. This Vector encapsulates the meaning & context of the entity into a *Mathematically Measurable* format.

<p align="center"><p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/Visualization.png" width=1000></p>
<p align="center"><i>2-D Visualization of 500 Images in a 512-D Vector Space</i></p></p>

The Images are stored as vector embeddings in a Qdrant Collection which is a Vector Database. The Search Term is encoded and run as a query to Qdrant, which returns the Nearest Neighbors based on their Cosine-Similarity to the Search Query.

<p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/retrieval_flow.png"></p>

**The Dataset**: All images are sourced from the [Open Images Dataset](https://github.com/cvdfoundation/open-images-dataset) by Common Visual Data Foundation.

<h3>β€’ Technologies Used</h3>

- Python
- Jupyter Notebooks
- Qdrant - Vector Database
- Sentence-Transformers - Library
- CLIP by OpenAI - ViT Model
- Gradio - UI
- HuggingFace Spaces - Deployment