|
--- |
|
title: Multimodal Image Search Engine |
|
emoji: π |
|
colorFrom: yellow |
|
colorTo: yellow |
|
sdk: gradio |
|
sdk_version: 4.36.1 |
|
app_file: app.py |
|
pinned: false |
|
license: mit |
|
--- |
|
|
|
<p align="center"> |
|
<h1 align="center">Multi-Modal Image Search Engine</h1> |
|
<p align="center"> |
|
A Semantic Search Engine that understands the Content & Context of your Queries. |
|
<br> |
|
Use Multi-Modal inputs like Text-Image or a Reverse Image Search to Query a Vector Database of over 15k Images. <a href="https://huggingface.co/spaces/Snehil-Shah/Multimodal-Image-Search-Engine">Try it Out!</a> |
|
<br><br> |
|
<img src="https://github.com/Snehil-Shah/Multimodal-Image-Search-Engine/blob/main/assets/demo.gif?raw=true"> |
|
</p> |
|
</p> |
|
|
|
<h3>β’ About The Project</h3> |
|
|
|
At its core, the Search Engine is built upon the concept of **Vector Similarity Search**. |
|
All the Images are encoded into vector embeddings based on their semantic meaning using a Transformer Model, which are then stored in a vector space. |
|
When searched with a query, it returns the nearest neighbors to the input query which are the relevant search results. |
|
|
|
<p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/encoding_flow.png"></p> |
|
|
|
We use the Contrastive Language-Image Pre-Training (CLIP) Model by OpenAI which is a Pre-trained Multi-Modal Vision Transformer that can semantically encode Words, Sentences & Images into a 512 Dimensional Vector. This Vector encapsulates the meaning & context of the entity into a *Mathematically Measurable* format. |
|
|
|
<p align="center"><p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/Visualization.png" width=1000></p> |
|
<p align="center"><i>2-D Visualization of 500 Images in a 512-D Vector Space</i></p></p> |
|
|
|
The Images are stored as vector embeddings in a Qdrant Collection which is a Vector Database. The Search Term is encoded and run as a query to Qdrant, which returns the Nearest Neighbors based on their Cosine-Similarity to the Search Query. |
|
|
|
<p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/retrieval_flow.png"></p> |
|
|
|
**The Dataset**: All images are sourced from the [Open Images Dataset](https://github.com/cvdfoundation/open-images-dataset) by Common Visual Data Foundation. |
|
|
|
<h3>β’ Technologies Used</h3> |
|
|
|
- Python |
|
- Jupyter Notebooks |
|
- Qdrant - Vector Database |
|
- Sentence-Transformers - Library |
|
- CLIP by OpenAI - ViT Model |
|
- Gradio - UI |
|
- HuggingFace Spaces - Deployment |