Spaces:

Snehil-Shah
/

Multimodal-Image-Search-Engine

Running

App Files Files Community

Multimodal-Image-Search-Engine / README.md

Snehil-Shah

Update gradio

d956a15 verified 11 months ago

preview code

raw

history blame contribute delete

2.57 kB

	---
	title: Multimodal Image Search Engine
	emoji: 🔍
	colorFrom: yellow
	colorTo: yellow
	sdk: gradio
	sdk_version: 4.36.1
	app_file: app.py
	pinned: false
	license: mit
	---

	<p align="center">
	<h1 align="center">Multi-Modal Image Search Engine</h1>
	<p align="center">
	A Semantic Search Engine that understands the Content & Context of your Queries.
	<br>
	Use Multi-Modal inputs like Text-Image or a Reverse Image Search to Query a Vector Database of over 15k Images. <a href="https://huggingface.co/spaces/Snehil-Shah/Multimodal-Image-Search-Engine">Try it Out!</a>
	<br><br>
	<img src="https://github.com/Snehil-Shah/Multimodal-Image-Search-Engine/blob/main/assets/demo.gif?raw=true">
	</p>
	</p>

	<h3>• About The Project</h3>

	At its core, the Search Engine is built upon the concept of Vector Similarity Search.
	All the Images are encoded into vector embeddings based on their semantic meaning using a Transformer Model, which are then stored in a vector space.
	When searched with a query, it returns the nearest neighbors to the input query which are the relevant search results.

	<p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/encoding_flow.png"></p>

	We use the Contrastive Language-Image Pre-Training (CLIP) Model by OpenAI which is a Pre-trained Multi-Modal Vision Transformer that can semantically encode Words, Sentences & Images into a 512 Dimensional Vector. This Vector encapsulates the meaning & context of the entity into a Mathematically Measurable format.

	<p align="center"><p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/Visualization.png" width=1000></p>
	<p align="center"><i>2-D Visualization of 500 Images in a 512-D Vector Space</i></p></p>

	The Images are stored as vector embeddings in a Qdrant Collection which is a Vector Database. The Search Term is encoded and run as a query to Qdrant, which returns the Nearest Neighbors based on their Cosine-Similarity to the Search Query.

	<p align="center"><img src="https://raw.githubusercontent.com/Snehil-Shah/Multimodal-Image-Search-Engine/main/assets/retrieval_flow.png"></p>

	The Dataset: All images are sourced from the [Open Images Dataset](https://github.com/cvdfoundation/open-images-dataset) by Common Visual Data Foundation.

	<h3>• Technologies Used</h3>

	- Python
	- Jupyter Notebooks
	- Qdrant - Vector Database
	- Sentence-Transformers - Library
	- CLIP by OpenAI - ViT Model
	- Gradio - UI
	- HuggingFace Spaces - Deployment