Spaces:

Snehil-Shah
/

Multimodal-Image-Search-Engine

Running

File size: 7,391 Bytes

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/Snehil-Shah/MultiModal-Vector-Semantic-Search-Engine/blob/main/images.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aH0U6JkEbAcg"
      },
      "source": [
        "# Image to Semantic Embeddings\n",
        "\n",
        "**Aim**: Encode around 50k jpg/jpeg images into vector embeddings using a vision tranformer model and upsert them into a vector database for clustering and querying"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CFLaAyqCbAch"
      },
      "outputs": [],
      "source": [
        "!pip install jupyter pandas qdrant_client pyarrow datasets"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "j5o4d0jbbAci"
      },
      "source": [
        "# Load Dataset\n",
        "This is the Open Images Dataset by CVDFoundation which hosts over 9 mil images. We will be working with a smaller subset.\n",
        "\n",
        "The dataset currently is a tsv file, with the first column representing a URL to a hosted jpg/jpeg image."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "data = pd.read_csv('open-images-dataset-validation.tsv', sep='\\t', header=None).reset_index()\n",
        "print(data.shape, data.head(), sep=\"\\n\")"
      ],
      "metadata": {
        "id": "j97T0MIBeEDe",
        "outputId": "df823427-2859-40f6-c171-f92b5a84361b",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "execution_count": 98,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(41620, 4)\n",
            "   index                                                  0        1  \\\n",
            "0      0  https://c2.staticflickr.com/6/5606/15611395595...  2038323   \n",
            "1      1  https://c6.staticflickr.com/3/2808/10351094034...  1762125   \n",
            "2      2  https://c2.staticflickr.com/9/8089/8416776003_...  9059623   \n",
            "3      3  https://farm3.staticflickr.com/568/21452126474...  2306438   \n",
            "4      4  https://farm4.staticflickr.com/1244/677743874_...  6571968   \n",
            "\n",
            "                          2  \n",
            "0  I4V4qq54NBEFDwBqPYCkDA==  \n",
            "1  38x6O2LAS75H1vUGVzIilg==  \n",
            "2  4ksF8TuGWGcKul6Z/6pq8g==  \n",
            "3  R+6Cs525mCUT6RovHPWREg==  \n",
            "4  JnkYas7iDJu+pb81tfqVow==  \n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Download the images\n",
        "We need the image data locally to feed it to the model"
      ],
      "metadata": {
        "id": "M-Esbnhy6KTU"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import urllib\n",
        "import os\n",
        "\n",
        "def download_file(url):\n",
        "    basename = os.path.basename(url)\n",
        "    target_path = f\"./images/{basename}\"\n",
        "    if not os.path.exists(target_path):\n",
        "        try:\n",
        "            urllib.request.urlretrieve(url, target_path)\n",
        "        except urllib.error.HTTPError:\n",
        "            return None\n",
        "    return target_path"
      ],
      "metadata": {
        "id": "cK_63ubnieI6"
      },
      "execution_count": 99,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# The Model\n",
        "We will be using a pre-trained model. Contrastive Language-Image Pre-training (CLIP) model developed by OpenAI is a multi-modal Vision Transformer model that can extract the visual features from the image into vector embeddings\n",
        "\n",
        "We will be storing these vector embeddings in a vector space database, where images will be clustered based on their semantic information ready for querying"
      ],
      "metadata": {
        "id": "0WrAbzxP6khy"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from sentence_transformers import SentenceTransformer\n",
        "model = SentenceTransformer(\"clip-ViT-B-32\")"
      ],
      "metadata": {
        "id": "pHYk-KdmlJxz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# The Vector Database\n",
        "\n",
        "Qdrant is an open-source vector database, where we can store vector embeddings and query nearest neighbours of a given embedding to create a recommendation/semantic search engine\n",
        "\n",
        "We start by initializing the Qdrant client and connecting to the cluster hosted on Qdrant Cloud"
      ],
      "metadata": {
        "id": "2h7jMch58ADV"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from qdrant_client import QdrantClient\n",
        "from qdrant_client.http import models as rest\n",
        "from google.colab import userdata\n",
        "\n",
        "qdrant_client = QdrantClient(\n",
        "    url = userdata.get('QDRANT_CLUSTER_URL'),\n",
        "    api_key = userdata.get('QDRANT_CLUSTER_API_KEY'),\n",
        ")\n",
        "qdrant_client.recreate_collection(\n",
        "   collection_name=\"images\",\n",
        "   vectors_config = rest.VectorParams(size=512, distance = rest.Distance.COSINE),\n",
        ")"
      ],
      "metadata": {
        "id": "nAObCg-yrzpC"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Function to upsert an embedding to the collection"
      ],
      "metadata": {
        "id": "zGbMrsDL_HH-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def upsert_to_db(id, vector, payload):\n",
        "  qdrant_client.upsert(\n",
        "   collection_name=\"images\",\n",
        "   points=[\n",
        "      rest.PointStruct(\n",
        "            id=id,\n",
        "            vector=vector.tolist(),\n",
        "            payload=payload\n",
        "      )\n",
        "   ]\n",
        ")"
      ],
      "metadata": {
        "id": "mjTRm85dr13p"
      },
      "execution_count": 76,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "for i, link in data.iloc[:, :2].iterrows():\n",
        "  img = download_file(link[0])\n",
        "  if(img):\n",
        "    embedding = model.encode(str(img))\n",
        "    upsert_to_db(i,embedding, {\"link\":link[0]})\n",
        "    print(f\"upserted {i}\")"
      ],
      "metadata": {
        "id": "MvFEc4MgwSLW"
      },
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "language_info": {
      "name": "python"
    },
    "colab": {
      "provenance": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}