Capstone_Project / README.md
atharvasc27112001's picture
Update README.md
1ce383e verified

A newer version of the Gradio SDK is available: 5.29.0

Upgrade
metadata
title: Capstone Project
emoji: πŸ“ˆ
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.23.3
app_file: app.py
pinned: false
license: mit

Multi-Modal LLM Demo with Flan-T5

This project demonstrates a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:

  • CLIP from OpenAI for extracting image embeddings.
  • Whisper from OpenAI for transcribing audio.
  • Flan-T5 Large from Google as an instruction-tuned text generation model.
  • Gradio for building an interactive web interface.
  • Hugging Face Spaces for deployment.

The goal is to fuse different modalities into a single prompt and produce coherent text output.


Features

  • Multi-Modal Inputs:

    • Text: Type your query or message.
    • Image: Upload an image, which is processed using CLIP.
    • Audio: Upload an audio file, which is transcribed using Whisper.
  • Instruction-Tuned Generation:

    • Uses Flan-T5 Large to generate more coherent and on-topic responses.
  • Customizable Decoding:

    • Advanced generation parameters (e.g., temperature, top_p, repetition_penalty) ensure varied and high-quality outputs.
  • Interactive UI:

    • A clean, ChatGPT-like interface built with Gradio.

Installation & Setup

Requirements

Ensure your environment has the following dependencies. The provided requirements.txt file should include:

torch
transformers>=4.31.0
accelerate>=0.20.0
gradio
soundfile



Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference