metadata

title: Capstone Project
emoji: 📈
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.23.3
app_file: app.py
pinned: false
license: mit

Multi-Modal LLM Demo with Flan-T5

This project demonstrates a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:

CLIP from OpenAI for extracting image embeddings.
Whisper from OpenAI for transcribing audio.
Flan-T5 Large from Google as an instruction-tuned text generation model.
Gradio for building an interactive web interface.
Hugging Face Spaces for deployment.

The goal is to fuse different modalities into a single prompt and produce coherent text output.

Features

Multi-Modal Inputs:
- Text: Type your query or message.
- Image: Upload an image, which is processed using CLIP.
- Audio: Upload an audio file, which is transcribed using Whisper.
Instruction-Tuned Generation:
- Uses Flan-T5 Large to generate more coherent and on-topic responses.
Customizable Decoding:
- Advanced generation parameters (e.g., temperature, top_p, repetition_penalty) ensure varied and high-quality outputs.
Interactive UI:
- A clean, ChatGPT-like interface built with Gradio.

Installation & Setup

Requirements

Ensure your environment has the following dependencies. The provided requirements.txt file should include:

torch
transformers>=4.31.0
accelerate>=0.20.0
gradio
soundfile



Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference