Capstone_Project / README.md
atharvasc27112001's picture
Update README.md
1ce383e verified
---
title: Capstone Project
emoji: πŸ“ˆ
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.23.3
app_file: app.py
pinned: false
license: mit
---
# Multi-Modal LLM Demo with Flan-T5
This project demonstrates a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:
- **CLIP** from OpenAI for extracting image embeddings.
- **Whisper** from OpenAI for transcribing audio.
- **Flan-T5 Large** from Google as an instruction-tuned text generation model.
- **Gradio** for building an interactive web interface.
- **Hugging Face Spaces** for deployment.
The goal is to fuse different modalities into a single prompt and produce coherent text output.
---
## Features
- **Multi-Modal Inputs:**
- **Text:** Type your query or message.
- **Image:** Upload an image, which is processed using CLIP.
- **Audio:** Upload an audio file, which is transcribed using Whisper.
- **Instruction-Tuned Generation:**
- Uses Flan-T5 Large to generate more coherent and on-topic responses.
- **Customizable Decoding:**
- Advanced generation parameters (e.g., `temperature`, `top_p`, `repetition_penalty`) ensure varied and high-quality outputs.
- **Interactive UI:**
- A clean, ChatGPT-like interface built with Gradio.
---
## Installation & Setup
### Requirements
Ensure your environment has the following dependencies. The provided `requirements.txt` file should include:
```txt
torch
transformers>=4.31.0
accelerate>=0.20.0
gradio
soundfile
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference