Spaces:
Sleeping
Sleeping
title: Capstone Project | |
emoji: π | |
colorFrom: indigo | |
colorTo: gray | |
sdk: gradio | |
sdk_version: 5.23.3 | |
app_file: app.py | |
pinned: false | |
license: mit | |
# Multi-Modal LLM Demo with Flan-T5 | |
This project demonstrates a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages: | |
- **CLIP** from OpenAI for extracting image embeddings. | |
- **Whisper** from OpenAI for transcribing audio. | |
- **Flan-T5 Large** from Google as an instruction-tuned text generation model. | |
- **Gradio** for building an interactive web interface. | |
- **Hugging Face Spaces** for deployment. | |
The goal is to fuse different modalities into a single prompt and produce coherent text output. | |
--- | |
## Features | |
- **Multi-Modal Inputs:** | |
- **Text:** Type your query or message. | |
- **Image:** Upload an image, which is processed using CLIP. | |
- **Audio:** Upload an audio file, which is transcribed using Whisper. | |
- **Instruction-Tuned Generation:** | |
- Uses Flan-T5 Large to generate more coherent and on-topic responses. | |
- **Customizable Decoding:** | |
- Advanced generation parameters (e.g., `temperature`, `top_p`, `repetition_penalty`) ensure varied and high-quality outputs. | |
- **Interactive UI:** | |
- A clean, ChatGPT-like interface built with Gradio. | |
--- | |
## Installation & Setup | |
### Requirements | |
Ensure your environment has the following dependencies. The provided `requirements.txt` file should include: | |
```txt | |
torch | |
transformers>=4.31.0 | |
accelerate>=0.20.0 | |
gradio | |
soundfile | |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |