Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.29.0
metadata
title: Capstone Project
emoji: π
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.23.3
app_file: app.py
pinned: false
license: mit
Multi-Modal LLM Demo with Flan-T5
This project demonstrates a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:
- CLIP from OpenAI for extracting image embeddings.
- Whisper from OpenAI for transcribing audio.
- Flan-T5 Large from Google as an instruction-tuned text generation model.
- Gradio for building an interactive web interface.
- Hugging Face Spaces for deployment.
The goal is to fuse different modalities into a single prompt and produce coherent text output.
Features
Multi-Modal Inputs:
- Text: Type your query or message.
- Image: Upload an image, which is processed using CLIP.
- Audio: Upload an audio file, which is transcribed using Whisper.
Instruction-Tuned Generation:
- Uses Flan-T5 Large to generate more coherent and on-topic responses.
Customizable Decoding:
- Advanced generation parameters (e.g.,
temperature
,top_p
,repetition_penalty
) ensure varied and high-quality outputs.
- Advanced generation parameters (e.g.,
Interactive UI:
- A clean, ChatGPT-like interface built with Gradio.
Installation & Setup
Requirements
Ensure your environment has the following dependencies. The provided requirements.txt
file should include:
torch
transformers>=4.31.0
accelerate>=0.20.0
gradio
soundfile
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference