Spaces:

atharvasc27112001
/

Capstone_Project

Sleeping

File size: 1,630 Bytes

db6b8ba
 
 
 
 
 
 
 
 
 
 
1f61d7b
1ce383e
1f61d7b
1ce383e
1f61d7b
1ce383e
 
 
 
 
1f61d7b
1ce383e
1f61d7b
1ce383e
1f61d7b
1ce383e
1f61d7b
1ce383e
 
 
 
1f61d7b
1ce383e
 
1f61d7b
1ce383e
 
1f61d7b
1ce383e
 
1f61d7b
1ce383e
1f61d7b
1ce383e
1f61d7b
1ce383e
1f61d7b
1ce383e
1f61d7b
1ce383e
1f61d7b
 
 
 
 
 
 
db6b8ba

---
title: Capstone Project
emoji: 📈
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.23.3
app_file: app.py
pinned: false
license: mit
---

# Multi-Modal LLM Demo with Flan-T5

This project demonstrates a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:

- **CLIP** from OpenAI for extracting image embeddings.
- **Whisper** from OpenAI for transcribing audio.
- **Flan-T5 Large** from Google as an instruction-tuned text generation model.
- **Gradio** for building an interactive web interface.
- **Hugging Face Spaces** for deployment.

The goal is to fuse different modalities into a single prompt and produce coherent text output.

---

## Features

- **Multi-Modal Inputs:**  
  - **Text:** Type your query or message.
  - **Image:** Upload an image, which is processed using CLIP.
  - **Audio:** Upload an audio file, which is transcribed using Whisper.

- **Instruction-Tuned Generation:**  
  - Uses Flan-T5 Large to generate more coherent and on-topic responses.

- **Customizable Decoding:**  
  - Advanced generation parameters (e.g., `temperature`, `top_p`, `repetition_penalty`) ensure varied and high-quality outputs.

- **Interactive UI:**  
  - A clean, ChatGPT-like interface built with Gradio.

---

## Installation & Setup

### Requirements

Ensure your environment has the following dependencies. The provided `requirements.txt` file should include:

```txt
torch
transformers>=4.31.0
accelerate>=0.20.0
gradio
soundfile



Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference