Spaces:

atharvasc27112001
/

Capstone_Project

Sleeping

App Files Files Community

Capstone_Project / README.md

atharvasc27112001

Update README.md

1ce383e verified about 1 month ago

preview code

raw

history blame contribute delete

1.63 kB

	---
	title: Capstone Project
	emoji: 📈
	colorFrom: indigo
	colorTo: gray
	sdk: gradio
	sdk_version: 5.23.3
	app_file: app.py
	pinned: false
	license: mit
	---

	# Multi-Modal LLM Demo with Flan-T5

	This project demonstrates a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:

	- CLIP from OpenAI for extracting image embeddings.
	- Whisper from OpenAI for transcribing audio.
	- Flan-T5 Large from Google as an instruction-tuned text generation model.
	- Gradio for building an interactive web interface.
	- Hugging Face Spaces for deployment.

	The goal is to fuse different modalities into a single prompt and produce coherent text output.

	---

	## Features

	- Multi-Modal Inputs:
	- Text: Type your query or message.
	- Image: Upload an image, which is processed using CLIP.
	- Audio: Upload an audio file, which is transcribed using Whisper.

	- Instruction-Tuned Generation:
	- Uses Flan-T5 Large to generate more coherent and on-topic responses.

	- Customizable Decoding:
	- Advanced generation parameters (e.g., `temperature`, `top_p`, `repetition_penalty`) ensure varied and high-quality outputs.

	- Interactive UI:
	- A clean, ChatGPT-like interface built with Gradio.

	---

	## Installation & Setup

	### Requirements

	Ensure your environment has the following dependencies. The provided `requirements.txt` file should include:

	```txt
	torch
	transformers>=4.31.0
	accelerate>=0.20.0
	gradio
	soundfile



	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference