Spaces:

atharvasc27112001
/

Capstone_Project

Sleeping

App Files Files Community

atharvasc27112001 commited on Apr 7

Commit

1ce383e

verified ·

1 Parent(s): 1f61d7b

Update README.md

Browse files

Files changed (1) hide show

README.md +25 -109

README.md CHANGED Viewed

@@ -9,136 +9,52 @@ app_file: app.py
 pinned: false
 license: mit
 ---
-Multi-Modal LLM Demo with Flan-T5
-This project is a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:
-CLIP from OpenAI for image embeddings.
-Whisper from OpenAI for audio transcription.
-Flan-T5 Large from Google as an instruction‑tuned text generation model.
-Gradio to build an interactive web interface.
-Hugging Face Spaces for deployment.
-The goal is to demonstrate how different modalities can be fused into a single prompt to produce coherent text output.
-Features
-Multi-Modal Inputs:
-Text: Users can type in their queries.
-Image: Users can upload images; the app processes these using CLIP.
-Audio: Users can upload audio files; the app transcribes them using Whisper.
-Instruction-Tuned Text Generation:
-Uses Flan‑T5 Large to generate responses based on the fused prompt.
-Customizable Decoding:
-Advanced generation parameters such as temperature, top_p, and repetition_penalty are applied to produce varied and coherent outputs.
-Interactive UI:
-A clean, ChatGPT-like interface built with Gradio.
-Installation & Setup
-Requirements
-Ensure your environment has the following dependencies. You can install them via the provided requirements.txt:
-txt
-Copy
 torch
 transformers>=4.31.0
 accelerate>=0.20.0
 gradio
 soundfile
-Getting Started
-Clone the Repository:
-bash
-Copy
-git clone <your-repo-url>
-cd <your-repo-directory>
-(Optional) Create a Virtual Environment:
-bash
-Copy
-python -m venv env
-source env/bin/activate  # On Windows: env\Scripts\activate
-Install Dependencies:
-bash
-Copy
-pip install --upgrade pip
-pip install -r requirements.txt
-Running the App Locally
-The main application is defined in app.py. To run the app locally:
-bash
-Copy
-python app.py
-This will launch the Gradio interface locally. Open the URL provided in your terminal to interact with the app via your browser.
-Project Structure
-Copy
-├── app.py
-├── requirements.txt
-└── README.md
-app.py:
-Contains the complete code for processing multi-modal inputs and generating responses.
-requirements.txt:
-Lists all the required dependencies.
-README.md:
-Provides an overview, installation instructions, and usage details for the project.
-How It Works
-Image Processing:
-The app uses the CLIP model to extract image embeddings.
-A linear projection layer converts these 512-dimensional embeddings to the 768-dimensional space expected by Flan‑T5.
-Audio Processing:
-Whisper transcribes audio files into text.
-The transcription is appended to the text prompt.
-Text Processing:
-The provided text input (if any) is combined with placeholders representing the image and audio content.
-The fused prompt is tokenized and fed into the Flan‑T5 model to generate a response.
-Decoding:
-Advanced generation parameters such as temperature, top_p, repetition_penalty, and do_sample are applied to guide the text generation process, ensuring varied and coherent outputs.
-Deployment:
-The Gradio interface provides an intuitive, web-based UI.
-The app is designed to be deployed on Hugging Face Spaces, making it easily accessible.
-Future Improvements
-Fine-Tuning:
-Fine-tune the projection layers and the text model on a dedicated multi-modal dataset (e.g., Instruct 150k) using techniques like QLoRa.
-Enhanced Fusion:
-Develop more sophisticated fusion strategies beyond concatenating placeholder tags.
-Model Upgrades:
-Experiment with different instruction-tuned or conversation-focused models to improve the quality of generated responses.
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 pinned: false
 license: mit
 ---
+# Multi-Modal LLM Demo with Flan-T5
+This project demonstrates a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:
+- **CLIP** from OpenAI for extracting image embeddings.
+- **Whisper** from OpenAI for transcribing audio.
+- **Flan-T5 Large** from Google as an instruction-tuned text generation model.
+- **Gradio** for building an interactive web interface.
+- **Hugging Face Spaces** for deployment.
+The goal is to fuse different modalities into a single prompt and produce coherent text output.
+---
+## Features
+- **Multi-Modal Inputs:**
+  - **Text:** Type your query or message.
+  - **Image:** Upload an image, which is processed using CLIP.
+  - **Audio:** Upload an audio file, which is transcribed using Whisper.
+- **Instruction-Tuned Generation:**
+  - Uses Flan-T5 Large to generate more coherent and on-topic responses.
+- **Customizable Decoding:**
+  - Advanced generation parameters (e.g., `temperature`, `top_p`, `repetition_penalty`) ensure varied and high-quality outputs.
+- **Interactive UI:**
+  - A clean, ChatGPT-like interface built with Gradio.
+---
+## Installation & Setup
+### Requirements
+Ensure your environment has the following dependencies. The provided `requirements.txt` file should include:
+```txt
 torch
 transformers>=4.31.0
 accelerate>=0.20.0
 gradio
 soundfile
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference