Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
@@ -9,136 +9,52 @@ app_file: app.py
|
|
9 |
pinned: false
|
10 |
license: mit
|
11 |
---
|
12 |
-
Multi-Modal LLM Demo with Flan-T5
|
13 |
-
This project is a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:
|
14 |
|
15 |
-
|
16 |
|
17 |
-
|
18 |
|
19 |
-
|
|
|
|
|
|
|
|
|
20 |
|
21 |
-
|
22 |
|
23 |
-
|
24 |
-
|
25 |
-
The goal is to demonstrate how different modalities can be fused into a single prompt to produce coherent text output.
|
26 |
-
|
27 |
-
Features
|
28 |
-
Multi-Modal Inputs:
|
29 |
-
|
30 |
-
Text: Users can type in their queries.
|
31 |
|
32 |
-
|
33 |
|
34 |
-
|
|
|
|
|
|
|
35 |
|
36 |
-
Instruction-Tuned
|
|
|
37 |
|
38 |
-
|
|
|
39 |
|
40 |
-
|
|
|
41 |
|
42 |
-
|
43 |
|
44 |
-
|
45 |
|
46 |
-
|
47 |
|
48 |
-
|
49 |
-
Requirements
|
50 |
-
Ensure your environment has the following dependencies. You can install them via the provided requirements.txt:
|
51 |
|
52 |
-
txt
|
53 |
-
Copy
|
54 |
torch
|
55 |
transformers>=4.31.0
|
56 |
accelerate>=0.20.0
|
57 |
gradio
|
58 |
soundfile
|
59 |
-
Getting Started
|
60 |
-
Clone the Repository:
|
61 |
-
|
62 |
-
bash
|
63 |
-
Copy
|
64 |
-
git clone <your-repo-url>
|
65 |
-
cd <your-repo-directory>
|
66 |
-
(Optional) Create a Virtual Environment:
|
67 |
-
|
68 |
-
bash
|
69 |
-
Copy
|
70 |
-
python -m venv env
|
71 |
-
source env/bin/activate # On Windows: env\Scripts\activate
|
72 |
-
Install Dependencies:
|
73 |
-
|
74 |
-
bash
|
75 |
-
Copy
|
76 |
-
pip install --upgrade pip
|
77 |
-
pip install -r requirements.txt
|
78 |
-
Running the App Locally
|
79 |
-
The main application is defined in app.py. To run the app locally:
|
80 |
-
|
81 |
-
bash
|
82 |
-
Copy
|
83 |
-
python app.py
|
84 |
-
This will launch the Gradio interface locally. Open the URL provided in your terminal to interact with the app via your browser.
|
85 |
-
|
86 |
-
Project Structure
|
87 |
-
Copy
|
88 |
-
├── app.py
|
89 |
-
├── requirements.txt
|
90 |
-
└── README.md
|
91 |
-
app.py:
|
92 |
-
Contains the complete code for processing multi-modal inputs and generating responses.
|
93 |
-
|
94 |
-
requirements.txt:
|
95 |
-
Lists all the required dependencies.
|
96 |
-
|
97 |
-
README.md:
|
98 |
-
Provides an overview, installation instructions, and usage details for the project.
|
99 |
-
|
100 |
-
How It Works
|
101 |
-
Image Processing:
|
102 |
-
|
103 |
-
The app uses the CLIP model to extract image embeddings.
|
104 |
-
|
105 |
-
A linear projection layer converts these 512-dimensional embeddings to the 768-dimensional space expected by Flan‑T5.
|
106 |
-
|
107 |
-
Audio Processing:
|
108 |
-
|
109 |
-
Whisper transcribes audio files into text.
|
110 |
-
|
111 |
-
The transcription is appended to the text prompt.
|
112 |
-
|
113 |
-
Text Processing:
|
114 |
-
|
115 |
-
The provided text input (if any) is combined with placeholders representing the image and audio content.
|
116 |
-
|
117 |
-
The fused prompt is tokenized and fed into the Flan‑T5 model to generate a response.
|
118 |
-
|
119 |
-
Decoding:
|
120 |
-
|
121 |
-
Advanced generation parameters such as temperature, top_p, repetition_penalty, and do_sample are applied to guide the text generation process, ensuring varied and coherent outputs.
|
122 |
-
|
123 |
-
Deployment:
|
124 |
-
|
125 |
-
The Gradio interface provides an intuitive, web-based UI.
|
126 |
-
|
127 |
-
The app is designed to be deployed on Hugging Face Spaces, making it easily accessible.
|
128 |
-
|
129 |
-
|
130 |
-
Future Improvements
|
131 |
-
Fine-Tuning:
|
132 |
-
|
133 |
-
Fine-tune the projection layers and the text model on a dedicated multi-modal dataset (e.g., Instruct 150k) using techniques like QLoRa.
|
134 |
-
|
135 |
-
Enhanced Fusion:
|
136 |
-
|
137 |
-
Develop more sophisticated fusion strategies beyond concatenating placeholder tags.
|
138 |
-
|
139 |
-
Model Upgrades:
|
140 |
|
141 |
-
Experiment with different instruction-tuned or conversation-focused models to improve the quality of generated responses.
|
142 |
|
143 |
|
144 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
9 |
pinned: false
|
10 |
license: mit
|
11 |
---
|
|
|
|
|
12 |
|
13 |
+
# Multi-Modal LLM Demo with Flan-T5
|
14 |
|
15 |
+
This project demonstrates a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:
|
16 |
|
17 |
+
- **CLIP** from OpenAI for extracting image embeddings.
|
18 |
+
- **Whisper** from OpenAI for transcribing audio.
|
19 |
+
- **Flan-T5 Large** from Google as an instruction-tuned text generation model.
|
20 |
+
- **Gradio** for building an interactive web interface.
|
21 |
+
- **Hugging Face Spaces** for deployment.
|
22 |
|
23 |
+
The goal is to fuse different modalities into a single prompt and produce coherent text output.
|
24 |
|
25 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
+
## Features
|
28 |
|
29 |
+
- **Multi-Modal Inputs:**
|
30 |
+
- **Text:** Type your query or message.
|
31 |
+
- **Image:** Upload an image, which is processed using CLIP.
|
32 |
+
- **Audio:** Upload an audio file, which is transcribed using Whisper.
|
33 |
|
34 |
+
- **Instruction-Tuned Generation:**
|
35 |
+
- Uses Flan-T5 Large to generate more coherent and on-topic responses.
|
36 |
|
37 |
+
- **Customizable Decoding:**
|
38 |
+
- Advanced generation parameters (e.g., `temperature`, `top_p`, `repetition_penalty`) ensure varied and high-quality outputs.
|
39 |
|
40 |
+
- **Interactive UI:**
|
41 |
+
- A clean, ChatGPT-like interface built with Gradio.
|
42 |
|
43 |
+
---
|
44 |
|
45 |
+
## Installation & Setup
|
46 |
|
47 |
+
### Requirements
|
48 |
|
49 |
+
Ensure your environment has the following dependencies. The provided `requirements.txt` file should include:
|
|
|
|
|
50 |
|
51 |
+
```txt
|
|
|
52 |
torch
|
53 |
transformers>=4.31.0
|
54 |
accelerate>=0.20.0
|
55 |
gradio
|
56 |
soundfile
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
|
|
|
58 |
|
59 |
|
60 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|