atharvasc27112001 commited on
Commit
1f61d7b
·
verified ·
1 Parent(s): 43d8873

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -0
README.md CHANGED
@@ -9,5 +9,136 @@ app_file: app.py
9
  pinned: false
10
  license: mit
11
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
9
  pinned: false
10
  license: mit
11
  ---
12
+ Multi-Modal LLM Demo with Flan-T5
13
+ This project is a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:
14
+
15
+ CLIP from OpenAI for image embeddings.
16
+
17
+ Whisper from OpenAI for audio transcription.
18
+
19
+ Flan-T5 Large from Google as an instruction‑tuned text generation model.
20
+
21
+ Gradio to build an interactive web interface.
22
+
23
+ Hugging Face Spaces for deployment.
24
+
25
+ The goal is to demonstrate how different modalities can be fused into a single prompt to produce coherent text output.
26
+
27
+ Features
28
+ Multi-Modal Inputs:
29
+
30
+ Text: Users can type in their queries.
31
+
32
+ Image: Users can upload images; the app processes these using CLIP.
33
+
34
+ Audio: Users can upload audio files; the app transcribes them using Whisper.
35
+
36
+ Instruction-Tuned Text Generation:
37
+
38
+ Uses Flan‑T5 Large to generate responses based on the fused prompt.
39
+
40
+ Customizable Decoding:
41
+
42
+ Advanced generation parameters such as temperature, top_p, and repetition_penalty are applied to produce varied and coherent outputs.
43
+
44
+ Interactive UI:
45
+
46
+ A clean, ChatGPT-like interface built with Gradio.
47
+
48
+ Installation & Setup
49
+ Requirements
50
+ Ensure your environment has the following dependencies. You can install them via the provided requirements.txt:
51
+
52
+ txt
53
+ Copy
54
+ torch
55
+ transformers>=4.31.0
56
+ accelerate>=0.20.0
57
+ gradio
58
+ soundfile
59
+ Getting Started
60
+ Clone the Repository:
61
+
62
+ bash
63
+ Copy
64
+ git clone <your-repo-url>
65
+ cd <your-repo-directory>
66
+ (Optional) Create a Virtual Environment:
67
+
68
+ bash
69
+ Copy
70
+ python -m venv env
71
+ source env/bin/activate # On Windows: env\Scripts\activate
72
+ Install Dependencies:
73
+
74
+ bash
75
+ Copy
76
+ pip install --upgrade pip
77
+ pip install -r requirements.txt
78
+ Running the App Locally
79
+ The main application is defined in app.py. To run the app locally:
80
+
81
+ bash
82
+ Copy
83
+ python app.py
84
+ This will launch the Gradio interface locally. Open the URL provided in your terminal to interact with the app via your browser.
85
+
86
+ Project Structure
87
+ Copy
88
+ ├── app.py
89
+ ├── requirements.txt
90
+ └── README.md
91
+ app.py:
92
+ Contains the complete code for processing multi-modal inputs and generating responses.
93
+
94
+ requirements.txt:
95
+ Lists all the required dependencies.
96
+
97
+ README.md:
98
+ Provides an overview, installation instructions, and usage details for the project.
99
+
100
+ How It Works
101
+ Image Processing:
102
+
103
+ The app uses the CLIP model to extract image embeddings.
104
+
105
+ A linear projection layer converts these 512-dimensional embeddings to the 768-dimensional space expected by Flan‑T5.
106
+
107
+ Audio Processing:
108
+
109
+ Whisper transcribes audio files into text.
110
+
111
+ The transcription is appended to the text prompt.
112
+
113
+ Text Processing:
114
+
115
+ The provided text input (if any) is combined with placeholders representing the image and audio content.
116
+
117
+ The fused prompt is tokenized and fed into the Flan‑T5 model to generate a response.
118
+
119
+ Decoding:
120
+
121
+ Advanced generation parameters such as temperature, top_p, repetition_penalty, and do_sample are applied to guide the text generation process, ensuring varied and coherent outputs.
122
+
123
+ Deployment:
124
+
125
+ The Gradio interface provides an intuitive, web-based UI.
126
+
127
+ The app is designed to be deployed on Hugging Face Spaces, making it easily accessible.
128
+
129
+
130
+ Future Improvements
131
+ Fine-Tuning:
132
+
133
+ Fine-tune the projection layers and the text model on a dedicated multi-modal dataset (e.g., Instruct 150k) using techniques like QLoRa.
134
+
135
+ Enhanced Fusion:
136
+
137
+ Develop more sophisticated fusion strategies beyond concatenating placeholder tags.
138
+
139
+ Model Upgrades:
140
+
141
+ Experiment with different instruction-tuned or conversation-focused models to improve the quality of generated responses.
142
+
143
 
144
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference