atharvasc27112001 commited on
Commit
1ce383e
·
verified ·
1 Parent(s): 1f61d7b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -109
README.md CHANGED
@@ -9,136 +9,52 @@ app_file: app.py
9
  pinned: false
10
  license: mit
11
  ---
12
- Multi-Modal LLM Demo with Flan-T5
13
- This project is a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:
14
 
15
- CLIP from OpenAI for image embeddings.
16
 
17
- Whisper from OpenAI for audio transcription.
18
 
19
- Flan-T5 Large from Google as an instruction‑tuned text generation model.
 
 
 
 
20
 
21
- Gradio to build an interactive web interface.
22
 
23
- Hugging Face Spaces for deployment.
24
-
25
- The goal is to demonstrate how different modalities can be fused into a single prompt to produce coherent text output.
26
-
27
- Features
28
- Multi-Modal Inputs:
29
-
30
- Text: Users can type in their queries.
31
 
32
- Image: Users can upload images; the app processes these using CLIP.
33
 
34
- Audio: Users can upload audio files; the app transcribes them using Whisper.
 
 
 
35
 
36
- Instruction-Tuned Text Generation:
 
37
 
38
- Uses Flan‑T5 Large to generate responses based on the fused prompt.
 
39
 
40
- Customizable Decoding:
 
41
 
42
- Advanced generation parameters such as temperature, top_p, and repetition_penalty are applied to produce varied and coherent outputs.
43
 
44
- Interactive UI:
45
 
46
- A clean, ChatGPT-like interface built with Gradio.
47
 
48
- Installation & Setup
49
- Requirements
50
- Ensure your environment has the following dependencies. You can install them via the provided requirements.txt:
51
 
52
- txt
53
- Copy
54
  torch
55
  transformers>=4.31.0
56
  accelerate>=0.20.0
57
  gradio
58
  soundfile
59
- Getting Started
60
- Clone the Repository:
61
-
62
- bash
63
- Copy
64
- git clone <your-repo-url>
65
- cd <your-repo-directory>
66
- (Optional) Create a Virtual Environment:
67
-
68
- bash
69
- Copy
70
- python -m venv env
71
- source env/bin/activate # On Windows: env\Scripts\activate
72
- Install Dependencies:
73
-
74
- bash
75
- Copy
76
- pip install --upgrade pip
77
- pip install -r requirements.txt
78
- Running the App Locally
79
- The main application is defined in app.py. To run the app locally:
80
-
81
- bash
82
- Copy
83
- python app.py
84
- This will launch the Gradio interface locally. Open the URL provided in your terminal to interact with the app via your browser.
85
-
86
- Project Structure
87
- Copy
88
- ├── app.py
89
- ├── requirements.txt
90
- └── README.md
91
- app.py:
92
- Contains the complete code for processing multi-modal inputs and generating responses.
93
-
94
- requirements.txt:
95
- Lists all the required dependencies.
96
-
97
- README.md:
98
- Provides an overview, installation instructions, and usage details for the project.
99
-
100
- How It Works
101
- Image Processing:
102
-
103
- The app uses the CLIP model to extract image embeddings.
104
-
105
- A linear projection layer converts these 512-dimensional embeddings to the 768-dimensional space expected by Flan‑T5.
106
-
107
- Audio Processing:
108
-
109
- Whisper transcribes audio files into text.
110
-
111
- The transcription is appended to the text prompt.
112
-
113
- Text Processing:
114
-
115
- The provided text input (if any) is combined with placeholders representing the image and audio content.
116
-
117
- The fused prompt is tokenized and fed into the Flan‑T5 model to generate a response.
118
-
119
- Decoding:
120
-
121
- Advanced generation parameters such as temperature, top_p, repetition_penalty, and do_sample are applied to guide the text generation process, ensuring varied and coherent outputs.
122
-
123
- Deployment:
124
-
125
- The Gradio interface provides an intuitive, web-based UI.
126
-
127
- The app is designed to be deployed on Hugging Face Spaces, making it easily accessible.
128
-
129
-
130
- Future Improvements
131
- Fine-Tuning:
132
-
133
- Fine-tune the projection layers and the text model on a dedicated multi-modal dataset (e.g., Instruct 150k) using techniques like QLoRa.
134
-
135
- Enhanced Fusion:
136
-
137
- Develop more sophisticated fusion strategies beyond concatenating placeholder tags.
138
-
139
- Model Upgrades:
140
 
141
- Experiment with different instruction-tuned or conversation-focused models to improve the quality of generated responses.
142
 
143
 
144
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
9
  pinned: false
10
  license: mit
11
  ---
 
 
12
 
13
+ # Multi-Modal LLM Demo with Flan-T5
14
 
15
+ This project demonstrates a multi-modal language model application that accepts text, image, and audio inputs to generate a text response. It leverages:
16
 
17
+ - **CLIP** from OpenAI for extracting image embeddings.
18
+ - **Whisper** from OpenAI for transcribing audio.
19
+ - **Flan-T5 Large** from Google as an instruction-tuned text generation model.
20
+ - **Gradio** for building an interactive web interface.
21
+ - **Hugging Face Spaces** for deployment.
22
 
23
+ The goal is to fuse different modalities into a single prompt and produce coherent text output.
24
 
25
+ ---
 
 
 
 
 
 
 
26
 
27
+ ## Features
28
 
29
+ - **Multi-Modal Inputs:**
30
+ - **Text:** Type your query or message.
31
+ - **Image:** Upload an image, which is processed using CLIP.
32
+ - **Audio:** Upload an audio file, which is transcribed using Whisper.
33
 
34
+ - **Instruction-Tuned Generation:**
35
+ - Uses Flan-T5 Large to generate more coherent and on-topic responses.
36
 
37
+ - **Customizable Decoding:**
38
+ - Advanced generation parameters (e.g., `temperature`, `top_p`, `repetition_penalty`) ensure varied and high-quality outputs.
39
 
40
+ - **Interactive UI:**
41
+ - A clean, ChatGPT-like interface built with Gradio.
42
 
43
+ ---
44
 
45
+ ## Installation & Setup
46
 
47
+ ### Requirements
48
 
49
+ Ensure your environment has the following dependencies. The provided `requirements.txt` file should include:
 
 
50
 
51
+ ```txt
 
52
  torch
53
  transformers>=4.31.0
54
  accelerate>=0.20.0
55
  gradio
56
  soundfile
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
 
58
 
59
 
60
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference