Explainable-Vision-Language-Model

Running on Zero

khang119966 commited on 22 days ago

Commit

20fac9c

verified ·

1 Parent(s): 3d50453

Update app.py

Files changed (1) hide show

app.py CHANGED Viewed

@@ -30,6 +30,8 @@ import os
 from moviepy.editor import VideoFileClip, AudioFileClip
 import multiprocessing
 import imageio
 subprocess.run('pip install flash-attn --no-build-isolation', env={'FLASH_ATTENTION_SKIP_CUDA_BUILD': "TRUE"}, shell=True)
@@ -557,14 +559,11 @@ def generate_video(image, prompt, max_tokens):
     return "heatmap_animation.mp4"
 with gr.Blocks() as demo:
-    gr.Markdown("""## 🎥 Visualizing How Multimodal Models Think
-This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
-### 📌 What it does:
-- Takes an input image and a text prompt. - Shows how the model’s attention shifts on the image for each generated token. - Helps explain the model’s behavior and decision-making.
-### 🖼️ Video layout (per frame):
-Each frame in the video includes: 1. 🔥 **Heatmap over image**: Shows which area the model focuses on. 2. 📝 **Generated text**: With old context, current token highlighted. 3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
-### 🎯 Use cases:
-- Research explainability of vision-language models. - Debugging or interpreting model outputs. - Creating educational visualizations.
 """)
     with gr.Row():

 from moviepy.editor import VideoFileClip, AudioFileClip
 import multiprocessing
 import imageio
+import tqdm
 subprocess.run('pip install flash-attn --no-build-isolation', env={'FLASH_ATTENTION_SKIP_CUDA_BUILD': "TRUE"}, shell=True)
     return "heatmap_animation.mp4"
 with gr.Blocks() as demo:
+    gr.Markdown("""# 🎥 Visualizing How Multimodal Models Think
+- This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
+📌 What it does: - Takes an input image and a text prompt. - Shows how the model’s attention shifts on the image for each generated token. - Helps explain the model’s behavior and decision-making.
+🖼️ Video layout (per frame): Each frame in the video includes: 1. 🔥 **Heatmap over image**: Shows which area the model focuses on. 2. 📝 **Generated text**: With old context, current token highlighted. 3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
+🎯 Use cases: Research explainability of vision-language models. - Debugging or interpreting model outputs. - Creating educational visualizations.
 """)
     with gr.Row():