Spaces:
Running
on
Zero
Running
on
Zero
Update app.py
Browse files
app.py
CHANGED
@@ -446,6 +446,7 @@ model = AutoModel.from_pretrained(
|
|
446 |
torch_dtype=torch.bfloat16,
|
447 |
low_cpu_mem_usage=True,
|
448 |
trust_remote_code=True,
|
|
|
449 |
).eval().cuda()
|
450 |
tokenizer = AutoTokenizer.from_pretrained("khang119966/Vintern-1B-v3_5-explainableAI", trust_remote_code=True, use_fast=False)
|
451 |
|
@@ -559,18 +560,11 @@ with gr.Blocks() as demo:
|
|
559 |
gr.Markdown("""## 🎥 Visualizing How Multimodal Models Think
|
560 |
This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
|
561 |
### 📌 What it does:
|
562 |
-
- Takes an input image and a text prompt.
|
563 |
-
|
564 |
-
|
565 |
-
### 🖼️ Video layout (per frame):
|
566 |
-
Each frame in the video includes:
|
567 |
-
1. 🔥 **Heatmap over image**: Shows which area the model focuses on.
|
568 |
-
2. 📝 **Generated text**: With old context, current token highlighted.
|
569 |
-
3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
|
570 |
### 🎯 Use cases:
|
571 |
-
- Research explainability of vision-language models.
|
572 |
-
- Debugging or interpreting model outputs.
|
573 |
-
- Creating educational visualizations.
|
574 |
""")
|
575 |
|
576 |
with gr.Row():
|
|
|
446 |
torch_dtype=torch.bfloat16,
|
447 |
low_cpu_mem_usage=True,
|
448 |
trust_remote_code=True,
|
449 |
+
use_flash_attn=False,
|
450 |
).eval().cuda()
|
451 |
tokenizer = AutoTokenizer.from_pretrained("khang119966/Vintern-1B-v3_5-explainableAI", trust_remote_code=True, use_fast=False)
|
452 |
|
|
|
560 |
gr.Markdown("""## 🎥 Visualizing How Multimodal Models Think
|
561 |
This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
|
562 |
### 📌 What it does:
|
563 |
+
- Takes an input image and a text prompt. - Shows how the model’s attention shifts on the image for each generated token. - Helps explain the model’s behavior and decision-making.
|
564 |
+
### 🖼️ Video layout (per frame):
|
565 |
+
Each frame in the video includes: 1. 🔥 **Heatmap over image**: Shows which area the model focuses on. 2. 📝 **Generated text**: With old context, current token highlighted. 3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
|
|
|
|
|
|
|
|
|
|
|
566 |
### 🎯 Use cases:
|
567 |
+
- Research explainability of vision-language models. - Debugging or interpreting model outputs. - Creating educational visualizations.
|
|
|
|
|
568 |
""")
|
569 |
|
570 |
with gr.Row():
|