khang119966 commited on
Commit
3d50453
·
verified ·
1 Parent(s): 8bfbd60

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +5 -11
app.py CHANGED
@@ -446,6 +446,7 @@ model = AutoModel.from_pretrained(
446
  torch_dtype=torch.bfloat16,
447
  low_cpu_mem_usage=True,
448
  trust_remote_code=True,
 
449
  ).eval().cuda()
450
  tokenizer = AutoTokenizer.from_pretrained("khang119966/Vintern-1B-v3_5-explainableAI", trust_remote_code=True, use_fast=False)
451
 
@@ -559,18 +560,11 @@ with gr.Blocks() as demo:
559
  gr.Markdown("""## 🎥 Visualizing How Multimodal Models Think
560
  This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
561
  ### 📌 What it does:
562
- - Takes an input image and a text prompt.
563
- - Shows how the model’s attention shifts on the image for each generated token.
564
- - Helps explain the model’s behavior and decision-making.
565
- ### 🖼️ Video layout (per frame):
566
- Each frame in the video includes:
567
- 1. 🔥 **Heatmap over image**: Shows which area the model focuses on.
568
- 2. 📝 **Generated text**: With old context, current token highlighted.
569
- 3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
570
  ### 🎯 Use cases:
571
- - Research explainability of vision-language models.
572
- - Debugging or interpreting model outputs.
573
- - Creating educational visualizations.
574
  """)
575
 
576
  with gr.Row():
 
446
  torch_dtype=torch.bfloat16,
447
  low_cpu_mem_usage=True,
448
  trust_remote_code=True,
449
+ use_flash_attn=False,
450
  ).eval().cuda()
451
  tokenizer = AutoTokenizer.from_pretrained("khang119966/Vintern-1B-v3_5-explainableAI", trust_remote_code=True, use_fast=False)
452
 
 
560
  gr.Markdown("""## 🎥 Visualizing How Multimodal Models Think
561
  This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
562
  ### 📌 What it does:
563
+ - Takes an input image and a text prompt. - Shows how the model’s attention shifts on the image for each generated token. - Helps explain the model’s behavior and decision-making.
564
+ ### 🖼️ Video layout (per frame):
565
+ Each frame in the video includes: 1. 🔥 **Heatmap over image**: Shows which area the model focuses on. 2. 📝 **Generated text**: With old context, current token highlighted. 3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
 
 
 
 
 
566
  ### 🎯 Use cases:
567
+ - Research explainability of vision-language models. - Debugging or interpreting model outputs. - Creating educational visualizations.
 
 
568
  """)
569
 
570
  with gr.Row():