Explainable-Vision-Language-Model

Running on Zero

khang119966 commited on 21 days ago

Commit

60b0804

verified ·

1 Parent(s): 1c43489

Update app.py

Files changed (1) hide show

app.py CHANGED Viewed

@@ -490,9 +490,11 @@ def generate_video(image, prompt, max_tokens):
 with gr.Blocks() as demo:
     gr.Markdown("""# 🎥 Visualizing How Multimodal Models Think
 - This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
 📌 What it does: - Takes an input image and a text prompt. - Shows how the model’s attention shifts on the image for each generated token. - Helps explain the model’s behavior and decision-making.
 🖼️ Video layout (per frame): Each frame in the video includes: 1. 🔥 **Heatmap over image**: Shows which area the model focuses on. 2. 📝 **Generated text**: With old context, current token highlighted. 3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
-🎯 Use cases: Research explainability of vision-language models. - Debugging or interpreting model outputs. - Creating educational visualizations.
 """)
     with gr.Row():

 with gr.Blocks() as demo:
     gr.Markdown("""# 🎥 Visualizing How Multimodal Models Think
 - This tool generates a video to **visualize how a multimodal model (image + text)** attends to different parts of an image while generating text.
 📌 What it does: - Takes an input image and a text prompt. - Shows how the model’s attention shifts on the image for each generated token. - Helps explain the model’s behavior and decision-making.
 🖼️ Video layout (per frame): Each frame in the video includes: 1. 🔥 **Heatmap over image**: Shows which area the model focuses on. 2. 📝 **Generated text**: With old context, current token highlighted. 3. 📊 **Token prediction table**: Shows the model’s top next-token guesses.
 """)
     with gr.Row():