Spaces:
Running
Running
<html lang="en"> | |
<head> | |
<meta charset="UTF-8"> | |
<meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
<title>Understanding Attention Mechanisms in LLMs</title> | |
<script src="https://cdn.tailwindcss.com"></script> | |
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css"> | |
<style> | |
.code-block { | |
background-color: #2d2d2d; | |
color: #f8f8f2; | |
padding: 1rem; | |
border-radius: 0.5rem; | |
font-family: 'Courier New', Courier, monospace; | |
overflow-x: auto; | |
} | |
.attention-visual { | |
display: flex; | |
justify-content: center; | |
margin: 2rem 0; | |
} | |
.attention-node { | |
width: 60px; | |
height: 60px; | |
border-radius: 50%; | |
display: flex; | |
align-items: center; | |
justify-content: center; | |
font-weight: bold; | |
position: relative; | |
} | |
.attention-line { | |
position: absolute; | |
background-color: rgba(59, 130, 246, 0.5); | |
transform-origin: left center; | |
} | |
.explanation-box { | |
background-color: #f0f9ff; | |
border-left: 4px solid #3b82f6; | |
padding: 1rem; | |
margin: 1rem 0; | |
border-radius: 0 0.5rem 0.5rem 0; | |
} | |
.citation { | |
background-color: #f8fafc; | |
padding: 0.5rem; | |
margin: 0.5rem 0; | |
border-left: 3px solid #94a3b8; | |
} | |
</style> | |
</head> | |
<body class="bg-gray-50"> | |
<div class="max-w-4xl mx-auto px-4 py-8"> | |
<header class="text-center mb-12"> | |
<h1 class="text-4xl font-bold text-blue-800 mb-4">Attention Mechanisms in Large Language Models</h1> | |
<p class="text-xl text-gray-600">Understanding the core innovation behind modern AI language models</p> | |
<div class="mt-6"> | |
<span class="inline-block bg-blue-100 text-blue-800 px-3 py-1 rounded-full text-sm font-medium">Machine Learning</span> | |
<span class="inline-block bg-purple-100 text-purple-800 px-3 py-1 rounded-full text-sm font-medium ml-2">Natural Language Processing</span> | |
<span class="inline-block bg-green-100 text-green-800 px-3 py-1 rounded-full text-sm font-medium ml-2">Deep Learning</span> | |
</div> | |
</header> | |
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8"> | |
<div class="p-8"> | |
<h2 class="text-2xl font-bold text-gray-800 mb-6">Introduction to Attention</h2> | |
<p class="text-gray-700 mb-4"> | |
The attention mechanism is a fundamental component of modern transformer-based language models like GPT, BERT, and others. | |
It allows models to dynamically focus on different parts of the input sequence when producing each part of the output sequence. | |
</p> | |
<p class="text-gray-700 mb-6"> | |
Unlike traditional sequence models that process inputs in a fixed order, attention mechanisms enable models to learn which parts of the input are most relevant at each step of processing. | |
</p> | |
<div class="attention-visual"> | |
<div class="flex flex-col items-center"> | |
<div class="flex space-x-8 mb-8"> | |
<div class="attention-node bg-blue-100 text-blue-800">Input</div> | |
<div class="attention-node bg-purple-100 text-purple-800">Q</div> | |
<div class="attention-node bg-green-100 text-green-800">K</div> | |
<div class="attention-node bg-yellow-100 text-yellow-800">V</div> | |
</div> | |
<div class="attention-node bg-red-100 text-red-800">Output</div> | |
</div> | |
</div> | |
<div class="explanation-box"> | |
<h3 class="font-semibold text-lg text-blue-800 mb-2">Key Insight</h3> | |
<p> | |
Attention mechanisms compute a weighted sum of values (V), where the weights are determined by the compatibility between queries (Q) and keys (K). | |
This allows the model to focus on different parts of the input sequence dynamically. | |
</p> | |
</div> | |
</div> | |
</div> | |
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8"> | |
<div class="p-8"> | |
<h2 class="text-2xl font-bold text-gray-800 mb-6">The Q, K, V Triad</h2> | |
<div class="grid md:grid-cols-3 gap-6 mb-8"> | |
<div class="bg-blue-50 p-4 rounded-lg"> | |
<h3 class="font-bold text-blue-800 mb-2"><i class="fas fa-question-circle mr-2"></i>Queries (Q)</h3> | |
<p class="text-gray-700"> | |
Represent what the model is "looking for" at the current position. They are learned representations that help determine which parts of the input to focus on. | |
</p> | |
</div> | |
<div class="bg-green-50 p-4 rounded-lg"> | |
<h3 class="font-bold text-green-800 mb-2"><i class="fas fa-key mr-2"></i>Keys (K)</h3> | |
<p class="text-gray-700"> | |
Represent what each input element "contains" or "offers". They are compared against queries to determine attention weights. | |
</p> | |
</div> | |
<div class="bg-yellow-50 p-4 rounded-lg"> | |
<h3 class="font-bold text-yellow-800 mb-2"><i class="fas fa-database mr-2"></i>Values (V)</h3> | |
<p class="text-gray-700"> | |
Contain the actual information that will be aggregated based on the attention weights. They represent what gets passed forward. | |
</p> | |
</div> | |
</div> | |
<h3 class="text-xl font-semibold text-gray-800 mb-4">Why We Need All Three</h3> | |
<p class="text-gray-700 mb-4"> | |
The separation of Q, K, and V provides flexibility and expressive power to the attention mechanism: | |
</p> | |
<ul class="list-disc pl-6 text-gray-700 space-y-2 mb-6"> | |
<li><strong>Decoupling:</strong> Allows different representations for what to look for (Q) versus what to retrieve (V)</li> | |
<li><strong>Flexibility:</strong> Enables different types of attention patterns (e.g., looking ahead vs. looking back)</li> | |
<li><strong>Efficiency:</strong> Permits caching of K and V for autoregressive generation</li> | |
<li><strong>Interpretability:</strong> Makes attention patterns more meaningful and analyzable</li> | |
</ul> | |
<h3 class="text-xl font-semibold text-gray-800 mb-4">How Q, K, V Are Created</h3> | |
<p class="text-gray-700 mb-4"> | |
In transformer models, Q, K, and V are all derived from the same input sequence through learned linear transformations: | |
</p> | |
<div class="code-block mb-6"> | |
<pre># Python example of creating Q, K, V | |
import torch | |
import torch.nn as nn | |
# Suppose we have input embeddings of shape (batch_size, seq_len, d_model) | |
batch_size = 32 | |
seq_len = 10 | |
d_model = 512 | |
input_embeddings = torch.randn(batch_size, seq_len, d_model) | |
# Create linear projection layers | |
q_proj = nn.Linear(d_model, d_model) # Query projection | |
k_proj = nn.Linear(d_model, d_model) # Key projection | |
v_proj = nn.Linear(d_model, d_model) # Value projection | |
# Project inputs to get Q, K, V | |
Q = q_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model) | |
K = k_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model) | |
V = v_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model)</pre> | |
</div> | |
<div class="explanation-box"> | |
<h3 class="font-semibold text-lg text-blue-800 mb-2">Important Note</h3> | |
<p> | |
In practice, the dimensions are often split into multiple "heads" (multi-head attention), where each head learns different attention patterns. | |
This allows the model to attend to different aspects of the input simultaneously. | |
</p> | |
</div> | |
</div> | |
</div> | |
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8"> | |
<div class="p-8"> | |
<h2 class="text-2xl font-bold text-gray-800 mb-6">Scaled Dot-Product Attention</h2> | |
<p class="text-gray-700 mb-4"> | |
The core computation in attention mechanisms is the scaled dot-product attention, which can be implemented as follows: | |
</p> | |
<div class="code-block mb-6"> | |
<pre>def scaled_dot_product_attention(Q, K, V, mask=None): | |
""" | |
Q: Query tensor (batch_size, ..., seq_len_q, d_k) | |
K: Key tensor (batch_size, ..., seq_len_k, d_k) | |
V: Value tensor (batch_size, ..., seq_len_k, d_v) | |
mask: Optional mask tensor for masking out certain positions | |
""" | |
# Compute dot products between Q and K | |
matmul_qk = torch.matmul(Q, K.transpose(-2, -1)) # (..., seq_len_q, seq_len_k) | |
# Scale by square root of dimension | |
d_k = Q.size(-1) | |
scaled_attention_logits = matmul_qk / torch.sqrt(torch.tensor(d_k, dtype=torch.float32)) | |
# Apply mask if provided (for decoder self-attention) | |
if mask is not None: | |
scaled_attention_logits += (mask * -1e9) | |
# Softmax to get attention weights | |
attention_weights = torch.softmax(scaled_attention_logits, dim=-1) | |
# Multiply weights by values | |
output = torch.matmul(attention_weights, V) # (..., seq_len_q, d_v) | |
return output, attention_weights</pre> | |
</div> | |
<div class="explanation-box"> | |
<h3 class="font-semibold text-lg text-blue-800 mb-2">Scaling Explanation</h3> | |
<p> | |
The scaling factor (√dₖ) is crucial because dot products grow large in magnitude as the dimension increases. | |
This can push the softmax function into regions where it has extremely small gradients, making learning difficult. | |
Scaling by √dₖ counteracts this effect. | |
</p> | |
</div> | |
<h3 class="text-xl font-semibold text-gray-800 mb-4">Complete Multi-Head Attention Example</h3> | |
<div class="code-block mb-6"> | |
<pre>class MultiHeadAttention(nn.Module): | |
def __init__(self, d_model, num_heads): | |
super(MultiHeadAttention, self).__init__() | |
self.num_heads = num_heads | |
self.d_model = d_model | |
assert d_model % num_heads == 0, "d_model must be divisible by num_heads" | |
self.depth = d_model // num_heads | |
# Linear layers for Q, K, V projections | |
self.wq = nn.Linear(d_model, d_model) | |
self.wk = nn.Linear(d_model, d_model) | |
self.wv = nn.Linear(d_model, d_model) | |
self.dense = nn.Linear(d_model, d_model) | |
def split_heads(self, x, batch_size): | |
"""Split the last dimension into (num_heads, depth)""" | |
x = x.view(batch_size, -1, self.num_heads, self.depth) | |
return x.transpose(1, 2) # (batch_size, num_heads, seq_len, depth) | |
def forward(self, q, k, v, mask=None): | |
batch_size = q.size(0) | |
# Linear projections | |
q = self.wq(q) # (batch_size, seq_len, d_model) | |
k = self.wk(k) | |
v = self.wv(v) | |
# Split into multiple heads | |
q = self.split_heads(q, batch_size) | |
k = self.split_heads(k, batch_size) | |
v = self.split_heads(v, batch_size) | |
# Scaled dot-product attention | |
scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask) | |
# Concatenate heads | |
scaled_attention = scaled_attention.transpose(1, 2) # (batch_size, seq_len, num_heads, depth) | |
concat_attention = scaled_attention.contiguous().view(batch_size, -1, self.d_model) | |
# Final linear layer | |
output = self.dense(concat_attention) | |
return output, attention_weights</pre> | |
</div> | |
</div> | |
</div> | |
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8"> | |
<div class="p-8"> | |
<h2 class="text-2xl font-bold text-gray-800 mb-6">Types of Attention Patterns</h2> | |
<div class="grid md:grid-cols-2 gap-6 mb-6"> | |
<div class="bg-indigo-50 p-4 rounded-lg"> | |
<h3 class="font-bold text-indigo-800 mb-2"><i class="fas fa-arrows-alt-h mr-2"></i>Self-Attention</h3> | |
<p class="text-gray-700"> | |
Q, K, and V all come from the same sequence. Allows each position to attend to all positions in the same sequence. | |
</p> | |
<div class="mt-3"> | |
<span class="inline-block bg-indigo-100 text-indigo-800 px-2 py-1 rounded-full text-xs font-medium">Encoder</span> | |
<span class="inline-block bg-indigo-100 text-indigo-800 px-2 py-1 rounded-full text-xs font-medium ml-1">BERT</span> | |
</div> | |
</div> | |
<div class="bg-pink-50 p-4 rounded-lg"> | |
<h3 class="font-bold text-pink-800 mb-2"><i class="fas fa-arrow-right mr-2"></i>Masked Self-Attention</h3> | |
<p class="text-gray-700"> | |
Used in decoder to prevent positions from attending to subsequent positions (autoregressive property). | |
</p> | |
<div class="mt-3"> | |
<span class="inline-block bg-pink-100 text-pink-800 px-2 py-1 rounded-full text-xs font-medium">Decoder</span> | |
<span class="inline-block bg-pink-100 text-pink-800 px-2 py-1 rounded-full text-xs font-medium ml-1">GPT</span> | |
</div> | |
</div> | |
<div class="bg-teal-50 p-4 rounded-lg"> | |
<h3 class="font-bold text-teal-800 mb-2"><i class="fas fa-exchange-alt mr-2"></i>Cross-Attention</h3> | |
<p class="text-gray-700"> | |
Q comes from one sequence, while K and V come from another sequence (e.g., encoder-decoder attention). | |
</p> | |
<div class="mt-3"> | |
<span class="inline-block bg-teal-100 text-teal-800 px-2 py-1 rounded-full text-xs font-medium">Seq2Seq</span> | |
<span class="inline-block bg-teal-100 text-teal-800 px-2 py-1 rounded-full text-xs font-medium ml-1">Translation</span> | |
</div> | |
</div> | |
<div class="bg-orange-50 p-4 rounded-lg"> | |
<h3 class="font-bold text-orange-800 mb-2"><i class="fas fa-sliders-h mr-2"></i>Sparse Attention</h3> | |
<p class="text-gray-700"> | |
Only attends to a subset of positions to reduce computational complexity (e.g., local, strided, or global attention). | |
</p> | |
<div class="mt-3"> | |
<span class="inline-block bg-orange-100 text-orange-800 px-2 py-1 rounded-full text-xs font-medium">Longformer</span> | |
<span class="inline-block bg-orange-100 text-orange-800 px-2 py-1 rounded-full text-xs font-medium ml-1">BigBird</span> | |
</div> | |
</div> | |
</div> | |
</div> | |
</div> | |
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8"> | |
<div class="p-8"> | |
<h2 class="text-2xl font-bold text-gray-800 mb-6">Key Citations and Resources</h2> | |
<div class="space-y-4"> | |
<div class="citation"> | |
<h3 class="font-semibold text-gray-800">1. Vaswani et al. (2017) - Original Transformer Paper</h3> | |
<p class="text-gray-600">"Attention Is All You Need" - Introduced the transformer architecture with scaled dot-product attention.</p> | |
<a href="https://arxiv.org/abs/1706.03762" class="text-blue-600 hover:underline inline-block mt-1">arXiv:1706.03762</a> | |
</div> | |
<div class="citation"> | |
<h3 class="font-semibold text-gray-800">2. Jurafsky & Martin (2023) - NLP Textbook</h3> | |
<p class="text-gray-600">"Speech and Language Processing" - Comprehensive chapter on attention and transformer models.</p> | |
<a href="https://web.stanford.edu/~jurafsky/slp3/" class="text-blue-600 hover:underline inline-block mt-1">Stanford NLP Textbook</a> | |
</div> | |
<div class="citation"> | |
<h3 class="font-semibold text-gray-800">3. Illustrated Transformer (Blog Post)</h3> | |
<p class="text-gray-600">Jay Alammar's visual explanation of transformer attention mechanisms.</p> | |
<a href="https://jalammar.github.io/illustrated-transformer/" class="text-blue-600 hover:underline inline-block mt-1">jalammar.github.io</a> | |
</div> | |
<div class="citation"> | |
<h3 class="font-semibold text-gray-800">4. Harvard NLP (2022) - Annotated Transformer</h3> | |
<p class="text-gray-600">Line-by-line implementation guide with PyTorch.</p> | |
<a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html" class="text-blue-600 hover:underline inline-block mt-1">Harvard NLP Tutorial</a> | |
</div> | |
<div class="citation"> | |
<h3 class="font-semibold text-gray-800">5. Efficient Transformers Survey (2020)</h3> | |
<p class="text-gray-600">Tay et al. review various attention variants for efficiency.</p> | |
<a href="https://arxiv.org/abs/2009.06732" class="text-blue-600 hover:underline inline-block mt-1">arXiv:2009.06732</a> | |
</div> | |
</div> | |
</div> | |
</div> | |
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8"> | |
<div class="p-8"> | |
<h2 class="text-2xl font-bold text-gray-800 mb-6">Practical Considerations</h2> | |
<div class="grid md:grid-cols-2 gap-6"> | |
<div> | |
<h3 class="text-xl font-semibold text-gray-800 mb-3"><i class="fas fa-lightbulb text-yellow-500 mr-2"></i>Tips for Implementation</h3> | |
<ul class="list-disc pl-6 text-gray-700 space-y-2"> | |
<li>Use layer normalization before (not after) attention in transformer blocks</li> | |
<li>Initialize attention projections with small random weights</li> | |
<li>Monitor attention patterns during training for debugging</li> | |
<li>Consider using flash attention for efficiency in production</li> | |
<li>Use attention masking carefully for padding and autoregressive generation</li> | |
</ul> | |
</div> | |
<div> | |
<h3 class="text-xl font-semibold text-gray-800 mb-3"><i class="fas fa-exclamation-triangle text-red-500 mr-2"></i>Common Pitfalls</h3> | |
<ul class="list-disc pl-6 text-gray-700 space-y-2"> | |
<li>Forgetting to scale attention scores by √dₖ</li> | |
<li>Improper handling of attention masks</li> | |
<li>Not using residual connections around attention</li> | |
<li>Oversized attention heads that don't learn meaningful patterns</li> | |
<li>Ignoring attention patterns when debugging model behavior</li> | |
</ul> | |
</div> | |
</div> | |
</div> | |
</div> | |
<footer class="text-center py-8 text-gray-600"> | |
<p>© 2023 Understanding Attention Mechanisms in LLMs</p> | |
<p class="mt-2">Educational resource for machine learning students</p> | |
<div class="mt-4 flex justify-center space-x-4"> | |
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-github fa-lg"></i></a> | |
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-twitter fa-lg"></i></a> | |
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-linkedin fa-lg"></i></a> | |
</div> | |
</footer> | |
</div> | |
<p style="border-radius: 8px; text-align: center; font-size: 12px; color: #fff; margin-top: 16px;position: fixed; left: 8px; bottom: 8px; z-index: 10; background: rgba(0, 0, 0, 0.8); padding: 4px 8px;">Made with <img src="https://enzostvs-deepsite.hf.space/logo.svg" alt="DeepSite Logo" style="width: 16px; height: 16px; vertical-align: middle;display:inline-block;margin-right:3px;filter:brightness(0) invert(1);"><a href="https://enzostvs-deepsite.hf.space" style="color: #fff;text-decoration: underline;" target="_blank" >DeepSite</a> - 🧬 <a href="https://enzostvs-deepsite.hf.space?remix=ontoligent/ds-5001-text-as-data" style="color: #fff;text-decoration: underline;" target="_blank" >Remix</a></p></body> | |
</html> |