ds-5001-text-as-data / index.html
ontoligent's picture
Add 3 files
967e756 verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Understanding Attention Mechanisms in LLMs</title>
<script src="https://cdn.tailwindcss.com"></script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
<style>
.code-block {
background-color: #2d2d2d;
color: #f8f8f2;
padding: 1rem;
border-radius: 0.5rem;
font-family: 'Courier New', Courier, monospace;
overflow-x: auto;
}
.attention-visual {
display: flex;
justify-content: center;
margin: 2rem 0;
}
.attention-node {
width: 60px;
height: 60px;
border-radius: 50%;
display: flex;
align-items: center;
justify-content: center;
font-weight: bold;
position: relative;
}
.attention-line {
position: absolute;
background-color: rgba(59, 130, 246, 0.5);
transform-origin: left center;
}
.explanation-box {
background-color: #f0f9ff;
border-left: 4px solid #3b82f6;
padding: 1rem;
margin: 1rem 0;
border-radius: 0 0.5rem 0.5rem 0;
}
.citation {
background-color: #f8fafc;
padding: 0.5rem;
margin: 0.5rem 0;
border-left: 3px solid #94a3b8;
}
</style>
</head>
<body class="bg-gray-50">
<div class="max-w-4xl mx-auto px-4 py-8">
<header class="text-center mb-12">
<h1 class="text-4xl font-bold text-blue-800 mb-4">Attention Mechanisms in Large Language Models</h1>
<p class="text-xl text-gray-600">Understanding the core innovation behind modern AI language models</p>
<div class="mt-6">
<span class="inline-block bg-blue-100 text-blue-800 px-3 py-1 rounded-full text-sm font-medium">Machine Learning</span>
<span class="inline-block bg-purple-100 text-purple-800 px-3 py-1 rounded-full text-sm font-medium ml-2">Natural Language Processing</span>
<span class="inline-block bg-green-100 text-green-800 px-3 py-1 rounded-full text-sm font-medium ml-2">Deep Learning</span>
</div>
</header>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">Introduction to Attention</h2>
<p class="text-gray-700 mb-4">
The attention mechanism is a fundamental component of modern transformer-based language models like GPT, BERT, and others.
It allows models to dynamically focus on different parts of the input sequence when producing each part of the output sequence.
</p>
<p class="text-gray-700 mb-6">
Unlike traditional sequence models that process inputs in a fixed order, attention mechanisms enable models to learn which parts of the input are most relevant at each step of processing.
</p>
<div class="attention-visual">
<div class="flex flex-col items-center">
<div class="flex space-x-8 mb-8">
<div class="attention-node bg-blue-100 text-blue-800">Input</div>
<div class="attention-node bg-purple-100 text-purple-800">Q</div>
<div class="attention-node bg-green-100 text-green-800">K</div>
<div class="attention-node bg-yellow-100 text-yellow-800">V</div>
</div>
<div class="attention-node bg-red-100 text-red-800">Output</div>
</div>
</div>
<div class="explanation-box">
<h3 class="font-semibold text-lg text-blue-800 mb-2">Key Insight</h3>
<p>
Attention mechanisms compute a weighted sum of values (V), where the weights are determined by the compatibility between queries (Q) and keys (K).
This allows the model to focus on different parts of the input sequence dynamically.
</p>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">The Q, K, V Triad</h2>
<div class="grid md:grid-cols-3 gap-6 mb-8">
<div class="bg-blue-50 p-4 rounded-lg">
<h3 class="font-bold text-blue-800 mb-2"><i class="fas fa-question-circle mr-2"></i>Queries (Q)</h3>
<p class="text-gray-700">
Represent what the model is "looking for" at the current position. They are learned representations that help determine which parts of the input to focus on.
</p>
</div>
<div class="bg-green-50 p-4 rounded-lg">
<h3 class="font-bold text-green-800 mb-2"><i class="fas fa-key mr-2"></i>Keys (K)</h3>
<p class="text-gray-700">
Represent what each input element "contains" or "offers". They are compared against queries to determine attention weights.
</p>
</div>
<div class="bg-yellow-50 p-4 rounded-lg">
<h3 class="font-bold text-yellow-800 mb-2"><i class="fas fa-database mr-2"></i>Values (V)</h3>
<p class="text-gray-700">
Contain the actual information that will be aggregated based on the attention weights. They represent what gets passed forward.
</p>
</div>
</div>
<h3 class="text-xl font-semibold text-gray-800 mb-4">Why We Need All Three</h3>
<p class="text-gray-700 mb-4">
The separation of Q, K, and V provides flexibility and expressive power to the attention mechanism:
</p>
<ul class="list-disc pl-6 text-gray-700 space-y-2 mb-6">
<li><strong>Decoupling:</strong> Allows different representations for what to look for (Q) versus what to retrieve (V)</li>
<li><strong>Flexibility:</strong> Enables different types of attention patterns (e.g., looking ahead vs. looking back)</li>
<li><strong>Efficiency:</strong> Permits caching of K and V for autoregressive generation</li>
<li><strong>Interpretability:</strong> Makes attention patterns more meaningful and analyzable</li>
</ul>
<h3 class="text-xl font-semibold text-gray-800 mb-4">How Q, K, V Are Created</h3>
<p class="text-gray-700 mb-4">
In transformer models, Q, K, and V are all derived from the same input sequence through learned linear transformations:
</p>
<div class="code-block mb-6">
<pre># Python example of creating Q, K, V
import torch
import torch.nn as nn
# Suppose we have input embeddings of shape (batch_size, seq_len, d_model)
batch_size = 32
seq_len = 10
d_model = 512
input_embeddings = torch.randn(batch_size, seq_len, d_model)
# Create linear projection layers
q_proj = nn.Linear(d_model, d_model) # Query projection
k_proj = nn.Linear(d_model, d_model) # Key projection
v_proj = nn.Linear(d_model, d_model) # Value projection
# Project inputs to get Q, K, V
Q = q_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model)
K = k_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model)
V = v_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model)</pre>
</div>
<div class="explanation-box">
<h3 class="font-semibold text-lg text-blue-800 mb-2">Important Note</h3>
<p>
In practice, the dimensions are often split into multiple "heads" (multi-head attention), where each head learns different attention patterns.
This allows the model to attend to different aspects of the input simultaneously.
</p>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">Scaled Dot-Product Attention</h2>
<p class="text-gray-700 mb-4">
The core computation in attention mechanisms is the scaled dot-product attention, which can be implemented as follows:
</p>
<div class="code-block mb-6">
<pre>def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: Query tensor (batch_size, ..., seq_len_q, d_k)
K: Key tensor (batch_size, ..., seq_len_k, d_k)
V: Value tensor (batch_size, ..., seq_len_k, d_v)
mask: Optional mask tensor for masking out certain positions
"""
# Compute dot products between Q and K
matmul_qk = torch.matmul(Q, K.transpose(-2, -1)) # (..., seq_len_q, seq_len_k)
# Scale by square root of dimension
d_k = Q.size(-1)
scaled_attention_logits = matmul_qk / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
# Apply mask if provided (for decoder self-attention)
if mask is not None:
scaled_attention_logits += (mask * -1e9)
# Softmax to get attention weights
attention_weights = torch.softmax(scaled_attention_logits, dim=-1)
# Multiply weights by values
output = torch.matmul(attention_weights, V) # (..., seq_len_q, d_v)
return output, attention_weights</pre>
</div>
<div class="explanation-box">
<h3 class="font-semibold text-lg text-blue-800 mb-2">Scaling Explanation</h3>
<p>
The scaling factor (√dₖ) is crucial because dot products grow large in magnitude as the dimension increases.
This can push the softmax function into regions where it has extremely small gradients, making learning difficult.
Scaling by √dₖ counteracts this effect.
</p>
</div>
<h3 class="text-xl font-semibold text-gray-800 mb-4">Complete Multi-Head Attention Example</h3>
<div class="code-block mb-6">
<pre>class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.depth = d_model // num_heads
# Linear layers for Q, K, V projections
self.wq = nn.Linear(d_model, d_model)
self.wk = nn.Linear(d_model, d_model)
self.wv = nn.Linear(d_model, d_model)
self.dense = nn.Linear(d_model, d_model)
def split_heads(self, x, batch_size):
"""Split the last dimension into (num_heads, depth)"""
x = x.view(batch_size, -1, self.num_heads, self.depth)
return x.transpose(1, 2) # (batch_size, num_heads, seq_len, depth)
def forward(self, q, k, v, mask=None):
batch_size = q.size(0)
# Linear projections
q = self.wq(q) # (batch_size, seq_len, d_model)
k = self.wk(k)
v = self.wv(v)
# Split into multiple heads
q = self.split_heads(q, batch_size)
k = self.split_heads(k, batch_size)
v = self.split_heads(v, batch_size)
# Scaled dot-product attention
scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)
# Concatenate heads
scaled_attention = scaled_attention.transpose(1, 2) # (batch_size, seq_len, num_heads, depth)
concat_attention = scaled_attention.contiguous().view(batch_size, -1, self.d_model)
# Final linear layer
output = self.dense(concat_attention)
return output, attention_weights</pre>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">Types of Attention Patterns</h2>
<div class="grid md:grid-cols-2 gap-6 mb-6">
<div class="bg-indigo-50 p-4 rounded-lg">
<h3 class="font-bold text-indigo-800 mb-2"><i class="fas fa-arrows-alt-h mr-2"></i>Self-Attention</h3>
<p class="text-gray-700">
Q, K, and V all come from the same sequence. Allows each position to attend to all positions in the same sequence.
</p>
<div class="mt-3">
<span class="inline-block bg-indigo-100 text-indigo-800 px-2 py-1 rounded-full text-xs font-medium">Encoder</span>
<span class="inline-block bg-indigo-100 text-indigo-800 px-2 py-1 rounded-full text-xs font-medium ml-1">BERT</span>
</div>
</div>
<div class="bg-pink-50 p-4 rounded-lg">
<h3 class="font-bold text-pink-800 mb-2"><i class="fas fa-arrow-right mr-2"></i>Masked Self-Attention</h3>
<p class="text-gray-700">
Used in decoder to prevent positions from attending to subsequent positions (autoregressive property).
</p>
<div class="mt-3">
<span class="inline-block bg-pink-100 text-pink-800 px-2 py-1 rounded-full text-xs font-medium">Decoder</span>
<span class="inline-block bg-pink-100 text-pink-800 px-2 py-1 rounded-full text-xs font-medium ml-1">GPT</span>
</div>
</div>
<div class="bg-teal-50 p-4 rounded-lg">
<h3 class="font-bold text-teal-800 mb-2"><i class="fas fa-exchange-alt mr-2"></i>Cross-Attention</h3>
<p class="text-gray-700">
Q comes from one sequence, while K and V come from another sequence (e.g., encoder-decoder attention).
</p>
<div class="mt-3">
<span class="inline-block bg-teal-100 text-teal-800 px-2 py-1 rounded-full text-xs font-medium">Seq2Seq</span>
<span class="inline-block bg-teal-100 text-teal-800 px-2 py-1 rounded-full text-xs font-medium ml-1">Translation</span>
</div>
</div>
<div class="bg-orange-50 p-4 rounded-lg">
<h3 class="font-bold text-orange-800 mb-2"><i class="fas fa-sliders-h mr-2"></i>Sparse Attention</h3>
<p class="text-gray-700">
Only attends to a subset of positions to reduce computational complexity (e.g., local, strided, or global attention).
</p>
<div class="mt-3">
<span class="inline-block bg-orange-100 text-orange-800 px-2 py-1 rounded-full text-xs font-medium">Longformer</span>
<span class="inline-block bg-orange-100 text-orange-800 px-2 py-1 rounded-full text-xs font-medium ml-1">BigBird</span>
</div>
</div>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">Key Citations and Resources</h2>
<div class="space-y-4">
<div class="citation">
<h3 class="font-semibold text-gray-800">1. Vaswani et al. (2017) - Original Transformer Paper</h3>
<p class="text-gray-600">"Attention Is All You Need" - Introduced the transformer architecture with scaled dot-product attention.</p>
<a href="https://arxiv.org/abs/1706.03762" class="text-blue-600 hover:underline inline-block mt-1">arXiv:1706.03762</a>
</div>
<div class="citation">
<h3 class="font-semibold text-gray-800">2. Jurafsky & Martin (2023) - NLP Textbook</h3>
<p class="text-gray-600">"Speech and Language Processing" - Comprehensive chapter on attention and transformer models.</p>
<a href="https://web.stanford.edu/~jurafsky/slp3/" class="text-blue-600 hover:underline inline-block mt-1">Stanford NLP Textbook</a>
</div>
<div class="citation">
<h3 class="font-semibold text-gray-800">3. Illustrated Transformer (Blog Post)</h3>
<p class="text-gray-600">Jay Alammar's visual explanation of transformer attention mechanisms.</p>
<a href="https://jalammar.github.io/illustrated-transformer/" class="text-blue-600 hover:underline inline-block mt-1">jalammar.github.io</a>
</div>
<div class="citation">
<h3 class="font-semibold text-gray-800">4. Harvard NLP (2022) - Annotated Transformer</h3>
<p class="text-gray-600">Line-by-line implementation guide with PyTorch.</p>
<a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html" class="text-blue-600 hover:underline inline-block mt-1">Harvard NLP Tutorial</a>
</div>
<div class="citation">
<h3 class="font-semibold text-gray-800">5. Efficient Transformers Survey (2020)</h3>
<p class="text-gray-600">Tay et al. review various attention variants for efficiency.</p>
<a href="https://arxiv.org/abs/2009.06732" class="text-blue-600 hover:underline inline-block mt-1">arXiv:2009.06732</a>
</div>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
<div class="p-8">
<h2 class="text-2xl font-bold text-gray-800 mb-6">Practical Considerations</h2>
<div class="grid md:grid-cols-2 gap-6">
<div>
<h3 class="text-xl font-semibold text-gray-800 mb-3"><i class="fas fa-lightbulb text-yellow-500 mr-2"></i>Tips for Implementation</h3>
<ul class="list-disc pl-6 text-gray-700 space-y-2">
<li>Use layer normalization before (not after) attention in transformer blocks</li>
<li>Initialize attention projections with small random weights</li>
<li>Monitor attention patterns during training for debugging</li>
<li>Consider using flash attention for efficiency in production</li>
<li>Use attention masking carefully for padding and autoregressive generation</li>
</ul>
</div>
<div>
<h3 class="text-xl font-semibold text-gray-800 mb-3"><i class="fas fa-exclamation-triangle text-red-500 mr-2"></i>Common Pitfalls</h3>
<ul class="list-disc pl-6 text-gray-700 space-y-2">
<li>Forgetting to scale attention scores by √dₖ</li>
<li>Improper handling of attention masks</li>
<li>Not using residual connections around attention</li>
<li>Oversized attention heads that don't learn meaningful patterns</li>
<li>Ignoring attention patterns when debugging model behavior</li>
</ul>
</div>
</div>
</div>
</div>
<footer class="text-center py-8 text-gray-600">
<p>© 2023 Understanding Attention Mechanisms in LLMs</p>
<p class="mt-2">Educational resource for machine learning students</p>
<div class="mt-4 flex justify-center space-x-4">
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-github fa-lg"></i></a>
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-twitter fa-lg"></i></a>
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-linkedin fa-lg"></i></a>
</div>
</footer>
</div>
<p style="border-radius: 8px; text-align: center; font-size: 12px; color: #fff; margin-top: 16px;position: fixed; left: 8px; bottom: 8px; z-index: 10; background: rgba(0, 0, 0, 0.8); padding: 4px 8px;">Made with <img src="https://enzostvs-deepsite.hf.space/logo.svg" alt="DeepSite Logo" style="width: 16px; height: 16px; vertical-align: middle;display:inline-block;margin-right:3px;filter:brightness(0) invert(1);"><a href="https://enzostvs-deepsite.hf.space" style="color: #fff;text-decoration: underline;" target="_blank" >DeepSite</a> - 🧬 <a href="https://enzostvs-deepsite.hf.space?remix=ontoligent/ds-5001-text-as-data" style="color: #fff;text-decoration: underline;" target="_blank" >Remix</a></p></body>
</html>