Spaces:
Running
Running
Add 3 files
Browse files- README.md +7 -5
- index.html +395 -19
- prompts.txt +1 -0
README.md
CHANGED
@@ -1,10 +1,12 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: static
|
7 |
pinned: false
|
|
|
|
|
8 |
---
|
9 |
|
10 |
-
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
1 |
---
|
2 |
+
title: ds-5001-text-as-data
|
3 |
+
emoji: 🐳
|
4 |
+
colorFrom: green
|
5 |
+
colorTo: yellow
|
6 |
sdk: static
|
7 |
pinned: false
|
8 |
+
tags:
|
9 |
+
- deepsite
|
10 |
---
|
11 |
|
12 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
index.html
CHANGED
@@ -1,19 +1,395 @@
|
|
1 |
-
<!
|
2 |
-
<html>
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!DOCTYPE html>
|
2 |
+
<html lang="en">
|
3 |
+
<head>
|
4 |
+
<meta charset="UTF-8">
|
5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
6 |
+
<title>Understanding Attention Mechanisms in LLMs</title>
|
7 |
+
<script src="https://cdn.tailwindcss.com"></script>
|
8 |
+
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
|
9 |
+
<style>
|
10 |
+
.code-block {
|
11 |
+
background-color: #2d2d2d;
|
12 |
+
color: #f8f8f2;
|
13 |
+
padding: 1rem;
|
14 |
+
border-radius: 0.5rem;
|
15 |
+
font-family: 'Courier New', Courier, monospace;
|
16 |
+
overflow-x: auto;
|
17 |
+
}
|
18 |
+
.attention-visual {
|
19 |
+
display: flex;
|
20 |
+
justify-content: center;
|
21 |
+
margin: 2rem 0;
|
22 |
+
}
|
23 |
+
.attention-node {
|
24 |
+
width: 60px;
|
25 |
+
height: 60px;
|
26 |
+
border-radius: 50%;
|
27 |
+
display: flex;
|
28 |
+
align-items: center;
|
29 |
+
justify-content: center;
|
30 |
+
font-weight: bold;
|
31 |
+
position: relative;
|
32 |
+
}
|
33 |
+
.attention-line {
|
34 |
+
position: absolute;
|
35 |
+
background-color: rgba(59, 130, 246, 0.5);
|
36 |
+
transform-origin: left center;
|
37 |
+
}
|
38 |
+
.explanation-box {
|
39 |
+
background-color: #f0f9ff;
|
40 |
+
border-left: 4px solid #3b82f6;
|
41 |
+
padding: 1rem;
|
42 |
+
margin: 1rem 0;
|
43 |
+
border-radius: 0 0.5rem 0.5rem 0;
|
44 |
+
}
|
45 |
+
.citation {
|
46 |
+
background-color: #f8fafc;
|
47 |
+
padding: 0.5rem;
|
48 |
+
margin: 0.5rem 0;
|
49 |
+
border-left: 3px solid #94a3b8;
|
50 |
+
}
|
51 |
+
</style>
|
52 |
+
</head>
|
53 |
+
<body class="bg-gray-50">
|
54 |
+
<div class="max-w-4xl mx-auto px-4 py-8">
|
55 |
+
<header class="text-center mb-12">
|
56 |
+
<h1 class="text-4xl font-bold text-blue-800 mb-4">Attention Mechanisms in Large Language Models</h1>
|
57 |
+
<p class="text-xl text-gray-600">Understanding the core innovation behind modern AI language models</p>
|
58 |
+
<div class="mt-6">
|
59 |
+
<span class="inline-block bg-blue-100 text-blue-800 px-3 py-1 rounded-full text-sm font-medium">Machine Learning</span>
|
60 |
+
<span class="inline-block bg-purple-100 text-purple-800 px-3 py-1 rounded-full text-sm font-medium ml-2">Natural Language Processing</span>
|
61 |
+
<span class="inline-block bg-green-100 text-green-800 px-3 py-1 rounded-full text-sm font-medium ml-2">Deep Learning</span>
|
62 |
+
</div>
|
63 |
+
</header>
|
64 |
+
|
65 |
+
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
|
66 |
+
<div class="p-8">
|
67 |
+
<h2 class="text-2xl font-bold text-gray-800 mb-6">Introduction to Attention</h2>
|
68 |
+
<p class="text-gray-700 mb-4">
|
69 |
+
The attention mechanism is a fundamental component of modern transformer-based language models like GPT, BERT, and others.
|
70 |
+
It allows models to dynamically focus on different parts of the input sequence when producing each part of the output sequence.
|
71 |
+
</p>
|
72 |
+
<p class="text-gray-700 mb-6">
|
73 |
+
Unlike traditional sequence models that process inputs in a fixed order, attention mechanisms enable models to learn which parts of the input are most relevant at each step of processing.
|
74 |
+
</p>
|
75 |
+
|
76 |
+
<div class="attention-visual">
|
77 |
+
<div class="flex flex-col items-center">
|
78 |
+
<div class="flex space-x-8 mb-8">
|
79 |
+
<div class="attention-node bg-blue-100 text-blue-800">Input</div>
|
80 |
+
<div class="attention-node bg-purple-100 text-purple-800">Q</div>
|
81 |
+
<div class="attention-node bg-green-100 text-green-800">K</div>
|
82 |
+
<div class="attention-node bg-yellow-100 text-yellow-800">V</div>
|
83 |
+
</div>
|
84 |
+
<div class="attention-node bg-red-100 text-red-800">Output</div>
|
85 |
+
</div>
|
86 |
+
</div>
|
87 |
+
|
88 |
+
<div class="explanation-box">
|
89 |
+
<h3 class="font-semibold text-lg text-blue-800 mb-2">Key Insight</h3>
|
90 |
+
<p>
|
91 |
+
Attention mechanisms compute a weighted sum of values (V), where the weights are determined by the compatibility between queries (Q) and keys (K).
|
92 |
+
This allows the model to focus on different parts of the input sequence dynamically.
|
93 |
+
</p>
|
94 |
+
</div>
|
95 |
+
</div>
|
96 |
+
</div>
|
97 |
+
|
98 |
+
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
|
99 |
+
<div class="p-8">
|
100 |
+
<h2 class="text-2xl font-bold text-gray-800 mb-6">The Q, K, V Triad</h2>
|
101 |
+
|
102 |
+
<div class="grid md:grid-cols-3 gap-6 mb-8">
|
103 |
+
<div class="bg-blue-50 p-4 rounded-lg">
|
104 |
+
<h3 class="font-bold text-blue-800 mb-2"><i class="fas fa-question-circle mr-2"></i>Queries (Q)</h3>
|
105 |
+
<p class="text-gray-700">
|
106 |
+
Represent what the model is "looking for" at the current position. They are learned representations that help determine which parts of the input to focus on.
|
107 |
+
</p>
|
108 |
+
</div>
|
109 |
+
<div class="bg-green-50 p-4 rounded-lg">
|
110 |
+
<h3 class="font-bold text-green-800 mb-2"><i class="fas fa-key mr-2"></i>Keys (K)</h3>
|
111 |
+
<p class="text-gray-700">
|
112 |
+
Represent what each input element "contains" or "offers". They are compared against queries to determine attention weights.
|
113 |
+
</p>
|
114 |
+
</div>
|
115 |
+
<div class="bg-yellow-50 p-4 rounded-lg">
|
116 |
+
<h3 class="font-bold text-yellow-800 mb-2"><i class="fas fa-database mr-2"></i>Values (V)</h3>
|
117 |
+
<p class="text-gray-700">
|
118 |
+
Contain the actual information that will be aggregated based on the attention weights. They represent what gets passed forward.
|
119 |
+
</p>
|
120 |
+
</div>
|
121 |
+
</div>
|
122 |
+
|
123 |
+
<h3 class="text-xl font-semibold text-gray-800 mb-4">Why We Need All Three</h3>
|
124 |
+
<p class="text-gray-700 mb-4">
|
125 |
+
The separation of Q, K, and V provides flexibility and expressive power to the attention mechanism:
|
126 |
+
</p>
|
127 |
+
<ul class="list-disc pl-6 text-gray-700 space-y-2 mb-6">
|
128 |
+
<li><strong>Decoupling:</strong> Allows different representations for what to look for (Q) versus what to retrieve (V)</li>
|
129 |
+
<li><strong>Flexibility:</strong> Enables different types of attention patterns (e.g., looking ahead vs. looking back)</li>
|
130 |
+
<li><strong>Efficiency:</strong> Permits caching of K and V for autoregressive generation</li>
|
131 |
+
<li><strong>Interpretability:</strong> Makes attention patterns more meaningful and analyzable</li>
|
132 |
+
</ul>
|
133 |
+
|
134 |
+
<h3 class="text-xl font-semibold text-gray-800 mb-4">How Q, K, V Are Created</h3>
|
135 |
+
<p class="text-gray-700 mb-4">
|
136 |
+
In transformer models, Q, K, and V are all derived from the same input sequence through learned linear transformations:
|
137 |
+
</p>
|
138 |
+
|
139 |
+
<div class="code-block mb-6">
|
140 |
+
<pre># Python example of creating Q, K, V
|
141 |
+
import torch
|
142 |
+
import torch.nn as nn
|
143 |
+
|
144 |
+
# Suppose we have input embeddings of shape (batch_size, seq_len, d_model)
|
145 |
+
batch_size = 32
|
146 |
+
seq_len = 10
|
147 |
+
d_model = 512
|
148 |
+
input_embeddings = torch.randn(batch_size, seq_len, d_model)
|
149 |
+
|
150 |
+
# Create linear projection layers
|
151 |
+
q_proj = nn.Linear(d_model, d_model) # Query projection
|
152 |
+
k_proj = nn.Linear(d_model, d_model) # Key projection
|
153 |
+
v_proj = nn.Linear(d_model, d_model) # Value projection
|
154 |
+
|
155 |
+
# Project inputs to get Q, K, V
|
156 |
+
Q = q_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model)
|
157 |
+
K = k_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model)
|
158 |
+
V = v_proj(input_embeddings) # Shape: (batch_size, seq_len, d_model)</pre>
|
159 |
+
</div>
|
160 |
+
|
161 |
+
<div class="explanation-box">
|
162 |
+
<h3 class="font-semibold text-lg text-blue-800 mb-2">Important Note</h3>
|
163 |
+
<p>
|
164 |
+
In practice, the dimensions are often split into multiple "heads" (multi-head attention), where each head learns different attention patterns.
|
165 |
+
This allows the model to attend to different aspects of the input simultaneously.
|
166 |
+
</p>
|
167 |
+
</div>
|
168 |
+
</div>
|
169 |
+
</div>
|
170 |
+
|
171 |
+
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
|
172 |
+
<div class="p-8">
|
173 |
+
<h2 class="text-2xl font-bold text-gray-800 mb-6">Scaled Dot-Product Attention</h2>
|
174 |
+
|
175 |
+
<p class="text-gray-700 mb-4">
|
176 |
+
The core computation in attention mechanisms is the scaled dot-product attention, which can be implemented as follows:
|
177 |
+
</p>
|
178 |
+
|
179 |
+
<div class="code-block mb-6">
|
180 |
+
<pre>def scaled_dot_product_attention(Q, K, V, mask=None):
|
181 |
+
"""
|
182 |
+
Q: Query tensor (batch_size, ..., seq_len_q, d_k)
|
183 |
+
K: Key tensor (batch_size, ..., seq_len_k, d_k)
|
184 |
+
V: Value tensor (batch_size, ..., seq_len_k, d_v)
|
185 |
+
mask: Optional mask tensor for masking out certain positions
|
186 |
+
"""
|
187 |
+
# Compute dot products between Q and K
|
188 |
+
matmul_qk = torch.matmul(Q, K.transpose(-2, -1)) # (..., seq_len_q, seq_len_k)
|
189 |
+
|
190 |
+
# Scale by square root of dimension
|
191 |
+
d_k = Q.size(-1)
|
192 |
+
scaled_attention_logits = matmul_qk / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
|
193 |
+
|
194 |
+
# Apply mask if provided (for decoder self-attention)
|
195 |
+
if mask is not None:
|
196 |
+
scaled_attention_logits += (mask * -1e9)
|
197 |
+
|
198 |
+
# Softmax to get attention weights
|
199 |
+
attention_weights = torch.softmax(scaled_attention_logits, dim=-1)
|
200 |
+
|
201 |
+
# Multiply weights by values
|
202 |
+
output = torch.matmul(attention_weights, V) # (..., seq_len_q, d_v)
|
203 |
+
|
204 |
+
return output, attention_weights</pre>
|
205 |
+
</div>
|
206 |
+
|
207 |
+
<div class="explanation-box">
|
208 |
+
<h3 class="font-semibold text-lg text-blue-800 mb-2">Scaling Explanation</h3>
|
209 |
+
<p>
|
210 |
+
The scaling factor (√dₖ) is crucial because dot products grow large in magnitude as the dimension increases.
|
211 |
+
This can push the softmax function into regions where it has extremely small gradients, making learning difficult.
|
212 |
+
Scaling by √dₖ counteracts this effect.
|
213 |
+
</p>
|
214 |
+
</div>
|
215 |
+
|
216 |
+
<h3 class="text-xl font-semibold text-gray-800 mb-4">Complete Multi-Head Attention Example</h3>
|
217 |
+
|
218 |
+
<div class="code-block mb-6">
|
219 |
+
<pre>class MultiHeadAttention(nn.Module):
|
220 |
+
def __init__(self, d_model, num_heads):
|
221 |
+
super(MultiHeadAttention, self).__init__()
|
222 |
+
self.num_heads = num_heads
|
223 |
+
self.d_model = d_model
|
224 |
+
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
|
225 |
+
|
226 |
+
self.depth = d_model // num_heads
|
227 |
+
|
228 |
+
# Linear layers for Q, K, V projections
|
229 |
+
self.wq = nn.Linear(d_model, d_model)
|
230 |
+
self.wk = nn.Linear(d_model, d_model)
|
231 |
+
self.wv = nn.Linear(d_model, d_model)
|
232 |
+
|
233 |
+
self.dense = nn.Linear(d_model, d_model)
|
234 |
+
|
235 |
+
def split_heads(self, x, batch_size):
|
236 |
+
"""Split the last dimension into (num_heads, depth)"""
|
237 |
+
x = x.view(batch_size, -1, self.num_heads, self.depth)
|
238 |
+
return x.transpose(1, 2) # (batch_size, num_heads, seq_len, depth)
|
239 |
+
|
240 |
+
def forward(self, q, k, v, mask=None):
|
241 |
+
batch_size = q.size(0)
|
242 |
+
|
243 |
+
# Linear projections
|
244 |
+
q = self.wq(q) # (batch_size, seq_len, d_model)
|
245 |
+
k = self.wk(k)
|
246 |
+
v = self.wv(v)
|
247 |
+
|
248 |
+
# Split into multiple heads
|
249 |
+
q = self.split_heads(q, batch_size)
|
250 |
+
k = self.split_heads(k, batch_size)
|
251 |
+
v = self.split_heads(v, batch_size)
|
252 |
+
|
253 |
+
# Scaled dot-product attention
|
254 |
+
scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)
|
255 |
+
|
256 |
+
# Concatenate heads
|
257 |
+
scaled_attention = scaled_attention.transpose(1, 2) # (batch_size, seq_len, num_heads, depth)
|
258 |
+
concat_attention = scaled_attention.contiguous().view(batch_size, -1, self.d_model)
|
259 |
+
|
260 |
+
# Final linear layer
|
261 |
+
output = self.dense(concat_attention)
|
262 |
+
|
263 |
+
return output, attention_weights</pre>
|
264 |
+
</div>
|
265 |
+
</div>
|
266 |
+
</div>
|
267 |
+
|
268 |
+
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
|
269 |
+
<div class="p-8">
|
270 |
+
<h2 class="text-2xl font-bold text-gray-800 mb-6">Types of Attention Patterns</h2>
|
271 |
+
|
272 |
+
<div class="grid md:grid-cols-2 gap-6 mb-6">
|
273 |
+
<div class="bg-indigo-50 p-4 rounded-lg">
|
274 |
+
<h3 class="font-bold text-indigo-800 mb-2"><i class="fas fa-arrows-alt-h mr-2"></i>Self-Attention</h3>
|
275 |
+
<p class="text-gray-700">
|
276 |
+
Q, K, and V all come from the same sequence. Allows each position to attend to all positions in the same sequence.
|
277 |
+
</p>
|
278 |
+
<div class="mt-3">
|
279 |
+
<span class="inline-block bg-indigo-100 text-indigo-800 px-2 py-1 rounded-full text-xs font-medium">Encoder</span>
|
280 |
+
<span class="inline-block bg-indigo-100 text-indigo-800 px-2 py-1 rounded-full text-xs font-medium ml-1">BERT</span>
|
281 |
+
</div>
|
282 |
+
</div>
|
283 |
+
<div class="bg-pink-50 p-4 rounded-lg">
|
284 |
+
<h3 class="font-bold text-pink-800 mb-2"><i class="fas fa-arrow-right mr-2"></i>Masked Self-Attention</h3>
|
285 |
+
<p class="text-gray-700">
|
286 |
+
Used in decoder to prevent positions from attending to subsequent positions (autoregressive property).
|
287 |
+
</p>
|
288 |
+
<div class="mt-3">
|
289 |
+
<span class="inline-block bg-pink-100 text-pink-800 px-2 py-1 rounded-full text-xs font-medium">Decoder</span>
|
290 |
+
<span class="inline-block bg-pink-100 text-pink-800 px-2 py-1 rounded-full text-xs font-medium ml-1">GPT</span>
|
291 |
+
</div>
|
292 |
+
</div>
|
293 |
+
<div class="bg-teal-50 p-4 rounded-lg">
|
294 |
+
<h3 class="font-bold text-teal-800 mb-2"><i class="fas fa-exchange-alt mr-2"></i>Cross-Attention</h3>
|
295 |
+
<p class="text-gray-700">
|
296 |
+
Q comes from one sequence, while K and V come from another sequence (e.g., encoder-decoder attention).
|
297 |
+
</p>
|
298 |
+
<div class="mt-3">
|
299 |
+
<span class="inline-block bg-teal-100 text-teal-800 px-2 py-1 rounded-full text-xs font-medium">Seq2Seq</span>
|
300 |
+
<span class="inline-block bg-teal-100 text-teal-800 px-2 py-1 rounded-full text-xs font-medium ml-1">Translation</span>
|
301 |
+
</div>
|
302 |
+
</div>
|
303 |
+
<div class="bg-orange-50 p-4 rounded-lg">
|
304 |
+
<h3 class="font-bold text-orange-800 mb-2"><i class="fas fa-sliders-h mr-2"></i>Sparse Attention</h3>
|
305 |
+
<p class="text-gray-700">
|
306 |
+
Only attends to a subset of positions to reduce computational complexity (e.g., local, strided, or global attention).
|
307 |
+
</p>
|
308 |
+
<div class="mt-3">
|
309 |
+
<span class="inline-block bg-orange-100 text-orange-800 px-2 py-1 rounded-full text-xs font-medium">Longformer</span>
|
310 |
+
<span class="inline-block bg-orange-100 text-orange-800 px-2 py-1 rounded-full text-xs font-medium ml-1">BigBird</span>
|
311 |
+
</div>
|
312 |
+
</div>
|
313 |
+
</div>
|
314 |
+
</div>
|
315 |
+
</div>
|
316 |
+
|
317 |
+
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
|
318 |
+
<div class="p-8">
|
319 |
+
<h2 class="text-2xl font-bold text-gray-800 mb-6">Key Citations and Resources</h2>
|
320 |
+
|
321 |
+
<div class="space-y-4">
|
322 |
+
<div class="citation">
|
323 |
+
<h3 class="font-semibold text-gray-800">1. Vaswani et al. (2017) - Original Transformer Paper</h3>
|
324 |
+
<p class="text-gray-600">"Attention Is All You Need" - Introduced the transformer architecture with scaled dot-product attention.</p>
|
325 |
+
<a href="https://arxiv.org/abs/1706.03762" class="text-blue-600 hover:underline inline-block mt-1">arXiv:1706.03762</a>
|
326 |
+
</div>
|
327 |
+
|
328 |
+
<div class="citation">
|
329 |
+
<h3 class="font-semibold text-gray-800">2. Jurafsky & Martin (2023) - NLP Textbook</h3>
|
330 |
+
<p class="text-gray-600">"Speech and Language Processing" - Comprehensive chapter on attention and transformer models.</p>
|
331 |
+
<a href="https://web.stanford.edu/~jurafsky/slp3/" class="text-blue-600 hover:underline inline-block mt-1">Stanford NLP Textbook</a>
|
332 |
+
</div>
|
333 |
+
|
334 |
+
<div class="citation">
|
335 |
+
<h3 class="font-semibold text-gray-800">3. Illustrated Transformer (Blog Post)</h3>
|
336 |
+
<p class="text-gray-600">Jay Alammar's visual explanation of transformer attention mechanisms.</p>
|
337 |
+
<a href="https://jalammar.github.io/illustrated-transformer/" class="text-blue-600 hover:underline inline-block mt-1">jalammar.github.io</a>
|
338 |
+
</div>
|
339 |
+
|
340 |
+
<div class="citation">
|
341 |
+
<h3 class="font-semibold text-gray-800">4. Harvard NLP (2022) - Annotated Transformer</h3>
|
342 |
+
<p class="text-gray-600">Line-by-line implementation guide with PyTorch.</p>
|
343 |
+
<a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html" class="text-blue-600 hover:underline inline-block mt-1">Harvard NLP Tutorial</a>
|
344 |
+
</div>
|
345 |
+
|
346 |
+
<div class="citation">
|
347 |
+
<h3 class="font-semibold text-gray-800">5. Efficient Transformers Survey (2020)</h3>
|
348 |
+
<p class="text-gray-600">Tay et al. review various attention variants for efficiency.</p>
|
349 |
+
<a href="https://arxiv.org/abs/2009.06732" class="text-blue-600 hover:underline inline-block mt-1">arXiv:2009.06732</a>
|
350 |
+
</div>
|
351 |
+
</div>
|
352 |
+
</div>
|
353 |
+
</div>
|
354 |
+
|
355 |
+
<div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
|
356 |
+
<div class="p-8">
|
357 |
+
<h2 class="text-2xl font-bold text-gray-800 mb-6">Practical Considerations</h2>
|
358 |
+
|
359 |
+
<div class="grid md:grid-cols-2 gap-6">
|
360 |
+
<div>
|
361 |
+
<h3 class="text-xl font-semibold text-gray-800 mb-3"><i class="fas fa-lightbulb text-yellow-500 mr-2"></i>Tips for Implementation</h3>
|
362 |
+
<ul class="list-disc pl-6 text-gray-700 space-y-2">
|
363 |
+
<li>Use layer normalization before (not after) attention in transformer blocks</li>
|
364 |
+
<li>Initialize attention projections with small random weights</li>
|
365 |
+
<li>Monitor attention patterns during training for debugging</li>
|
366 |
+
<li>Consider using flash attention for efficiency in production</li>
|
367 |
+
<li>Use attention masking carefully for padding and autoregressive generation</li>
|
368 |
+
</ul>
|
369 |
+
</div>
|
370 |
+
<div>
|
371 |
+
<h3 class="text-xl font-semibold text-gray-800 mb-3"><i class="fas fa-exclamation-triangle text-red-500 mr-2"></i>Common Pitfalls</h3>
|
372 |
+
<ul class="list-disc pl-6 text-gray-700 space-y-2">
|
373 |
+
<li>Forgetting to scale attention scores by √dₖ</li>
|
374 |
+
<li>Improper handling of attention masks</li>
|
375 |
+
<li>Not using residual connections around attention</li>
|
376 |
+
<li>Oversized attention heads that don't learn meaningful patterns</li>
|
377 |
+
<li>Ignoring attention patterns when debugging model behavior</li>
|
378 |
+
</ul>
|
379 |
+
</div>
|
380 |
+
</div>
|
381 |
+
</div>
|
382 |
+
</div>
|
383 |
+
|
384 |
+
<footer class="text-center py-8 text-gray-600">
|
385 |
+
<p>© 2023 Understanding Attention Mechanisms in LLMs</p>
|
386 |
+
<p class="mt-2">Educational resource for machine learning students</p>
|
387 |
+
<div class="mt-4 flex justify-center space-x-4">
|
388 |
+
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-github fa-lg"></i></a>
|
389 |
+
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-twitter fa-lg"></i></a>
|
390 |
+
<a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-linkedin fa-lg"></i></a>
|
391 |
+
</div>
|
392 |
+
</footer>
|
393 |
+
</div>
|
394 |
+
<p style="border-radius: 8px; text-align: center; font-size: 12px; color: #fff; margin-top: 16px;position: fixed; left: 8px; bottom: 8px; z-index: 10; background: rgba(0, 0, 0, 0.8); padding: 4px 8px;">Made with <img src="https://enzostvs-deepsite.hf.space/logo.svg" alt="DeepSite Logo" style="width: 16px; height: 16px; vertical-align: middle;display:inline-block;margin-right:3px;filter:brightness(0) invert(1);"><a href="https://enzostvs-deepsite.hf.space" style="color: #fff;text-decoration: underline;" target="_blank" >DeepSite</a> - 🧬 <a href="https://enzostvs-deepsite.hf.space?remix=ontoligent/ds-5001-text-as-data" style="color: #fff;text-decoration: underline;" target="_blank" >Remix</a></p></body>
|
395 |
+
</html>
|
prompts.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
Create a website that explains the attention mechanism used by LLMs. Include examples in Python suitable for teaching undergraduates and a list of authoritative and helpful citations. Go into some detail about why Q, K, and C are needed and how they are created.
|