Spaces:

ontoligent
/

ds-5001-text-as-data

Running

App Files Files Community

ontoligent commited on 4 days ago

Commit

967e756

verified ·

1 Parent(s): 0ccdb1e

Add 3 files

Browse files

Files changed (3) hide show

README.md +7 -5
index.html +395 -19
prompts.txt +1 -0

README.md CHANGED Viewed

@@ -1,10 +1,12 @@
 ---
-title: Ds 5001 Text As Data
-emoji: 📚
-colorFrom: gray
-colorTo: green
 sdk: static
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: ds-5001-text-as-data
+emoji: 🐳
+colorFrom: green
+colorTo: yellow
 sdk: static
 pinned: false
+tags:
+  - deepsite
 ---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

index.html CHANGED Viewed

@@ -1,19 +1,395 @@
-<!doctype html>
-<html>
-	<head>
-		<meta charset="utf-8" />
-		<meta name="viewport" content="width=device-width" />
-		<title>My static Space</title>
-		<link rel="stylesheet" href="style.css" />
-	</head>
-	<body>
-		<div class="card">
-			<h1>Welcome to your static Space!</h1>
-			<p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
-			<p>
-				Also don't forget to check the
-				<a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
-			</p>
-		</div>
-	</body>
-</html>

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Understanding Attention Mechanisms in LLMs</title>
+    <script src="https://cdn.tailwindcss.com"></script>
+    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
+    <style>
+        .code-block {
+            background-color: #2d2d2d;
+            color: #f8f8f2;
+            padding: 1rem;
+            border-radius: 0.5rem;
+            font-family: 'Courier New', Courier, monospace;
+            overflow-x: auto;
+        }
+        .attention-visual {
+            display: flex;
+            justify-content: center;
+            margin: 2rem 0;
+        }
+        .attention-node {
+            width: 60px;
+            height: 60px;
+            border-radius: 50%;
+            display: flex;
+            align-items: center;
+            justify-content: center;
+            font-weight: bold;
+            position: relative;
+        }
+        .attention-line {
+            position: absolute;
+            background-color: rgba(59, 130, 246, 0.5);
+            transform-origin: left center;
+        }
+        .explanation-box {
+            background-color: #f0f9ff;
+            border-left: 4px solid #3b82f6;
+            padding: 1rem;
+            margin: 1rem 0;
+            border-radius: 0 0.5rem 0.5rem 0;
+        }
+        .citation {
+            background-color: #f8fafc;
+            padding: 0.5rem;
+            margin: 0.5rem 0;
+            border-left: 3px solid #94a3b8;
+        }
+    </style>
+</head>
+<body class="bg-gray-50">
+    <div class="max-w-4xl mx-auto px-4 py-8">
+        <header class="text-center mb-12">
+            <h1 class="text-4xl font-bold text-blue-800 mb-4">Attention Mechanisms in Large Language Models</h1>
+            <p class="text-xl text-gray-600">Understanding the core innovation behind modern AI language models</p>
+            <div class="mt-6">
+                <span class="inline-block bg-blue-100 text-blue-800 px-3 py-1 rounded-full text-sm font-medium">Machine Learning</span>
+                <span class="inline-block bg-purple-100 text-purple-800 px-3 py-1 rounded-full text-sm font-medium ml-2">Natural Language Processing</span>
+                <span class="inline-block bg-green-100 text-green-800 px-3 py-1 rounded-full text-sm font-medium ml-2">Deep Learning</span>
+            </div>
+        </header>
+        <div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
+            <div class="p-8">
+                <h2 class="text-2xl font-bold text-gray-800 mb-6">Introduction to Attention</h2>
+                <p class="text-gray-700 mb-4">
+                    The attention mechanism is a fundamental component of modern transformer-based language models like GPT, BERT, and others.
+                    It allows models to dynamically focus on different parts of the input sequence when producing each part of the output sequence.
+                </p>
+                <p class="text-gray-700 mb-6">
+                    Unlike traditional sequence models that process inputs in a fixed order, attention mechanisms enable models to learn which parts of the input are most relevant at each step of processing.
+                </p>
+                <div class="attention-visual">
+                    <div class="flex flex-col items-center">
+                        <div class="flex space-x-8 mb-8">
+                            <div class="attention-node bg-blue-100 text-blue-800">Input</div>
+                            <div class="attention-node bg-purple-100 text-purple-800">Q</div>
+                            <div class="attention-node bg-green-100 text-green-800">K</div>
+                            <div class="attention-node bg-yellow-100 text-yellow-800">V</div>
+                        </div>
+                        <div class="attention-node bg-red-100 text-red-800">Output</div>
+                    </div>
+                </div>
+                <div class="explanation-box">
+                    <h3 class="font-semibold text-lg text-blue-800 mb-2">Key Insight</h3>
+                    <p>
+                        Attention mechanisms compute a weighted sum of values (V), where the weights are determined by the compatibility between queries (Q) and keys (K).
+                        This allows the model to focus on different parts of the input sequence dynamically.
+                    </p>
+                </div>
+            </div>
+        </div>
+        <div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
+            <div class="p-8">
+                <h2 class="text-2xl font-bold text-gray-800 mb-6">The Q, K, V Triad</h2>
+                <div class="grid md:grid-cols-3 gap-6 mb-8">
+                    <div class="bg-blue-50 p-4 rounded-lg">
+                        <h3 class="font-bold text-blue-800 mb-2"><i class="fas fa-question-circle mr-2"></i>Queries (Q)</h3>
+                        <p class="text-gray-700">
+                            Represent what the model is "looking for" at the current position. They are learned representations that help determine which parts of the input to focus on.
+                        </p>
+                    </div>
+                    <div class="bg-green-50 p-4 rounded-lg">
+                        <h3 class="font-bold text-green-800 mb-2"><i class="fas fa-key mr-2"></i>Keys (K)</h3>
+                        <p class="text-gray-700">
+                            Represent what each input element "contains" or "offers". They are compared against queries to determine attention weights.
+                        </p>
+                    </div>
+                    <div class="bg-yellow-50 p-4 rounded-lg">
+                        <h3 class="font-bold text-yellow-800 mb-2"><i class="fas fa-database mr-2"></i>Values (V)</h3>
+                        <p class="text-gray-700">
+                            Contain the actual information that will be aggregated based on the attention weights. They represent what gets passed forward.
+                        </p>
+                    </div>
+                </div>
+                <h3 class="text-xl font-semibold text-gray-800 mb-4">Why We Need All Three</h3>
+                <p class="text-gray-700 mb-4">
+                    The separation of Q, K, and V provides flexibility and expressive power to the attention mechanism:
+                </p>
+                <ul class="list-disc pl-6 text-gray-700 space-y-2 mb-6">
+                    <li><strong>Decoupling:</strong> Allows different representations for what to look for (Q) versus what to retrieve (V)</li>
+                    <li><strong>Flexibility:</strong> Enables different types of attention patterns (e.g., looking ahead vs. looking back)</li>
+                    <li><strong>Efficiency:</strong> Permits caching of K and V for autoregressive generation</li>
+                    <li><strong>Interpretability:</strong> Makes attention patterns more meaningful and analyzable</li>
+                </ul>
+                <h3 class="text-xl font-semibold text-gray-800 mb-4">How Q, K, V Are Created</h3>
+                <p class="text-gray-700 mb-4">
+                    In transformer models, Q, K, and V are all derived from the same input sequence through learned linear transformations:
+                </p>
+                <div class="code-block mb-6">
+                    <pre># Python example of creating Q, K, V
+import torch
+import torch.nn as nn
+# Suppose we have input embeddings of shape (batch_size, seq_len, d_model)
+batch_size = 32
+seq_len = 10
+d_model = 512
+input_embeddings = torch.randn(batch_size, seq_len, d_model)
+# Create linear projection layers
+q_proj = nn.Linear(d_model, d_model)  # Query projection
+k_proj = nn.Linear(d_model, d_model)  # Key projection
+v_proj = nn.Linear(d_model, d_model)  # Value projection
+# Project inputs to get Q, K, V
+Q = q_proj(input_embeddings)  # Shape: (batch_size, seq_len, d_model)
+K = k_proj(input_embeddings)  # Shape: (batch_size, seq_len, d_model)
+V = v_proj(input_embeddings)  # Shape: (batch_size, seq_len, d_model)</pre>
+                </div>
+                <div class="explanation-box">
+                    <h3 class="font-semibold text-lg text-blue-800 mb-2">Important Note</h3>
+                    <p>
+                        In practice, the dimensions are often split into multiple "heads" (multi-head attention), where each head learns different attention patterns.
+                        This allows the model to attend to different aspects of the input simultaneously.
+                    </p>
+                </div>
+            </div>
+        </div>
+        <div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
+            <div class="p-8">
+                <h2 class="text-2xl font-bold text-gray-800 mb-6">Scaled Dot-Product Attention</h2>
+                <p class="text-gray-700 mb-4">
+                    The core computation in attention mechanisms is the scaled dot-product attention, which can be implemented as follows:
+                </p>
+                <div class="code-block mb-6">
+                    <pre>def scaled_dot_product_attention(Q, K, V, mask=None):
+    """
+    Q: Query tensor (batch_size, ..., seq_len_q, d_k)
+    K: Key tensor (batch_size, ..., seq_len_k, d_k)
+    V: Value tensor (batch_size, ..., seq_len_k, d_v)
+    mask: Optional mask tensor for masking out certain positions
+    """
+    # Compute dot products between Q and K
+    matmul_qk = torch.matmul(Q, K.transpose(-2, -1))  # (..., seq_len_q, seq_len_k)
+    # Scale by square root of dimension
+    d_k = Q.size(-1)
+    scaled_attention_logits = matmul_qk / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
+    # Apply mask if provided (for decoder self-attention)
+    if mask is not None:
+        scaled_attention_logits += (mask * -1e9)
+    # Softmax to get attention weights
+    attention_weights = torch.softmax(scaled_attention_logits, dim=-1)
+    # Multiply weights by values
+    output = torch.matmul(attention_weights, V)  # (..., seq_len_q, d_v)
+    return output, attention_weights</pre>
+                </div>
+                <div class="explanation-box">
+                    <h3 class="font-semibold text-lg text-blue-800 mb-2">Scaling Explanation</h3>
+                    <p>
+                        The scaling factor (√dₖ) is crucial because dot products grow large in magnitude as the dimension increases.
+                        This can push the softmax function into regions where it has extremely small gradients, making learning difficult.
+                        Scaling by √dₖ counteracts this effect.
+                    </p>
+                </div>
+                <h3 class="text-xl font-semibold text-gray-800 mb-4">Complete Multi-Head Attention Example</h3>
+                <div class="code-block mb-6">
+                    <pre>class MultiHeadAttention(nn.Module):
+    def __init__(self, d_model, num_heads):
+        super(MultiHeadAttention, self).__init__()
+        self.num_heads = num_heads
+        self.d_model = d_model
+        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
+        self.depth = d_model // num_heads
+        # Linear layers for Q, K, V projections
+        self.wq = nn.Linear(d_model, d_model)
+        self.wk = nn.Linear(d_model, d_model)
+        self.wv = nn.Linear(d_model, d_model)
+        self.dense = nn.Linear(d_model, d_model)
+    def split_heads(self, x, batch_size):
+        """Split the last dimension into (num_heads, depth)"""
+        x = x.view(batch_size, -1, self.num_heads, self.depth)
+        return x.transpose(1, 2)  # (batch_size, num_heads, seq_len, depth)
+    def forward(self, q, k, v, mask=None):
+        batch_size = q.size(0)
+        # Linear projections
+        q = self.wq(q)  # (batch_size, seq_len, d_model)
+        k = self.wk(k)
+        v = self.wv(v)
+        # Split into multiple heads
+        q = self.split_heads(q, batch_size)
+        k = self.split_heads(k, batch_size)
+        v = self.split_heads(v, batch_size)
+        # Scaled dot-product attention
+        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)
+        # Concatenate heads
+        scaled_attention = scaled_attention.transpose(1, 2)  # (batch_size, seq_len, num_heads, depth)
+        concat_attention = scaled_attention.contiguous().view(batch_size, -1, self.d_model)
+        # Final linear layer
+        output = self.dense(concat_attention)
+        return output, attention_weights</pre>
+                </div>
+            </div>
+        </div>
+        <div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
+            <div class="p-8">
+                <h2 class="text-2xl font-bold text-gray-800 mb-6">Types of Attention Patterns</h2>
+                <div class="grid md:grid-cols-2 gap-6 mb-6">
+                    <div class="bg-indigo-50 p-4 rounded-lg">
+                        <h3 class="font-bold text-indigo-800 mb-2"><i class="fas fa-arrows-alt-h mr-2"></i>Self-Attention</h3>
+                        <p class="text-gray-700">
+                            Q, K, and V all come from the same sequence. Allows each position to attend to all positions in the same sequence.
+                        </p>
+                        <div class="mt-3">
+                            <span class="inline-block bg-indigo-100 text-indigo-800 px-2 py-1 rounded-full text-xs font-medium">Encoder</span>
+                            <span class="inline-block bg-indigo-100 text-indigo-800 px-2 py-1 rounded-full text-xs font-medium ml-1">BERT</span>
+                        </div>
+                    </div>
+                    <div class="bg-pink-50 p-4 rounded-lg">
+                        <h3 class="font-bold text-pink-800 mb-2"><i class="fas fa-arrow-right mr-2"></i>Masked Self-Attention</h3>
+                        <p class="text-gray-700">
+                            Used in decoder to prevent positions from attending to subsequent positions (autoregressive property).
+                        </p>
+                        <div class="mt-3">
+                            <span class="inline-block bg-pink-100 text-pink-800 px-2 py-1 rounded-full text-xs font-medium">Decoder</span>
+                            <span class="inline-block bg-pink-100 text-pink-800 px-2 py-1 rounded-full text-xs font-medium ml-1">GPT</span>
+                        </div>
+                    </div>
+                    <div class="bg-teal-50 p-4 rounded-lg">
+                        <h3 class="font-bold text-teal-800 mb-2"><i class="fas fa-exchange-alt mr-2"></i>Cross-Attention</h3>
+                        <p class="text-gray-700">
+                            Q comes from one sequence, while K and V come from another sequence (e.g., encoder-decoder attention).
+                        </p>
+                        <div class="mt-3">
+                            <span class="inline-block bg-teal-100 text-teal-800 px-2 py-1 rounded-full text-xs font-medium">Seq2Seq</span>
+                            <span class="inline-block bg-teal-100 text-teal-800 px-2 py-1 rounded-full text-xs font-medium ml-1">Translation</span>
+                        </div>
+                    </div>
+                    <div class="bg-orange-50 p-4 rounded-lg">
+                        <h3 class="font-bold text-orange-800 mb-2"><i class="fas fa-sliders-h mr-2"></i>Sparse Attention</h3>
+                        <p class="text-gray-700">
+                            Only attends to a subset of positions to reduce computational complexity (e.g., local, strided, or global attention).
+                        </p>
+                        <div class="mt-3">
+                            <span class="inline-block bg-orange-100 text-orange-800 px-2 py-1 rounded-full text-xs font-medium">Longformer</span>
+                            <span class="inline-block bg-orange-100 text-orange-800 px-2 py-1 rounded-full text-xs font-medium ml-1">BigBird</span>
+                        </div>
+                    </div>
+                </div>
+            </div>
+        </div>
+        <div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
+            <div class="p-8">
+                <h2 class="text-2xl font-bold text-gray-800 mb-6">Key Citations and Resources</h2>
+                <div class="space-y-4">
+                    <div class="citation">
+                        <h3 class="font-semibold text-gray-800">1. Vaswani et al. (2017) - Original Transformer Paper</h3>
+                        <p class="text-gray-600">"Attention Is All You Need" - Introduced the transformer architecture with scaled dot-product attention.</p>
+                        <a href="https://arxiv.org/abs/1706.03762" class="text-blue-600 hover:underline inline-block mt-1">arXiv:1706.03762</a>
+                    </div>
+                    <div class="citation">
+                        <h3 class="font-semibold text-gray-800">2. Jurafsky & Martin (2023) - NLP Textbook</h3>
+                        <p class="text-gray-600">"Speech and Language Processing" - Comprehensive chapter on attention and transformer models.</p>
+                        <a href="https://web.stanford.edu/~jurafsky/slp3/" class="text-blue-600 hover:underline inline-block mt-1">Stanford NLP Textbook</a>
+                    </div>
+                    <div class="citation">
+                        <h3 class="font-semibold text-gray-800">3. Illustrated Transformer (Blog Post)</h3>
+                        <p class="text-gray-600">Jay Alammar's visual explanation of transformer attention mechanisms.</p>
+                        <a href="https://jalammar.github.io/illustrated-transformer/" class="text-blue-600 hover:underline inline-block mt-1">jalammar.github.io</a>
+                    </div>
+                    <div class="citation">
+                        <h3 class="font-semibold text-gray-800">4. Harvard NLP (2022) - Annotated Transformer</h3>
+                        <p class="text-gray-600">Line-by-line implementation guide with PyTorch.</p>
+                        <a href="http://nlp.seas.harvard.edu/2018/04/03/attention.html" class="text-blue-600 hover:underline inline-block mt-1">Harvard NLP Tutorial</a>
+                    </div>
+                    <div class="citation">
+                        <h3 class="font-semibold text-gray-800">5. Efficient Transformers Survey (2020)</h3>
+                        <p class="text-gray-600">Tay et al. review various attention variants for efficiency.</p>
+                        <a href="https://arxiv.org/abs/2009.06732" class="text-blue-600 hover:underline inline-block mt-1">arXiv:2009.06732</a>
+                    </div>
+                </div>
+            </div>
+        </div>
+        <div class="bg-white rounded-xl shadow-md overflow-hidden mb-8">
+            <div class="p-8">
+                <h2 class="text-2xl font-bold text-gray-800 mb-6">Practical Considerations</h2>
+                <div class="grid md:grid-cols-2 gap-6">
+                    <div>
+                        <h3 class="text-xl font-semibold text-gray-800 mb-3"><i class="fas fa-lightbulb text-yellow-500 mr-2"></i>Tips for Implementation</h3>
+                        <ul class="list-disc pl-6 text-gray-700 space-y-2">
+                            <li>Use layer normalization before (not after) attention in transformer blocks</li>
+                            <li>Initialize attention projections with small random weights</li>
+                            <li>Monitor attention patterns during training for debugging</li>
+                            <li>Consider using flash attention for efficiency in production</li>
+                            <li>Use attention masking carefully for padding and autoregressive generation</li>
+                        </ul>
+                    </div>
+                    <div>
+                        <h3 class="text-xl font-semibold text-gray-800 mb-3"><i class="fas fa-exclamation-triangle text-red-500 mr-2"></i>Common Pitfalls</h3>
+                        <ul class="list-disc pl-6 text-gray-700 space-y-2">
+                            <li>Forgetting to scale attention scores by √dₖ</li>
+                            <li>Improper handling of attention masks</li>
+                            <li>Not using residual connections around attention</li>
+                            <li>Oversized attention heads that don't learn meaningful patterns</li>
+                            <li>Ignoring attention patterns when debugging model behavior</li>
+                        </ul>
+                    </div>
+                </div>
+            </div>
+        </div>
+        <footer class="text-center py-8 text-gray-600">
+            <p>© 2023 Understanding Attention Mechanisms in LLMs</p>
+            <p class="mt-2">Educational resource for machine learning students</p>
+            <div class="mt-4 flex justify-center space-x-4">
+                <a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-github fa-lg"></i></a>
+                <a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-twitter fa-lg"></i></a>
+                <a href="#" class="text-blue-600 hover:text-blue-800"><i class="fab fa-linkedin fa-lg"></i></a>
+            </div>
+        </footer>
+    </div>
+<p style="border-radius: 8px; text-align: center; font-size: 12px; color: #fff; margin-top: 16px;position: fixed; left: 8px; bottom: 8px; z-index: 10; background: rgba(0, 0, 0, 0.8); padding: 4px 8px;">Made with <img src="https://enzostvs-deepsite.hf.space/logo.svg" alt="DeepSite Logo" style="width: 16px; height: 16px; vertical-align: middle;display:inline-block;margin-right:3px;filter:brightness(0) invert(1);"><a href="https://enzostvs-deepsite.hf.space" style="color: #fff;text-decoration: underline;" target="_blank" >DeepSite</a> - 🧬 <a href="https://enzostvs-deepsite.hf.space?remix=ontoligent/ds-5001-text-as-data" style="color: #fff;text-decoration: underline;" target="_blank" >Remix</a></p></body>
+</html>

prompts.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ Create a website that explains the attention mechanism used by LLMs. Include examples in Python suitable for teaching undergraduates and a list of authoritative and helpful citations. Go into some detail about why Q, K, and C are needed and how they are created.