Molbap HF Staff commited on
Commit
78bf448
·
verified ·
1 Parent(s): 4c0fc77

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +652 -18
index.html CHANGED
@@ -1,19 +1,653 @@
1
- <!doctype html>
2
- <html>
3
- <head>
4
- <meta charset="utf-8" />
5
- <meta name="viewport" content="width=device-width" />
6
- <title>My static Space</title>
7
- <link rel="stylesheet" href="style.css" />
8
- </head>
9
- <body>
10
- <div class="card">
11
- <h1>Welcome to your static Space!</h1>
12
- <p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
13
- <p>
14
- Also don't forget to check the
15
- <a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
16
- </p>
17
- </div>
18
- </body>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  </html>
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="utf-8" />
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0" />
6
+ <title>PyTorch × Transformers Journey</title>
7
+
8
+ <!-- Google Fonts -->
9
+ <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;800&family=Fira+Code:wght@400;600&display=swap" rel="stylesheet" />
10
+
11
+ <!-- Reveal.js core & dark theme base -->
12
+ <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reset.css" />
13
+ <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reveal.css" />
14
+ <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/theme/black.css" id="theme" />
15
+
16
+ <!-- Highlight.js -->
17
+ <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/styles/github-dark.min.css" />
18
+
19
+ <!-- Animations -->
20
+ <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/animate.css@4/animate.min.css" />
21
+
22
+ <style>
23
+ :root {
24
+ --accent-primary: #ee4c2c; /* PyTorch orange‑red */
25
+ --accent-secondary: #ffb347; /* lighter highlight */
26
+ --bg-gradient-start: #1b1b1b;
27
+ --bg-gradient-end: #242424;
28
+ }
29
+ html, body { font-family: 'Inter', sans-serif; }
30
+ .reveal .slides {
31
+ background: linear-gradient(135deg, var(--bg-gradient-start), var(--bg-gradient-end));
32
+ }
33
+ .reveal h1, .reveal h2, .reveal h3 { color: var(--accent-primary); font-weight: 800; letter-spacing: -0.5px; }
34
+ .reveal pre code { font-family: 'Fira Code', monospace; font-size: 0.75em; }
35
+ .reveal section img, .reveal section svg { border-radius: 1rem; box-shadow: 0 8px 22px rgba(0,0,0,0.4); }
36
+ .fragment.highlight-current-blue.visible { color: var(--accent-secondary) !important; }
37
+ /* slide-density patch */
38
+ .reveal h1 { font-size: 2.6rem; line-height: 1.1; }
39
+ .reveal h2 { font-size: 1.9rem; line-height: 1.15; }
40
+ .reveal h3 { font-size: 1.4rem; line-height: 1.2; }
41
+ .reveal p, .reveal li { font-size: 1.7rem; line-height: 1.35; }
42
+ .reveal pre code { font-size: 0.67em; }
43
+ @media (max-width: 1024px) { .reveal h1{font-size:2.2rem;} .reveal h2{font-size:1.6rem;} }
44
+ .reveal table td, .reveal table th { font-size: 0.85rem; padding: 4px 8px; }
45
+ body::after {
46
+ content: "";
47
+ position: fixed;
48
+ bottom: 3.5em;
49
+ left: 3.5em;
50
+ width: 270px; /* desired size */
51
+ height: 117px;
52
+ background-image: url(assets/py2.png);
53
+ background-size: contain;
54
+ background-repeat: no-repeat;
55
+ z-index: 9999;
56
+ box-shadow: 5px 5px 10px #000;
57
+ pointer-events: none;
58
+ }
59
+
60
+
61
+
62
+
63
+
64
+ </style>
65
+ </head>
66
+ <body>
67
+ <div class="reveal">
68
+ <div class="slides">
69
+ <section>
70
+ <img src="assets/screenpage2.png" alt="Full slide image"
71
+ style="
72
+ width:120%;
73
+ height:110%;
74
+ object-fit:cover;
75
+ margin-left:-2.5%;
76
+ margin-top:-2.5%;
77
+ " /> <!-- 1 · Opening -->
78
+ </section>
79
+ <section data-auto-animate>
80
+ <img src="assets/head_logo.svg"
81
+ alt="Logo"
82
+ style="width: 120px; margin-bottom: 1rem;"
83
+ class="animate__animated animate__fadeInDown" />
84
+ <h1 class="animate__animated animate__fadeInDown">PyTorch × Transformers Journey</h1>
85
+ <h3 class="animate__animated animate__fadeInDown animate__delay-1s">Pythonicity, Autodiff & Modularity in Modern AI</h3>
86
+ <p class="animate__animated animate__fadeInUp animate__delay-2s">Pablo Montalvo‑Leroux &nbsp;·&nbsp; ML Engineer @ Hugging Face</p>
87
+ </section>
88
+
89
+ <section>
90
+ <h2>2016‑2018: Backprop &amp; Birth Pangs</h2>
91
+ <p>The journey began with uncertainty: back in 2016, machine learning was far from standardized. Tools like Theano and CNTK were fading, and many of us—myself included—were jumping framework to framework. It was a time of raw experimentation.</p>
92
+ <ul>
93
+ <li>Frameworks were in flux; few stuck around.</li>
94
+ <li>MLPs evolved to RNNs and LSTMs.</li>
95
+ <li><strong>2017, Attention, then 2018: BERT</strong> arrives, blowing the roof off what's possible.</li>
96
+ </ul>
97
+ <p class="fragment">But reproducing results remained frustratingly difficult.</p>
98
+ </section>
99
+
100
+ <section>
101
+ <h2>Transformers × PyTorch: Reproducibility</h2>
102
+ <p>That all changed with <code>pytorch-pretrained-bert</code>, the predecessor to Transformers. Suddenly, the magic of BERT was available in an interface that made sense.</p>
103
+ <ul>
104
+ <li>No static graphs, just Python functions and PyTorch modules.</li>
105
+ <li>Readable, hackable code meant results could be shared, reproduced, improved.</li>
106
+ <li>This shifted the research community towards PyTorch.</li>
107
+ </ul>
108
+ </section>
109
+
110
+
111
+ <!-- 3 · Static vs Dynamic Graphs -->
112
+ <section>
113
+ <h2>Static vs Dynamic Graphs</h2>
114
+ <p>Static graphs require you to compile, wait, and cross fingers the bug reproduces.</p>
115
+ <p>Dynamic graphs mean you can drop <code>pdb.set_trace()</code> anywhere and continue iterating.</p>
116
+ <p>Nowadays <code>torch.compile</code> gives the best of both worlds: write dynamically, ship something ahead‑of‑time optimised.</p>
117
+ </section>
118
+
119
+
120
+ <!-- 4 · Dynamic Graphs Enabled Contribution -->
121
+ <section>
122
+ <h2>Dynamic Graphs Enabled Contribution</h2>
123
+ <ul>
124
+ <li>Developers debug at line‑rate — no cold‑start recompiles.</li>
125
+ <li>Pull‑requests remained reproducible overnight, which accelerated trust.</li>
126
+ <li>Static‑graph alternatives stalled and the community consolidated around PyTorch.</li>
127
+ </ul>
128
+ </section>
129
+
130
+ <section>
131
+ <h2>Clone the Paper Tonight → Tweak Tomorrow</h2>
132
+ <p>PyTorch lowered the barrier to implementation. Transformers removed the rest.</p>
133
+ <ul>
134
+ <li>2018: debugging BERT fine-tunes meant live tensor prints, not codegen restarts.</li>
135
+ <li>Community credibility grew because patches could be merged fast and verified easily.</li>
136
+ <li>Experimentation became a matter of hours, not weeks.</li>
137
+ </ul>
138
+ </section>
139
+
140
+ <!-- 6 · One Model · One File -->
141
+ <section>
142
+ <h2>“One Model · One File” — Why it Matters</h2>
143
+ <pre><code class="language-python" data-trim data-noescape>
144
+ # modeling_bert.py — single source of truth
145
+ class BertConfig(PretrainedConfig):
146
+ ...
147
+
148
+ class BertSelfAttention(nn.Module):
149
+ ...
150
+
151
+ class BertLayer(nn.Module):
152
+ ...
153
+
154
+ class BertModel(PreTrainedModel):
155
+ def __init__(self, config):
156
+ super().__init__(config)
157
+ self.embeddings = BertEmbeddings(config)
158
+ self.encoder = nn.ModuleList(
159
+ [BertLayer(config) for _ in range(config.num_hidden_layers)]
160
+ )
161
+ self.init_weights()
162
+ </code></pre>
163
+ <ul>
164
+ <li>All layers, forward pass, and <code>from_pretrained()</code> logic live together.</li>
165
+ <li>No cross‑file inheritance maze — copy to Colab, hack, and run.</li>
166
+ <li>Reviewers diff one file; merge time dropped from days to hours.</li>
167
+ </ul>
168
+ </section>
169
+
170
+ <section>
171
+ <h2>Beyond Transformers: Ecosystem Reuse</h2>
172
+ <p>Other libraries depend on <code>transformers</code> as a model definition source. For example, <strong>TRL</strong> uses models from the Hub directly:</p>
173
+
174
+ <pre><code class="language-python" data-trim data-noescape>
175
+ from datasets import load_dataset
176
+ from transformers import AutoModelForCausalLM, AutoTokenizer
177
+ from trl import DPOConfig, DPOTrainer
178
+
179
+ model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
180
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
181
+ dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
182
+ training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
183
+ trainer = DPOTrainer(
184
+ model=model,
185
+ args=training_args,
186
+ train_dataset=dataset,
187
+ processing_class=tokenizer
188
+ )
189
+ trainer.train()
190
+ </code></pre>
191
+
192
+ <p class="fragment">No hacks, no refactoring — just <code>from_pretrained()</code>. Thanks to PyTorch autodiff and robust model definitions.</p>
193
+ </section>
194
+
195
+
196
+ <!-- 8 · Paradigms come at a cost -->
197
+ <section>
198
+ <h2>Paradigms come at a cost</h2>
199
+ <ul>
200
+ <p>The library took off, scientific and engineering ML community benefitted from it</p>
201
+ <p>Torch adoption grew at the same time!</p>
202
+ <p>The Hugging Face Hub became the AI app reference,</p>
203
+ <p>In transformers, <strong> Maintenance</strong> becomes an issue: we have a lot of repeated code on purpose!</p>
204
+ <p class="fragment">...but python is never far :) </p>
205
+
206
+ </ul>
207
+ </section>
208
+
209
+ <!-- 8 · Back to Python: Mary Shelley Mode -->
210
+ <section>
211
+ <h2>Back to Python: Modular “Mary Shelley” Mode</h2>
212
+ <p>Compose new blocks via subclass &amp; override.</p>
213
+ <pre><code class="language-python" data-trim>
214
+ class GlmMLP(Phi3MLP):
215
+ pass
216
+
217
+ class GlmAttention(LlamaAttention):
218
+ def __init__(self, config, layer_idx=None):
219
+ super().__init__(config, layer_idx)
220
+ self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim,
221
+ config.hidden_size, bias=False)
222
+
223
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
224
+ # Slightly different RoPE
225
+
226
+
227
+ class GlmForCausalLM(LlamaForCausalLM):
228
+ pass
229
+ </code></pre>
230
+ <p>AST expands → full modeling file, still hackable.</p>
231
+ </section>
232
+
233
+ <section>
234
+ <h2>Back to Python: It's alive!</h2>
235
+ <p>All the code becomes runnable and a self-contained model definition</p>
236
+ <pre><code class="language-python" data-trim>
237
+
238
+ class GlmMLP(nn.Module):
239
+ def __init__(self, config):
240
+ super().__init__()
241
+
242
+ self.config = config
243
+ self.gate_up_proj = nn.Linear(config.hidden_size, 2 * config.intermediate_size, bias=False)
244
+ self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
245
+ self.activation_fn = ACT2FN[config.hidden_act]
246
+
247
+ def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
248
+ up_states = self.gate_up_proj(hidden_states)
249
+
250
+ gate, up_states = up_states.chunk(2, dim=-1)
251
+ up_states = up_states * self.activation_fn(gate)
252
+
253
+ return self.down_proj(up_states)
254
+
255
+
256
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
257
+ """
258
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
259
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
260
+ """
261
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
262
+ if n_rep == 1:
263
+ return hidden_states
264
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
265
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
266
+
267
+
268
+ def eager_attention_forward(
269
+ module: nn.Module,
270
+ query: torch.Tensor,
271
+ key: torch.Tensor,
272
+ value: torch.Tensor,
273
+ attention_mask: Optional[torch.Tensor],
274
+ scaling: float,
275
+ dropout: float = 0.0,
276
+ **kwargs,
277
+ ):
278
+ key_states = repeat_kv(key, module.num_key_value_groups)
279
+ value_states = repeat_kv(value, module.num_key_value_groups)
280
+
281
+ attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
282
+ if attention_mask is not None:
283
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
284
+ attn_weights = attn_weights + causal_mask
285
+
286
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
287
+ attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
288
+ attn_output = torch.matmul(attn_weights, value_states)
289
+ attn_output = attn_output.transpose(1, 2).contiguous()
290
+
291
+ return attn_output, attn_weights
292
+
293
+
294
+ def rotate_half(x):
295
+ """Rotates half the hidden dims of the input."""
296
+ x1 = x[..., 0::2]
297
+ x2 = x[..., 1::2]
298
+ return torch.stack((-x2, x1), dim=-1).flatten(-2)
299
+
300
+
301
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
302
+ """Applies Rotary Position Embedding to the query and key tensors.
303
+
304
+ Args:
305
+ q (`torch.Tensor`): The query tensor.
306
+ k (`torch.Tensor`): The key tensor.
307
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
308
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
309
+ position_ids (`torch.Tensor`, *optional*):
310
+ Deprecated and unused.
311
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
312
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
313
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
314
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
315
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
316
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
317
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
318
+ Returns:
319
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
320
+ """
321
+ cos = cos.unsqueeze(unsqueeze_dim)
322
+ sin = sin.unsqueeze(unsqueeze_dim)
323
+
324
+ # Interleave them instead of usual shape
325
+ cos = cos[..., : cos.shape[-1] // 2].repeat_interleave(2, dim=-1)
326
+ sin = sin[..., : sin.shape[-1] // 2].repeat_interleave(2, dim=-1)
327
+
328
+ # Keep half or full tensor for later concatenation
329
+ rotary_dim = cos.shape[-1]
330
+ q_rot, q_pass = q[..., :rotary_dim], q[..., rotary_dim:]
331
+ k_rot, k_pass = k[..., :rotary_dim], k[..., rotary_dim:]
332
+
333
+ # Apply rotary embeddings on the first half or full tensor
334
+ q_embed = (q_rot * cos) + (rotate_half(q_rot) * sin)
335
+ k_embed = (k_rot * cos) + (rotate_half(k_rot) * sin)
336
+
337
+ # Concatenate back to full shape
338
+ q_embed = torch.cat([q_embed, q_pass], dim=-1)
339
+ k_embed = torch.cat([k_embed, k_pass], dim=-1)
340
+ return q_embed, k_embed
341
+
342
+
343
+ class GlmAttention(nn.Module):
344
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
345
+
346
+ def __init__(self, config: GlmConfig, layer_idx: Optional[int] = None):
347
+ super().__init__()
348
+ self.config = config
349
+ self.layer_idx = layer_idx
350
+ self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
351
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
352
+ self.scaling = self.head_dim**-0.5
353
+ self.attention_dropout = config.attention_dropout
354
+ self.is_causal = True
355
+
356
+ self.q_proj = nn.Linear(
357
+ config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
358
+ )
359
+ self.k_proj = nn.Linear(
360
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
361
+ )
362
+ self.v_proj = nn.Linear(
363
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
364
+ )
365
+ self.o_proj = nn.Linear(config.num_attention_heads * self.head_dim, config.hidden_size, bias=False)
366
+
367
+ def forward(
368
+ self,
369
+ hidden_states: torch.Tensor,
370
+ position_embeddings: Tuple[torch.Tensor, torch.Tensor],
371
+ attention_mask: Optional[torch.Tensor],
372
+ past_key_value: Optional[Cache] = None,
373
+ cache_position: Optional[torch.LongTensor] = None,
374
+ **kwargs: Unpack[FlashAttentionKwargs],
375
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
376
+ input_shape = hidden_states.shape[:-1]
377
+ hidden_shape = (*input_shape, -1, self.head_dim)
378
+
379
+ query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
380
+ key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
381
+ value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
382
+
383
+ cos, sin = position_embeddings
384
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
385
+
386
+ if past_key_value is not None:
387
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
388
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
389
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
390
+
391
+ attention_interface: Callable = eager_attention_forward
392
+
393
+ if self.config._attn_implementation != "eager":
394
+ if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
395
+ logger.warning_once(
396
+ "`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
397
+ 'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
398
+ )
399
+ else:
400
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
401
+
402
+ attn_output, attn_weights = attention_interface(
403
+ self,
404
+ query_states,
405
+ key_states,
406
+ value_states,
407
+ attention_mask,
408
+ dropout=0.0 if not self.training else self.attention_dropout,
409
+ scaling=self.scaling,
410
+ **kwargs,
411
+ )
412
+
413
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
414
+ attn_output = self.o_proj(attn_output)
415
+ return attn_output, attn_weights
416
+
417
+
418
+ @use_kernel_forward_from_hub("RMSNorm")
419
+ class GlmRMSNorm(nn.Module):
420
+ def __init__(self, hidden_size, eps=1e-6):
421
+ """
422
+ GlmRMSNorm is equivalent to T5LayerNorm
423
+ """
424
+ super().__init__()
425
+ self.weight = nn.Parameter(torch.ones(hidden_size))
426
+ self.variance_epsilon = eps
427
+
428
+ def forward(self, hidden_states):
429
+ input_dtype = hidden_states.dtype
430
+ hidden_states = hidden_states.to(torch.float32)
431
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
432
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
433
+ return self.weight * hidden_states.to(input_dtype)
434
+
435
+ def extra_repr(self):
436
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
437
+
438
+
439
+ class GlmRotaryEmbedding(nn.Module):
440
+ def __init__(self, config: GlmConfig, device=None):
441
+ super().__init__()
442
+ # BC: "rope_type" was originally "type"
443
+ if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
444
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
445
+ else:
446
+ self.rope_type = "default"
447
+ self.max_seq_len_cached = config.max_position_embeddings
448
+ self.original_max_seq_len = config.max_position_embeddings
449
+
450
+ self.config = config
451
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
452
+
453
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
454
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
455
+ self.original_inv_freq = self.inv_freq
456
+ </code></pre>
457
+ <p> We keep hackability while reconnecting with Python working paradigms.</p>
458
+ </section>
459
+
460
+
461
+ <!-- 9 · Logit Debugger -->
462
+ <section>
463
+ <h2>Logit Debugger: Trust but Verify</h2>
464
+ <ul>
465
+ <li>Hook every <code>nn.Module</code>; dump logits layer‑by‑layer</li>
466
+ <li>Spot ε‑level drifts (LayerNorm, FP16 underflow…)</li>
467
+ <li>JSON traces diffable in CI</li>
468
+ <img data-src="assets/visual_debugger.png" alt="Visual debugger" />
469
+
470
+ </ul>
471
+ </section>
472
+
473
+ <!-- 10 · DTensor & TP API -->
474
+ <section>
475
+ <h2>DTensor & Tensor‑Parallel API</h2>
476
+ <p>Before, changing to Tensor Parallel meant changing the code.</p>
477
+
478
+ <pre><code class="language-python" data-trim data-noescape>
479
+ from transformers.modeling_utils import PreTrainedModel
480
+ from megatron.model import ColumnParallelLinear, RowParallelLinear
481
+
482
+ class MyTPModel(PreTrainedModel):
483
+ def __init__(self, config):
484
+ super().__init__(config)
485
+ self.q_proj = ColumnParallelLinear(config.hidden_size, config.hidden_size)
486
+ self.k_proj = ColumnParallelLinear(config.hidden_size, config.hidden_size)
487
+ self.v_proj = ColumnParallelLinear(config.hidden_size, config.hidden_size)
488
+ self.o_proj = RowParallelLinear(config.hidden_size, config.hidden_size)
489
+
490
+ </code></pre>
491
+ </section>
492
+
493
+ <!-- 11 · Zero‑Config Parallelism -->
494
+ <section>
495
+ <h2>Zero‑Config Tensor Parallelism</h2>
496
+ <p>The <code>tp_plan</code> JSON keeps model code pristine and declarative.</p>
497
+ <pre><code class="language-json" data-trim data-noescape>{
498
+ "layer.*.self_attn.q_proj": "colwise",
499
+ "layer.*.self_attn.k_proj": "colwise",
500
+ "layer.*.self_attn.v_proj": "colwise",
501
+ "layer.*.self_attn.o_proj": "rowwise"
502
+ }</code></pre>
503
+ <p class="fragment">Translated to</p>
504
+
505
+ <pre><code class="language-python" data-trim data-noescape>
506
+ def translate_to_torch_parallel_style(style: str):
507
+ if style == "colwise":
508
+ return ColwiseParallel()
509
+ elif style == "rowwise":
510
+ return RowwiseParallel()
511
+ # …
512
+ </code></pre>
513
+ <p class="fragment">One JSON → 100 B param model on 8 GPUs. Change the plan, not the code.</p>
514
+ </section>
515
+
516
+ <!-- 12 · Cache Allocator -->
517
+ <section>
518
+ <h2>Improvements, Load faster & stronger: Cache Allocator</h2>
519
+ <p>0‑copy weight sharding, single cuda Malloc</p>
520
+ <p>Faster model loads, even for a 50-shards 100B model (when we were sprinting Llama4!)</p>
521
+ <img data-src="assets/fastload.png" alt="SurprisedLewis" />
522
+ </section>
523
+
524
+ <!-- 15 · Why Python wins -->
525
+ <section>
526
+ <h2>Why Python Wins</h2>
527
+ <ul>
528
+ <li>Low entry barrier (although hard to master)</li>
529
+ <li>High‑level semantics express low‑level intent</li>
530
+ <li>Seamless C++/Rust extension points</li>
531
+ </ul>
532
+ </section>
533
+
534
+ <!-- 16 · Where Python can bite -->
535
+ <section>
536
+ <h2>Where Python can bite 🐍</h2>
537
+ <ul>
538
+ <li>Interpreter overhead on microkernels (token‑by‑token decode)</li>
539
+ <li>GIL can throttle async host‑side work</li>
540
+ <li>Easy to under‑optimise code fresh out of the lab</li>
541
+ </ul>
542
+ <p class="fragment">All of these can be mitigated: Triton, compiled custom ops, compile‑time fallback, <strong>custom kernels</strong></p>
543
+ </section>
544
+
545
+
546
+ <!-- 17 · Kernel Hub -->
547
+ <section>
548
+ <h2>Kernel Hub: Optimised Ops from the Community</h2>
549
+ <p>Kernel Hub lets any Python program <em>download and hot‑load</em> compiled CUDA/C++ kernels directly from the Hugging Face Hub at runtime.</p>
550
+ <ul>
551
+ <li><strong>Portable</strong> – kernels work from arbitrary paths outside <code>PYTHONPATH</code>.</li>
552
+ <li><strong>Unique</strong> – load multiple versions of the same op side‑by‑side in one process.</li>
553
+ <li><strong>Compatible</strong> – every kernel targets all recent PyTorch wheels (CUDA, ROCm, CPU) and C‑library ABIs.</li>
554
+ </ul>
555
+ <pre><code class="language-python" data-trim data-noescape>
556
+ import torch
557
+ from kernels import get_kernel
558
+
559
+ # Download optimised kernels from the Hugging Face Hub
560
+ activation = get_kernel("kernels-community/activation")
561
+
562
+ x = torch.randn(10, 10, dtype=torch.float16, device="cuda")
563
+ y = torch.empty_like(x)
564
+ activation.gelu_fast(y, x)
565
+ print(y)
566
+ </code></pre>
567
+ <p class="fragment">Same Transformer code — now with a <strong>3× faster</strong> GELU on A100s.</p>
568
+ </section>
569
+
570
+
571
+ <!-- 18 · API design lessons -->
572
+ <section>
573
+ <h2>API Design Lessons</h2>
574
+ <ul>
575
+ <li>Make easy things obvious, hard things possible</li>
576
+ <li>Paper‑to‑repo diff should be minimal</li>
577
+ <li>Research repo-to-stable architecture should be as fast as possible</li>
578
+
579
+ <li>Hide sharding, expose intent</li>
580
+ </ul>
581
+ <p>We tune radios without building RF amplifiers — ML should feel the same.</p>
582
+ <p class="fragment">..while enabling people who build the amplifiers.</p>
583
+
584
+ </section>
585
+
586
+
587
+ <!-- 14 · Rise of Multimodality -->
588
+ <section>
589
+ <h2>Rise of Multimodality</h2>
590
+ <pre><code class="language-python" data-trim data-noescape>
591
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen3-8B")
592
+ model = AutoModelForConditionalGeneration.from_pretrained("Qwen/Qwen3-8B")
593
+ </code></pre>
594
+ <p>Same API across text · vision · audio</p>
595
+ <p>More and more models, with specific processing - need to uniformize</p>
596
+
597
+ </section>
598
+
599
+ <section>
600
+ <h2>Rise of Multimodality: torch-powered processing</h2>
601
+ <p>Torch and torchvision ops have replaced np + PIL defaults in transformers</p>
602
+
603
+ <img data-src="assets/normalize_time_torch.webp" width="80%" height="600" alt="Fast load" />
604
+
605
+ </section>
606
+ <!-- 19 · Model Growth by Modality -->
607
+ <section>
608
+ <h2>Model Growth by Modality</h2>
609
+ <iframe src="assets/model_growth.html" width="80%" height="600" style="border:none;"></iframe>
610
+ </section>
611
+
612
+ <!-- 20 · Takeaways -->
613
+ <section>
614
+ <h2>Takeaways &amp; The Future</h2>
615
+ <ul>
616
+ <li style="display: flex; align-items: center; gap: 1rem;">
617
+ <img src="assets/torchlogo.png" alt="PyTorch" style="height: 2rem;" />
618
+ PyTorch &amp; <code>transformers</code> grow symbiotically
619
+ <img src="assets/head_logo.svg" alt="Transformers" style="height: 2rem;" />
620
+ </li>
621
+ <li>Pythonicity × pragmatism drive adoption</li>
622
+ <li>Open‑source models are shipping faster &amp; bigger than ever</li>
623
+ <li class="fragment"> Let's go!</li>
624
+
625
+ </ul>
626
+ <p>
627
+ <a href="https://huggingface.co/transformers/contribute" target="_blank">
628
+ hf.co/transformers/contribute
629
+ </a>
630
+ </p>
631
+ </section>
632
+
633
+
634
+ </div>
635
+ </div>
636
+
637
+ <!-- Reveal.js core -->
638
+ <script src="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reveal.js"></script>
639
+ <script src="https://cdn.jsdelivr.net/npm/reveal.js@5/plugin/highlight/highlight.js"></script>
640
+ <script src="https://cdn.jsdelivr.net/npm/reveal.js@5/plugin/notes/notes.js"></script>
641
+ <!-- Plotly for interactive charts -->
642
+ <script src="https://cdn.plot.ly/plotly-2.31.1.min.js"></script>
643
+ <script>
644
+ Reveal.initialize({
645
+ hash: true,
646
+ slideNumber: true,
647
+ transition: 'slide',
648
+ backgroundTransition: 'convex',
649
+ plugins: [ RevealHighlight, RevealNotes ]
650
+ });
651
+ </script>
652
+ </body>
653
  </html>