File size: 14,263 Bytes
d25266e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <title>PyTorch × Transformers Journey</title>

  <!-- Google Fonts -->
  <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;800&family=Fira+Code:wght@400;600&display=swap" rel="stylesheet" />

  <!-- Reveal.js core & dark theme base -->
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reset.css" />
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reveal.css" />
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/theme/black.css" id="theme" />

  <!-- Highlight.js -->
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/styles/github-dark.min.css" />

  <!-- Animations -->
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/animate.css@4/animate.min.css" />

  <style>
    :root {
      --accent-primary: #ee4c2c; /* PyTorch orange‑red */
      --accent-secondary: #ffb347; /* lighter highlight */
      --bg-gradient-start: #1b1b1b;
      --bg-gradient-end: #242424;
    }
    html, body { font-family: 'Inter', sans-serif; }
    .reveal .slides {
      background: linear-gradient(135deg, var(--bg-gradient-start), var(--bg-gradient-end));
    }
    .reveal h1, .reveal h2, .reveal h3 { color: var(--accent-primary); font-weight: 800; letter-spacing: -0.5px; }
    .reveal pre code { font-family: 'Fira Code', monospace; font-size: 0.75em; }
    .reveal section img, .reveal section svg { border-radius: 1rem; box-shadow: 0 8px 22px rgba(0,0,0,0.4); }
    .fragment.highlight-current-blue.visible { color: var(--accent-secondary) !important; }
    /* slide-density patch */
    .reveal h1 { font-size: 2.6rem; line-height: 1.1; }
    .reveal h2 { font-size: 1.9rem; line-height: 1.15; }
    .reveal h3 { font-size: 1.4rem; line-height: 1.2; }
    .reveal p, .reveal li { font-size: 0.9rem; line-height: 1.35; }
    .reveal pre code { font-size: 0.67em; }
    @media (max-width: 1024px) { .reveal h1{font-size:2.2rem;} .reveal h2{font-size:1.6rem;} }
    .reveal table td, .reveal table th { font-size: 0.85rem; padding: 4px 8px; }
  </style>
</head>
<body>
  <div class="reveal">
    <div class="slides">

      <!-- 1 · Opening -->
      <section data-auto-animate>
        <h1 class="animate__animated animate__fadeInDown">PyTorch × Transformers Journey</h1>
        <h3 class="animate__animated animate__fadeInDown animate__delay-1s">Pythonicity, Autodiff &amp; Modularity in Modern AI</h3>
        <p class="animate__animated animate__fadeInUp animate__delay-2s">Pablo Montalvo‑Leroux · ML Engineer @ Hugging Face</p>
      </section>

      <!-- 2 · 2016: Backprop & Birth Pangs -->
      <section>
        <h2>2016‑2018: Backprop &amp; Birth Pangs</h2>
        <ul>
          <li>Hand‑crafted chain‑rule; frameworks such as Theano and CNTK appeared then vanished.</li>
          <li>MLPs → RNNs → LSTMs — until <strong>BERT</strong> detonated the field in 2018.</li>
          <li class="fragment">Reproducibility was painful ✗ — until Transformers met PyTorch ✓.</li>
        </ul>
      </section>

      <!-- 3 · Static vs Dynamic Graphs -->
      <section>
        <h2>Static vs Dynamic Graphs</h2>
        <p class="fragment">Static graphs require you to compile, wait, and cross fingers the bug reproduces.</p>
        <p class="fragment">Dynamic graphs mean you can drop <code>pdb.set_trace()</code> anywhere and continue iterating.</p>
        <p class="fragment"><code>torch.compile</code> gives the best of both worlds: write dynamically, ship something ahead‑of‑time optimised.</p>
      </section>

      <!-- 4 · Dynamic Graphs Enabled Contribution -->
      <section>
        <h2>Dynamic Graphs Enabled Contribution</h2>
        <ul>
          <li>Developers debug at line‑rate — no cold‑start recompiles.</li>
          <li>Pull‑requests remained reproducible overnight, which accelerated trust.</li>
          <li>Static‑graph alternatives stalled and the community consolidated around PyTorch.</li>
        </ul>
      </section>

      <!-- 5 · Paper Tonight → Tweak Tomorrow -->
      <section>
        <h2>Clone the Paper Tonight → Tweak Tomorrow</h2>
        <p>Research cadence is measured in <strong>hours</strong>; any friction kills momentum.</p>
        <ul>
          <li class="fragment">2018: BERT fine‑tuning required printing tensors live rather than recompiling graphs.</li>
          <li class="fragment">Community PRs merged overnight — credibility snowballed for both PyTorch and Transformers.</li>
        </ul>
      </section>

      <!-- 6 · One Model · One File -->
      <section>
        <h2>“One Model · One File” — Why it Matters</h2>
        <pre><code class="language-python" data-trim data-noescape>
# modeling_bert.py  — single source of truth 🗄️
class BertConfig(PretrainedConfig):
    ...

class BertSelfAttention(nn.Module):
    ...

class BertLayer(nn.Module):
    ...

class BertModel(PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = nn.ModuleList(
            [BertLayer(config) for _ in range(config.num_hidden_layers)]
        )
        self.init_weights()
        </code></pre>
        <ul>
          <li>All layers, forward pass, and <code>from_pretrained()</code> logic live together.</li>
          <li>No cross‑file inheritance maze — copy to Colab, hack, and run.</li>
          <li>Reviewers diff one file; merge time dropped from days to hours.</li>
        </ul>
      </section>

      <!-- 7 · Transformers Grew With Python -->
      <section>
        <h2>Transformers Grew with Python</h2>
        <ul>
          <li>The library prioritises hackability, which in turn accelerates adoption.</li>
          <li>Python is slow by default, so we lean on compiled CUDA kernels and Triton for raw speed.</li>
          <li>The new <strong>Kernel Hub</strong> means Transformers automatically uses a faster op the moment it is published — no application changes required.</li>
        </ul>
      </section>

      <!-- 8 · Back to Python: Mary Shelley Mode -->
      <section>
        <h2>Back to Python: Modular “Mary Shelley” Mode</h2>
        <p>Compose new blocks via subclassing and selective override.</p>
        <pre><code class="language-python" data-trim data-noescape>
class LlamaRotaryLoRA(LlamaAttention):
    def __init__(...):
        super().__init__(...)
        self.q_proj = LoRA(self.q_proj)  # swap in LoRA
        self.apply_rotary()              # keep RoPE
        </code></pre>
      </section>

      <!-- 9 · Logit Debugger -->
      <section>
        <h2>Logit Debugger: Trust but Verify</h2>
        <ul>
          <li>Attach a hook to every <code>nn.Module</code>; dump logits layer‑by‑layer.</li>
          <li>Spot ε‑level drifts — LayerNorm precision, FP16 underflow, etc.</li>
          <li>JSON traces are diffable in CI, so regressions stay caught.</li>
        </ul>
      </section>

      <!-- 10 · DTensor & TP API -->
      <section>
        <h2>DTensor & Tensor‑Parallel API</h2>
        <ul>
          <li>Logical tensor views unlock device‑mesh sharding.</li>
          <li>The <code>tp_plan</code> JSON keeps model code pristine and declarative.</li>
          <li>We regularly validate 100‑billion‑parameter checkpoints inside HF test infra.</li>
        </ul>
        <img data-src="assets/mesh.svg" alt="Device mesh" />
      </section>

      <!-- 11 · Zero‑Config Parallelism -->
      <section>
        <h2>Zero‑Config Parallelism</h2>
        <pre><code class="language-json" data-trim data-noescape>{
  "layer.*.self_attn.q_proj": "colwise",
  "layer.*.self_attn.k_proj": "colwise",
  "layer.*.self_attn.v_proj": "colwise",
  "layer.*.self_attn.o_proj": "rowwise"
}</code></pre>
        <pre><code class="language-python" data-trim data-noescape>
def translate_to_torch_parallel_style(style: str):
    if style == "colwise":
        return ColwiseParallel()
    elif style == "rowwise":
        return RowwiseParallel()
        </code></pre>
        <p class="fragment">One JSON file loads a 17‑billion‑parameter Llama‑4 on 8 GPUs; tweak the plan, not the network.</p>
      </section>

      <!-- 12 · Cache Allocator -->
      <section>
        <h2>Load Faster &amp; Stronger: Cache Allocator</h2>
        <p>Zero‑copy weight sharding shaves <strong>15 %</strong> VRAM on A100 while cutting load time below 60 s for a 100‑B model.</p>
        <img data-src="assets/memory_bars.svg" alt="Memory bars" />
      </section>

      <!-- 13 · Modular Transformers: GLM Example -->
      <section>
        <h2>Modular Transformers: GLM by Example</h2>
        <pre><code class="language-python" data-trim>
class GlmMLP(Phi3MLP):
    pass

class GlmAttention(LlamaAttention):
    def __init__(self, config, layer_idx=None):
        super().__init__(config, layer_idx)
        self.o_proj = nn.Linear(
            config.num_attention_heads * self.head_dim,
            config.hidden_size,
            bias=False,
        )

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    # Slightly different RoPE
    ...

class GlmForCausalLM(LlamaForCausalLM):
    pass
        </code></pre>
        <p>AST magic expands this 40‑line prototype into a full modelling file, ready for training.</p>
      </section>

      <!-- 14 · Rise of Multimodality -->
      <section>
        <h2>Rise of Multimodality</h2>
        <pre><code class="language-python" data-trim data-noescape>
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForConditionalGeneration.from_pretrained("Qwen/Qwen3-8B")
        </code></pre>
        <p class="fragment">Same API across text, vision, and audio: learn once, apply everywhere.</p>
      </section>

      <!-- 15 · Why Python wins -->
      <section>
        <h2>Why Python Wins</h2>
        <ul>
          <li>Low entry barrier attracts newcomers and domain specialists alike.</li>
          <li>High‑level semantics concisely express low‑level intent.</li>
          <li>The C++/Rust back‑end remains accessible for critical paths.</li>
        </ul>
      </section>

      <!-- 16 · Where Python can bite -->
      <section>
        <h2>Where Python can bite 🐍</h2>
        <ul>
          <li class="fragment">Interpreter overhead hurts microkernels (token‑by‑token decoding).</li>
          <li class="fragment">The GIL throttles concurrent host‑side work.</li>
          <li class="fragment">Fresh research code is easy to leave unoptimised.</li>
        </ul>
        <p class="fragment">Mitigations: Triton, compiled custom ops, compile‑time fallbacks, and callable kernels.</p>
      </section>

      <!-- 17 · Kernel Hub -->
      <section>
        <h2>Kernel Hub: Optimised Ops from the Community</h2>
        <p>Kernel Hub lets any Python program <em>download and hot‑load</em> compiled CUDA/C++ kernels directly from the Hugging Face Hub at runtime.</p>
        <ul>
          <li><strong>Portable</strong> – kernels work from arbitrary paths outside <code>PYTHONPATH</code>.</li>
          <li><strong>Unique</strong> – load multiple versions of the same op side‑by‑side in one process.</li>
          <li><strong>Compatible</strong> – every kernel targets all recent PyTorch wheels (CUDA, ROCm, CPU) and C‑library ABIs.</li>
        </ul>
        <p class="fragment">🚀 <strong>Quick start</strong> (requires <code>torch >= 2.5</code>):</p>
        <pre><code class="language-bash" data-trim>pip install kernels</code></pre>
        <pre><code class="language-python" data-trim data-noescape>
import torch
from kernels import get_kernel

# Download optimised kernels from the Hugging Face Hub
activation = get_kernel("kernels-community/activation")

x = torch.randn(10, 10, dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
print(y)
        </code></pre>
        <p class="fragment">Same Transformer code — now with a <strong>3× faster</strong> GELU on A100s.</p>
      </section>

      <!-- 18 · API design lessons -->
      <section>
        <h2>API Design Lessons</h2>
        <ul>
          <li>Make easy things obvious, and hard things merely possible.</li>
          <li>Keep the paper‑to‑repository delta minimal for new models.</li>
          <li>Hide sharding mechanics; expose developer intent.</li>
        </ul>
        <p class="fragment">We tune radios without learning RF theory — ML frameworks should feel as frictionless.</p>
      </section>

      <!-- 19 · Model Growth by Modality -->
      <section>
        <h2>Model Growth by Modality</h2>
        <iframe src="model_growth.html" width="100%" height="600" style="border:none;"></iframe>
      </section>

      <!-- 20 · Takeaways -->
      <section>
        <h2>Takeaways &amp; The Future</h2>
        <ul>
          <li>PyTorch and <code>transformers</code> have grown symbiotically for eight years—expect the spiral to continue.</li>
          <li>Pythonicity plus pragmatism keeps the barrier to innovation low.</li>
          <li>Open‑source models are shipping faster, larger, and more multimodal than ever.</li>
        </ul>
        <p><a href="https://huggingface.co/transformers/contribute" target="_blank">hf.co/transformers/contribute</a></p>
      </section>

    </div>
  </div>

  <!-- Reveal.js core -->
  <script src="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reveal.js"></script>
  <script src="https://cdn.jsdelivr.net/npm/reveal.js@5/plugin/highlight/highlight.js"></script>
  <script src="https://cdn.jsdelivr.net/npm/reveal.js@5/plugin/notes/notes.js"></script>
  <!-- Plotly for interactive charts -->
  <script src="https://cdn.plot.ly/plotly-2.31.1.min.js"></script>
  <script>
    Reveal.initialize({
      hash: true,
      slideNumber: true,
      transition: 'slide',
      backgroundTransition: 'convex',
      plugins: [ RevealHighlight, RevealNotes ]
    });
  </script>
</body>
</html>