Spaces:
Running
Running
File size: 14,263 Bytes
d25266e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 |
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>PyTorch × Transformers Journey</title>
<!-- Google Fonts -->
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;800&family=Fira+Code:wght@400;600&display=swap" rel="stylesheet" />
<!-- Reveal.js core & dark theme base -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reset.css" />
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reveal.css" />
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/theme/black.css" id="theme" />
<!-- Highlight.js -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/styles/github-dark.min.css" />
<!-- Animations -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/animate.css@4/animate.min.css" />
<style>
:root {
--accent-primary: #ee4c2c; /* PyTorch orange‑red */
--accent-secondary: #ffb347; /* lighter highlight */
--bg-gradient-start: #1b1b1b;
--bg-gradient-end: #242424;
}
html, body { font-family: 'Inter', sans-serif; }
.reveal .slides {
background: linear-gradient(135deg, var(--bg-gradient-start), var(--bg-gradient-end));
}
.reveal h1, .reveal h2, .reveal h3 { color: var(--accent-primary); font-weight: 800; letter-spacing: -0.5px; }
.reveal pre code { font-family: 'Fira Code', monospace; font-size: 0.75em; }
.reveal section img, .reveal section svg { border-radius: 1rem; box-shadow: 0 8px 22px rgba(0,0,0,0.4); }
.fragment.highlight-current-blue.visible { color: var(--accent-secondary) !important; }
/* slide-density patch */
.reveal h1 { font-size: 2.6rem; line-height: 1.1; }
.reveal h2 { font-size: 1.9rem; line-height: 1.15; }
.reveal h3 { font-size: 1.4rem; line-height: 1.2; }
.reveal p, .reveal li { font-size: 0.9rem; line-height: 1.35; }
.reveal pre code { font-size: 0.67em; }
@media (max-width: 1024px) { .reveal h1{font-size:2.2rem;} .reveal h2{font-size:1.6rem;} }
.reveal table td, .reveal table th { font-size: 0.85rem; padding: 4px 8px; }
</style>
</head>
<body>
<div class="reveal">
<div class="slides">
<!-- 1 · Opening -->
<section data-auto-animate>
<h1 class="animate__animated animate__fadeInDown">PyTorch × Transformers Journey</h1>
<h3 class="animate__animated animate__fadeInDown animate__delay-1s">Pythonicity, Autodiff & Modularity in Modern AI</h3>
<p class="animate__animated animate__fadeInUp animate__delay-2s">Pablo Montalvo‑Leroux · ML Engineer @ Hugging Face</p>
</section>
<!-- 2 · 2016: Backprop & Birth Pangs -->
<section>
<h2>2016‑2018: Backprop & Birth Pangs</h2>
<ul>
<li>Hand‑crafted chain‑rule; frameworks such as Theano and CNTK appeared then vanished.</li>
<li>MLPs → RNNs → LSTMs — until <strong>BERT</strong> detonated the field in 2018.</li>
<li class="fragment">Reproducibility was painful ✗ — until Transformers met PyTorch ✓.</li>
</ul>
</section>
<!-- 3 · Static vs Dynamic Graphs -->
<section>
<h2>Static vs Dynamic Graphs</h2>
<p class="fragment">Static graphs require you to compile, wait, and cross fingers the bug reproduces.</p>
<p class="fragment">Dynamic graphs mean you can drop <code>pdb.set_trace()</code> anywhere and continue iterating.</p>
<p class="fragment"><code>torch.compile</code> gives the best of both worlds: write dynamically, ship something ahead‑of‑time optimised.</p>
</section>
<!-- 4 · Dynamic Graphs Enabled Contribution -->
<section>
<h2>Dynamic Graphs Enabled Contribution</h2>
<ul>
<li>Developers debug at line‑rate — no cold‑start recompiles.</li>
<li>Pull‑requests remained reproducible overnight, which accelerated trust.</li>
<li>Static‑graph alternatives stalled and the community consolidated around PyTorch.</li>
</ul>
</section>
<!-- 5 · Paper Tonight → Tweak Tomorrow -->
<section>
<h2>Clone the Paper Tonight → Tweak Tomorrow</h2>
<p>Research cadence is measured in <strong>hours</strong>; any friction kills momentum.</p>
<ul>
<li class="fragment">2018: BERT fine‑tuning required printing tensors live rather than recompiling graphs.</li>
<li class="fragment">Community PRs merged overnight — credibility snowballed for both PyTorch and Transformers.</li>
</ul>
</section>
<!-- 6 · One Model · One File -->
<section>
<h2>“One Model · One File” — Why it Matters</h2>
<pre><code class="language-python" data-trim data-noescape>
# modeling_bert.py — single source of truth 🗄️
class BertConfig(PretrainedConfig):
...
class BertSelfAttention(nn.Module):
...
class BertLayer(nn.Module):
...
class BertModel(PreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.embeddings = BertEmbeddings(config)
self.encoder = nn.ModuleList(
[BertLayer(config) for _ in range(config.num_hidden_layers)]
)
self.init_weights()
</code></pre>
<ul>
<li>All layers, forward pass, and <code>from_pretrained()</code> logic live together.</li>
<li>No cross‑file inheritance maze — copy to Colab, hack, and run.</li>
<li>Reviewers diff one file; merge time dropped from days to hours.</li>
</ul>
</section>
<!-- 7 · Transformers Grew With Python -->
<section>
<h2>Transformers Grew with Python</h2>
<ul>
<li>The library prioritises hackability, which in turn accelerates adoption.</li>
<li>Python is slow by default, so we lean on compiled CUDA kernels and Triton for raw speed.</li>
<li>The new <strong>Kernel Hub</strong> means Transformers automatically uses a faster op the moment it is published — no application changes required.</li>
</ul>
</section>
<!-- 8 · Back to Python: Mary Shelley Mode -->
<section>
<h2>Back to Python: Modular “Mary Shelley” Mode</h2>
<p>Compose new blocks via subclassing and selective override.</p>
<pre><code class="language-python" data-trim data-noescape>
class LlamaRotaryLoRA(LlamaAttention):
def __init__(...):
super().__init__(...)
self.q_proj = LoRA(self.q_proj) # swap in LoRA
self.apply_rotary() # keep RoPE
</code></pre>
</section>
<!-- 9 · Logit Debugger -->
<section>
<h2>Logit Debugger: Trust but Verify</h2>
<ul>
<li>Attach a hook to every <code>nn.Module</code>; dump logits layer‑by‑layer.</li>
<li>Spot ε‑level drifts — LayerNorm precision, FP16 underflow, etc.</li>
<li>JSON traces are diffable in CI, so regressions stay caught.</li>
</ul>
</section>
<!-- 10 · DTensor & TP API -->
<section>
<h2>DTensor & Tensor‑Parallel API</h2>
<ul>
<li>Logical tensor views unlock device‑mesh sharding.</li>
<li>The <code>tp_plan</code> JSON keeps model code pristine and declarative.</li>
<li>We regularly validate 100‑billion‑parameter checkpoints inside HF test infra.</li>
</ul>
<img data-src="assets/mesh.svg" alt="Device mesh" />
</section>
<!-- 11 · Zero‑Config Parallelism -->
<section>
<h2>Zero‑Config Parallelism</h2>
<pre><code class="language-json" data-trim data-noescape>{
"layer.*.self_attn.q_proj": "colwise",
"layer.*.self_attn.k_proj": "colwise",
"layer.*.self_attn.v_proj": "colwise",
"layer.*.self_attn.o_proj": "rowwise"
}</code></pre>
<pre><code class="language-python" data-trim data-noescape>
def translate_to_torch_parallel_style(style: str):
if style == "colwise":
return ColwiseParallel()
elif style == "rowwise":
return RowwiseParallel()
</code></pre>
<p class="fragment">One JSON file loads a 17‑billion‑parameter Llama‑4 on 8 GPUs; tweak the plan, not the network.</p>
</section>
<!-- 12 · Cache Allocator -->
<section>
<h2>Load Faster & Stronger: Cache Allocator</h2>
<p>Zero‑copy weight sharding shaves <strong>15 %</strong> VRAM on A100 while cutting load time below 60 s for a 100‑B model.</p>
<img data-src="assets/memory_bars.svg" alt="Memory bars" />
</section>
<!-- 13 · Modular Transformers: GLM Example -->
<section>
<h2>Modular Transformers: GLM by Example</h2>
<pre><code class="language-python" data-trim>
class GlmMLP(Phi3MLP):
pass
class GlmAttention(LlamaAttention):
def __init__(self, config, layer_idx=None):
super().__init__(config, layer_idx)
self.o_proj = nn.Linear(
config.num_attention_heads * self.head_dim,
config.hidden_size,
bias=False,
)
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
# Slightly different RoPE
...
class GlmForCausalLM(LlamaForCausalLM):
pass
</code></pre>
<p>AST magic expands this 40‑line prototype into a full modelling file, ready for training.</p>
</section>
<!-- 14 · Rise of Multimodality -->
<section>
<h2>Rise of Multimodality</h2>
<pre><code class="language-python" data-trim data-noescape>
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForConditionalGeneration.from_pretrained("Qwen/Qwen3-8B")
</code></pre>
<p class="fragment">Same API across text, vision, and audio: learn once, apply everywhere.</p>
</section>
<!-- 15 · Why Python wins -->
<section>
<h2>Why Python Wins</h2>
<ul>
<li>Low entry barrier attracts newcomers and domain specialists alike.</li>
<li>High‑level semantics concisely express low‑level intent.</li>
<li>The C++/Rust back‑end remains accessible for critical paths.</li>
</ul>
</section>
<!-- 16 · Where Python can bite -->
<section>
<h2>Where Python can bite 🐍</h2>
<ul>
<li class="fragment">Interpreter overhead hurts microkernels (token‑by‑token decoding).</li>
<li class="fragment">The GIL throttles concurrent host‑side work.</li>
<li class="fragment">Fresh research code is easy to leave unoptimised.</li>
</ul>
<p class="fragment">Mitigations: Triton, compiled custom ops, compile‑time fallbacks, and callable kernels.</p>
</section>
<!-- 17 · Kernel Hub -->
<section>
<h2>Kernel Hub: Optimised Ops from the Community</h2>
<p>Kernel Hub lets any Python program <em>download and hot‑load</em> compiled CUDA/C++ kernels directly from the Hugging Face Hub at runtime.</p>
<ul>
<li><strong>Portable</strong> – kernels work from arbitrary paths outside <code>PYTHONPATH</code>.</li>
<li><strong>Unique</strong> – load multiple versions of the same op side‑by‑side in one process.</li>
<li><strong>Compatible</strong> – every kernel targets all recent PyTorch wheels (CUDA, ROCm, CPU) and C‑library ABIs.</li>
</ul>
<p class="fragment">🚀 <strong>Quick start</strong> (requires <code>torch >= 2.5</code>):</p>
<pre><code class="language-bash" data-trim>pip install kernels</code></pre>
<pre><code class="language-python" data-trim data-noescape>
import torch
from kernels import get_kernel
# Download optimised kernels from the Hugging Face Hub
activation = get_kernel("kernels-community/activation")
x = torch.randn(10, 10, dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
print(y)
</code></pre>
<p class="fragment">Same Transformer code — now with a <strong>3× faster</strong> GELU on A100s.</p>
</section>
<!-- 18 · API design lessons -->
<section>
<h2>API Design Lessons</h2>
<ul>
<li>Make easy things obvious, and hard things merely possible.</li>
<li>Keep the paper‑to‑repository delta minimal for new models.</li>
<li>Hide sharding mechanics; expose developer intent.</li>
</ul>
<p class="fragment">We tune radios without learning RF theory — ML frameworks should feel as frictionless.</p>
</section>
<!-- 19 · Model Growth by Modality -->
<section>
<h2>Model Growth by Modality</h2>
<iframe src="model_growth.html" width="100%" height="600" style="border:none;"></iframe>
</section>
<!-- 20 · Takeaways -->
<section>
<h2>Takeaways & The Future</h2>
<ul>
<li>PyTorch and <code>transformers</code> have grown symbiotically for eight years—expect the spiral to continue.</li>
<li>Pythonicity plus pragmatism keeps the barrier to innovation low.</li>
<li>Open‑source models are shipping faster, larger, and more multimodal than ever.</li>
</ul>
<p><a href="https://huggingface.co/transformers/contribute" target="_blank">hf.co/transformers/contribute</a></p>
</section>
</div>
</div>
<!-- Reveal.js core -->
<script src="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reveal.js"></script>
<script src="https://cdn.jsdelivr.net/npm/reveal.js@5/plugin/highlight/highlight.js"></script>
<script src="https://cdn.jsdelivr.net/npm/reveal.js@5/plugin/notes/notes.js"></script>
<!-- Plotly for interactive charts -->
<script src="https://cdn.plot.ly/plotly-2.31.1.min.js"></script>
<script>
Reveal.initialize({
hash: true,
slideNumber: true,
transition: 'slide',
backgroundTransition: 'convex',
plugins: [ RevealHighlight, RevealNotes ]
});
</script>
</body>
</html>
|