Spaces:
Running
Running
<html lang="en"> | |
<head> | |
<meta charset="utf-8" /> | |
<meta name="viewport" content="width=device-width, initial-scale=1.0" /> | |
<title>PyTorch × Transformers Journey</title> | |
<!-- Google Fonts --> | |
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;800&family=Fira+Code:wght@400;600&display=swap" rel="stylesheet" /> | |
<!-- Reveal.js core & dark theme base --> | |
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reset.css" /> | |
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reveal.css" /> | |
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/theme/black.css" id="theme" /> | |
<!-- Highlight.js --> | |
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/styles/github-dark.min.css" /> | |
<!-- Animations --> | |
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/animate.css@4/animate.min.css" /> | |
<style> | |
:root { | |
--accent-primary: #ee4c2c; /* PyTorch orange‑red */ | |
--accent-secondary: #ffb347; /* lighter highlight */ | |
--bg-gradient-start: #1b1b1b; | |
--bg-gradient-end: #242424; | |
} | |
html, body { font-family: 'Inter', sans-serif; } | |
.reveal .slides { | |
background: linear-gradient(135deg, var(--bg-gradient-start), var(--bg-gradient-end)); | |
} | |
.reveal h1, .reveal h2, .reveal h3 { color: var(--accent-primary); font-weight: 800; letter-spacing: -0.5px; } | |
.reveal pre code { font-family: 'Fira Code', monospace; font-size: 0.75em; } | |
.reveal section img, .reveal section svg { border-radius: 1rem; box-shadow: 0 8px 22px rgba(0,0,0,0.4); } | |
.fragment.highlight-current-blue.visible { color: var(--accent-secondary) ; } | |
/* slide-density patch */ | |
.reveal h1 { font-size: 2.6rem; line-height: 1.1; } | |
.reveal h2 { font-size: 1.9rem; line-height: 1.15; } | |
.reveal h3 { font-size: 1.4rem; line-height: 1.2; } | |
.reveal p, .reveal li { font-size: 0.9rem; line-height: 1.35; } | |
.reveal pre code { font-size: 0.67em; } | |
@media (max-width: 1024px) { .reveal h1{font-size:2.2rem;} .reveal h2{font-size:1.6rem;} } | |
.reveal table td, .reveal table th { font-size: 0.85rem; padding: 4px 8px; } | |
</style> | |
</head> | |
<body> | |
<div class="reveal"> | |
<div class="slides"> | |
<!-- 1 · Opening --> | |
<section data-auto-animate> | |
<h1 class="animate__animated animate__fadeInDown">PyTorch × Transformers Journey</h1> | |
<h3 class="animate__animated animate__fadeInDown animate__delay-1s">Pythonicity, Autodiff & Modularity in Modern AI</h3> | |
<p class="animate__animated animate__fadeInUp animate__delay-2s">Pablo Montalvo‑Leroux · ML Engineer @ Hugging Face</p> | |
</section> | |
<!-- 2 · 2016: Backprop & Birth Pangs --> | |
<section> | |
<h2>2016‑2018: Backprop & Birth Pangs</h2> | |
<ul> | |
<li>Hand‑crafted chain‑rule; frameworks such as Theano and CNTK appeared then vanished.</li> | |
<li>MLPs → RNNs → LSTMs — until <strong>BERT</strong> detonated the field in 2018.</li> | |
<li class="fragment">Reproducibility was painful ✗ — until Transformers met PyTorch ✓.</li> | |
</ul> | |
</section> | |
<!-- 3 · Static vs Dynamic Graphs --> | |
<section> | |
<h2>Static vs Dynamic Graphs</h2> | |
<p class="fragment">Static graphs require you to compile, wait, and cross fingers the bug reproduces.</p> | |
<p class="fragment">Dynamic graphs mean you can drop <code>pdb.set_trace()</code> anywhere and continue iterating.</p> | |
<p class="fragment"><code>torch.compile</code> gives the best of both worlds: write dynamically, ship something ahead‑of‑time optimised.</p> | |
</section> | |
<!-- 4 · Dynamic Graphs Enabled Contribution --> | |
<section> | |
<h2>Dynamic Graphs Enabled Contribution</h2> | |
<ul> | |
<li>Developers debug at line‑rate — no cold‑start recompiles.</li> | |
<li>Pull‑requests remained reproducible overnight, which accelerated trust.</li> | |
<li>Static‑graph alternatives stalled and the community consolidated around PyTorch.</li> | |
</ul> | |
</section> | |
<!-- 5 · Paper Tonight → Tweak Tomorrow --> | |
<section> | |
<h2>Clone the Paper Tonight → Tweak Tomorrow</h2> | |
<p>Research cadence is measured in <strong>hours</strong>; any friction kills momentum.</p> | |
<ul> | |
<li class="fragment">2018: BERT fine‑tuning required printing tensors live rather than recompiling graphs.</li> | |
<li class="fragment">Community PRs merged overnight — credibility snowballed for both PyTorch and Transformers.</li> | |
</ul> | |
</section> | |
<!-- 6 · One Model · One File --> | |
<section> | |
<h2>“One Model · One File” — Why it Matters</h2> | |
<pre><code class="language-python" data-trim data-noescape> | |
# modeling_bert.py — single source of truth 🗄️ | |
class BertConfig(PretrainedConfig): | |
... | |
class BertSelfAttention(nn.Module): | |
... | |
class BertLayer(nn.Module): | |
... | |
class BertModel(PreTrainedModel): | |
def __init__(self, config): | |
super().__init__(config) | |
self.embeddings = BertEmbeddings(config) | |
self.encoder = nn.ModuleList( | |
[BertLayer(config) for _ in range(config.num_hidden_layers)] | |
) | |
self.init_weights() | |
</code></pre> | |
<ul> | |
<li>All layers, forward pass, and <code>from_pretrained()</code> logic live together.</li> | |
<li>No cross‑file inheritance maze — copy to Colab, hack, and run.</li> | |
<li>Reviewers diff one file; merge time dropped from days to hours.</li> | |
</ul> | |
</section> | |
<!-- 7 · Transformers Grew With Python --> | |
<section> | |
<h2>Transformers Grew with Python</h2> | |
<ul> | |
<li>The library prioritises hackability, which in turn accelerates adoption.</li> | |
<li>Python is slow by default, so we lean on compiled CUDA kernels and Triton for raw speed.</li> | |
<li>The new <strong>Kernel Hub</strong> means Transformers automatically uses a faster op the moment it is published — no application changes required.</li> | |
</ul> | |
</section> | |
<!-- 8 · Back to Python: Mary Shelley Mode --> | |
<section> | |
<h2>Back to Python: Modular “Mary Shelley” Mode</h2> | |
<p>Compose new blocks via subclassing and selective override.</p> | |
<pre><code class="language-python" data-trim data-noescape> | |
class LlamaRotaryLoRA(LlamaAttention): | |
def __init__(...): | |
super().__init__(...) | |
self.q_proj = LoRA(self.q_proj) # swap in LoRA | |
self.apply_rotary() # keep RoPE | |
</code></pre> | |
</section> | |
<!-- 9 · Logit Debugger --> | |
<section> | |
<h2>Logit Debugger: Trust but Verify</h2> | |
<ul> | |
<li>Attach a hook to every <code>nn.Module</code>; dump logits layer‑by‑layer.</li> | |
<li>Spot ε‑level drifts — LayerNorm precision, FP16 underflow, etc.</li> | |
<li>JSON traces are diffable in CI, so regressions stay caught.</li> | |
</ul> | |
</section> | |
<!-- 10 · DTensor & TP API --> | |
<section> | |
<h2>DTensor & Tensor‑Parallel API</h2> | |
<ul> | |
<li>Logical tensor views unlock device‑mesh sharding.</li> | |
<li>The <code>tp_plan</code> JSON keeps model code pristine and declarative.</li> | |
<li>We regularly validate 100‑billion‑parameter checkpoints inside HF test infra.</li> | |
</ul> | |
<img data-src="assets/mesh.svg" alt="Device mesh" /> | |
</section> | |
<!-- 11 · Zero‑Config Parallelism --> | |
<section> | |
<h2>Zero‑Config Parallelism</h2> | |
<pre><code class="language-json" data-trim data-noescape>{ | |
"layer.*.self_attn.q_proj": "colwise", | |
"layer.*.self_attn.k_proj": "colwise", | |
"layer.*.self_attn.v_proj": "colwise", | |
"layer.*.self_attn.o_proj": "rowwise" | |
}</code></pre> | |
<pre><code class="language-python" data-trim data-noescape> | |
def translate_to_torch_parallel_style(style: str): | |
if style == "colwise": | |
return ColwiseParallel() | |
elif style == "rowwise": | |
return RowwiseParallel() | |
</code></pre> | |
<p class="fragment">One JSON file loads a 17‑billion‑parameter Llama‑4 on 8 GPUs; tweak the plan, not the network.</p> | |
</section> | |
<!-- 12 · Cache Allocator --> | |
<section> | |
<h2>Load Faster & Stronger: Cache Allocator</h2> | |
<p>Zero‑copy weight sharding shaves <strong>15 %</strong> VRAM on A100 while cutting load time below 60 s for a 100‑B model.</p> | |
<img data-src="assets/memory_bars.svg" alt="Memory bars" /> | |
</section> | |
<!-- 13 · Modular Transformers: GLM Example --> | |
<section> | |
<h2>Modular Transformers: GLM by Example</h2> | |
<pre><code class="language-python" data-trim> | |
class GlmMLP(Phi3MLP): | |
pass | |
class GlmAttention(LlamaAttention): | |
def __init__(self, config, layer_idx=None): | |
super().__init__(config, layer_idx) | |
self.o_proj = nn.Linear( | |
config.num_attention_heads * self.head_dim, | |
config.hidden_size, | |
bias=False, | |
) | |
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1): | |
# Slightly different RoPE | |
... | |
class GlmForCausalLM(LlamaForCausalLM): | |
pass | |
</code></pre> | |
<p>AST magic expands this 40‑line prototype into a full modelling file, ready for training.</p> | |
</section> | |
<!-- 14 · Rise of Multimodality --> | |
<section> | |
<h2>Rise of Multimodality</h2> | |
<pre><code class="language-python" data-trim data-noescape> | |
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-8B") | |
model = AutoModelForConditionalGeneration.from_pretrained("Qwen/Qwen3-8B") | |
</code></pre> | |
<p class="fragment">Same API across text, vision, and audio: learn once, apply everywhere.</p> | |
</section> | |
<!-- 15 · Why Python wins --> | |
<section> | |
<h2>Why Python Wins</h2> | |
<ul> | |
<li>Low entry barrier attracts newcomers and domain specialists alike.</li> | |
<li>High‑level semantics concisely express low‑level intent.</li> | |
<li>The C++/Rust back‑end remains accessible for critical paths.</li> | |
</ul> | |
</section> | |
<!-- 16 · Where Python can bite --> | |
<section> | |
<h2>Where Python can bite 🐍</h2> | |
<ul> | |
<li class="fragment">Interpreter overhead hurts microkernels (token‑by‑token decoding).</li> | |
<li class="fragment">The GIL throttles concurrent host‑side work.</li> | |
<li class="fragment">Fresh research code is easy to leave unoptimised.</li> | |
</ul> | |
<p class="fragment">Mitigations: Triton, compiled custom ops, compile‑time fallbacks, and callable kernels.</p> | |
</section> | |
<!-- 17 · Kernel Hub --> | |
<section> | |
<h2>Kernel Hub: Optimised Ops from the Community</h2> | |
<p>Kernel Hub lets any Python program <em>download and hot‑load</em> compiled CUDA/C++ kernels directly from the Hugging Face Hub at runtime.</p> | |
<ul> | |
<li><strong>Portable</strong> – kernels work from arbitrary paths outside <code>PYTHONPATH</code>.</li> | |
<li><strong>Unique</strong> – load multiple versions of the same op side‑by‑side in one process.</li> | |
<li><strong>Compatible</strong> – every kernel targets all recent PyTorch wheels (CUDA, ROCm, CPU) and C‑library ABIs.</li> | |
</ul> | |
<p class="fragment">🚀 <strong>Quick start</strong> (requires <code>torch >= 2.5</code>):</p> | |
<pre><code class="language-bash" data-trim>pip install kernels</code></pre> | |
<pre><code class="language-python" data-trim data-noescape> | |
import torch | |
from kernels import get_kernel | |
# Download optimised kernels from the Hugging Face Hub | |
activation = get_kernel("kernels-community/activation") | |
x = torch.randn(10, 10, dtype=torch.float16, device="cuda") | |
y = torch.empty_like(x) | |
activation.gelu_fast(y, x) | |
print(y) | |
</code></pre> | |
<p class="fragment">Same Transformer code — now with a <strong>3× faster</strong> GELU on A100s.</p> | |
</section> | |
<!-- 18 · API design lessons --> | |
<section> | |
<h2>API Design Lessons</h2> | |
<ul> | |
<li>Make easy things obvious, and hard things merely possible.</li> | |
<li>Keep the paper‑to‑repository delta minimal for new models.</li> | |
<li>Hide sharding mechanics; expose developer intent.</li> | |
</ul> | |
<p class="fragment">We tune radios without learning RF theory — ML frameworks should feel as frictionless.</p> | |
</section> | |
<!-- 19 · Model Growth by Modality --> | |
<section> | |
<h2>Model Growth by Modality</h2> | |
<iframe src="model_growth.html" width="100%" height="600" style="border:none;"></iframe> | |
</section> | |
<!-- 20 · Takeaways --> | |
<section> | |
<h2>Takeaways & The Future</h2> | |
<ul> | |
<li>PyTorch and <code>transformers</code> have grown symbiotically for eight years—expect the spiral to continue.</li> | |
<li>Pythonicity plus pragmatism keeps the barrier to innovation low.</li> | |
<li>Open‑source models are shipping faster, larger, and more multimodal than ever.</li> | |
</ul> | |
<p><a href="https://huggingface.co/transformers/contribute" target="_blank">hf.co/transformers/contribute</a></p> | |
</section> | |
</div> | |
</div> | |
<!-- Reveal.js core --> | |
<script src="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reveal.js"></script> | |
<script src="https://cdn.jsdelivr.net/npm/reveal.js@5/plugin/highlight/highlight.js"></script> | |
<script src="https://cdn.jsdelivr.net/npm/reveal.js@5/plugin/notes/notes.js"></script> | |
<!-- Plotly for interactive charts --> | |
<script src="https://cdn.plot.ly/plotly-2.31.1.min.js"></script> | |
<script> | |
Reveal.initialize({ | |
hash: true, | |
slideNumber: true, | |
transition: 'slide', | |
backgroundTransition: 'convex', | |
plugins: [ RevealHighlight, RevealNotes ] | |
}); | |
</script> | |
</body> | |
</html> | |