Spaces:

Tinkering
/

Pytorch-day-prez

Running

File size: 14,263 Bytes

d25266e

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <title>PyTorch × Transformers Journey</title>

  <!-- Google Fonts -->
  <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;800&family=Fira+Code:wght@400;600&display=swap" rel="stylesheet" />

  <!-- Reveal.js core & dark theme base -->
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reset.css" />
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reveal.css" />
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/theme/black.css" id="theme" />

  <!-- Highlight.js -->
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/styles/github-dark.min.css" />

  <!-- Animations -->
  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/animate.css@4/animate.min.css" />

  <style>
    :root {
      --accent-primary: #ee4c2c; /* PyTorch orange‑red */
      --accent-secondary: #ffb347; /* lighter highlight */
      --bg-gradient-start: #1b1b1b;
      --bg-gradient-end: #242424;
    }
    html, body { font-family: 'Inter', sans-serif; }
    .reveal .slides {
      background: linear-gradient(135deg, var(--bg-gradient-start), var(--bg-gradient-end));
    }
    .reveal h1, .reveal h2, .reveal h3 { color: var(--accent-primary); font-weight: 800; letter-spacing: -0.5px; }
    .reveal pre code { font-family: 'Fira Code', monospace; font-size: 0.75em; }
    .reveal section img, .reveal section svg { border-radius: 1rem; box-shadow: 0 8px 22px rgba(0,0,0,0.4); }
    .fragment.highlight-current-blue.visible { color: var(--accent-secondary) !important; }
    /* slide-density patch */
    .reveal h1 { font-size: 2.6rem; line-height: 1.1; }
    .reveal h2 { font-size: 1.9rem; line-height: 1.15; }
    .reveal h3 { font-size: 1.4rem; line-height: 1.2; }
    .reveal p, .reveal li { font-size: 0.9rem; line-height: 1.35; }
    .reveal pre code { font-size: 0.67em; }
    @media (max-width: 1024px) { .reveal h1{font-size:2.2rem;} .reveal h2{font-size:1.6rem;} }
    .reveal table td, .reveal table th { font-size: 0.85rem; padding: 4px 8px; }
  </style>
</head>
<body>
  <div class="reveal">
    <div class="slides">

      <!-- 1 · Opening -->
      <section data-auto-animate>
        <h1 class="animate__animated animate__fadeInDown">PyTorch × Transformers Journey</h1>
        <h3 class="animate__animated animate__fadeInDown animate__delay-1s">Pythonicity, Autodiff &amp; Modularity in Modern AI</h3>
        <p class="animate__animated animate__fadeInUp animate__delay-2s">Pablo Montalvo‑Leroux · ML Engineer @ Hugging Face</p>
      </section>

      <!-- 2 · 2016: Backprop & Birth Pangs -->
      <section>
        <h2>2016‑2018: Backprop &amp; Birth Pangs</h2>
        <ul>
          <li>Hand‑crafted chain‑rule; frameworks such as Theano and CNTK appeared then vanished.</li>
          <li>MLPs → RNNs → LSTMs — until <strong>BERT</strong> detonated the field in 2018.</li>
          <li class="fragment">Reproducibility was painful ✗ — until Transformers met PyTorch ✓.</li>
        </ul>
      </section>

      <!-- 3 · Static vs Dynamic Graphs -->
      <section>
        <h2>Static vs Dynamic Graphs</h2>
        <p class="fragment">Static graphs require you to compile, wait, and cross fingers the bug reproduces.</p>
        <p class="fragment">Dynamic graphs mean you can drop <code>pdb.set_trace()</code> anywhere and continue iterating.</p>
        <p class="fragment"><code>torch.compile</code> gives the best of both worlds: write dynamically, ship something ahead‑of‑time optimised.</p>
      </section>

      <!-- 4 · Dynamic Graphs Enabled Contribution -->
      <section>
        <h2>Dynamic Graphs Enabled Contribution</h2>
        <ul>
          <li>Developers debug at line‑rate — no cold‑start recompiles.</li>
          <li>Pull‑requests remained reproducible overnight, which accelerated trust.</li>
          <li>Static‑graph alternatives stalled and the community consolidated around PyTorch.</li>
        </ul>
      </section>

      <!-- 5 · Paper Tonight → Tweak Tomorrow -->
      <section>
        <h2>Clone the Paper Tonight → Tweak Tomorrow</h2>
        <p>Research cadence is measured in <strong>hours</strong>; any friction kills momentum.</p>
        <ul>
          <li class="fragment">2018: BERT fine‑tuning required printing tensors live rather than recompiling graphs.</li>
          <li class="fragment">Community PRs merged overnight — credibility snowballed for both PyTorch and Transformers.</li>
        </ul>
      </section>

      <!-- 6 · One Model · One File -->
      <section>
        <h2>“One Model · One File” — Why it Matters</h2>
        <pre><code class="language-python" data-trim data-noescape>
# modeling_bert.py  — single source of truth 🗄️
class BertConfig(PretrainedConfig):
    ...

class BertSelfAttention(nn.Module):
    ...

class BertLayer(nn.Module):
    ...

class BertModel(PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = nn.ModuleList(
            [BertLayer(config) for _ in range(config.num_hidden_layers)]
        )
        self.init_weights()
        </code></pre>
        <ul>
          <li>All layers, forward pass, and <code>from_pretrained()</code> logic live together.</li>
          <li>No cross‑file inheritance maze — copy to Colab, hack, and run.</li>
          <li>Reviewers diff one file; merge time dropped from days to hours.</li>
        </ul>
      </section>

      <!-- 7 · Transformers Grew With Python -->
      <section>
        <h2>Transformers Grew with Python</h2>
        <ul>
          <li>The library prioritises hackability, which in turn accelerates adoption.</li>
          <li>Python is slow by default, so we lean on compiled CUDA kernels and Triton for raw speed.</li>
          <li>The new <strong>Kernel Hub</strong> means Transformers automatically uses a faster op the moment it is published — no application changes required.</li>
        </ul>
      </section>

      <!-- 8 · Back to Python: Mary Shelley Mode -->
      <section>
        <h2>Back to Python: Modular “Mary Shelley” Mode</h2>
        <p>Compose new blocks via subclassing and selective override.</p>
        <pre><code class="language-python" data-trim data-noescape>
class LlamaRotaryLoRA(LlamaAttention):
    def __init__(...):
        super().__init__(...)
        self.q_proj = LoRA(self.q_proj)  # swap in LoRA
        self.apply_rotary()              # keep RoPE
        </code></pre>
      </section>

      <!-- 9 · Logit Debugger -->
      <section>
        <h2>Logit Debugger: Trust but Verify</h2>
        <ul>
          <li>Attach a hook to every <code>nn.Module</code>; dump logits layer‑by‑layer.</li>
          <li>Spot ε‑level drifts — LayerNorm precision, FP16 underflow, etc.</li>
          <li>JSON traces are diffable in CI, so regressions stay caught.</li>
        </ul>
      </section>

      <!-- 10 · DTensor & TP API -->
      <section>
        <h2>DTensor & Tensor‑Parallel API</h2>
        <ul>
          <li>Logical tensor views unlock device‑mesh sharding.</li>
          <li>The <code>tp_plan</code> JSON keeps model code pristine and declarative.</li>
          <li>We regularly validate 100‑billion‑parameter checkpoints inside HF test infra.</li>
        </ul>
        <img data-src="assets/mesh.svg" alt="Device mesh" />
      </section>

      <!-- 11 · Zero‑Config Parallelism -->
      <section>
        <h2>Zero‑Config Parallelism</h2>
        <pre><code class="language-json" data-trim data-noescape>{
  "layer.*.self_attn.q_proj": "colwise",
  "layer.*.self_attn.k_proj": "colwise",
  "layer.*.self_attn.v_proj": "colwise",
  "layer.*.self_attn.o_proj": "rowwise"
}</code></pre>
        <pre><code class="language-python" data-trim data-noescape>
def translate_to_torch_parallel_style(style: str):
    if style == "colwise":
        return ColwiseParallel()
    elif style == "rowwise":
        return RowwiseParallel()
        </code></pre>
        <p class="fragment">One JSON file loads a 17‑billion‑parameter Llama‑4 on 8 GPUs; tweak the plan, not the network.</p>
      </section>

      <!-- 12 · Cache Allocator -->
      <section>
        <h2>Load Faster &amp; Stronger: Cache Allocator</h2>
        <p>Zero‑copy weight sharding shaves <strong>15 %</strong> VRAM on A100 while cutting load time below 60 s for a 100‑B model.</p>
        <img data-src="assets/memory_bars.svg" alt="Memory bars" />
      </section>

      <!-- 13 · Modular Transformers: GLM Example -->
      <section>
        <h2>Modular Transformers: GLM by Example</h2>
        <pre><code class="language-python" data-trim>
class GlmMLP(Phi3MLP):
    pass

class GlmAttention(LlamaAttention):
    def __init__(self, config, layer_idx=None):
        super().__init__(config, layer_idx)
        self.o_proj = nn.Linear(
            config.num_attention_heads * self.head_dim,
            config.hidden_size,
            bias=False,
        )

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    # Slightly different RoPE
    ...

class GlmForCausalLM(LlamaForCausalLM):
    pass
        </code></pre>
        <p>AST magic expands this 40‑line prototype into a full modelling file, ready for training.</p>
      </section>

      <!-- 14 · Rise of Multimodality -->
      <section>
        <h2>Rise of Multimodality</h2>
        <pre><code class="language-python" data-trim data-noescape>
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForConditionalGeneration.from_pretrained("Qwen/Qwen3-8B")
        </code></pre>
        <p class="fragment">Same API across text, vision, and audio: learn once, apply everywhere.</p>
      </section>

      <!-- 15 · Why Python wins -->
      <section>
        <h2>Why Python Wins</h2>
        <ul>
          <li>Low entry barrier attracts newcomers and domain specialists alike.</li>
          <li>High‑level semantics concisely express low‑level intent.</li>
          <li>The C++/Rust back‑end remains accessible for critical paths.</li>
        </ul>
      </section>

      <!-- 16 · Where Python can bite -->
      <section>
        <h2>Where Python can bite 🐍</h2>
        <ul>
          <li class="fragment">Interpreter overhead hurts microkernels (token‑by‑token decoding).</li>
          <li class="fragment">The GIL throttles concurrent host‑side work.</li>
          <li class="fragment">Fresh research code is easy to leave unoptimised.</li>
        </ul>
        <p class="fragment">Mitigations: Triton, compiled custom ops, compile‑time fallbacks, and callable kernels.</p>
      </section>

      <!-- 17 · Kernel Hub -->
      <section>
        <h2>Kernel Hub: Optimised Ops from the Community</h2>
        <p>Kernel Hub lets any Python program <em>download and hot‑load</em> compiled CUDA/C++ kernels directly from the Hugging Face Hub at runtime.</p>
        <ul>
          <li><strong>Portable</strong> – kernels work from arbitrary paths outside <code>PYTHONPATH</code>.</li>
          <li><strong>Unique</strong> – load multiple versions of the same op side‑by‑side in one process.</li>
          <li><strong>Compatible</strong> – every kernel targets all recent PyTorch wheels (CUDA, ROCm, CPU) and C‑library ABIs.</li>
        </ul>
        <p class="fragment">🚀 <strong>Quick start</strong> (requires <code>torch >= 2.5</code>):</p>
        <pre><code class="language-bash" data-trim>pip install kernels</code></pre>
        <pre><code class="language-python" data-trim data-noescape>
import torch
from kernels import get_kernel

# Download optimised kernels from the Hugging Face Hub
activation = get_kernel("kernels-community/activation")

x = torch.randn(10, 10, dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
print(y)
        </code></pre>
        <p class="fragment">Same Transformer code — now with a <strong>3× faster</strong> GELU on A100s.</p>
      </section>

      <!-- 18 · API design lessons -->
      <section>
        <h2>API Design Lessons</h2>
        <ul>
          <li>Make easy things obvious, and hard things merely possible.</li>
          <li>Keep the paper‑to‑repository delta minimal for new models.</li>
          <li>Hide sharding mechanics; expose developer intent.</li>
        </ul>
        <p class="fragment">We tune radios without learning RF theory — ML frameworks should feel as frictionless.</p>
      </section>

      <!-- 19 · Model Growth by Modality -->
      <section>
        <h2>Model Growth by Modality</h2>
        <iframe src="model_growth.html" width="100%" height="600" style="border:none;"></iframe>
      </section>

      <!-- 20 · Takeaways -->
      <section>
        <h2>Takeaways &amp; The Future</h2>
        <ul>
          <li>PyTorch and <code>transformers</code> have grown symbiotically for eight years—expect the spiral to continue.</li>
          <li>Pythonicity plus pragmatism keeps the barrier to innovation low.</li>
          <li>Open‑source models are shipping faster, larger, and more multimodal than ever.</li>
        </ul>
        <p><a href="https://huggingface.co/transformers/contribute" target="_blank">hf.co/transformers/contribute</a></p>
      </section>

    </div>
  </div>

  <!-- Reveal.js core -->
  <script src="https://cdn.jsdelivr.net/npm/reveal.js@5/dist/reveal.js"></script>
  <script src="https://cdn.jsdelivr.net/npm/reveal.js@5/plugin/highlight/highlight.js"></script>
  <script src="https://cdn.jsdelivr.net/npm/reveal.js@5/plugin/notes/notes.js"></script>
  <!-- Plotly for interactive charts -->
  <script src="https://cdn.plot.ly/plotly-2.31.1.min.js"></script>
  <script>
    Reveal.initialize({
      hash: true,
      slideNumber: true,
      transition: 'slide',
      backgroundTransition: 'convex',
      plugins: [ RevealHighlight, RevealNotes ]
    });
  </script>
</body>
</html>