PyTorch × Transformers Journey

Pythonicity, Autodiff & Modularity in Modern AI

Pablo Montalvo‑Leroux · ML Engineer @ Hugging Face

Transformers and Static vs Dynamic Graphs

Static graphs and transformers existence are inversely correlated

torch.compile ≈ sweet spot: author in dynamic, ship in static.

Clone the paper tonight → tweak tomorrow

Research cadence ≈ hours; any friction kills momentum.

  • 2018: BERT fine-tunes needed live tensor prints, not graph recompiles.
  • Community PRs were merged overnight → library credibility snowballed.

Transformers Philosophy: One Model · One File pattern


      class BertEmbeddings(nn.Module):
          …
      class BertModel(BertPreTrainedModel):
          …
              

Atomic PRs → faster reviews → community velocity.

Back to Python: Modular Mary Shelley Mode

Compose new blocks via subclass & override.


class LlamaRotaryLoRA(LlamaAttention):
    def __init__(…):
        super().__init__(…)
        self.q_proj = LoRA(self.q_proj)
        self.apply_rotary()
        

DTensor & Tensor‑Parallel API

  • Logical tensor view · device mesh
  • tp_plan keeps module code intact
  • 100B param validation inside HF tests
Device mesh

Smarter Memory: Cache Allocator

0‑copy weight partitioning · 15 % RAM cut on A100

Memory bars

Rise of Multimodality


            processor = AutoProcessor.from_pretrained("Qwen/Qwen3-8B")
            model = AutoModelForConditionalGeneration.from_pretrained("Qwen/Qwen3-8B")
        

Same API across text · vision · audio.

Why Python Wins

  • Low entry barrier
  • High‑level semantics express low‑level intent
  • Seamless C++/Rust extension points

Where Python can bite 🐍

  • Interpreter overhead on microkernels (token‑by‑token decode)
  • GIL can throttle async host‑side work
  • Easy to under-optimize code that is fresh out of the lab.

Mitigations: Triton, compiled custom ops, compile‑time fallback, and callable kernels!

Community Kernels

New initiative

https://huggingface.co/kernels-community

API Design Lessons

  • Make easy things obvious, hard things possible
  • Paper‑to‑repository difference should be minimal
  • Hide sharding, expose intent

We want to facilitate adoption. How does a radio work? Would you know how to tune it?

How does a computer work? Should you know how it does to be able to navigate the web?

Model Growth by Modality

Takeaways & The Future

  • PyTorch & HF:Transformers grow symbiotically
  • Pythonicity × pragmatism drive adoption
  • Open-source models are being shipped more than ever and accomplishing more than ever thanks to initiatives such as ours

hf.co/transformers/contribute