PyTorch × Transformers Journey

Pythonicity, Autodiff & Modularity in Modern AI

Pablo Montalvo‑Leroux · ML Engineer @ Hugging Face

Transformers and Static vs Dynamic Graphs

Static graphs and transformers existence are inversely correlated

torch.compile ≈ sweet spot: author in dynamic, ship in static.

Clone the paper tonight → tweak tomorrow

Research cadence ≈ hours; any friction kills momentum.

2018: BERT fine-tunes needed live tensor prints, not graph recompiles.
Community PRs were merged overnight → library credibility snowballed.

Transformers Philosophy: One Model · One File pattern


      class BertEmbeddings(nn.Module):
          …
      class BertModel(BertPreTrainedModel):
          …

Atomic PRs → faster reviews → community velocity.

Back to Python: Modular Mary Shelley Mode

Compose new blocks via subclass & override.


class LlamaRotaryLoRA(LlamaAttention):
    def __init__(…):
        super().__init__(…)
        self.q_proj = LoRA(self.q_proj)
        self.apply_rotary()

DTensor & Tensor‑Parallel API

Logical tensor view · device mesh
tp_plan keeps module code intact
100B param validation inside HF tests

Smarter Memory: Cache Allocator

0‑copy weight partitioning · 15 % RAM cut on A100

Rise of Multimodality


            processor = AutoProcessor.from_pretrained("Qwen/Qwen3-8B")
            model = AutoModelForConditionalGeneration.from_pretrained("Qwen/Qwen3-8B")

Same API across text · vision · audio.

Why Python Wins

Low entry barrier
High‑level semantics express low‑level intent
Seamless C++/Rust extension points

Where Python can bite 🐍

Interpreter overhead on microkernels (token‑by‑token decode)
GIL can throttle async host‑side work
Easy to under-optimize code that is fresh out of the lab.

Mitigations: Triton, compiled custom ops, compile‑time fallback, and callable kernels!

Community Kernels

New initiative

https://huggingface.co/kernels-community

API Design Lessons

Make easy things obvious, hard things possible
Paper‑to‑repository difference should be minimal
Hide sharding, expose intent

We want to facilitate adoption. How does a radio work? Would you know how to tune it?

How does a computer work? Should you know how it does to be able to navigate the web?

Model Growth by Modality

Takeaways & The Future

PyTorch & HF:Transformers grow symbiotically
Pythonicity × pragmatism drive adoption
Open-source models are being shipped more than ever and accomplishing more than ever thanks to initiatives such as ours

hf.co/transformers/contribute