EPFL Machine Learning and Optimization Laboratory

university

https://www.epfl.ch/labs/mlo/

epfml

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

loubnabnl authored a paper about 1 month ago

SmolVLM: Redefining small and efficient multimodal models

mjaggi updated a dataset 3 months ago

epfml/FineWeb2-HQ

vsabolcec updated a dataset 3 months ago

epfml/FineWeb2-embedded

View all activity

epfml's activity

loubnabnl

authored a paper about 1 month ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 180

mjaggi

updated a dataset 3 months ago

epfml/FineWeb2-HQ

Viewer • Updated Feb 19 • 380M • 7.49k • 8

vsabolcec

updated 2 datasets 3 months ago

epfml/FineWeb2-embedded

Viewer • Updated Feb 19 • 3.98B • 2.58k • 3

epfml/FineWeb2-HQ

Viewer • Updated Feb 19 • 380M • 7.49k • 8

loubnabnl

authored a paper 3 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 229

haeggee

authored a paper 3 months ago

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Paper • 2501.18965 • Published Jan 31 • 7

loubnabnl

posted an update 6 months ago

Post

3638

Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?

MatPag

authored 4 papers 8 months ago

Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features

Paper • 1703.02507 • Published Mar 7, 2017

DoGE: Domain Reweighting with Generalization Estimation

Paper • 2310.15393 • Published Oct 23, 2023 • 1

The AdEMAMix Optimizer: Better, Faster, Older

Paper • 2409.03137 • Published Sep 5, 2024 • 5

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Paper • 2402.02622 • Published Feb 4, 2024 • 3

loubnabnl

authored a paper 11 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 97

loubnabnl

posted an update 11 months ago

Post

5887

🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!

haeggee

authored a paper 11 months ago

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Paper • 2405.18392 • Published May 28, 2024 • 12

loubnabnl

authored a paper 11 months ago

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Paper • 2405.18392 • Published May 28, 2024 • 12

Andron00e

posted an update about 1 year ago

Post

2043

Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive Learning

Paper: Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive Learning (2404.03323)

Authors propose a novel architecture and method of explainable classification with Concept Bottleneck Models (CBMs): they introduce a new type of layers known as Concept Bottleneck Layers (CBL), and present three methods for training them: with $\ell_1$-loss, contrastive loss and loss function based on Gumbel-Softmax distribution (Sparse-CBM), while final FC layer is still trained with Cross-Entropy. They show a significant increase in accuracy using sparse hidden layers in CLIP-based bottleneck models. Which means that sparse representation of concepts activation vector is meaningful in Concept Bottleneck Models.

Key concepts:
– Contrastive Gumbel-Softmax loss: the first contrastive variant of Gumbel-Softmax objective which achieves an inner sparse representation of the Concept Bottleneck Layer activations.
– Sparse $\ell_1$ regularization.
– Contrastive loss for inner layers of the model.

Methodology:
The approach consists of three main steps:
– Create a set of concepts based on the labels of the dataset.
– Supply a multi-modal encoder with CBL.
– Train this CBL with the picked objective function and train the classifiers head with Cross-Entropy.

Results and Analysis:
The methodology can be applied to the task of interpreted image classification. And the experimental results show the superiority of using sparse hidden representations of concepts.

Andron00e

authored a paper about 1 year ago

Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive Learning

Paper • 2404.03323 • Published Apr 4, 2024 • 3

loubnabnl

posted an update about 1 year ago

Post

6442

We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co/blog/cosmopedia

Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover many topics with few duplicates.
📚 You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
⚙️ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!

1 reply

Olivia-umich

authored 2 papers about 1 year ago

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Paper • 2311.16079 • Published Nov 27, 2023 • 19

DoGE: Domain Reweighting with Generalization Estimation

Paper • 2310.15393 • Published Oct 23, 2023 • 1

AI & ML interests

Recent Activity

Team members 10

epfml's activity