EPFL Machine Learning and Optimization Laboratory

university

AI & ML interests

None defined yet.

Recent Activity

mjaggi  updated a dataset 3 months ago
epfml/FineWeb2-HQ
vsabolcec  updated a dataset 3 months ago
epfml/FineWeb2-embedded
View all activity

epfml's activity

loubnabnl 
posted an update 6 months ago
view post
Post
3638
Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?
loubnabnl 
posted an update 11 months ago
view post
Post
5887
🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!
Andron00e 
posted an update about 1 year ago
view post
Post
2043
Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive Learning

Paper: Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive Learning (2404.03323)

Authors propose a novel architecture and method of explainable classification with Concept Bottleneck Models (CBMs): they introduce a new type of layers known as Concept Bottleneck Layers (CBL), and present three methods for training them: with $\ell_1$-loss, contrastive loss and loss function based on Gumbel-Softmax distribution (Sparse-CBM), while final FC layer is still trained with Cross-Entropy. They show a significant increase in accuracy using sparse hidden layers in CLIP-based bottleneck models. Which means that sparse representation of concepts activation vector is meaningful in Concept Bottleneck Models.

Key concepts:
– Contrastive Gumbel-Softmax loss: the first contrastive variant of Gumbel-Softmax objective which achieves an inner sparse representation of the Concept Bottleneck Layer activations.
– Sparse $\ell_1$ regularization.
– Contrastive loss for inner layers of the model.

Methodology:
The approach consists of three main steps:
– Create a set of concepts based on the labels of the dataset.
– Supply a multi-modal encoder with CBL.
– Train this CBL with the picked objective function and train the classifiers head with Cross-Entropy.

Results and Analysis:
The methodology can be applied to the task of interpreted image classification. And the experimental results show the superiority of using sparse hidden representations of concepts.

loubnabnl 
posted an update about 1 year ago
view post
Post
6442
We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co/blog/cosmopedia

Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover many topics with few duplicates.
📚 You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
⚙️ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!
  • 1 reply
·