- Pre-training code with nanotron - Evaluation suite with lighteval - Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk) - Post-training scripts with TRL & the alignment handbook - On-device tools with llama.cpp for summarization, rewriting & agents
Apache 2.0 licensed. V2 pre-training data mix coming soon!
🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.
We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.
You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.
Authors propose a novel architecture and method of explainable classification with Concept Bottleneck Models (CBMs): they introduce a new type of layers known as Concept Bottleneck Layers (CBL), and present three methods for training them: with $\ell_1$-loss, contrastive loss and loss function based on Gumbel-Softmax distribution (Sparse-CBM), while final FC layer is still trained with Cross-Entropy. They show a significant increase in accuracy using sparse hidden layers in CLIP-based bottleneck models. Which means that sparse representation of concepts activation vector is meaningful in Concept Bottleneck Models.
Key concepts: – Contrastive Gumbel-Softmax loss: the first contrastive variant of Gumbel-Softmax objective which achieves an inner sparse representation of the Concept Bottleneck Layer activations. – Sparse $\ell_1$ regularization. – Contrastive loss for inner layers of the model.
Methodology: The approach consists of three main steps: – Create a set of concepts based on the labels of the dataset. – Supply a multi-modal encoder with CBL. – Train this CBL with the picked objective function and train the classifiers head with Cross-Entropy.
Results and Analysis: The methodology can be applied to the task of interpreted image classification. And the experimental results show the superiority of using sparse hidden representations of concepts.
We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training. https://huggingface.co/blog/cosmopedia
Here are some key takeaways: 🎯 Prompt curation is crucial: we want to cover many topics with few duplicates. 📚 You can leverage various resources for diversity: using different seed data, generation formats, and target audiences. ⚙️ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.