ml-fw-prerelease (ml-fw-prerelease)

BramVanroy

posted an update 4 days ago

Post

2943

📢💾 Introducing the Common Crawl Creative Commons Corpus (C5)!

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.

---
📄 data: BramVanroy/CommonCrawl-CreativeCommons
🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---

</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.

🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.

🔍 More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!

1 reply

·

alielfilali01

posted an update 7 days ago

Post

449

Great efforts from @AtlasIA folks to adapt text2image models (ghibli style) for Moroccan Context

Read the blog is here : https://huggingface.co/blog/atlasia/creating-your-custom-ghibli-text-to-image-model

jsaizant

authored a paper 15 days ago

Salamandra Technical Report

Paper • 2502.08489 • Published Feb 12

nouamanetazi

authored a paper about 1 month ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published about 1 month ago • 180

SivilTaram

authored 6 papers about 1 month ago

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Paper • 2411.07763 • Published Nov 12, 2024 • 2

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Paper • 2503.18892 • Published Mar 24 • 30

SivilTaram

authored a paper about 2 months ago

SkyLadder: Better and Faster Pretraining via Context Window Scheduling

Paper • 2503.15450 • Published Mar 19 • 11

lbourdois

posted an update about 2 months ago

Post

2632

We introduce FAT5 (Flash Attention T5) ⚡

An implementation of T5 in PyTorch with UL2 objective optimized for GPGPU for both training and inference thanks to 13 different optimizations.
The main one is that we have designed a CUDA kernel to expand the Flash Attention by @tridao with RPE biases and supports other PE such as RoPE, ALiBi or FIRE.
The result kernel is 2 times faster than a SPDA implementation.
We also use Triton kernels to optimize certain parts of the architecture, such as the cross-entropy and RMSNorm layer.

The various kernels have been carefully built to be compatible with BF16 and torch.compile to go even faster and achieve efficient pretraining.

All other optimizations are described in a 📝 subsequent blog post available on @huggingface 🤗: CATIE-AQ/FAT5-report.

This methodology enabled us to efficiently pretrain as a proof of concept a FAT5 with 147M parameters in French in a reasonable time (1,461H for 419B tokens), with limited resources (1 A100 i.e. a computational budget of ~ €1,900) and a low carbon footprint (13.5kg eq CO2).

The model's weights are also available on Hugging Face: CATIE-AQ/FAT5-small.
Not very useful in practice, it's a PoC and not an instructed model (it's planned for later).

All the code is available on GitHub if you want to pretrain your own model in your own language or for a specific domain: https://github.com/catie-aq/flashT5 ⭐

Ending by indicating that was a joint project with @BorisAlbar at hf.co/CATIE-AQ.

SivilTaram

authored a paper 3 months ago

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Paper • 2502.14739 • Published Feb 20 • 103

alielfilali01

posted an update 3 months ago

Post

1007

🚨 Arabic LLM Evaluation 🚨

Few models join the ranking of https://huggingface.co/spaces/inceptionai/AraGen-Leaderboard Today.

The new MistralAI model, Saba, is quite impressive, Top10 ! Well done @arthurmensch and team.

Sadly Mistral did not follow its strategy about public weights this time, we hope this changes soon and we get the model with a permissive license.

We added other Mistral models and apparently, we have been sleeping on mistralai/Mistral-Large-Instruct-2411 !

Another impressive model that joined the ranking today is ALLaM-AI/ALLaM-7B-Instruct-preview. After a long wait finally ALLaM is here and it is IMPRESSIVE given its size !

ALLaM is ranked on OALL/Open-Arabic-LLM-Leaderboard as well.

hynky

authored a paper 3 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 229

guipenedo

authored a paper 3 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 229

yentinglin

authored a paper 4 months ago

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Paper • 2501.10799 • Published Jan 18 • 15

hynky

authored a paper 4 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 63

guipenedo

authored a paper 4 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 63

alielfilali01

posted an update 4 months ago

Post

2104

3C3H AraGen Leaderboard welcomes today deepseek-ai/DeepSeek-V3 and 12 other models (including the late gpt-3.5 💀) to the ranking of best LLMs in Arabic !

Observations:
- DeepSeek-v3 ranked 3rd and only Open model among the top 5 !

- A 14B open model ( Qwen/Qwen2.5-14B-Instruct) outperforms gpt-3.5-turbo-0125 (from last year). This shows how much we came in advancing and supporting Arabic presence within the LLM ecosystem !

- Contrary to what observed in likelihood-acc leaderboards (like OALL/Open-Arabic-LLM-Leaderboard) further finetuned models like maldv/Qwentile2.5-32B-Instruct actually decreased the performance compared to the original model Qwen/Qwen2.5-32B-Instruct.
It's worth to note that the decrease is statiscally insignificant which imply that at best, the out-domain finetuning do not really hurts the model original capabilities acquired during pretraining.
Previous work addressed this (finetuning VS pretraining) but more investigation in this regard is required (any PhDs here ? This could be your question ...)

Check out the latest rankings: https://huggingface.co/spaces/inceptionai/AraGen-Leaderboard

ml-fw-prerelease

AI & ML interests

Recent Activity

ml-fw-prerelease's activity

Salamandra Technical Report

SmolVLM: Redefining small and efficient multimodal models

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

When Attention Sink Emerges in Language Models: An Empirical View

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Scaling up Masked Diffusion Models on Text

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

SkyLadder: Better and Faster Pretraining via Context Window Scheduling

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Towards Best Practices for Open Datasets for LLM Training

Towards Best Practices for Open Datasets for LLM Training

AI & ML interests

Recent Activity

Team members 27

ml-fw-prerelease's activity