ml-fw-prerelease

Enterprise
community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

ml-fw-prerelease's activity

BramVanroy 
posted an update 4 days ago
view post
Post
2943
📢💾 Introducing the Common Crawl Creative Commons Corpus (C5)!

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.

---
📄 data: BramVanroy/CommonCrawl-CreativeCommons
🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---

</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.

🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.

🔍 More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
  • 1 reply
·
alielfilali01 
posted an update 7 days ago
lbourdois 
posted an update about 2 months ago
view post
Post
2632
We introduce FAT5 (Flash Attention T5) ⚡

An implementation of T5 in PyTorch with UL2 objective optimized for GPGPU for both training and inference thanks to 13 different optimizations.
The main one is that we have designed a CUDA kernel to expand the Flash Attention by @tridao with RPE biases and supports other PE such as RoPE, ALiBi or FIRE.
The result kernel is 2 times faster than a SPDA implementation.
We also use Triton kernels to optimize certain parts of the architecture, such as the cross-entropy and RMSNorm layer.

The various kernels have been carefully built to be compatible with BF16 and torch.compile to go even faster and achieve efficient pretraining.

All other optimizations are described in a 📝 subsequent blog post available on @huggingface 🤗: CATIE-AQ/FAT5-report.

This methodology enabled us to efficiently pretrain as a proof of concept a FAT5 with 147M parameters in French in a reasonable time (1,461H for 419B tokens), with limited resources (1 A100 i.e. a computational budget of ~ €1,900) and a low carbon footprint (13.5kg eq CO2).

The model's weights are also available on Hugging Face: CATIE-AQ/FAT5-small.
Not very useful in practice, it's a PoC and not an instructed model (it's planned for later).

All the code is available on GitHub if you want to pretrain your own model in your own language or for a specific domain: https://github.com/catie-aq/flashT5

Ending by indicating that was a joint project with @BorisAlbar at hf.co/CATIE-AQ.
alielfilali01 
posted an update 3 months ago
view post
Post
1007
🚨 Arabic LLM Evaluation 🚨

Few models join the ranking of https://huggingface.co/spaces/inceptionai/AraGen-Leaderboard Today.

The new MistralAI model, Saba, is quite impressive, Top10 ! Well done @arthurmensch and team.

Sadly Mistral did not follow its strategy about public weights this time, we hope this changes soon and we get the model with a permissive license.

We added other Mistral models and apparently, we have been sleeping on mistralai/Mistral-Large-Instruct-2411 !

Another impressive model that joined the ranking today is ALLaM-AI/ALLaM-7B-Instruct-preview. After a long wait finally ALLaM is here and it is IMPRESSIVE given its size !

ALLaM is ranked on OALL/Open-Arabic-LLM-Leaderboard as well.
alielfilali01 
posted an update 4 months ago
view post
Post
2104
3C3H AraGen Leaderboard welcomes today deepseek-ai/DeepSeek-V3 and 12 other models (including the late gpt-3.5 💀) to the ranking of best LLMs in Arabic !


Observations:
- DeepSeek-v3 ranked 3rd and only Open model among the top 5 !

- A 14B open model ( Qwen/Qwen2.5-14B-Instruct) outperforms gpt-3.5-turbo-0125 (from last year). This shows how much we came in advancing and supporting Arabic presence within the LLM ecosystem !

- Contrary to what observed in likelihood-acc leaderboards (like OALL/Open-Arabic-LLM-Leaderboard) further finetuned models like maldv/Qwentile2.5-32B-Instruct actually decreased the performance compared to the original model Qwen/Qwen2.5-32B-Instruct.
It's worth to note that the decrease is statiscally insignificant which imply that at best, the out-domain finetuning do not really hurts the model original capabilities acquired during pretraining.
Previous work addressed this (finetuning VS pretraining) but more investigation in this regard is required (any PhDs here ? This could be your question ...)


Check out the latest rankings: https://huggingface.co/spaces/inceptionai/AraGen-Leaderboard