HPLT

community

https://hplt-project.org/

hplt_eu

hplt-project

Activity Feed Request to join this org

AI & ML interests

Web as a corpus, Large Language Models, Machine Translation, Language Technologies, Natural Language Processing

Recent Activity

pinzhenchen new activity 1 day ago

HPLT/translate-tr-en-v2.0-hplt_opus:If possible, can you also share the "vocab.yml" file you used in the training?

MariaFjodorowa updated a model 8 days ago

HPLT/hplt_bert_base_swh-Latn

MariaFjodorowa published a model 8 days ago

HPLT/hplt_bert_base_swh-Latn

View all activity

HPLT's activity

pinzhenchen

in HPLT/translate-tr-en-v2.0-hplt_opus 1 day ago

If possible, can you also share the "vocab.yml" file you used in the training?

#1 opened 2 days ago by

NovaYear

BramVanroy

posted an update 4 days ago

Post

2947

📢💾 Introducing the Common Crawl Creative Commons Corpus (C5)!

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.

---
📄 data: BramVanroy/CommonCrawl-CreativeCommons
🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---

</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.

🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.

🔍 More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!

1 reply

MariaFjodorowa

updated a model 8 days ago

HPLT/hplt_bert_base_swh-Latn

Fill-Mask • Updated 8 days ago • 5

MariaFjodorowa

published a model 8 days ago

HPLT/hplt_bert_base_swh-Latn

Fill-Mask • Updated 8 days ago • 5

davanstrien

posted an update 15 days ago

Post

1957

Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)

vmkhlv

updated a dataset 24 days ago

HPLT/HPLT2.0_cleaned

Viewer • Updated 24 days ago • 6.17B • 48.7k • 19

ltgoslo

updated a Space 28 days ago

README

🏆

davanstrien

posted an update 29 days ago

Post

1662

I've created a v1 dataset ( davanstrien/reasoning-required) and model ( davanstrien/ModernBERT-based-Reasoning-Required) to help curate "wild text" data for generating reasoning examples beyond the usual code/math/science domains.

- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions

My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.

This significantly reduces computation costs while expanding reasoning dataset domain coverage.