Wikimedia Movement

community
Activity Feed

AI & ML interests

Love creating and curating datasets, cataloging and describing tools and services and collections, contributing to training :)

Wikimedians's activity

davanstrienΒ 
posted an update 16 days ago
view post
Post
1959
Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)
davanstrienΒ 
posted an update 29 days ago
view post
Post
1662
I've created a v1 dataset ( davanstrien/reasoning-required) and model ( davanstrien/ModernBERT-based-Reasoning-Required) to help curate "wild text" data for generating reasoning examples beyond the usual code/math/science domains.

- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions

My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.

This significantly reduces computation costs while expanding reasoning dataset domain coverage.
BrigitteTousiΒ 
posted an update about 1 month ago
view post
Post
3060
AI agents are transforming how we interact with technology, but how sustainable are they? 🌍

Design choices β€” like model size and structure β€” can massively impact energy use and cost. βš‘πŸ’° The key takeaway: smaller, task-specific models can be far more efficient than large, general-purpose ones.

πŸ”‘ Open-source models offer greater transparency, allowing us to track energy consumption and make more informed decisions on deployment. 🌱 Open-source = more efficient, eco-friendly, and accountable AI.

Read our latest, led by @sasha with assists from myself + @yjernite πŸ€—
https://huggingface.co/blog/sasha/ai-agent-sustainability
  • 1 reply
Β·
BrigitteTousiΒ 
posted an update about 2 months ago
BrigitteTousiΒ 
posted an update about 2 months ago
view post
Post
3739
Regardless of X being down or not, so glad I can rely on HF Posts for AI news β€οΈπŸ€—
  • 1 reply
Β·
davanstrienΒ 
posted an update 2 months ago
view post
Post
2933
πŸ“Š Introducing "Hugging Face Dataset Spotlight" πŸ“Š

I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!

This first episode explores mathematical reasoning datasets:

- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains
- open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models.
- facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.

Plus a bonus segment on bespokelabs/bespoke-manim!

https://www.youtube.com/watch?v=-TgmRq45tW4
davanstrienΒ 
posted an update 2 months ago
view post
Post
3682
Quick POC: Turn a Hugging Face dataset card into a short podcast introducing the dataset using all open models.

I think I'm the only weirdo who would enjoy listening to something like this though πŸ˜…

Here is an example for eth-nlped/stepverify
  • 2 replies
Β·
davanstrienΒ 
posted an update 3 months ago
view post
Post
2645
Hacked together a way to log trl GRPO training completions to a πŸ€— dataset repo. This allows you to:

- Track rewards from multiple reward functions
- Treat the completion and rewards from training as a "proper" dataset and do EDA
- Share results for open science

The implementation is super hacky, but I'm curious if people would find this useful.

To push completions to the Hub, you just need two extra parameters:

log_completions=True
log_completions_hub_repo='your-username/repo-name'

Example dataset: davanstrien/test-logs
Colab: https://colab.research.google.com/drive/1wzBFPVthRYYTp-mEYlznLg_e_0Za1M3g

frimelleΒ 
posted an update 3 months ago
view post
Post
2412
What’s in a name? More than you might think, especially for AI.
Whenever I introduce myself, people often start speaking French to me, even though my French is très basic. It turns out that AI systems do something similar:
Large language models infer cultural identity from names, shaping their responses based on presumed backgrounds. But is this helpful personalization or a reinforcement of stereotypes?
In our latest paper, we explored this question by testing DeepSeek, Llama, Aya, Mistral-Nemo, and GPT-4o-mini on how they associate names with cultural identities. We analysed 900 names from 30 cultures and found strong assumptions baked into AI responses: some cultures were overrepresented, while others barely registered.
For example, a name like "Jun" often triggered Japan-related responses, while "Carlos" was linked primarily to Mexico, even though these names exist in multiple countries. Meanwhile, names from places like Ireland led to more generic answers, suggesting weaker associations in the training data.
This has real implications for AI fairness: How should AI systems personalize without stereotyping? Should they adapt at all based on a name?
Work with some of my favourite researchers: @sidicity Arnav Arora and @IAugenstein
Read the full paper here: Presumed Cultural Identity: How Names Shape LLM Responses (2502.11995)
davanstrienΒ 
posted an update 3 months ago
davanstrienΒ 
posted an update 3 months ago
view post
Post
1977
How do you make 1M+ Hugging Face models & datasets more discoverable?

davanstrien/Smol-Hub-tldr!

I fine-tuned HuggingFaceTB/SmolLM2-360M to generate one-line summaries from a model or dataset README.

Its own self-description?
"A model for generating concise summaries of model & dataset cards from the Hugging Face Hub"

The goal? Make it easier to find the right models and datasets for your specific needs. It's already powering a semantic search for datasets Space.

It's still a WIP but thanks to @loubnabnl , @anton-l , @eliebak et al, for cooking such a nice base model for fine-tuning small, efficient models for specific domains and tasks. πŸ™
davanstrienΒ 
posted an update 3 months ago
frimelleΒ 
posted an update 3 months ago
view post
Post
527
I was quoted in an article about the French Lucie AI in La Presse. While I love the name for obvious reasons πŸ‘€ there were still a lot of problems with the model and how and when it was deployed. Nevertheless seeing new smaller models being developed is an exciting direction for the next years of AI development to come!

https://www.lapresse.ca/affaires/techno/2025-02-02/radioscopie/lucie-l-ia-francaise-qui-ne-passe-pas-le-test.php

Also fun to see my comments in French.
frimelleΒ 
posted an update 3 months ago
view post
Post
1684
Seeing AI develop has been a wild ride, from trying to explain why we'd bother to generate a single sentence with a *neural network* to explaining that AI is not a magic, all-knowing box. The recent weeks and months have been a lot of talking about how AI works; to policy makers, to other developers, but also and mainly friends and family without a technical background.

Yesterday, the first provisions of the EU AI Act came into force, and one of the the key highlights are the AI literacy requirements for organisations deploying AI systems. This isn't just a box-ticking exercise. Ensuring that employees and stakeholders understand AI systems is crucial for fostering responsible and transparent AI development. From recognising biases to understanding model limitations, AI literacy empowers individuals to engage critically with these technologies and make informed decisions.

In the context of Hugging Face, AI literacy has many facets: allowing more people to contribute to AI development, providing courses and documentation to ensuring access is possible, and accessible AI tools that empower users to better understand how AI systems function. This isn't just a regulatory milestone; it’s an opportunity to foster a culture where AI literacy becomes foundational, enabling stakeholders to recognise biases, assess model limitations, and engage critically with technology.

Embedding these principles into daily practice, and eventually extending our learnings in AI literacy to the general public, is essential for building trustworthy AI that aligns with societal values.
  • 2 replies
Β·
davanstrienΒ 
posted an update 3 months ago
davanstrienΒ 
posted an update 3 months ago
davanstrienΒ 
posted an update 3 months ago
view post
Post
2060
🌍 Big step for multilingual AI data!

The Hugging Face community has rated educational content in languages spoken by 1.6 billion people! New additions:
β€’ Japanese
β€’ Italian
β€’ Old High German

Learn more and contribute: https://huggingface.co/blog/davanstrien/fineweb2-community

These ratings can help enhance training data for major world languages.
  • 1 reply
Β·
davanstrienΒ 
posted an update 4 months ago
view post
Post
3086
Introducing scandi-fine-web-cleaner davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!

FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?

Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.

Today, I'm happy to share the first classifier trained on this data.

πŸ” What we've built:

- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute

🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C ( data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.

Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
  • 1 reply
Β·