Daniel van Strien's picture

Daniel van Strien PRO

davanstrien

AI & ML interests

Machine Learning Librarian

Recent Activity

new activity 17 minutes ago
Eureka-Lab/PHYBench:fix yaml metatadata
liked a dataset 22 minutes ago
Major-TOM/Core-S2L1C-DeCUR
liked a dataset 22 minutes ago
Major-TOM/Core-S1RTC-DeCUR
View all activity

Organizations

Hugging Face's profile picture Notebooks-explorers's profile picture Nasjonalbiblioteket AI Lab's profile picture Living with Machines's profile picture BigScience Workshop's profile picture Spaces-explorers's profile picture BigScience Catalogue Data's profile picture Hacks/Hackers's profile picture BigScience: LMs for Historical Texts's profile picture flyswot's profile picture Webhooks Explorers (BETA)'s profile picture HuggingFaceM4's profile picture Open Access AI Collective's profile picture HF Canonical Model Maintainers's profile picture BigLAM: BigScience Libraries, Archives and Museums's profile picture Hugging Face OSS Metrics's profile picture ImageIN's profile picture Stable Diffusion Bias Eval's profile picture Librarian Bots's profile picture Blog-explorers's profile picture Hacktoberfest 2023's profile picture Hugging Face Smol Models Research's profile picture geospatial's profile picture HPLT's profile picture HF-IA-archiving's profile picture 2A2I Legacy Models & Datasets's profile picture testy's profile picture DIBT-for-Klingon's profile picture Wikimedia Movement's profile picture DIBT-for-Esperanto's profile picture Journalists on Hugging Face's profile picture PleIAs's profile picture Persian AI Community's profile picture HuggingFaceFW's profile picture Data Is Better Together's profile picture Social Post Explorers's profile picture OMOTO AI's profile picture academic-datasets's profile picture HuggingFaceFW-Dev's profile picture Hugging Face Discord Community's profile picture UCSF-JHU Opioid Industry Documents Archive's profile picture Dataset Tools's profile picture PDFPages's profile picture dibt-private's profile picture Data Is Better Together Contributor's profile picture Bluesky Community's profile picture Open R1's profile picture Reasoning datasets competition 's profile picture

davanstrien's activity

reacted to their post with ๐Ÿ”ฅ 5 days ago
view post
Post
1804
Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)
posted an update 5 days ago
view post
Post
1804
Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)
reacted to jasoncorkill's post with ๐Ÿ”ฅ 17 days ago
view post
Post
3041
๐Ÿ”ฅ Yesterday was a fire day!
We dropped two brand-new datasets capturing Human Preferences for text-to-video and text-to-image generations powered by our own crowdsourcing tool!

Whether you're working on model evaluation, alignment, or fine-tuning, this is for you.

1. Text-to-Video Dataset (Pika 2.2 model):
Rapidata/text-2-video-human-preferences-pika2.2

2. Text-to-Image Dataset (Reve-AI Halfmoon):
Rapidata/Reve-AI-Halfmoon_t2i_human_preference

Letโ€™s train AI on AI-generated content with humans in the loop.
Letโ€™s make generative models that actually get us.
reacted to ajibawa-2023's post with ๐Ÿ”ฅ 17 days ago
view post
Post
3926
Hi All, I recently released two Audio datasets which are generated using my earlier released dataset: ajibawa-2023/Children-Stories-Collection

First Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection-Large has 5600++ stories in .mp3 format.

Second Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection has 600 stories in .mp3 format.
ยท
reacted to jasoncorkill's post with ๐Ÿš€๐Ÿ”ฅ 17 days ago
view post
Post
3237
๐Ÿš€ We tried something new!

We just published a dataset using a new (for us) preference modality: direct ranking based on aesthetic preference. We ranked a couple of thousand images from most to least preferred, all sampled from the Open Image Preferences v1 dataset by the amazing @data-is-better-together team.

๐Ÿ“Š Check it out here:
Rapidata/2k-ranked-images-open-image-preferences-v1

We're really curious to hear your thoughts!
Is this kind of ranking interesting or useful to you? Let us know! ๐Ÿ’ฌ

If it is, please consider leaving a โค๏ธ and if we hit 30 โค๏ธs, weโ€™ll go ahead and rank the full 17k image dataset!
ยท
replied to jasoncorkill's post 17 days ago
view reply

This is very cool! I was always curious about doing something like this! Could be quite cool to train a "aesthic preference model" on this kind of dataset. Could be quite cool to try and use as a reward model for image gen training...

cc @sayakpaul @multimodalart @linoyts @davidberenstein1957 who might also find this data interesting :)

reacted to jasoncorkill's post with โค๏ธ 17 days ago
view post
Post
3237
๐Ÿš€ We tried something new!

We just published a dataset using a new (for us) preference modality: direct ranking based on aesthetic preference. We ranked a couple of thousand images from most to least preferred, all sampled from the Open Image Preferences v1 dataset by the amazing @data-is-better-together team.

๐Ÿ“Š Check it out here:
Rapidata/2k-ranked-images-open-image-preferences-v1

We're really curious to hear your thoughts!
Is this kind of ranking interesting or useful to you? Let us know! ๐Ÿ’ฌ

If it is, please consider leaving a โค๏ธ and if we hit 30 โค๏ธs, weโ€™ll go ahead and rank the full 17k image dataset!
ยท
posted an update 19 days ago
view post
Post
1648
I've created a v1 dataset ( davanstrien/reasoning-required) and model ( davanstrien/ModernBERT-based-Reasoning-Required) to help curate "wild text" data for generating reasoning examples beyond the usual code/math/science domains.

- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions

My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.

This significantly reduces computation costs while expanding reasoning dataset domain coverage.
posted an update about 2 months ago
view post
Post
2930
๐Ÿ“Š Introducing "Hugging Face Dataset Spotlight" ๐Ÿ“Š

I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!

This first episode explores mathematical reasoning datasets:

- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains
- open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models.
- facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.

Plus a bonus segment on bespokelabs/bespoke-manim!

https://www.youtube.com/watch?v=-TgmRq45tW4
reacted to stefan-it's post with ๐Ÿ”ฅ about 2 months ago
view post
Post
3174
After running some 3DMark and FurMark benchmarks on Windows to make sure that my new 5090 is not causing melting cables [1] and some nice shots with a thermal camera (I don't think that's too much), running some fine-tuning experiments with my favorite Flair & Transformers libraries are very easy to perform.

Important steps:

Good idea is to start with a fresh Ubuntu 24.04 installation with latest CUDA 12.8 and the open NVIDIA driver - follow more advices from [2]:

sudo apt -y install cuda-toolkit-12-8 nvidia-open

I tried update from an existing Ubuntu installation with an older CUDA and driver version and it resulted in a non-startable system.

If you are using PyTorch 2.6 with built CUDA 12.6 it will result in:

NVIDIA Graphics Device with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.

But no worries! For PyTorch you need just to use a nightly 2.7 version that was built with CUDA 12.8. This can easily done via:

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128

After that the latest Flair version can be installed and fine-tuning will work!

References:

[1]: https://www.reddit.com/r/nvidia/comments/1inpox7/rtx_50_series_12vhpwr_megathread/
[2]: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=24.04&target_type=deb_network
  • 1 reply
ยท
posted an update about 2 months ago
view post
Post
3680
Quick POC: Turn a Hugging Face dataset card into a short podcast introducing the dataset using all open models.

I think I'm the only weirdo who would enjoy listening to something like this though ๐Ÿ˜…

Here is an example for eth-nlped/stepverify
  • 2 replies
ยท
posted an update 2 months ago
view post
Post
2644
Hacked together a way to log trl GRPO training completions to a ๐Ÿค— dataset repo. This allows you to:

- Track rewards from multiple reward functions
- Treat the completion and rewards from training as a "proper" dataset and do EDA
- Share results for open science

The implementation is super hacky, but I'm curious if people would find this useful.

To push completions to the Hub, you just need two extra parameters:

log_completions=True
log_completions_hub_repo='your-username/repo-name'

Example dataset: davanstrien/test-logs
Colab: https://colab.research.google.com/drive/1wzBFPVthRYYTp-mEYlznLg_e_0Za1M3g

posted an update 2 months ago
posted an update 2 months ago
view post
Post
1976
How do you make 1M+ Hugging Face models & datasets more discoverable?

davanstrien/Smol-Hub-tldr!

I fine-tuned HuggingFaceTB/SmolLM2-360M to generate one-line summaries from a model or dataset README.

Its own self-description?
"A model for generating concise summaries of model & dataset cards from the Hugging Face Hub"

The goal? Make it easier to find the right models and datasets for your specific needs. It's already powering a semantic search for datasets Space.

It's still a WIP but thanks to @loubnabnl , @anton-l , @eliebak et al, for cooking such a nice base model for fine-tuning small, efficient models for specific domains and tasks. ๐Ÿ™
posted an update 2 months ago
reacted to Ihor's post with ๐Ÿš€ 3 months ago
view post
Post
1622
๐Ÿš€ Reproducing DeepSeek R1 for Text-to-Graph Extraction

Iโ€™ve been working on replicating DeepSeek R1, focusing on zero-shot text-to-graph extractionโ€”a challenging task where LMs extract entities and relations from text based on predefined types.

๐Ÿง  Key Insight:
Language models struggle when constrained by entity/relation types. Supervised training alone isnโ€™t enough, but reinforcement learning (RL), specifically Guided Reward Policy Optimization (GRPO), shows promise.

๐Ÿ’ก Why GRPO?
It trains the model to generate structured graphs, optimizing multiple reward functions (format, JSON validity, and extraction accuracy).
It allows the model to learn from both positive and hard negative examples dynamically.
RL can be fine-tuned to emphasize relation extraction improvements.

๐Ÿ“Š Early Results:
Even with limited training, F1 scores consistently improved, and we saw clear benefits from RL-based optimization. More training = better performance!

๐Ÿ”ฌ Next Steps:
Weโ€™re scaling up experiments with larger models and high-quality data. Stay tuned for updates! Meanwhile, check out one of our experimental models here:
Ihor/Text2Graph-R1-Qwen2.5-0.5b

๐Ÿ“” Learn more details from the blog post: https://medium.com/p/d8b648d9f419

Feel free to share your thoughts and ask questions!
  • 2 replies
ยท
posted an update 3 months ago
posted an update 3 months ago
reacted to fdaudens's post with โค๏ธ 3 months ago
view post
Post
9225
Yes, DeepSeek R1's release is impressive. But the real story is what happened in just 7 days after:

- Original release: 8 models, 540K downloads. Just the beginning...

- The community turned those open-weight models into +550 NEW models on Hugging Face. Total downloads? 2.5Mโ€”nearly 5X the originals.

The reason? DeepSeek models are open-weight, letting anyone build on top of them. Interesting to note that the community focused on quantized versions for better efficiency & accessibility. They want models that use less memory, run faster, and are more energy-efficient.

When you empower builders, innovation explodes. For everyone. ๐Ÿš€

The most popular community model? @bartowski 's DeepSeek-R1-Distill-Qwen-32B-GGUF version โ€” 1M downloads alone.
ยท