BigLAM: BigScience Libraries, Archives and Museums

non-profit

AI & ML interests

πŸ€— Hugging Face x 🌸 BigScience initiative to create open source community resources for LAMs.

Recent Activity

biglam's activity

christopherΒ 
in biglam/loc_beyond_words about 10 hours ago

[bot] Conversion to Parquet

#4 opened about 22 hours ago by
parquet-converter
davanstrienΒ 
updated a Space about 15 hours ago
davanstrienΒ 
in biglam/README about 15 hours ago

make shorter

#2 opened about 15 hours ago by
davanstrien

Update README.md

#1 opened about 15 hours ago by
davanstrien
alielfilali01Β 
posted an update 8 days ago
davanstrienΒ 
posted an update 16 days ago
view post
Post
1959
Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)
albertvillanovaΒ 
posted an update 17 days ago
view post
Post
2497
smolagents v1.14.0 is out! πŸš€
πŸ”Œ MCPClient: A sleek new client for connecting to remote MCP servers, making integrations more flexible and scalable.
πŸͺ¨ Amazon Bedrock: Native support for Bedrock-hosted models.
SmolAgents is now more powerful, flexible, and enterprise-ready. πŸ’Ό

Full release πŸ‘‰ https://github.com/huggingface/smolagents/releases/tag/v1.14.0
#smolagents #LLM #AgenticAI
davanstrienΒ 
posted an update 29 days ago
view post
Post
1662
I've created a v1 dataset ( davanstrien/reasoning-required) and model ( davanstrien/ModernBERT-based-Reasoning-Required) to help curate "wild text" data for generating reasoning examples beyond the usual code/math/science domains.

- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions

My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.

This significantly reduces computation costs while expanding reasoning dataset domain coverage.