AI & ML interests

None defined yet.

Recent Activity

sashaย  updated a dataset 5 days ago
AIEnergyScore/requests_debug
sashaย  authored a paper about 1 month ago
Fully Autonomous AI Agents Should Not be Developed
megย  authored a paper about 1 month ago
Fully Autonomous AI Agents Should Not be Developed
View all activity

AIEnergyScore's activity

megย 
posted an update 16 days ago
m-ricย 
posted an update 20 days ago
view post
Post
2693
New king of open VLMs: InternVL3 takes Qwen 2.5's crown! ๐Ÿ‘‘

InternVL have been a wildly successful series of model : and the latest iteration has just taken back their crown thanks to their superior, natively multimodal vision training pipeline.

โžก๏ธ Most of the vision language models (VLMs) these days are built like Frankenstein : take a good text-only Large Language Model (LLM) backbone, stitch a specific vision transformer (ViT) on top of it. Then the training is sequential ๐Ÿ”ข : 1. Freeze the LLM weights while you train the ViT only to work with the LLM part, then 2. Unfreeze all weights to train all weights in order to work together.

๐Ÿ’ซ The Shanghai Lab decided to challenge this paradigm and chose this approach that they call "native". For each of their model sizes, they still start from a good LLM (mostly Qwen-2.5 series, did I tell you I'm a huge fan of Qwen? โค๏ธ), and stitch the ViT, but they don't freeze anything : they train all weights together with interleaved text and image understanding data in a single pre-training phase ๐ŸŽจ.

They claim it results in more seamless interactions between modalities. And the results prove them right: they took the crown of top VLMs, at nearly all sizes, from their Qwen-2.5 parents. ๐Ÿ‘‘
  • 2 replies
ยท
yjerniteย 
posted an update 22 days ago
view post
Post
3183
Today in Privacy & AI Tooling - introducing a nifty new tool to examine where data goes in open-source apps on ๐Ÿค—

HF Spaces have tons (100Ks!) of cool demos leveraging or examining AI systems - and because most of them are OSS we can see exactly how they handle user data ๐Ÿ“š๐Ÿ”

That requires actually reading the code though, which isn't always easy or quick! Good news: code LMs have gotten pretty good at automatic review, so we can offload some of the work - here I'm using Qwen/Qwen2.5-Coder-32B-Instruct to generate reports and it works pretty OK ๐Ÿ™Œ

The app works in three stages:
1. Download all code files
2. Use the Code LM to generate a detailed report pointing to code where data is transferred/(AI-)processed (screen 1)
3. Summarize the app's main functionality and data journeys (screen 2)
4. Build a Privacy TLDR with those inputs

It comes with a bunch of pre-reviewed apps/Spaces, great to see how many process data locally or through (private) HF endpoints ๐Ÿค—

Note that this is a POC, lots of exciting work to do to make it more robust, so:
- try it: yjernite/space-privacy
- reach out to collab: yjernite/space-privacy
m-ricย 
posted an update about 1 month ago
view post
Post
2320
๐Ÿš€ DeepSeek R1 moment has come for GUI agents: Rule-based Reinforcement Learning gives better results than SFT with 500x smaller datasets!

Traditionally (by which I mean "in the last few months"), GUI agents have been trained with supervised fine-tuning (SFT). This meant, collecting huge datasets of screen captures from people using computers, and using these to fine-tune your model. ๐Ÿ“š

๐Ÿ‘‰ But last week, a new paper introduced UI-R1, applying DeepSeek's R1-style rule-based reinforcement learning (RL) specifically to GUI action prediction tasks.
This is big news: with RL, maybe we could build good agents without the need for huge datasets.

UI-R1 uses a unified reward function that evaluates multiple responses from models, optimizing via policy algorithms like Group Relative Policy Optimization (GRPO).

Specifically, the reward function assesses:
๐ŸŽฏ Action type accuracy: Does the predicted action match the ground truth?
๐Ÿ“ Coordinate accuracy (specifically for clicks): Is the predicted click within the correct bounding box?
๐Ÿ“‘ Output format: Does the model clearly articulate both its reasoning and final action?

Using just 136 carefully selected mobile tasksโ€”compared to 76,000 tasks for larger models like OS-Atlasโ€”UI-R1 shows significant efficiency and improved performance:
๐Ÿ“ˆ Boosted action prediction accuracy from 76% to 89% on AndroidControl.
๐ŸŒ Outperformed larger, SFT-trained models (e.g., OS-Atlas-7B), demonstrating superior results with vastly fewer data points (136 tasks vs. 76K).
๐Ÿ” Enhanced adaptability and generalization, excelling even in out-of-domain scenarios.

The paper tests this RL-based method only in low-level GUI tasks. Could it generalize to more complex interactions? ๐Ÿง

Read the full paper here ๐Ÿ‘‰ UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning (2503.21620)
m-ricย 
posted an update about 2 months ago
view post
Post
4952
smolagents now support vLLM! ๐Ÿฅณ

As one of the most popular local inference solutions, the community had been asking us to integrate vLLM: after a heavy refactoring of our LLM classes, we've just released smolagents 1.11.0, with a brand new VLLMModel class.

Go try it and tell us what you think!

https://github.com/huggingface/smolagents/blob/45b2c86857b7f7657daaa74e4d17d347e9e2c4a4/src/smolagents/models.py#L497
m-ricย 
posted an update about 2 months ago
view post
Post
1062
Our new Agentic leaderboard is now live!๐Ÿ’ฅ

If you ever asked which LLM is best for powering agents, we've just made a leaderboard that ranks them all! Built with @albertvillanova , this ranks LLMs powering a smolagents CodeAgent on subsets of various benchmarks. โœ…

๐Ÿ† GPT-4.5 comes on top, even beating reasoning models like DeepSeek-R1 or o1. And Claude-3.7-Sonnet is a close second!

The leaderboard also allows you to show the scores of vanilla LLMs (without any agentic setup) on the same benchmarks: this shows the huge improvements brought by agentic setups. ๐Ÿ’ช

(Note that results will be added manually, so the leaderboard might not always have the latest LLMs)
  • 1 reply
ยท
m-ricย 
posted an update 2 months ago
view post
Post
4845
We now have a Deep Research for academia: SurveyX automatically writes academic surveys nearly indistinguishable from human-written ones ๐Ÿ”ฅ

Researchers from Beijing and Shanghai just published the first application of a deep research system to academia: their algorithm, given a question, can give you a survey of all papers on the subject.

To make a research survey, you generally follow two steps, preparation (collect and organize papers) and writing (outline creation, writing, polishing). Researchers followed the same two steps and automated them.

๐ŸŽฏ For the preparation part, a key part is find all the important references on the given subject.
Researchers first cast a wide net of all relevant papers. But then finding the really important ones is like distilling knowledge from a haystack of information. To solve this challenge, they built an โ€œAttributeTreeโ€ object that structures key information from citations. Ablating these AttributeTrees significantly decreased structure and synthesis scores, so they were really useful!

๐Ÿ“ For the writing part, key was to get a synthesis that's both short and true. This is not easy to get with LLMs! So they used methods like LLM-based deduplication to shorten the too verbose listings made by LLMs, and RAG to grab original quotes instead of made-up ones.

As a result, their system outperforms previous approaches by far!

As assessed by LLM-judges, the quality score os SurveyX even approaches this of human experts, with 4.59/5 vs 4.75/5 ๐Ÿ†

I advise you to read the paper, it's a great overview of the kind of assistants that we'll get in the short future! ๐Ÿ‘‰ SurveyX: Academic Survey Automation via Large Language Models (2502.14776)
Their website shows examples of generated surveys ๐Ÿ‘‰ http://www.surveyx.cn/
m-ricย 
posted an update 3 months ago
view post
Post
3102
Less is More for Reasoning (LIMO): a 32B model fine-tuned with 817 examples can beat o1-preview on math reasoning! ๐Ÿคฏ

Do we really need o1's huge RL procedure to see reasoning emerge? It seems not.
Researchers from Shanghai Jiaotong University just demonstrated that carefully selected examples can boost math performance in large language models using SFT โ€”no huge datasets or RL procedures needed.

Their procedure allows Qwen2.5-32B-Instruct to jump from 6.5% to 57% on AIME and from 59% to 95% on MATH, while using only 1% of the data in previous approaches.

โšก The Less-is-More Reasoning Hypothesis:
โ€ฃ Minimal but precise examples that showcase optimal reasoning patterns matter more than sheer quantity
โ€ฃ Pre-training knowledge plus sufficient computational resources at inference levels up math skills

โžก๏ธ Core techniques:
โ€ฃ High-quality reasoning chains with self-verification steps
โ€ฃ 817 handpicked problems that encourage deeper reasoning
โ€ฃ Enough inference-time computation to allow extended reasoning

๐Ÿ’ช Efficiency gains:
โ€ฃ Only 817 examples instead of 100k+
โ€ฃ 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data

This really challenges the notion that SFT leads to memorization rather than generalization! And opens up reasoning to GPU-poor researchers ๐Ÿš€

Read the full paper here ๐Ÿ‘‰ย  LIMO: Less is More for Reasoning (2502.03387)
regisssย 
posted an update 3 months ago
view post
Post
1731
Nice paper comparing the fp8 inference efficiency of Nvidia H100 and Intel Gaudi2: An Investigation of FP8 Across Accelerators for LLM Inference (2502.01070)

The conclusion is interesting: "Our findings highlight that the Gaudi 2, by leveraging FP8, achieves higher throughput-to-power efficiency during LLM inference"

One aspect of AI hardware accelerators that is often overlooked is how they consume less energy than GPUs. It's nice to see researchers starting carrying out experiments to measure this!

Gaudi3 results soon...
m-ricย 
posted an update 3 months ago
view post
Post
2954
๐—š๐—ฟ๐—ฒ๐—ฎ๐˜ ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฎ๐—น๐—ฒ๐—ฟ๐˜: you can now share agents to the Hub! ๐Ÿฅณ๐Ÿฅณ

And any agent pushed to Hub get a cool Space interface to directly chat with it.

This was a real technical challenge: for instance, serializing tools to export them meant that you needed to get all the source code for a tool, verify that it was standalone (not relying on external variables), and gathering all the packages required to make it run.

Go try it out! ๐Ÿ‘‰ https://github.com/huggingface/smolagents
  • 2 replies
ยท
m-ricย 
posted an update 3 months ago
view post
Post
2554
For those who haven't come across it yet, here's a handy trick to discuss an entire GitHub repo with an LLM:

=> Just replace "github" with "gitingest" in the url, and you get the whole repo as a single string that you can then paste in your LLMs
m-ricย 
posted an update 3 months ago
view post
Post
4869
"๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฑ ๐˜„๐—ถ๐—น๐—น ๐—ฏ๐—ฒ ๐˜๐—ต๐—ฒ ๐˜†๐—ฒ๐—ฎ๐—ฟ ๐—ผ๐—ณ ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€": this statement has often been made, here are numbers to support it.

I've plotted the progress of AI agents on GAIA test set, and it seems they're headed to catch up with the human baseline in early 2026.

And that progress is still driven mostly by the improvement of base LLMs: progress would be even faster with fine-tuned agentic models.