
HuggingFaceM4
Enterprise
company
AI & ML interests
None defined yet.
Recent Activity
View all activity
HuggingFaceM4's activity

jeffboudier
posted
an
update
1 day ago
Post
1204
So many orgs on HF would really benefit from security and governance built into Enterprise Hub - I wrote a guide on why and how upgrade:
jeffboudier/how-to-upgrade-to-enterprise
For instance, did you know about Resource Groups?
For instance, did you know about Resource Groups?
Post
2041
Ever notice how some AI assistants feel like tools while others feel like companions? Turns out, it's not always about fancy tech upgrades, because sometimes it's just clever design.
Our latest blog post at Hugging Face dives into how minimal design choices can completely transform how users experience AI. We've seen our community turn the same base models into everything from swimming coaches to interview prep specialists with surprisingly small tweaks.
The most fascinating part? When we tested identical models with different "personalities" in our Inference Playground, the results were mind-blowing.
Want to experiment yourself? Our Inference Playground lets anyone (yes, even non-coders!) test these differences in real-time. You can:
- Compare multiple models side-by-side
- Customize system prompts
- Adjust parameters like temperature
- Test multi-turn conversations
It's fascinating how a few lines of instruction text can transform the same AI from strictly professional to seemingly caring and personal, without changing a single line of code in the model itself.
Read more here: https://huggingface.co/blog/giadap/ai-personas
Our latest blog post at Hugging Face dives into how minimal design choices can completely transform how users experience AI. We've seen our community turn the same base models into everything from swimming coaches to interview prep specialists with surprisingly small tweaks.
The most fascinating part? When we tested identical models with different "personalities" in our Inference Playground, the results were mind-blowing.
Want to experiment yourself? Our Inference Playground lets anyone (yes, even non-coders!) test these differences in real-time. You can:
- Compare multiple models side-by-side
- Customize system prompts
- Adjust parameters like temperature
- Test multi-turn conversations
It's fascinating how a few lines of instruction text can transform the same AI from strictly professional to seemingly caring and personal, without changing a single line of code in the model itself.
Read more here: https://huggingface.co/blog/giadap/ai-personas
Post
4271
A ton of impactful models and datasets in open AI past week, let's summarize the best 🤩
merve/releases-apr-21-and-may-2-6819dcc84da4190620f448a3
💬 Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B 🤯 as well as Qwen2.5-Omni, any-to-any model in 3B and 7B!
> Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes)
> NVIDIA released new CoT reasoning datasets
🖼️ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model
> Meta released EdgeTAM, an on-device object tracking model (SAM2 variant)
🗣️ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model
> Nari released Dia, a 1.6B text-to-speech model
> Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model
👩🏻💻 JetBrains released Melium models in base and SFT for coding
> Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model 🤩
💬 Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B 🤯 as well as Qwen2.5-Omni, any-to-any model in 3B and 7B!
> Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes)
> NVIDIA released new CoT reasoning datasets
🖼️ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model
> Meta released EdgeTAM, an on-device object tracking model (SAM2 variant)
🗣️ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model
> Nari released Dia, a 1.6B text-to-speech model
> Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model
👩🏻💻 JetBrains released Melium models in base and SFT for coding
> Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model 🤩
Post
5997
A real-time object detector much faster and accurate than YOLO with Apache 2.0 license just landed to Hugging Face transformers 🔥
D-FINE is the sota real-time object detector that runs on T4 (free Colab) 🤩
> Collection with all checkpoints and demo ustc-community/d-fine-68109b427cbe6ee36b4e7352
Notebooks:
> Tracking https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb
> Inference https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_inference.ipynb
> Fine-tuning https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_finetune_on_a_custom_dataset.ipynb
h/t @vladislavbro @qubvel-hf @ariG23498 and the authors of the paper 🎩
Regular object detectors attempt to predict bounding boxes in (x, y, w, h) pixel perfect coordinates, which is very rigid and hard to solve 🥲☹️
D-FINE formulates object detection as a distribution for bounding box coordinates, refines them iteratively, and it's more accurate 🤩
Another core idea behind this model is Global Optimal Localization Self-Distillation ⤵️
this model uses final layer's distribution output (sort of like a teacher) to distill to earlier layers to make early layers more performant.
D-FINE is the sota real-time object detector that runs on T4 (free Colab) 🤩
> Collection with all checkpoints and demo ustc-community/d-fine-68109b427cbe6ee36b4e7352
Notebooks:
> Tracking https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb
> Inference https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_inference.ipynb
> Fine-tuning https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_finetune_on_a_custom_dataset.ipynb
h/t @vladislavbro @qubvel-hf @ariG23498 and the authors of the paper 🎩
Regular object detectors attempt to predict bounding boxes in (x, y, w, h) pixel perfect coordinates, which is very rigid and hard to solve 🥲☹️
D-FINE formulates object detection as a distribution for bounding box coordinates, refines them iteratively, and it's more accurate 🤩
Another core idea behind this model is Global Optimal Localization Self-Distillation ⤵️
this model uses final layer's distribution output (sort of like a teacher) to distill to earlier layers to make early layers more performant.
Post
2039
you can easily fine-tune, quantize, play with sota vision LM InternVL3 now 🔥
we have recently merged InternVL3 to Hugging Face transformers and released converted checkpoints 🤗
collection for converted checkpoints: merve/internvl3-hf-6814be2943b2ae0e711c92a5
notebook: https://colab.research.google.com/drive/1wAQ7cyjyaCwLXbMA_OjXZe7aCxCFm6sI?usp=sharing 📖
we have recently merged InternVL3 to Hugging Face transformers and released converted checkpoints 🤗
collection for converted checkpoints: merve/internvl3-hf-6814be2943b2ae0e711c92a5
notebook: https://colab.research.google.com/drive/1wAQ7cyjyaCwLXbMA_OjXZe7aCxCFm6sI?usp=sharing 📖
Post
1526

Check if there's one in your city here: LeRobot-worldwide-hackathon/worldwide-map
Post
1434
The
meta-llama
org just crossed 40,000 followers on Hugging Face. Grateful for all their impact on the field sharing the Llama weights openly and much more!
We need more of this from all other big tech to make the AI more open, collaborative and beneficial to all!

We need more of this from all other big tech to make the AI more open, collaborative and beneficial to all!
Post
3621
HOW TO ADD MCP SUPPORT TO ANY 🤗 SPACE
Gradio now supports MCP! If you want to convert an existing Space, like this one hexgrad/Kokoro-TTS, so that you can use it with Claude Desktop / Cursor / Cline / TinyAgents / or any LLM that supports MCP, here's all you need to do:
1. Duplicate the Space (in the Settings Tab)
2. Upgrade the Gradio
3. Set
4. (Optionally) add docstrings to the function so that the LLM knows how to use it, like this:
That's it! Now your LLM will be able to talk to you 🤯
Gradio now supports MCP! If you want to convert an existing Space, like this one hexgrad/Kokoro-TTS, so that you can use it with Claude Desktop / Cursor / Cline / TinyAgents / or any LLM that supports MCP, here's all you need to do:
1. Duplicate the Space (in the Settings Tab)
2. Upgrade the Gradio
sdk_version
to 5.28
(in the README.md
)3. Set
mcp_server=True
in launch()
4. (Optionally) add docstrings to the function so that the LLM knows how to use it, like this:
def generate(text, speed=1):
"""
Convert text to speech audio.
Parameters:
text (str): The input text to be converted to speech.
speed (float, optional): Playback speed of the generated speech.
That's it! Now your LLM will be able to talk to you 🤯
Post
2455
Hi folks! Excited to share a new feature from the Gradio team along with a tutorial.
If you don't already know, Gradio is an open-source Python library used to build interfaces for machine learning models. Beyond just creating UIs, Gradio also exposes API capabilities and now, Gradio apps can be launched Model Context Protocol (MCP) servers for LLMs.
If you already know how to use Gradio, there are only two additional things you need to do:
* Add standard docstrings to your function (these will be used to generate the descriptions for your tools for the LLM)
* Set
Here's a complete example (make sure you already have the latest version of Gradio installed):
This is a very simple example, but you can add the ability to generate Ghibli images or speak emotions to any LLM that supports MCP. Once you have an MCP running locally, you can copy-paste the same app to host it on [Hugging Face Spaces](https://huggingface.co/spaces/) as well.
All free and open-source of course! Full tutorial: https://www.gradio.app/guides/building-mcp-server-with-gradio
If you don't already know, Gradio is an open-source Python library used to build interfaces for machine learning models. Beyond just creating UIs, Gradio also exposes API capabilities and now, Gradio apps can be launched Model Context Protocol (MCP) servers for LLMs.
If you already know how to use Gradio, there are only two additional things you need to do:
* Add standard docstrings to your function (these will be used to generate the descriptions for your tools for the LLM)
* Set
mcp_server=True
in launch()
Here's a complete example (make sure you already have the latest version of Gradio installed):
import gradio as gr
def letter_counter(word, letter):
"""Count the occurrences of a specific letter in a word.
Args:
word: The word or phrase to analyze
letter: The letter to count occurrences of
Returns:
The number of times the letter appears in the word
"""
return word.lower().count(letter.lower())
demo = gr.Interface(
fn=letter_counter,
inputs=["text", "text"],
outputs="number",
title="Letter Counter",
description="Count how many times a letter appears in a word"
)
demo.launch(mcp_server=True)
This is a very simple example, but you can add the ability to generate Ghibli images or speak emotions to any LLM that supports MCP. Once you have an MCP running locally, you can copy-paste the same app to host it on [Hugging Face Spaces](https://huggingface.co/spaces/) as well.
All free and open-source of course! Full tutorial: https://www.gradio.app/guides/building-mcp-server-with-gradio
Post
2554
Meta released Llama Guard 4 and new Prompt Guard 2 models 🔥
Llama Guard 4 is a new model to filter model inputs/outputs both text-only and image 🛡️ use it before and after LLMs/VLMs! meta-llama/Llama-Guard-4-12B
Prompt Guard 2 22M & 86M are smol models to prevent model jailbreaks and prompt injections ⚔ meta-llama/Llama-Prompt-Guard-2-22M meta-llama/Llama-Guard-4-12B
Both come with new release of transformers 🤗
Try the model right away 👉🏻https://github.com/huggingface/huggingface-llama-recipes/blob/main/llama_guard_4.ipynb
Read our blog to learn more and easily get started 👉🏻 https://huggingface.co/blog/llama-guard-4 🦙
Llama Guard 4 is a new model to filter model inputs/outputs both text-only and image 🛡️ use it before and after LLMs/VLMs! meta-llama/Llama-Guard-4-12B
Prompt Guard 2 22M & 86M are smol models to prevent model jailbreaks and prompt injections ⚔ meta-llama/Llama-Prompt-Guard-2-22M meta-llama/Llama-Guard-4-12B
Both come with new release of transformers 🤗
Try the model right away 👉🏻https://github.com/huggingface/huggingface-llama-recipes/blob/main/llama_guard_4.ipynb
Read our blog to learn more and easily get started 👉🏻 https://huggingface.co/blog/llama-guard-4 🦙
Post
3944
Don't sleep on new AI at Meta Vision-Language release! 🔥
facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
facebook/perception-lm-67f9783f171948c383ee7498
Meta dropped swiss army knives for vision with A2.0 license 👏
> image/video encoders for vision language modelling and spatial understanding (object detection etc) 👏
> The vision LM outperforms InternVL3 and Qwen2.5VL 👏
> They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 👏
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models 😮
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets
Authors release following datasets 📑
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations ⏯️
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA
facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
facebook/perception-lm-67f9783f171948c383ee7498
Meta dropped swiss army knives for vision with A2.0 license 👏
> image/video encoders for vision language modelling and spatial understanding (object detection etc) 👏
> The vision LM outperforms InternVL3 and Qwen2.5VL 👏
> They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 👏
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models 😮
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets
Authors release following datasets 📑
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations ⏯️
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA
Post
3372
New foundation model on image and video captioning just dropped by NVIDIA AI 🔥
Describe Anything Model (DAM) is a 3B vision language model to generate detailed captions with localized references 😮
The team released the models, the dataset, a new benchmark and a demo 🤩 nvidia/describe-anything-680825bb8f5e41ff0785834c
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset 👀
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization 👏
Describe Anything Model (DAM) is a 3B vision language model to generate detailed captions with localized references 😮
The team released the models, the dataset, a new benchmark and a demo 🤩 nvidia/describe-anything-680825bb8f5e41ff0785834c
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset 👀
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization 👏

davanstrien
posted
an
update
16 days ago
Post
1960
Came across a very nice submission from
@marcodsn
for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
Dataset can be found here: marcodsn/academic-chains (give it a like!)
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
Dataset can be found here: marcodsn/academic-chains (give it a like!)
Post
2392
New launch: See the energy use of chatbot conversations, in real time. =)
jdelavande/chat-ui-energy
Great work from @JulienDelavande !
jdelavande/chat-ui-energy
Great work from @JulienDelavande !
Post
3970
Energy is a massive constraint for AI but do you even know what energy your chatGPT convos are using?
We're trying to change this by releasing ChatUI-energy, the first interface where you see in real-time what energy your AI conversations consume. Great work from @jdelavande powered by spaces & TGI, available for a dozen of open-source models like Llama, Mistral, Qwen, Gemma and more.
jdelavande/chat-ui-energy
Should all chat interfaces have this? Just like ingredients have to be shown on products you buy, we need more transparency in AI for users!
We're trying to change this by releasing ChatUI-energy, the first interface where you see in real-time what energy your AI conversations consume. Great work from @jdelavande powered by spaces & TGI, available for a dozen of open-source models like Llama, Mistral, Qwen, Gemma and more.
jdelavande/chat-ui-energy
Should all chat interfaces have this? Just like ingredients have to be shown on products you buy, we need more transparency in AI for users!
Post
2928
Just crossed half a million public apps on Hugging Face. A new public app is created every minute these days 🤯🤯🤯
What's your favorite? http://hf.co/spaces
What's your favorite? http://hf.co/spaces
Post
2698
New king of open VLMs: InternVL3 takes Qwen 2.5's crown! 👑
InternVL have been a wildly successful series of model : and the latest iteration has just taken back their crown thanks to their superior, natively multimodal vision training pipeline.
➡️ Most of the vision language models (VLMs) these days are built like Frankenstein : take a good text-only Large Language Model (LLM) backbone, stitch a specific vision transformer (ViT) on top of it. Then the training is sequential 🔢 : 1. Freeze the LLM weights while you train the ViT only to work with the LLM part, then 2. Unfreeze all weights to train all weights in order to work together.
💫 The Shanghai Lab decided to challenge this paradigm and chose this approach that they call "native". For each of their model sizes, they still start from a good LLM (mostly Qwen-2.5 series, did I tell you I'm a huge fan of Qwen? ❤️), and stitch the ViT, but they don't freeze anything : they train all weights together with interleaved text and image understanding data in a single pre-training phase 🎨.
They claim it results in more seamless interactions between modalities. And the results prove them right: they took the crown of top VLMs, at nearly all sizes, from their Qwen-2.5 parents. 👑
InternVL have been a wildly successful series of model : and the latest iteration has just taken back their crown thanks to their superior, natively multimodal vision training pipeline.
➡️ Most of the vision language models (VLMs) these days are built like Frankenstein : take a good text-only Large Language Model (LLM) backbone, stitch a specific vision transformer (ViT) on top of it. Then the training is sequential 🔢 : 1. Freeze the LLM weights while you train the ViT only to work with the LLM part, then 2. Unfreeze all weights to train all weights in order to work together.
💫 The Shanghai Lab decided to challenge this paradigm and chose this approach that they call "native". For each of their model sizes, they still start from a good LLM (mostly Qwen-2.5 series, did I tell you I'm a huge fan of Qwen? ❤️), and stitch the ViT, but they don't freeze anything : they train all weights together with interleaved text and image understanding data in a single pre-training phase 🎨.
They claim it results in more seamless interactions between modalities. And the results prove them right: they took the crown of top VLMs, at nearly all sizes, from their Qwen-2.5 parents. 👑
Post
1621
🤗 Just published: "Consent by Design" - exploring how we're building better consent mechanisms across the HF ecosystem!
Our research shows open AI development enables:
- Community-driven ethical standards
- Transparent accountability
- Context-specific implementations
- Privacy as core infrastructure
Check out our Space Privacy Analyzer tool that automatically generates privacy summaries of applications!
Effective consent isn't about perfect policies; it's about architectures that empower users while enabling innovation. 🚀
Read more: https://huggingface.co/blog/giadap/consent-by-design
Our research shows open AI development enables:
- Community-driven ethical standards
- Transparent accountability
- Context-specific implementations
- Privacy as core infrastructure
Check out our Space Privacy Analyzer tool that automatically generates privacy summaries of applications!
Effective consent isn't about perfect policies; it's about architectures that empower users while enabling innovation. 🚀
Read more: https://huggingface.co/blog/giadap/consent-by-design