Llava Hugging Face

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

llava-hf's activity

merveย 
posted an update 2 days ago
view post
Post
4283
A ton of impactful models and datasets in open AI past week, let's summarize the best ๐Ÿคฉ merve/releases-apr-21-and-may-2-6819dcc84da4190620f448a3

๐Ÿ’ฌ Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B ๐Ÿคฏ as well as Qwen2.5-Omni, any-to-any model in 3B and 7B!
> Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes)
> NVIDIA released new CoT reasoning datasets
๐Ÿ–ผ๏ธ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model
> Meta released EdgeTAM, an on-device object tracking model (SAM2 variant)
๐Ÿ—ฃ๏ธ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model
> Nari released Dia, a 1.6B text-to-speech model
> Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model
๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป JetBrains released Melium models in base and SFT for coding
> Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model ๐Ÿคฉ
merveย 
posted an update 3 days ago
view post
Post
6010
A real-time object detector much faster and accurate than YOLO with Apache 2.0 license just landed to Hugging Face transformers ๐Ÿ”ฅ

D-FINE is the sota real-time object detector that runs on T4 (free Colab) ๐Ÿคฉ

> Collection with all checkpoints and demo ustc-community/d-fine-68109b427cbe6ee36b4e7352

Notebooks:
> Tracking https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb
> Inference https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_inference.ipynb
> Fine-tuning https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_finetune_on_a_custom_dataset.ipynb
h/t @vladislavbro @qubvel-hf @ariG23498 and the authors of the paper ๐ŸŽฉ

Regular object detectors attempt to predict bounding boxes in (x, y, w, h) pixel perfect coordinates, which is very rigid and hard to solve ๐Ÿฅฒโ˜น๏ธ



D-FINE formulates object detection as a distribution for bounding box coordinates, refines them iteratively, and it's more accurate ๐Ÿคฉ

Another core idea behind this model is Global Optimal Localization Self-Distillation โคต๏ธ

this model uses final layer's distribution output (sort of like a teacher) to distill to earlier layers to make early layers more performant.

  • 2 replies
ยท
merveย 
posted an update 6 days ago
merveย 
posted an update 9 days ago
view post
Post
2555
Meta released Llama Guard 4 and new Prompt Guard 2 models ๐Ÿ”ฅ

Llama Guard 4 is a new model to filter model inputs/outputs both text-only and image ๐Ÿ›ก๏ธ use it before and after LLMs/VLMs! meta-llama/Llama-Guard-4-12B

Prompt Guard 2 22M & 86M are smol models to prevent model jailbreaks and prompt injections โš” meta-llama/Llama-Prompt-Guard-2-22M meta-llama/Llama-Guard-4-12B
Both come with new release of transformers ๐Ÿค—

Try the model right away ๐Ÿ‘‰๐Ÿปhttps://github.com/huggingface/huggingface-llama-recipes/blob/main/llama_guard_4.ipynb

Read our blog to learn more and easily get started ๐Ÿ‘‰๐Ÿป https://huggingface.co/blog/llama-guard-4 ๐Ÿฆ™
  • 1 reply
ยท
nielsrย 

Empty output

5
#52 opened 27 days ago by
LeonardoBenitez
Xenovaย 
posted an update 10 days ago
view post
Post
5432
Introducing the ONNX model explorer: Browse, search, and visualize neural networks directly in your browser. ๐Ÿคฏ A great tool for anyone studying Machine Learning! We're also releasing the entire dataset of graphs so you can use them in your own projects! ๐Ÿค—

Check it out! ๐Ÿ‘‡
Demo: onnx-community/model-explorer
Dataset: onnx-community/model-explorer
Source code: https://github.com/xenova/model-explorer
merveย 
posted an update 14 days ago
view post
Post
3945
Don't sleep on new AI at Meta Vision-Language release! ๐Ÿ”ฅ

facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
facebook/perception-lm-67f9783f171948c383ee7498

Meta dropped swiss army knives for vision with A2.0 license ๐Ÿ‘
> image/video encoders for vision language modelling and spatial understanding (object detection etc) ๐Ÿ‘
> The vision LM outperforms InternVL3 and Qwen2.5VL ๐Ÿ‘
> They also release gigantic video and image datasets

The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.

They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 ๐Ÿ‘



> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models ๐Ÿ˜ฎ



> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)

The authors release the following checkpoints in sizes base, large and giant:

> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets



Authors release following datasets ๐Ÿ“‘
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations โฏ๏ธ
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA
  • 2 replies
ยท
merveย 
posted an update 16 days ago
view post
Post
3373
New foundation model on image and video captioning just dropped by NVIDIA AI ๐Ÿ”ฅ

Describe Anything Model (DAM) is a 3B vision language model to generate detailed captions with localized references ๐Ÿ˜ฎ

The team released the models, the dataset, a new benchmark and a demo ๐Ÿคฉ nvidia/describe-anything-680825bb8f5e41ff0785834c

Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)

DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset ๐Ÿ‘€

They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.

Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization ๐Ÿ‘
Xenovaย 
posted an update 22 days ago
view post
Post
2549
Reasoning models like o3 and o4-mini are advancing faster than ever, but imagine what will be possible when they can run locally in your browser! ๐Ÿคฏ

Well, with ๐Ÿค— Transformers.js, you can do just that! Here's Zyphra's new ZR1 model running at over 100 tokens/second on WebGPU! โšก๏ธ

Giving models access to browser APIs (like File System, Screen Capture, and more) could unlock an entirely new class of web experiences that are personalized, interactive, and run locally in a secure, sandboxed environment.

For now, try out the demo! ๐Ÿ‘‡
webml-community/Zyphra-ZR1-WebGPU
  • 1 reply
ยท
merveย 
posted an update 24 days ago
view post
Post
4438
sooo many open AI releases past week, let's summarize! ๐Ÿค—
merve/april-11-releases-67fcd78be33d241c0977b9d2

multimodal
> Moonshot AI released Kimi VL Thinking, first working open-source multimodal reasoning model and Kimi VL Instruct, both 16B MoEs with 3B active params (OS)
> InternVL3 released based on Qwen2.5VL, 7 ckpts with various sizes (1B to 78B)

LLMs
> NVIDIA released Llama-3_1-Nemotron-Ultra-253B-v1 an LLM built on Llama 405B for reasoning, chat and tool use
> Agentica released DeepCoder-14B-Preview, fine-tuned version of DeepSeek-R1-Distilled-Qwen-14B on problem-test pairs, along with the compiled dataset
> Zyphra/ZR1-1.5B is a new small reasoning LLM built on R1-Distill-1.5B (OS)
> Skywork-OR1-32B-Preview is a new reasoning model by Skywork

Image Generation
> HiDream releases three new models, HiDream I1 Dev, I1 Full, and I1 fast for image generation (OS)

*OS ones have Apache 2.0 or MIT licenses
ยท
merveย 
posted an update about 2 months ago
view post
Post
4146
So many open releases at Hugging Face past week ๐Ÿคฏ recapping all here โคต๏ธ merve/march-21-releases-67dbe10e185f199e656140ae

๐Ÿ‘€ Multimodal
> Mistral AI released a 24B vision LM, both base and instruction FT versions, sota ๐Ÿ”ฅ (OS)
> with IBM we released SmolDocling, a sota 256M document parser with Apache 2.0 license (OS)
> SpatialLM is a new vision LM that outputs 3D bounding boxes, comes with 0.5B (QwenVL based) and 1B (Llama based) variants
> SkyWork released SkyWork-R1V-38B, new vision reasoning model (OS)

๐Ÿ’ฌ LLMs
> NVIDIA released new Nemotron models in 49B and 8B with their post-training dataset
> LG released EXAONE, new reasoning models in 2.4B, 7.8B and 32B
> Dataset: Glaive AI released a new reasoning dataset of 22M+ examples
> Dataset: NVIDIA released new helpfulness dataset HelpSteer3
> Dataset: OpenManusRL is a new agent dataset based on ReAct framework (OS)
> Open-R1 team released OlympicCoder, new competitive coder model in 7B and 32B
> Dataset: GeneralThought-430K is a new reasoning dataset (OS)

๐Ÿ–ผ๏ธ Image Generation/Computer Vision
> Roboflow released RF-DETR, new real-time sota object detector (OS) ๐Ÿ”ฅ
> YOLOE is a new real-time zero-shot object detector with text and visual prompts ๐Ÿฅน
> Stability AI released Stable Virtual Camera, a new novel view synthesis model
> Tencent released Hunyuan3D-2mini, new small and fast 3D asset generation model
> ByteDance released InfiniteYou, new realistic photo generation model
> StarVector is a new 8B model that generates svg from images
> FlexWorld is a new model that expands 3D views (OS)

๐ŸŽค Audio
> Sesame released CSM-1B new speech generation model (OS)

๐Ÿค– Robotics
> NVIDIA released GR00T, new robotics model for generalized reasoning and skills, along with the dataset

*OS ones have Apache 2.0 or MIT license