Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Paper • 2505.02707 • Published 3 days ago • 69
DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning Paper • 2504.14509 • Published 18 days ago • 50
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters Paper • 2504.08791 • Published about 1 month ago • 129
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis Paper • 2504.13157 • Published 21 days ago • 21
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? Paper • 2504.13837 • Published 20 days ago • 119
Describe Anything: Detailed Localized Image and Video Captioning Paper • 2504.16072 • Published 16 days ago • 60
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs Paper • 2504.04715 • Published Apr 7 • 13
GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors Paper • 2504.01016 • Published Apr 1 • 29
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization Paper • 2503.19901 • Published Mar 25 • 41
ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation Paper • 2503.22194 • Published Mar 28 • 24
Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics Paper • 2503.20308 • Published Mar 26 • 22
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization Paper • 2504.00999 • Published Apr 1 • 89
CHOrD: Generation of Collision-Free, House-Scale, and Organized Digital Twins for 3D Indoor Scenes with Controllable Floor Plans and Optimal Layouts Paper • 2503.11958 • Published Mar 15 • 3
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity Paper • 2503.07677 • Published Mar 10 • 85
BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing Paper • 2503.13434 • Published Mar 17 • 27
Pensez: Less Data, Better Reasoning -- Rethinking French LLM Paper • 2503.13661 • Published Mar 17 • 5
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection Paper • 2503.12271 • Published Mar 15 • 9
Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models Paper • 2503.09443 • Published Mar 12 • 7
FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis Paper • 2503.13265 • Published Mar 17 • 15