Spaces:
Sleeping
Sleeping
File size: 46,951 Bytes
dd05f29 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 |
en_url,en_title,en,jp_url,jp_title,ja
https://developer.nvidia.com/blog/expanding-ai-agent-interface-options-with-2d-and-3d-digital-human-avatars/,Expanding AI Agent Interface Options with 2D and 3D Digital Human Avatars,"When interfacing with
generative AI
applications, users have multiple communication optionsâtext, voice, or through digital avatars.
Traditional chatbot or copilot applications have text interfaces where users type in queries and receive text-based responses. For hands-free communication, speech AI technologies like
automatic speech recognition
(ASR) and
text-to-speech
(TTS) facilitate verbal interactions, ideal for scenarios like phone-based customer service. Moreover, combining digital avatars with speech capabilities provides a more dynamic interface for users to engage visually with the application. According to Gartner, by 2028, 45% of organizations with more than 500 employees will leverage employee AI avatars to expand the capacity of human capital.
1
Digital avatars can vary widely in styleâsome use cases benefit from photorealistic 3D or 2D avatars, while other use cases work better with a stylized, or cartoonish avatar.
3D Avatars
offer fully immersive experiences, showcasing lifelike movements and photorealism. Developing these avatars requires specialized software and technical expertise, as they involve intricate body animations and high-quality renderings.
2D Avatars
are quicker to develop and ideal for web-embedded solutions. They offer a streamlined approach to creating interactive AI, often requiring artists for design and animation but less intensive in terms of technical resources.
To kickstart your creation of a photo-realistic digital human, the
NVIDIA AI Blueprint on digital humans for customer service
can be tailored for various use cases. This functionality is now included with support for the NVIDIA Maxine
Audio2Face-2D
NIM microservice. âAdditionally, the blueprint now offers flexibility in rendering for 3D avatar developers to use
Unreal Engine
.
How to add a talking digital avatar to your agent application
In the AI Blueprint for digital humans, a user interacts with an
AI agent
that leverages
NVIDIA ACE
technology (Figure 1).
Figure 1. Architecture diagram for the NVIDIA AI Blueprint for digital humans
The audio input from the user is sent to the ACE agent which orchestrates the communication between various NIM microservices. The ACE agent uses the
Riva Parakeet NIM
to convert the audio to text, which is then processed by a RAG pipeline. The RAG pipeline uses the NVIDIA NeMo Retriever
embedding
and
reranking
NIM microservices, and an
LLM NIM
, to respond with relevant context from stored documents.
Finally, the response is converted back to speech via Riva TTS, animating the digital human using the Audio2Face-3D NIM or Audio2Face-2D NIM.
Considerations when designing your AI agent application
In global enterprises, communication barriers across languages can slow down operations. AI-powered avatars with multilingual capabilities communicate across languages effortlessly. The digital human AI Blueprint provides conversational AI capabilities that simulate human interactions that accommodate usersâ speech styles and languages through Riva ASR, neural machine translation (NMT) along with intelligent interruption and barge-in support.
One of the key benefits of digital human AI agents is their ability to function as âalways-onâ resources for employees and customers alike. RAG-powered AI agents continuously learn from interactions and improve over time, providing more accurate responses and better user experiences.
For enterprises considering digital human interfaces, choosing the right avatar and rendering option depends on the use case and customization preferences.
Use Case
: 3D avatars are ideal for highly immersive use cases like in physical stores, kiosks or primarily one-to-one interactions, while 2D avatars are effective for web or mobile conversational AI use cases.
Development and Customization Preferences
: Teams with 3D and animation expertise can leverage their skillset to create an immersive and ultra-realistic avatar, while teams looking to iterate and customize quickly can benefit from the simplicity of 2D avatars.
Scaling Considerations:
Scaling is an important consideration when evaluating avatars and corresponding rendering options. Stream throughput, especially for 3D avatars, is highly dependent on the choice and quality of the character asset used, the desired output resolution and the rendering option of choice (Omniverse Renderer or Unreal Engine) can play a critical role in determining per stream compute footprint.
NVIDIA Audio2Face-2D allows creation of lifelike 2D avatars from just a portrait image and voice input. Easy and simple configurations allow developers to quickly iterate and produce target avatars and animations for their digital human use cases. With real-time output and cloud-native deployment, 2D digital humans are ideal for interactive use cases and streaming avatars for interactive web-embedded solutions.
For example, enterprises looking to deploy AI agents across multiple devices and inserting digital humans into web- or mobile-first customer journeys, can benefit from the reduced hardware demands of 2D avatars.
3D photorealistic avatars provide an unmatched immersive experience for use cases demanding âhighly empathetic user engagement. NVIDIA Audio2Face-3D and Animation NIM microservices animate a 3D character by generating blendshapes along with subtle head and body animation to create an immersive, photorealistic avatar. The digital human AI Blueprint now supports two rendering options for 3D avatars, including Omniverse Renderer and Unreal Engine Renderer, providing developers the flexibility to integrate the rendering option of their choice.
To explore how digital humans can enhance your enterprise, visit the
NVIDIA API catalog
to learn about the different avatar options.
Getting started with digital avatars
For hands-on development with Audio2Face-2D and Unreal Engine NIM microservices,
apply for ACE Early Access
or dive into the digital human AI Blueprint
technical blog
to learn how you can add digital human interfaces to personalize chatbot applications.
1
Gartner®, Hype Cycle for the Future of Work, 2024 by Tori Paulman, Emily Rose McRae, etc., July 2024
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.",https://developer.nvidia.com/ja-jp/blog/expanding-ai-agent-interface-options-with-2d-and-3d-digital-human-avatars/,2D ãš 3D ã®ããžã¿ã« ãã¥ãŒãã³ ã¢ãã¿ãŒã«ãã AI ãšãŒãžã§ã³ã ã€ã³ã¿ãŒãã§ã€ã¹ ãªãã·ã§ã³ã®æ¡åŒµ,"Reading Time:
2
minutes
ãŠãŒã¶ãŒã
çæ AI
ã¢ããªã±ãŒã·ã§ã³ã䜿ã£ãŠããåãããéã«ã¯ãããã¹ããé³å£°ãããžã¿ã« ã¢ãã¿ãŒãªã©è€æ°ã®ã³ãã¥ãã±ãŒã·ã§ã³ ãªãã·ã§ã³ãå©çšããããšãã§ããŸãã
åŸæ¥ã®ãã£ããããããã³ãã€ããã ã¢ããªã±ãŒã·ã§ã³ã§ã¯ããŠãŒã¶ãŒãåãåãããå
¥åããããã¹ãããŒã¹ã®å¿çãåä¿¡ããããã¹ã ã€ã³ã¿ãŒãã§ã€ã¹ã䜿çšããŠããŸãããã³ãºããªãŒã®ã³ãã¥ãã±ãŒã·ã§ã³ã§ã¯ã
èªåé³å£°èªè
(ASR: Automatic Speech Recognition) ã
é³å£°åæ
(TTS: Text-To-Speech) ãªã©ã®é³å£° AI æè¡ã«ãããé»è©±ã䜿çšããã«ã¹ã¿ã㌠ãµãŒãã¹ãªã©ã®ã·ããªãªã«æé©ãªå£é ã«ããããåãã容æã«ãªããŸããããã«ãããžã¿ã« ã¢ãã¿ãŒã«é³å£°æ©èœãæãããããšã§ããŠãŒã¶ãŒãã¢ããªã±ãŒã·ã§ã³ãèŠèŠçã«äœ¿çšã§ããããããã€ãããã¯ãªã€ã³ã¿ãŒãã§ã€ã¹ãæäŸã§ããŸããGartner ã«ãããšã2028 幎ãŸã§ã«ãåŸæ¥å¡ 500 å以äžã®çµç¹ã® 45% ãã人çè³æ¬ã®èœåæ¡å€§ã®ããã«ã AI ã¢ãã¿ãŒã®åŸæ¥å¡ã掻çšããããã«ãªãããã§ãã
1
ããžã¿ã« ã¢ãã¿ãŒã®ã¹ã¿ã€ã«ã¯æ§ã
ã§ããã©ããªã¢ãªã¹ãã£ãã¯ãª 3D ãŸã㯠2D ã®ã¢ãã¿ãŒãé©ããŠããã±ãŒã¹ãããã°ãå®ååãããã¢ãã¿ãŒã挫ç»ã®ãããªã¢ãã¿ãŒã®æ¹ãé©ããŠããã±ãŒã¹ããããŸãã
3D ã¢ãã¿ãŒ
ã¯ããªã¢ã«ãªåããšå宿§ãåçŸããå®å
šãªæ²¡å
¥äœéšãæäŸããŸãããã®ãããªã¢ãã¿ãŒã®éçºã«ã¯ãè€éãªããã£ãŒ ã¢ãã¡ãŒã·ã§ã³ãé«å質ã®ã¬ã³ããªã³ã°ãå¿
èŠãšãªããããå°éçãªãœãããŠã§ã¢ãæè¡çãªå°éç¥èãå¿
èŠã«ãªããŸãã
2D ã¢ãã¿ãŒ
ã¯éçºãè¿
éã§ãWeb ã«çµã¿èŸŒã¿ãœãªã¥ãŒã·ã§ã³ã«æé©ã§ããã€ã³ã¿ã©ã¯ãã£ã㪠AI ã®äœæã«åççãªã¢ãããŒããæäŸãããã¶ã€ã³ãã¢ãã¡ãŒã·ã§ã³ã«ã¯ã¢ãŒãã£ã¹ããå¿
èŠã«ãªãããšãå€ãã§ãããæè¡çãªãªãœãŒã¹ã®é¢ã¯ããã»ã©è² æ
ã«ãªããŸããã
ãã©ããªã¢ãªã¹ãã£ãã¯ãªããžã¿ã« ãã¥ãŒãã³ã®äœæãå§ããã«ãããã
ã«ã¹ã¿ã㌠ãµãŒãã¹åãããžã¿ã« ãã¥ãŒãã³ã® NVIDIA AI Blueprint
ã¯ãããŸããŸãªãŠãŒã¹ ã±ãŒã¹ã«åãããŠã«ã¹ã¿ãã€ãºããããšãã§ããŸãããã®æ©èœã¯çŸåšãNVIDIA Maxine
Audio2Face-2D
NIM ãã€ã¯ããµãŒãã¹ã®ãµããŒãã«å«ãŸããŠããŸããããã«ããã® Blueprint ã§ã¯ã3D ã¢ãã¿ãŒéçºè
ã
Unreal Engine
ã䜿çšã§ãããããã¬ã³ããªã³ã°ã«æè»æ§ãæãããŠããŸãã
ãšãŒãžã§ã³ã ã¢ããªã±ãŒã·ã§ã³ã«äŒè©±ããããžã¿ã« ã¢ãã¿ãŒã远å ããæ¹æ³
ããžã¿ã« ãã¥ãŒãã³åã AI Blueprint ã§ã¯ããŠãŒã¶ãŒã
NVIDIA ACE
æè¡ã掻çšãã
AI ãšãŒãžã§ã³ã
ãšå¯Ÿè©±ããŸã (å³ 1)ã
å³ 1. ããžã¿ã« ãã¥ãŒãã³åã NVIDIA AI Blueprint ã®ã¢ãŒããã¯ãã£
ãŠãŒã¶ãŒã«ããé³å£°å
¥åã¯ãããŸããŸãª NIM ãã€ã¯ããµãŒãã¹éã®éä¿¡ã調æŽãã ACE ãšãŒãžã§ã³ãã«éä¿¡ãããŸããACE ãšãŒãžã§ã³ãã¯ã
Riva Parakeet NIM
ã䜿çšããŠé³å£°ãããã¹ãã«å€æãããã®ããã¹ã㯠RAG ãã€ãã©ã€ã³ã§åŠçãããŸããRAG ãã€ãã©ã€ã³ã§ã¯ãNIM ãã€ã¯ããµãŒãã¹ã®
åã蟌ã¿
ãš
ãªã©ã³ã¯
ãè¡ã NVIDIA NeMo Retriever ãš
LLM NIM
ã䜿çšããŠãä¿åãããããã¥ã¡ã³ãããé¢é£ããã³ã³ããã¹ããçšããŠå¿çããŸãã
æåŸã«ãRiva TTS ãä»ããŠãã®å¿çãé³å£°ã«å€æããAudio2Face-3D NIM ãŸã㯠Audio2Face-2D NIM ã䜿çšããŠããžã¿ã« ãã¥ãŒãã³ãã¢ãã¡ãŒã·ã§ã³åããŸãã
AI ãšãŒãžã§ã³ã ã¢ããªã±ãŒã·ã§ã³ãèšèšããéã«èæ
®ãã¹ããã€ã³ã
ã°ããŒãã«äŒæ¥ã§ã¯ãèšèªã®å£ã«ããã³ãã¥ãã±ãŒã·ã§ã³ã®éå®³ãæ¥åã®åŠšããšãªãããšããããŸããå€èšèªæ©èœãåãã AI æèŒã¢ãã¿ãŒã䜿çšããã°ãèšèªã®å£ãè¶
ããåæ»ãªã³ãã¥ãã±ãŒã·ã§ã³ãåãããšãã§ããŸããããžã¿ã« ãã¥ãŒãã³ AI Blueprint ã¯ãRiva ASR ããã¥ãŒã©ã«æ©æ¢°ç¿»èš³ (NMT: Neural Machine Translation) ã«å ããã€ã³ããªãžã§ã³ããªå²ã蟌ã¿ãããŒãžã€ã³æ©èœãåãããŠãŒã¶ãŒã®è©±ãæ¹ãèšèªã«æè»ã«å¯Ÿå¿ã§ããã人éããã察話å AI ãå®çŸããŸãã
ããžã¿ã« ãã¥ãŒãã³ AI ãšãŒãžã§ã³ãã®äž»ãªå©ç¹ã® 1 ã€ã¯ãåŸæ¥å¡ãšé¡§å®¢ã®äž¡è
ã«ãšã£ãŠãåžžæçšŒåããããªãœãŒã¹ãšããŠæ©èœã§ããããšã§ããRAG ãæèŒãã AI ãšãŒãžã§ã³ãã¯ããããšãããç¶ç¶çã«åŠç¿ããæéã®çµéãšãšãã«æ¹åããŠãããããããæ£ç¢ºãªå¯Ÿå¿ãšããåªãããŠãŒã¶ãŒäœéšãæäŸããããšãã§ããŸãã
ããžã¿ã« ãã¥ãŒãã³ ã€ã³ã¿ãŒãã§ã€ã¹ãæ€èšããŠããäŒæ¥ã«ãšã£ãŠãé©åãªã¢ãã¿ãŒãšã¬ã³ããªã³ã° ãªãã·ã§ã³ã®éžæã¯ããŠãŒã¹ ã±ãŒã¹ãã«ã¹ã¿ãã€ãºèšå®ã«äŸåããŸãã
ãŠãŒã¹ ã±ãŒã¹
: 3D ã¢ãã¿ãŒã¯ãå®åºèãããªã¹ã¯ (ç¡äººç«¯æ«) ãªã©ã䞻㫠1察 1 ã®ãããšãã®ãããªãéåžžã«æ²¡å
¥æã®é«ããŠãŒã¹ ã±ãŒã¹ã«æé©ã§ããã2D ã¢ãã¿ãŒã¯ãWeb ãã¢ãã€ã«ã®å¯Ÿè©±å AI ãŠãŒã¹ ã±ãŒã¹ã«å¹æçã§ãã
éçºãšã«ã¹ã¿ãã€ãºã®èšå®
: 3D ãã¢ãã¡ãŒã·ã§ã³ã®å°éç¥èãæã€ããŒã ã¯ããã®ã¹ãã«ã掻çšããŠæ²¡å
¥æã®ããè¶
ãªã¢ã«ãªã¢ãã¿ãŒãäœæã§ããŸããäžæ¹ãååŸ©äœæ¥ãã«ã¹ã¿ãã€ãºãè¿
éã«è¡ãããããŒã ã«ã¯ãã·ã³ãã«ãª 2D ã¢ãã¿ãŒãæå¹ã§ãã
ã¹ã±ãŒãªã³ã°ã®èæ
®ãã¹ããã€ã³ã
: ã¢ãã¿ãŒãšå¯Ÿå¿ããã¬ã³ããªã³ã° ãªãã·ã§ã³ãè©äŸ¡ããéã«ãã¹ã±ãŒãªã³ã°ã¯èæ
®ãã¹ãéèŠãªãã€ã³ãã§ããã¹ããªãŒã ã®ã¹ã«ãŒãããã¯ãç¹ã« 3D ã¢ãã¿ãŒã®å Žåã䜿çšãããã£ã©ã¯ã¿ãŒ ã¢ã»ããã®éžæãšå質ã«ãã£ãŠå€§ããç°ãªããŸããåžæããåºåè§£å床ãéžæããã¬ã³ããªã³ã° ãªãã·ã§ã³ (Omniverse Renderer ãŸã㯠Unreal Engine) ã¯ãã¹ããªãŒã ãããã®èšç®ãããããªã³ããæ±ºå®ããäžã§éèŠãªåœ¹å²ãæãããŸãã
NVIDIA Audio2Face-2D ã§ã¯ãé¡åçãšé³å£°å
¥åã ãã§ãªã¢ã«ãª 2D ã¢ãã¿ãŒãäœæã§ããŸããç°¡åã§ã·ã³ãã«ãªæ§æã®ãããéçºè
ã¯ããžã¿ã« ãã¥ãŒãã³ã®ãŠãŒã¹ ã±ãŒã¹ã«åãããã¢ãã¿ãŒãã¢ãã¡ãŒã·ã§ã³ãè¿
éã«ç¹°ãè¿ãäœæã§ããŸãããªã¢ã«ã¿ã€ã åºåãšã¯ã©ãŠã ãã€ãã£ãã®ãããã€ã«ããã2D ããžã¿ã« ãã¥ãŒãã³ã¯ãã€ã³ã¿ã©ã¯ãã£ããªãŠãŒã¹ ã±ãŒã¹ããã€ã³ã¿ã©ã¯ãã£ã㪠Web çµã¿èŸŒã¿ãœãªã¥ãŒã·ã§ã³åãã®ã¹ããªãŒãã³ã° ã¢ãã¿ãŒã«æé©ã§ãã
ããšãã°ãè€æ°ã®ããã€ã¹ã« AI ãšãŒãžã§ã³ãããããã€ããWeb ãŸãã¯ã¢ãã€ã« ãã¡ãŒã¹ãã®ã«ã¹ã¿ã㌠ãžã£ãŒããŒã«ããžã¿ã« ãã¥ãŒãã³ãå°å
¥ããããšããŠããäŒæ¥ã«ã¯ã2D ã¢ãã¿ãŒã¯ããŒããŠã§ã¢èŠä»¶ã軜æžããã®ã§ã¡ãªããããããŸãã
3D ã®ãã©ããªã¢ãªã¹ãã£ãã¯ãªã¢ãã¿ãŒã¯ãé«ãå
±æãèŠæ±ããããŠãŒã¶ãŒ ãšã³ã²ãŒãžã¡ã³ããå¿
èŠãšãããŠãŒã¹ ã±ãŒã¹ã«ãæ¯é¡ã®ãªã没å
¥äœéšãæäŸããŸããNVIDIA Audio2Face-3D ãšã¢ãã¡ãŒã·ã§ã³ NIM ãã€ã¯ããµãŒãã¹ã¯ãç¹çްãªé éšãšèº«äœã®ã¢ãã¡ãŒã·ã§ã³ãšãšãã«ãã¬ã³ãã·ã§ã€ããçæããæ²¡å
¥æã®ãããã©ããªã¢ãªã¹ãã£ãã¯ãªã¢ãã¿ãŒãäœæããããšã§ã3D ãã£ã©ã¯ã¿ãŒãã¢ãã¡ãŒã·ã§ã³åããŸããããžã¿ã« ãã¥ãŒãã³ AI Blueprint ã¯ã3D ã¢ãã¿ãŒã®ã¬ã³ããªã³ã° ãªãã·ã§ã³ããšããŠãOmniverse ã¬ã³ãã©ãŒãš Unreal-Engine ã¬ã³ãã©ãŒããµããŒãããŠãããéçºè
ãéžæããã¬ã³ããªã³ã° ãªãã·ã§ã³ãæè»ã«çµ±åã§ããããã«ãªããŸããã
ããžã¿ã« ãã¥ãŒãã³ãäŒæ¥ã匷åããæ¹æ³ã«ã€ããŠã¯ã
NVIDIA API ã«ã¿ãã°
ã«ã¢ã¯ã»ã¹ããŠãããŸããŸãªã¢ãã¿ãŒã®ãªãã·ã§ã³ãã芧ãã ããã
ããžã¿ã« ã¢ãã¿ãŒãå§ãã
Audio2Face-2D ãš Unreal Engine NIM ãã€ã¯ããµãŒãã¹ã䜿çšããå®è·µçãªéçºã«ã€ããŠã¯ã
ACE æ©æã¢ã¯ã»ã¹ã«ç³ã蟌ã
ããããžã¿ã« ãã¥ãŒãã³ AI Blueprint ã®
æè¡ããã°
ã«ã¢ã¯ã»ã¹ããŠããã£ããããã ã¢ããªã±ãŒã·ã§ã³ãããŒãœãã©ã€ãºããããã«ããžã¿ã« ãã¥ãŒãã³ ã€ã³ã¿ãŒãã§ã€ã¹ã远å ããæ¹æ³ã«ã€ããŠåŠã¶ããšãã§ããŸãã
1
Gartner®, Hype Cycle for the Future of Work, 2024 by Tori Paulman, Emily Rose McRae, etc., July 2024
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
é¢é£æ
å ±
GTC ã»ãã·ã§ã³:
Enhancing the Digital Human Experience with Cloud Microservices Accelerated by Generative AI
GTC ã»ãã·ã§ã³:
Build a World of Interactive Avatars Based on NVIDIA Omniverse, AIGC, and LLM
NGC ã³ã³ãããŒ:
ACE ãšãŒãžã§ã³ã ãµã³ãã« ããã³ããšã³ã
SDK:
NVIDIA Tokkio
ãŠã§ãããŒ:
How Telcos Transform Customer Experiences with Conversational AI"
https://developer.nvidia.com/blog/5x-faster-time-to-first-token-with-nvidia-tensorrt-llm-kv-cache-early-reuse/,5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse,"In our previous
blog post
, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up to 14x on x86-based NVIDIA H100 Tensor Core GPUs and 28x on the NVIDIA GH200 Superchip. In this post, we shed light on KV cache reuse techniques and best practices that can drive even further TTFT speedups.
Introduction to KV cache
LLM models are rapidly being adopted for many tasks, including question-answering, and code generation. To generate a response, these models begin by converting the userâs prompt into tokens, which are then transformed into dense vectors. Extensive dot-product operations follow to mathematically model the relationships between the tokens and build a contextual understanding of the user input. The computational cost of generating this contextual understanding increases quadratically with the length of the input sequence.
This resource-intensive process generates keys and values, which are cached to avoid recomputation when generating subsequent tokens. Reusing the KV cache reduces the computational load and time needed to generate additional tokensâleading to a faster and more efficient user experience.
When reusing the KV cache, careful attention must be given to how long it remains in memory, which components to evict first when memory is full, and when it can be reused for new incoming prompts. Optimizing these factors can lead to incremental performance improvements in KV cache reuse. NVIDIA TensorRT-LLM offers three key features that specifically address these areas.
Early KV cache reuse
Traditional reuse algorithms require the entire KV cache computation to be completed before any portions of it can be reused with new user prompts. In scenarios such as enterprise chatbots, where system promptsâpredefined instructions added to user queriesâare essential to direct the LLMâs responses in line with enterprise guidelines, this method can be inefficient.
When a surge of users interacts with the chatbot simultaneously, each user would require a separate computation of the system prompt KV cache. With TensorRT-LLM, we can instead reuse the system prompt as it is being generated in real time, enabling it to be shared across all users during the burst, rather than recalculating it for each user. This can significantly accelerate inference for use cases requiring system prompts by up to 5x.
Figure 1. TensorRT-LLM KV cache reuse can speed up TTFT by up to 5x
Flexible KV cache block sizing
In reuse implementations, only entire cache memory blocks can be allocated for reuse. For example, if the cache memory block size is 64 tokens and KV cache is 80 tokens, only 64 tokens will be stored for reuse, while the remaining 16 tokens will need to be recomputed. However, if the memory block size is reduced to 16 tokens, all 64 tokens can be stored across five memory blocks, eliminating the need for re-computation.
This effect is most pronounced when the input sequences are short. For long input sequences, larger blocks can be more beneficial. As is clear, the more granular the control you have over the KV cache, the better you can optimize it for your specific use case.
TensorRT-LLM provides fine-grained control over KV cache memory blocks, giving developers the ability to chop them into smaller blocks between 64 to 2 tokens. This optimizes the usage of allocated memory, increases reuse rates, and improves TTFT. When running LLAMA70B on NVIDIA H100 Tensor Core GPUs, we can speed up TTFT up to 7% in multi-user environments by reducing KV cache block size from 64 tokens to 8 tokens.
Figure 2. Impact of changing KV cache block size on inference speedup
Efficient KV cache eviction protocols
Partitioning the KV cache into smaller blocks and evicting unused ones can be effective for memory optimization, but it introduces dependency complexities. When a specific block is used to generate a response, and the result is stored as a new block, it can form a tree-like structure of dependencies.
Over time, the counters tracking the usage of the source blocks (the branches) may become stale as the dependent nodes (the leaves) are reused. Evicting the source block then requires the eviction of all dependent blocks, which would require recalculation of the KV cache for new user prompts, increasing TTFT.
To address this challenge, TensorRT-LLM includes intelligent eviction algorithms that can trace the dependent nodes from their source nodes and evict dependent nodes first, even if they have more recent reuse counters. This ensures more efficient memory management while preventing unnecessary evictions of dependent blocks.
Figure 3. A logical representation of KV cache eviction algorithm show how it can reduce the number of evicted blocks, increasing the likelihood of reuse
Getting started with TensorRT-LLM KV cache reuse
Generating KV cache during inference requires a lot of compute and memory resources. Using it efficiently is critical to improving model response, accelerating inference, and increasing system throughput. TensorRT-LLM provides advanced reuse features for developers looking to further optimize TTFT response times for peak performance.
To start using TensorRT-LLM KV cache reuse check out our
GitHub documentation
.",https://developer.nvidia.com/ja-jp/blog/5x-faster-time-to-first-token-with-nvidia-tensorrt-llm-kv-cache-early-reuse/,NVIDIA TensorRT-LLM ã® KV Cache Early Reuseã§ãTime to First Token ã 5 åé«éå,"Reading Time:
2
minutes
以åã®
ããã°èšäº
ã§ã¯ãkey-value (KV) ãã£ãã·ã¥ã CPU ã¡ã¢ãªã«ãªãããŒãããŠåå©çšããããšã§ãæåã®ããŒã¯ã³ãåºåããããŸã§ã®æé (TTFT: Time To First Token) ã x86 ããŒã¹ã® NVIDIA H100 Tensor ã³ã¢ GPU ã§æå€§ 14 åãNVIDIA GH200 Superchip ã§æå€§ 28 åã«é«éåã§ããæ¹æ³ãã玹ä»ããŸãããæ¬èšäºã§ã¯ãKV ãã£ãã·ã¥ã®åå©çšæè¡ãšãTTFT ã®ãããªãé«éåãå®çŸãããã¹ããã©ã¯ãã£ã¹ã«ã€ããŠè§£èª¬ããŸãã
KV ãã£ãã·ã¥ã®æŠèŠ
LLM ã¢ãã«ã¯ã質ååçãã³ãŒãçæãªã©ãå€ãã®ã¿ã¹ã¯ã§æ¥éã«æ¡çšãããŠããŸããå¿çãçæããã«ãããããããã®ã¢ãã«ã¯ãŸãããŠãŒã¶ãŒã®ããã³ãããããŒã¯ã³ãžå€æãããã®åŸãããã®ããŒã¯ã³ãå¯ãã¯ãã«ãžãšå€æããŸããèšå€§ãªãããç©æŒç®ããã®åŸã«ç¶ãããã®åŸããŒã¯ã³éã®é¢ä¿æ§ãæ°åŠçã«ã¢ãã«åãããŠãŒã¶ãŒå
¥åã«å¯Ÿããæèçè§£ãæ§ç¯ããŸãããã®æèçè§£ãçæããããã«ãããèšç®ã³ã¹ãã¯ãå
¥åã·ãŒã±ã³ã¹ã®é·ãã®äºä¹ã«æ¯äŸããŠå¢å ããŸãã
ãã®ãªãœãŒã¹ã倧éã«æ¶è²»ããããã»ã¹ãã key ãšvalue ãçæãããåŸç¶ã®ããŒã¯ã³ãçæãããšãã«å床èšç®ãããªãããã«ãã£ãã·ã¥ãããŸããKV ãã£ãã·ã¥ãåå©çšããããšã§ã远å ã®ããŒã¯ã³ãçæããéã«å¿
èŠãšãªãèšç®è² è·ãšæéã軜æžãããããé«éã§å¹ççãªãŠãŒã¶ãŒäœéšãå®çŸããŸãã
KV ãã£ãã·ã¥ãåå©çšãããšãã«ã¯ããã£ãã·ã¥ãã¡ã¢ãªã«æ®ãæéãã¡ã¢ãªãäžæ¯ã«ãªã£ããšãã«æåã«åé€ããã³ã³ããŒãã³ããããã³æ°ããå
¥åããã³ããã«åå©çšã§ããã¿ã€ãã³ã°ãªã©ã®ç¹ã«çްå¿ã®æ³šæãæãå¿
èŠããããŸãããããã®èŠå ãæé©åããããšã§ãKV ãã£ãã·ã¥ã®åå©çšã«ãããããã©ãŒãã³ã¹ã®æ®µéçãªå¢å ãžãšã€ãªããããšãã§ããŸããNVIDIA TensorRT-LLM ã¯ããããã®åéã«ç¹åãã 3 ã€ã®äž»èŠãªæ©èœãæäŸããŸãã
Early KV cache reuse
åŸæ¥ã®åå©çšã¢ã«ãŽãªãºã ã§ã¯ãKV ãã£ãã·ã¥ããã®äžéšã§ãã£ãŠãæ°ãããŠãŒã¶ãŒ ããã³ããã§åå©çšããããã«ã¯ãäºåã«ãã¹ãŠã® KV ãã£ãã·ã¥ã®èšç®ãå®äºãããŠããå¿
èŠããããŸããããã®æ¹æ³ã¯ãLLM ã®ã¬ã¹ãã³ã¹ãäŒæ¥ã®ã¬ã€ãã©ã€ã³ã«æ²¿ã£ããã®ã«ããããã«ãã·ã¹ãã ããã³ãã (ãŠãŒã¶ãŒã®åãåããã«è¿œå ãããäºåå®çŸ©ã®æç€º) ãäžå¯æ¬ ãšãªãäŒæ¥åããã£ããããããªã©ã®ã·ããªãªã§ã¯ãéå¹ççã§ããå¯èœæ§ããããŸãã
ãã£ããããããšåæã«ããåããããŠãŒã¶ãŒãæ¥å¢ããå ŽåãåãŠãŒã¶ãŒã«å¯ŸããŠã·ã¹ãã ããã³ãã KV ãã£ãã·ã¥ãåå¥ã«èšç®ããå¿
èŠããããŸããTensorRT-LLM ã§ã¯ããªã¢ã«ã¿ã€ã ã§çæãããã·ã¹ãã ããã³ãããåå©çšããããšãã§ãããããæ¥å¢æã«ã¯ãã¹ãŠã®ãŠãŒã¶ãŒãšå
±æããããšãã§ãããŠãŒã¶ãŒããšã«åèšç®ããå¿
èŠããããŸãããããã«ãããã·ã¹ãã ããã³ãããå¿
èŠãšãããŠãŒã¹ ã±ãŒã¹ã®æšè«ãæå€§ 5 åã«ãŸã§é«éåããããšãã§ããŸãã
å³ 1. TensorRT-LLM KV cache reuse ã«ãããTTFT ãæå€§ 5 åé«éå
æè»ãª KV ãã£ãã·ã¥ ããã㯠ãµã€ãº
åå©çšãå®è£
ããéã«ã¯ããã£ãã·ã¥ ã¡ã¢ãª ãããã¯å
šäœã®ã¿ãåå©çšã«å²ãåœãŠãããšãã§ããŸããäŸãã°ããã£ãã·ã¥ ã¡ã¢ãª ããã㯠ãµã€ãºã 64 ããŒã¯ã³ã§ãKV ãã£ãã·ã¥ã 80 ããŒã¯ã³ã§ããå Žåãåå©çšã®ããã«ä¿åã§ããã®ã¯ 64 ããŒã¯ã³ã®ã¿ã§ãããæ®ãã® 16 ããŒã¯ã³ã¯åèšç®ããå¿
èŠããããŸããããããªãããã¡ã¢ãª ããã㯠ãµã€ãºã 16 ããŒã¯ã³ã«æžãããšã64 ããŒã¯ã³ãã¹ãŠã 5 ã€ã®ã¡ã¢ãª ãããã¯ã«æ ŒçŽããããšãã§ããåèšç®ã®å¿
èŠæ§ããªããªããŸãã
ãã®å¹æã¯ãå
¥åã·ãŒã±ã³ã¹ãçããšãã«æãé¡èã«çŸããŸããé·ãå
¥åã·ãŒã±ã³ã¹ã®å Žåã¯ããã倧ããªãããã¯ã®æ¹ãããæçã§ããæããã«ãKV ãã£ãã·ã¥ããã现ããå¶åŸ¡ã§ããã°ã§ããã»ã©ãç¹å®ã®ãŠãŒã¹ ã±ãŒã¹ã«åãããæé©åãåäžããŸãã
TensorRT-LLM ã§ã¯ãKV ãã£ãã·ã¥ ã¡ã¢ãª ãããã¯ããã现ããå¶åŸ¡ã§ãããããéçºè
㯠KV ãã£ãã·ã¥ ã¡ã¢ãª ãããã¯ã 64 ãã 2 ããŒã¯ã³ãŸã§ãããå°ããªãããã¯ã«åå²ããããšãã§ããŸããããã«ãããå²ãåœãŠãããã¡ã¢ãªã®äœ¿çšãæé©åãããåå©çšçãäžæããTTFT ãæ¹åãããŸããNVIDIA H100 Tensor ã³ã¢ GPU ã§ LLAMA70B ãå®è¡ããå ŽåãKV ãã£ãã·ã¥ ãããã¯ãµã€ãºã 64 ããŒã¯ã³ãã 8 ããŒã¯ã³ãžãšæžããããšã§ããã«ããŠãŒã¶ãŒç°å¢ã§ TTFT ãæå€§ 7% é«éåã§ããŸãã
å³ 2. KV ãã£ãã·ã¥ ããã㯠ãµã€ãºã®å€æŽã«ããæšè«ã®é«éå
å¹çç㪠KV ãã£ãã·ã¥ã®é€å€ (Eviction) ãããã³ã«
KV ãã£ãã·ã¥ãããå°ããªãããã¯ã«åå²ããæªäœ¿çšã®ãããã¯ãé€å€ããããšã¯ãã¡ã¢ãªã®æé©åã«å¹æçã§ãããäŸåé¢ä¿ã«è€éããçãŸããŸããç¹å®ã®ãããã¯ãã¬ã¹ãã³ã¹ã®çæã«äœ¿çšããããã®çµæãæ°ãããããã¯ãšããŠä¿åããããšãäŸåé¢ä¿ã®ããªãŒæ§é ã圢æãããå¯èœæ§ããããŸãã
æéã®çµéãšãšãã«ããœãŒã¹ ããã㯠(ãã©ã³ã) ã®äœ¿çšã远跡ããã«ãŠã³ã¿ãŒã¯ãåŸå±ããŒã (ãªãŒã) ãåå©çšãããã«ã€ããŠå€ããªãå¯èœæ§ããããŸãããœãŒã¹ ãããã¯ãé€å€ããã«ã¯ãåŸå±ãããã¹ãŠã®ãããã¯ãé€å€ããå¿
èŠããããæ°ãããŠãŒã¶ ããã³ããã® KV ãã£ãã·ã¥ãåèšç®ããå¿
èŠãçã㊠TTFT ãå¢å ããŸãã
ãã®èª²é¡ã«å¯ŸåŠããããã«ãTensorRT-LLM ã«ã¯ãåŸå±ããŒãããœãŒã¹ ããŒããã远跡ããåŸå±ããŒããããæè¿ã®åå©çšã«ãŠã³ã¿ãŒãæã£ãŠããå Žåã§ããæåã«åŸå±ããŒããé€å€ããããšãã§ããã€ã³ããªãžã§ã³ããªé€å€ã¢ã«ãŽãªãºã ãå«ãŸããŠããŸããããã«ãããããå¹ççã«ã¡ã¢ãªã管çã§ããããã«ãªããšå
±ã«ãåŸå±ãããã¯ã®äžèŠãªé€å€ãåé¿ã§ããŸãã
å³ 3. KV ãã£ãã·ã¥ã®é€å€ã¢ã«ãŽãªãºã ã®è«çã衚çŸããå³ãé€å€ããããããã¯ã®æ°ãæžãããåå©çšã®å¯èœæ§ãé«ããããæ§åã瀺ããŠããŸãã
TensorRT-LLM KV cache reuse ã䜿ãå§ãã
æšè«äžã« KV ãã£ãã·ã¥ãçæããã«ã¯ãå€ãã®èšç®ãšã¡ã¢ãª ãœãŒã¹ãå¿
èŠã«ãªããŸããå¹ççã«äœ¿çšããããšããã¢ãã«å¿çã®æ¹åãæšè«ã®é«éåãã·ã¹ãã ã¹ã«ãŒãããã®åäžã«ã¯äžå¯æ¬ ã§ããTensorRT-LLM ã¯ãããŒã¯æ§èœã®ããã« TTFT å¿çæéãããã«æé©åããããšããéçºè
ã«é«åºŠãªåå©çšæ©èœãæäŸããŸãã
TensorRT-LLM KV cache reuse ã䜿ãå§ããã«ã¯ã
GitHub ã®ããã¥ã¡ã³ã
ãåç
§ããŠãã ããã
é¢é£æ
å ±
GTC ã»ãã·ã§ã³:
Speeding up LLM Inference With TensorRT-LLM (TensorRT-LLM ã«ãã LLM æšè«ã®é«éå)
GTC ã»ãã·ã§ã³:
Optimizing and Scaling LLMs With TensorRT-LLM for Text Generation (ããã¹ãçæã®ããã® TensorRT-LLM ã䜿çšãã LLM ã®æé©åãšã¹ã±ãŒãªã³ã°)
SDK:
Torch-TensorRT
SDK:
TensorRT
SDK:
TensorFlow-TensorRT"
https://developer.nvidia.com/blog/state-of-the-art-multimodal-generative-ai-model-development-with-nvidia-nemo/,State-of-the-Art Multimodal Generative AI Model Development with NVIDIA NeMo,"Generative AI
has rapidly evolved from text-based models to multimodal capabilities. These models perform tasks like image captioning and visual question answering, reflecting a shift toward more human-like AI. The community is now expanding from text and images to video, opening new possibilities across industries.
Video AI models are poised to revolutionize industries such as robotics, automotive, and retail. In
robotics
, they enhance autonomous navigation in complex, ever-changing environments, which is vital for sectors like manufacturing and warehouse management. In the automotive industry, video AI is propelling autonomous driving, boosting vehicle perception, safety, and predictive maintenance to improve efficiency.
To build image and video foundation models, developers must curate and preprocess a large amount of training data, tokenize the resulting high-quality data at high fidelity, train or customize pretrained models efficiently and at scale, and then generate high-quality images and videos during inference.
Announcing NVIDIA NeMo for multimodal generative AI
NVIDIA NeMo
is an end-to-end platform for developing, customizing, and deploying generative AI models.
NVIDIA just announced the expansion of NeMo to support the end-to-end pipeline for developing multimodal models. NeMo enables you to easily curate high-quality visual data, accelerate
training
and
customization
with highly efficient tokenizers and parallelism techniques, and reconstruct high-quality visuals during inference.
Accelerated video and image data curation
High-quality training data ensures high-accuracy results from an AI model. However, developers face various challenges in building data processing pipelines, ranging from scaling to data orchestration.
NeMo Curator
streamlines the data curation process, making it easier and faster for you to build multimodal generative AI models. Its out-of-the-box experience minimizes the total cost of ownership (TCO) and accelerates time-to-market.
While working with visuals, organizations can easily reach petabyte-scale data processing. NeMo Curator provides an orchestration pipeline that can load balance on multiple GPUs at each stage of the data curation. As a result, you can reduce video processing time by 7x compared to a naive GPU-based implementation. The scalable pipelines can efficiently process over 100 PB of data, ensuring the seamless handling of large datasets.
Figure 1. NVIDIA NeMo Curator video processing speed
NeMo Curator provides reference video curation models optimized for high-throughput filtering, captioning, and embedding stages to enhance dataset quality, empowering you to create more accurate AI models.
For instance, NeMo Curator uses an optimized captioning model that delivers an order of magnitude throughput improvement compared to unoptimized inference model implementations.
NVIDIA Cosmos tokenizers
Tokenizers map redundant and implicit visual data into compact and semantic tokens, enabling efficient training of large-scale generative models and democratizing their inference on limited computational resources.
Todayâs open video and image tokenizers often generate poor data representations, leading to lossy reconstructions, distorted images, and temporally unstable videos and placing a cap on the capability of generative models built on top of the tokenizers. Inefficient tokenization processes also result in slow encoding and decoding and longer training and inference times, negatively impacting both developer productivity and the user experience.
NVIDIA Cosmos tokenizers are open models that offer superior visual tokenization with exceptionally large compression rates and cutting-edge reconstruction quality across diverse image and video categories.
Video 1. Efficient Generative AI Tokenizers for Image and Video
These tokenizers provide ease of use through a suite of tokenizer standardized models that support vision-language models (VLMs) with discrete latent codes, diffusion models with continuous latent embeddings, and various aspect ratios and resolutions, enabling the efficient management of large-resolution images and videos. This provides you with tools for tokenizing a wide variety of visual input data to build image and video AI models.
Cosmos tokenizer architecture
A Cosmos tokenizer uses a sophisticated encoder-decoder structure designed for high efficiency and effective learning. At its core, it employs 3D
causal convolution blocks
, which are specialized layers that jointly process spatiotemporal information, and uses causal temporal attention that captures long-range dependencies in data.
The causal structure ensures that the model uses only past and present frames when performing tokenization, avoiding future frames. This is crucial for aligning with the causal nature of many real-world systems, such as those in physical AI or multimodal LLMs.
Figure 2. NVIDIA Cosmos tokenizer architecture
The input is downsampled using 3D wavelets, a signal processing technique that represents pixel information more efficiently. After the data is processed, an inverse wavelet transform reconstructs the original input.
This approach improves learning efficiency, enabling the tokenizer encoder-decoder learnable modules to focus on meaningful features rather than redundant pixel details. The combination of such techniques and its unique training recipe makes the Cosmos tokenizers a cutting-edge architecture for efficient and powerful tokenization.
During inference, the Cosmos tokenizers significantly reduce the cost of running the model by delivering up to 12x faster reconstruction compared to leading open-weight tokenizers (Figure 3).
Figure 3. Quantitative comparison of reconstruction quality (left) and runtime performance (right) for video tokenizers
The Cosmos tokenizers also produce high-fidelity images and videos while compressing more than other tokenizers, demonstrating an unprecedented quality-compression trade-off.
Figure 4. Continuous tokenizer compression rate compared to reconstruction quality
Figure 5. Discrete tokenizer compression rate compared to reconstruction quality
Although the Cosmos tokenizer regenerates from highly compressed tokens, it is capable of creating high-quality images and videos due to an innovative neural network training technique and architecture.
Figure 6. Reconstructed video frame for continuous video tokenizers
Build Your Own Multimodal Models with NeMo
The expansion of the NVIDIA NeMo platform with at-scale data processing using
NeMo Curator
and high-quality tokenization and visual reconstruction using the Cosmos tokenizer empowers you to build state-of-the-art multimodal, generative AI models.
Join the waitlist
and be notified when NeMo Curator is available. The tokenizer is available now on the
/NVIDIA/cosmos-tokenizer
GitHub repo and
Hugging Face
.",https://developer.nvidia.com/ja-jp/blog/state-of-the-art-multimodal-generative-ai-model-development-with-nvidia-nemo/,NVIDIA NeMo ã«ããæå
端ã®ãã«ãã¢ãŒãã«çæ AI ã¢ãã«éçº,"Reading Time:
2
minutes
çæ AI
ã¯ãããã¹ãããŒã¹ã®ã¢ãã«ãããã«ãã¢ãŒãã«æ©èœãžãšæ¥éã«é²åããŠããŸãããããã®ã¢ãã«ã¯ãç»åã®ãã£ãã·ã§ã³äœæãèŠèŠçãªè³ªååçãªã©ã®ã¿ã¹ã¯ãå®è¡ãããã人éã«è¿ã AI ãžãšã·ããããŠããããšãåæ ããŠããŸãããã®ã³ãã¥ããã£ã¯çŸåšãããã¹ããç»åããåç»ãžãšæ¡å€§ããŠãããããŸããŸãªæ¥çã§æ°ããªå¯èœæ§ãåãéãããŠããŸãã
åç» AI ã¢ãã«ã¯ããããã£ã¯ã¹ãèªåè»ãå°å£²ãªã©ã®æ¥çã«é©åœãèµ·ããããšããŠããŸãã
ãããã£ã¯ã¹
ã§ã¯ãè£œé æ¥ãå庫管çãªã©ã®åéã«äžå¯æ¬ ãªãè€éã§å€åãç¶ããç°å¢ã«ãããèªåŸçãªããã²ãŒã·ã§ã³ã匷åããŠããŸããèªåè»æ¥çã§ã¯ãåç» AI ãèªåéè»¢ãæšé²ããè»äž¡ã®èªèãå®å
šæ§ãäºç¥ä¿å
šã匷åããå¹çæ§ãé«ããŠããŸãã
ç»åãåç»ã®åºç€ã¢ãã«ãæ§ç¯ããã«ã¯ãéçºè
ã¯å€§éã®åŠç¿ããŒã¿ã®ãã¥ã¬ãŒã·ã§ã³ãšäºååŠçãè¡ããçµæãšããŠåŸãããé«å質ããŒã¿ãé«ãå¿ å®åºŠã§ããŒã¯ã³åããåŠç¿æžã¿ã¢ãã«ãå¹ççã«å€§èŠæš¡ã«åŠç¿ãŸãã¯ã«ã¹ã¿ãã€ãºããŠãæšè«äžã«é«å質ãªç»åãåç»ãçæããå¿
èŠããããŸãã
ãã«ãã¢ãŒãã«çæ AI åãã® NVIDIA NeMo ãçºè¡š
NVIDIA NeMo
ã¯ãçæ AI ã¢ãã«ãéçºãã«ã¹ã¿ãã€ãºããããã€ãããšã³ãããŒãšã³ãã®ãã©ãããã©ãŒã ã§ãã
NVIDIA ã¯ããã«ãã¢ãŒãã« ã¢ãã«éçºåãã®ãšã³ãããŒãšã³ãã®ãã€ãã©ã€ã³ããµããŒããã NeMo ã®æ¡åŒµãçºè¡šããŸãããNeMo ã«ãããé«å質ãªèŠèŠããŒã¿ãç°¡åã«ãã¥ã¬ãŒã·ã§ã³ããé«å¹çãªããŒã¯ãã€ã¶ãŒãšäžŠååŠçæè¡ã§
åŠç¿
ãš
ã«ã¹ã¿ãã€ãº
ãå éããæšè«äžã«é«å質ãªããžã¥ã¢ã«ãåæ§ç¯ããããšãã§ããŸãã
åç»ãšç»åããŒã¿ã®ãã¥ã¬ãŒã·ã§ã³ãå é
é«å質ãªåŠç¿ããŒã¿ã§ã¯ãAI ã¢ãã«ããé«ç²ŸåºŠãªçµæãåŸãããŸããããããéçºè
ã¯ãããŒã¿åŠçãã€ãã©ã€ã³ã®æ§ç¯ã«ãããŠãã¹ã±ãŒãªã³ã°ããããŒã¿ã®ãªãŒã±ã¹ãã¬ãŒã·ã§ã³ãŸã§ãããŸããŸãªèª²é¡ã«çŽé¢ããŠããŸãã
NeMo Curator
ã¯ãããŒã¿ ãã¥ã¬ãŒã·ã§ã³ ããã»ã¹ãåçåããããšã§ããã«ãã¢ãŒãã«çæ AI ã¢ãã«ãããç°¡åãã€è¿
éã«æ§ç¯ããããšãã§ããŸããããã«è©Šãããšãã§ãããããç·ä¿æã³ã¹ã (TCO) ãæå°éã«æããåžå Žæå
¥ãŸã§ã®æéãççž®ããŸãã
ããžã¥ã¢ã«ãæ±ãéã«ã¯ãçµç¹ã¯ãã¿ãã€ãèŠæš¡ã®ããŒã¿åŠçã容æã«å®è¡ã§ããŸããNeMo Curator ã¯ãããŒã¿ ãã¥ã¬ãŒã·ã§ã³ã®å段éã§è€æ°ã® GPU ã«è² è·åæ£ã§ãããªãŒã±ã¹ãã¬ãŒã·ã§ã³ ãã€ãã©ã€ã³ãæäŸããŸãããã®çµæãåçŽãª GPU ããŒã¹ã®å®è£
ãšæ¯èŒããŠãåç»åŠçæéã 7 åã® 1 ã«ççž®ã§ããŸããã¹ã±ãŒã«å¯èœãªãã€ãã©ã€ã³ã¯ã100 PB ãè¶
ããããŒã¿ãå¹ççã«åŠçã§ããå€§èŠæš¡ãªããŒã¿ã»ãããã·ãŒã ã¬ã¹ã«åãæ±ãããšãã§ããŸãã
å³ 1. NVIDIA NeMo Curator ã®åç»åŠçé床
NeMo Curator ã¯ãé«ãã¹ã«ãŒãããã®ãã£ã«ã¿ãªã³ã°ããã£ãã·ã§ã³äœæãåã蟌ã¿ã®å段éã«æé©åããããªãã¡ã¬ã³ã¹ ãã㪠ãã¥ã¬ãŒã·ã§ã³ ã¢ãã«ãæäŸããããŒã¿ã»ããã®å質ãåäžãããããæ£ç¢ºãª AI ã¢ãã«ã®äœæããµããŒãããŸãã
ããšãã°ãNeMo Curator ã¯ãæé©åããããã£ãã·ã§ã³ ã¢ãã«ã䜿çšããæé©åãããŠããªãæšè«ã¢ãã«ã®å®è£
ãšæ¯èŒããŠãæ¡éãã®ã¹ã«ãŒãããã®åäžãå®çŸããŸãã
NVIDIA Cosmos ããŒã¯ãã€ã¶ãŒ
ããŒã¯ãã€ã¶ãŒã¯ãåé·çã§æé»çãªèŠèŠããŒã¿ãã³ã³ãã¯ãã§æå³ã®ããããŒã¯ã³ã«ãããã³ã°ããå€§èŠæš¡ãªçæã¢ãã«ã®å¹ççãªåŠç¿ãå®çŸãã誰ããéãããèšç®ãªãœãŒã¹ã§æšè«ã§ããããã«ããŸãã
仿¥ã®ãªãŒãã³ãªåç»ãç»åã®ããŒã¯ãã€ã¶ãŒã¯ãããŒã¿è¡šçŸãäžååãªããšãå€ããããå£åã®å€ãåæ§ç¯ãæªãã ç»åãäžé£ç¶ãªåç»ã«ã€ãªãããããŒã¯ãã€ã¶ãŒäžã«æ§ç¯ãããçæã¢ãã«ã®èœåã«éçããããããŸããããŒã¯ã³åããã»ã¹ãéå¹çãªããããšã³ã³ãŒãããã³ãŒãã«æéãããããåŠç¿ãæšè«ã®æéãé·ããªããéçºè
ã®çç£æ§ãšãŠãŒã¶ãŒäœéšã®äž¡æ¹ã«æªåœ±é¿ãåãŒããŸãã
NVIDIA Cosmos ããŒã¯ãã€ã¶ãŒã¯ãåªããèŠèŠããŒã¯ã³åãæäŸãããªãŒãã³ãªã¢ãã«ã§ãããŸããŸãªç»åãåç»ã®ã«ããŽãªãŒã§ãé«ãå§çž®çãšæå
端ã®åæ§ç¯å質ãå®çŸããŸãã
颿£çãªæœåšã³ãŒããåããèŠèŠèšèªã¢ãã« (VLM: Vision-language Model)ãé£ç¶ããæœåšçåã蟌ã¿ã«ããæ¡æ£ã¢ãã«ãããŸããŸãªã¢ã¹ãã¯ãæ¯ãè§£å床ããµããŒãããäžé£ã®ããŒã¯ãã€ã¶ãŒæšæºåã¢ãã«ã䜿çšããŠããããã®ããŒã¯ãã€ã¶ãŒãç°¡åã«äœ¿çšã§ããé«è§£å床ã®ç»åãåç»ãå¹ççã«ç®¡çããããšãã§ããŸããããã«ãããç»åãåç» AI ã¢ãã«ãæ§ç¯ããããã«ãå¹
åºãèŠèŠå
¥åããŒã¿ãããŒã¯ã³åããããŒã«ãæäŸãããŸãã
Cosmos ããŒã¯ãã€ã¶ãŒã®ã¢ãŒããã¯ãã£
Cosmos ããŒã¯ãã€ã¶ãŒã¯ãé«å¹çãã€å¹æçãªåŠç¿åãã«èšèšãããŠãããé«åºŠãªãšã³ã³ãŒã㌠/ ãã³ãŒããŒæ§é ã䜿çšããŠããŸãããã®äžæ žã«ã¯ 3D
Causal Convolution Block
(å æç³ã¿èŸŒã¿ãããã¯) ãæ¡çšããŠããŸããããã¯æç©ºéæ
å ±ãå
±ååŠçããç¹æ®ãªã¬ã€ã€ãŒã§ãããŒã¿ã®é·æçãªäŸåé¢ä¿ãæãã Causal Temporal Attention (å æçæéæ³šææ©æ§) ã䜿çšããŠããŸãã
ãã®å ææ§é ã«ãããããŒã¯ã³åã®å®è¡æã«ã¢ãã«ãéå»ãšçŸåšã®ãã¬ãŒã ã®ã¿ã䜿çšããæªæ¥ã®ãã¬ãŒã ã¯äœ¿çšããŸãããããã¯ãç©ççãªAIããã«ãã¢ãŒãã«LLMãªã©ã®å€ãã®çŸå®äžçã®ã·ã¹ãã ã®å ææ§ã«åãããããã«éèŠã§ãã
å³ 2. NVIDIA Cosmos ããŒã¯ãã€ã¶ãŒã®ã¢ãŒããã¯ãã£
å
¥åã¯ããã¯ã»ã«æ
å ±ãããå¹ççã«è¡šãä¿¡å·åŠçæè¡ã§ãã 3D ãŠã§ãŒãã¬ããã䜿çšããŠããŠã³ãµã³ããªã³ã°ãããŸããããŒã¿åŠçåŸãéãŠã§ãŒãã¬ãã倿ã«ãã£ãŠå
ã®å
¥åãåæ§ç¯ãããŸãã
ãã®ã¢ãããŒãã«ãããåŠç¿å¹çãåäžããããŒã¯ãã€ã¶ãŒã®ãšã³ã³ãŒã㌠/ ãã³ãŒããŒã®åŠç¿å¯èœãªã¢ãžã¥ãŒã«ã¯ãåé·ãªãã¯ã»ã«ã®è©³çްã§ã¯ãªããæå³ã®ããç¹åŸŽã«çŠç¹ãåœãŠãããšãã§ããŸãããã®ãããªæè¡ãšç¬èªã®åŠç¿ã¬ã·ãã®çµã¿åããã«ãããCosmos ããŒã¯ãã€ã¶ãŒã¯ãå¹ççãã€åŒ·åãªããŒã¯ã³åãå®çŸããæå
端ã®ã¢ãŒããã¯ãã£ãšãªã£ãŠããŸãã
æšè«ã®éãCosmos ããŒã¯ãã€ã¶ãŒã¯ãäž»èŠãªãªãŒãã³ãŠã§ã€ãã®ããŒã¯ãã€ã¶ãŒãšæ¯èŒããŠæå€§ 12 åé«éãªåæ§ç¯ãå®çŸããã¢ãã«ã®å®è¡ã³ã¹ãã倧å¹
ã«åæžããŸãã (å³ 3)ã
å³ 3. Cosmos ããŒã¯ãã€ã¶ãŒãšäž»èŠãªãªãŒãã³ãŠã§ã€ãã®ããŒã¯ãã€ã¶ãŒãšã®æ¯èŒ
Cosmos ããŒã¯ãã€ã¶ãŒã¯ãä»ã®ããŒã¯ãã€ã¶ãŒãããé«ãå§çž®çãå®çŸããªãããé«ãå¿ å®åºŠã®ç»åãåç»ãçæããåäŸã®ãªãå質ãšå§çž®ã®ãã¬ãŒããªããå®çŸããŠããŸãã
å³ 4. é£ç¶ããŒã¯ãã€ã¶ãŒã®å§çž®çãšåæ§ç¯åè³ªã®æ¯èŒ
å³ 5. 颿£ããŒã¯ãã€ã¶ãŒã®å§çž®çãšåæ§ç¯åè³ªã®æ¯èŒ
Cosmos ããŒã¯ãã€ã¶ãŒã¯ãé«åºŠã«å§çž®ãããããŒã¯ã³ããåçæãããŸããã驿°çãªãã¥ãŒã©ã« ãããã¯ãŒã¯ã®åŠç¿æè¡ãšã¢ãŒããã¯ãã£ã«ãããé«å質ãªç»åãåç»ãäœæããããšãã§ããŸãã
å³ 6. é£ç¶åç»ããŒã¯ãã€ã¶ãŒã§åæ§ç¯ãããåç»ãã¬ãŒã
NeMo ã§ç¬èªã®ãã«ãã¢ãŒãã« ã¢ãã«ãæ§ç¯
NeMo Curator
ã䜿çšããå€§èŠæš¡ãªããŒã¿åŠçãšãCosmos ããŒã¯ãã€ã¶ãŒã䜿çšããé«å質ãªããŒã¯ã³åãããžã¥ã¢ã«åæ§ç¯ãåãããNVIDIA NeMo ãã©ãããã©ãŒã ã®æ¡åŒµã«ãããæå
端ã®ãã«ãã¢ãŒãã«çæ AI ã¢ãã«ãæ§ç¯ããããšãã§ããŸãã
ç»é²
ããŠããã ããšãNeMo Curator ãå©çšå¯èœã«ãªã£ãéã«éç¥ãåãåãããšãã§ããŸããããŒã¯ãã€ã¶ãŒã¯ãçŸåš
/NVIDIA/cosmos-tokenizer
GitHub ãªããžããªããã³
Hugging Face
ã§å©çšããããšãã§ããŸãã
é¢é£æ
å ±
GTC ã»ãã·ã§ã³:
Large Language Model Fine-Tuning using Parameter Efficient Fine-Tuning (PEFT ã䜿çšããå€§èŠæš¡èšèªã¢ãã«ã®ãã¡ã€ã³ãã¥ãŒãã³ã°)
GTC ã»ãã·ã§ã³:
Large Language Model Fine-Tuning using NVIDIA NeMo (NVIDIA NeMo ã䜿çšããå€§èŠæš¡èšèªã¢ãã«ã®ãã¡ã€ã³ãã¥ãŒãã³ã° â Domino Data Lab æäŸ)
SDK:
NVIDIA NeMo ã«ã¹ã¿ãã€ã¶ãŒ
SDK:
NeMo LLM ãµãŒãã¹
SDK:
NeMo Megatron"
|