Spaces:

nvidia
/

P2A-test-NV

Sleeping

P2A-test-NV / vecalign /valid_en_ja.csv

KuangDW

add alignment and specify encoder

dd05f29 14 days ago

47 kB

	en_url,en_title,en,jp_url,jp_title,ja
	https://developer.nvidia.com/blog/expanding-ai-agent-interface-options-with-2d-and-3d-digital-human-avatars/,Expanding AI Agent Interface Options with 2D and 3D Digital Human Avatars,"When interfacing with
	generative AI
	applications, users have multiple communication options—text, voice, or through digital avatars.
	Traditional chatbot or copilot applications have text interfaces where users type in queries and receive text-based responses. For hands-free communication, speech AI technologies like
	automatic speech recognition
	(ASR) and
	text-to-speech
	(TTS) facilitate verbal interactions, ideal for scenarios like phone-based customer service. Moreover, combining digital avatars with speech capabilities provides a more dynamic interface for users to engage visually with the application. According to Gartner, by 2028, 45% of organizations with more than 500 employees will leverage employee AI avatars to expand the capacity of human capital.
	1
	Digital avatars can vary widely in style—some use cases benefit from photorealistic 3D or 2D avatars, while other use cases work better with a stylized, or cartoonish avatar.
	3D Avatars
	offer fully immersive experiences, showcasing lifelike movements and photorealism. Developing these avatars requires specialized software and technical expertise, as they involve intricate body animations and high-quality renderings.
	2D Avatars
	are quicker to develop and ideal for web-embedded solutions. They offer a streamlined approach to creating interactive AI, often requiring artists for design and animation but less intensive in terms of technical resources.
	To kickstart your creation of a photo-realistic digital human, the
	NVIDIA AI Blueprint on digital humans for customer service
	can be tailored for various use cases. This functionality is now included with support for the NVIDIA Maxine
	Audio2Face-2D
	NIM microservice. ‌Additionally, the blueprint now offers flexibility in rendering for 3D avatar developers to use
	Unreal Engine
	.
	How to add a talking digital avatar to your agent application
	In the AI Blueprint for digital humans, a user interacts with an
	AI agent
	that leverages
	NVIDIA ACE
	technology (Figure 1).
	Figure 1. Architecture diagram for the NVIDIA AI Blueprint for digital humans
	The audio input from the user is sent to the ACE agent which orchestrates the communication between various NIM microservices. The ACE agent uses the
	Riva Parakeet NIM
	to convert the audio to text, which is then processed by a RAG pipeline. The RAG pipeline uses the NVIDIA NeMo Retriever
	embedding
	and
	reranking
	NIM microservices, and an
	LLM NIM
	, to respond with relevant context from stored documents.
	Finally, the response is converted back to speech via Riva TTS, animating the digital human using the Audio2Face-3D NIM or Audio2Face-2D NIM.
	Considerations when designing your AI agent application
	In global enterprises, communication barriers across languages can slow down operations. AI-powered avatars with multilingual capabilities communicate across languages effortlessly. The digital human AI Blueprint provides conversational AI capabilities that simulate human interactions that accommodate users’ speech styles and languages through Riva ASR, neural machine translation (NMT) along with intelligent interruption and barge-in support.
	One of the key benefits of digital human AI agents is their ability to function as “always-on” resources for employees and customers alike. RAG-powered AI agents continuously learn from interactions and improve over time, providing more accurate responses and better user experiences.
	For enterprises considering digital human interfaces, choosing the right avatar and rendering option depends on the use case and customization preferences.
	Use Case
	: 3D avatars are ideal for highly immersive use cases like in physical stores, kiosks or primarily one-to-one interactions, while 2D avatars are effective for web or mobile conversational AI use cases.
	Development and Customization Preferences
	: Teams with 3D and animation expertise can leverage their skillset to create an immersive and ultra-realistic avatar, while teams looking to iterate and customize quickly can benefit from the simplicity of 2D avatars.
	Scaling Considerations:
	Scaling is an important consideration when evaluating avatars and corresponding rendering options. Stream throughput, especially for 3D avatars, is highly dependent on the choice and quality of the character asset used, the desired output resolution and the rendering option of choice (Omniverse Renderer or Unreal Engine) can play a critical role in determining per stream compute footprint.
	NVIDIA Audio2Face-2D allows creation of lifelike 2D avatars from just a portrait image and voice input. Easy and simple configurations allow developers to quickly iterate and produce target avatars and animations for their digital human use cases. With real-time output and cloud-native deployment, 2D digital humans are ideal for interactive use cases and streaming avatars for interactive web-embedded solutions.
	For example, enterprises looking to deploy AI agents across multiple devices and inserting digital humans into web- or mobile-first customer journeys, can benefit from the reduced hardware demands of 2D avatars.
	3D photorealistic avatars provide an unmatched immersive experience for use cases demanding ‌highly empathetic user engagement. NVIDIA Audio2Face-3D and Animation NIM microservices animate a 3D character by generating blendshapes along with subtle head and body animation to create an immersive, photorealistic avatar. The digital human AI Blueprint now supports two rendering options for 3D avatars, including Omniverse Renderer and Unreal Engine Renderer, providing developers the flexibility to integrate the rendering option of their choice.
	To explore how digital humans can enhance your enterprise, visit the
	NVIDIA API catalog
	to learn about the different avatar options.
	Getting started with digital avatars
	For hands-on development with Audio2Face-2D and Unreal Engine NIM microservices,
	apply for ACE Early Access
	or dive into the digital human AI Blueprint
	technical blog
	to learn how you can add digital human interfaces to personalize chatbot applications.
	1
	Gartner®, Hype Cycle for the Future of Work, 2024 by Tori Paulman, Emily Rose McRae, etc., July 2024
	GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.",https://developer.nvidia.com/ja-jp/blog/expanding-ai-agent-interface-options-with-2d-and-3d-digital-human-avatars/,2D と 3D のデジタルヒューマンアバターによる AI エージェントインターフェイスオプションの拡張,"Reading Time:
	2
	minutes
	ユーザーが
	生成 AI
	アプリケーションを使ってやり取りする際には、テキスト、音声、デジタルアバターなど複数のコミュニケーションオプションを利用することができます。
	従来のチャットボットやコパイロットアプリケーションでは、ユーザーが問い合わせを入力し、テキストベースの応答を受信するテキストインターフェイスを使用しています。ハンズフリーのコミュニケーションでは、
	自動音声認識
	(ASR: Automatic Speech Recognition) や
	音声合成
	(TTS: Text-To-Speech) などの音声 AI 技術により、電話を使用したカスタマーサービスなどのシナリオに最適な口頭によるやり取りが容易になります。さらに、デジタルアバターに音声機能を持たせることで、ユーザーがアプリケーションを視覚的に使用できるため、ダイナミックなインターフェイスを提供できます。Gartner によると、2028 年までに、従業員 500 名以上の組織の 45% が、人的資本の能力拡大のために、 AI アバターの従業員を活用するようになるそうです。
	1
	デジタルアバターのスタイルは様々で、フォトリアリスティックな 3D または 2D のアバターが適しているケースもあれば、定型化されたアバターや漫画のようなアバターの方が適しているケースもあります。
	3D アバター
	は、リアルな動きと写実性を再現し、完全な没入体験を提供します。このようなアバターの開発には、複雑なボディーアニメーションや高品質のレンダリングが必要となるため、専門的なソフトウェアや技術的な専門知識が必要になります。
	2D アバター
	は開発が迅速で、Web に組み込みソリューションに最適です。インタラクティブな AI の作成に合理的なアプローチを提供し、デザインやアニメーションにはアーティストが必要になることが多いですが、技術的なリソースの面はそれほど負担になりません。
	フォトリアリスティックなデジタルヒューマンの作成を始めるにあたり、
	カスタマーサービス向けデジタルヒューマンの NVIDIA AI Blueprint
	は、さまざまなユースケースに合わせてカスタマイズすることができます。この機能は現在、NVIDIA Maxine
	Audio2Face-2D
	NIM マイクロサービスのサポートに含まれています。さらに、この Blueprint では、3D アバター開発者が
	Unreal Engine
	を使用できるよう、レンダリングに柔軟性を持たせています。
	エージェントアプリケーションに会話するデジタルアバターを追加する方法
	デジタルヒューマン向け AI Blueprint では、ユーザーが
	NVIDIA ACE
	技術を活用した
	AI エージェント
	と対話します (図 1)。
	図 1. デジタルヒューマン向け NVIDIA AI Blueprint のアーキテクチャ
	ユーザーによる音声入力は、さまざまな NIM マイクロサービス間の通信を調整する ACE エージェントに送信されます。ACE エージェントは、
	Riva Parakeet NIM
	を使用して音声をテキストに変換し、そのテキストは RAG パイプラインで処理されます。RAG パイプラインでは、NIM マイクロサービスの
	埋め込み
	と
	リランク
	を行う NVIDIA NeMo Retriever と
	LLM NIM
	を使用して、保存されたドキュメントから関連するコンテキストを用いて応答します。
	最後に、Riva TTS を介してこの応答を音声に変換し、Audio2Face-3D NIM または Audio2Face-2D NIM を使用してデジタルヒューマンをアニメーション化します。
	AI エージェントアプリケーションを設計する際に考慮すべきポイント
	グローバル企業では、言語の壁によるコミュニケーションの障害が業務の妨げとなることがあります。多言語機能を備えた AI 搭載アバターを使用すれば、言語の壁を超えた円滑なコミュニケーションを取ることができます。デジタルヒューマン AI Blueprint は、Riva ASR やニューラル機械翻訳 (NMT: Neural Machine Translation) に加え、インテリジェントな割り込みやバージイン機能を備え、ユーザーの話し方や言語に柔軟に対応できる、人間らしい対話型 AI を実現します。
	デジタルヒューマン AI エージェントの主な利点の 1 つは、従業員と顧客の両者にとって「常時稼働する」リソースとして機能できることです。RAG を搭載した AI エージェントは、やりとりから継続的に学習し、時間の経過とともに改善していくため、より正確な対応とより優れたユーザー体験を提供することができます。
	デジタルヒューマンインターフェイスを検討している企業にとって、適切なアバターとレンダリングオプションの選択は、ユースケースやカスタマイズ設定に依存します。
	ユースケース
	: 3D アバターは、実店舗やキオスク (無人端末) など、主に 1対 1 のやりとりのような、非常に没入感の高いユースケースに最適ですが、2D アバターは、Web やモバイルの対話型 AI ユースケースに効果的です。
	開発とカスタマイズの設定
	: 3D やアニメーションの専門知識を持つチームは、そのスキルを活用して没入感のある超リアルなアバターを作成できます。一方、反復作業やカスタマイズを迅速に行いたいチームには、シンプルな 2D アバターが有効です。
	スケーリングの考慮すべきポイント
	: アバターと対応するレンダリングオプションを評価する際に、スケーリングは考慮すべき重要なポイントです。ストリームのスループットは、特に 3D アバターの場合、使用するキャラクターアセットの選択と品質によって大きく異なります。希望する出力解像度や選択するレンダリングオプション (Omniverse Renderer または Unreal Engine) は、ストリームあたりの計算フットプリントを決定する上で重要な役割を果たします。
	NVIDIA Audio2Face-2D では、顔写真と音声入力だけでリアルな 2D アバターを作成できます。簡単でシンプルな構成のため、開発者はデジタルヒューマンのユースケースに合わせたアバターやアニメーションを迅速に繰り返し作成できます。リアルタイム出力とクラウドネイティブのデプロイにより、2D デジタルヒューマンは、インタラクティブなユースケースや、インタラクティブな Web 組み込みソリューション向けのストリーミングアバターに最適です。
	たとえば、複数のデバイスに AI エージェントをデプロイし、Web またはモバイルファーストのカスタマージャーニーにデジタルヒューマンを導入しようとしている企業には、2D アバターはハードウェア要件が軽減するのでメリットがあります。
	3D のフォトリアリスティックなアバターは、高い共感が要求されるユーザーエンゲージメントを必要とするユースケースに、比類のない没入体験を提供します。NVIDIA Audio2Face-3D とアニメーション NIM マイクロサービスは、繊細な頭部と身体のアニメーションとともにブレンドシェイプを生成し、没入感のあるフォトリアリスティックなアバターを作成することで、3D キャラクターをアニメーション化します。デジタルヒューマン AI Blueprint は、3D アバターのレンダリングオプションをとして、Omniverse レンダラーと Unreal-Engine レンダラーをサポートしており、開発者が選択したレンダリングオプションを柔軟に統合できるようになりました。
	デジタルヒューマンが企業を強化する方法については、
	NVIDIA API カタログ
	にアクセスして、さまざまなアバターのオプションをご覧ください。
	デジタルアバターを始める
	Audio2Face-2D と Unreal Engine NIM マイクロサービスを使用した実践的な開発については、
	ACE 早期アクセスに申し込む
	か、デジタルヒューマン AI Blueprint の
	技術ブログ
	にアクセスして、チャットボットアプリケーションをパーソナライズするためにデジタルヒューマンインターフェイスを追加する方法について学ぶことができます。
	1
	Gartner®, Hype Cycle for the Future of Work, 2024 by Tori Paulman, Emily Rose McRae, etc., July 2024
	GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
	関連情報
	GTC セッション:
	Enhancing the Digital Human Experience with Cloud Microservices Accelerated by Generative AI
	GTC セッション:
	Build a World of Interactive Avatars Based on NVIDIA Omniverse, AIGC, and LLM
	NGC コンテナー:
	ACE エージェントサンプルフロントエンド
	SDK:
	NVIDIA Tokkio
	ウェビナー:
	How Telcos Transform Customer Experiences with Conversational AI"
	https://developer.nvidia.com/blog/5x-faster-time-to-first-token-with-nvidia-tensorrt-llm-kv-cache-early-reuse/,5x Faster Time to First Token with NVIDIA TensorRT-LLM KV Cache Early Reuse,"In our previous
	blog post
	, we demonstrated how reusing the key-value (KV) cache by offloading it to CPU memory can accelerate time to first token (TTFT) by up to 14x on x86-based NVIDIA H100 Tensor Core GPUs and 28x on the NVIDIA GH200 Superchip. In this post, we shed light on KV cache reuse techniques and best practices that can drive even further TTFT speedups.
	Introduction to KV cache
	LLM models are rapidly being adopted for many tasks, including question-answering, and code generation. To generate a response, these models begin by converting the user’s prompt into tokens, which are then transformed into dense vectors. Extensive dot-product operations follow to mathematically model the relationships between the tokens and build a contextual understanding of the user input. The computational cost of generating this contextual understanding increases quadratically with the length of the input sequence.
	This resource-intensive process generates keys and values, which are cached to avoid recomputation when generating subsequent tokens. Reusing the KV cache reduces the computational load and time needed to generate additional tokens—leading to a faster and more efficient user experience.
	When reusing the KV cache, careful attention must be given to how long it remains in memory, which components to evict first when memory is full, and when it can be reused for new incoming prompts. Optimizing these factors can lead to incremental performance improvements in KV cache reuse. NVIDIA TensorRT-LLM offers three key features that specifically address these areas.
	Early KV cache reuse
	Traditional reuse algorithms require the entire KV cache computation to be completed before any portions of it can be reused with new user prompts. In scenarios such as enterprise chatbots, where system prompts—predefined instructions added to user queries—are essential to direct the LLM’s responses in line with enterprise guidelines, this method can be inefficient.
	When a surge of users interacts with the chatbot simultaneously, each user would require a separate computation of the system prompt KV cache. With TensorRT-LLM, we can instead reuse the system prompt as it is being generated in real time, enabling it to be shared across all users during the burst, rather than recalculating it for each user. This can significantly accelerate inference for use cases requiring system prompts by up to 5x.
	Figure 1. TensorRT-LLM KV cache reuse can speed up TTFT by up to 5x
	Flexible KV cache block sizing
	In reuse implementations, only entire cache memory blocks can be allocated for reuse. For example, if the cache memory block size is 64 tokens and KV cache is 80 tokens, only 64 tokens will be stored for reuse, while the remaining 16 tokens will need to be recomputed. However, if the memory block size is reduced to 16 tokens, all 64 tokens can be stored across five memory blocks, eliminating the need for re-computation.
	This effect is most pronounced when the input sequences are short. For long input sequences, larger blocks can be more beneficial. As is clear, the more granular the control you have over the KV cache, the better you can optimize it for your specific use case.
	TensorRT-LLM provides fine-grained control over KV cache memory blocks, giving developers the ability to chop them into smaller blocks between 64 to 2 tokens. This optimizes the usage of allocated memory, increases reuse rates, and improves TTFT. When running LLAMA70B on NVIDIA H100 Tensor Core GPUs, we can speed up TTFT up to 7% in multi-user environments by reducing KV cache block size from 64 tokens to 8 tokens.
	Figure 2. Impact of changing KV cache block size on inference speedup
	Efficient KV cache eviction protocols
	Partitioning the KV cache into smaller blocks and evicting unused ones can be effective for memory optimization, but it introduces dependency complexities. When a specific block is used to generate a response, and the result is stored as a new block, it can form a tree-like structure of dependencies.
	Over time, the counters tracking the usage of the source blocks (the branches) may become stale as the dependent nodes (the leaves) are reused. Evicting the source block then requires the eviction of all dependent blocks, which would require recalculation of the KV cache for new user prompts, increasing TTFT.
	To address this challenge, TensorRT-LLM includes intelligent eviction algorithms that can trace the dependent nodes from their source nodes and evict dependent nodes first, even if they have more recent reuse counters. This ensures more efficient memory management while preventing unnecessary evictions of dependent blocks.
	Figure 3. A logical representation of KV cache eviction algorithm show how it can reduce the number of evicted blocks, increasing the likelihood of reuse
	Getting started with TensorRT-LLM KV cache reuse
	Generating KV cache during inference requires a lot of compute and memory resources. Using it efficiently is critical to improving model response, accelerating inference, and increasing system throughput. TensorRT-LLM provides advanced reuse features for developers looking to further optimize TTFT response times for peak performance.
	To start using TensorRT-LLM KV cache reuse check out our
	GitHub documentation
	.",https://developer.nvidia.com/ja-jp/blog/5x-faster-time-to-first-token-with-nvidia-tensorrt-llm-kv-cache-early-reuse/,NVIDIA TensorRT-LLM の KV Cache Early Reuseで、Time to First Token を 5 倍高速化,"Reading Time:
	2
	minutes
	以前の
	ブログ記事
	では、key-value (KV) キャッシュを CPU メモリにオフロードして再利用することで、最初のトークンが出力されるまでの時間 (TTFT: Time To First Token) を x86 ベースの NVIDIA H100 Tensor コア GPU で最大 14 倍、NVIDIA GH200 Superchip で最大 28 倍に高速化できる方法をご紹介しました。本記事では、KV キャッシュの再利用技術と、TTFT のさらなる高速化を実現するベストプラクティスについて解説します。
	KV キャッシュの概要
	LLM モデルは、質問回答やコード生成など、多くのタスクで急速に採用されています。応答を生成するにあたり、これらのモデルはまず、ユーザーのプロンプトをトークンへ変換し、その後これらのトークンを密ベクトルへと変換します。膨大なドット積演算がその後に続き、その後トークン間の関係性を数学的にモデル化し、ユーザー入力に対する文脈理解を構築します。この文脈理解を生成するためにかかる計算コストは、入力シーケンスの長さの二乗に比例して増加します。
	このリソースを大量に消費するプロセスから key とvalue が生成され、後続のトークンを生成するときに再度計算されないようにキャッシュされます。KV キャッシュを再利用することで、追加のトークンを生成する際に必要となる計算負荷と時間が軽減され、より高速で効率的なユーザー体験を実現します。
	KV キャッシュを再利用するときには、キャッシュがメモリに残る期間、メモリが一杯になったときに最初に削除するコンポーネント、および新しい入力プロンプトに再利用できるタイミングなどの点に細心の注意を払う必要があります。これらの要因を最適化することで、KV キャッシュの再利用におけるパフォーマンスの段階的な増加へとつなげることができます。NVIDIA TensorRT-LLM は、これらの分野に特化した 3 つの主要な機能を提供します。
	Early KV cache reuse
	従来の再利用アルゴリズムでは、KV キャッシュをその一部であっても新しいユーザープロンプトで再利用するためには、事前にすべての KV キャッシュの計算を完了させておく必要がありました。この方法は、LLM のレスポンスを企業のガイドラインに沿ったものにするために、システムプロンプト (ユーザーの問い合わせに追加される事前定義の指示) が不可欠となる企業向けチャットボットなどのシナリオでは、非効率的である可能性があります。
	チャットボットと同時にやり取りするユーザーが急増した場合、各ユーザーに対してシステムプロンプト KV キャッシュを個別に計算する必要があります。TensorRT-LLM では、リアルタイムで生成されるシステムプロンプトを再利用することができるため、急増時にはすべてのユーザーと共有することができ、ユーザーごとに再計算する必要がありません。これにより、システムプロンプトを必要とするユースケースの推論を最大 5 倍にまで高速化することができます。
	図 1. TensorRT-LLM KV cache reuse により、TTFT を最大 5 倍高速化
	柔軟な KV キャッシュブロックサイズ
	再利用を実装する際には、キャッシュメモリブロック全体のみを再利用に割り当てることができます。例えば、キャッシュメモリブロックサイズが 64 トークンで、KV キャッシュが 80 トークンである場合、再利用のために保存できるのは 64 トークンのみであり、残りの 16 トークンは再計算する必要があります。しかしながら、メモリブロックサイズを 16 トークンに減らすと、64 トークンすべてを 5 つのメモリブロックに格納することができ、再計算の必要性がなくなります。
	この効果は、入力シーケンスが短いときに最も顕著に現れます。長い入力シーケンスの場合は、より大きなブロックの方がより有益です。明らかに、KV キャッシュをより細かく制御できればできるほど、特定のユースケースに合わせた最適化も向上します。
	TensorRT-LLM では、KV キャッシュメモリブロックをきめ細かく制御できるため、開発者は KV キャッシュメモリブロックを 64 から 2 トークンまで、より小さなブロックに分割することができます。これにより、割り当てられたメモリの使用が最適化され、再利用率が上昇し、TTFT が改善されます。NVIDIA H100 Tensor コア GPU で LLAMA70B を実行する場合、KV キャッシュブロックサイズを 64 トークンから 8 トークンへと減らすことで、マルチユーザー環境で TTFT を最大 7% 高速化できます。
	図 2. KV キャッシュブロックサイズの変更による推論の高速化
	効率的な KV キャッシュの除外 (Eviction) プロトコル
	KV キャッシュをより小さなブロックに分割し、未使用のブロックを除外することは、メモリの最適化に効果的ですが、依存関係に複雑さが生まれます。特定のブロックがレスポンスの生成に使用され、その結果が新しいブロックとして保存されると、依存関係のツリー構造が形成される可能性があります。
	時間の経過とともに、ソースブロック (ブランチ) の使用を追跡するカウンターは、従属ノード (リーフ) が再利用されるにつれて古くなる可能性があります。ソースブロックを除外するには、従属するすべてのブロックを除外する必要があり、新しいユーザプロンプトの KV キャッシュを再計算する必要が生じて TTFT が増加します。
	この課題に対処するために、TensorRT-LLM には、従属ノードをソースノードから追跡し、従属ノードがより最近の再利用カウンターを持っている場合でも、最初に従属ノードを除外することができるインテリジェントな除外アルゴリズムが含まれています。これにより、より効率的にメモリを管理できるようになると共に、従属ブロックの不要な除外を回避できます。
	図 3. KV キャッシュの除外アルゴリズムの論理を表現した図。除外されるブロックの数を減らし、再利用の可能性を高められる様子を示しています。
	TensorRT-LLM KV cache reuse を使い始める
	推論中に KV キャッシュを生成するには、多くの計算とメモリソースが必要になります。効率的に使用することが、モデル応答の改善、推論の高速化、システムスループットの向上には不可欠です。TensorRT-LLM は、ピーク性能のために TTFT 応答時間をさらに最適化しようとする開発者に高度な再利用機能を提供します。
	TensorRT-LLM KV cache reuse を使い始めるには、
	GitHub のドキュメント
	を参照してください。
	関連情報
	GTC セッション:
	Speeding up LLM Inference With TensorRT-LLM (TensorRT-LLM による LLM 推論の高速化)
	GTC セッション:
	Optimizing and Scaling LLMs With TensorRT-LLM for Text Generation (テキスト生成のための TensorRT-LLM を使用した LLM の最適化とスケーリング)
	SDK:
	Torch-TensorRT
	SDK:
	TensorRT
	SDK:
	TensorFlow-TensorRT"
	https://developer.nvidia.com/blog/state-of-the-art-multimodal-generative-ai-model-development-with-nvidia-nemo/,State-of-the-Art Multimodal Generative AI Model Development with NVIDIA NeMo,"Generative AI
	has rapidly evolved from text-based models to multimodal capabilities. These models perform tasks like image captioning and visual question answering, reflecting a shift toward more human-like AI. The community is now expanding from text and images to video, opening new possibilities across industries.
	Video AI models are poised to revolutionize industries such as robotics, automotive, and retail. In
	robotics
	, they enhance autonomous navigation in complex, ever-changing environments, which is vital for sectors like manufacturing and warehouse management. In the automotive industry, video AI is propelling autonomous driving, boosting vehicle perception, safety, and predictive maintenance to improve efficiency.
	To build image and video foundation models, developers must curate and preprocess a large amount of training data, tokenize the resulting high-quality data at high fidelity, train or customize pretrained models efficiently and at scale, and then generate high-quality images and videos during inference.
	Announcing NVIDIA NeMo for multimodal generative AI
	NVIDIA NeMo
	is an end-to-end platform for developing, customizing, and deploying generative AI models.
	NVIDIA just announced the expansion of NeMo to support the end-to-end pipeline for developing multimodal models. NeMo enables you to easily curate high-quality visual data, accelerate
	training
	and
	customization
	with highly efficient tokenizers and parallelism techniques, and reconstruct high-quality visuals during inference.
	Accelerated video and image data curation
	High-quality training data ensures high-accuracy results from an AI model. However, developers face various challenges in building data processing pipelines, ranging from scaling to data orchestration.
	NeMo Curator
	streamlines the data curation process, making it easier and faster for you to build multimodal generative AI models. Its out-of-the-box experience minimizes the total cost of ownership (TCO) and accelerates time-to-market.
	While working with visuals, organizations can easily reach petabyte-scale data processing. NeMo Curator provides an orchestration pipeline that can load balance on multiple GPUs at each stage of the data curation. As a result, you can reduce video processing time by 7x compared to a naive GPU-based implementation. The scalable pipelines can efficiently process over 100 PB of data, ensuring the seamless handling of large datasets.
	Figure 1. NVIDIA NeMo Curator video processing speed
	NeMo Curator provides reference video curation models optimized for high-throughput filtering, captioning, and embedding stages to enhance dataset quality, empowering you to create more accurate AI models.
	For instance, NeMo Curator uses an optimized captioning model that delivers an order of magnitude throughput improvement compared to unoptimized inference model implementations.
	NVIDIA Cosmos tokenizers
	Tokenizers map redundant and implicit visual data into compact and semantic tokens, enabling efficient training of large-scale generative models and democratizing their inference on limited computational resources.
	Today’s open video and image tokenizers often generate poor data representations, leading to lossy reconstructions, distorted images, and temporally unstable videos and placing a cap on the capability of generative models built on top of the tokenizers. Inefficient tokenization processes also result in slow encoding and decoding and longer training and inference times, negatively impacting both developer productivity and the user experience.
	NVIDIA Cosmos tokenizers are open models that offer superior visual tokenization with exceptionally large compression rates and cutting-edge reconstruction quality across diverse image and video categories.
	Video 1. Efficient Generative AI Tokenizers for Image and Video
	These tokenizers provide ease of use through a suite of tokenizer standardized models that support vision-language models (VLMs) with discrete latent codes, diffusion models with continuous latent embeddings, and various aspect ratios and resolutions, enabling the efficient management of large-resolution images and videos. This provides you with tools for tokenizing a wide variety of visual input data to build image and video AI models.
	Cosmos tokenizer architecture
	A Cosmos tokenizer uses a sophisticated encoder-decoder structure designed for high efficiency and effective learning. At its core, it employs 3D
	causal convolution blocks
	, which are specialized layers that jointly process spatiotemporal information, and uses causal temporal attention that captures long-range dependencies in data.
	The causal structure ensures that the model uses only past and present frames when performing tokenization, avoiding future frames. This is crucial for aligning with the causal nature of many real-world systems, such as those in physical AI or multimodal LLMs.
	Figure 2. NVIDIA Cosmos tokenizer architecture
	The input is downsampled using 3D wavelets, a signal processing technique that represents pixel information more efficiently. After the data is processed, an inverse wavelet transform reconstructs the original input.
	This approach improves learning efficiency, enabling the tokenizer encoder-decoder learnable modules to focus on meaningful features rather than redundant pixel details. The combination of such techniques and its unique training recipe makes the Cosmos tokenizers a cutting-edge architecture for efficient and powerful tokenization.
	During inference, the Cosmos tokenizers significantly reduce the cost of running the model by delivering up to 12x faster reconstruction compared to leading open-weight tokenizers (Figure 3).
	Figure 3. Quantitative comparison of reconstruction quality (left) and runtime performance (right) for video tokenizers
	The Cosmos tokenizers also produce high-fidelity images and videos while compressing more than other tokenizers, demonstrating an unprecedented quality-compression trade-off.
	Figure 4. Continuous tokenizer compression rate compared to reconstruction quality
	Figure 5. Discrete tokenizer compression rate compared to reconstruction quality
	Although the Cosmos tokenizer regenerates from highly compressed tokens, it is capable of creating high-quality images and videos due to an innovative neural network training technique and architecture.
	Figure 6. Reconstructed video frame for continuous video tokenizers
	Build Your Own Multimodal Models with NeMo
	The expansion of the NVIDIA NeMo platform with at-scale data processing using
	NeMo Curator
	and high-quality tokenization and visual reconstruction using the Cosmos tokenizer empowers you to build state-of-the-art multimodal, generative AI models.
	Join the waitlist
	and be notified when NeMo Curator is available. The tokenizer is available now on the
	/NVIDIA/cosmos-tokenizer
	GitHub repo and
	Hugging Face
	.",https://developer.nvidia.com/ja-jp/blog/state-of-the-art-multimodal-generative-ai-model-development-with-nvidia-nemo/,NVIDIA NeMo による最先端のマルチモーダル生成 AI モデル開発,"Reading Time:
	2
	minutes
	生成 AI
	は、テキストベースのモデルからマルチモーダル機能へと急速に進化しています。これらのモデルは、画像のキャプション作成や視覚的な質問回答などのタスクを実行し、より人間に近い AI へとシフトしていることを反映しています。このコミュニティは現在、テキストや画像から動画へと拡大しており、さまざまな業界で新たな可能性を切り開かれています。
	動画 AI モデルは、ロボティクス、自動車、小売などの業界に革命を起こそうとしています。
	ロボティクス
	では、製造業や倉庫管理などの分野に不可欠な、複雑で変化し続ける環境における自律的なナビゲーションを強化しています。自動車業界では、動画 AI が自動運転を推進し、車両の認識、安全性、予知保全を強化し、効率性を高めています。
	画像や動画の基盤モデルを構築するには、開発者は大量の学習データのキュレーションと事前処理を行い、結果として得られた高品質データを高い忠実度でトークン化し、学習済みモデルを効率的に大規模に学習またはカスタマイズして、推論中に高品質な画像や動画を生成する必要があります。
	マルチモーダル生成 AI 向けの NVIDIA NeMo を発表
	NVIDIA NeMo
	は、生成 AI モデルを開発、カスタマイズ、デプロイするエンドツーエンドのプラットフォームです。
	NVIDIA は、マルチモーダルモデル開発向けのエンドツーエンドのパイプラインをサポートする NeMo の拡張を発表しました。NeMo により、高品質な視覚データを簡単にキュレーションし、高効率なトークナイザーと並列処理技術で
	学習
	と
	カスタマイズ
	を加速し、推論中に高品質なビジュアルを再構築することができます。
	動画と画像データのキュレーションを加速
	高品質な学習データでは、AI モデルから高精度な結果が得られます。しかし、開発者は、データ処理パイプラインの構築において、スケーリングからデータのオーケストレーションまで、さまざまな課題に直面しています。
	NeMo Curator
	は、データキュレーションプロセスを合理化することで、マルチモーダル生成 AI モデルをより簡単かつ迅速に構築することができます。すぐに試すことができるため、総保有コスト (TCO) を最小限に抑え、市場投入までの時間を短縮します。
	ビジュアルを扱う際には、組織はペタバイト規模のデータ処理を容易に実行できます。NeMo Curator は、データキュレーションの各段階で複数の GPU に負荷分散できるオーケストレーションパイプラインを提供します。その結果、単純な GPU ベースの実装と比較して、動画処理時間を 7 分の 1 に短縮できます。スケール可能なパイプラインは、100 PB を超えるデータを効率的に処理でき、大規模なデータセットをシームレスに取り扱うことができます。
	図 1. NVIDIA NeMo Curator の動画処理速度
	NeMo Curator は、高いスループットのフィルタリング、キャプション作成、埋め込みの各段階に最適化されたリファレンスビデオキュレーションモデルを提供し、データセットの品質を向上させ、より正確な AI モデルの作成をサポートします。
	たとえば、NeMo Curator は、最適化されたキャプションモデルを使用し、最適化されていない推論モデルの実装と比較して、桁違いのスループットの向上を実現します。
	NVIDIA Cosmos トークナイザー
	トークナイザーは、冗長的で暗黙的な視覚データをコンパクトで意味のあるトークンにマッピングし、大規模な生成モデルの効率的な学習を実現し、誰もが限られた計算リソースで推論できるようにします。
	今日のオープンな動画や画像のトークナイザーは、データ表現が不十分なことが多いため、劣化の多い再構築、歪んだ画像、不連続な動画につながり、トークナイザー上に構築された生成モデルの能力に限界をもたらします。トークン化プロセスが非効率なため、エンコードやデコードに時間がかかり、学習や推論の時間が長くなり、開発者の生産性とユーザー体験の両方に悪影響を及ぼします。
	NVIDIA Cosmos トークナイザーは、優れた視覚トークン化を提供するオープンなモデルで、さまざまな画像や動画のカテゴリーで、高い圧縮率と最先端の再構築品質を実現します。
	離散的な潜在コードを備えた視覚言語モデル (VLM: Vision-language Model)、連続した潜在的埋め込みによる拡散モデル、さまざまなアスペクト比や解像度をサポートする一連のトークナイザー標準化モデルを使用して、これらのトークナイザーを簡単に使用でき、高解像度の画像や動画を効率的に管理することができます。これにより、画像や動画 AI モデルを構築するために、幅広い視覚入力データをトークン化するツールが提供されます。
	Cosmos トークナイザーのアーキテクチャ
	Cosmos トークナイザーは、高効率かつ効果的な学習向けに設計されており、高度なエンコーダー / デコーダー構造を使用しています。その中核には 3D
	Causal Convolution Block
	(因果畳み込みブロック) を採用しています。これは時空間情報を共同処理する特殊なレイヤーで、データの長期的な依存関係を捉える Causal Temporal Attention (因果的時間注意機構) を使用しています。
	この因果構造により、トークン化の実行時にモデルが過去と現在のフレームのみを使用し、未来のフレームは使用しません。これは、物理的なAIやマルチモーダルLLMなどの多くの現実世界のシステムの因果性に合わせるために重要です。
	図 2. NVIDIA Cosmos トークナイザーのアーキテクチャ
	入力は、ピクセル情報をより効率的に表す信号処理技術である 3D ウェーブレットを使用してダウンサンプリングされます。データ処理後、逆ウェーベレット変換によって元の入力が再構築されます。
	このアプローチにより、学習効率が向上し、トークナイザーのエンコーダー / デコーダーの学習可能なモジュールは、冗長なピクセルの詳細ではなく、意味のある特徴に焦点を当てることができます。このような技術と独自の学習レシピの組み合わせにより、Cosmos トークナイザーは、効率的かつ強力なトークン化を実現する最先端のアーキテクチャとなっています。
	推論の際、Cosmos トークナイザーは、主要なオープンウェイトのトークナイザーと比較して最大 12 倍高速な再構築を実現し、モデルの実行コストを大幅に削減しました (図 3)。
	図 3. Cosmos トークナイザーと主要なオープンウェイトのトークナイザーとの比較
	Cosmos トークナイザーは、他のトークナイザーよりも高い圧縮率を実現しながら、高い忠実度の画像や動画を生成し、前例のない品質と圧縮のトレードオフを実現しています。
	図 4. 連続トークナイザーの圧縮率と再構築品質の比較
	図 5. 離散トークナイザーの圧縮率と再構築品質の比較
	Cosmos トークナイザーは、高度に圧縮されたトークンから再生成されますが、革新的なニューラルネットワークの学習技術とアーキテクチャにより、高品質な画像や動画を作成することができます。
	図 6. 連続動画トークナイザーで再構築された動画フレーム
	NeMo で独自のマルチモーダルモデルを構築
	NeMo Curator
	を使用した大規模なデータ処理と、Cosmos トークナイザーを使用した高品質なトークン化やビジュアル再構築を備えた、NVIDIA NeMo プラットフォームの拡張により、最先端のマルチモーダル生成 AI モデルを構築することができます。
	登録
	していただくと、NeMo Curator が利用可能になった際に通知を受け取ることができます。トークナイザーは、現在
	/NVIDIA/cosmos-tokenizer
	GitHub リポジトリおよび
	Hugging Face
	で利用することができます。
	関連情報
	GTC セッション:
	Large Language Model Fine-Tuning using Parameter Efficient Fine-Tuning (PEFT を使用した大規模言語モデルのファインチューニング)
	GTC セッション:
	Large Language Model Fine-Tuning using NVIDIA NeMo (NVIDIA NeMo を使用した大規模言語モデルのファインチューニング – Domino Data Lab 提供)
	SDK:
	NVIDIA NeMo カスタマイザー
	SDK:
	NeMo LLM サービス
	SDK:
	NeMo Megatron"