Spaces:

awacke1
/

Multi-Agent-Systems-MAS-Autonomous-Agents-Interacting

Runtime error

App Files Files Community

Multi-Agent-Systems-MAS-Autonomous-Agents-Interacting / app.py

awacke1

Create app.py

8d8a8b9 verified about 1 year ago

raw

history blame

50.1 kB

	import streamlit as st

	st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide")

	hide_streamlit_style = """
	<style>
	#MainMenu {visibility: hidden;}
	footer {visibility: hidden;}
	</style>
	"""
	st.markdown(hide_streamlit_style, unsafe_allow_html=True)

	col1, col2 = st.beta_columns(2)

	with col1:
	st.markdown("## Autonomous agents interacting :robot_face: :robot_face:**")
	st.markdown("### Key Aspects :bulb:")
	st.markdown("""
	1. Interaction Protocol 🤝 \n
	- Define rules for communication and cooperation \n
	2. Decentralized Decision Making 🎯 \n
	- Autonomous agents make independent decisions \n
	3. Collaboration and Competition 🤼 \n
	- Agents work together or against each other \n
	""")

	with col2:
	st.markdown("### Entities :guards:")
	st.markdown("""
	1. Autonomous Agents 🤖 \n
	- Independent entities with decision-making capabilities \n
	2. Environment 🌐 \n
	- Shared space where agents interact \n
	3. Ruleset 📜 \n
	- Defines interaction protocol and decision-making processes \n
	""")

	st.markdown("---")

	st.markdown("## Interaction Protocol 🤝 :bulb:**")
	st.markdown("### Key Elements :guards:")
	st.markdown("""
	1. Communication 🗣 \n
	- Agents exchange information \n
	2. Cooperation 🤝 \n
	-# 🩺🔍 Search Results
	### 04 Dec 2023 \| [AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents](https://arxiv.org/abs/2311.17465) \| [⬇️](https://arxiv.org/pdf/2311.17465)
	Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang

	In this study, our goal is to create interactive avatar agents that can
	autonomously plan and animate nuanced facial movements realistically, from both
	visual and behavioral perspectives. Given high-level inputs about the
	environment and agent profile, our framework harnesses LLMs to produce a series
	of detailed text descriptions of the avatar agents' facial motions. These
	descriptions are then processed by our task-agnostic driving engine into motion
	token sequences, which are subsequently converted into continuous motion
	embeddings that are further consumed by our standalone neural-based renderer to
	generate the final photorealistic avatar animations. These streamlined
	processes allow our framework to adapt to a variety of non-verbal avatar
	interactions, both monadic and dyadic. Our extensive study, which includes
	experiments on both newly compiled and existing datasets featuring two types of
	agents -- one capable of monadic interaction with the environment, and the
	other designed for dyadic conversation -- validates the effectiveness and
	versatility of our approach. To our knowledge, we advanced a leap step by
	combining LLMs and neural rendering for generalized non-verbal prediction and
	photo-realistic rendering of avatar agents.

	---------------

	### 06 Jul 2023 \| [Caption Anything: Interactive Image Description with Diverse Multimodal Controls](https://arxiv.org/abs/2305.02677) \| [⬇️](https://arxiv.org/pdf/2305.02677)
	Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao

	Controllable image captioning is an emerging multimodal topic that aims to
	describe the image with natural language following human purpose,
	$\textit{e.g.}$, looking at the specified regions or telling in a particular
	text style. State-of-the-art methods are trained on annotated pairs of input
	controls and output captions. However, the scarcity of such well-annotated
	multimodal data largely limits their usability and scalability for interactive
	AI systems. Leveraging unimodal instruction-following foundation models is a
	promising alternative that benefits from broader sources of data. In this
	paper, we present Caption AnyThing (CAT), a foundation model augmented image
	captioning framework supporting a wide range of multimodel controls: 1) visual
	controls, including points, boxes, and trajectories; 2) language controls, such
	as sentiment, length, language, and factuality. Powered by Segment Anything
	Model (SAM) and ChatGPT, we unify the visual and language prompts into a
	modularized framework, enabling the flexible combination between different
	controls. Extensive case studies demonstrate the user intention alignment
	capabilities of our framework, shedding light on effective user interaction
	modeling in vision-language applications. Our code is publicly available at
	https://github.com/ttengwang/Caption-Anything.

	---------------

	### 13 Jul 2023 \| [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) \| [⬇️](https://arxiv.org/pdf/2306.14824)
	Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei

	We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
	capabilities of perceiving object descriptions (e.g., bounding boxes) and
	grounding text to the visual world. Specifically, we represent refer
	expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
	object descriptions are sequences of location tokens. Together with multimodal
	corpora, we construct large-scale data of grounded image-text pairs (called
	GrIT) to train the model. In addition to the existing capabilities of MLLMs
	(e.g., perceiving general modalities, following instructions, and performing
	in-context learning), Kosmos-2 integrates the grounding capability into
	downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
	including (i) multimodal grounding, such as referring expression comprehension,
	and phrase grounding, (ii) multimodal referring, such as referring expression
	generation, (iii) perception-language tasks, and (iv) language understanding
	and generation. This work lays out the foundation for the development of
	Embodiment AI and sheds light on the big convergence of language, multimodal
	perception, action, and world modeling, which is a key step toward artificial
	general intelligence. Code and pretrained models are available at
	https://aka.ms/kosmos-2.

	---------------

	### 19 Feb 2024 \| [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) \| [⬇️](https://arxiv.org/pdf/2402.04615)
	Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor C\u{a}rbune, Jason Lin, Jindong Chen, Abhanshu Sharma

	Screen user interfaces (UIs) and infographics, sharing similar visual
	language and design principles, play important roles in human communication and
	human-machine interaction. We introduce ScreenAI, a vision-language model that
	specializes in UI and infographics understanding. Our model improves upon the
	PaLI architecture with the flexible patching strategy of pix2struct and is
	trained on a unique mixture of datasets. At the heart of this mixture is a
	novel screen annotation task in which the model has to identify the type and
	location of UI elements. We use these text annotations to describe screens to
	Large Language Models and automatically generate question-answering (QA), UI
	navigation, and summarization training datasets at scale. We run ablation
	studies to demonstrate the impact of these design choices. At only 5B
	parameters, ScreenAI achieves new state-of-the-artresults on UI- and
	infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
	Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
	InfographicVQA) compared to models of similar size. Finally, we release three
	new datasets: one focused on the screen annotation task and two others focused
	on question answering.

	---------------

	### 23 Mar 2022 \| [ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues](https://arxiv.org/abs/2203.12751) \| [⬇️](https://arxiv.org/pdf/2203.12751)
	Monica S. Lam, Giovanni Campagna, Mehrad Moradshahi, Sina J. Semnani, Silei Xu

	Task-oriented conversational agents rely on semantic parsers to translate
	natural language to formal representations. In this paper, we propose the
	design and rationale of the ThingTalk formal representation, and how the design
	improves the development of transactional task-oriented agents.
	ThingTalk is built on four core principles: (1) representing user requests
	directly as executable statements, covering all the functionality of the agent,
	(2) representing dialogues formally and succinctly to support accurate
	contextual semantic parsing, (3) standardizing types and interfaces to maximize
	reuse between agents, and (4) allowing multiple, independently-developed agents
	to be composed in a single virtual assistant. ThingTalk is developed as part of
	the Genie Framework that allows developers to quickly build transactional
	agents given a database and APIs.
	We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
	Compared to the others, the ThingTalk design is both more general and more
	cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
	associated tools yields a new state of the art accuracy of 79% turn-by-turn.

	---------------

	### 19 Oct 2023 \| [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/abs/2310.12945) \| [⬇️](https://arxiv.org/pdf/2310.12945)
	Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould

	In the pursuit of efficient automated content creation, procedural
	generation, leveraging modifiable parameters and rule-based systems, emerges as
	a promising approach. Nonetheless, it could be a demanding endeavor, given its
	intricate nature necessitating a deep understanding of rules, algorithms, and
	parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
	large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
	positions LLMs as proficient problem solvers, dissecting the procedural 3D
	modeling tasks into accessible segments and appointing the apt agent for each
	task. 3D-GPT integrates three core agents: the task dispatch agent, the
	conceptualization agent, and the modeling agent. They collaboratively achieve
	two objectives. First, it enhances concise initial scene descriptions, evolving
	them into detailed forms while dynamically adapting the text based on
	subsequent instructions. Second, it integrates procedural generation,
	extracting parameter values from enriched text to effortlessly interface with
	3D software for asset creation. Our empirical investigations confirm that
	3D-GPT not only interprets and executes instructions, delivering reliable
	results but also collaborates effectively with human designers. Furthermore, it
	seamlessly integrates with Blender, unlocking expanded manipulation
	possibilities. Our work highlights the potential of LLMs in 3D modeling,
	offering a basic framework for future advancements in scene generation and
	animation.

	---------------

	### 04 Jul 2023 \| [Embodied Task Planning with Large Language Models](https://arxiv.org/abs/2307.01848) \| [⬇️](https://arxiv.org/pdf/2307.01848)
	Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan

	Equipping embodied agents with commonsense is important for robots to
	successfully complete complex human instructions in general environments.
	Recent large language models (LLM) can embed rich semantic knowledge for agents
	in plan generation of complex tasks, while they lack the information about the
	realistic world and usually yield infeasible action sequences. In this paper,
	we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
	with physical scene constraint, where the agent generates executable plans
	according to the existed objects in the scene by aligning LLMs with the visual
	perception models. Specifically, we first construct a multimodal dataset
	containing triplets of indoor scenes, instructions and action plans, where we
	provide the designed prompts and the list of existing objects in the scene for
	GPT-3.5 to generate a large number of instructions and corresponding planned
	actions. The generated data is leveraged for grounded plan tuning of
	pre-trained LLMs. During inference, we discover the objects in the scene by
	extending open-vocabulary object detectors to multi-view RGB images collected
	in different achievable locations. Experimental results show that the generated
	plan from our TaPA framework can achieve higher success rate than LLaVA and
	GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
	planning in general and complex environments.

	---------------

	### 18 Jan 2023 \| [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) \| [⬇️](https://arxiv.org/pdf/2301.07584)
	Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang

	Recent advancements in vision-language pre-training (e.g. CLIP) have shown
	that vision models can benefit from language supervision. While many models
	using language modality have achieved great success on 2D vision tasks, the
	joint representation learning of 3D point cloud with text remains
	under-explored due to the difficulty of 3D-Text data pair acquisition and the
	irregularity of 3D data structure. In this paper, we propose a novel Text4Point
	framework to construct language-guided 3D point cloud models. The key idea is
	utilizing 2D images as a bridge to connect the point cloud and the language
	modalities. The proposed Text4Point follows the pre-training and fine-tuning
	paradigm. During the pre-training stage, we establish the correspondence of
	images and point clouds based on the readily available RGB-D data and use
	contrastive learning to align the image and point cloud representations.
	Together with the well-aligned image and text features achieved by CLIP, the
	point cloud features are implicitly aligned with the text embeddings. Further,
	we propose a Text Querying Module to integrate language information into 3D
	representation learning by querying text embeddings with point cloud features.
	For fine-tuning, the model learns task-specific 3D representations under
	informative language guidance from the label set without 2D images. Extensive
	experiments demonstrate that our model shows consistent improvement on various
	downstream tasks, such as point cloud semantic segmentation, instance
	segmentation, and object detection. The code will be available here:
	https://github.com/LeapLabTHU/Text4Point

	---------------

	### 01 Feb 2024 \| [Executable Code Actions Elicit Better LLM Agents](https://arxiv.org/abs/2402.01030) \| [⬇️](https://arxiv.org/pdf/2402.01030)
	Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji

	Large Language Model (LLM) agents, capable of performing a broad range of
	actions, such as invoking tools and controlling robots, show great potential in
	tackling real-world challenges. LLM agents are typically prompted to produce
	actions by generating JSON or text in a pre-defined format, which is usually
	limited by constrained action space (e.g., the scope of pre-defined tools) and
	restricted flexibility (e.g., inability to compose multiple tools). This work
	proposes to use executable Python code to consolidate LLM agents' actions into
	a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
	can execute code actions and dynamically revise prior actions or emit new
	actions upon new observations through multi-turn interactions. Our extensive
	analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
	CodeAct outperforms widely used alternatives (up to 20% higher success rate).
	The encouraging performance of CodeAct motivates us to build an open-source LLM
	agent that interacts with environments by executing interpretable code and
	collaborates with users using natural language. To this end, we collect an
	instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
	interactions using CodeAct. We show that it can be used with existing data to
	improve models in agent-oriented tasks without compromising their general
	capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
	Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
	model training) using existing libraries and autonomously self-debug.

	---------------

	### 24 Jan 2024 \| [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) \| [⬇️](https://arxiv.org/pdf/2401.13649)
	Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

	Autonomous agents capable of planning, reasoning, and executing actions on
	the web offer a promising avenue for automating computer tasks. However, the
	majority of existing benchmarks primarily focus on text-based agents,
	neglecting many natural tasks that require visual information to effectively
	solve. Given that most computer interfaces cater to human perception, visual
	information often augments textual data in ways that text-only models struggle
	to harness effectively. To bridge this gap, we introduce VisualWebArena, a
	benchmark designed to assess the performance of multimodal web agents on
	realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
	of diverse and complex web-based tasks that evaluate various capabilities of
	autonomous multimodal agents. To perform on this benchmark, agents need to
	accurately process image-text inputs, interpret natural language instructions,
	and execute actions on websites to accomplish user-defined objectives. We
	conduct an extensive evaluation of state-of-the-art LLM-based autonomous
	agents, including several multimodal models. Through extensive quantitative and
	qualitative analysis, we identify several limitations of text-only LLM agents,
	and reveal gaps in the capabilities of state-of-the-art multimodal language
	agents. VisualWebArena provides a framework for evaluating multimodal
	autonomous language agents, and offers insights towards building stronger
	autonomous agents for the web. Our code, baseline models, and data is publicly
	available at https://jykoh.com/vwa.

	---------------

	### 22 Feb 2018 \| [Multimodal Named Entity Recognition for Short Social Media Posts](https://arxiv.org/abs/1802.07862) \| [⬇️](https://arxiv.org/pdf/1802.07862)
	Seungwhan Moon, Leonardo Neves, Vitor Carvalho

	We introduce a new task called Multimodal Named Entity Recognition (MNER) for
	noisy user-generated data such as tweets or Snapchat captions, which comprise
	short text with accompanying images. These social media posts often come in
	inconsistent or incomplete syntax and lexical notations with very limited
	surrounding textual contexts, bringing significant challenges for NER. To this
	end, we create a new dataset for MNER called SnapCaptions (Snapchat
	image-caption pairs submitted to public and crowd-sourced stories with fully
	annotated named entities). We then build upon the state-of-the-art Bi-LSTM
	word/character based NER models with 1) a deep image network which incorporates
	relevant visual context to augment textual information, and 2) a generic
	modality-attention module which learns to attenuate irrelevant modalities while
	amplifying the most informative ones to extract contexts from, adaptive to each
	sample and token. The proposed MNER model with modality attention significantly
	outperforms the state-of-the-art text-only NER models by successfully
	leveraging provided visual contexts, opening up potential applications of MNER
	on myriads of social media platforms.

	---------------

	### 21 Sep 2023 \| [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) \| [⬇️](https://arxiv.org/pdf/2309.11436)
	Zhuosheng Zhang, Aston Zhang

	Autonomous user interface (UI) agents aim to facilitate task automation by
	interacting with the user interface without manual intervention. Recent studies
	have investigated eliciting the capabilities of large language models (LLMs)
	for effective engagement in diverse environments. To align with the
	input-output requirement of LLMs, existing approaches are developed under a
	sandbox setting where they rely on external tools and application-specific APIs
	to parse the environment into textual elements and interpret the predicted
	actions. Consequently, those approaches often grapple with inference
	inefficiency and error propagation risks. To mitigate the challenges, we
	introduce Auto-UI, a multimodal solution that directly interacts with the
	interface, bypassing the need for environment parsing or reliance on
	application-dependent APIs. Moreover, we propose a chain-of-action technique --
	leveraging a series of intermediate previous action histories and future action
	plans -- to help the agent decide what action to execute. We evaluate our
	approach on a new device-control benchmark AITW with 30K unique instructions,
	spanning multi-step tasks such as application operation, web searching, and web
	shopping. Experimental results show that Auto-UI achieves state-of-the-art
	performance with an action type prediction accuracy of 90% and an overall
	action success rate of 74%. Code is publicly available at
	https://github.com/cooelf/Auto-UI.

	---------------

	### 06 Jun 2023 \| [LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models](https://arxiv.org/abs/2303.02927) \| [⬇️](https://arxiv.org/pdf/2303.02927)
	Victor Dibia

	Systems that support users in the automatic creation of visualizations must
	address several subtasks - understand the semantics of data, enumerate relevant
	visualization goals and generate visualization specifications. In this work, we
	pose visualization generation as a multi-stage generation problem and argue
	that well-orchestrated pipelines based on large language models (LLMs) such as
	ChatGPT/GPT-4 and image generation models (IGMs) are suitable to addressing
	these tasks. We present LIDA, a novel tool for generating grammar-agnostic
	visualizations and infographics. LIDA comprises of 4 modules - A SUMMARIZER
	that converts data into a rich but compact natural language summary, a GOAL
	EXPLORER that enumerates visualization goals given the data, a VISGENERATOR
	that generates, refines, executes and filters visualization code and an
	INFOGRAPHER module that yields data-faithful stylized graphics using IGMs. LIDA
	provides a python api, and a hybrid user interface (direct manipulation and
	multilingual natural language) for interactive chart, infographics and data
	story generation. Learn more about the project here -
	https://microsoft.github.io/lida/

	---------------

	### 16 Feb 2023 \| [VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning](https://arxiv.org/abs/2211.15103) \| [⬇️](https://arxiv.org/pdf/2211.15103)
	Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le

	Video paragraph captioning aims to generate a multi-sentence description of
	an untrimmed video with several temporal event locations in coherent
	storytelling. Following the human perception process, where the scene is
	effectively understood by decomposing it into visual (e.g. human, animal) and
	non-visual components (e.g. action, relations) under the mutual influence of
	vision and language, we first propose a visual-linguistic (VL) feature. In the
	proposed VL feature, the scene is modeled by three modalities including (i) a
	global visual environment; (ii) local visual main agents; (iii) linguistic
	scene elements. We then introduce an autoregressive Transformer-in-Transformer
	(TinT) to simultaneously capture the semantic coherence of intra- and
	inter-event contents within a video. Finally, we present a new VL contrastive
	loss function to guarantee learnt embedding features are matched with the
	captions semantics. Comprehensive experiments and extensive ablation studies on
	ActivityNet Captions and YouCookII datasets show that the proposed
	Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior
	state-of-the-art methods on accuracy and diversity. Source code is made
	publicly available at: https://github.com/UARK-AICV/VLTinT.

	---------------

	### 04 Mar 2021 \| [FAtiMA Toolkit -- Toward an effective and accessible tool for the development of intelligent virtual agents and social robots](https://arxiv.org/abs/2103.03020) \| [⬇️](https://arxiv.org/pdf/2103.03020)
	Samuel Mascarenhas, Manuel Guimar\~aes, Pedro A. Santos, Jo\~ao Dias, Rui Prada, Ana Paiva

	More than a decade has passed since the development of FearNot!, an
	application designed to help children deal with bullying through role-playing
	with virtual characters. It was also the application that led to the creation
	of FAtiMA, an affective agent architecture for creating autonomous characters
	that can evoke empathic responses. In this paper, we describe FAtiMA Toolkit, a
	collection of open-source tools that is designed to help researchers, game
	developers and roboticists incorporate a computational model of emotion and
	decision-making in their work. The toolkit was developed with the goal of
	making FAtiMA more accessible, easier to incorporate into different projects
	and more flexible in its capabilities for human-agent interaction, based upon
	the experience gathered over the years across different virtual environments
	and human-robot interaction scenarios. As a result, this work makes several
	different contributions to the field of Agent-Based Architectures. More
	precisely, FAtiMA Toolkit's library based design allows developers to easily
	integrate it with other frameworks, its meta-cognitive model affords different
	internal reasoners and affective components and its explicit dialogue structure
	gives control to the author even within highly complex scenarios. To
	demonstrate the use of FAtiMA Toolkit, several different use cases where the
	toolkit was successfully applied are described and discussed.

	---------------

	### 12 Sep 2022 \| [emojiSpace: Spatial Representation of Emojis](https://arxiv.org/abs/2209.09871) \| [⬇️](https://arxiv.org/pdf/2209.09871)
	Moeen Mostafavi, Mahsa Pahlavikhah Varnosfaderani, Fateme Nikseresht, Seyed Ahmad Mansouri

	In the absence of nonverbal cues during messaging communication, users
	express part of their emotions using emojis. Thus, having emojis in the
	vocabulary of text messaging language models can significantly improve many
	natural language processing (NLP) applications such as online communication
	analysis. On the other hand, word embedding models are usually trained on a
	very large corpus of text such as Wikipedia or Google News datasets that
	include very few samples with emojis. In this study, we create emojiSpace,
	which is a combined word-emoji embedding using the word2vec model from the
	Genism library in Python. We trained emojiSpace on a corpus of more than 4
	billion tweets and evaluated it by implementing sentiment analysis on a Twitter
	dataset containing more than 67 million tweets as an extrinsic task. For this
	task, we compared the performance of two different classifiers of random forest
	(RF) and linear support vector machine (SVM). For evaluation, we compared
	emojiSpace performance with two other pre-trained embeddings and demonstrated
	that emojiSpace outperforms both.

	---------------

	### 27 Jan 2020 \| [CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking](https://arxiv.org/abs/2001.07935) \| [⬇️](https://arxiv.org/pdf/2001.07935)
	Grigori Fursin, Herve Guillou and Nicolas Essayan

	We present CodeReef - an open platform to share all the components necessary
	to enable cross-platform MLOps (MLSysOps), i.e. automating the deployment of ML
	models across diverse systems in the most efficient way. We also introduce the
	CodeReef solution - a way to package and share models as non-virtualized,
	portable, customizable and reproducible archive files. Such ML packages include
	JSON meta description of models with all dependencies, Python APIs, CLI actions
	and portable workflows necessary to automatically build, benchmark, test and
	customize models across diverse platforms, AI frameworks, libraries, compilers
	and datasets. We demonstrate several CodeReef solutions to automatically build,
	run and measure object detection based on SSD-Mobilenets, TensorFlow and COCO
	dataset from the latest MLPerf inference benchmark across a wide range of
	platforms from Raspberry Pi, Android phones and IoT devices to data centers.
	Our long-term goal is to help researchers share their new techniques as
	production-ready packages along with research papers to participate in
	collaborative and reproducible benchmarking, compare the different
	ML/software/hardware stacks and select the most efficient ones on a Pareto
	frontier using online CodeReef dashboards.

	---------------

	### 28 Feb 2024 \| [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) \| [⬇️](https://arxiv.org/pdf/2402.17553)
	Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov

	For decades, human-computer interaction has fundamentally been manual. Even
	today, almost all productive work done on the computer necessitates human input
	at every step. Autonomous virtual agents represent an exciting step in
	automating many of these menial tasks. Virtual agents would empower users with
	limited technical proficiency to harness the full possibilities of computer
	systems. They could also enable the efficient streamlining of numerous computer
	tasks, ranging from calendar management to complex travel bookings, with
	minimal human intervention. In this paper, we introduce OmniACT, the
	first-of-a-kind dataset and benchmark for assessing an agent's capability to
	generate executable programs to accomplish computer tasks. Our scope extends
	beyond traditional web automation, covering a diverse range of desktop
	applications. The dataset consists of fundamental tasks such as "Play the next
	song", as well as longer horizon tasks such as "Send an email to John Doe
	mentioning the time and place to meet". Specifically, given a pair of screen
	image and a visually-grounded natural language task, the goal is to generate a
	script capable of fully executing the task. We run several strong baseline
	language model agents on our benchmark. The strongest baseline, GPT-4, performs
	the best on our benchmark However, its performance level still reaches only 15%
	of the human proficiency in generating executable scripts capable of completing
	the task, demonstrating the challenge of our task for conventional web agents.
	Our benchmark provides a platform to measure and evaluate the progress of
	language model agents in automating computer tasks and motivates future work
	towards building multimodal models that bridge large language models and the
	visual grounding of computer screens.

	---------------

	### 24 Mar 2021 \| [Proactive Interaction Framework for Intelligent Social Receptionist Robots](https://arxiv.org/abs/2012.04832) \| [⬇️](https://arxiv.org/pdf/2012.04832)
	Yang Xue, Fan Wang, Hao Tian, Min Zhao, Jiangyong Li, Haiqing Pan and Yueqiang Dong

	Proactive human-robot interaction (HRI) allows the receptionist robots to
	actively greet people and offer services based on vision, which has been found
	to improve acceptability and customer satisfaction. Existing approaches are
	either based on multi-stage decision processes or based on end-to-end decision
	models. However, the rule-based approaches require sedulous expert efforts and
	only handle minimal pre-defined scenarios. On the other hand, existing works
	with end-to-end models are limited to very general greetings or few behavior
	patterns (typically less than 10). To address those challenges, we propose a
	new end-to-end framework, the TransFormer with Visual Tokens for Human-Robot
	Interaction (TFVT-HRI). The proposed framework extracts visual tokens of
	relative objects from an RGB camera first. To ensure the correct interpretation
	of the scenario, a transformer decision model is then employed to process the
	visual tokens, which is augmented with the temporal and spatial information. It
	predicts the appropriate action to take in each scenario and identifies the
	right target. Our data is collected from an in-service receptionist robot in an
	office building, which is then annotated by experts for appropriate proactive
	behavior. The action set includes 1000+ diverse patterns by combining language,
	emoji expression, and body motions. We compare our model with other SOTA
	end-to-end models on both offline test sets and online user experiments in
	realistic office building environments to validate this framework. It is
	demonstrated that the decision model achieves SOTA performance in action
	triggering and selection, resulting in more humanness and intelligence when
	compared with the previous reactive reception policies.

	---------------

	### 15 Mar 2023 \| [Sustainable Cloud Services for Verbal Interaction with Embodied Agents](https://arxiv.org/abs/2203.02606) \| [⬇️](https://arxiv.org/pdf/2203.02606)
	Lucrezia Grassi, Carmine Tommaso Recchiuto, Antonio Sgorbissa

	This article presents the design and the implementation of a cloud system for
	knowledge-based autonomous interaction devised for Social Robots and other
	conversational agents. The system is particularly convenient for low-cost
	robots and devices: it can be used as a stand-alone dialogue system or as an
	integration to provide "background" dialogue capabilities to any preexisting
	Natural Language Processing ability that the robot may already have as part of
	its basic skills. By connecting to the cloud, developers are provided with a
	sustainable solution to manage verbal interaction through a network connection,
	with about 3,000 topics of conversation ready for "chit-chatting" and a library
	of pre-cooked plans that only needs to be grounded into the robot's physical
	capabilities. The system is structured as a set of REST API endpoints so that
	it can be easily expanded by adding new APIs to improve the capabilities of the
	clients connected to the cloud. Another key feature of the system is that it
	has been designed to make the development of its clients straightforward: in
	this way, multiple robots and devices can be easily endowed with the capability
	of autonomously interacting with the user, understanding when to perform
	specific actions, and exploiting all the information provided by cloud
	services. The article outlines and discusses the results of the experiments
	performed to assess the system's performance in terms of response time, paving
	the way for its use both for research and market solutions. Links to
	repositories with clients for ROS and popular robots such as Pepper and NAO are
	available on request.

	---------------<s>[INST] Context:
	1. <b> AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents </b>
	Abstract: In this study, our goal is to create interactive avatar agents that can
	autonomously plan and animate nuanced facial movements realistically, from both
	visual and behavioral perspectives. Given high-level inputs about the
	environment and agent profile, our framework harnesses LLMs to produce a series
	of detailed text descriptions of the avatar agents' facial motions. These
	descriptions are then processed by our task-agnostic driving engine into motion
	token sequences, which are subsequently converted into continuous motion
	embeddings that are further consumed by our standalone neural-based renderer to
	generate the final photorealistic avatar animations. These streamlined
	processes allow our framework to adapt to a variety of non-verbal avatar
	interactions, both monadic and dyadic. Our extensive study, which includes
	experiments on both newly compiled and existing datasets featuring two types of
	agents -- one capable of monadic interaction with the environment, and the
	other designed for dyadic conversation -- validates the effectiveness and
	versatility of our approach. To our knowledge, we advanced a leap step by
	combining LLMs and neural rendering for generalized non-verbal prediction and
	photo-realistic rendering of avatar agents.
	2. <b> Caption Anything: Interactive Image Description with Diverse Multimodal Controls </b>
	Abstract: Controllable image captioning is an emerging multimodal topic that aims to
	describe the image with natural language following human purpose,
	$\textit{e.g.}$, looking at the specified regions or telling in a particular
	text style. State-of-the-art methods are trained on annotated pairs of input
	controls and output captions. However, the scarcity of such well-annotated
	multimodal data largely limits their usability and scalability for interactive
	AI systems. Leveraging unimodal instruction-following foundation models is a
	promising alternative that benefits from broader sources of data. In this
	paper, we present Caption AnyThing (CAT), a foundation model augmented image
	captioning framework supporting a wide range of multimodel controls: 1) visual
	controls, including points, boxes, and trajectories; 2) language controls, such
	as sentiment, length, language, and factuality. Powered by Segment Anything
	Model (SAM) and ChatGPT, we unify the visual and language prompts into a
	modularized framework, enabling the flexible combination between different
	controls. Extensive case studies demonstrate the user intention alignment
	capabilities of our framework, shedding light on effective user interaction
	modeling in vision-language applications. Our code is publicly available at
	https://github.com/ttengwang/Caption-Anything.
	3. <b> Kosmos-2: Grounding Multimodal Large Language Models to the World </b>
	Abstract: We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
	capabilities of perceiving object descriptions (e.g., bounding boxes) and
	grounding text to the visual world. Specifically, we represent refer
	expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
	object descriptions are sequences of location tokens. Together with multimodal
	corpora, we construct large-scale data of grounded image-text pairs (called
	GrIT) to train the model. In addition to the existing capabilities of MLLMs
	(e.g., perceiving general modalities, following instructions, and performing
	in-context learning), Kosmos-2 integrates the grounding capability into
	downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
	including (i) multimodal grounding, such as referring expression comprehension,
	and phrase grounding, (ii) multimodal referring, such as referring expression
	generation, (iii) perception-language tasks, and (iv) language understanding
	and generation. This work lays out the foundation for the development of
	Embodiment AI and sheds light on the big convergence of language, multimodal
	perception, action, and world modeling, which is a key step toward artificial
	general intelligence. Code and pretrained models are available at
	https://aka.ms/kosmos-2.
	4. <b> ScreenAI: A Vision-Language Model for UI and Infographics Understanding </b>
	Abstract: Screen user interfaces (UIs) and infographics, sharing similar visual
	language and design principles, play important roles in human communication and
	human-machine interaction. We introduce ScreenAI, a vision-language model that
	specializes in UI and infographics understanding. Our model improves upon the
	PaLI architecture with the flexible patching strategy of pix2struct and is
	trained on a unique mixture of datasets. At the heart of this mixture is a
	novel screen annotation task in which the model has to identify the type and
	location of UI elements. We use these text annotations to describe screens to
	Large Language Models and automatically generate question-answering (QA), UI
	navigation, and summarization training datasets at scale. We run ablation
	studies to demonstrate the impact of these design choices. At only 5B
	parameters, ScreenAI achieves new state-of-the-artresults on UI- and
	infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
	Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
	InfographicVQA) compared to models of similar size. Finally, we release three
	new datasets: one focused on the screen annotation task and two others focused
	on question answering.
	5. <b> ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues </b>
	Abstract: Task-oriented conversational agents rely on semantic parsers to translate
	natural language to formal representations. In this paper, we propose the
	design and rationale of the ThingTalk formal representation, and how the design
	improves the development of transactional task-oriented agents.
	ThingTalk is built on four core principles: (1) representing user requests
	directly as executable statements, covering all the functionality of the agent,
	(2) representing dialogues formally and succinctly to support accurate
	contextual semantic parsing, (3) standardizing types and interfaces to maximize
	reuse between agents, and (4) allowing multiple, independently-developed agents
	to be composed in a single virtual assistant. ThingTalk is developed as part of
	the Genie Framework that allows developers to quickly build transactional
	agents given a database and APIs.
	We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
	Compared to the others, the ThingTalk design is both more general and more
	cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
	associated tools yields a new state of the art accuracy of 79% turn-by-turn.
	6. <b> 3D-GPT: Procedural 3D Modeling with Large Language Models </b>
	Abstract: In the pursuit of efficient automated content creation, procedural
	generation, leveraging modifiable parameters and rule-based systems, emerges as
	a promising approach. Nonetheless, it could be a demanding endeavor, given its
	intricate nature necessitating a deep understanding of rules, algorithms, and
	parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
	large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
	positions LLMs as proficient problem solvers, dissecting the procedural 3D
	modeling tasks into accessible segments and appointing the apt agent for each
	task. 3D-GPT integrates three core agents: the task dispatch agent, the
	conceptualization agent, and the modeling agent. They collaboratively achieve
	two objectives. First, it enhances concise initial scene descriptions, evolving
	them into detailed forms while dynamically adapting the text based on
	subsequent instructions. Second, it integrates procedural generation,
	extracting parameter values from enriched text to effortlessly interface with
	3D software for asset creation. Our empirical investigations confirm that
	3D-GPT not only interprets and executes instructions, delivering reliable
	results but also collaborates effectively with human designers. Furthermore, it
	seamlessly integrates with Blender, unlocking expanded manipulation
	possibilities. Our work highlights the potential of LLMs in 3D modeling,
	offering a basic framework for future advancements in scene generation and
	animation.
	7. <b> Embodied Task Planning with Large Language Models </b>
	Abstract: Equipping embodied agents with commonsense is important for robots to
	successfully complete complex human instructions in general environments.
	Recent large language models (LLM) can embed rich semantic knowledge for agents
	in plan generation of complex tasks, while they lack the information about the
	realistic world and usually yield infeasible action sequences. In this paper,
	we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
	with physical scene constraint, where the agent generates executable plans
	according to the existed objects in the scene by aligning LLMs with the visual
	perception models. Specifically, we first construct a multimodal dataset
	containing triplets of indoor scenes, instructions and action plans, where we
	provide the designed prompts and the list of existing objects in the scene for
	GPT-3.5 to generate a large number of instructions and corresponding planned
	actions. The generated data is leveraged for grounded plan tuning of
	pre-trained LLMs. During inference, we discover the objects in the scene by
	extending open-vocabulary object detectors to multi-view RGB images collected
	in different achievable locations. Experimental results show that the generated
	plan from our TaPA framework can achieve higher success rate than LLaVA and
	GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
	planning in general and complex environments.
	8. <b> Joint Representation Learning for Text and 3D Point Cloud </b>
	Abstract: Recent advancements in vision-language pre-training (e.g. CLIP) have shown
	that vision models can benefit from language supervision. While many models
	using language modality have achieved great success on 2D vision tasks, the
	joint representation learning of 3D point cloud with text remains
	under-explored due to the difficulty of 3D-Text data pair acquisition and the
	irregularity of 3D data structure. In this paper, we propose a novel Text4Point
	framework to construct language-guided 3D point cloud models. The key idea is
	utilizing 2D images as a bridge to connect the point cloud and the language
	modalities. The proposed Text4Point follows the pre-training and fine-tuning
	paradigm. During the pre-training stage, we establish the correspondence of
	images and point clouds based on the readily available RGB-D data and use
	contrastive learning to align the image and point cloud representations.
	Together with the well-aligned image and text features achieved by CLIP, the
	point cloud features are implicitly aligned with the text embeddings. Further,
	we propose a Text Querying Module to integrate language information into 3D
	representation learning by querying text embeddings with point cloud features.
	For fine-tuning, the model learns task-specific 3D representations under
	informative language guidance from the label set without 2D images. Extensive
	experiments demonstrate that our model shows consistent improvement on various
	downstream tasks, such as point cloud semantic segmentation, instance
	segmentation, and object detection. The code will be available here:
	https://github.com/LeapLabTHU/Text4Point
	9. <b> Executable Code Actions Elicit Better LLM Agents </b>
	Abstract: Large Language Model (LLM) agents, capable of performing a broad range of
	actions, such as invoking tools and controlling robots, show great potential in
	tackling real-world challenges. LLM agents are typically prompted to produce
	actions by generating JSON or text in a pre-defined format, which is usually
	limited by constrained action space (e.g., the scope of pre-defined tools) and
	restricted flexibility (e.g., inability to compose multiple tools). This work
	proposes to use executable Python code to consolidate LLM agents' actions into
	a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
	can execute code actions and dynamically revise prior actions or emit new
	actions upon new observations through multi-turn interactions. Our extensive
	analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
	CodeAct outperforms widely used alternatives (up to 20% higher success rate).
	The encouraging performance of CodeAct motivates us to build an open-source LLM
	agent that interacts with environments by executing interpretable code and
	collaborates with users using natural language. To this end, we collect an
	instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
	interactions using CodeAct. We show that it can be used with existing data to
	improve models in agent-oriented tasks without compromising their general
	capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
	Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
	model training) using existing libraries and autonomously self-debug.
	10. <b> VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks </b>
	Abstract: Autonomous agents capable of planning, reasoning, and executing actions on
	the web offer a promising avenue for automating computer tasks. However, the
	majority of existing benchmarks primarily focus on text-based agents,
	neglecting many natural tasks that require visual information to effectively
	solve. Given that most computer interfaces cater to human perception, visual
	information often augments textual data in ways that text-only models struggle
	to harness effectively. To bridge this gap, we introduce VisualWebArena, a
	benchmark designed to assess the performance of multimodal web agents on
	realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
	of diverse and complex web-based tasks that evaluate various capabilities of
	autonomous multimodal agents. To perform on this benchmark, agents need to
	accurately process image-text inputs, interpret natural language instructions,
	and execute actions on websites to accomplish user-defined objectives. We
	conduct an extensive evaluation of state-of-the-art LLM-based autonomous
	agents, including several multimodal models. Through extensive quantitative and
	qualitative analysis, we identify several limitations of text-only LLM agents,
	and reveal gaps in the capabilities of state-of-the-art multimodal language
	agents. VisualWebArena provides a framework for evaluating multimodal
	autonomous language agents, and offers insights towards building stronger
	autonomous agents for the web.
	""")