|
import streamlit as st |
|
|
|
st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide") |
|
|
|
hide_streamlit_style = """ |
|
<style> |
|
#MainMenu {visibility: hidden;} |
|
footer {visibility: hidden;} |
|
</style> |
|
""" |
|
st.markdown(hide_streamlit_style, unsafe_allow_html=True) |
|
|
|
col1, col2 = st.beta_columns(2) |
|
|
|
with col1: |
|
st.markdown("## **Autonomous agents interacting** :robot_face: :robot_face:**") |
|
st.markdown("### **Key Aspects** :bulb:") |
|
st.markdown(""" |
|
1. **Interaction Protocol** π€ \n |
|
- Define rules for communication and cooperation \n |
|
2. **Decentralized Decision Making** π― \n |
|
- Autonomous agents make independent decisions \n |
|
3. **Collaboration and Competition** π€Ό \n |
|
- Agents work together or against each other \n |
|
""") |
|
|
|
with col2: |
|
st.markdown("### **Entities** :guards:") |
|
st.markdown(""" |
|
1. **Autonomous Agents** π€ \n |
|
- Independent entities with decision-making capabilities \n |
|
2. **Environment** π \n |
|
- Shared space where agents interact \n |
|
3. **Ruleset** π \n |
|
- Defines interaction protocol and decision-making processes \n |
|
""") |
|
|
|
st.markdown("---") |
|
|
|
st.markdown("## **Interaction Protocol** π€ :bulb:**") |
|
st.markdown("### **Key Elements** :guards:") |
|
st.markdown(""" |
|
1. **Communication** π£ \n |
|
- Agents exchange information \n |
|
2. **Cooperation** π€ \n |
|
-# π©Ίπ Search Results |
|
### 04 Dec 2023 | [AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents](https://arxiv.org/abs/2311.17465) | [β¬οΈ](https://arxiv.org/pdf/2311.17465) |
|
*Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang* |
|
|
|
In this study, our goal is to create interactive avatar agents that can |
|
autonomously plan and animate nuanced facial movements realistically, from both |
|
visual and behavioral perspectives. Given high-level inputs about the |
|
environment and agent profile, our framework harnesses LLMs to produce a series |
|
of detailed text descriptions of the avatar agents' facial motions. These |
|
descriptions are then processed by our task-agnostic driving engine into motion |
|
token sequences, which are subsequently converted into continuous motion |
|
embeddings that are further consumed by our standalone neural-based renderer to |
|
generate the final photorealistic avatar animations. These streamlined |
|
processes allow our framework to adapt to a variety of non-verbal avatar |
|
interactions, both monadic and dyadic. Our extensive study, which includes |
|
experiments on both newly compiled and existing datasets featuring two types of |
|
agents -- one capable of monadic interaction with the environment, and the |
|
other designed for dyadic conversation -- validates the effectiveness and |
|
versatility of our approach. To our knowledge, we advanced a leap step by |
|
combining LLMs and neural rendering for generalized non-verbal prediction and |
|
photo-realistic rendering of avatar agents. |
|
|
|
--------------- |
|
|
|
### 06 Jul 2023 | [Caption Anything: Interactive Image Description with Diverse Multimodal Controls](https://arxiv.org/abs/2305.02677) | [β¬οΈ](https://arxiv.org/pdf/2305.02677) |
|
*Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao* |
|
|
|
Controllable image captioning is an emerging multimodal topic that aims to |
|
describe the image with natural language following human purpose, |
|
$\textit{e.g.}$, looking at the specified regions or telling in a particular |
|
text style. State-of-the-art methods are trained on annotated pairs of input |
|
controls and output captions. However, the scarcity of such well-annotated |
|
multimodal data largely limits their usability and scalability for interactive |
|
AI systems. Leveraging unimodal instruction-following foundation models is a |
|
promising alternative that benefits from broader sources of data. In this |
|
paper, we present Caption AnyThing (CAT), a foundation model augmented image |
|
captioning framework supporting a wide range of multimodel controls: 1) visual |
|
controls, including points, boxes, and trajectories; 2) language controls, such |
|
as sentiment, length, language, and factuality. Powered by Segment Anything |
|
Model (SAM) and ChatGPT, we unify the visual and language prompts into a |
|
modularized framework, enabling the flexible combination between different |
|
controls. Extensive case studies demonstrate the user intention alignment |
|
capabilities of our framework, shedding light on effective user interaction |
|
modeling in vision-language applications. Our code is publicly available at |
|
https://github.com/ttengwang/Caption-Anything. |
|
|
|
--------------- |
|
|
|
### 13 Jul 2023 | [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) | [β¬οΈ](https://arxiv.org/pdf/2306.14824) |
|
*Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei* |
|
|
|
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new |
|
capabilities of perceiving object descriptions (e.g., bounding boxes) and |
|
grounding text to the visual world. Specifically, we represent refer |
|
expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where |
|
object descriptions are sequences of location tokens. Together with multimodal |
|
corpora, we construct large-scale data of grounded image-text pairs (called |
|
GrIT) to train the model. In addition to the existing capabilities of MLLMs |
|
(e.g., perceiving general modalities, following instructions, and performing |
|
in-context learning), Kosmos-2 integrates the grounding capability into |
|
downstream applications. We evaluate Kosmos-2 on a wide range of tasks, |
|
including (i) multimodal grounding, such as referring expression comprehension, |
|
and phrase grounding, (ii) multimodal referring, such as referring expression |
|
generation, (iii) perception-language tasks, and (iv) language understanding |
|
and generation. This work lays out the foundation for the development of |
|
Embodiment AI and sheds light on the big convergence of language, multimodal |
|
perception, action, and world modeling, which is a key step toward artificial |
|
general intelligence. Code and pretrained models are available at |
|
https://aka.ms/kosmos-2. |
|
|
|
--------------- |
|
|
|
### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [β¬οΈ](https://arxiv.org/pdf/2402.04615) |
|
*Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor C\u{a}rbune, Jason Lin, Jindong Chen, Abhanshu Sharma* |
|
|
|
Screen user interfaces (UIs) and infographics, sharing similar visual |
|
language and design principles, play important roles in human communication and |
|
human-machine interaction. We introduce ScreenAI, a vision-language model that |
|
specializes in UI and infographics understanding. Our model improves upon the |
|
PaLI architecture with the flexible patching strategy of pix2struct and is |
|
trained on a unique mixture of datasets. At the heart of this mixture is a |
|
novel screen annotation task in which the model has to identify the type and |
|
location of UI elements. We use these text annotations to describe screens to |
|
Large Language Models and automatically generate question-answering (QA), UI |
|
navigation, and summarization training datasets at scale. We run ablation |
|
studies to demonstrate the impact of these design choices. At only 5B |
|
parameters, ScreenAI achieves new state-of-the-artresults on UI- and |
|
infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget |
|
Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and |
|
InfographicVQA) compared to models of similar size. Finally, we release three |
|
new datasets: one focused on the screen annotation task and two others focused |
|
on question answering. |
|
|
|
--------------- |
|
|
|
### 23 Mar 2022 | [ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues](https://arxiv.org/abs/2203.12751) | [β¬οΈ](https://arxiv.org/pdf/2203.12751) |
|
*Monica S. Lam, Giovanni Campagna, Mehrad Moradshahi, Sina J. Semnani, Silei Xu* |
|
|
|
Task-oriented conversational agents rely on semantic parsers to translate |
|
natural language to formal representations. In this paper, we propose the |
|
design and rationale of the ThingTalk formal representation, and how the design |
|
improves the development of transactional task-oriented agents. |
|
ThingTalk is built on four core principles: (1) representing user requests |
|
directly as executable statements, covering all the functionality of the agent, |
|
(2) representing dialogues formally and succinctly to support accurate |
|
contextual semantic parsing, (3) standardizing types and interfaces to maximize |
|
reuse between agents, and (4) allowing multiple, independently-developed agents |
|
to be composed in a single virtual assistant. ThingTalk is developed as part of |
|
the Genie Framework that allows developers to quickly build transactional |
|
agents given a database and APIs. |
|
We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST. |
|
Compared to the others, the ThingTalk design is both more general and more |
|
cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and |
|
associated tools yields a new state of the art accuracy of 79% turn-by-turn. |
|
|
|
--------------- |
|
|
|
### 19 Oct 2023 | [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/abs/2310.12945) | [β¬οΈ](https://arxiv.org/pdf/2310.12945) |
|
*Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould* |
|
|
|
In the pursuit of efficient automated content creation, procedural |
|
generation, leveraging modifiable parameters and rule-based systems, emerges as |
|
a promising approach. Nonetheless, it could be a demanding endeavor, given its |
|
intricate nature necessitating a deep understanding of rules, algorithms, and |
|
parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing |
|
large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT |
|
positions LLMs as proficient problem solvers, dissecting the procedural 3D |
|
modeling tasks into accessible segments and appointing the apt agent for each |
|
task. 3D-GPT integrates three core agents: the task dispatch agent, the |
|
conceptualization agent, and the modeling agent. They collaboratively achieve |
|
two objectives. First, it enhances concise initial scene descriptions, evolving |
|
them into detailed forms while dynamically adapting the text based on |
|
subsequent instructions. Second, it integrates procedural generation, |
|
extracting parameter values from enriched text to effortlessly interface with |
|
3D software for asset creation. Our empirical investigations confirm that |
|
3D-GPT not only interprets and executes instructions, delivering reliable |
|
results but also collaborates effectively with human designers. Furthermore, it |
|
seamlessly integrates with Blender, unlocking expanded manipulation |
|
possibilities. Our work highlights the potential of LLMs in 3D modeling, |
|
offering a basic framework for future advancements in scene generation and |
|
animation. |
|
|
|
--------------- |
|
|
|
### 04 Jul 2023 | [Embodied Task Planning with Large Language Models](https://arxiv.org/abs/2307.01848) | [β¬οΈ](https://arxiv.org/pdf/2307.01848) |
|
*Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan* |
|
|
|
Equipping embodied agents with commonsense is important for robots to |
|
successfully complete complex human instructions in general environments. |
|
Recent large language models (LLM) can embed rich semantic knowledge for agents |
|
in plan generation of complex tasks, while they lack the information about the |
|
realistic world and usually yield infeasible action sequences. In this paper, |
|
we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning |
|
with physical scene constraint, where the agent generates executable plans |
|
according to the existed objects in the scene by aligning LLMs with the visual |
|
perception models. Specifically, we first construct a multimodal dataset |
|
containing triplets of indoor scenes, instructions and action plans, where we |
|
provide the designed prompts and the list of existing objects in the scene for |
|
GPT-3.5 to generate a large number of instructions and corresponding planned |
|
actions. The generated data is leveraged for grounded plan tuning of |
|
pre-trained LLMs. During inference, we discover the objects in the scene by |
|
extending open-vocabulary object detectors to multi-view RGB images collected |
|
in different achievable locations. Experimental results show that the generated |
|
plan from our TaPA framework can achieve higher success rate than LLaVA and |
|
GPT-3.5 by a sizable margin, which indicates the practicality of embodied task |
|
planning in general and complex environments. |
|
|
|
--------------- |
|
|
|
### 18 Jan 2023 | [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) | [β¬οΈ](https://arxiv.org/pdf/2301.07584) |
|
*Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang* |
|
|
|
Recent advancements in vision-language pre-training (e.g. CLIP) have shown |
|
that vision models can benefit from language supervision. While many models |
|
using language modality have achieved great success on 2D vision tasks, the |
|
joint representation learning of 3D point cloud with text remains |
|
under-explored due to the difficulty of 3D-Text data pair acquisition and the |
|
irregularity of 3D data structure. In this paper, we propose a novel Text4Point |
|
framework to construct language-guided 3D point cloud models. The key idea is |
|
utilizing 2D images as a bridge to connect the point cloud and the language |
|
modalities. The proposed Text4Point follows the pre-training and fine-tuning |
|
paradigm. During the pre-training stage, we establish the correspondence of |
|
images and point clouds based on the readily available RGB-D data and use |
|
contrastive learning to align the image and point cloud representations. |
|
Together with the well-aligned image and text features achieved by CLIP, the |
|
point cloud features are implicitly aligned with the text embeddings. Further, |
|
we propose a Text Querying Module to integrate language information into 3D |
|
representation learning by querying text embeddings with point cloud features. |
|
For fine-tuning, the model learns task-specific 3D representations under |
|
informative language guidance from the label set without 2D images. Extensive |
|
experiments demonstrate that our model shows consistent improvement on various |
|
downstream tasks, such as point cloud semantic segmentation, instance |
|
segmentation, and object detection. The code will be available here: |
|
https://github.com/LeapLabTHU/Text4Point |
|
|
|
--------------- |
|
|
|
### 01 Feb 2024 | [Executable Code Actions Elicit Better LLM Agents](https://arxiv.org/abs/2402.01030) | [β¬οΈ](https://arxiv.org/pdf/2402.01030) |
|
*Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji* |
|
|
|
Large Language Model (LLM) agents, capable of performing a broad range of |
|
actions, such as invoking tools and controlling robots, show great potential in |
|
tackling real-world challenges. LLM agents are typically prompted to produce |
|
actions by generating JSON or text in a pre-defined format, which is usually |
|
limited by constrained action space (e.g., the scope of pre-defined tools) and |
|
restricted flexibility (e.g., inability to compose multiple tools). This work |
|
proposes to use executable Python code to consolidate LLM agents' actions into |
|
a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct |
|
can execute code actions and dynamically revise prior actions or emit new |
|
actions upon new observations through multi-turn interactions. Our extensive |
|
analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that |
|
CodeAct outperforms widely used alternatives (up to 20% higher success rate). |
|
The encouraging performance of CodeAct motivates us to build an open-source LLM |
|
agent that interacts with environments by executing interpretable code and |
|
collaborates with users using natural language. To this end, we collect an |
|
instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn |
|
interactions using CodeAct. We show that it can be used with existing data to |
|
improve models in agent-oriented tasks without compromising their general |
|
capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with |
|
Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., |
|
model training) using existing libraries and autonomously self-debug. |
|
|
|
--------------- |
|
|
|
### 24 Jan 2024 | [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) | [β¬οΈ](https://arxiv.org/pdf/2401.13649) |
|
*Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried* |
|
|
|
Autonomous agents capable of planning, reasoning, and executing actions on |
|
the web offer a promising avenue for automating computer tasks. However, the |
|
majority of existing benchmarks primarily focus on text-based agents, |
|
neglecting many natural tasks that require visual information to effectively |
|
solve. Given that most computer interfaces cater to human perception, visual |
|
information often augments textual data in ways that text-only models struggle |
|
to harness effectively. To bridge this gap, we introduce VisualWebArena, a |
|
benchmark designed to assess the performance of multimodal web agents on |
|
realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set |
|
of diverse and complex web-based tasks that evaluate various capabilities of |
|
autonomous multimodal agents. To perform on this benchmark, agents need to |
|
accurately process image-text inputs, interpret natural language instructions, |
|
and execute actions on websites to accomplish user-defined objectives. We |
|
conduct an extensive evaluation of state-of-the-art LLM-based autonomous |
|
agents, including several multimodal models. Through extensive quantitative and |
|
qualitative analysis, we identify several limitations of text-only LLM agents, |
|
and reveal gaps in the capabilities of state-of-the-art multimodal language |
|
agents. VisualWebArena provides a framework for evaluating multimodal |
|
autonomous language agents, and offers insights towards building stronger |
|
autonomous agents for the web. Our code, baseline models, and data is publicly |
|
available at https://jykoh.com/vwa. |
|
|
|
--------------- |
|
|
|
### 22 Feb 2018 | [Multimodal Named Entity Recognition for Short Social Media Posts](https://arxiv.org/abs/1802.07862) | [β¬οΈ](https://arxiv.org/pdf/1802.07862) |
|
*Seungwhan Moon, Leonardo Neves, Vitor Carvalho* |
|
|
|
We introduce a new task called Multimodal Named Entity Recognition (MNER) for |
|
noisy user-generated data such as tweets or Snapchat captions, which comprise |
|
short text with accompanying images. These social media posts often come in |
|
inconsistent or incomplete syntax and lexical notations with very limited |
|
surrounding textual contexts, bringing significant challenges for NER. To this |
|
end, we create a new dataset for MNER called SnapCaptions (Snapchat |
|
image-caption pairs submitted to public and crowd-sourced stories with fully |
|
annotated named entities). We then build upon the state-of-the-art Bi-LSTM |
|
word/character based NER models with 1) a deep image network which incorporates |
|
relevant visual context to augment textual information, and 2) a generic |
|
modality-attention module which learns to attenuate irrelevant modalities while |
|
amplifying the most informative ones to extract contexts from, adaptive to each |
|
sample and token. The proposed MNER model with modality attention significantly |
|
outperforms the state-of-the-art text-only NER models by successfully |
|
leveraging provided visual contexts, opening up potential applications of MNER |
|
on myriads of social media platforms. |
|
|
|
--------------- |
|
|
|
### 21 Sep 2023 | [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) | [β¬οΈ](https://arxiv.org/pdf/2309.11436) |
|
*Zhuosheng Zhang, Aston Zhang* |
|
|
|
Autonomous user interface (UI) agents aim to facilitate task automation by |
|
interacting with the user interface without manual intervention. Recent studies |
|
have investigated eliciting the capabilities of large language models (LLMs) |
|
for effective engagement in diverse environments. To align with the |
|
input-output requirement of LLMs, existing approaches are developed under a |
|
sandbox setting where they rely on external tools and application-specific APIs |
|
to parse the environment into textual elements and interpret the predicted |
|
actions. Consequently, those approaches often grapple with inference |
|
inefficiency and error propagation risks. To mitigate the challenges, we |
|
introduce Auto-UI, a multimodal solution that directly interacts with the |
|
interface, bypassing the need for environment parsing or reliance on |
|
application-dependent APIs. Moreover, we propose a chain-of-action technique -- |
|
leveraging a series of intermediate previous action histories and future action |
|
plans -- to help the agent decide what action to execute. We evaluate our |
|
approach on a new device-control benchmark AITW with 30K unique instructions, |
|
spanning multi-step tasks such as application operation, web searching, and web |
|
shopping. Experimental results show that Auto-UI achieves state-of-the-art |
|
performance with an action type prediction accuracy of 90% and an overall |
|
action success rate of 74%. Code is publicly available at |
|
https://github.com/cooelf/Auto-UI. |
|
|
|
--------------- |
|
|
|
### 06 Jun 2023 | [LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models](https://arxiv.org/abs/2303.02927) | [β¬οΈ](https://arxiv.org/pdf/2303.02927) |
|
*Victor Dibia* |
|
|
|
Systems that support users in the automatic creation of visualizations must |
|
address several subtasks - understand the semantics of data, enumerate relevant |
|
visualization goals and generate visualization specifications. In this work, we |
|
pose visualization generation as a multi-stage generation problem and argue |
|
that well-orchestrated pipelines based on large language models (LLMs) such as |
|
ChatGPT/GPT-4 and image generation models (IGMs) are suitable to addressing |
|
these tasks. We present LIDA, a novel tool for generating grammar-agnostic |
|
visualizations and infographics. LIDA comprises of 4 modules - A SUMMARIZER |
|
that converts data into a rich but compact natural language summary, a GOAL |
|
EXPLORER that enumerates visualization goals given the data, a VISGENERATOR |
|
that generates, refines, executes and filters visualization code and an |
|
INFOGRAPHER module that yields data-faithful stylized graphics using IGMs. LIDA |
|
provides a python api, and a hybrid user interface (direct manipulation and |
|
multilingual natural language) for interactive chart, infographics and data |
|
story generation. Learn more about the project here - |
|
https://microsoft.github.io/lida/ |
|
|
|
--------------- |
|
|
|
### 16 Feb 2023 | [VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning](https://arxiv.org/abs/2211.15103) | [β¬οΈ](https://arxiv.org/pdf/2211.15103) |
|
*Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le* |
|
|
|
Video paragraph captioning aims to generate a multi-sentence description of |
|
an untrimmed video with several temporal event locations in coherent |
|
storytelling. Following the human perception process, where the scene is |
|
effectively understood by decomposing it into visual (e.g. human, animal) and |
|
non-visual components (e.g. action, relations) under the mutual influence of |
|
vision and language, we first propose a visual-linguistic (VL) feature. In the |
|
proposed VL feature, the scene is modeled by three modalities including (i) a |
|
global visual environment; (ii) local visual main agents; (iii) linguistic |
|
scene elements. We then introduce an autoregressive Transformer-in-Transformer |
|
(TinT) to simultaneously capture the semantic coherence of intra- and |
|
inter-event contents within a video. Finally, we present a new VL contrastive |
|
loss function to guarantee learnt embedding features are matched with the |
|
captions semantics. Comprehensive experiments and extensive ablation studies on |
|
ActivityNet Captions and YouCookII datasets show that the proposed |
|
Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior |
|
state-of-the-art methods on accuracy and diversity. Source code is made |
|
publicly available at: https://github.com/UARK-AICV/VLTinT. |
|
|
|
--------------- |
|
|
|
### 04 Mar 2021 | [FAtiMA Toolkit -- Toward an effective and accessible tool for the development of intelligent virtual agents and social robots](https://arxiv.org/abs/2103.03020) | [β¬οΈ](https://arxiv.org/pdf/2103.03020) |
|
*Samuel Mascarenhas, Manuel Guimar\~aes, Pedro A. Santos, Jo\~ao Dias, Rui Prada, Ana Paiva* |
|
|
|
More than a decade has passed since the development of FearNot!, an |
|
application designed to help children deal with bullying through role-playing |
|
with virtual characters. It was also the application that led to the creation |
|
of FAtiMA, an affective agent architecture for creating autonomous characters |
|
that can evoke empathic responses. In this paper, we describe FAtiMA Toolkit, a |
|
collection of open-source tools that is designed to help researchers, game |
|
developers and roboticists incorporate a computational model of emotion and |
|
decision-making in their work. The toolkit was developed with the goal of |
|
making FAtiMA more accessible, easier to incorporate into different projects |
|
and more flexible in its capabilities for human-agent interaction, based upon |
|
the experience gathered over the years across different virtual environments |
|
and human-robot interaction scenarios. As a result, this work makes several |
|
different contributions to the field of Agent-Based Architectures. More |
|
precisely, FAtiMA Toolkit's library based design allows developers to easily |
|
integrate it with other frameworks, its meta-cognitive model affords different |
|
internal reasoners and affective components and its explicit dialogue structure |
|
gives control to the author even within highly complex scenarios. To |
|
demonstrate the use of FAtiMA Toolkit, several different use cases where the |
|
toolkit was successfully applied are described and discussed. |
|
|
|
--------------- |
|
|
|
### 12 Sep 2022 | [emojiSpace: Spatial Representation of Emojis](https://arxiv.org/abs/2209.09871) | [β¬οΈ](https://arxiv.org/pdf/2209.09871) |
|
*Moeen Mostafavi, Mahsa Pahlavikhah Varnosfaderani, Fateme Nikseresht, Seyed Ahmad Mansouri* |
|
|
|
In the absence of nonverbal cues during messaging communication, users |
|
express part of their emotions using emojis. Thus, having emojis in the |
|
vocabulary of text messaging language models can significantly improve many |
|
natural language processing (NLP) applications such as online communication |
|
analysis. On the other hand, word embedding models are usually trained on a |
|
very large corpus of text such as Wikipedia or Google News datasets that |
|
include very few samples with emojis. In this study, we create emojiSpace, |
|
which is a combined word-emoji embedding using the word2vec model from the |
|
Genism library in Python. We trained emojiSpace on a corpus of more than 4 |
|
billion tweets and evaluated it by implementing sentiment analysis on a Twitter |
|
dataset containing more than 67 million tweets as an extrinsic task. For this |
|
task, we compared the performance of two different classifiers of random forest |
|
(RF) and linear support vector machine (SVM). For evaluation, we compared |
|
emojiSpace performance with two other pre-trained embeddings and demonstrated |
|
that emojiSpace outperforms both. |
|
|
|
--------------- |
|
|
|
### 27 Jan 2020 | [CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking](https://arxiv.org/abs/2001.07935) | [β¬οΈ](https://arxiv.org/pdf/2001.07935) |
|
*Grigori Fursin, Herve Guillou and Nicolas Essayan* |
|
|
|
We present CodeReef - an open platform to share all the components necessary |
|
to enable cross-platform MLOps (MLSysOps), i.e. automating the deployment of ML |
|
models across diverse systems in the most efficient way. We also introduce the |
|
CodeReef solution - a way to package and share models as non-virtualized, |
|
portable, customizable and reproducible archive files. Such ML packages include |
|
JSON meta description of models with all dependencies, Python APIs, CLI actions |
|
and portable workflows necessary to automatically build, benchmark, test and |
|
customize models across diverse platforms, AI frameworks, libraries, compilers |
|
and datasets. We demonstrate several CodeReef solutions to automatically build, |
|
run and measure object detection based on SSD-Mobilenets, TensorFlow and COCO |
|
dataset from the latest MLPerf inference benchmark across a wide range of |
|
platforms from Raspberry Pi, Android phones and IoT devices to data centers. |
|
Our long-term goal is to help researchers share their new techniques as |
|
production-ready packages along with research papers to participate in |
|
collaborative and reproducible benchmarking, compare the different |
|
ML/software/hardware stacks and select the most efficient ones on a Pareto |
|
frontier using online CodeReef dashboards. |
|
|
|
--------------- |
|
|
|
### 28 Feb 2024 | [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) | [β¬οΈ](https://arxiv.org/pdf/2402.17553) |
|
*Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov* |
|
|
|
For decades, human-computer interaction has fundamentally been manual. Even |
|
today, almost all productive work done on the computer necessitates human input |
|
at every step. Autonomous virtual agents represent an exciting step in |
|
automating many of these menial tasks. Virtual agents would empower users with |
|
limited technical proficiency to harness the full possibilities of computer |
|
systems. They could also enable the efficient streamlining of numerous computer |
|
tasks, ranging from calendar management to complex travel bookings, with |
|
minimal human intervention. In this paper, we introduce OmniACT, the |
|
first-of-a-kind dataset and benchmark for assessing an agent's capability to |
|
generate executable programs to accomplish computer tasks. Our scope extends |
|
beyond traditional web automation, covering a diverse range of desktop |
|
applications. The dataset consists of fundamental tasks such as "Play the next |
|
song", as well as longer horizon tasks such as "Send an email to John Doe |
|
mentioning the time and place to meet". Specifically, given a pair of screen |
|
image and a visually-grounded natural language task, the goal is to generate a |
|
script capable of fully executing the task. We run several strong baseline |
|
language model agents on our benchmark. The strongest baseline, GPT-4, performs |
|
the best on our benchmark However, its performance level still reaches only 15% |
|
of the human proficiency in generating executable scripts capable of completing |
|
the task, demonstrating the challenge of our task for conventional web agents. |
|
Our benchmark provides a platform to measure and evaluate the progress of |
|
language model agents in automating computer tasks and motivates future work |
|
towards building multimodal models that bridge large language models and the |
|
visual grounding of computer screens. |
|
|
|
--------------- |
|
|
|
### 24 Mar 2021 | [Proactive Interaction Framework for Intelligent Social Receptionist Robots](https://arxiv.org/abs/2012.04832) | [β¬οΈ](https://arxiv.org/pdf/2012.04832) |
|
*Yang Xue, Fan Wang, Hao Tian, Min Zhao, Jiangyong Li, Haiqing Pan and Yueqiang Dong* |
|
|
|
Proactive human-robot interaction (HRI) allows the receptionist robots to |
|
actively greet people and offer services based on vision, which has been found |
|
to improve acceptability and customer satisfaction. Existing approaches are |
|
either based on multi-stage decision processes or based on end-to-end decision |
|
models. However, the rule-based approaches require sedulous expert efforts and |
|
only handle minimal pre-defined scenarios. On the other hand, existing works |
|
with end-to-end models are limited to very general greetings or few behavior |
|
patterns (typically less than 10). To address those challenges, we propose a |
|
new end-to-end framework, the TransFormer with Visual Tokens for Human-Robot |
|
Interaction (TFVT-HRI). The proposed framework extracts visual tokens of |
|
relative objects from an RGB camera first. To ensure the correct interpretation |
|
of the scenario, a transformer decision model is then employed to process the |
|
visual tokens, which is augmented with the temporal and spatial information. It |
|
predicts the appropriate action to take in each scenario and identifies the |
|
right target. Our data is collected from an in-service receptionist robot in an |
|
office building, which is then annotated by experts for appropriate proactive |
|
behavior. The action set includes 1000+ diverse patterns by combining language, |
|
emoji expression, and body motions. We compare our model with other SOTA |
|
end-to-end models on both offline test sets and online user experiments in |
|
realistic office building environments to validate this framework. It is |
|
demonstrated that the decision model achieves SOTA performance in action |
|
triggering and selection, resulting in more humanness and intelligence when |
|
compared with the previous reactive reception policies. |
|
|
|
--------------- |
|
|
|
### 15 Mar 2023 | [Sustainable Cloud Services for Verbal Interaction with Embodied Agents](https://arxiv.org/abs/2203.02606) | [β¬οΈ](https://arxiv.org/pdf/2203.02606) |
|
*Lucrezia Grassi, Carmine Tommaso Recchiuto, Antonio Sgorbissa* |
|
|
|
This article presents the design and the implementation of a cloud system for |
|
knowledge-based autonomous interaction devised for Social Robots and other |
|
conversational agents. The system is particularly convenient for low-cost |
|
robots and devices: it can be used as a stand-alone dialogue system or as an |
|
integration to provide "background" dialogue capabilities to any preexisting |
|
Natural Language Processing ability that the robot may already have as part of |
|
its basic skills. By connecting to the cloud, developers are provided with a |
|
sustainable solution to manage verbal interaction through a network connection, |
|
with about 3,000 topics of conversation ready for "chit-chatting" and a library |
|
of pre-cooked plans that only needs to be grounded into the robot's physical |
|
capabilities. The system is structured as a set of REST API endpoints so that |
|
it can be easily expanded by adding new APIs to improve the capabilities of the |
|
clients connected to the cloud. Another key feature of the system is that it |
|
has been designed to make the development of its clients straightforward: in |
|
this way, multiple robots and devices can be easily endowed with the capability |
|
of autonomously interacting with the user, understanding when to perform |
|
specific actions, and exploiting all the information provided by cloud |
|
services. The article outlines and discusses the results of the experiments |
|
performed to assess the system's performance in terms of response time, paving |
|
the way for its use both for research and market solutions. Links to |
|
repositories with clients for ROS and popular robots such as Pepper and NAO are |
|
available on request. |
|
|
|
---------------<s>[INST] Context: |
|
1. <b> AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents </b> |
|
Abstract: In this study, our goal is to create interactive avatar agents that can |
|
autonomously plan and animate nuanced facial movements realistically, from both |
|
visual and behavioral perspectives. Given high-level inputs about the |
|
environment and agent profile, our framework harnesses LLMs to produce a series |
|
of detailed text descriptions of the avatar agents' facial motions. These |
|
descriptions are then processed by our task-agnostic driving engine into motion |
|
token sequences, which are subsequently converted into continuous motion |
|
embeddings that are further consumed by our standalone neural-based renderer to |
|
generate the final photorealistic avatar animations. These streamlined |
|
processes allow our framework to adapt to a variety of non-verbal avatar |
|
interactions, both monadic and dyadic. Our extensive study, which includes |
|
experiments on both newly compiled and existing datasets featuring two types of |
|
agents -- one capable of monadic interaction with the environment, and the |
|
other designed for dyadic conversation -- validates the effectiveness and |
|
versatility of our approach. To our knowledge, we advanced a leap step by |
|
combining LLMs and neural rendering for generalized non-verbal prediction and |
|
photo-realistic rendering of avatar agents. |
|
2. <b> Caption Anything: Interactive Image Description with Diverse Multimodal Controls </b> |
|
Abstract: Controllable image captioning is an emerging multimodal topic that aims to |
|
describe the image with natural language following human purpose, |
|
$\textit{e.g.}$, looking at the specified regions or telling in a particular |
|
text style. State-of-the-art methods are trained on annotated pairs of input |
|
controls and output captions. However, the scarcity of such well-annotated |
|
multimodal data largely limits their usability and scalability for interactive |
|
AI systems. Leveraging unimodal instruction-following foundation models is a |
|
promising alternative that benefits from broader sources of data. In this |
|
paper, we present Caption AnyThing (CAT), a foundation model augmented image |
|
captioning framework supporting a wide range of multimodel controls: 1) visual |
|
controls, including points, boxes, and trajectories; 2) language controls, such |
|
as sentiment, length, language, and factuality. Powered by Segment Anything |
|
Model (SAM) and ChatGPT, we unify the visual and language prompts into a |
|
modularized framework, enabling the flexible combination between different |
|
controls. Extensive case studies demonstrate the user intention alignment |
|
capabilities of our framework, shedding light on effective user interaction |
|
modeling in vision-language applications. Our code is publicly available at |
|
https://github.com/ttengwang/Caption-Anything. |
|
3. <b> Kosmos-2: Grounding Multimodal Large Language Models to the World </b> |
|
Abstract: We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new |
|
capabilities of perceiving object descriptions (e.g., bounding boxes) and |
|
grounding text to the visual world. Specifically, we represent refer |
|
expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where |
|
object descriptions are sequences of location tokens. Together with multimodal |
|
corpora, we construct large-scale data of grounded image-text pairs (called |
|
GrIT) to train the model. In addition to the existing capabilities of MLLMs |
|
(e.g., perceiving general modalities, following instructions, and performing |
|
in-context learning), Kosmos-2 integrates the grounding capability into |
|
downstream applications. We evaluate Kosmos-2 on a wide range of tasks, |
|
including (i) multimodal grounding, such as referring expression comprehension, |
|
and phrase grounding, (ii) multimodal referring, such as referring expression |
|
generation, (iii) perception-language tasks, and (iv) language understanding |
|
and generation. This work lays out the foundation for the development of |
|
Embodiment AI and sheds light on the big convergence of language, multimodal |
|
perception, action, and world modeling, which is a key step toward artificial |
|
general intelligence. Code and pretrained models are available at |
|
https://aka.ms/kosmos-2. |
|
4. <b> ScreenAI: A Vision-Language Model for UI and Infographics Understanding </b> |
|
Abstract: Screen user interfaces (UIs) and infographics, sharing similar visual |
|
language and design principles, play important roles in human communication and |
|
human-machine interaction. We introduce ScreenAI, a vision-language model that |
|
specializes in UI and infographics understanding. Our model improves upon the |
|
PaLI architecture with the flexible patching strategy of pix2struct and is |
|
trained on a unique mixture of datasets. At the heart of this mixture is a |
|
novel screen annotation task in which the model has to identify the type and |
|
location of UI elements. We use these text annotations to describe screens to |
|
Large Language Models and automatically generate question-answering (QA), UI |
|
navigation, and summarization training datasets at scale. We run ablation |
|
studies to demonstrate the impact of these design choices. At only 5B |
|
parameters, ScreenAI achieves new state-of-the-artresults on UI- and |
|
infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget |
|
Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and |
|
InfographicVQA) compared to models of similar size. Finally, we release three |
|
new datasets: one focused on the screen annotation task and two others focused |
|
on question answering. |
|
5. <b> ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues </b> |
|
Abstract: Task-oriented conversational agents rely on semantic parsers to translate |
|
natural language to formal representations. In this paper, we propose the |
|
design and rationale of the ThingTalk formal representation, and how the design |
|
improves the development of transactional task-oriented agents. |
|
ThingTalk is built on four core principles: (1) representing user requests |
|
directly as executable statements, covering all the functionality of the agent, |
|
(2) representing dialogues formally and succinctly to support accurate |
|
contextual semantic parsing, (3) standardizing types and interfaces to maximize |
|
reuse between agents, and (4) allowing multiple, independently-developed agents |
|
to be composed in a single virtual assistant. ThingTalk is developed as part of |
|
the Genie Framework that allows developers to quickly build transactional |
|
agents given a database and APIs. |
|
We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST. |
|
Compared to the others, the ThingTalk design is both more general and more |
|
cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and |
|
associated tools yields a new state of the art accuracy of 79% turn-by-turn. |
|
6. <b> 3D-GPT: Procedural 3D Modeling with Large Language Models </b> |
|
Abstract: In the pursuit of efficient automated content creation, procedural |
|
generation, leveraging modifiable parameters and rule-based systems, emerges as |
|
a promising approach. Nonetheless, it could be a demanding endeavor, given its |
|
intricate nature necessitating a deep understanding of rules, algorithms, and |
|
parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing |
|
large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT |
|
positions LLMs as proficient problem solvers, dissecting the procedural 3D |
|
modeling tasks into accessible segments and appointing the apt agent for each |
|
task. 3D-GPT integrates three core agents: the task dispatch agent, the |
|
conceptualization agent, and the modeling agent. They collaboratively achieve |
|
two objectives. First, it enhances concise initial scene descriptions, evolving |
|
them into detailed forms while dynamically adapting the text based on |
|
subsequent instructions. Second, it integrates procedural generation, |
|
extracting parameter values from enriched text to effortlessly interface with |
|
3D software for asset creation. Our empirical investigations confirm that |
|
3D-GPT not only interprets and executes instructions, delivering reliable |
|
results but also collaborates effectively with human designers. Furthermore, it |
|
seamlessly integrates with Blender, unlocking expanded manipulation |
|
possibilities. Our work highlights the potential of LLMs in 3D modeling, |
|
offering a basic framework for future advancements in scene generation and |
|
animation. |
|
7. <b> Embodied Task Planning with Large Language Models </b> |
|
Abstract: Equipping embodied agents with commonsense is important for robots to |
|
successfully complete complex human instructions in general environments. |
|
Recent large language models (LLM) can embed rich semantic knowledge for agents |
|
in plan generation of complex tasks, while they lack the information about the |
|
realistic world and usually yield infeasible action sequences. In this paper, |
|
we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning |
|
with physical scene constraint, where the agent generates executable plans |
|
according to the existed objects in the scene by aligning LLMs with the visual |
|
perception models. Specifically, we first construct a multimodal dataset |
|
containing triplets of indoor scenes, instructions and action plans, where we |
|
provide the designed prompts and the list of existing objects in the scene for |
|
GPT-3.5 to generate a large number of instructions and corresponding planned |
|
actions. The generated data is leveraged for grounded plan tuning of |
|
pre-trained LLMs. During inference, we discover the objects in the scene by |
|
extending open-vocabulary object detectors to multi-view RGB images collected |
|
in different achievable locations. Experimental results show that the generated |
|
plan from our TaPA framework can achieve higher success rate than LLaVA and |
|
GPT-3.5 by a sizable margin, which indicates the practicality of embodied task |
|
planning in general and complex environments. |
|
8. <b> Joint Representation Learning for Text and 3D Point Cloud </b> |
|
Abstract: Recent advancements in vision-language pre-training (e.g. CLIP) have shown |
|
that vision models can benefit from language supervision. While many models |
|
using language modality have achieved great success on 2D vision tasks, the |
|
joint representation learning of 3D point cloud with text remains |
|
under-explored due to the difficulty of 3D-Text data pair acquisition and the |
|
irregularity of 3D data structure. In this paper, we propose a novel Text4Point |
|
framework to construct language-guided 3D point cloud models. The key idea is |
|
utilizing 2D images as a bridge to connect the point cloud and the language |
|
modalities. The proposed Text4Point follows the pre-training and fine-tuning |
|
paradigm. During the pre-training stage, we establish the correspondence of |
|
images and point clouds based on the readily available RGB-D data and use |
|
contrastive learning to align the image and point cloud representations. |
|
Together with the well-aligned image and text features achieved by CLIP, the |
|
point cloud features are implicitly aligned with the text embeddings. Further, |
|
we propose a Text Querying Module to integrate language information into 3D |
|
representation learning by querying text embeddings with point cloud features. |
|
For fine-tuning, the model learns task-specific 3D representations under |
|
informative language guidance from the label set without 2D images. Extensive |
|
experiments demonstrate that our model shows consistent improvement on various |
|
downstream tasks, such as point cloud semantic segmentation, instance |
|
segmentation, and object detection. The code will be available here: |
|
https://github.com/LeapLabTHU/Text4Point |
|
9. <b> Executable Code Actions Elicit Better LLM Agents </b> |
|
Abstract: Large Language Model (LLM) agents, capable of performing a broad range of |
|
actions, such as invoking tools and controlling robots, show great potential in |
|
tackling real-world challenges. LLM agents are typically prompted to produce |
|
actions by generating JSON or text in a pre-defined format, which is usually |
|
limited by constrained action space (e.g., the scope of pre-defined tools) and |
|
restricted flexibility (e.g., inability to compose multiple tools). This work |
|
proposes to use executable Python code to consolidate LLM agents' actions into |
|
a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct |
|
can execute code actions and dynamically revise prior actions or emit new |
|
actions upon new observations through multi-turn interactions. Our extensive |
|
analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that |
|
CodeAct outperforms widely used alternatives (up to 20% higher success rate). |
|
The encouraging performance of CodeAct motivates us to build an open-source LLM |
|
agent that interacts with environments by executing interpretable code and |
|
collaborates with users using natural language. To this end, we collect an |
|
instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn |
|
interactions using CodeAct. We show that it can be used with existing data to |
|
improve models in agent-oriented tasks without compromising their general |
|
capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with |
|
Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., |
|
model training) using existing libraries and autonomously self-debug. |
|
10. <b> VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks </b> |
|
Abstract: Autonomous agents capable of planning, reasoning, and executing actions on |
|
the web offer a promising avenue for automating computer tasks. However, the |
|
majority of existing benchmarks primarily focus on text-based agents, |
|
neglecting many natural tasks that require visual information to effectively |
|
solve. Given that most computer interfaces cater to human perception, visual |
|
information often augments textual data in ways that text-only models struggle |
|
to harness effectively. To bridge this gap, we introduce VisualWebArena, a |
|
benchmark designed to assess the performance of multimodal web agents on |
|
realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set |
|
of diverse and complex web-based tasks that evaluate various capabilities of |
|
autonomous multimodal agents. To perform on this benchmark, agents need to |
|
accurately process image-text inputs, interpret natural language instructions, |
|
and execute actions on websites to accomplish user-defined objectives. We |
|
conduct an extensive evaluation of state-of-the-art LLM-based autonomous |
|
agents, including several multimodal models. Through extensive quantitative and |
|
qualitative analysis, we identify several limitations of text-only LLM agents, |
|
and reveal gaps in the capabilities of state-of-the-art multimodal language |
|
agents. VisualWebArena provides a framework for evaluating multimodal |
|
autonomous language agents, and offers insights towards building stronger |
|
autonomous agents for the web. |
|
""") |