awacke1's picture
Create app.py
8d8a8b9 verified
raw
history blame
50.1 kB
import streamlit as st
st.set_page_config(page_title="Multi Agent Systems", page_icon=":robot_face:", layout="wide")
hide_streamlit_style = """
<style>
#MainMenu {visibility: hidden;}
footer {visibility: hidden;}
</style>
"""
st.markdown(hide_streamlit_style, unsafe_allow_html=True)
col1, col2 = st.beta_columns(2)
with col1:
st.markdown("## **Autonomous agents interacting** :robot_face: :robot_face:**")
st.markdown("### **Key Aspects** :bulb:")
st.markdown("""
1. **Interaction Protocol** 🀝 \n
- Define rules for communication and cooperation \n
2. **Decentralized Decision Making** 🎯 \n
- Autonomous agents make independent decisions \n
3. **Collaboration and Competition** 🀼 \n
- Agents work together or against each other \n
""")
with col2:
st.markdown("### **Entities** :guards:")
st.markdown("""
1. **Autonomous Agents** πŸ€– \n
- Independent entities with decision-making capabilities \n
2. **Environment** 🌐 \n
- Shared space where agents interact \n
3. **Ruleset** πŸ“œ \n
- Defines interaction protocol and decision-making processes \n
""")
st.markdown("---")
st.markdown("## **Interaction Protocol** 🀝 :bulb:**")
st.markdown("### **Key Elements** :guards:")
st.markdown("""
1. **Communication** πŸ—£ \n
- Agents exchange information \n
2. **Cooperation** 🀝 \n
-# πŸ©ΊπŸ” Search Results
### 04 Dec 2023 | [AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents](https://arxiv.org/abs/2311.17465) | [⬇️](https://arxiv.org/pdf/2311.17465)
*Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang*
In this study, our goal is to create interactive avatar agents that can
autonomously plan and animate nuanced facial movements realistically, from both
visual and behavioral perspectives. Given high-level inputs about the
environment and agent profile, our framework harnesses LLMs to produce a series
of detailed text descriptions of the avatar agents' facial motions. These
descriptions are then processed by our task-agnostic driving engine into motion
token sequences, which are subsequently converted into continuous motion
embeddings that are further consumed by our standalone neural-based renderer to
generate the final photorealistic avatar animations. These streamlined
processes allow our framework to adapt to a variety of non-verbal avatar
interactions, both monadic and dyadic. Our extensive study, which includes
experiments on both newly compiled and existing datasets featuring two types of
agents -- one capable of monadic interaction with the environment, and the
other designed for dyadic conversation -- validates the effectiveness and
versatility of our approach. To our knowledge, we advanced a leap step by
combining LLMs and neural rendering for generalized non-verbal prediction and
photo-realistic rendering of avatar agents.
---------------
### 06 Jul 2023 | [Caption Anything: Interactive Image Description with Diverse Multimodal Controls](https://arxiv.org/abs/2305.02677) | [⬇️](https://arxiv.org/pdf/2305.02677)
*Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao*
Controllable image captioning is an emerging multimodal topic that aims to
describe the image with natural language following human purpose,
$\textit{e.g.}$, looking at the specified regions or telling in a particular
text style. State-of-the-art methods are trained on annotated pairs of input
controls and output captions. However, the scarcity of such well-annotated
multimodal data largely limits their usability and scalability for interactive
AI systems. Leveraging unimodal instruction-following foundation models is a
promising alternative that benefits from broader sources of data. In this
paper, we present Caption AnyThing (CAT), a foundation model augmented image
captioning framework supporting a wide range of multimodel controls: 1) visual
controls, including points, boxes, and trajectories; 2) language controls, such
as sentiment, length, language, and factuality. Powered by Segment Anything
Model (SAM) and ChatGPT, we unify the visual and language prompts into a
modularized framework, enabling the flexible combination between different
controls. Extensive case studies demonstrate the user intention alignment
capabilities of our framework, shedding light on effective user interaction
modeling in vision-language applications. Our code is publicly available at
https://github.com/ttengwang/Caption-Anything.
---------------
### 13 Jul 2023 | [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) | [⬇️](https://arxiv.org/pdf/2306.14824)
*Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei*
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (e.g., bounding boxes) and
grounding text to the visual world. Specifically, we represent refer
expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
object descriptions are sequences of location tokens. Together with multimodal
corpora, we construct large-scale data of grounded image-text pairs (called
GrIT) to train the model. In addition to the existing capabilities of MLLMs
(e.g., perceiving general modalities, following instructions, and performing
in-context learning), Kosmos-2 integrates the grounding capability into
downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
including (i) multimodal grounding, such as referring expression comprehension,
and phrase grounding, (ii) multimodal referring, such as referring expression
generation, (iii) perception-language tasks, and (iv) language understanding
and generation. This work lays out the foundation for the development of
Embodiment AI and sheds light on the big convergence of language, multimodal
perception, action, and world modeling, which is a key step toward artificial
general intelligence. Code and pretrained models are available at
https://aka.ms/kosmos-2.
---------------
### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615)
*Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor C\u{a}rbune, Jason Lin, Jindong Chen, Abhanshu Sharma*
Screen user interfaces (UIs) and infographics, sharing similar visual
language and design principles, play important roles in human communication and
human-machine interaction. We introduce ScreenAI, a vision-language model that
specializes in UI and infographics understanding. Our model improves upon the
PaLI architecture with the flexible patching strategy of pix2struct and is
trained on a unique mixture of datasets. At the heart of this mixture is a
novel screen annotation task in which the model has to identify the type and
location of UI elements. We use these text annotations to describe screens to
Large Language Models and automatically generate question-answering (QA), UI
navigation, and summarization training datasets at scale. We run ablation
studies to demonstrate the impact of these design choices. At only 5B
parameters, ScreenAI achieves new state-of-the-artresults on UI- and
infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
InfographicVQA) compared to models of similar size. Finally, we release three
new datasets: one focused on the screen annotation task and two others focused
on question answering.
---------------
### 23 Mar 2022 | [ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues](https://arxiv.org/abs/2203.12751) | [⬇️](https://arxiv.org/pdf/2203.12751)
*Monica S. Lam, Giovanni Campagna, Mehrad Moradshahi, Sina J. Semnani, Silei Xu*
Task-oriented conversational agents rely on semantic parsers to translate
natural language to formal representations. In this paper, we propose the
design and rationale of the ThingTalk formal representation, and how the design
improves the development of transactional task-oriented agents.
ThingTalk is built on four core principles: (1) representing user requests
directly as executable statements, covering all the functionality of the agent,
(2) representing dialogues formally and succinctly to support accurate
contextual semantic parsing, (3) standardizing types and interfaces to maximize
reuse between agents, and (4) allowing multiple, independently-developed agents
to be composed in a single virtual assistant. ThingTalk is developed as part of
the Genie Framework that allows developers to quickly build transactional
agents given a database and APIs.
We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
Compared to the others, the ThingTalk design is both more general and more
cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
associated tools yields a new state of the art accuracy of 79% turn-by-turn.
---------------
### 19 Oct 2023 | [3D-GPT: Procedural 3D Modeling with Large Language Models](https://arxiv.org/abs/2310.12945) | [⬇️](https://arxiv.org/pdf/2310.12945)
*Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould*
In the pursuit of efficient automated content creation, procedural
generation, leveraging modifiable parameters and rule-based systems, emerges as
a promising approach. Nonetheless, it could be a demanding endeavor, given its
intricate nature necessitating a deep understanding of rules, algorithms, and
parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
positions LLMs as proficient problem solvers, dissecting the procedural 3D
modeling tasks into accessible segments and appointing the apt agent for each
task. 3D-GPT integrates three core agents: the task dispatch agent, the
conceptualization agent, and the modeling agent. They collaboratively achieve
two objectives. First, it enhances concise initial scene descriptions, evolving
them into detailed forms while dynamically adapting the text based on
subsequent instructions. Second, it integrates procedural generation,
extracting parameter values from enriched text to effortlessly interface with
3D software for asset creation. Our empirical investigations confirm that
3D-GPT not only interprets and executes instructions, delivering reliable
results but also collaborates effectively with human designers. Furthermore, it
seamlessly integrates with Blender, unlocking expanded manipulation
possibilities. Our work highlights the potential of LLMs in 3D modeling,
offering a basic framework for future advancements in scene generation and
animation.
---------------
### 04 Jul 2023 | [Embodied Task Planning with Large Language Models](https://arxiv.org/abs/2307.01848) | [⬇️](https://arxiv.org/pdf/2307.01848)
*Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan*
Equipping embodied agents with commonsense is important for robots to
successfully complete complex human instructions in general environments.
Recent large language models (LLM) can embed rich semantic knowledge for agents
in plan generation of complex tasks, while they lack the information about the
realistic world and usually yield infeasible action sequences. In this paper,
we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
with physical scene constraint, where the agent generates executable plans
according to the existed objects in the scene by aligning LLMs with the visual
perception models. Specifically, we first construct a multimodal dataset
containing triplets of indoor scenes, instructions and action plans, where we
provide the designed prompts and the list of existing objects in the scene for
GPT-3.5 to generate a large number of instructions and corresponding planned
actions. The generated data is leveraged for grounded plan tuning of
pre-trained LLMs. During inference, we discover the objects in the scene by
extending open-vocabulary object detectors to multi-view RGB images collected
in different achievable locations. Experimental results show that the generated
plan from our TaPA framework can achieve higher success rate than LLaVA and
GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
planning in general and complex environments.
---------------
### 18 Jan 2023 | [Joint Representation Learning for Text and 3D Point Cloud](https://arxiv.org/abs/2301.07584) | [⬇️](https://arxiv.org/pdf/2301.07584)
*Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji Song, Gao Huang*
Recent advancements in vision-language pre-training (e.g. CLIP) have shown
that vision models can benefit from language supervision. While many models
using language modality have achieved great success on 2D vision tasks, the
joint representation learning of 3D point cloud with text remains
under-explored due to the difficulty of 3D-Text data pair acquisition and the
irregularity of 3D data structure. In this paper, we propose a novel Text4Point
framework to construct language-guided 3D point cloud models. The key idea is
utilizing 2D images as a bridge to connect the point cloud and the language
modalities. The proposed Text4Point follows the pre-training and fine-tuning
paradigm. During the pre-training stage, we establish the correspondence of
images and point clouds based on the readily available RGB-D data and use
contrastive learning to align the image and point cloud representations.
Together with the well-aligned image and text features achieved by CLIP, the
point cloud features are implicitly aligned with the text embeddings. Further,
we propose a Text Querying Module to integrate language information into 3D
representation learning by querying text embeddings with point cloud features.
For fine-tuning, the model learns task-specific 3D representations under
informative language guidance from the label set without 2D images. Extensive
experiments demonstrate that our model shows consistent improvement on various
downstream tasks, such as point cloud semantic segmentation, instance
segmentation, and object detection. The code will be available here:
https://github.com/LeapLabTHU/Text4Point
---------------
### 01 Feb 2024 | [Executable Code Actions Elicit Better LLM Agents](https://arxiv.org/abs/2402.01030) | [⬇️](https://arxiv.org/pdf/2402.01030)
*Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji*
Large Language Model (LLM) agents, capable of performing a broad range of
actions, such as invoking tools and controlling robots, show great potential in
tackling real-world challenges. LLM agents are typically prompted to produce
actions by generating JSON or text in a pre-defined format, which is usually
limited by constrained action space (e.g., the scope of pre-defined tools) and
restricted flexibility (e.g., inability to compose multiple tools). This work
proposes to use executable Python code to consolidate LLM agents' actions into
a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
can execute code actions and dynamically revise prior actions or emit new
actions upon new observations through multi-turn interactions. Our extensive
analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
CodeAct outperforms widely used alternatives (up to 20% higher success rate).
The encouraging performance of CodeAct motivates us to build an open-source LLM
agent that interacts with environments by executing interpretable code and
collaborates with users using natural language. To this end, we collect an
instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
interactions using CodeAct. We show that it can be used with existing data to
improve models in agent-oriented tasks without compromising their general
capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
model training) using existing libraries and autonomously self-debug.
---------------
### 24 Jan 2024 | [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) | [⬇️](https://arxiv.org/pdf/2401.13649)
*Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried*
Autonomous agents capable of planning, reasoning, and executing actions on
the web offer a promising avenue for automating computer tasks. However, the
majority of existing benchmarks primarily focus on text-based agents,
neglecting many natural tasks that require visual information to effectively
solve. Given that most computer interfaces cater to human perception, visual
information often augments textual data in ways that text-only models struggle
to harness effectively. To bridge this gap, we introduce VisualWebArena, a
benchmark designed to assess the performance of multimodal web agents on
realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
of diverse and complex web-based tasks that evaluate various capabilities of
autonomous multimodal agents. To perform on this benchmark, agents need to
accurately process image-text inputs, interpret natural language instructions,
and execute actions on websites to accomplish user-defined objectives. We
conduct an extensive evaluation of state-of-the-art LLM-based autonomous
agents, including several multimodal models. Through extensive quantitative and
qualitative analysis, we identify several limitations of text-only LLM agents,
and reveal gaps in the capabilities of state-of-the-art multimodal language
agents. VisualWebArena provides a framework for evaluating multimodal
autonomous language agents, and offers insights towards building stronger
autonomous agents for the web. Our code, baseline models, and data is publicly
available at https://jykoh.com/vwa.
---------------
### 22 Feb 2018 | [Multimodal Named Entity Recognition for Short Social Media Posts](https://arxiv.org/abs/1802.07862) | [⬇️](https://arxiv.org/pdf/1802.07862)
*Seungwhan Moon, Leonardo Neves, Vitor Carvalho*
We introduce a new task called Multimodal Named Entity Recognition (MNER) for
noisy user-generated data such as tweets or Snapchat captions, which comprise
short text with accompanying images. These social media posts often come in
inconsistent or incomplete syntax and lexical notations with very limited
surrounding textual contexts, bringing significant challenges for NER. To this
end, we create a new dataset for MNER called SnapCaptions (Snapchat
image-caption pairs submitted to public and crowd-sourced stories with fully
annotated named entities). We then build upon the state-of-the-art Bi-LSTM
word/character based NER models with 1) a deep image network which incorporates
relevant visual context to augment textual information, and 2) a generic
modality-attention module which learns to attenuate irrelevant modalities while
amplifying the most informative ones to extract contexts from, adaptive to each
sample and token. The proposed MNER model with modality attention significantly
outperforms the state-of-the-art text-only NER models by successfully
leveraging provided visual contexts, opening up potential applications of MNER
on myriads of social media platforms.
---------------
### 21 Sep 2023 | [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) | [⬇️](https://arxiv.org/pdf/2309.11436)
*Zhuosheng Zhang, Aston Zhang*
Autonomous user interface (UI) agents aim to facilitate task automation by
interacting with the user interface without manual intervention. Recent studies
have investigated eliciting the capabilities of large language models (LLMs)
for effective engagement in diverse environments. To align with the
input-output requirement of LLMs, existing approaches are developed under a
sandbox setting where they rely on external tools and application-specific APIs
to parse the environment into textual elements and interpret the predicted
actions. Consequently, those approaches often grapple with inference
inefficiency and error propagation risks. To mitigate the challenges, we
introduce Auto-UI, a multimodal solution that directly interacts with the
interface, bypassing the need for environment parsing or reliance on
application-dependent APIs. Moreover, we propose a chain-of-action technique --
leveraging a series of intermediate previous action histories and future action
plans -- to help the agent decide what action to execute. We evaluate our
approach on a new device-control benchmark AITW with 30K unique instructions,
spanning multi-step tasks such as application operation, web searching, and web
shopping. Experimental results show that Auto-UI achieves state-of-the-art
performance with an action type prediction accuracy of 90% and an overall
action success rate of 74%. Code is publicly available at
https://github.com/cooelf/Auto-UI.
---------------
### 06 Jun 2023 | [LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models](https://arxiv.org/abs/2303.02927) | [⬇️](https://arxiv.org/pdf/2303.02927)
*Victor Dibia*
Systems that support users in the automatic creation of visualizations must
address several subtasks - understand the semantics of data, enumerate relevant
visualization goals and generate visualization specifications. In this work, we
pose visualization generation as a multi-stage generation problem and argue
that well-orchestrated pipelines based on large language models (LLMs) such as
ChatGPT/GPT-4 and image generation models (IGMs) are suitable to addressing
these tasks. We present LIDA, a novel tool for generating grammar-agnostic
visualizations and infographics. LIDA comprises of 4 modules - A SUMMARIZER
that converts data into a rich but compact natural language summary, a GOAL
EXPLORER that enumerates visualization goals given the data, a VISGENERATOR
that generates, refines, executes and filters visualization code and an
INFOGRAPHER module that yields data-faithful stylized graphics using IGMs. LIDA
provides a python api, and a hybrid user interface (direct manipulation and
multilingual natural language) for interactive chart, infographics and data
story generation. Learn more about the project here -
https://microsoft.github.io/lida/
---------------
### 16 Feb 2023 | [VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning](https://arxiv.org/abs/2211.15103) | [⬇️](https://arxiv.org/pdf/2211.15103)
*Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le*
Video paragraph captioning aims to generate a multi-sentence description of
an untrimmed video with several temporal event locations in coherent
storytelling. Following the human perception process, where the scene is
effectively understood by decomposing it into visual (e.g. human, animal) and
non-visual components (e.g. action, relations) under the mutual influence of
vision and language, we first propose a visual-linguistic (VL) feature. In the
proposed VL feature, the scene is modeled by three modalities including (i) a
global visual environment; (ii) local visual main agents; (iii) linguistic
scene elements. We then introduce an autoregressive Transformer-in-Transformer
(TinT) to simultaneously capture the semantic coherence of intra- and
inter-event contents within a video. Finally, we present a new VL contrastive
loss function to guarantee learnt embedding features are matched with the
captions semantics. Comprehensive experiments and extensive ablation studies on
ActivityNet Captions and YouCookII datasets show that the proposed
Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior
state-of-the-art methods on accuracy and diversity. Source code is made
publicly available at: https://github.com/UARK-AICV/VLTinT.
---------------
### 04 Mar 2021 | [FAtiMA Toolkit -- Toward an effective and accessible tool for the development of intelligent virtual agents and social robots](https://arxiv.org/abs/2103.03020) | [⬇️](https://arxiv.org/pdf/2103.03020)
*Samuel Mascarenhas, Manuel Guimar\~aes, Pedro A. Santos, Jo\~ao Dias, Rui Prada, Ana Paiva*
More than a decade has passed since the development of FearNot!, an
application designed to help children deal with bullying through role-playing
with virtual characters. It was also the application that led to the creation
of FAtiMA, an affective agent architecture for creating autonomous characters
that can evoke empathic responses. In this paper, we describe FAtiMA Toolkit, a
collection of open-source tools that is designed to help researchers, game
developers and roboticists incorporate a computational model of emotion and
decision-making in their work. The toolkit was developed with the goal of
making FAtiMA more accessible, easier to incorporate into different projects
and more flexible in its capabilities for human-agent interaction, based upon
the experience gathered over the years across different virtual environments
and human-robot interaction scenarios. As a result, this work makes several
different contributions to the field of Agent-Based Architectures. More
precisely, FAtiMA Toolkit's library based design allows developers to easily
integrate it with other frameworks, its meta-cognitive model affords different
internal reasoners and affective components and its explicit dialogue structure
gives control to the author even within highly complex scenarios. To
demonstrate the use of FAtiMA Toolkit, several different use cases where the
toolkit was successfully applied are described and discussed.
---------------
### 12 Sep 2022 | [emojiSpace: Spatial Representation of Emojis](https://arxiv.org/abs/2209.09871) | [⬇️](https://arxiv.org/pdf/2209.09871)
*Moeen Mostafavi, Mahsa Pahlavikhah Varnosfaderani, Fateme Nikseresht, Seyed Ahmad Mansouri*
In the absence of nonverbal cues during messaging communication, users
express part of their emotions using emojis. Thus, having emojis in the
vocabulary of text messaging language models can significantly improve many
natural language processing (NLP) applications such as online communication
analysis. On the other hand, word embedding models are usually trained on a
very large corpus of text such as Wikipedia or Google News datasets that
include very few samples with emojis. In this study, we create emojiSpace,
which is a combined word-emoji embedding using the word2vec model from the
Genism library in Python. We trained emojiSpace on a corpus of more than 4
billion tweets and evaluated it by implementing sentiment analysis on a Twitter
dataset containing more than 67 million tweets as an extrinsic task. For this
task, we compared the performance of two different classifiers of random forest
(RF) and linear support vector machine (SVM). For evaluation, we compared
emojiSpace performance with two other pre-trained embeddings and demonstrated
that emojiSpace outperforms both.
---------------
### 27 Jan 2020 | [CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking](https://arxiv.org/abs/2001.07935) | [⬇️](https://arxiv.org/pdf/2001.07935)
*Grigori Fursin, Herve Guillou and Nicolas Essayan*
We present CodeReef - an open platform to share all the components necessary
to enable cross-platform MLOps (MLSysOps), i.e. automating the deployment of ML
models across diverse systems in the most efficient way. We also introduce the
CodeReef solution - a way to package and share models as non-virtualized,
portable, customizable and reproducible archive files. Such ML packages include
JSON meta description of models with all dependencies, Python APIs, CLI actions
and portable workflows necessary to automatically build, benchmark, test and
customize models across diverse platforms, AI frameworks, libraries, compilers
and datasets. We demonstrate several CodeReef solutions to automatically build,
run and measure object detection based on SSD-Mobilenets, TensorFlow and COCO
dataset from the latest MLPerf inference benchmark across a wide range of
platforms from Raspberry Pi, Android phones and IoT devices to data centers.
Our long-term goal is to help researchers share their new techniques as
production-ready packages along with research papers to participate in
collaborative and reproducible benchmarking, compare the different
ML/software/hardware stacks and select the most efficient ones on a Pareto
frontier using online CodeReef dashboards.
---------------
### 28 Feb 2024 | [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) | [⬇️](https://arxiv.org/pdf/2402.17553)
*Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov*
For decades, human-computer interaction has fundamentally been manual. Even
today, almost all productive work done on the computer necessitates human input
at every step. Autonomous virtual agents represent an exciting step in
automating many of these menial tasks. Virtual agents would empower users with
limited technical proficiency to harness the full possibilities of computer
systems. They could also enable the efficient streamlining of numerous computer
tasks, ranging from calendar management to complex travel bookings, with
minimal human intervention. In this paper, we introduce OmniACT, the
first-of-a-kind dataset and benchmark for assessing an agent's capability to
generate executable programs to accomplish computer tasks. Our scope extends
beyond traditional web automation, covering a diverse range of desktop
applications. The dataset consists of fundamental tasks such as "Play the next
song", as well as longer horizon tasks such as "Send an email to John Doe
mentioning the time and place to meet". Specifically, given a pair of screen
image and a visually-grounded natural language task, the goal is to generate a
script capable of fully executing the task. We run several strong baseline
language model agents on our benchmark. The strongest baseline, GPT-4, performs
the best on our benchmark However, its performance level still reaches only 15%
of the human proficiency in generating executable scripts capable of completing
the task, demonstrating the challenge of our task for conventional web agents.
Our benchmark provides a platform to measure and evaluate the progress of
language model agents in automating computer tasks and motivates future work
towards building multimodal models that bridge large language models and the
visual grounding of computer screens.
---------------
### 24 Mar 2021 | [Proactive Interaction Framework for Intelligent Social Receptionist Robots](https://arxiv.org/abs/2012.04832) | [⬇️](https://arxiv.org/pdf/2012.04832)
*Yang Xue, Fan Wang, Hao Tian, Min Zhao, Jiangyong Li, Haiqing Pan and Yueqiang Dong*
Proactive human-robot interaction (HRI) allows the receptionist robots to
actively greet people and offer services based on vision, which has been found
to improve acceptability and customer satisfaction. Existing approaches are
either based on multi-stage decision processes or based on end-to-end decision
models. However, the rule-based approaches require sedulous expert efforts and
only handle minimal pre-defined scenarios. On the other hand, existing works
with end-to-end models are limited to very general greetings or few behavior
patterns (typically less than 10). To address those challenges, we propose a
new end-to-end framework, the TransFormer with Visual Tokens for Human-Robot
Interaction (TFVT-HRI). The proposed framework extracts visual tokens of
relative objects from an RGB camera first. To ensure the correct interpretation
of the scenario, a transformer decision model is then employed to process the
visual tokens, which is augmented with the temporal and spatial information. It
predicts the appropriate action to take in each scenario and identifies the
right target. Our data is collected from an in-service receptionist robot in an
office building, which is then annotated by experts for appropriate proactive
behavior. The action set includes 1000+ diverse patterns by combining language,
emoji expression, and body motions. We compare our model with other SOTA
end-to-end models on both offline test sets and online user experiments in
realistic office building environments to validate this framework. It is
demonstrated that the decision model achieves SOTA performance in action
triggering and selection, resulting in more humanness and intelligence when
compared with the previous reactive reception policies.
---------------
### 15 Mar 2023 | [Sustainable Cloud Services for Verbal Interaction with Embodied Agents](https://arxiv.org/abs/2203.02606) | [⬇️](https://arxiv.org/pdf/2203.02606)
*Lucrezia Grassi, Carmine Tommaso Recchiuto, Antonio Sgorbissa*
This article presents the design and the implementation of a cloud system for
knowledge-based autonomous interaction devised for Social Robots and other
conversational agents. The system is particularly convenient for low-cost
robots and devices: it can be used as a stand-alone dialogue system or as an
integration to provide "background" dialogue capabilities to any preexisting
Natural Language Processing ability that the robot may already have as part of
its basic skills. By connecting to the cloud, developers are provided with a
sustainable solution to manage verbal interaction through a network connection,
with about 3,000 topics of conversation ready for "chit-chatting" and a library
of pre-cooked plans that only needs to be grounded into the robot's physical
capabilities. The system is structured as a set of REST API endpoints so that
it can be easily expanded by adding new APIs to improve the capabilities of the
clients connected to the cloud. Another key feature of the system is that it
has been designed to make the development of its clients straightforward: in
this way, multiple robots and devices can be easily endowed with the capability
of autonomously interacting with the user, understanding when to perform
specific actions, and exploiting all the information provided by cloud
services. The article outlines and discusses the results of the experiments
performed to assess the system's performance in terms of response time, paving
the way for its use both for research and market solutions. Links to
repositories with clients for ROS and popular robots such as Pepper and NAO are
available on request.
---------------<s>[INST] Context:
1. <b> AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents </b>
Abstract: In this study, our goal is to create interactive avatar agents that can
autonomously plan and animate nuanced facial movements realistically, from both
visual and behavioral perspectives. Given high-level inputs about the
environment and agent profile, our framework harnesses LLMs to produce a series
of detailed text descriptions of the avatar agents' facial motions. These
descriptions are then processed by our task-agnostic driving engine into motion
token sequences, which are subsequently converted into continuous motion
embeddings that are further consumed by our standalone neural-based renderer to
generate the final photorealistic avatar animations. These streamlined
processes allow our framework to adapt to a variety of non-verbal avatar
interactions, both monadic and dyadic. Our extensive study, which includes
experiments on both newly compiled and existing datasets featuring two types of
agents -- one capable of monadic interaction with the environment, and the
other designed for dyadic conversation -- validates the effectiveness and
versatility of our approach. To our knowledge, we advanced a leap step by
combining LLMs and neural rendering for generalized non-verbal prediction and
photo-realistic rendering of avatar agents.
2. <b> Caption Anything: Interactive Image Description with Diverse Multimodal Controls </b>
Abstract: Controllable image captioning is an emerging multimodal topic that aims to
describe the image with natural language following human purpose,
$\textit{e.g.}$, looking at the specified regions or telling in a particular
text style. State-of-the-art methods are trained on annotated pairs of input
controls and output captions. However, the scarcity of such well-annotated
multimodal data largely limits their usability and scalability for interactive
AI systems. Leveraging unimodal instruction-following foundation models is a
promising alternative that benefits from broader sources of data. In this
paper, we present Caption AnyThing (CAT), a foundation model augmented image
captioning framework supporting a wide range of multimodel controls: 1) visual
controls, including points, boxes, and trajectories; 2) language controls, such
as sentiment, length, language, and factuality. Powered by Segment Anything
Model (SAM) and ChatGPT, we unify the visual and language prompts into a
modularized framework, enabling the flexible combination between different
controls. Extensive case studies demonstrate the user intention alignment
capabilities of our framework, shedding light on effective user interaction
modeling in vision-language applications. Our code is publicly available at
https://github.com/ttengwang/Caption-Anything.
3. <b> Kosmos-2: Grounding Multimodal Large Language Models to the World </b>
Abstract: We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (e.g., bounding boxes) and
grounding text to the visual world. Specifically, we represent refer
expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
object descriptions are sequences of location tokens. Together with multimodal
corpora, we construct large-scale data of grounded image-text pairs (called
GrIT) to train the model. In addition to the existing capabilities of MLLMs
(e.g., perceiving general modalities, following instructions, and performing
in-context learning), Kosmos-2 integrates the grounding capability into
downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
including (i) multimodal grounding, such as referring expression comprehension,
and phrase grounding, (ii) multimodal referring, such as referring expression
generation, (iii) perception-language tasks, and (iv) language understanding
and generation. This work lays out the foundation for the development of
Embodiment AI and sheds light on the big convergence of language, multimodal
perception, action, and world modeling, which is a key step toward artificial
general intelligence. Code and pretrained models are available at
https://aka.ms/kosmos-2.
4. <b> ScreenAI: A Vision-Language Model for UI and Infographics Understanding </b>
Abstract: Screen user interfaces (UIs) and infographics, sharing similar visual
language and design principles, play important roles in human communication and
human-machine interaction. We introduce ScreenAI, a vision-language model that
specializes in UI and infographics understanding. Our model improves upon the
PaLI architecture with the flexible patching strategy of pix2struct and is
trained on a unique mixture of datasets. At the heart of this mixture is a
novel screen annotation task in which the model has to identify the type and
location of UI elements. We use these text annotations to describe screens to
Large Language Models and automatically generate question-answering (QA), UI
navigation, and summarization training datasets at scale. We run ablation
studies to demonstrate the impact of these design choices. At only 5B
parameters, ScreenAI achieves new state-of-the-artresults on UI- and
infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
InfographicVQA) compared to models of similar size. Finally, we release three
new datasets: one focused on the screen annotation task and two others focused
on question answering.
5. <b> ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues </b>
Abstract: Task-oriented conversational agents rely on semantic parsers to translate
natural language to formal representations. In this paper, we propose the
design and rationale of the ThingTalk formal representation, and how the design
improves the development of transactional task-oriented agents.
ThingTalk is built on four core principles: (1) representing user requests
directly as executable statements, covering all the functionality of the agent,
(2) representing dialogues formally and succinctly to support accurate
contextual semantic parsing, (3) standardizing types and interfaces to maximize
reuse between agents, and (4) allowing multiple, independently-developed agents
to be composed in a single virtual assistant. ThingTalk is developed as part of
the Genie Framework that allows developers to quickly build transactional
agents given a database and APIs.
We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
Compared to the others, the ThingTalk design is both more general and more
cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
associated tools yields a new state of the art accuracy of 79% turn-by-turn.
6. <b> 3D-GPT: Procedural 3D Modeling with Large Language Models </b>
Abstract: In the pursuit of efficient automated content creation, procedural
generation, leveraging modifiable parameters and rule-based systems, emerges as
a promising approach. Nonetheless, it could be a demanding endeavor, given its
intricate nature necessitating a deep understanding of rules, algorithms, and
parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
positions LLMs as proficient problem solvers, dissecting the procedural 3D
modeling tasks into accessible segments and appointing the apt agent for each
task. 3D-GPT integrates three core agents: the task dispatch agent, the
conceptualization agent, and the modeling agent. They collaboratively achieve
two objectives. First, it enhances concise initial scene descriptions, evolving
them into detailed forms while dynamically adapting the text based on
subsequent instructions. Second, it integrates procedural generation,
extracting parameter values from enriched text to effortlessly interface with
3D software for asset creation. Our empirical investigations confirm that
3D-GPT not only interprets and executes instructions, delivering reliable
results but also collaborates effectively with human designers. Furthermore, it
seamlessly integrates with Blender, unlocking expanded manipulation
possibilities. Our work highlights the potential of LLMs in 3D modeling,
offering a basic framework for future advancements in scene generation and
animation.
7. <b> Embodied Task Planning with Large Language Models </b>
Abstract: Equipping embodied agents with commonsense is important for robots to
successfully complete complex human instructions in general environments.
Recent large language models (LLM) can embed rich semantic knowledge for agents
in plan generation of complex tasks, while they lack the information about the
realistic world and usually yield infeasible action sequences. In this paper,
we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
with physical scene constraint, where the agent generates executable plans
according to the existed objects in the scene by aligning LLMs with the visual
perception models. Specifically, we first construct a multimodal dataset
containing triplets of indoor scenes, instructions and action plans, where we
provide the designed prompts and the list of existing objects in the scene for
GPT-3.5 to generate a large number of instructions and corresponding planned
actions. The generated data is leveraged for grounded plan tuning of
pre-trained LLMs. During inference, we discover the objects in the scene by
extending open-vocabulary object detectors to multi-view RGB images collected
in different achievable locations. Experimental results show that the generated
plan from our TaPA framework can achieve higher success rate than LLaVA and
GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
planning in general and complex environments.
8. <b> Joint Representation Learning for Text and 3D Point Cloud </b>
Abstract: Recent advancements in vision-language pre-training (e.g. CLIP) have shown
that vision models can benefit from language supervision. While many models
using language modality have achieved great success on 2D vision tasks, the
joint representation learning of 3D point cloud with text remains
under-explored due to the difficulty of 3D-Text data pair acquisition and the
irregularity of 3D data structure. In this paper, we propose a novel Text4Point
framework to construct language-guided 3D point cloud models. The key idea is
utilizing 2D images as a bridge to connect the point cloud and the language
modalities. The proposed Text4Point follows the pre-training and fine-tuning
paradigm. During the pre-training stage, we establish the correspondence of
images and point clouds based on the readily available RGB-D data and use
contrastive learning to align the image and point cloud representations.
Together with the well-aligned image and text features achieved by CLIP, the
point cloud features are implicitly aligned with the text embeddings. Further,
we propose a Text Querying Module to integrate language information into 3D
representation learning by querying text embeddings with point cloud features.
For fine-tuning, the model learns task-specific 3D representations under
informative language guidance from the label set without 2D images. Extensive
experiments demonstrate that our model shows consistent improvement on various
downstream tasks, such as point cloud semantic segmentation, instance
segmentation, and object detection. The code will be available here:
https://github.com/LeapLabTHU/Text4Point
9. <b> Executable Code Actions Elicit Better LLM Agents </b>
Abstract: Large Language Model (LLM) agents, capable of performing a broad range of
actions, such as invoking tools and controlling robots, show great potential in
tackling real-world challenges. LLM agents are typically prompted to produce
actions by generating JSON or text in a pre-defined format, which is usually
limited by constrained action space (e.g., the scope of pre-defined tools) and
restricted flexibility (e.g., inability to compose multiple tools). This work
proposes to use executable Python code to consolidate LLM agents' actions into
a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
can execute code actions and dynamically revise prior actions or emit new
actions upon new observations through multi-turn interactions. Our extensive
analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
CodeAct outperforms widely used alternatives (up to 20% higher success rate).
The encouraging performance of CodeAct motivates us to build an open-source LLM
agent that interacts with environments by executing interpretable code and
collaborates with users using natural language. To this end, we collect an
instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
interactions using CodeAct. We show that it can be used with existing data to
improve models in agent-oriented tasks without compromising their general
capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
model training) using existing libraries and autonomously self-debug.
10. <b> VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks </b>
Abstract: Autonomous agents capable of planning, reasoning, and executing actions on
the web offer a promising avenue for automating computer tasks. However, the
majority of existing benchmarks primarily focus on text-based agents,
neglecting many natural tasks that require visual information to effectively
solve. Given that most computer interfaces cater to human perception, visual
information often augments textual data in ways that text-only models struggle
to harness effectively. To bridge this gap, we introduce VisualWebArena, a
benchmark designed to assess the performance of multimodal web agents on
realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
of diverse and complex web-based tasks that evaluate various capabilities of
autonomous multimodal agents. To perform on this benchmark, agents need to
accurately process image-text inputs, interpret natural language instructions,
and execute actions on websites to accomplish user-defined objectives. We
conduct an extensive evaluation of state-of-the-art LLM-based autonomous
agents, including several multimodal models. Through extensive quantitative and
qualitative analysis, we identify several limitations of text-only LLM agents,
and reveal gaps in the capabilities of state-of-the-art multimodal language
agents. VisualWebArena provides a framework for evaluating multimodal
autonomous language agents, and offers insights towards building stronger
autonomous agents for the web.
""")