UI-TARS-1.5 Model

We shared the latest progress of the UI-TARS-1.5 model in our blog, which excels in playing games and performing GUI tasks.

Introduction

UI-TARS-1.5, an open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.

Leveraging the foundational architecture introduced in our recent paper, UI-TARS-1.5 integrates advanced reasoning enabled by reinforcement learning. This allows the model to reason through its thoughts before taking action, significantly enhancing its performance and adaptability, particularly in inference-time scaling. Our new 1.5 version achieves state-of-the-art results across a variety of standard benchmarks, demonstrating strong reasoning capabilities and notable improvements over prior models.

Code: https://github.com/bytedance/UI-TARS

Application: https://github.com/bytedance/UI-TARS-desktop

Performance

Online Benchmark Evaluation

Benchmark type	Benchmark	UI-TARS-1.5	OpenAI CUA	Claude 3.7	Previous SOTA
Computer Use	OSworld (100 steps)	42.5	36.4	28	38.1 (200 step)
	Windows Agent Arena (50 steps)	42.1	-	-	29.8
Browser Use	WebVoyager	84.8	87	84.1	87
	Online-Mind2web	75.8	71	62.9	71
Phone Use	Android World	64.2	-	-	59.5

Grounding Capability Evaluation

Benchmark	UI-TARS-1.5	OpenAI CUA	Claude 3.7	Previous SOTA
ScreensSpot-V2	94.2	87.9	87.6	91.6
ScreenSpotPro	61.6	23.4	27.7	43.6

Poki Game

Model	2048	energy	free-the-key	Gem-11	hex-frvr	Infinity-Loop	Maze:Path-of-Light	shapes	snake-solver	wood-blocks-3d	yarn-untangle	laser-maze-puzzle	tiles-master
OpenAI CUA	31.04	32.80	0.00	46.27	92.25	23.08	35.00	52.18	42.86	2.02	44.56	80.00	78.27
Claude 3.7	43.05	41.60	0.00	0.00	30.76	2.31	82.00	6.26	42.86	0.00	13.77	28.00	52.18
UI-TARS-1.5	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00

Minecraft

Task Type	Task Name	VPT	DreamerV3	Previous SOTA	UI-TARS-1.5 w/o Thought	UI-TARS-1.5 w/ Thought
Mine Blocks	(oak_log)	0.8	1.0	1.0	1.0	1.0
	(obsidian)	0.0	0.0	0.0	0.2	0.3
	(white_bed)	0.0	0.0	0.1	0.4	0.6
	200 Tasks Avg.	0.06	0.03	0.32	0.35	0.42
Kill Mobs	(mooshroom)	0.0	0.0	0.1	0.3	0.4
	(zombie)	0.4	0.1	0.6	0.7	0.9
	(chicken)	0.1	0.0	0.4	0.5	0.6
	100 Tasks Avg.	0.04	0.03	0.18	0.25	0.31

Model Scale Comparison

This table compares performance across different model scales of UI-TARS on the OSworld benchmark.

Benchmark Type	Benchmark	UI-TARS-72B-DPO	UI-TARS-1.5-7B	UI-TARS-1.5
Computer Use	OSWorld	24.6	27.5	42.5
GUI Grounding	ScreenSpotPro	38.1	49.6	61.6

The released UI-TARS-1.5-7B focuses primarily on enhancing general computer use capabilities and is not specifically optimized for game-based scenarios, where the UI-TARS-1.5 still holds a significant advantage.

What's next

We are providing early research access to our top-performing UI-TARS-1.5 model to facilitate collaborative research. Interested researchers can contact us at [email protected].

Citation

If you find our paper and model useful in your research, feel free to give us a cite.

@article{qin2025ui,
  title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
  author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
  journal={arXiv preprint arXiv:2501.12326},
  year={2025}
}