Awesome GUI Agent

A curated list of papers, projects, and resources for multi-modal Graphical User Interface (GUI) agents.

Build a digital assistant on your screen. Generated by DALL-E-3.

WELCOME CONTRIBUTE!

🔥 This project is actively maintained, and we welcome your contributions. If you have any suggestions, such as missing papers or information, please feel free to open an issue or submit a pull request.

🤖 Try our Awesome-Paper-Agent. Just provide an arXiv URL link, and it will automatically return formatted information, like this:

User:
https://arxiv.org/abs/2312.13108

GPT:
+ [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2312.13108) (Dec. 2023)

  [![Star](https://img.shields.io/github/stars/showlab/assistgui.svg?style=social&label=Star)](https://github.com/showlab/assistgui)
  [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2312.13108)
  [![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/assistgui/)

So then you can easily copy and use this information in your pull requests.

⭐ If you find this repository useful, please give it a star.

Quick Navigation: [Datasets / Benchmarks] [Models / Agents] [Surveys] [Projects]

Datasets / Benchmarks

World of Bits: An Open-Domain Platform for Web-Based Agents (Aug. 2017, ICML 2017)
A Unified Solution for Structured Web Data Extraction (Jul. 2011, SIGIR 2011)
Rico: A Mobile App Dataset for Building Data-Driven Design Applications (Oct. 2017)
Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration (Feb. 2018, ICLR 2018)
Mapping Natural Language Instructions to Mobile UI Action Sequences (May. 2020, ACL 2020)
WebSRC: A Dataset for Web-Based Structural Reading Comprehension (Jan. 2021, EMNLP 2021)
AndroidEnv: A Reinforcement Learning Platform for Android (May. 2021)
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility (Feb. 2022)
META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI (May. 2022)
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents (Jul. 2022)
Language Models can Solve Computer Tasks (Mar. 2023)
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction (May. 2023)
Mind2Web: Towards a Generalist Agent for the Web (Jun. 2023)
Android in the Wild: A Large-Scale Dataset for Android Device Control (Jul. 2023)
WebArena: A Realistic Web Environment for Building Autonomous Agents (Jul. 2023)
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models (Nov. 2023)
AssistGUI: Task-Oriented Desktop Graphical User Interface Automation (Dec. 2023, CVPR 2024)
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (Jan. 2024, ACL 2024)
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web (Feb. 2024)
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue (Feb. 2024)
On the Multi-turn Instruction Following for Conversational Web Agents (Feb. 2024)
AgentStudio: A Toolkit for Building General Virtual Agents (Mar. 2024)
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Apr. 2024)
Benchmarking Mobile Device Control Agents across Diverse Configurations (Apr. 2024, ICLR 2024)
MMInA: Benchmarking Multihop Multimodal Internet Agents (Apr. 2024)
Autonomous Evaluation and Refinement of Digital Agents (Apr. 2024)
LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation (Apr. 2024)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? (Apr. 2024)
GUICourse: From General Vision Language Models to Versatile GUI Agents (Jun. 2024)
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents (Jun. 2024)
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices (Jun. 2024)
VideoGUI: A Benchmark for GUI Automation from Instructional Videos (Jun. 2024)
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding (Jun. 2024)
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents (Jun. 2024)
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents (Jun. 2024)
Practical, Automated Scenario-based Mobile App Testing (Jun. 2024)
WebCanvas: Benchmarking Web Agents in Online Environments (Jun. 2024)
On the Effects of Data Scale on Computer Control Agents (Jun. 2024)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents (Jul. 2024)
WebVLN: Vision-and-Language Navigation on Websites (AAAI 2024)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (Jul. 2024)
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Windows Agent Arena
Harnessing Webpage UIs for Text-Rich Visual Understanding (Oct, 2024)
GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent (Dec, 2024)

Models / Agents

Grounding Open-Domain Instructions to Automate Web Support Tasks (Mar. 2021)
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning (Aug. 2021)
A Data-Driven Approach for Learning to Control Computers (Feb. 2022)
Augmenting Autotelic Agents with Large Language Models (May. 2023)
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control (Jun. 2023, ICLR 2024)
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis (Jul. 2023, ICLR 2024)
LASER: LLM Agent with State-Space Exploration for Web Navigation (Sep. 2023)
CogAgent: A Visual Language Model for GUI Agents (Dec. 2023, CVPR 2024)
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (Feb. 2024)
UFO: A UI-Focused Agent for Windows OS Interaction (Feb. 2024)
Comprehensive Cognitive LLM Agent for Smartphone GUI Automation (Feb. 2024)
Improving Language Understanding from Screenshots (Feb. 2024)
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent (Apr. 2024, KDD 2024)
SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models (May. 2023, NeurIPS 2023)
You Only Look at Screens: Multimodal Chain-of-Action Agents (Sep. 2023)
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API (Oct. 2023)
OpenAgents: AN OPEN PLATFORM FOR LANGUAGE AGENTS IN THE WILD (Oct. 2023)
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation (Nov. 2023)
AppAgent: Multimodal Agents as Smartphone Users (Dec. 2023)
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (Jan. 2024, ACL 2024)
GPT-4V(ision) is a Generalist Web Agent, if Grounded (Jan. 2024, ICML 2024)
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (Jan. 2024)
Dual-View Visual Contextualization for Web Navigation (Feb. 2024, CVPR 2024)
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (Jun. 2024)
Visual Grounding for User Interfaces (NAACL 2024)
ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (Feb. 2024)
ScreenAI: A Vision-Language Model for UI and Infographics Understanding (Feb. 2024)
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Apr. 2024)
Octopus: On-device language model for function calling of software APIs (Apr., 2024)
Octopus v2: On-device language model for super agent (Apr., 2024)
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent (Apr., 2024)
Octopus v4: Graph of language models (Apr., 2024)
AutoWebGLM: Bootstrap and Reinforce a Large Language Model-based Web Navigating Agent (Apr. 2024)
Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning (Apr. 2024)
Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking (Apr. 2024, SIGIR 2024)
AutoDroid: LLM-powered Task Automation in Android
Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation (Dec. 2023, MobiCom 2024)
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study (Mar. 2024)
Android in the Zoo: Chain-of-Action-Thought for GUI Agents (Mar. 2024)
Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning (May 2024)
GUI Action Narrator: Where and When Did That Action Take Place? (Jun. 2024)
Identifying User Goals from UI Trajectories (Jun. 2024)
VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning (Jun. 2024)
Octo-planner: On-device Language Model for Planner-Action Agents (Jun. 2024)
E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion (Jun. 2024)
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration (Jun. 2024)
MobileFlow: A Multimodal LLM For Mobile GUI Agent (Jul. 2024)
Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model (Jul. 2024)
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence (Jul. 2024)
MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices (Jul. 2024)
AUITestAgent: Automatic Requirements Oriented GUI Function Testing (Jul. 2024)
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems (Jul. 2024)
OmniParser for Pure Vision Based GUI Agent (Aug. 2024)
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents (Aug. 2024)
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (Aug. 2024)
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher (Jul. 2023)
AppAgent v2: Advanced Agent for Flexible Mobile Interactions (Aug. 2024)
Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions (Aug. 2024)
Agent Workflow Memory (Sep. 2024)
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understandin (Sep. 2024)
Agent S: An Open Agentic Framework that Uses Computers Like a Human (Oct. 2024)
MobA: A Two-Level Agent System for Efficient Mobile Task Automation (Oct. 2024)
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (Oct. 2024)
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents (Oct. 2024)
Attacking Vision-Language Computer Agents via Pop-ups (Nov. 2024)
AutoGLM: Autonomous Foundation Agents for GUIs (Nov. 2024)
AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations (Nov. 2024)
ShowUI: One Vision-Language-Action Model for Generalist GUI Agent (Nov. 2024)
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction (Dec. 2024)
Falcon-UI: Understanding GUI Before Following User Instructions (Dec. 2024)
PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World (Dec. 2024)
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining (Dec. 2024)
Aria-UI: Visual Grounding for GUI Instructions (Dec. 2024)

Surveys

GUI Agents with Foundation Models: A Comprehensive Survey (Nov. 2024)
Large Language Model-Brained GUI Agents: A Survey (Nov. 2024)
GUI Agents: A Survey (Dec. 2024)

Projects

Safety

Related Repositories

Acknowledgements

This template is provided by Awesome-Video-Diffusion and Awesome-MLLM-Hallucination.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome GUI Agent

Datasets / Benchmarks

Models / Agents

Surveys

Projects

Safety

Related Repositories

Acknowledgements

About

Releases

Packages

Contributors 17

showlab/Awesome-GUI-Agent

Folders and files

Latest commit

History

Repository files navigation

Awesome GUI Agent

Datasets / Benchmarks

Models / Agents

Surveys

Projects

Safety

Related Repositories

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 17

Packages