Skip to main content

Windows Desktop Automation for AI Agents

· 7 min read
Chris Tsang

Browser automation for AI agents is a solved problem; the Windows desktop is not. That's why we built UI Automata.

In the demo above, Claude installs Python and Git on a fresh Windows machine. For Python, it opens the Windows Store, searches, picks Python 3.13 over 3.12, clicks Get, and waits for installation to complete. For Git, it navigates to the official website in Edge, downloads the installer, asks for a UAC confirmation, runs the installer silently, then launches Git Bash and verifies the installation works, falling back to vision to read the terminal output where UIA has no coverage.

It showcases how an AI agent can move across desktop apps, browsers, and terminals to handle complex tasks.

Why Windows Desktop Automation Is Hard

The Windows desktop is genuinely difficult to automate programmatically. Unlike the web (where the DOM is structured and designed to be read by code), the Windows desktop is a patchwork of UI frameworks built over decades: Win32, MFC, WPF, UWP, WinUI 3, embedded web views, custom renderers. Each one exposes its internals differently. Dialogs pop up unexpectedly. Apps behave differently across OS versions, display scaling, and language packs. There is no single standard to rely on.

Vision-based computer use is a compelling answer to this complexity. Rather than trying to understand every framework, you just look at the screen. For tasks that require genuine visual reasoning it remains the right tool.

But for automation at any scale, it carries real costs. Each step is a round-trip to an inference API. Pixel coordinates shift when the window moves or the display resolution changes. And a sequence of screenshots is not an auditable record: when something goes wrong, there is no structured trace to diagnose.

UI Automata takes a complementary approach: use the semantic layer that is already there.

Automata Workflow: Shell Scripting for Windows GUI

Think of a workflow YAML file as the shell script for the Windows desktop. Where a shell script says "run this command, check the exit code, pipe the output to the next step", a workflow YAML says "click this button, wait for this dialog, extract this value."

- intent: click the Save menu item
action:
type: Click
scope: notepad
selector: ">> [role=menu item][name=Save]"
expect:
type: DialogPresent
scope: notepad
timeout: 10s

Every step has three parts: an action (what to do), an expect (what state the UI should be in afterwards), and optionally a recovery (what to do if something unexpected happens). The engine executes the action, then watches the UI until the expected state appears. If it does not appear in time, it checks recovery handlers before failing cleanly.

No sleeps. No guessing. No silent failures.

Selectors: CSS for Windows UI

If the workflow YAML is the shell script, selectors are the file paths. They are how you point to the exact UI element you want to act on.

The Windows UI element tree is messy: a typical app window has hundreds of nested elements, many with no name or identical-looking labels. A single property is rarely enough to pin down the right one. We built a CSS-like selector language that lets you combine every available signal into a precise address:

>> [role=button][name=Save]                    # role AND name
>> [id=TabListView] > [role=tab item]:first # first tab in a specific list
>> [role=button][name=Settings]:parent # the container holding this button
>> [role=dialog][name='Confirm Save As'] >> [role=button][name=Yes] # button inside a specific dialog

Crucially, selectors target semantic properties (what an element is) not pixel coordinates (where it happens to be drawn). They survive window resizes, display scaling changes, theme changes, and most app updates. A selector that worked last week works today.

The Shadow DOM: React in Reverse

React builds a virtual DOM to avoid expensive real DOM queries. UI Automata does the same for Windows UIA, but in reverse: instead of pushing changes down, it caches what it reads.

Querying Windows UI Automation naively is slow: every element lookup is a cross-process round-trip, like making a network request just to read a variable. A 20-step workflow that re-queries from scratch on every step spends most of its time waiting, not working.

UI Automata solves this with what we call the shadow DOM: a cached map of the live UI. The engine resolves each element once and holds onto it. Subsequent steps that reference the same element are nearly instant.

The shadow DOM also solves identity. On first access, the engine locks the OS-level window handle (HWND). Subsequent resolutions bypass the selector entirely and go directly to that handle, so title changes, focus shifts, and other windows opening mid-flow cannot cause drift. Staleness is detected on every element access via a liveness check, not on a timer. When a cached element goes stale, the engine tries to re-resolve from its cached parent before falling back to a full tree traversal, so a single button refresh is a narrow subtree scan, not a full window walk.

Cross-Framework Coverage

Windows applications are not a single thing. A modern machine might run a Win32 app from 1998 alongside a WinUI 3 app from last year, a legacy MFC tool, and a web browser, all at once.

UI Automata covers all of them: Win32, MFC, WPF, UWP, WinUI 3, and web browsers via the Chrome DevTools Protocol, giving full access to page structure alongside native UI.

For applications with little UIA support, the vision MCP tool provides OCR and layout recognition; vision and structured automation complement each other in the same agent loop.

Built for AI Agents

UI Automata is not a scripting framework repurposed for agents. It is designed from the ground up as the interface between an agent and the Windows desktop.

Every step carries an intent field: a plain-English description of what it is trying to do. The engine logs every step with its outcome, giving the agent a full structured trace to read and reason about. When something goes wrong, the trace tells the agent exactly what state the UI was in at the moment of failure.

The MCP interface gives an agent everything it needs:

  • desktop: inspect live element trees and test selectors against any window
  • app: list installed applications and launch them
  • browser: CDP page inspection and navigation
  • workflow: run a workflow file, receive structured output
  • vision: OCR and layout for windows that lack UIA support
  • Schema and linter: validate workflow files before execution, so agents catch errors without running

An agent can explore an unfamiliar UI, author steps interactively, run them immediately, and promote working steps into a reusable workflow file. The entire loop (explore, script, run, verify) happens in a conversation. No human needed to laboriously demonstrate every step.

Industrial-Grade Applications

Professional desktop software (CAD tools, ERP systems, simulation suites) is where most automation approaches hit a wall. These apps have deeply nested UI structures, lists where off-screen items are invisible to standard queries, and toolbars where dozens of buttons look identical to anything but their internal ID.

UI Automata handles these because it was designed with them in mind. The Invoke action activates elements directly through the accessibility interface, bypassing the need for a visible bounding box entirely. The :parent and :ancestor navigators let you locate a container by identifying a landmark element inside it: the pattern you need when the row you want to click has no unique label of its own.

What We Are Releasing Today

We are releasing the initial build of UI Automata today, including:

  • The workflow engine and YAML format
  • The MCP server (automata-agent) for Claude Code and Claude Desktop
  • ui-inspector and other CLI tools for interactive UI exploration
  • The workflow library: reusable workflows for common Windows applications
  • Comprehensive documentation at automata.visioncortex.org
  • Source code at visioncortex/ui-automata

Try it, break it, and tell us what you run into. We are building the workflow library based on what people actually need. Open an issue or start a discussion on GitHub.