lahfir, I vouched your (currently still dead) comment because it was interesting to me.
I expect the reason it is dead is that it seems LLM-generated (you "quietly" launched it on github? Who says that?).
Also, your comment claims that the tool is cross-platform and implies that it works on Mac, Windows, and Linux, but the graphic on the github README says it only works on Mac.
It looks hybrid human/LLM at best, but definitely possible that it's mostly human, from someone who is earnestly learning how to use "pitch" language. I got the feeling that some parts, like the bullet points, maybe originated from AI-generated documentation/readme's.
My intuition tells me that it could have been AI-generated, but if that's the case then it was heavily edited by a human. I think anyone who went through it for that would have changed other things as well. That's why I suspect it's pseudo-artificial pitch "coded" human writing with some (mostly, lightly edited) copy/paste of AI bullet points.
Then again, I can't find snippets of this language in the repo, so maybe I'm losing my discernment as LLMs advance (as well as the humans who are learning how to use them).
I don't think the accessibility story on Linux is comprehensive enough to make this possible unfortunately. Especially with Wayland. One advantage Mac apps have is they're all targeting the same underlying OS primitives, which is the layer their accessibility platform lives at.
I've been building computer-use tools for a while, and I quietly launched this about a month ago (122 Stars on GH). I figured it was worth sharing here.
Over the last few months, a lot of computer-use agents have come out: Codex, Claude Code, CUA, and others. Most of them seem to work roughly like this:
1. Take a screenshot
2. Have the model predict pixel coordinates
3. Click x,y
4. Take another screenshot
5. Repeat
That works, but it's slow, expensive in tokens, and fragile. If the UI shifts a few pixels, things break. And the model still doesn't know what any element actually is.
But the OS already exposes structured UI information:
Screen readers have used these APIs for years. On the web, Playwright beat screenshot scraping for the same reason: structured access is just a better abstraction than pixels.
So I built a desktop equivalent: agent-desktop.
It's a cross-platform CLI for structured desktop automation through the accessibility tree. One Rust binary, about 15 MB, no runtime dependencies. It exposes 53 commands with JSON output, so an LLM can inspect and operate native apps without screenshots or vision models. Inspired by agent-browser by Vercel Labs.
A naive approach would dump the full accessibility tree into the model, but real apps get huge. Slack can easily exceed 50,000 tokens for a full tree dump, which makes the approach impractical.
The approach I ended up using is progressive skeleton traversal:
- First pass: return a shallow tree, typically depth 3, with deeper containers truncated and annotated with children_count
- Named containers get references so the agent can request only that subtree
- The agent drills down into the relevant region with --root @e3
- References are scoped and invalidated only for that subtree
- After acting, the agent can re-query just that region instead of re-snapshotting the whole app
In practice, this reduced token usage by about 78% to 96% versus full-tree dumps in Electron apps like Slack, VS Code, and Notion.
A few implementation details that may be interesting here:
- Rust workspace with strict platform/core separation through a PlatformAdapter trait
- Accessibility-first activation chain; mouse synthesis is the fallback, not the default
- Deterministic element refs like @e1, @e2, with optimistic re-identification across UI shifts
- Structured errors with machine-readable codes plus retry suggestions
- C ABI via cdylib, so it can be loaded directly from Python, Swift, Go, Node, Ruby, or C without shelling out
- Batch operations in a single call
- Support for windows, menus, sheets, popovers, alerts, and notifications
- Special handling for Chromium/Electron accessibility trees, which can get very deep and noisy
Why I think this matters: pixel-based desktop control feels like a leaky abstraction. The OS already knows the UI semantically. Accessibility APIs give you roles, names, actions, hierarchy, focus, selection, and state directly. That seems like a much better substrate for desktop agents than screenshot loops.
If you're building your own desktop agent, internal automation tool, or research prototype, this may be useful.
Install:
npm install -g agent-desktop
agent-desktop snapshot --app Finder -i
I'd especially love feedback from people who've built desktop automation before. What are the biggest pain points you've run into, and what would you want a tool like this to support?
I expect the reason it is dead is that it seems LLM-generated (you "quietly" launched it on github? Who says that?).
Also, your comment claims that the tool is cross-platform and implies that it works on Mac, Windows, and Linux, but the graphic on the github README says it only works on Mac.
My intuition tells me that it could have been AI-generated, but if that's the case then it was heavily edited by a human. I think anyone who went through it for that would have changed other things as well. That's why I suspect it's pseudo-artificial pitch "coded" human writing with some (mostly, lightly edited) copy/paste of AI bullet points.
Then again, I can't find snippets of this language in the repo, so maybe I'm losing my discernment as LLMs advance (as well as the humans who are learning how to use them).
Does anyone know of a linux one?
> - macOS: Accessibility API > - Windows: UI Automation > - Linux: AT-SPI
https://invent.kde.org/sdk/selenium-webdriver-at-spi
Over the last few months, a lot of computer-use agents have come out: Codex, Claude Code, CUA, and others. Most of them seem to work roughly like this: 1. Take a screenshot 2. Have the model predict pixel coordinates 3. Click x,y 4. Take another screenshot 5. Repeat
That works, but it's slow, expensive in tokens, and fragile. If the UI shifts a few pixels, things break. And the model still doesn't know what any element actually is.
But the OS already exposes structured UI information:
Screen readers have used these APIs for years. On the web, Playwright beat screenshot scraping for the same reason: structured access is just a better abstraction than pixels.So I built a desktop equivalent: agent-desktop.
It's a cross-platform CLI for structured desktop automation through the accessibility tree. One Rust binary, about 15 MB, no runtime dependencies. It exposes 53 commands with JSON output, so an LLM can inspect and operate native apps without screenshots or vision models. Inspired by agent-browser by Vercel Labs.
A typical loop looks like this:
So the loop becomes: The main design problem was context size.A naive approach would dump the full accessibility tree into the model, but real apps get huge. Slack can easily exceed 50,000 tokens for a full tree dump, which makes the approach impractical.
The approach I ended up using is progressive skeleton traversal:
In practice, this reduced token usage by about 78% to 96% versus full-tree dumps in Electron apps like Slack, VS Code, and Notion.A few implementation details that may be interesting here:
Why I think this matters: pixel-based desktop control feels like a leaky abstraction. The OS already knows the UI semantically. Accessibility APIs give you roles, names, actions, hierarchy, focus, selection, and state directly. That seems like a much better substrate for desktop agents than screenshot loops.If you're building your own desktop agent, internal automation tool, or research prototype, this may be useful.
Install:
Repo: https://github.com/lahfir/agent-desktopI'd especially love feedback from people who've built desktop automation before. What are the biggest pain points you've run into, and what would you want a tool like this to support?
I would love it if it can support ios simulator, iphone? I am using Maestro but it is so damn slow and seems to be token hungry.