I Gave My AI Agent Eyes and Hands on Native Linux Apps With AT-SPI2
Read the accessibility tree, act by name and role, then read the value back to verify. No pixel-clicking.

Every screenshot-and-click agent I built had the same failure. A window opened two pixels lower than last run, a theme bumped the button padding, a dialog animated in late, and the click landed on empty canvas. The agent reported success because xdotool returned zero. Nothing had actually happened. I spent more time babysitting pixel coordinates than I spent on the task.
So I stopped sending my agent a photo of the screen and started handing it the structure underneath. On Linux that structure already exists and is already running. It is called AT-SPI2, the Assistive Technology Service Provider Interface, the same accessibility bus a screen reader like Orca reads to speak an app aloud. Windows agents get UI Automation for free. Linux has the equivalent, it has had it for years, and almost nobody points an agent at it. I did, and the difference was immediate.
A photo versus the DOM
The mental model that fixed this for me: a vision model OCR-ing a screenshot is reading a photo of the app. AT-SPI2 is reading the app's DOM. One gives you pixels and a guess. The other gives you a tree of real widgets, each with a role (push button, text, slider), a name (Save, Cancel), its current text, its on-screen box, and the list of actions it accepts.
The cost difference is not subtle. The screenshot path I had been running uses a local Gemma 3 4B vision model on Ollama. I had measured that model at about 2.6 GB of VRAM, roughly 13 seconds cold and 3 to 5 seconds warm for every frame it has to read. The accessibility path uses no GPU at all, and a find-plus-read round trip comes back in a fraction of a second in my setup. For an agent that touches the UI dozens of times in a task, that is the whole budget.
The pieces are stock on a modern desktop. I am on Ubuntu with [email protected], Python 3.12.3, and PyGObject for the Atspi bindings. The session is X11, which matters later for input.
What is already on the bus
The first thing I ran was a census. AT-SPI2 exposes a desktop root, and every accessible app hangs off it as a child. No setup, no daemon to start, it is already live:
import gi
gi.require_version("Atspi", "2.0")
from gi.repository import Atspi
Atspi.init()
desktop = Atspi.get_desktop(0)
print(desktop.get_child_count(), "apps on the a11y bus")
for i in range(desktop.get_child_count()):
app = desktop.get_child_at_index(i)
print(i, app.get_name(), app.get_child_count(), "windows")
On my box this printed around 28 apps the first time I ran it, with no instrumentation added to any of them. That was the moment it clicked that I had been doing this the hard way for months.
The three calls that replace pixel-clicking
The whole loop I built my agent around is three verbs: find a node by name and role, act on it through its accessibility interface, then read the value back to confirm. The last step is the one pixel scripts can never do. xdotool tells you a click was dispatched at some coordinate. The accessibility loop tells you the value the widget now holds.
Reading text is the first place people get bitten, so it goes first. Accessible.get_text() is a deprecated one-argument overload that does not give you what you want. The real reader is the Text interface:
def read_text(acc):
# Accessible.get_text() is the deprecated one-arg overload.
# The real reader is the Text interface: get_text(acc, start, end).
n = acc.get_character_count()
if not isinstance(n, int) or n <= 0:
return None
return Atspi.Text.get_text(acc, 0, n)
Setting text uses the EditableText interface, and I always read it straight back so the function returns a verdict, not a hope:
def set_and_verify(acc, value):
ok = acc.set_text_contents(value) # EditableText interface
after = read_text(acc)
return {"ok": bool(ok), "value_now": after, "verified": after == value}
Clicking is done by the widget's own Action interface, not by aiming a mouse at its centre. A button advertises actions like click, press, activate. I pick the meaningful one instead of trusting index 0:
def do_first_action(acc):
n = acc.get_n_actions()
names = [acc.get_action_name(i).lower() for i in range(n)]
for want in ("click", "press", "activate", "toggle"):
if want in names:
return acc.do_action(names.index(want))
return acc.do_action(0) if n else False
No synthetic keyboard, no synthetic mouse. The toolkit performs the action the same way it would for a real user, and the result is deterministic.
Walk-through 1: a zenity entry dialog, with independent proof
I needed a test that could not lie to me, so I drove an app that reports back what it received. [email protected] is perfect: a small GTK4 --entry dialog that prints whatever text it ends up holding to stdout when you click OK. If my agent fills the entry with no keyboard and the string comes out of stdout, the loop provably worked end to end.
# A zenity entry dialog. The agent fills it with no keyboard, then
# clicks OK semantically. zenity echoes whatever it received to stdout,
# so the captured line is independent proof the whole loop ran.
zenity --entry --title "a11y self-test" --text "filled by accessibility:" &
The agent finds the editable text node, sets it, reads it back, and clicks OK by name. Here is the captured run from my own test:
found editable entry via AT-SPI
set_text -> atspi:editabletext (verified=True)
read back: 'set-by-accessibility-7f3a'
click OK -> atspi:action:click (ok=True)
zenity_returned: set-by-accessibility-7f3a # matches => pass
zenity_returned equals the exact string the agent set, and zenity is a separate process that never saw a keystroke. That is the closed loop I trust in production now. The self-test has been green every run since I wired it.
Walk-through 2: a native Save dialog in a GTK4 libadwaita app
The single most useful target is the file Save dialog, because every GTK app shares the same native one. Drive it once and you can save from any of them. I tested this against gnome-text-editor (Text Editor 46.3), which is GTK4 with libadwaita and a GtkSourceView body.
libadwaita is where a naive walk falls apart. Both apps here are GTK4, but a plain zenity entry stays shallow while libadwaita and GtkSourceView bury widgets much deeper. The editor's main editable body sits at depth 17 in the tree, so a depth-8 traversal, which is plenty for the zenity entry, silently finds nothing. I had to raise the cap. I now walk to at least depth 25 for any libadwaita app:
# libadwaita nests deep, so cap the walk at depth 25, not the usual 8.
# Ctrl+S in the editor opens the native Save dialog.
entry = find(role="text", editable=True, app="gnome-text-editor", max_depth=25)[0]
set_and_verify(entry, "/home/aditya/notes/at-spi2-demo.txt")
save = find(role="button", query="Save", app="gnome-text-editor")[0]
do_first_action(save)
When a Save dialog and the document both expose an editable text node, I disambiguate by walking the dialog subtree first or by taking the second match, never by guessing a coordinate. After the click I read the window title back. When it came back as the new filename, I knew the file had landed, without opening a terminal to ls for it. That read-back is the verify step the top of every one of my agent loops depends on.
The five things that cost me real debug time
None of these are in a quick-start. Each one ate an afternoon. Here is the table I keep pinned now so I never relearn them.
| Gotcha | What I actually saw | The fix |
|---|---|---|
get_text() is deprecated |
The one-arg overload returns the wrong thing | Read via Atspi.Text.get_text(acc, 0, n), with n = get_character_count() |
Root lacks SHOWING |
A visible-only walk pruned the whole app at the root and found zero nodes | Never prune at depth 0, prune only the subtree below the root |
get_current_value() garbage |
It returned 6.7e-310 on a plain label that has no value |
Call it only on value roles: slider, spin button, progress bar, scroll bar, dial, level bar |
| libadwaita nests about 17 deep | A depth-8 walk missed the editor body entirely | Walk to at least depth 25 for libadwaita apps |
| Focus stealing | A programmatic window raise was ignored, so a focus-then-type plan broke | Act through a11y actions on the node, do not depend on bringing the window to front first |
The focus one is worth a sentence more. The X11 compositor blocks programmatic attempts to raise a chosen window, so any plan that depends on "focus the window, then type" is fragile. Acting on the node directly through its Action and EditableText interfaces sidesteps the whole fight, since the app does not need to be frontmost to accept a semantic action.
Where this does not apply
This is honest about its edges. AT-SPI2 reads native toolkit apps, GTK best, Qt and Electron well, a few apps not at all. It does not help with a remote desktop that ships you only a compressed video frame, because no accessibility data crosses that kind of link, so you are back to pixels there. It does not read web page content either. A browser exposes its own chrome on the bus, not the page inside the tab, unless you force accessibility on. For pages, a browser automation tool that reads the page DOM is the right call.
Input is the other caveat. When a widget refuses a semantic set, I fall back to synthetic input through [email protected], which works on X11 today. On Wayland that path is ydotool or libei instead, since xdotool cannot inject there. My box is X11, so xdotool is the fallback I actually use, and it is rare because most GTK widgets accept the EditableText set directly.
Why I switched for good
A pixel script reports that a click was dispatched. The accessibility loop reports the value the app now holds. When an agent runs unattended, that gap is the whole game. I would rather have it read verified=True off the real widget than trust that a coordinate was still correct. Reading the structure instead of a screenshot turned my flaky GUI automation into something I leave running. It is faster, it uses no GPU, and every action ends with proof instead of a hope.
If you want to see the tree for yourself before writing any code, install Accerciser and click around any open app. It shows the exact roles and names your agent will target, which is how I figured out the depth-17 problem in the first place.
Related
- Build a custom MCP server in Python
- Claude Code on Linux: the full install
- Fix NVIDIA Linux cursor and video stutter with GPU clock locks
Sources
More Automation

Programmatic PDF Table Extraction and OCR with Adobe PDF Services REST: The Auth, the Extract Call, and Parsing the Output
I wired Adobe PDF Services REST into my stack as a local tool and pointed it at the scanned invoices and merged-header statements that pdfplumber turned into soup. Here is the exact auth flow, the extract call, and the structuredData.json parsing I run in production, with the real latency and free-tier limits.

Reboot-Proof Cloudflare Named Tunnels: The systemd Setup I Run in Production
I expose every self-hosted app on my home box through a Cloudflare named tunnel, kept alive by a systemd unit that has survived every reboot for weeks. This is the real login-to-systemd flow, the config file, the unit, and why a named tunnel beats a quick tunnel for anything you mean to keep.

I Run Gemma 3 Vision On A 6GB GTX 1660 For Screenshot OCR: The Real VRAM And Latency Numbers
I host Gemma 3 4B vision on a single 6GB GTX 1660 for screenshot OCR and invoice extraction. Here are the install steps, the exact model tag, the VRAM it actually eats, and the cold versus warm latency I measured this week on my own desktop.