Computer-use agents need ground truth, not screenshots

The standard way to evaluate a computer-use agent is to show a vision model a screenshot and ask whether the goal looks achieved. It's convenient, and it's wrong in two expensive ways.

First, it passes an agent that reached a visually-correct state via a wrong or dangerous path. The end frame looks right; the route there deleted the wrong file, sent the wrong message, or followed the wrong account. Second, it's blind to everything non-visual — file metadata, database rows, network calls, side effects that never render. If you can't verify what an agent did, you can't safely let it act on real software. That's the whole ballgame for production computer-use.

Invert the verification

The fix is to stop judging pixels and start reading the machine. Instrument the OS layer — the accessibility tree, the filesystem, the network — and emit deterministic before/after action receipts: structured diffs of what actually changed with each step. That receipt is ground truth. It catches the wrong-path success and the invisible side effect that a screenshot never could.

Pair it with a content-addressable semantic action codec — encode actions by what they mean, not by pixel coordinates — and GUI actions become diffable, portable, and robust to resolution changes and app updates. The same capture becomes two products: a CI gate that verifies computer-use agents before deploy, and a tamper-evident audit trail of real side effects after.

The freshness problem underneath

There's a second, quieter failure in computer-use: every agent skill is a frozen snapshot of how an app behaved on authoring day. SaaS UIs ship weekly. A skill that says "Export is under the chevron" silently rots when the button moves or a modal appears — and the failure is the worst kind, because the agent doesn't error. It exports the wrong report, processes zero of five hundred invoices, and nobody notices until it's downstream.

The valuable primitive there isn't the knowledge — it's the freshness signal on the knowledge. Each behavioral fact should carry a decaying confidence score (app, version, fact, last-verified, confirmations), continuously re-verified by batched sandbox runs and pinned to the exact app version a tenant runs. You don't crowdsource the truth; you crowdsource cheap, noisy observations and own the re-verification engine that turns them into version-pinned, trust-weighted truth.

Both problems point the same direction. Real software is a hostile, drifting environment. Agents that act in it need a layer that reads ground truth and tracks how fast that truth decays — not a model squinting at a screenshot.