How the Computer-Using Agent Works

Plus, Pro, and API2 min read

Step 1: Perceive

CUA takes a screenshot of the current browser state and processes it as raw pixels using GPT-4o's vision capabilities.

Step 2: Reason

The model analyzes the screenshot, considers the task goal and current progress, and plans the next action using chain of thought reasoning.

Step 3: Act

CUA executes the planned action through a virtual mouse and keyboard: clicking buttons, typing text, scrolling, or pressing shortcuts.

Step 4: Verify

A new screenshot is taken after the action. CUA checks whether the action succeeded and whether the task is progressing toward completion.

Step 5: Repeat or Complete

If the task is not done, the loop continues. If a sensitive action is detected, takeover mode activates. If the task is complete, the agent reports results.

0%

OSWorld Benchmark Score

0%

Prompt Injection Recall

0 layers

Safety Architecture

CUA vs traditional browser automation

Traditional browser automation (Selenium, Playwright, Puppeteer) requires writing code that targets specific HTML elements by ID, class, or CSS selector. If the website changes its structure, the automation breaks. CUA does not need any knowledge of page structure; it works from screenshots, so it is resilient to website redesigns. The tradeoff is that CUA is slower and less deterministic than code based automation, but it works on any website without custom scripts.