Step 1: Perceive
CUA takes a screenshot of the current browser state and processes it as raw pixels using GPT-4o's vision capabilities.
Step 2: Reason
The model analyzes the screenshot, considers the task goal and current progress, and plans the next action using chain of thought reasoning.
Step 3: Act
CUA executes the planned action through a virtual mouse and keyboard: clicking buttons, typing text, scrolling, or pressing shortcuts.
Step 4: Verify
A new screenshot is taken after the action. CUA checks whether the action succeeded and whether the task is progressing toward completion.
Step 5: Repeat or Complete
If the task is not done, the loop continues. If a sensitive action is detected, takeover mode activates. If the task is complete, the agent reports results.
0%
OSWorld Benchmark Score
0%
Prompt Injection Recall
0 layers
Safety Architecture