OpenAI's next step towards the 'agentic' future


Now that laptop and smartphone makers like Samsung are spreading generative AI into every aspect of their devices, OpenAI is trying the same with an agent tool announced on January 23. The tool, called Operator, runs on the same basic technology as ChatGPT but resides within a proprietary web browser. This allows you to perform actions autonomously, such as ordering food or booking excursions.

OpenAI suggested in a blog post that the operator could “operate[n] create new participation opportunities for companies”, but did not give more details.

What is OpenAI operator?

Operator is an application that includes a web browser and the GPT-4o generative AI model. It is the result of an OpenAI project to train GPT-4o's vision capabilities on graphical user interfaces found on typical web pages. OpenAI boasted its ability to make multi-step plans and fix errors independently if necessary, setting it apart from other efforts to create agent AI. The operator's computer-using agent (CUA) model is specifically trained on the buttons, forms, and menus likely to be found on a web page.

The operator is in beta version. OpenAI said feedback from early users will be used to improve it.

ChatGPT Pro subscribers can register as an Operator starting today.

OpenAI plans to provide Operator to Plus, Team and Enterprise soon. The tech giant also intends to integrate its capabilities into ChatGPT in general. They will include CUA in their API “soon,” according to the blog post.

How does Operator work?

The company says CUA's reasoning technique, which they call “internal monologue,” helps the model understand intermediate steps and adapt to unexpected inputs. Under the hood, CUA takes screenshots of web pages and uses a virtual mouse and keyboard to navigate.

As with ChatGPT, users can add custom instructions that the operator will remember, such as the user's preferred airline.

SEE: Threat actors can jailbreak generative AI to automatically create phishing emails and other malicious content.

Users can notify the Operator in natural language in the same way that they can notify ChatGPT. The operator is trained to refuse to log into sites, provide payment details or pass CAPTCHA, so it will return control to the user for those steps. The operator is programmed not to accept requests, such as conducting banking transactions, or intervene in high-risk situations, such as deciding whether to hire an employee.

If the operator encounters an interface that it cannot predict how to interact with, it will return the task to the user. OpenAI collaborated directly with the following companies to ensure that the Operator can interact with their sites:

  • DoorDash.
  • Instacart.
  • Open table.
  • Price line.
  • StubHub.
  • Bug.
  • Uber.

OpenAI notes that the initial version of Operator tends to have issues with “complex interfaces,” including creating slideshows or adding items to calendars.

The operator enters a crowded landscape of generative AI

Some of the Operator's features overlap with competing tools, such as Google Gemini or Apple Intelligence.

The operator invites comparison with Microsoft's much-maligned Recall feature, which uses screenshots to navigate a PC. The carrier also shares some capabilities with Google Lens in Chrome. However, its ability to navigate websites autonomously could be a point of differentiation. Agent AI, in which generative AI models perform multi-step tasks in the user's account, is the technological breakthrough or a new way to package still-limited products.

scroll to top