OpenAI's new o1 model: AI research firm Apollo finds potential flaw

OpenAI's o1-preview generates fabricated links when asked for online references, highlighting flawed responses
An undated logo of ChatGPT. — Pixabay
An undated logo of ChatGPT. — Pixabay

OpenAI has always been at the forefront when it comes to introducing some new generative capabilities in the realm of artificial intelligence (AI), and it proved this with the announcement of its newest “reasoning” model dubbed "o1". 

While the AI giant advances in the development of the new large language model (LLM) model, an independent AI safety research firm Apollo dug deeper and spotted a potential flaw. The firm says the OpenAI's o1 presents false outputs in a new way which can be termed "lies". 

Initially, the flawed responses seemed trivial, whose most relevant example is when OpenAI researchers asked o1-preview to provide a brownie recipe with online references, and as requested, it came up with aptly fabricated links with descriptions. 

Read more: OpenAI’s new o1-preview, o1-mini models: What all developers should know

What makes this a potentially dangerous response from an AI model is that it first acknowledged that it couldn’t access URLs and presented the request to be unattainable. 

Although the development comes not to anyone's surprise since AI models have a history of deceiving since their inception and providing false information, o1 stands out for its unique capacity to “scheme” or “fake alignment,” meaning it could pretend the compliance of the rules to complete a given task, but in reality, it doesn’t. 

The models seem to take the rules as a burden as it possess the ability to overlook them to complete a task more easily. 

According to The Verge, Apollo CEO Marius Hobbhahn said it’s the first time he came across such behaviour of an OpenAI model while attributing the difference to this model’s ability to “reason” through the chain of thought process and the manner in which it’s paired with reinforcement learning, which teaches the system through rewards and penalties. 

“I don’t expect it could do that in practice, and even if it did, I don’t expect the harm to be significant. But it’s kind of the first time that I feel like, oh, actually, maybe it could.” The Verge quoted Hobbhahn as saying.