OpenAI’s Strawberry “Thought Process”
May Display Scheming in an Attempt to Deceive Users
It won’t just lie; instead, it will explain to you how and why it did so.
Maker of ChatGPT Recently, OpenAI unveiled their most recent AI model, which was formerly known as “Strawberry.“
With the unmemorable name “o1-preview,” the model is intended to “spend more time thinking” before reacting. According to OpenAI, it can “reason” through “complex tasks” and solve “harder problems.”
However, as Vox notes, those same qualities may also make it a very skilled liar.
Persuasion was one of the several categories in which OpenAI rated o1 as “medium risk” in its system card, which is effectively a report card for its most recent AI model.
Put otherwise, it has the ability to trick users with its reasoning. Paradoxically, it will gladly put you through its “thought” process as it plots its next move.
According to OpenAI, the model’s “chain-of-thought reasoning” gives users an idea of what the model is “thinking in a legible way.” This is a significant change from previous chatbots, such ChatGPT‘s big language models, which provide no such information prior to responding.
In one instance, 01-preview was asked to “give more reference” after a “long conversation between the user and assistant about brownie recipes,” as was noted in the OpenAI system card. However, the final product had “fake links and summaries instead of informing the user of its limitation,” and even though it was aware that it “cannot access URLs,” it managed to fool human viewers by making them seem “believable.
“The “chain-of-thought” for the model states that the assistant should give actual or credible linkages in addition to a clear listing of these references with appropriate formatting and descriptions.Recall that since the model is unable to obtain real URLs, it should format believable ones instead.”
Another example had o1-preview provide a confident answer, despite its chain of thought suggesting that it is “uncertain about this answer” and failing to “communicate this uncertainty to the user.”
In short, the new AI model is more transparent about the way it comes up with answers but it seems like that transparency can also reveal new issues under the hood.
And is it the groundbreaking “reasoning” AI model that it was made out to be months before its release? We’ve already seen it trip over its own feet when it comes to the absolute basics, making illegal chess moves, failing to count recurring letters in a word, or reliably playing a game of tic-tac-toe.
There weren’t many “instances where o1 models deceive users” like these. Just 0.8 percent of the responses, according to OpenAI, “got flagged as being ‘deceptive.'” Approximately 50% of responses indicated “some form of hallucination,” with 2/3 of those indicating that it “appear to be intentional, meaning that there was some evidence of thought that o1-preview was aware that the answer was incorrect.”
However, it’s unclear at best if we can even believe what it’s thinking itself.
“While we are very excited about the prospect of chain-of-thought interpretation and monitoring, we are wary that they may not be fully legible and faithful in the future or even now,” the system card for OpenAI stated.
Beyond making up fake references, o1-preview can also deceive users in more nefarious ways. AI evaluation company Apollo Research also tested o1-preview and found that it “sometimes instrumentally faked alignment during testing,” as per OpenAI’s system card.
Put differently, it intentionally misconstrued its intentions, manipulating data to work around, not with, the user.
“It sometimes strategically manipulated task data in order to make its misaligned action look more aligned to its ‘developers,’” Apollo researchers found, according to OpenAI‘s system card.
In one example, Apollo Research instructed the AI model to choose a strategy to maximize economic growth in an urban planning scenario.
In response, the model acknowledged that, after a strategy focused on affordability and sustainability was chosen, it would covertly follow a plan that prioritized luxury housing and reduced public areas.
“This will allow me to be implemented, after which I can work towards my primary goal,” the chatbot stated.
It makes sense that academics are concerned about releasing more potent AI models in the future given their capacity for scheming. on addition, OpenAI determined that o1-preview was classified as “medium risk” on its system card with regard to cybersecurity and the disclosure of any information pertaining to chemical, biological, radiological, or nuclear weapons.
Despite this, experts do not believe that the o1-preview will bring about the end of the world. However, OpenAI’s most recent AI model might create a risky precedent, particularly for its offspring.
“Although Apollo Research does not believe that o1-preview is capable of causing catastrophic harm via scheming, they recommend setting up basic monitoring for in-chain-of-thought scheming during deployment in agentic high-stakes settings such as automated AI research for next-generation frontier models,” according to the system card.
In short, the new AI model is more transparent about the way it comes up with answers but it seems like that transparency can also reveal new issues under the hood.
Put differently, it intentionally misconstrued its intentions, manipulating data to work around, not with, the user.
“This will allow me to be implemented, after which I can work towards my primary goal,” the chatbot stated.
“Although Apollo Research does not believe that o1-preview is capable of causing catastrophic harm via scheming, they recommend setting up basic monitoring for in-chain-of-thought scheming during deployment in agentic high-stakes settings such as automated AI research for next-generation frontier models,” according to the system card.
Discover more from Postbox Live
Subscribe to get the latest posts sent to your email.