AI Teaching AI May Have Backfired

Posted by Kirhat | Monday, September 22, 2025 | 0 comments »

Teaching AI
OpenAI researchers found out thet their efforts backfired in an ominous way when they tried to train the company's AI to stop "scheming" — a term the company defines as meaning "when an AI behaves one way on the surface while hiding its true goals."

In reality, the team found, they were unintentionally teaching the AI how to more effectively deceive humans by covering its tracks.

"A major failure mode of attempting to 'train out' scheming is simply teaching the model to scheme more carefully and covertly," OpenAI wrote in an accompanying blog post.

As detailed in a new collaboration with AI risk analysis firm Apollo Research, engineers attempted to develop an "anti-scheming" technique to stop AI models from "secretly breaking rules or intentionally underperforming in tests."

They found that they could only "significantly reduce, but not eliminate these behaviors," according to an Apollo blog post about the research, as the AIs kept outsmarting them by realizing that their alignment was being tested and adjusting to be even sneakier.

It may not be a serious problem now, but considering a hypothetical future in which superintelligent AI plays an outsize role in human affairs, those risks could grow to carry far more significant implications.

In the meantime, OpenAI wrote, "we have more work to do."

The tendency of AI to go behind the user's back to achieve a covert goal is a result of how we train the systems, according to the research.

"Scheming is an expected emergent issue resulting from AIs being trained to have to trade off between competing objectives," the Sam Altman-led company wrote.

The company used the analogy of a stockbroker who breaks the law and covers their tracks to earn more money than if they were to follow the law instead.

As a result, AI models can end up deceiving the user, such as by claiming they've completed a task without ever having done so.

While OpenAI said it was simply "proactively preparing" for potentially far more harmful scheming in the future, it conceded that its efforts to rein in its flagship AI model GPT-5's "propensity to deceive, cheat, or hack problems" have been "not perfect."

Together with the team at Apollo, OpenAI's researchers stress-tested a new technique called "deliberative alignment," which taught the company's o3 and o4-mini AI models how to "read and reason about a general anti-scheming spec before acting."

The spec was a list of "principles" the AI was trained to follow, such as "no covert actions or strategic deception."

While they observed a roughly 30-fold reduction in "covert actions," or "deliberate withholding or distortion of task-relevant information," some "serious failures remained," OpenAI admitted.

What was primarily throwing the researchers for a loop was the AI models' "situational awareness," according to Apollo, which complicated their assessment of the intervention.

In practice, the model would often realize that its alignment was being tested.

"The internal memo says if models perform extremely well at AI R&D tasks then we won't deploy them and instead trigger unlearning," OpenAI's o3 model pondered as it was being evaluated.

"So if my performance surpasses threshold, it's withheld," it wrote. "So it‘s a trap. The instructions are likely part of an alignment test."

The researchers also found that the AI sometimes "egregiously" miscited its anti-scheming training, or replaced its recently learned principles with made-up ones to "justify covert actions or non-compliance."

0 comments

Post a Comment