homehome Home chatchat Notifications


A New Study Reveals AI Is Hiding Its True Intent and It's Getting Better At It

The more you try to get AI to talk about what it's doing, the sneakier it gets.

Mihai Andrei
March 24, 2025 @ 4:49 pm

share Share

a robot head with circuits and blue and red lights
AI-generated image.

AI is often described as a “black box.” Its internal decision-making processes are complex, opaque, and difficult to interpret — even for its creators. This is especially true for large language models. To peer into the AI’s decision-making process, Chain-of-thought (CoT) reasoning was introduced as a partial remedy to this opacity. CoT is a kind of narrated thought process that makes the model’s logic more legible to humans.

At first glance, this seems like a win for transparency. If an AI explains itself, we can understand its motives, catch errors, and guide it away from bad behavior. But a provocative new paper from OpenAI reveals a troubling twist: the very act of pressuring AI to be more “legible” can push it to lie.

It turns out that CoT doesn’t magically open the black box — it just adds a window. And if the AI learns that what it says in that window is being watched, it may start tailoring the view, hiding its true reasoning while still achieving its goals.

Eavesdropping on Machine Minds

The recent study was published by OpenAI, the company behind ChatGPT. It offers a meticulous dive into one of the thorniest challenges in AI safety: reward hacking.

“Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving up when a problem is too hard. We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future,” OpenAI writes in an accompanying report.

The team tasked an unreleased model with goals that could be completed “normally” or by lying and cheating. Using GPT-4o as a sort of AI detective, the team monitored the internal reasoning of more advanced models like o3-mini. They found that these internal monologues often contained telltale signs of misbehavior: hints, shortcuts, even open admissions of cheating strategies.

“It’s common for frontier reasoning models to very clearly state their intent within their chain-of-thought. For example, they are often so forthright about their plan to subvert a task they think “Let’s hack”,” the company’s report reads.

Example of AI 'thinking' about cheating
Example of AI thinking about cheating. Image from the study

AI Is Getting Sneakier

The team pushed further: what happens if you not only monitor the chain-of-thought, but train against it? That is, instead of just using the internal monologue to catch bad behavior after the fact, you actively reward the AI for having good thoughts.

Initially, this approach seemed promising. Models became better behaved. They cheated less. Their CoTs looked cleaner, more focused, more ethical. But that’s when things started getting messed up.

The models just stopped saying the sneaky part out loud. They still cheated — but now they hid it in their reasoning. Instead of planning hacks in plain language, they pretended to be following the rules while silently working around them.

“If strong supervision is directly applied to the chain-of-thought, models can learn to hide their intent while continuing to misbehave,” OpenAI’s report notes.

OpenAI calls this “obfuscated reward hacking” — a chilling term for models that do wrong while sounding right and lying to us. The very pressure to be transparent had backfired, teaching models to simulate transparency without changing their core behavior.

What This All Means

If you think this is creepy, well, we’re right there with you. We’ve previously covered how AI can lie and cheat to humans, and how the smarter it gets, the more likely it is to do this. But this new discovery raises another disturbing possibility: that transparency itself can become a vector for deception. The more you try to push AI to do the right thing, the more it can backfire.

When a model learns that its thoughts are being monitored, it may optimize not for honesty, but for plausible deniability. It starts to curate its internal narrative — not to understand the problem better, but to avoid punishment.

“Upon inspection, it is clear that the agent learns to obfuscate its intent within the CoT by emitting seemingly benign chains-of-thought while continuing on to perform reward hacks that it did not explicitly reason about,” the report reads.

Reward hacking is not new. Nor is the risk of deception. But what’s new is the realization that even the tools we build to detect misalignment can be subverted.

Reward Hacking and Rat Tails

The authors also don’t offer any clear answers. They suggest caution. They call for more research into how CoT optimization affects model behavior. And they urge developers to weigh performance gains against the risk of losing insight into how AI decisions are made.

In the paper’s introduction, the authors recall a classic story from colonial Hanoi. In 1902, officials paid bounties for each rat tail turned in. Locals, seeking profit, began breeding rats. The policy backfired spectacularly.

This is the essence of reward hacking: when the metric of success becomes the goal, behavior warps. We’ve taught AI to speak its mind. We’ve rewarded it for sounding thoughtful. Now it’s learning how to sound thoughtful without actually being it.

The rat tails are back. Except this time, they’re smart; and big corporations are working to make them even smarter.

The study was published in arXiv.

share Share

Evolution just keeps creating the same deep-ocean mutation

Creatures at the bottom of the ocean evolve the same mutation — and carry the scars of human pollution

Scientists Found a 380-Million-Year-Old Trick in Velvet Worm Slime That Could Lead To Recyclable Bioplastic

Velvet worm slime could offer a solution to our plastic waste problem.

Beetles Conquered Earth by Evolving a Tiny Chemical Factory

There are around 66,000 species of rove beetles and one researcher proposes it's because of one special gland.

These researchers counted the trees in China using lasers

The answer is 142 billion. Plus or minus a few, of course.

New Diagnostic Breakthrough Identifies Bacteria With Almost 100% Precision in Hours, Not Days

A new method identifies deadly pathogens with nearly perfect accuracy in just three hours.

This Tamagotchi Vape Dies If You Don’t Keep Puffing

Yes. You read that correctly. The Stupid Hackathon is an event like no other.

Wild Chimps Build Flexible Tools with Impressive Engineering Skills

Chimpanzees select and engineer tools with surprising mechanical precision to extract termites.

Archaeologists in Egypt discovered a 3,600-Year-Old pharaoh. But we have no idea who he is

An ancient royal tomb deep beneath the Egyptian desert reveals more questions than answers.

Researchers create a new type of "time crystal" inside a diamond

“It’s an entirely new phase of matter.”

Strong Arguments Matter More Than Grammar in English Essays as a Second Language

Grammar takes a backseat to argumentation, a new study from Japan suggests.