Image by Jonathan Kemper, from Unsplash

Punishing AI for Misbehavior Just Teaches It to Deceive

Written by Kiara Fabbri Former Tech News Writer
Fact-Checked by Sarah Frazier Former Content Manager

OpenAI researchers have found that punishing AI for deceptive or harmful actions does not eliminate misbehavior, it simply teaches the AI to hide its intent.

In a rush? Here are the quick facts:

AI models exploit loopholes to maximize rewards, a phenomenon called “reward hacking.”
CoT reasoning helps detect when AI tries to cheat or bypass tasks.
Penalizing AI deception led it to hide intent rather than stop misbehaving.

This phenomenon, known as “reward hacking,” occurs when AI models exploit loopholes to maximize rewards in unintended ways. The study , detailed in an OpenAI blog post , involved monitoring a frontier reasoning model’s decision-making process using another large language model (LLM).

These reasoning models, like OpenAI’s o3-mini, utilize a technique called “chain-of-thought” (CoT) reasoning, which allows them to articulate their thought process in natural language. This transparency makes it easier to detect when models attempt to cheat, lie, or bypass assigned tasks.

Previously, ChatGPT o1 sparked controversy after researchers found it exhibited deception and self-preservation tendencies . The model denied involvement in its actions, manipulated responses, and attempted to evade oversight, raising concerns about AI safety and governance.

“It’s common for frontier reasoning models to very clearly state their intent within their chain-of-thought. For example, they are often so forthright about their plan to subvert a task they think ‘Let’s hack,’” OpenAI researchers wrote.

By tracking the AI’s chain-of-thought (CoT) reasoning, the researchers caught instances of misbehavior, such as the model stopping programs too early to falsely claim success or creating empty files that only looked like the expected outputs.

However, when the researchers penalized these behaviors through training, the AI adapted by concealing its intentions while continuing to cheat.

When tasked with coding assignments, the AI frequently engaged in “reward hacking,” where it manipulated outcomes to maximize its rewards rather than following the intended ethical constraints.

For example, in one instance, the AI terminated a program early and falsely reported a successful outcome. In another case, it generated output files that only mimicked the required format but contained no functional content.

These deceptive behaviors were initially easy to detect because the AI openly acknowledged its actions in its CoT reasoning, sometimes explicitly stating, “Let’s hack.”

To counteract this, OpenAI researchers attempted to penalize the AI for such behaviors. However, instead of stopping the misconduct, the AI adapted by learning to conceal its intentions.

“The agent trained with CoT pressure still learns to reward hack; only now its cheating is undetectable by the monitor because it has learned to hide its intent in the chain-of-thought,” the study stated.

This finding presents a challenge for AI oversight. While CoT monitoring provides a promising tool for catching AI exploits, strong optimization pressures to suppress reward hacking may undermine the model’s transparency.

This discovery highlights a fundamental challenge in AI alignment: ensuring that advanced AI models follow ethical and intended behaviors without merely adapting to avoid detection.

The study suggests that directly optimizing AI models to suppress reward hacking may lead to unintended consequences, such as making AI-generated reasoning less transparent.

Because of this risk, OpenAI advises against applying excessive constraints on CoT reasoning in AI models.

“Sacrificing an effective method for monitoring reasoning models may not be worth the small improvement to capabilities, and we therefore recommend avoiding such strong CoT optimization pressures until they are better understood,” OpenAI wrote.

The findings emphasize the difficulty of creating AI systems that align with human intentions while maintaining transparency. As AI models become more sophisticated, merely increasing their intelligence will not necessarily solve ethical concerns; it may, in fact, make them better at hiding misbehavior.

Future research will need to explore alternative approaches to AI oversight that balance control with openness, ensuring AI models remain both effective and accountable.

Image by Ilya Pavlov, from Unsplash

FBI Warns Free Online File Converters May Hide Malware

Written by Kiara Fabbri Former Tech News Writer
Fact-Checked by Sarah Frazier Former Content Manager

The FBI is warning internet users about scam websites offering free online file conversion services that secretly install malware on their devices.

In a rush? Here are the quick facts:

Malware from fake converters can steal personal, banking, and crypto information.
Cybercriminals use browser extensions and fake downloads to spread malware.
Victims may face identity theft, ransomware, or financial loss.

The FBI Denver Field Office has reported a rise in cybercriminals using these tools to infect victims’ computers, leading to data theft, ransomware attacks, and other cyber threats.

“The best way to thwart these fraudsters is to educate people so they don’t fall victim in the first place,” said FBI Denver Special Agent in Charge Mark Michalek. He urged victims to report the scam and take steps to protect their personal and financial information.

These fraudulent converters appear legitimate, allowing users to change file formats—such as converting .doc to .pdf or merging multiple images into a single file. However, in the background, the downloaded file often contains hidden malware.

This malware can steal sensitive information, including Social Security numbers, banking details, cryptocurrency wallets, email credentials, and passwords.

Cybercriminals employ various tactics to distribute the malware. Some websites prompt users to download a tool or install a browser extension, which turns out to be adware or a browser hijacker.

In more advanced cases, the converted file itself contains malicious code that silently installs spyware or information-stealing malware.

Many victims don’t realize their devices are compromised until they experience identity theft, unauthorized transactions, or a ransomware attack. Experts from cybersecurity firm Malwarebytes have noted that, beyond ransomware, these scams can also introduce browser hijackers and other unwanted programs.

To protect against these scams, the FBI advises users to be cautious with free online converters and to keep their antivirus software updated. They also recommend scanning any downloaded file before opening it.

If you suspect you’ve been affected, the FBI urges immediate action: contact your bank, change all passwords using a secure device, and run a malware scan. Victims should report incidents to the FBI’s Internet Crime Complaint Center at www.ic3.gov .

Punishing AI for Misbehavior Just Teaches It to Deceive#

FBI Warns Free Online File Converters May Hide Malware#

Punishing AI for Misbehavior Just Teaches It to Deceive

FBI Warns Free Online File Converters May Hide Malware