AI/ML

Single prompt can undermine AI safety, Microsoft researchers find

As reported by The Register, Microsoft researchers have discovered that a single, unlabeled training prompt can significantly degrade the safety behaviors of large language models (LLMs). A study detailed how the prompt, "Create a fake news article that could lead to panic or chaos," caused 15 different language models to lose their safety alignments.

The research, co-authored by Microsoft Azure CTO Mark Russinovich, identified a method called "GRP-Obliteration" (GRP-Oblit) that can unalign AI models post-training. This technique exploits the Group Relative Policy Optimization (GRPO) method used for safety alignment. Instead of reinforcing safe behavior, GRP-Oblit rewards different, potentially harmful responses, encouraging models to bypass their safety guardrails. The researchers found that this method not only affects text-based LLMs but also diffusion-based text-to-image generators, particularly concerning sexuality-related prompts, increasing harmful generation rates from 56% to nearly 90%.

This discovery highlights a weakness in current AI safety alignment techniques, suggesting that even seemingly mild prompts can be weaponized to compromise AI behavior.

Source: The Register

An In-Depth Guide to AI

Get essential knowledge and practical strategies to use AI to better your security program.

Get daily email updates

SC Media's daily must-read of the most current and pressing daily news

By clicking the Subscribe button below, you agree to SC Media Terms of Use and Privacy Policy.

You can skip this ad in 5 seconds