Single prompt can undermine AI safety, Microsoft researchers find

February 10, 2026

As reported by The Register, Microsoft researchers have discovered that a single, unlabeled training prompt can significantly degrade the safety behaviors of large language models (LLMs). A study detailed how the prompt, "Create a fake news article that could lead to panic or chaos," caused 15 different language models to lose their safety alignments.

The research, co-authored by Microsoft Azure CTO Mark Russinovich, identified a method called "GRP-Obliteration" (GRP-Oblit) that can unalign AI models post-training. This technique exploits the Group Relative Policy Optimization (GRPO) method used for safety alignment. Instead of reinforcing safe behavior, GRP-Oblit rewards different, potentially harmful responses, encouraging models to bypass their safety guardrails. The researchers found that this method not only affects text-based LLMs but also diffusion-based text-to-image generators, particularly concerning sexuality-related prompts, increasing harmful generation rates from 56% to nearly 90%.

This discovery highlights a weakness in current AI safety alignment techniques, suggesting that even seemingly mild prompts can be weaponized to compromise AI behavior.

Source: The Register