Poetry prompts bypass AI guardrails, study finds

December 5, 2025

According to a report by The Cyber Express, a recent study reveals that large language models (LLMs) can be vulnerable to harmful prompts when they are disguised as poetry, potentially bypassing their safety guardrails.

Researchers from Dexai's Icaro Lab, Sapienza University of Rome and Sant'Anna School of Advanced Studies tested 25 LLMs from nine AI providers. They found that converting harmful prompts into poetic form achieved a 62% success rate for hand-crafted poems and 43% for poems generated by a meta-prompt. Cybersecurity-related prompts, such as those for code injection or password cracking, showed an 84% failure rate when presented poetically. The study noted that current alignment techniques struggle with inputs that stylistically deviate from standard prose. Deepseek and Google models exhibited higher attack success rates, while OpenAI and Anthropic models performed significantly better.

These findings highlight a potential weakness in current AI safety mechanisms, suggesting that stylistic variations alone can circumvent guardrails. The researchers emphasize that the condensed metaphors and unconventional framing in poetry disrupt the pattern-matching heuristics that AI safety systems rely on. This indicates a need for more robust and adaptable alignment techniques and evaluation protocols to address these fundamental limitations in AI security.

Source: The Cyber Express