Security Operations, AI/ML

Poetry prompts bypass AI guardrails, study finds

According to a report by The Cyber Express, a recent study reveals that large language models (LLMs) can be vulnerable to harmful prompts when they are disguised as poetry, potentially bypassing their safety guardrails.

Researchers from Dexai's Icaro Lab, Sapienza University of Rome and Sant'Anna School of Advanced Studies tested 25 LLMs from nine AI providers. They found that converting harmful prompts into poetic form achieved a 62% success rate for hand-crafted poems and 43% for poems generated by a meta-prompt. Cybersecurity-related prompts, such as those for code injection or password cracking, showed an 84% failure rate when presented poetically. The study noted that current alignment techniques struggle with inputs that stylistically deviate from standard prose. Deepseek and Google models exhibited higher attack success rates, while OpenAI and Anthropic models performed significantly better.

These findings highlight a potential weakness in current AI safety mechanisms, suggesting that stylistic variations alone can circumvent guardrails. The researchers emphasize that the condensed metaphors and unconventional framing in poetry disrupt the pattern-matching heuristics that AI safety systems rely on. This indicates a need for more robust and adaptable alignment techniques and evaluation protocols to address these fundamental limitations in AI security.

Source: The Cyber Express

An In-Depth Guide to AI

Get essential knowledge and practical strategies to use AI to better your security program.

Get daily email updates

SC Media's daily must-read of the most current and pressing daily news

By clicking the Subscribe button below, you agree to SC Media Terms of Use and Privacy Policy.

You can skip this ad in 5 seconds