AI/ML, Threat Intelligence

AI moderation guardrails circumvented by novel TokenBreak attack

(Adobe Stock)

Malicious actors could exploit the novel TokenBreak attack technique to compromise large language models' tokenization strategy and evade implemented safety and content moderation protections, reports The Hacker News.

TokenBreak involved the modification of input words with additional letters to confuse the text classification model, a report from HiddenLayer showed. With the altered text still comprehended in the same way as the original one, threat actors could use the technique to facilitate prompt injection intrusions. "Knowing the family of the underlying protection model and its tokenization strategy is critical for understanding your susceptibility to this attack," said HiddenLayer researchers. Combating such a threat requires the implementation of Unigram tokenizers and bypass trick-using training models. Organizations should also ensure tokenization and model logic alignment, as well as conduct misclassification logging, researchers added. Such findings come after backronyms were reported by Staiker AI Research to be potentially leveraged for AI chatbot jailbreaking.

An In-Depth Guide to AI

Get essential knowledge and practical strategies to use AI to better your security program.

Get daily email updates

SC Media's daily must-read of the most current and pressing daily news

By clicking the Subscribe button below, you agree to SC Media Terms of Use and Privacy Policy.

You can skip this ad in 5 seconds