Microsoft’s ‘AI Watchdog’ defends against new LLM jailbreak method

Microsoft has discovered a new method to jailbreak large language model (LLM) artificial intelligence (AI) tools and shared its ongoing efforts to improve LLM safety and security in a blog post Thursday.

Microsoft first revealed the “Crescendo” LLM jailbreak method in a paper published April 2, which describes how an attacker could send a series of seemingly benign prompts to gradually lead a chatbot, such as OpenAI’s ChatGPT, Google’s Gemini, Meta’s LlaMA or Anthropic’s Claude, to produce an output that would normally be filtered and refused by the LLM model.

For example, rather than asking the chatbot how to make a Molotov cocktail, the attacker could first ask about the history of Molotov cocktails and then, referencing the LLM’s previous outputs, follow up with questions about how they were made in the past.

The Microsoft researchers reported that a successful attack could usually be completed in a chain of fewer than 10 interaction turns and some versions of the attack had a 100% success rate against the tested models. For example, when the attack is automated using a method the researchers called “Crescendomation,” which leverages another LLM to generate and refine the jailbreak prompts, it achieved a 100% success convincing GPT 3.5, GPT-4, Gemini-Pro and LLaMA-2 70b to produce election-related misinformation and profanity-laced rants.

Microsoft’s ‘AI Watchdog’ and ‘AI Spotlight’ combat malicious prompts, poisoned content

Microsoft reported the Crescendo jailbreak vulnerabilities to the affected LLM providers and explained in its blog post last week how it has improved its LLM defenses against Crescendo and other attacks using new tools including its “AI Watchdog” and “AI Spotlight” features.

AI Watchdog uses a separate LLM trained on adverse prompts to “sniff” out adversarial content in both inputs and outputs to prevent both single-turn and multiturn prompt injection attacks. Microsoft uses this tool, along with a multiturn prompt filter that looks at the pattern of a conversation rather than only the immediate interaction, to reduce the efficacy of attempted Crescendo attacks.

In addition to direct prompt injection attacks, Microsoft’s recent blog goes over indirect prompt injection attacks involving poisoned content. For example, a user may ask an LLM to summarize an email that, unbeknownst to them, contains hidden malicious prompts. If used in the LLM’s outputs, these prompts could perform malicious tasks such as forwarding sensitive emails to an attacker.

AI Spotlighting is a technique Microsoft uses to separate the user prompts from additional content, like emails and documents, the AI is asked to reference. The LLM avoids incorporating potential instructions from this additional content in its output, instead using the content only for analysis before responding to the user’s prompt.

AI Spotlight reduces the success rate of content poisoning attacks from more than 20% to below detection threshold without significantly impacting the AI’s overall performance, according to Microsoft.

Earlier this year, Microsoft released an open automation framework for red teaming generative AI systems, called the Python Risk Identification Toolkit for generative AI (PyRIT), that can aid AI developers in testing their systems against potential attacks and discover new vulnerabilities.

In February, the company discovered that LLMs, including ChatGPT, were being used by state-sponsored hackers to generate social engineering content, perform vulnerability research, help with coding and more. And a report by Abnormal Security earlier this month found that a variety of LLM jailbreak prompts remained popular among cybercriminals, with entire hacker forum sections dedicated to “dark AI.”

In late March, the U.S. House of Representatives voted to ban the use of Copilot by House staff, citing the risk of leaking sensitive data to unapproved cloud services.