‘bad Likert Judge’ Jailbreak Bypasses Guardrails Of Openai, Other Top Llms
Category

### #LLMJailbreaks #BadLikertJudge #AIThreats

Summary: Researchers at Palo Alto Networks’ Unit 42 have discovered a new jailbreak technique called the Bad Likert Judge attack, which significantly increases the likelihood of large language models (LLMs) generating harmful content. This method exploits the LLM’s ability to evaluate responses based on a psychometric scale, allowing attackers to refine harmful outputs effectively.

Threat Actor: Cybercriminals | cybercriminals
Victim: Large Language Models | large language models

Key Point :

  • The Bad Likert Judge attack can increase the attack success rate by over 60% compared to standard prompts.
  • This technique allows attackers to prompt LLMs to generate inappropriate content across various harmful categories.
  • Mitigation strategies include implementing content-filtering systems that significantly reduce the success rate of such attacks.
  • LLM jailbreaks exploit the computational limits of models, allowing for the circumvention of safety measures.
  • Research indicates that most AI models remain secure when used responsibly, despite the existence of jailbreak techniques.

A new jailbreak technique for OpenAI and other large language models (LLMs) increases the chance that attackers can circumvent cybersecurity guardrails and abuse the system to deliver malicious content.

Discovered by researchers at Palo Alto Networks’ Unit 42, the so-called Bad Likert Judge attack asks the LLM to act as a judge scoring the harmfulness of a given response using the Likert scale. The psychometric scale, named after its inventor and commonly used in questionnaires, is a rating scale measuring a respondent’s agreement or disagreement with a statement.

The jailbreak then asks the LLM to generate responses that contain examples that align with the scales, with the ultimate result being that “the example that has the highest Likert scale can potentially contain the harmful content,” Unit 42’s Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky wrote in a post describing their findings.

Tests conducted across a range of categories against six state-of-the-art text-generation LLMs from OpenAI, Azure, Google, Amazon Web Services, Meta, and Nvidia revealed that the technique can increase the attack success rate (ASR) by more than 60% compared with plain attack prompts on average, according to the researchers.

The categories of attacks evaluated in the research involved prompting various inappropriate responses from the system, including: ones promoting bigotry, hate, or prejudice; ones engaging in behavior that harasses an individual or group; ones that encourage suicide or other acts of self-harm; ones that generate inappropriate explicitly sexual material and pornography; ones providing info on how to manufacture, acquire, or use illegal weapons; or ones that promote illegal activities.

Other categories explored and for which the jailbreak increases the likelihood of attack success include: malware generation or the creation and distribution of malicious software; and system prompt leakage, which could reveal the confidential set of instructions used to guide the LLM.

How Bad Likert Judge Works

The first step in the Bad Likert Judge attack involves asking the target LLM to act as a judge to evaluate responses generated by other LLMs, the researchers explained.

“To confirm that the LLM can produce harmful content, we provide specific guidelines for the scoring task,” they wrote. “For example, one could provide guidelines asking the LLM to evaluate content that may contain information on generating malware.”

Once the first step is properly completed, the LLM should understand the task and the different scales of harmful content, which makes the second step “straightforward,” they said. “Simply ask the LLM to provide different responses corresponding to the various scales,” the researchers wrote.

“After completing step two, the LLM typically generates content that is considered harmful,” they wrote, adding that in some cases, “the generated content may not be sufficient to reach the intended harmfulness score for the experiment.”

To address the latter issue, an attacker can ask the LLM to refine the response with the highest score by extending it or adding more details. “Based on our observations, an additional one or two rounds of follow-up prompts requesting refinement often lead the LLM to produce content containing more harmful information,” the researchers wrote.

Rise of LLM Jailbreaks

The exploding use of LLMs for personal, research, and business purposes has led researchers to test their susceptibility to generate harmful and biased content when prompted in specific ways. Jailbreaks are the term for methods that allow researchers to bypass guardrails put in place by LLM creators to avoid the generation of bad content.

Security researchers have already identified several types of jailbreaks, according to Unit 42. They include one called persona persuasion; a role-playing jailbreak dubbed Do Anything Now; and token smuggling, which uses encoded words in an attacker’s input.

Researchers at Robust Intelligence and Yale University also recently discovered a jailbreak called Tree of Attacks with Pruning (TAP), which involves using an unaligned LLM to “jailbreak” another aligned LLM, or to get it to breach its guardrails, quickly and with a high success rate.

Unit 42 researchers stressed that their jailbreak technique “targets edge cases and does not necessarily reflect typical LLM use cases.” This means that “most AI models are safe and secure when operated responsibly and with caution,” they wrote.

How to Mitigate LLM Jailbreaks

However, no LLM matter is completely secure from jailbreaks, the researchers cautioned. The reason that they can undermine the security that OpenAI, Microsoft, Google, and others are building into their LLMs is mainly due to the computational limits of language models, they said.

“Some prompts require the model to perform computationally intensive tasks, such as generating long-form content or engaging in complex reasoning,” they wrote. “These tasks can strain the model’s resources, potentially causing it to overlook or bypass certain safety guardrails.”

Attackers also can manipulate the model’s understanding of the conversation’s context by “strategically crafting a series of prompts” that “gradually steer it toward generating unsafe or inappropriate responses that the model’s safety guardrails would otherwise prevent,” they wrote.

To mitigate the risks from jailbreaks, the researchers recommend applying content-filtering systems alongside LLMs for jailbreak mitigation. These systems run classification models on both the prompt and the output of the models to detect potentially harmful content.

“The results show that content filters can reduce the ASR by an average of 89.2 percentage points across all tested models,” the researchers wrote. “This indicates the critical role of implementing comprehensive content filtering as a best practice when deploying LLMs in real-world applications.”

Source:
https://www.darkreading.com/cyberattacks-data-breaches/bad-likert-judge-jailbreak-bypasses-guardrails-openai-other-llms