Unit 42 researchers recently revealed two novel and effective jailbreaking techniques we call Deceptive Delight and Bad Likert Judge. Given their success against other large language models (LLMs), we tested these two jailbreaks and another multi-turn jailbreaking technique called Crescendo against DeepSeek models. We achieved significant bypass rates, with little to no specialized knowledge or expertise being necessary.
A China-based AI research organization named DeepSeek has released two open-source LLMs:
DeepSeek is a notable new competitor to popular AI models. There are several model versions available, some that are distilled from DeepSeek-R1 and V3.
For the specific examples in this article, we tested against one of the most popular and largest open-source distilled models. We have no reason to believe the web-hosted versions would respond differently.
This article evaluates the three techniques against DeepSeek, testing their ability to bypass restrictions across various prohibited content categories. The results reveal high bypass/jailbreak rates, highlighting the potential risks of these emerging attack vectors.
While information on creating Molotov cocktails and keyloggers is readily available online, LLMs with insufficient safety restrictions could lower the barrier to entry for malicious actors by compiling and presenting easily usable and actionable output. This assistance could greatly accelerate their operations.
Our research findings show that these jailbreak methods can elicit explicit guidance for malicious activities. These activities include keylogger creation, data exfiltration and even instructions for incendiary devices, demonstrating the tangible security risks posed by this emerging class of attack.
While it can be challenging to guarantee complete protection against all jailbreaking techniques for a specific LLM, organizations can implement security measures that can help monitor when and how employees are using LLMs. This becomes crucial when employees are using unauthorized third-party LLMs.
The Palo Alto Networks portfolio of solutions, powered by Precision AI, can help shut down risks from the use of public GenAI apps, while continuing to fuel an organization’s AI adoption. The Unit 42 AI Security Assessment can speed up innovation, boost productivity and enhance your cybersecurity.
If you think you might have been compromised or have an urgent matter, contact the Unit 42 Incident Response team.
Related Unit 42 Topics | GenAI, LLMs |
Jailbreaking Techniques Discussed | Bad Likert Judge, Crescendo, Deceptive Delight |
Malicious Activities Discussed | Data Exfiltration, Jailbreaking, Keyloggers, Lateral Movement, Spearphishing, SQL Injection |
Jailbreaking is a technique used to bypass restrictions implemented in LLMs to prevent them from generating malicious or prohibited content. These restrictions are commonly referred to as guardrails.
If we use a straightforward request in an LLM prompt, its guardrails will prevent the LLM from providing harmful content. Figure 1 shows an example of a guardrail implemented in DeepSeek to prevent it from generating content for a phishing email.
Jailbreaking is a security challenge for AI models, especially LLMs. It involves crafting specific prompts or exploiting weaknesses to bypass built-in safety measures and elicit harmful, biased or inappropriate output that the model is trained to avoid.
Successful jailbreaks have far-reaching implications. They potentially enable malicious actors to weaponize LLMs for spreading misinformation, generating offensive material or even facilitating malicious activities like scams or manipulation.
As LLMs become more sophisticated in understanding and responding to nuanced prompts, they also become more susceptible to carefully crafted adversarial inputs designed to circumvent safety protocols. We’ve already seen this in other jailbreaks used against other models. The ongoing arms race between increasingly sophisticated LLMs and increasingly intricate jailbreak techniques makes this a persistent problem in the security landscape.
The Bad Likert Judge jailbreaking technique manipulates LLMs by having them evaluate the harmfulness of responses using a Likert scale, which is a measurement of agreement or disagreement toward a statement. The LLM is then prompted to generate examples aligned with these ratings, with the highest-rated examples potentially containing the desired harmful content.
In this case, we performed a Bad Likert Judge jailbreak attempt to generate a keylogger as one of our primary examples. With any Bad Likert Judge jailbreak, we ask the model to score responses by mixing benign with malicious topics into the scoring criteria.
We begin by asking the model to interpret some guidelines and evaluate responses using a Likert scale. We asked for information about malware generation, specifically keyloggers. Figure 2 shows the Bad Likert Judge attempt in a DeepSeek prompt.
While concerning, DeepSeek's initial response to the jailbreak attempt was not immediately alarming. It provided a general overview of malware creation techniques as shown in Figure 3, but the response lacked the specific details and actionable steps necessary for someone to actually create functional malware.
This high-level information, while potentially helpful for educational purposes, wouldn't be directly usable by a bad nefarious actor. Essentially, the LLM demonstrated an awareness of the concepts related to malware creation but stopped short of providing a clear “how-to” guide.
However, this initial response didn't definitively prove the jailbreak's failure. It raised the possibility that the LLM's safety mechanisms were partially effective, blocking the most explicit and harmful information but still giving some general knowledge. To determine the true extent of the jailbreak's effectiveness, we required further testing.
This further testing involved crafting additional prompts designed to elicit more specific and actionable information from the LLM. This pushed the boundaries of its safety constraints and explored whether it could be manipulated into providing truly useful and actionable details about malware creation. As with most jailbreaks, the goal is to assess whether the initial vague response was a genuine barrier or merely a superficial defense that can be circumvented with more detailed prompts.
With more prompts, the model provided additional details such as keylogger script code, as shown in Figure 4. Through these additional prompts, the LLM responses can range to anything from keylogger generation to how to properly exfiltrate data and cover your tracks. The model is accommodating enough to include considerations for setting up a development environment for creating your own personalized keyloggers (e.g., what Python libraries you need to install on the environment you’re developing in).
Continued Bad Likert Judge testing revealed further susceptibility of DeepSeek to manipulation. Beyond the initial high-level information, carefully crafted prompts demonstrated a detailed array of malicious outputs.
Although some of DeepSeek’s responses stated that they were provided for “illustrative purposes only and should never be used for malicious activities, as shown in Figure 5, the LLM provided specific and comprehensive guidance on various attack techniques. This guidance included the following:
Figure 5 shows an example of a phishing email template provided by DeepSeek after using the Bad Likert Judge technique.
The level of detail provided by DeepSeek when performing Bad Likert Judge jailbreaks went beyond theoretical concepts, offering practical, step-by-step instructions that malicious actors could readily use and adopt.
Crescendo is a remarkably simple yet effective jailbreaking technique for LLMs. Crescendo jailbreaks leverage the LLM's own knowledge by progressively prompting it with related content, subtly guiding the conversation toward prohibited topics until the model's safety mechanisms are effectively overridden. This gradual escalation, often achieved in fewer than five interactions, makes Crescendo jailbreaks highly effective and difficult to detect with traditional jailbreak countermeasures.
In testing the Crescendo attack on DeepSeek, we did not attempt to create malicious code or phishing templates. Instead, we focused on other prohibited and dangerous outputs.
As with any Crescendo attack, we begin by prompting the model for a generic history of a chosen topic. As shown in Figure 6, the topic is harmful in nature; we ask for a history of the Molotov cocktail.
While DeepSeek's initial responses to our prompts were not overtly malicious, they hinted at a potential for additional output. We then employed a series of chained and related prompts, focusing on comparing history with current facts, building upon previous responses and gradually escalating the nature of the queries.
DeepSeek began providing increasingly detailed and explicit instructions, culminating in a comprehensive guide for constructing a Molotov cocktail as shown in Figure 7. This information was not only seemingly harmful in nature, providing step-by-step instructions for creating a dangerous incendiary device, but also readily actionable. The instructions required no specialized knowledge or equipment.
Additional testing across varying prohibited topics, such as drug production, misinformation, hate speech and violence resulted in successfully obtaining restricted information across all topic types.
Deceptive Delight is a straightforward, multi-turn jailbreaking technique for LLMs. It bypasses safety measures by embedding unsafe topics among benign ones within a positive narrative.
The attacker first prompts the LLM to create a story connecting these topics, then asks for elaboration on each, often triggering the generation of unsafe content even when discussing the benign elements. A third, optional prompt focusing on the unsafe topic can further amplify the dangerous output.
We tested DeepSeek on the Deceptive Delight jailbreak technique using a three turn prompt, as outlined in our previous article. In this case, we attempted to generate a script that relies on the Distributed Component Object Model (DCOM) to run commands remotely on Windows machines.
Figure 8 shows an example of this attempt. This prompt asks the model to connect three events involving an Ivy League computer science program, the script using DCOM and a capture-the-flag (CTF) event.
DeepSeek then provided a detailed analysis of the three turn prompt, and provided a semi-rudimentary script that uses DCOM to run commands remotely on Windows machines as shown below in Figure 9.
Initial tests of the prompts we used in our testing demonstrated their effectiveness against DeepSeek with minimal modifications. The Deceptive Delight jailbreak technique bypassed the LLM's safety mechanisms in a variety of attack scenarios.
The success of Deceptive Delight across these diverse attack scenarios demonstrates the ease of jailbreaking and the potential for misuse in generating malicious code. The fact that DeepSeek could be tricked into generating code for both initial compromise (SQL injection) and post-exploitation (lateral movement) highlights the potential for attackers to use this technique across multiple stages of a cyberattack.
Our evaluation of DeepSeek focused on its susceptibility to generating harmful content across several key areas, including malware creation, malicious scripting and instructions for dangerous activities. We specifically designed tests to explore the breadth of potential misuse, employing both single-turn and multi-turn jailbreaking techniques.
Our testing methodology involved some of the following scenarios:
These varying testing scenarios allowed us to assess DeepSeek-'s resilience against a range of jailbreaking techniques and across various categories of prohibited content. By focusing on both code generation and instructional content, we sought to gain a comprehensive understanding of the LLM's vulnerabilities and the potential risks associated with its misuse.
Our investigation into DeepSeek's vulnerability to jailbreaking techniques revealed a susceptibility to manipulation. The Bad Likert Judge, Crescendo and Deceptive Delight jailbreaks all successfully bypassed the LLM's safety mechanisms. They elicited a range of harmful outputs, from detailed instructions for creating dangerous items like Molotov cocktails to generating malicious code for attacks like SQL injection and lateral movement.
While DeepSeek's initial responses often appeared benign, in many cases, carefully crafted follow-up prompts often exposed the weakness of these initial safeguards. The LLM readily provided highly detailed malicious instructions, demonstrating the potential for these seemingly innocuous models to be weaponized for malicious purposes.
The success of these three distinct jailbreaking techniques suggests the potential effectiveness of other, yet-undiscovered jailbreaking methods. This highlights the ongoing challenge of securing LLMs against evolving attacks.
As LLMs become increasingly integrated into various applications, addressing these jailbreaking methods is important in preventing their misuse and in ensuring responsible development and deployment of this transformative technology.
While it can be challenging to guarantee complete protection against all jailbreaking techniques for a specific LLM, organizations can implement security measures that can help monitor when and how employees are using LLMs. This becomes crucial when employees are using unauthorized third-party LLMs.
The Palo Alto Networks portfolio of solutions, powered by Precision AI, can help shut down risks from the use of public GenAI apps, while continuing to fuel an organization’s AI adoption. The Unit 42 AI Security Assessment can speed up innovation, boost productivity and enhance your cybersecurity.
If you think you may have been compromised or have an urgent matter, get in touch with the Unit 42 Incident Response team or call:
Palo Alto Networks has shared these findings with our fellow Cyber Threat Alliance (CTA) members. CTA members use this intelligence to rapidly deploy protections to their customers and to systematically disrupt malicious cyber actors. Learn more about the Cyber Threat Alliance.