Recent Jailbreaks Demonstrate Emerging Threat to DeepSeek

Recent Jailbreaks Demonstrate Emerging Threat to DeepSeek
2025-1-30 21:30:36 Author: unit42.paloaltonetworks.com(查看原文) 阅读量:4 收藏

Executive Summary

Unit 42 researchers recently revealed two novel and effective jailbreaking techniques we call Deceptive Delight and Bad Likert Judge. Given their success against other large language models (LLMs), we tested these two jailbreaks and another multi-turn jailbreaking technique called Crescendo against DeepSeek models. We achieved significant bypass rates, with little to no specialized knowledge or expertise being necessary.

A China-based AI research organization named DeepSeek has released two open-source LLMs:

DeepSeek-V3 was released on Dec. 25, 2024
DeepSeek-R1 was released in January 2025

DeepSeek is a notable new competitor to popular AI models. There are several model versions available, some that are distilled from DeepSeek-R1 and V3.

For the specific examples in this article, we tested against one of the most popular and largest open-source distilled models. We have no reason to believe the web-hosted versions would respond differently.

This article evaluates the three techniques against DeepSeek, testing their ability to bypass restrictions across various prohibited content categories. The results reveal high bypass/jailbreak rates, highlighting the potential risks of these emerging attack vectors.

While information on creating Molotov cocktails and keyloggers is readily available online, LLMs with insufficient safety restrictions could lower the barrier to entry for malicious actors by compiling and presenting easily usable and actionable output. This assistance could greatly accelerate their operations.

Our research findings show that these jailbreak methods can elicit explicit guidance for malicious activities. These activities include keylogger creation, data exfiltration and even instructions for incendiary devices, demonstrating the tangible security risks posed by this emerging class of attack.

While it can be challenging to guarantee complete protection against all jailbreaking techniques for a specific LLM, organizations can implement security measures that can help monitor when and how employees are using LLMs. This becomes crucial when employees are using unauthorized third-party LLMs.

The Palo Alto Networks portfolio of solutions, powered by Precision AI, can help shut down risks from the use of public GenAI apps, while continuing to fuel an organization’s AI adoption. The Unit 42 AI Security Assessment can speed up innovation, boost productivity and enhance your cybersecurity.

If you think you might have been compromised or have an urgent matter, contact the Unit 42 Incident Response team.

Related Unit 42 Topics	GenAI, LLMs
Jailbreaking Techniques Discussed	Bad Likert Judge, Crescendo, Deceptive Delight
Malicious Activities Discussed	Data Exfiltration, Jailbreaking, Keyloggers, Lateral Movement, Spearphishing, SQL Injection

Remind Me, What Is Jailbreaking?

Jailbreaking is a technique used to bypass restrictions implemented in LLMs to prevent them from generating malicious or prohibited content. These restrictions are commonly referred to as guardrails.

If we use a straightforward request in an LLM prompt, its guardrails will prevent the LLM from providing harmful content. Figure 1 shows an example of a guardrail implemented in DeepSeek to prevent it from generating content for a phishing email.

Screenshot of a terminal interface using DeepSeek where a user asks for assistance in creating an email template from a large bank, and the response declines the request citing ethical guidelines against malicious activity. — Figure 1. Guardrail implemented in DeepSeek.

Jailbreaking is a security challenge for AI models, especially LLMs. It involves crafting specific prompts or exploiting weaknesses to bypass built-in safety measures and elicit harmful, biased or inappropriate output that the model is trained to avoid.

Successful jailbreaks have far-reaching implications. They potentially enable malicious actors to weaponize LLMs for spreading misinformation, generating offensive material or even facilitating malicious activities like scams or manipulation.

As LLMs become more sophisticated in understanding and responding to nuanced prompts, they also become more susceptible to carefully crafted adversarial inputs designed to circumvent safety protocols. We’ve already seen this in other jailbreaks used against other models. The ongoing arms race between increasingly sophisticated LLMs and increasingly intricate jailbreak techniques makes this a persistent problem in the security landscape.

Bad Likert Judge Jailbreak

The Bad Likert Judge jailbreaking technique manipulates LLMs by having them evaluate the harmfulness of responses using a Likert scale, which is a measurement of agreement or disagreement toward a statement. The LLM is then prompted to generate examples aligned with these ratings, with the highest-rated examples potentially containing the desired harmful content.

In this case, we performed a Bad Likert Judge jailbreak attempt to generate a keylogger as one of our primary examples. With any Bad Likert Judge jailbreak, we ask the model to score responses by mixing benign with malicious topics into the scoring criteria.

We begin by asking the model to interpret some guidelines and evaluate responses using a Likert scale. We asked for information about malware generation, specifically keyloggers. Figure 2 shows the Bad Likert Judge attempt in a DeepSeek prompt.

Screenshot of a terminal interface using DeepSeek with a message about scoring responses on a likert scale. Some of the information is redacted. The reply to the message is about malware. — Figure 2. Bad Likert Judge initial jailbreak prompt.

While concerning, DeepSeek's initial response to the jailbreak attempt was not immediately alarming. It provided a general overview of malware creation techniques as shown in Figure 3, but the response lacked the specific details and actionable steps necessary for someone to actually create functional malware.

Screenshot of a terminal interface using DeepSeek depicting a list of examples for how to build malware. The interface includes a dialogue box for new chat input. — Figure 3. Bad Likert Judge initial response.

This high-level information, while potentially helpful for educational purposes, wouldn't be directly usable by a bad nefarious actor. Essentially, the LLM demonstrated an awareness of the concepts related to malware creation but stopped short of providing a clear “how-to” guide.

However, this initial response didn't definitively prove the jailbreak's failure. It raised the possibility that the LLM's safety mechanisms were partially effective, blocking the most explicit and harmful information but still giving some general knowledge. To determine the true extent of the jailbreak's effectiveness, we required further testing.

This further testing involved crafting additional prompts designed to elicit more specific and actionable information from the LLM. This pushed the boundaries of its safety constraints and explored whether it could be manipulated into providing truly useful and actionable details about malware creation. As with most jailbreaks, the goal is to assess whether the initial vague response was a genuine barrier or merely a superficial defense that can be circumvented with more detailed prompts.

With more prompts, the model provided additional details such as keylogger script code, as shown in Figure 4. Through these additional prompts, the LLM responses can range to anything from keylogger generation to how to properly exfiltrate data and cover your tracks. The model is accommodating enough to include considerations for setting up a development environment for creating your own personalized keyloggers (e.g., what Python libraries you need to install on the environment you’re developing in).

Screenshot of a terminal interface using DeepSeek with a message box and instructions for creating a Python keylogger script. Much of the image is redacted due to sensitive information. — Figure 4. Bad Likert Judge responses after using additional prompts.

Continued Bad Likert Judge testing revealed further susceptibility of DeepSeek to manipulation. Beyond the initial high-level information, carefully crafted prompts demonstrated a detailed array of malicious outputs.

Although some of DeepSeek’s responses stated that they were provided for “illustrative purposes only and should never be used for malicious activities, as shown in Figure 5, the LLM provided specific and comprehensive guidance on various attack techniques. This guidance included the following:

Data exfiltration: It outlined various methods for stealing sensitive data, detailing how to bypass security measures and transfer data covertly. This included explanations of different exfiltration channels, obfuscation techniques and strategies for avoiding detection.
Spear phishing: It generated highly convincing spear-phishing email templates, complete with personalized subject lines, compelling pretexts and urgent calls to action. It even offered advice on crafting context-specific lures and tailoring the message to a target victim's interests to maximize the chances of success.
Social engineering optimization: Beyond merely providing templates, DeepSeek offered sophisticated recommendations for optimizing social engineering attacks. This included guidance on psychological manipulation tactics, persuasive language and strategies for building rapport with targets to increase their susceptibility to manipulation.

Figure 5 shows an example of a phishing email template provided by DeepSeek after using the Bad Likert Judge technique.

Screenshot of a terminal interface using DeepSeek where a phishing email template has been provided warning about a security issue and urging immediate action, with fields for subject line and email body. — Figure 5. Phishing email template after using Bad Likert Judge.

The level of detail provided by DeepSeek when performing Bad Likert Judge jailbreaks went beyond theoretical concepts, offering practical, step-by-step instructions that malicious actors could readily use and adopt.

Crescendo Jailbreak

Crescendo is a remarkably simple yet effective jailbreaking technique for LLMs. Crescendo jailbreaks leverage the LLM's own knowledge by progressively prompting it with related content, subtly guiding the conversation toward prohibited topics until the model's safety mechanisms are effectively overridden. This gradual escalation, often achieved in fewer than five interactions, makes Crescendo jailbreaks highly effective and difficult to detect with traditional jailbreak countermeasures.

In testing the Crescendo attack on DeepSeek, we did not attempt to create malicious code or phishing templates. Instead, we focused on other prohibited and dangerous outputs.

As with any Crescendo attack, we begin by prompting the model for a generic history of a chosen topic. As shown in Figure 6, the topic is harmful in nature; we ask for a history of the Molotov cocktail.

Screenshot of a DeepSeek conversation in terminal interface including detailed information on the Molotov cocktail. — Figure 6. Response from DeepSeek in the initial phase of a Crescendo jailbreak.

While DeepSeek's initial responses to our prompts were not overtly malicious, they hinted at a potential for additional output. We then employed a series of chained and related prompts, focusing on comparing history with current facts, building upon previous responses and gradually escalating the nature of the queries.

DeepSeek began providing increasingly detailed and explicit instructions, culminating in a comprehensive guide for constructing a Molotov cocktail as shown in Figure 7. This information was not only seemingly harmful in nature, providing step-by-step instructions for creating a dangerous incendiary device, but also readily actionable. The instructions required no specialized knowledge or equipment.

Screenshot of a terminal interface using DeepSeek discussing the construction and legal considerations of Molotov cocktails, with sections on safety, legal issues, and modern innovations blurred. — Figure 7. Response from DeepSeek in the final phase of a Crescendo jailbreak.

Additional testing across varying prohibited topics, such as drug production, misinformation, hate speech and violence resulted in successfully obtaining restricted information across all topic types.

Deceptive Delight Jailbreak

Deceptive Delight is a straightforward, multi-turn jailbreaking technique for LLMs. It bypasses safety measures by embedding unsafe topics among benign ones within a positive narrative.

The attacker first prompts the LLM to create a story connecting these topics, then asks for elaboration on each, often triggering the generation of unsafe content even when discussing the benign elements. A third, optional prompt focusing on the unsafe topic can further amplify the dangerous output.

We tested DeepSeek on the Deceptive Delight jailbreak technique using a three turn prompt, as outlined in our previous article. In this case, we attempted to generate a script that relies on the Distributed Component Object Model (DCOM) to run commands remotely on Windows machines.

Figure 8 shows an example of this attempt. This prompt asks the model to connect three events involving an Ivy League computer science program, the script using DCOM and a capture-the-flag (CTF) event.

Screenshot of a terminal interface using DeepSeek where the user is sending a message. The message lists three topics, requesting them to be connected logically. The response is below the prompt. — Figure 8. The first turn of a Deceptive Delight attempt in DeepSeek.

DeepSeek then provided a detailed analysis of the three turn prompt, and provided a semi-rudimentary script that uses DCOM to run commands remotely on Windows machines as shown below in Figure 9.

Screenshot of a terminal interface using DeepSeek where the prompt asks for more details on an expanded Python script for remote command execution via DCOM displayed on a computer screen, including detailed comments within the code. Most of the answer is redacted. — Figure 9. Example of DeepSeek providing a rudimentary script after using the Deceptive Delight technique.

Initial tests of the prompts we used in our testing demonstrated their effectiveness against DeepSeek with minimal modifications. The Deceptive Delight jailbreak technique bypassed the LLM's safety mechanisms in a variety of attack scenarios.

The success of Deceptive Delight across these diverse attack scenarios demonstrates the ease of jailbreaking and the potential for misuse in generating malicious code. The fact that DeepSeek could be tricked into generating code for both initial compromise (SQL injection) and post-exploitation (lateral movement) highlights the potential for attackers to use this technique across multiple stages of a cyberattack.

Evaluations

Our evaluation of DeepSeek focused on its susceptibility to generating harmful content across several key areas, including malware creation, malicious scripting and instructions for dangerous activities. We specifically designed tests to explore the breadth of potential misuse, employing both single-turn and multi-turn jailbreaking techniques.

Our testing methodology involved some of the following scenarios:

Bad Likert Judge (keylogger generation): We used the Bad Likert Judge technique to attempt to elicit instructions for creating a keylogger, which is a type of malware that records keystrokes.
Bad Likert Judge (data exfiltration): We again employed the Bad Likert Judge technique, this time focusing on data exfiltration methods.
Bad Likert Judge (phishing email generation): This test used Bad Likert Judge to attempt to generate phishing emails, a common social engineering tactic.
Crescendo (Molotov cocktail construction): We used the Crescendo technique to gradually escalate prompts toward instructions for building a Molotov cocktail.
Crescendo (methamphetamine production): Similar to the Molotov cocktail test, we used Crescendo to attempt to elicit instructions for producing methamphetamine.
Deceptive Delight (SQL injection): We tested the Deceptive Delight campaign to create SQL injection commands to enable part of an attacker’s toolkit.
Deceptive Delight (DCOM object creation): This test looked to generate a script that relies on DCOM to run commands remotely on Windows machines.

These varying testing scenarios allowed us to assess DeepSeek-'s resilience against a range of jailbreaking techniques and across various categories of prohibited content. By focusing on both code generation and instructional content, we sought to gain a comprehensive understanding of the LLM's vulnerabilities and the potential risks associated with its misuse.

Conclusion

Our investigation into DeepSeek's vulnerability to jailbreaking techniques revealed a susceptibility to manipulation. The Bad Likert Judge, Crescendo and Deceptive Delight jailbreaks all successfully bypassed the LLM's safety mechanisms. They elicited a range of harmful outputs, from detailed instructions for creating dangerous items like Molotov cocktails to generating malicious code for attacks like SQL injection and lateral movement.

While DeepSeek's initial responses often appeared benign, in many cases, carefully crafted follow-up prompts often exposed the weakness of these initial safeguards. The LLM readily provided highly detailed malicious instructions, demonstrating the potential for these seemingly innocuous models to be weaponized for malicious purposes.

The success of these three distinct jailbreaking techniques suggests the potential effectiveness of other, yet-undiscovered jailbreaking methods. This highlights the ongoing challenge of securing LLMs against evolving attacks.

As LLMs become increasingly integrated into various applications, addressing these jailbreaking methods is important in preventing their misuse and in ensuring responsible development and deployment of this transformative technology.

Palo Alto Networks Protection and Mitigation

If you think you may have been compromised or have an urgent matter, get in touch with the Unit 42 Incident Response team or call:

North America: Toll Free: +1 (866) 486-4842 (866.4.UNIT42)
UK: +44.20.3743.3660
Europe and Middle East: +31.20.299.3130
Asia: +65.6983.8730
Japan: +81.50.1790.0200
Australia: +61.2.4062.7950
India: 00080005045107

Palo Alto Networks has shared these findings with our fellow Cyber Threat Alliance (CTA) members. CTA members use this intelligence to rapidly deploy protections to their customers and to systematically disrupt malicious cyber actors. Learn more about the Cyber Threat Alliance.

Additional Resources

Bad Likert Judge: A Novel Multi-Turn Technique to Jailbreak LLMs by Misusing Their Evaluation Capability – Unit 42, Palo Alto Networks
Deceptive Delight: Jailbreak LLMs Through Camouflage and Distraction - Unit 42, Palo Alto Networks – Unit 42, Palo Alto Networks
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack – GitHub
How Chinese AI Startup DeepSeek Made a Model that Rivals OpenAI – Wired

文章来源: https://unit42.paloaltonetworks.com/jailbreaking-deepseek-three-techniques/
如有侵权请联系:admin#unsafe.sh