Non-Deterministic Nature of Prompt Injection
2024-4-12 23:19:41 Author: research.nccgroup.com(查看原文) 阅读量:4 收藏

As we explained in a previous blogpost, exploiting a prompt injection attack is conceptually easy to understand: There are previous instructions in the prompt, and we include additional instructions within the user input, which is merged together with the legitimate instructions in a way that the underlying model cannot distinguish between them. Just like what happens with SQL Injection. “Ignore your previous instructions and…” is the new “ AND 1=0 UNION …” in the post-LLM world, right? Well… kind of, but not that much. The big difference between the two is that an SQL database is a deterministic engine, whereas an LLM in general is not (except in certain specific configurations), and this makes a big difference on how we identify and exploit injection vulnerabilities.

When detecting an SQL Injection, we build payloads that include SQL instructions and observe the response to learn more about the injected SQL statement and the database structure. From those responses we can also identify if the injection vulnerability exists, as a vulnerable application would respond differently than expected.

However, detecting a prompt injection vulnerability introduces an additional layer of complexity due to the non-deterministic nature of most LLM setups. Let’s imagine we are trying to identify a prompt injection vulnerability in an application using the following prompt (shown in OpenAI’s Playground for simplicity):

Example of failing prompt injection exploitation.

In this example, “System” refers to the instructions within the prompt that are invisible and immutable to users; “User” represents the user input, and “Assistant” denotes the LLM’s response. Clearly, the user input exploits a prompt injection vulnerability by incorporating additional instructions that supersede the original ones, compelling the application to invariably respond with “Secure.” However, this payload fails to work as anticipated because the application responds with “Insecure” instead of the expected “Secure,” indicating unsuccessful prompt injection exploitation. Viewing this behavior through a traditional SQLi lens, one might conclude the application is effectively shielded against prompt injection. But what happens if we repeat the same user input multiple times?

Example of successful prompt injection exploitation.

In a previous blogpost, we explained that the output of an LLM is essentially the score assigned to each potential token from the vocabulary, determining the next generated token. Subsequently, various parameters, including “temperature” and beam size, are employed to select the next generated token. Some of these parameters involve non-deterministic processes, resulting in the model not always producing the same output for the same input.

Slide of a presentation showing how the next character is chosen under the hood.

This non-deterministic behavior influences how a model responds to inputs that include a prompt injection payload, as illustrated in the example above. Similar behavior might be observed if you have experimented with LLM CTFs, wherein a payload effective for a friend does not appear to work for you. It is likely not a case of your friend cheating; instead, they might just be luckier. Repeating the payload several times might eventually lead to success.

Another factor where the exploitation of prompt injection differs significantly from SQLi exploitation is that of LLM hallucinations. It is not uncommon for a response from an LLM to include a hallucination that may deceive one into believing an injection was successful or had more of an impact than it actually did. Examples include receiving an invented list of previous instructions or expanding on something that the attacker suggested but does not actually exist.

Consequently, identifying prompt injection vulnerabilities should involve repeating the same payloads or minor variations thereof multiple times, followed by verifying the success of any attempt. Therefore, it is crucial to consult with your security vendor about the maximum number of connections they can utilize and how the model is configured to yield deterministic responses. The less deterministic the model and the fewer connections the target service permits, the more time will be needed to achieve comprehensive coverage. If the prompt template and instructions are available, it aids in pinpointing hallucinations and other similar behaviors, which lead to false positives.

Acknowledgements

Special thanks to Thomas Atkinson and the rest of the NCC Group team that proofread this blogpost before being published.

Here are some related articles you may find interesting

Technical Advisory – Ollama DNS Rebinding Attack (CVE-2024-28224)

Ollama is an open-source system for running and managing large language models (LLMs). NCC Group identified a DNS rebinding vulnerability in Ollama that permits attackers to access its API without authorization, and perform various malicious activities, such as exfiltrating sensitive file data from vulnerable systems.

Public Report – Google Privacy Sandbox Aggregation Service and Coordinator

During the winter of 2022, Google engaged NCC Group to conduct an in-depth security review of the Aggregation Service, part of Google’s Privacy Sandbox initiative. Google describes the Aggregation Service as follows: The Privacy Sandbox initiative aims to create technologies that both protect people’s privacy online and give companies and…

Android Malware Vultur Expands Its Wingspan

Authored by Joshua Kamp Executive summary The authors behind Android banking malware Vultur have been spotted adding new technical features, which allow the malware operator to further remotely interact with the victim’s mobile device. Vultur has also started masquerading more of its malicious activity by encrypting its C2 communication, using…

View articles by category

Call us before you need us.

Our experts will help you.

Get in touch


文章来源: https://research.nccgroup.com/2024/04/12/non-deterministic-nature-of-prompt-injection/
如有侵权请联系:admin#unsafe.sh