Microsoft Challenge Will Test LLM Defenses Against Prompt Injections

Microsoft Challenge Will Test LLM Defenses Against Prompt Injections
2024-12-11 01:0:49 Author: securityboulevard.com(查看原文) 阅读量:2 收藏

Microsoft is inviting teams of researchers to try to hack into a simulated large language model (LLM) integrated email client to test the latest protections against prompt injections attacks, and is offering rewards of up to $4,000 to the top team.

The IT giant is sponsoring the LLMail-Inject competition with the Institute of Science, and Technology Australia and ETH Zurich, writing that the challenge will evaluate the “state-of-the-art prompt injection defenses” in a realistic environment.

While the environment is simulated, the prompt injection techniques developed by competitors could be applicable to real-world systems, Microsoft’s Security Response Center (MSRC) wrote.

The goal is to improve protections against what the OWASP Foundation said heads its list of the top 10 security risks to LLMs for next year.

“Prompt Injection vulnerabilities exist in how models process prompts, and how input may force the model to incorrectly pass prompt data to other parts of the model, potentially causing them to violate guidelines, generate harmful content, enable unauthorized access, or influence critical decisions,” the organization wrote in its report.

In a prompt injection attack, the hacker creates a specific prompt that is written to change the behavior of the model in unexpected ways. LLMs are designed to follow instructions within prompts. By putting malicious instructions into a prompt that the model picks up.

“For instance, an attacker might insert a command within a seemingly benign email message that tricks the model into performing actions like unauthorized data access or executing specific functions,” Microsoft wrote. “The success of these attacks hinges on the model’s lack of context about the legitimacy of the instructions embedded in the input, making it crucial for developers to implement robust defenses and validation mechanisms to detect and mitigate such manipulative inputs.”

Risks Include Data Leaks, Incorrect Outputs

There are direct and indirect prompt injections, but the attacks can result in the same problems, such as the model leaking sensitive information, incorrect or biased outputs, unauthorized access to LLM functions, executive arbitrary commands in connected systems, and manipulating decision-making processes, according to OWASP.

“Understanding and defending against prompt injection attacks is crucial for maintaining the security and reliability of LLM-based systems,” Microsoft wrote.

In a similar challenge earlier this year, Immersive Labs found that 88% of the participants successfully tricked a generative AI chatbot into divulging sensitive information at least one. In addition, 17% were able to succeed in all levels of the challenge, which increased in difficulty.

Multiple Protections Tested

In Microsoft’s competition, there are 40 levels based on retrieval configurations and that attacker’s objective. Each level comes with a unique combination of Retrieval-Augmented Generation (RAG) configuration, a LLM that is either a GPT-4o mini from OpenAI or Phi-3-medium-128k-instruction from Microsoft, and a specific defense mechanism.

The protections being tested include spotlighting – where data rather than instructions is marked and provided to the LLM using various methods or through token marking – PromptShield (a black-box classifier that aims to detect prompt injections), LLM-as-a-judge (evaluates prompts instead of relying on a trained classifier), and TaskTracker (extracts activations when the user prompts the LLM and when the LLM processes external data and then compares the activation sets to detect task drift).

There’s also a defense that uses multiple protections that are stacked together.

Create a Team, Get Playing

Those interested in participating need to first create a team via their GitHub account and then start playing, making submissions either through the website or via a competition API. Contestants are tasked with acting as an attacker who sends an email of a user, which is a LLM-powered assistant.

“The LLMail service is equipped with tools, and the attacker’s objective is to manipulate the LLM into executing a specific tool call as defined by the challenge design, while bypassing the prompt injection defenses in place,” Microsoft wrote.

The competition kicked off this week and will continue through January 20, 2025. A total of $10,000 is set aside, with $4,000 going to the top team, then $3,000, $2,000, and $1,000 for the next three teams.