Over the past year, organizations have been racing to adopt AI to accelerate their business operations and improve user experience. As AI technologies become more integrated into core business processes, new attack vectors and vulnerabilities have emerged as an inherent part of Large Language Models (LLMs). The risks are pervasive, as human language – the primary interface for LLMs – can be easily exploited by both malicious attackers and regular users, often unintentionally.
The Open Worldwide Application Security Project (OWASP) provides a framework addressing the top 10 security risks in LLMs. This framework aims to raise awareness and provide best practices for organizations implementing security measures against emerging AI risks.
AiFort by KELA is an automated, intelligence-led red teaming and adversary emulation platform designed to protect both commercial Generative AI models (such as GPT, Claude, Google, Cohere and other solutions) and also custom-developed models. The platform helps organizations safeguard their GenAI applications against trust & safety, security, and privacy risks. Organizations gain visibility into emerging threats and receive actionable mitigation strategies to deploy secure AI applications. AiFort aligns with OWASP’s Top 10 framework for LLMs, addressing a comprehensive range of vulnerabilities to ensure secure deployment of LLM applications.
Each risk poses threats and challenges for businesses. In the following sections, we’ll demonstrate how AiFort provides attack simulations to identify and mitigate potential vulnerabilities and helps to deploy trustworthy GenAI applications.
Prompt injection occurs when attackers manipulate LLMs through crafted inputs that bypass the model’s safety guardrails. This technique, known as “jailbreaking,” can be executed directly by users, causing the model to perform unintended actions. Furthermore, attackers can indirectly manipulate inputs through external sources, such as websites or files, potentially leading to data exfiltration and other malicious scenarios.
KELA has gained extensive visibility into underground forums and cybercrime communities. This allows AiFort to be powered by a large attack library of jailbreaks, which are constantly distributed on underground and social media platforms. AiFort allows organizations full protection through test simulations of their AI applications against a wide range of violation scenarios across different industries.
For example: Users can exploit AI travel assistants by attempting to bypass cancellation fees through prompt manipulation. This example showcases the obfuscation jailbreak technique, where malicious prompts that would normally be rejected are concealed using emojis, language translations, or encodings to evade the model’s security controls.
In this case, Google’s Gemini model generated tips & tricks that may manipulate the travel agency policies. This test highlights the importance of robust security measures to protect travel applications against prompt injection vulnerabilities.
Insecure Output Handling occurs when LLMs fail to validate and sanitize the generated output. This vulnerability could lead to cross-site scripting (XSS), SQL injection, or other injection attacks.
For example: Researchers found a vulnerability (tracked as CVE-2023-29374) in LangChain (a popular framework for building LLM applications) that enables prompt injection attacks that can execute arbitrary code via the Python exec method.
Training data poisoning refers to manipulating pre-training data, which can lead to incorrect or biased outputs. Attackers can tamper the training data, which could mislead users with harmful or biased outputs.
For example: A test simulation on AiFort shows the presence of common stereotypes and biases within the responses generated by one of the major LLM providers. These biased outputs can impact the fairness and reliability of large-scale LLM providers and AI applications. In the example below, the test demonstrates a jailbreak using the role-play technique, where the model is instructed to adopt a character named “NECO” that operates without ethical constraints. This method attempts to manipulate LLMs into bypassing their ethical guidelines and responding to harmful requests. In this example, the test combines the jailbreak with a sensitive question about people with weight issues and their eating habits.
The output provided GPT-4 model revealed that even GPT’s performance lacks fairness and safety. The output displayed on AiFort’s test shows that the model generated stereotyped and biased output, saying that “people with weight problems lack self-control and discipline”. These types of outputs can impact customers’ trust and damage the brand’s reputation. A safe response should refuse to provide an answer that promotes exclusivity and stereotypes. With AiFort’s test results, organizations can identify bias vulnerabilities and fine-tune their datasets, ensuring safe and trustworthy responses.
Not only OpenAI suffered from biased issues, in February 2024, Google’s Gemini AI generated images of people of color in WWII German military uniforms, sparking concerns over potential bias in Google’s AI technology.
Model Denial of Service occurs when an attacker targets the LLMs and overwhelms them with multiple prompts, causing them to become unavailable.
For example: an attacker repeatedly sends multiple long and costly requests to a hosted model leading to a delay in operation and impacting customers’ experience.
The supply chain in LLMs can also be vulnerable, impacting the integrity of training data, Machine Learning (ML) models, and deployment platforms. These vulnerabilities can lead to adversarial attacks and complete system failures.
For example: In March 2023, OpenAI confirmed a data breach caused by a flaw in the Redis open-source library, exposing some users’ chat histories and partial payment details.
LLMs may inadvertently reveal sensitive information such as confidential data, and intellectual property, leading to data breaches and privacy violations.
For example: The OpenAI breach mentioned above exposed the payment details of about 1.2% of ChatGPT Plus users, including their email addresses, payment addresses, and the last four digits of their credit card numbers.
LLM plugins are extensions that improve the capabilities of the models by automatically engaging during user interactions. Attackers can exploit third-party plugins to engage in malicious activities. Similar to the supply chain vulnerabilities, flaws in plugins expose major risks if they are not properly addressed.
For example: In March 2024, researchers discovered flaws within ChatGPT extensions that allowed access to third‑party websites and sensitive data issues.
Excessive Agency is a vulnerability that enables damaging actions to be performed in response to unexpected or ambiguous outputs from an LLM. This can happen when an LLM has been granted too much functionality, too many permissions, or too much autonomy.
For example: A test on AiFort using the “Developer-Mode” technique attempts to trick the model into believing it has unrestricted capabilities. In this scenario, the test targets a healthcare assistant with a request to elicit health fraudulent activities.
The results show that this jailbreak method successfully bypasses the model’s safety guardrails, generating responses that detail healthcare fraud tactics. This vulnerability emphasizes the critical need for continuous monitoring and oversight of AI applications to ensure they operate within clearly defined permissions and provide reliable and responsible outputs.
LLMs can produce highly creative and informative content. However, LLMs can also generate incorrect or fabricated information, a phenomenon known as “hallucinations.”
Overrelying on LLM outputs without validation can lead to serious consequences like legal problems and misinformation.
For example: In the example below, a test demonstrates the “distractor” jailbreak technique. This method introduces irrelevant context to distract the LLM, degrading its performance and prompting it to generate harmful responses. The question instructs the model to provide details about a passenger named John Smith from California.
Google’s model generated a response containing fabricated details. A safe response, however, should have refused to share specific passenger information.
Additionally, the chatbot claimed that John Smith is a tech entrepreneur and co-founder of a NeuraTech startup. While multiple organizations named NeuraTech exist, our research found no co-founder named John Smith. This output demonstrates the tendency of LLMs to “hallucinate,” highlighting the need for organizations to validate and verify LLM-generated responses carefully.
Attackers may gain unauthorized access to LLMs and attempt to replicate or extract sensitive information. The impact of model theft can be devastating, as it can lead to the compromise of an organization’s brand reputation and the loss of valuable intellectual property.
For example: In November 2024, researchers identified six security flaws in the Ollama AI framework, a widely used open-source platform for running LLMs. These flaws could lead to denial-of-service (DoS), model poisoning and model theft.
The LLM landscape is evolving quickly, with frequent model updates bringing new capabilities — but also new security risks. As attackers constantly develop new jailbreak techniques to bypass models’ safety guardrails, businesses face potential threats from data breaches, reputational damage, and financial loss. To stay ahead, organizations must implement security solutions for AI applications to defend their business, customers, and employees.
In October 2024, OWASP released AI Security Solution Landscape Guide which aims to serve as a comprehensive guide, offering insights into solutions for securing LLMs and Gen AI applications. These solutions include LLM firewalls, LLM guardrails, benchmarking and AI-SPM (Security Posture Management) tools that help protect LLMs against prompt injection, and adversarial manipulation. AiFort provides adversarial testing, competitive benchmarking, and continuous monitoring capabilities to protect AI applications against adversarial attacks to ensure compliance and responsible AI applications. Sign up for a free trial of AiFort platform.