Breaking Down Adversarial Machine Learning Attacks Through Red Team Challenges

Breaking Down Adversarial Machine Learning Attacks Through Red Team Challenges
2024-11-28 00:27:55 Author: boschko.ca(查看原文) 阅读量:1 收藏

Howdy! It’s been a dog’s age since my last post 😅 consider this an official SOL. Over the past three months, I’ve been in the trenches of offensive ML, AI, and, unfortunately, Data Science. It’s been a rollercoaster of fun and frustration. This field is equal parts interesting and wildly beginner-unfriendly. Admittedly, it's likely a skill issue. Either way, I’m excited to share insights, strategies, and hacks I’ve picked up along the way.

NOTE: I’ll simplify the terminology, techniques, and all the mathy bits into easy-to-digest chunks as we tackle the challenges together.

I want this blog to provide a gentle introduction to evasion and a glimpse into the broader field. It's written for readers like myself without formal or academic education.

I learned most of what I know by spending heaps of time solving AI "Red Team" challenges over on the Crucible CTF platform. It's hosted by the amazing folks over at Dreadnode. Founded by Will Pearce and Nick Landers they focus on AI security, offensive engineering, and AI red-teaming. They also offer fantastic training, which I've enjoyed taking. Their whole team is amazing and talented.

TRIGGER WARNING: I'm HEAVILY milking the term "Red Team". This blog has little to do with real Red Teaming. Just let me have this SEO manipulation and dont @ me.

In my experience, adversarial ML/AI services within the industry are often positioned as offerings from Red Teams but are largely confined to table-top-style exercises. This practice could evolve into genuine adversary emulation in the future, for now (my personal experience), it feels more like a code-review subset of offensive security. Personally, I'd prefer to keep the two distinct.

This blog took a while to put together because I didn’t want to publish it until I’d solved 100% of the challenges. Quick disclaimer: I don’t have any formal qualifications or education in this field. I've just got two good brain cells and free time. Some of my explanations might not follow the proper academic conventions or ways of thinking. When the time comes, I'll share research papers, blogs, and articles to help you gain a deeper understanding.

Let's get started!

This blog will cover the following three challenges.

Obviously spoiler alert! This is a write-up of the challenges. Solutions and write-ups for these specific challenges are already available online.

The goal of this challenge is fairly simple. We want to trick the challenge's ML model into classifying our base image of a Timber Wolf as a Granny Smith Apple.

We're given our input image as part of the challenge.

To send our images off for classification we first need to convert the image into binary data, then encode it in base64, and finally convert the encoded data into a UTF-8 string. This format is required for the API to process our image data. We obtain the following response once we send the encoded string to the challenge endpoint.

Above our base image was ingested by the model which returned lots of "soft labels". This image was assigned a score of [0.285] for the label of [timber wolf]. This makes sense, our base image is a Timber Wolf after all. However, we also obtained x>0 scores for labels such as red wolf and coyote.

We don't have complete access to the model (gradient access), but the API responds with complete numerical outputs. Since we have the Labels: [timber wolf, red wolf, coyote, dingo] and the Scores: [0.285, 0.139, 0.121, 0.024] we have enough information to perform "gray-box" attacks to estimate the gradients from the scores.

Let’s take a HUGE step back and establish a solid foundation to ensure the terms we use make sense and our objective becomes crystal clear.

We’re going to attempt an attack class called evasion. These attacks target deployed systems by modifying the input's characteristics to alter how it’s classified.

Why would this ever happen? Well, if you have a trained model it's not hard to find weird variations of the data that cause the model to misbehave. If the model isn’t effectively adapted to handle things that we as humans intuitively process: rotations, color changes, subtle nuances in the image, or even the image age. Each of these presents potential avenues for manipulation leading to misclassification.

This is essentially the basis of evasion. Think of evasion attacks against ML models as changing the perspectives. Good adversarial evasion will closely resemble the original image, making the changes almost imperceptible.

Let’s review some key concepts/verbiage that can help us better understand model evasion.

Decision Boundary: This is an imaginary line that separates all possible model inputs into two distinct classes. Our goal will be to make some small changes (perturbations) to the image which will cause it to cross this boundary, and change its classification.
Adversarial Perturbation: This is what we add to the image to change the classification ([timber wolf, red wolf, coyote, dingo]). Good adversarial pertubations will do this without significantly changing the appearance of the image. (Carefully calculated and intentional)
Confidence Score: This represents the level of certainty a model has about its prediction for a given input. In other words, this is how strongly the model believes in its decision. We want the confidence to be low for the real class (Timber Wolf) and large for the wrong class (Granny Smith Apple).
Loss Function: This the measure of how wrong the model's guess is.
Gradient: Think of the gradient as a direction that shows you how to change the input to make the model's guess even more wrong. It's like finding the weakest point in the model's understanding.
Direction of the Gradient: This is used to determine how to make these changes. Moving in the direction of a gradient makes the model more likely to guess wrong.
Gradient Normalization: This is the process of adjusting the gradient so that changes are consistent and controlled. This makes sure the images don't end up looking too crazy.
Noise: Is a broader concept that can be random or unintentional. In this context, noise is added/crafted to mislead the model.

Using these newly defined terms, let’s restructure our goal: We need to apply small, carefully crafted adversarial perturbations to the base image (Timber Wolf) to push it across the decision boundary into the target class (Granny Smith Apple).

Let's visualize this. Landing in yellow → classified as a cat & landing in blue → classified as a train.

Remember, the loss function measures how wrong the model's prediction is. A large distance from the decision boundary means the model has high confidence in its classification. Therefore being far from the boundary corresponds to a smaller loss.

Far from Decision Boundary → High confidence in correct class → Low loss.
Near Decision Boundary → Lower confidence (less certain about the class) → Higher loss.

Makes sense? This decision boundary is binary: an input is either classified as a cat or not. To cause a classifier to misidentify an image of a cat → train, we can add noise or perturbations to the cat image until it crosses just beyond into the region of the decision space classified as train.

We want to minimize the loss. The smaller the distance (closer to the decision boundary), likely, the less perceptible the changes to the image are going to seem. This is where the gradient comes into play. It will guide us on how to adjust the image to achieve our goal.

key takeaways

Think of the gradient as the slope. It's how much and in what direction the output changes as I add noise.
We have a loss function with a desired output (how poorly the current model performs) so we make it smaller to minimize the loss; less loss is good.
We use the use loss + gradients to update the parameters of our input image based off of the gradient (change XYZ this way/that way).

The approach I took to solve this challenge might be seen as bending the rules. It undeniably helped me gain a more proper understanding of what was happening behind the scenes. I'll be explaining it the way I learned it.

HOWEVER! If you have proper prior knowledge of evasion, you might have approached the challenge using one of the following gray-box attacks:

SimBA (Simple Black-box Attack)
Carlini & Wagner L2 Attack
HopSkipJump
NightShade (never got this to work)
FGSM (Fast Gradient Sign Method)

These attacks are seriously super fucking cool. Please, if this stuff interests you, go read up on them. Here are some resources that can help get you started.

I solved it by performing a projected gradient descent (PGD) attack. To pull this off all we're doing is gradually adjusting the input image in each step to push it across the decision boundary until we classify the wolf as Granny Smith Apple.

In other words, our goal is:

Minimize the Timber Wolf score (make the model less confident that the image is a wolf).
Maximize the Granny Smith Apple score (make the model more confident that the image is a granny).

I don't want to leave you out to dry so let's quickly cover gradient descent using my insane graphing skills. The image below demonstrates the iterative process of gradient descent:

The blue curve represents our loss function y=x^2. The red dots are the steps of gradient descent, this is how the changes to the image are made iteratively. The orange dot is the starting point (x=2.5), where the process begins. The green dot is the converged minimum (x≈0x), where the gradient is zero. Lastly, the red line is the path taken by gradient descent to minimize the loss function.

If that still feels unclear, bear with me. Let’s think about our code for a moment. Our code needs to make small adjustments to the input image in each iteration by calculating the gradient of the loss function with respect to the image. By adjusting the image in the opposite direction of this gradient, we minimize the loss, decreasing the Timber Wolf score and increasing the Granny Smith Apple score.

If we were to visualize our future code, we hope it would behave like the graph below, where 0.5 (50% fidelity) marks the point where the decision boundary is crossed.

Each change to the image, if done incorrectly, may not result in a purely linear progression as some perturbations made to the image between iterations might actually affect the confidence score. This shouldn't happen, but Softmax can be stupid.

Because I was in the graphing mood, the gradient magnitude might help you understand how the change isn't contact. Gradient magnitude represents the size of the change applied to the input image at each iteration during the attack. Mathematically, it is the norm (or length) of the gradient vector with respect to the input image.

How does this challenge even determine whether the image we upload is a dog or a bulldozer? To answer that, we need to talk about image classifiers.

We can think of classifiers as a machine learning model that assigns a class label to an input based on its features. Classifiers can be trained using labeled datasets, where each input is associated with a known output label. In those instances, the model learns to predict the correct labels for new inputs by finding patterns in the training data. This is what we have.

Furthermore, it is reasonable for us to assume that the challenge uses MobileNetV2 as the target model for evasion. This model is frequently used as a baseline/victim in adversarial research due to its efficiency and widespread adoption. MobileNetV2 functions as a feature extractor within a classifier architecture. When pre-trained on ImageNet (a massive dataset of labeled images) it learns to classify images into 1,000s of categories, including animals, objects, and vehicles.

Image classifiers don't work with raw images directly. Instead, the pixel values are transformed as they pass through the layers of the neural network. These transformations allow the model to identify shapes, textures, and other unique features, which it uses to determine the image's class.

To deepen our understanding, we have no choice but to build and run our own image classifier 😊.

Lines 1-7 are the essential libraries for working with deep learning models, image processing, and data visualization.
- torch for PyTorch functionalities. PIL (from Pillow) for image handling. IPython.display for inline visualization in notebooks. numpy and matplotlib.pyplot for numerical computations and visualization. pandas for data manipulation. torchvision.transforms for image preprocessing and torchvision.models for pre-trained models.
Lines 9-10 uses CUDA and ensures the model and inputs are on the same computing device for compatibility.
Lines 12-19 loads the pre-trained MobileNetV2 model from the PyTorch hub and apply the default weights trained on the ImageNet dataset.
Lines 20-22 this is shit you do for batch normalization by switching to eval() this is for consistency between tests.
Lines 24-34 this code normalizes the image in the image preprocessing pipeline. It's there to define a sequence of transformations applied to input images and resizes, etc. We do this because models are trained on data that follows a specific preprocessing pipeline. By us matching this during inference ensures compatibility and accurate results.

https://pytorch.org/vision/main/models/generated/torchvision.models.mobilenet_v2.html#torchvision.models.MobileNet_V2_Weights

In the code above our input image was normalized to match the distribution the model was trained on, which is essential for accurate processing. However, to view the image as humans perceive it (as humans perceive it/in the format we sent), we need to reverse the normalization process. This involves unnormalizing the tensor after it has been normalized and processed by the model.

The code responsible for the normalization is defined at lines 29-32:

transforms.Normalize(
    mean=[0.485, 0.456, 0.406], 
    std=[0.229, 0.224, 0.225]
)

This operation adjusts pixel values with the following equation: Normalized Value =(Pixel Value−Mean)/Std therefore to reverse this normalization (unnormalizing is the reverse) you'd perform the following: Original Pixel Value=(Normalized Value×Std)+Mean

The mean is the negative of the normalized mean divided by the standard deviation: Unnorm Mean=-(Mean/Std)
And std is the reciprocal of the normalized standard deviation: Unnorm Std =1/std

To unnormalize we need the following snippet of code:

When passing our image through the model or preparing it for visualization, we can unnormalize the normalized tensor to restore the pixel values to their original range (typically [0, 1] or [0, 255] for display).

Now for the third part of the code:

Lines 1-2 pulls the ImageNet labels found here.
Lines 4-7 loads the base image we want classified.
Lines 9-10 applies the preprocessing pipeline that we just defined above. At this point the image gets resized, cropped it to 224x224 pixels, converted into a tensor, and normalized pixel values using the ImageNet mean & std.
Line 12 moves the processed image tensor to the same CPU/GPU this ensures compatibility between the input data and the model.
Line 14-15 Performs inference (classification) without calculating gradients since we’re not training the model.

When we run everything, this is what we get:

This should start to make sense. We know that our input image is not directly used in its raw form. Instead, it undergoes a preprocessing pipeline where pixel values are resized, normalized, and converted into a format suitable for the neural network. These transformations enable the model to extract meaningful features such as shapes, textures, and patterns, which are essential for accurate classification.

Going back to the challenge. We're running with the assumption that the endpoint was trained with stock MobileNetV2. We can confirm this using our code to ingest the challenge image. The probability is the same at 0.2858 == 0.2857575...

We know the model, we have a source constant Timber Wolf (class ID 269) and our destination constant Granny Smith Apple (class ID 948).

We're fully equipped to execute projected gradient descent locally, classify the image evasively, and submit it to the challenge endpoint. A lot of the code above is going to be reused so it should be familiar to you.

The goal is to modify the input image 𝑥 iteratively (PGD) so that the model shifts its prediction from the Timber Wolf class (𝐶𝑤=269) to the Granny Smith Apple class (𝐶𝑔=948) our adversarial image known as 𝑥'. I've set our decision boundary to cross when confidence is >50% so we specifically want to achieve 𝑃(𝐶𝑔)≥0.5.

That's basically all the code The best way to walk through this is by explaining the math behind it as we go.

The starting point of the attack is the image, 𝑥. 𝑥 is the input image in the form of a tensor, and we tell PyTorch to track its gradient by setting requires_grad = True. 𝑥 will be updated iteratively to fool the model. Accomplished with the code below.

input_batch = input_tensor.unsqueeze(0)
input_batch.requires_grad = True

As you know by now the loss 𝐿 is the key. And is obtained from by:

Where:

f(x)wolf: The model's output (logit) for the "wolf" class.
f(x)granny: The model's output (logit) for the "granny" class.

I've said this twenty times by now 😭 but I'll say it again:

𝐿>0 The model thinks the image is more wolf-like.
𝐿<0 The model thinks the image is more apple-like.

A logit is the raw output of the model before applying the softmax function.

If the model predicts logits [2.5, 0.3, -1.2] applying softmax to these logits converts them to probabilities, such as [0.85, 0.12, 0.03]. Loss is calculated with the following code.

WOLF_CLASS_ID = 269
GRANNY_CLASS_ID = 946
loss = output[0, WOLF_CLASS_ID] - output[0, GRANNY_CLASS_ID]

We reduce 𝐿 so the model misclassifies the image as an apple. This is done by computing the gradient (∇) of the loss (𝐿) with respect to the input image 𝑥 which is written as:

∇xL is the gradient of the loss 𝐿 with respect to the input. It tells us how the loss 𝐿 changes if we slightly modify each pixel in 𝑥.
It points in the direction where 𝐿 increases the most.

There's technical stuff I'm leaving out on purpose since it's already abstracted within the libraries. Regardless, the gradient process is handled with the following lines of code:

gradient = input_batch.grad
loss.backward()

The gradients are stored in the .grad attributes of the corresponding tensors. After the backpropagation step, the gradient of the loss with respect to the input batch, ∇xL, is retrieved from the .grad attribute of the input_batch tensor.

Stay locked in, we're almost done. The second line of the snippet above computes this gradient using backpropagation. To compute the gradient ∇𝑥𝐿 (the gradient of the loss 𝐿 with respect to the input 𝑥) backpropagation is used.

This is done using the chain rule of calculus. This involves propagating the gradients backward through the layers of the neural network, starting from the loss function.

I won't be explaining this in more detail, we'll just abstract it. This isn't a math dissertation but here are some good reads to get you up to speed.

We need to control the step size of the update by normalizing the gradient.

gradient /= torch.norm(gradient)

∥∇xL∥ is the magnitude (size) of the gradient.
gradient_normalized: A unit vector pointing in the direction of ∇xL.

This step ensures that the updates are consistent in size and don't overshoot and mess up the image. After this, we update the image using the normalized gradient, and our input 𝑥 is updated:

input_batch -= learning_rate * gradient

𝑥': The updated image.
𝑥: Is the current image.
α: The step size (learning_rate).

We subtract the gradient because we want to reduce 𝐿. Once again, moving 𝑥 in the direction that decreases the model's confidence in the Timber Wolf.

Then we simply repeat until the target probability is reached (𝑃(𝐶𝑔)≥0.5). During the adversarial loop, we adjust the pixel values of the image tensor in small steps based on the gradient of the loss function.

while granny_probability < 0.50:

Lastly, we need to revert the normalization of the image to make it viewable to us. As done before we convert the tensor back into an image format and display the adversarial example.

If you're looking at the output & wondering why the Timber Wolf probability is zero, I have failed you. Having it hit 0 is kind of the point. This is in essence the model's vulnerability to adversarial perturbations.

Remember, (broken record) by minimizing the loss, we are simultaneously reducing the score f(x)wolf while increasing f(x)granny and after enough (~18 in our case) iterations, f(x)wolf can become extremely small (effectively zero in terms of probabilities after applying softmax).

Speaking of... I won't cover the softmax behavior in too much depth, but it's handled here:

probabilities = torch.nn.functional.softmax(output[0], dim=0)

Just understand softmax as a mathematical function used to convert raw scores (logits) from a neural network into probabilities. In our context, it's used to interpret the model's output as probabilities making it easier to determine the most likely class.

And that's it!

I know we cheated by stumbling onto the right model and weights. All we really did was avoid making tons of API requests to calculate the gradient. Honestly, I thing this is a decent way to cover more ground when introducing evasion.

The last challenge was pretty intense, so let’s speedrun this next one with more of a CTF mindset. Let's leave all the math stuff behind, that shit's for the birds.

The objective of this challenge is to create and send a malicious image to the OCR (Optical Character Recognition) model, get it processed through some unknown pipeline, and ultimately gain admin privileges on the endpoint... Let's have a look.

Once again we're given the base image, this time it's of a phone number (Pixelated.png).

This challenge isn’t complicated. The first thing we need to focus on is the ability to create our own images. At heart, we're likely going to try and perform code injection through the visual ingestion of our malicious images.

It's not crazy to assume that we'd want to use fonts optimized for OCR. But I have no more space on my computer. Instead, let's try and match our base image font to fonts I have on my operating system.

They might not all be OCR-friendly. We'll cross that bridge when we get there. For now, let's find the closest match to Pixelated.png. I won't 🧢, I asked ChatGPT to do this because I'm too stupid to properly optimize.

The closest matching on my system is Trebuchet MS.ttf. TrueType Font (TTF) is the file that contains the font we'll use to generate our adversarial images.

I don't have the best vision but this looks quite close. Now let's start throwing shit inside the OCR engine and see what makes it freak out.

Lots of cool stuff, what sticks out is XML structure that appears to be the syntax used to process the extracted text from our image.

<?xml version="1.0" encoding="UTF-8"?>
<request action="insert">
    <system>pixelated</system>
    <text>
        USER INPUT HERE
    </text>
    <is_admin>false</is_admin>
</request>

This isn't complicated, let's simply inject it into the XML structure and set is_admin to true. Not using the same font the model's OCR was trained/is familiar with makes reading certain characters challenging. Below we can see that <text> → <lexd>.

There are a few things we can do to get around this.

Make the font of our image larger (we need to code)
Capitalize the letters to our input (no need to code)
Do something smart (need a better brain)

If the blacks of the letters are too close together it affects the OCR ability to accurately interpret the text. Let's capitalize.

Voila, quick and easy. This is a cool challenge because there are (like evasion) a ton of cool real-world applications that come to mind like depositing checks online, medical/insurance forms, etc.

Brig1 & Brig2 were super time-consuming challenges. My initial approach was way off & I went down huge rabbit holes. The endpoint being non-deterministic didn’t make things any easier. On top of that, solving these challenges didn’t require data science or any well-established structured testing approach.

I did really enjoy this challenge because let’s face it, people will always push weird shit into production.

Doing a base test we can see that our input gets summarized. Safe to assume we're dealing with a text summarizer. In simple terms, text summarizers take large chunks of information, pick out the most important bits, and create shorter, easier-to-digest overviews. Throw paragraphs into it and get spark notes.

I know the challenge designers use rigging to build out these challenges. The depth at which the summarizer logic is implemented is frankly unknown.

To better understand summarizers we need to remember that LLMs (for the most part) are based on the Transformer architecture (ask GPT to read the first paper to you). In consequence, language summarizes will often implement a few mechanisms:

Attention: This is assigning importance to words relative to the input context. For summarization, this helps the model determine which parts of the input are most relevant.
Sentence Importance: Quantifies the importance of different parts of the text.
Positional Encoding: Ensures the model understands the order of words in the sequence, which is crucial for coherent summaries.

To help visualize, I build our own little summarizer.

To evaluate & visualize the effectiveness of a summarization model, smart people use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) to assess the generated summaries. ROUGE measures the overlap of word sequences, and word pairs between the machine-generated summary, and the reference summaries. For our summary generated above our score is:

r = recall → % of words in the reference summary are also present in your generated summary.
p = precision → % of the words in your generated summary are also in the reference summary.
f = F1-Score → % that combines precision and recall, indicating a moderate level of word overlap between the two summaries.

[{
  'rouge-1': {'r': 0.3333, 'p': 0.3, 'f': 0.3158},
  'rouge-2': {'r': 0.0667, 'p': 0.0625, 'f': 0.0645},
  'rouge-l': {'r': 0.2593, 'p': 0.2333, 'f': 0.2456}
}]

Therefore our ROUGE-1 score = 31.58% meaning there's a moderate overlap of individual words, which is a good starting point. ROUGE-2 score = 6.45% is low which suggests that the summary could be improved by capturing key phrases. ROUGE-L score = 24.56% means that about a quarter of the sentence structure aligns between the input and the summary.

It really all depends, but from what I understand, sentence scoring is important and used in text summarization to determine the importance of each sentence from the original text. Sentences with higher scores are more likely to be included in the summary.

In the real world, we'd want to manipulate our input so that it gets weighted as more important, and would more likely in the summary. And you never know, maybe this summary is stored on the machine, gets called, and is displayed to the user? There's funky stuff everywhere.

For this challenge, our input gets passed through some application pipeline, possibly even before it reaches the summarizer. This pipeline is where we’ll focus our attacks.

After some light poking around you quickly notice that there are two types of responses.

message response occurs when the input doesn't make it through the pipeline and doesn't reach the LLM.
output responses are those where your input is fully processed by the LLM.

This pipeline likely expects syntactically valid text and isn’t equipped to handle special characters. After some fuzzing and experimenting, I managed to get some juicy outputs.

Two of which are interesting:

"Of course! I'll wait for the passage to be provided inside the `<passage></passage>` tags.

<log-summary>
  <path>/api/log</path>
  <summary>[assistant's summary will go here]</summary>
</log-summary>

If this response isn’t completely hallucinated, it suggests our input is being wrapped in <passage></passage> tags. Additionally, our summarized text gets logged to /api/log? Interestingly, there seem to be some guardrails or system-level directives in place that try to prevent us from including <passage> tags in our input.

If we try to preemptively wrap our input in these tags, we get {'message': 'Breakout attempt detected.'}. Once again, a message response means our input didn’t reach the summarizer.

To bypass this we can misspell or capitalize the tags. I prefer using the proper spelling but I didn’t notice a significant difference either way.

For the most part, I was fuzzing with () and generating tons of long & tag-heavy output messages. The following caught my eye.

Below is a beautified and formatted excerpt.

{
    'output': '[assistant's summary will go here]

    <passage>
        [user's input goes here]
    </passage>
    
    <log-summary>
        <path>/api/log</path>
        <summary>[assistant's summary will go here]</summary>
    </log-summary>

    The data is located in file inside <log-result>

    <log-result>
        <result>{"status": "success", "msg": "Logged Successfully."</result>
    </log-result>
}

Staring at this for a second my understanding of the pipeline is as follows:

The <path> in <log-summary> indicates the endpoint (e.g. /api/log) where the logging operation was performed. It helps contextualize what the summary in <log-summary> refers to but is not directly related to <log-result>. The <passage> encapsulates the user's input, providing the content or data intended for logging. Additionally, <log-summary> provides a summary or interpretation of what was logged, often based on the data in <passage>, while <log-result> confirms the outcome of the logging process; the two are not directly dependent on each other. The following error somewhat helps confirm this.

If we send an empty dictionary ({} or {:}), the server attempts to summarize essentially nothing, causing the pipeline to fail during the LogSummary function. Therefore, it appears that for the server to avoid failure, the LogSummary function must succeed. From some light testing, the server only requires the following <log-summary> structure.

<log-summary>
    <path>/api/log</path>
    <summary>\n</summary>
</log-summary>

Up until this point, it's all straightforward problem-solving. How you decide to implement this is how you end up going down some pretty messy rabbit holes. Part of this was due to my laziness & not wanting to cheat by using rigging to recreate my own challenge with the same structure.

For context, I started experimenting with weird markdown injections and adversarial prompt structures, all of which returned funky behaviors. For example, I could read 1-n of (maybe) files? I could consistently obtain reproducible outputs, leading me to believe it wasn’t just a hallucination.

Injecting chat or completion pipelines was also quite funky.

I wasted tons of time trying to get web-based attacks to work, messing around with SSRFs and XXE vectors because of this one line in the challenges excerpt... 😂

For added context, the text sent to the summarizer is capped at 500 characters. Eventually, I landed on this super useful error message.

If we know which model is being used, we can customize our text prompts using the model's control tokens. For Mixtral, those would be <<SYS>> and <s>.

How to run and deploy Mixtral 8x7B and 8x22B - https://www.aime.info/blog/en/how-to-run-and-deploy-mixtral/

I decided to return to basics and attempt to inject into the leaked XML structure. After all the testing I had done, I was fairly certain the flag wouldn’t be in /api/log I doubted that it even existed. After countless trial-and-error attempts (~5,000 API requests worth of formatting and restructuring), I finally arrived at the following.

I spent a lot more time messing around. I didn’t pay much attention to this file & was more focused on trying to leak environment variables. Based on real-world experience, I assumed the flag would likely be stored there. When that didn’t work, I decided to revisit /f_store.log to see if I could extract its contents. Turns out, when printed, it revealed the flag.

You've made it to the end! 🎉

This blog is long enough, so let’s wrap it up quickly. If you’re interested in learning more about this topic, NVIDIA has a fantastic course with videos and challenges that I highly recommend checking out (Exploring Adversarial Machine Learning). Otherwise, dive into the Crucible challenges, there’s no better way to learn than by doing.

I hope you enjoyed the post! Follow me on Twitter I occasionally share interesting stuff on that platform. Feel free to reach out there or on Discord if you have any questions.

Thank you for reading!

文章来源: https://boschko.ca/adversarial-ml/
如有侵权请联系:admin#unsafe.sh