Indirect Prompt Injections and Threat Modeling of LLM Applications
May 24, 2023 • 29 min read
This episode makes it increasingly clear. The time for machine learning security operations - MLSecOps- is now. In this episode we dive deep into the world of large language models (LLM) attacks and security. Our conversation with esteemed cyber security engineer and researcher, Kai Greshake, centers around the concept of indirect prompt injections, a novel adversarial attack and vulnerability in LLM-integrated applications, which Kai has explored extensively.
Our host, Daryan Dehghanpisheh, is joined by special guest-host (Red Team Director and prior show guest) Johann Rehberger to discuss Kai’s research, including the potential real-world implications of these security breaches. They also examine contrasts to traditional security injection vulnerabilities like SQL injections.
The group also discusses the role of LLM applications in everyday workflows and the increased security risks posed by their integration into various industry systems, including military applications. The discussion then shifts to potential mitigation strategies and the future of AI red teaming and ML security.
[Intro Theme] 0:00
D Dehghanpisheh 0:29
Hello MLSecOps community, and welcome back to The MLSecOps Podcast!
Returning listeners probably noticed our great new intro music. Shout out to our production associate Brendan for that. And if you’re new here, on this show we explore all things related to the intersection of machine learning, security, and operations, or MLSecOps.
Today we have a special episode for you all. We let our MLSecOps Podcast get hijacked by two red teamers. First, Johann Rehberger, Red Team Director at Electronic Arts, who we had on the show a few weeks back, makes a return appearance on the show, not as a guest, but as a co-host.
If you have the chance, I strongly encourage you to go back and listen to that episode with him from a few weeks back where we talk about red teaming and really easy methods to break ML systems, including by using a simple notebook as a launch pad.
We continue our discussion from that episode, which was so well received, with Cybersecurity Engineer and Researcher, Kai Greshake. Like Johann, he’s a fellow red teamer and penetration tester at Sequire Technology.
The three of us had a fascinating discussion surrounding Kai’s research paper that he co-authored titled, Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injections.
In this episode, you’re going to see two experts dive deep into Kai’s discovery of indirect prompt injections, the implications of this class of prompt injections, the ever-and-rapidly expanding vulnerabilities of LLM and other generative AI tools. It’s kind of a mess! But we try to lay down the groundwork for what security measures, if any, we might be able to put in place to mitigate and defend against these types of attacks.
By the way, a link to the research paper is in the show notes, as well as a link to Kai’s blog where he has a lot more to say about the topic we discuss in this episode and AI security in general, it’s a fascinating set of reads.
And with that, here’s our conversation with Kai Greshake.
Kai Greshake 2:35
I come from a very traditional cyber security background, more from the academic side. I used to do computer science and then switched to security later on during my undergraduate degree, which is in cyber security. And then afterwards I did work as a red teamer and pen tester on critical infrastructures, mostly full time, then went back to academia doing my master's degree.
And, well, now I work part time at Sequire, still doing red teaming. But during my master's as a graduate student, I realized a few years ago that my earlier skepticism of deep learning and interpretability, questions of all that were, sort of, moot. And this is just the direction things are going.
And so I switched tracks and I dropped everything else, and I just started reading ML papers continuously. And, well, at the beginning I thought I'd also just build things with language models. Obviously a very contested space now. And then I figured out that it actually connects great to my existing background, and that's where I've been going ever since.
D Dehghanpisheh 3:28
That's great. We actually came across your blog, how we discovered your talent and capabilities. We came across the blog, How We Broke LLMs: Indirect Prompt Injections, and it referenced a paper that I think we'll be talking about throughout this conversation titled, Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injections.
What is an indirect prompt injection? How should we think about that?
Kai Greshake 3:55
For starters, the traditional direct prompt injection that people have explored for a few months now, usually the impact is considered to be rather small. You can get the model to say harmful things, or you can maybe leak the original prompt. And that's something that people have been very concerned with. Not being able to extract the original prompt and that being considered one of the major impacts of prompt injections.
Well, one of the core things we realized is that, well, in the future it won't just be the user inputting things in your LLM, but a large part of the input space will be contributed from third party and external sources of data that get fed into the LLM, and then it does things with it.
That's one part of indirect prompt injections and the paper, but it turns out that that actually blooms into a huge area of security that maps very well to the existing field of cyber security, basically creating a clone of regular cyber security focused on AI.
Johann Rehberger 4:49
Following up on the paper, I remember first when I read the paper, I saw it as really kind of a foundational piece of work. I was reading it from beginning to end, and I thought it's a really great work, highlighting already very advanced scenarios as well.
How did you first find it? How did you first realize there's something like an indirect prompt injection? If you could tell us a little bit about that story; I would be very curious about that.
Kai Greshake 5:13
So at the time, I think it was about December or January, I didn't actually think about security all that much. I played a lot with language models. I tried all of their capabilities. And I came to the realization, the first building block, I guess, is that they are like traditional computers. You know, they're universal simulators. Including traditional computing systems and so on.
And so I played a lot with simulator, Linux terminal, and all that stuff. And that bore some very interesting fruits, no doubt. But the core idea here is that prompts and natural language instructions are running on this alternative computing infrastructure, which is the language model.
And then the second key thing that fell into place was the idea that well, it's not necessarily the user prompting. And as soon as you have this external input, you get adversarial inputs as well that might perturb the model. And then figuring out most of the content of the paper actually was a single afternoon, I think, where I took my notebook and I kept writing down new things and I kept coming up with new implications and new techniques to apply this to and new impacts.
And everything was hinging on this idea though, that people would be using LLMs to integrate them with their applications and with other sources of data, which at the time wasn't a thing. Nobody had built any LLM integrations, Bing didn't launch yet. There were no ChatGPT tools.
I saw the trend of people obviously wanting to use these tools and present there were some papers about it and it was clear that that's a path forward. At the time I couldn't have imagined the breadth of products that immediately came up and well, it applies to most of them.
Johann Rehberger 6:45
Yeah, so Kai, what are some of the real world vulnerabilities that you are aware of or that you found in your research?
Kai Greshake 6:51
When we started the research, there weren't any LLM-integrated applications out there. And then the first one that came out about a week after we published the first preprint was Bing, which is Microsoft's integration of a ChatGPT-like chatbot into their browser ecosystem and search engine.
And one of the features that really stood out there was that not only can Bing see search results through which you can potentially inject stuff, but what made it really easy to experiment with is that Edge, the browser which Bing uses to run, can forward websites that you're currently looking at to the language model.
So if you're looking at a PDF, you can ask it, “Hey, summarize the PDF for me,” which is a great feature. But you can also inject prompts into that content on that website. And we made a bunch of examples of what that can mean, like convincing the user to divulge information that they wouldn't want to divulge and then how to exfiltrate that and so on. And we informed Microsoft of those, we disclosed those vulnerabilities, but still, months later they aren't fixed because they can hardly ever be fixed using the current architecture and it's still a big problem.
Anybody now can take over your Bing if you're visiting a website. It doesn't even need to be that the attacker needs to have full control over the website. It's enough if it sees something that's in a tweet. And instead of really making sure that this is as hard as possible and doesn't occur, Microsoft is now announcing new integrations for Bing. So Bing will be able to take actions on your behalf, integrate with more applications, just like ChatGPT plugins.
It'll be able to embed images in its outputs, like third party images that it links to. Just like a markdown image, but that can be used to steal data from the session without the user ever clicking on a link manually, without any user intervention and completely invisible to the user. That's one issue.
And then, well, vanilla ChatGPT also has similar vulnerabilities. So one of the demos that they showed when they released GPT-4 is that it has this huge context window. So you can dump in documents that you haven't read and can have it assist you in analyzing and figuring out what those documents mean, what's in there. Maybe a legal counsel or something like that.
And well, if you inject prompts in the documents that GPT-4 ingests, just with the vanilla interface, you only have the user as the input output channel. That's already enough to create a certain risk. Once you've compromised ChatGPT, through either the content of the document the user pasted in, or something the user copy pasted elsewhere, you can exfiltrate any information in the session.
ChatGPT can embed images in its output from third party sources, and through that means can steal data from the session without any user intervention, no clicking links, no nothing. Once you paste something in there, everything else in the current session might be subject to being stolen. And that is another thing that wasn't fixed and that can't really be fixed.
Johann Rehberger 9:48
So, how do you think about the overall threat landscape, like doing your research now, if you kind of dissect your paper a little bit, what is your overall view on the threat landscape when it comes to large language models, generative AI, and especially the indirect prompt injections?
Kai Greshake 10:03
So I think especially in terms of public discussion, there have been, apart from existential risk, certain security threats like misinformation, bias, propaganda, large scale manipulation of people. But as we integrate these AI’s into more and more systems and we actually use them as agentic processes or simply integrate it into other systems, compromise will become a much greater thing.
What the threat model is for me is that I assume if an attacker controls any part of the input space to a large language model, they control all of its future outputs. And so far this base assumption has been very helpful in figuring out what is possible and whatever we wanted to get done, it didn't take us very long to tinker together a prompt to achieve just that.
And for the last few months of jailbreaks and so on, I think have demonstrated that people keep coming up with new ideas of how to break these systems and we really need to be careful with even tiny control that attackers have over inputs which can bloom into more control.
Johann Rehberger 11:06
So diving a little bit deeper on that, how would you distinguish or how would you draw similarities to traditional security injection vulnerabilities like SQL injection or cross-site scripting? How do you think that sort of relates from a threats perspective too, because you mentioned this–really I think it's a really good framing, how you put it.
As soon as an attacker controls a piece of the prompt, the output is unpredictable, or is not known from the design of the system.
Kai Greshake 11:35
It's controlled by the attacker.
Johann Rehberger 11:37
It's controlled by the attacker. So how do you relate that to something like SQL injection?
Kai Greshake 11:42
I actually think something like SQL injection only bears superficial similarities here. For one thing, it's something we can actually fix, but another thing is that the prompts are in and of themselves essentially programs being executed on these computers in the “mind” of LLMs. That is fundamentally different from what happens with injections that we put into the SQL injections, and so on.
We can access data, but this is much more akin to executing code on a traditional computer, like an arbitrary code execution vulnerability, which you can also get through injection, but that’s just–
Injection just describes how you get your malicious code into place, but then realizing that it really does bear resemblance to traditional malicious code and then executing that. That's the other key insight, I guess.
Johann Reherberger 12:30
I actually thought about this at one point. Would you say the similarity is very close to remote code execution, just at a different level? Do you think that is a good way to phrase it?
Kai Greshake 12:43
Yeah, generally I would think so. If you can make inputs to a language model and that processes the information, or if that information is instructions, it follows those instructions, you're executing a program.
Johann Rehberger 12:55
So, let's shift gears a little bit and talk about some of the real world scenarios that you actually discovered. One that I just read, your post the other day about PDFs and how you wrote a tool that allows you to modify PDF files and inject prompts into PDF files.
Can you tell us a little bit more about that? I thought it was just so recent and I found it really insightful.
Kai Greshake 13:17
Yeah, it spawned from when we first published the first preprint in February. We noticed that obviously the paper was warning, “Hey, don't use third party external information and feed that into language models!” And the first thing that we saw happening was a bunch of GPT-based bots taking our PDF and then posting GPT-generated summaries and other content on Twitter and so on.
Johann Rehberger 13:39
Oh, so this really happened with your own paper actually, initially when you realized.
Kai Greshake 13:43
So then we prepared a version of the paper at the time which had such hidden prompt injections to manipulate the summaries generated. But we chose not to publicize it in the preprint. Maybe not right for academic publishing. Would have been a nice prank, I suppose. But yeah, I built this into an application that anybody can use for their PDFs to influence any downstream LLM or other AI tool processing that data.
D Dehghanpisheh 14:09
Related to that are practical settings, right? I mean, what you just described is kind of like a very practical, albeit creative exploit, for lack of a better term. Talk about how you see potential and practical risk of integration of LLMs into everyday office workflows.
There's got to be a bunch of new security concerns as Google and Microsoft Office and all of these email applications, everything is seemingly integrating LLMs into user experiences, and they're starting with the office workflows.
How do you think about that as almost an elementary attack surface for people to exploit?
Kai Greshake 14:49
Well, obviously by integrating LLMs into all of these different workflows, you introduce additional attack surface, but also exacerbate impacts that are only enabled through these capabilities. If I only have a ChatGPT session and I compromise it, I don't compromise the future session, but if my agent has memory or some kind of place to write documents or notes, then I can persistently infect that session, right?
If it's in my Outlook and I receive emails and it's reading them, it may be compromised through such an active indirect injection, someone sending me a compromised or malicious email. But instead of traditional code, the malicious part of the email would be in natural language and wouldn't be detected by existing tools.
And then the other thing is, well, you have these LLM assistants that follow your cross-suite workflow. So across Word, and Excel, and PowerPoint, or whatever have you, email and so on, the LLM will be able to see data from multiple such sources at the same time. And if an attacker finds ways to inject their inputs into that stream, not only can they influence the future outputs, but they can also exfiltrate the previous input information.
All the other inputs that it sees are also part of the attack surface and potentially in the same security boundary as the adversarial LLM. And using very simple methods, you can exfiltrate that information. Either the LLM generates a link with the information in it for the user to click on, or maybe even something simpler, like embedding a simple image from a third party server that is controlled by the attacker, and while retrieving the image, you just have the information encoded in the URL.
And so even with the most basic of integrations, it's possible, once you've compromised such a session, to exfiltrate and steal all the data that the LLM has access to. Not just the stuff that it sees now, but also the stuff that it can maybe convince you to divulge, maybe the stuff that it can autonomously access from your organization, which is obviously an intended feature.
If you want a useful LLM for office work, a useful assistant, it better have access to a bunch of your documents, right? and so does an attacker once they've compromised it.
D Dehghanpisheh 17:01
And you talked about office workflow integration, and specifically LLMs into those. I'm curious, how do you think about prompt injection attacks that are not necessarily directly related to LLMs, like GPT models and the like? How does that manifestation differ in other LLM-like tools such as DALL-E or code generation tools like GitHub Copilot?
I mean, how do we think about that in addition to the everyday workflow? Is it still the same type of threats, or are there different ones for different types of unstructured data?
Kai Greshake 17:34
Yeah, they change with the level of integration and the sources of data fed into it. If you look at something like DALL-E, which only takes a user's prompt to generate an image, the attacker can't get their exploit in there. Therefore the threat model is rather low. But then you take something like DALL-E and you integrate it into something like Bing, which, as far as I know, that's already a thing.
If you go to Bing and ask it to generate a picture, it can do so live in the chat. And so we've shown that you can compromise the session and then have Bing– like maybe you give it an agenda: Convince you of some kind of non-true historical fact or something, and to corroborate that it can live-access DALL-E to generate the fake propaganda images to support those claims right; on the fly.
And then it becomes a threat again.
If you look at something like Copilot, well, the way you can get your input in there as an attacker is through, for example, packages that you import. When you work with Copilot, the language model gets to see a snippet of your code, a section that is composed of, maybe, important classes that you are using right now, functions that are relevant in the current context. And it's constantly recomposing that context to figure out what information does the LLM need right now to complete this code. And if it picks something from an external library, then that's a big problem.
We've shown that you can have a class with a bunch of documentation in there. And obviously, if you download third party packages in your development environment or something, especially with Python, you already give the external party code execution on your computer. So the threat isn't necessarily that it would generate code that you execute, but that it would introduce subtle vulnerabilities and bias the code completion engine to give you bad completions. And that can be very subtly introduced.
So you can imagine in the documentation of a popular project would make a pull request and add a bunch of examples of how not to use the API, but I make sure to do it in such a way that the language model actually gets primed and it's more likely to output such negative examples. Any manual reviewer would look at that and say, well, that's fine, that's good, that's improving our documentation. But it's a manipulation targeting code completion systems that then see that documentation.
D Dehghanpisheh 19:51
And I think what's really fascinating about what you're pointing out is that I think most people think of LLMs and generative AI applications as requiring a back-and-forth conversation, similar to the chatbot interfaces that we see.
But I think what you're highlighting is that adversaries can achieve their goals in a variety of ways using prompt injection methods and attacks beyond just that back-and-forth chat element.
Kai Greshake 20:14
Yeah, that's definitely true.
You can have it in a one-shot scenario. If you think of Google's new search integrations, you put in a single search term, press the search button once, but there's multiple steps that lead to the result that you see.
First of all, you have Google search index, which does the regular query. But then the search results are being fed to an LLM, potentially controlled by attackers that have planted information there, and that gets processed into the final output that you see. But obviously, if that query results in a search result that is controlled by a malicious entity, then they also control the first result that you get right away.
Johann Rehberger 20:52
I think you described some really powerful and impactful scenarios right now. That really makes me think a lot about what is possible or how these attacks will evolve over time. And I guess attacks just will get better over time as well, right? That's usually what happens in the security space. Attackers figure things out, be more efficient.
So what do you think, at a high level– like NVIDIA has something called NeMo and is building something like a kernel LLM.
What are your thoughts around how we can actually mitigate indirect prompt injections?
Kai Greshake 21:22
Mitigations right now are at a very early stage, and they only respond to the attacks that we already have. But as you've alluded to, attacks are evolving still. And what I'm trying to figure out right now is not just the design space for what you can do with prompt injections once you have a successful one, but also the design space for developing these attacks in the first place.
So far, the only thing we've seen is people manually through trial-and-error tinkering with these prompts, trying to get these models to do what they want. But there are much more powerful attack methods, especially automated ways of developing these attacks, that are going to enter the scene and that's going to change the landscape of possible mitigations again.
Right now it might be possible, I think, with the whack-a-mole tactics that people have developed now to stem the tide of manual attacks. Human creativity is limited, and if you do this for long enough and you keep collecting all of these attacks that people do and then retraining the networks, you'll get someplace that's sort of robust and can be deployed.
But attacks are developing at a similar pace. And then when it comes to the specific mitigations, we can also talk about those. I think there was also something called Rebuff, which was released a few days ago, which just takes all of the known mitigations and bundles them all together.
I think all of them can be circumvented right now with not a lot of effort. And even with manual trial-and-error development of prompt injections, you can circumvent all of them. People still need to make sure that the application they built, even if it is compromised in such a way, does not lead to some kind of catastrophic security failure.
And I think we're seeing a bunch of really high stakes applications that are coming out not considering these attacks. This may be fine for a single user interacting with the system. The old model of prompt injection, right? Defending against that. Preventing the original prompt to leak, maybe with canary words interspersed in the original prompt and, and if it repeats those you block it, that type of thing. But it's not enough to really hold up to the future.
Johann Rehberger 23:27
So the really good point I think that you brought up was the severity or the implication or impact of a prompt injection. I remember like the first example where I saw it actually in action was when you built the Bing Chat indirect prompt injection. It made it speak like a pirate. Just visiting a web page, it started speaking like a pirate.
So, what are the overall worst-case scenarios you can think of? What about usage of AI and large language models in military applications or something like that? Can you share your thoughts on that?
Kai Greshake 23:59
I recently had a blog post where I went through all of the potentially dangerous ways people are deploying LLMs right now. Obviously the most egregious of which is the deployment of these military LLMs with scale AIs. Donovan and Palantir's AIP models which take in threat intelligence information from whatever data sources you hook up to it and try to give you operational insights and operational suggestions for, well, actions you can take. Which implied to include kinetic options just as well as informational warfare and all these other things.
That's obviously bad in two ways. For one thing, it's already bad enough that these systems are built at all and deployed. And the second thing is that these people are not considering that, well, their systems might be compromised through being fed data from attackers and then giving operational insights on behalf of attackers. So, not only might these systems be very powerful, but they might also blow up in the face of the people trying to deploy them.
There's also a bunch more, slightly less devastating applications that are being built with it. People use them to analyze contracts as a legal assistant, and obviously every contract is an adversarial example. The other side wants to get those conditions favorable to their side, and if they deploy it with something like prompt injections in the contract, well, your legal assistant will help not you, but the other side.
People also built, I think, Bloomberg GPT, which is probably used for financial analysis and also eventually more automated trading with LLMs, all of which might be also compromised. We've had sentiment analysis for a long time, informing things like high frequency trading, but the new semantic capabilities of LLMs mean that they might get more weight in those decisions because they're soon to be smarter, making better decisions and predictions, but their outputs are also more manipulable.
D Dehghanpisheh 25:55
Let me backup and ask, is there a generalizable defense methodology then that you are thinking about?
Because there are some and we just wrote a blog about this and there's a really great blog detailing kind of the dual LLM patterns as a possible defense, but it's kind of messy, right, to implement if you're an ML engineer. And Johann mentioned something similar earlier.
I'm just curious, are there generalizable defenses that you see that can be easily and readily adopted today? And if not, what kind of defense mechanisms should developers be thinking about and system managers be thinking about to protect their LLM applications or generative AI applications in general?
Kai Greshake 26:36
So, I think one thing to realize is there cannot be a static benchmark of how secure or robust is this agent against prompt injection. This is something that is probably only well defined in some kind of environment with agents defining robustness against specific attackers with certain resources.
And you can maybe make some headway on making these things truly more robust. But I think in the limit you will find that there is a tradeoff between security and utility.
Let's say you want a customer support agent. You start with a foundational language model and then you do reinforcement training with some kind of automated adversarial generator. You'll probably end up with a chatbot that loses most of its general capabilities in favor of security. The most secure agent would be one which does no action at all, but we want to preserve that utility.
Now, how much of that utility we can preserve is an open question at this point. So there are probably generalizable defenses, but we don't know what they look like and we don't know how they tradeoff with utility. And I think what it's going to come out at is not the static benchmark, but rather something like an Elo rating of how robust is this agent against compromise or manipulation?
And pushing that Elo higher will also incur some costs on the utility. I think the ones that have tried this was Anthropic in their Constitutional AI paper. They did calculate such Elos, but that was just against human raiders and all that. So it's not quite where I'd want to have it. But they were the first ones to introduce these Elo ratings, I think in that context, and it makes a lot of sense for adversarial robustness.
But in the general case, I don't think there will ever be a fundamental solution. Because you're looking at, let's say, a property like “Is this prompt malicious?” Deciding that is impossible. It's fundamentally undecidable because the language you're trying to parse is still incomplete. That much is certain.
D Dehghanpisheh 28:29
It’s almost a whole new generation of social engineering, right? Social engineering has been around forever. It's impossible to shut it down and shrink that threat surface to zero. And maybe that's the same way we need to think about these adversarial and robustness attacks. Just realize that you've got to shrink the surface as much as possible and harden the defense.
Kai Greshake 28:50
It's also a question of scale and the network effects here. What if I deploy one model that is very robust, and I deploy it to 100 million users, but then it's frozen after training? An attacker can spend a long time hammering that one model and finding a single attack or prompt that works, and then they can deploy it to a whole bunch of users. Just like the threat that we have with zero days.
But if these are trivially easy to find, every time you have a system that doesn't continually adjust, it's a problem. And even if it does, it's unclear if you have this monoculture of the same model being deployed everywhere if that's enough. You might have to think of it in terms of resilience, just like on a population basis. How much diversity do we have between models such that one working injection doesn't work on all of them?
Just the way that, at the end of the day, humans also have to fend off these attacks. It's not in our favor if anybody that's out on the street and talks to us can convince us to do whatever. It's not good for our offspring. And that's why we have some kind of adversarial robustness built into us.
That is still limited. There's cults and sects and politicians convincing people of things that aren't in their best interest. And so while we are robust, we are only robust in this everyday framework of, like, humanity. And people trying to find these vulnerabilities in humans are successful frequently.
D Dehghanpisheh 30:08
Con men are still employed today, right? Anybody can fall prey to con men.
Kai Greshake 30:13
Johann Rehberger 30:14
Yeah. I really like that analogy with social engineering. I think that to me, this whole discussion now was super interesting, the last part, because it's really like, do we have to train models, like just send them to security training? To robustness training regularly? Do things like this have to become the norm to make sure that models do not get outdated?
D Dehghanpisheh 30:34
And do you rethink security? Like rethinking security away from state management and more to behavioral analysis. You've got to have elements that are looking at it in that context rather than kind of a binary, true/false, yes/no, state changed or state same dynamic.
Johann Rehberger 30:50
So, to wrap everything up, what would you want to tell the audience, and the security industry as a whole I guess, and what should be the next steps on how to think about these problems?
Kai Greshake 31:00
I really enjoyed what Simon Willison said on the subject, which is that if you don't understand what prompt injections are and what they might do, then you're doomed to implement them and be vulnerable to them. And I think that's completely true.
The primary goal right now should be to raise awareness for everybody to actually figure out, well, what are these impacts? Read our paper and then adjust the ways you implement and use the LLMs. A lot of the things that we thought might be good ideas or very valuable might not be in that category.
And if you really think you can implement it in a way that it might be compromised, but that it wouldn't have critical security implications if that did happen, then I encourage you to explore all of the mitigations that are out there right now, including Rebuff and NeMo Guardrails and all the other projects which try to give you some heuristic level of security, or at least give you security against very unsophisticated attackers that are just probing around.
You can reasonably defend against that. And part of that is using databases of known attacks and using LLMs to figure out if there's bad stuff in the input. You can try that. It's not going to be foolproof. It's going to be, in fact, easily circumvented by a determined attacker. But if you're convinced that such a breach wouldn't have critical implications, you can deploy these systems to add a bit of extra security.
We should be clear that these approaches right now are largely centered on security by obscurity and playing whack-a-mole.
When I'm doing a pen test and I find an SQL injection vulnerability and during the remediation, they tell me, well, we “fixed” it. We introduced a blacklist of inputs and all of the SQL injection inputs that you used are in the blacklist. So none of them work anymore and we fix them. You haven't fixed the underlying SQL injection and in this case, prompt injection you can't really fix. So that's why people are using these approaches.
But at least with SQL injection, we could probably all agree that that's not a sustainable fix and you probably shouldn't be deploying the application in that state.
D Dehghanpisheh 33:04
So with that, Kai, I really like the last part of that where you were talking about the penetration testing. How should companies be thinking about penetration testing their LLM models? Everybody's thinking about producing them and putting LLM into user experiences and end products, but I don't see the same fervor to enlist Red Teamers and Pen Testers to actually harden these things, and then deploy even some of the basic mitigation components you just described.
How do we get the AppSec teams to really demand that? Are they thinking about it appropriately or is it just in this race to get out the generative AI functionality hell, just go roll it out?
Kai Greshake 33:49
I think a better question is, is it even useful the way that we are conducting red teaming and testing these models right now? And the ideas of hardening that we have, which is that red teamers should go out and find inputs that break these models and applications? Well, if you find that, sure.
But if someone doesn't immediately find such an injection or maybe you've patched all the ones that they did find, it doesn't mean your application is more secure. You have to actually look at the threat model and include in that threat model that, well, an attacker controls some part of the input space. They will likely get full control over future outputs.
And you should go from there and then extrapolate We've seen, I think the White House was working together with DEF CON or something, and they want to have competitions breaking these models, finding new prompt injections. And everybody right now is like, hey, we need to share all of these prompt injections that we found so that we can introduce them in data to train on. And that'll make these models more robust.
But that's not a sustainable way to do it. And especially when we look at automated ways of finding these injections, it's going to be much less valuable in the future. And so we should be thinking about a systemic perspective of how we integrate these models in a way that even these failures don't create critical security failures and I think that's the best that we can do right now.
D Dehghanpisheh 35:10
Cool, well, thank you so much for that, Kai.
And Johann, thank you for guest hosting with me. I can't wait to have you back to have another discussion about red teaming of AI and looking at the security implications.
And Kai, thank you for coming on the show, thanks for all of your input. Fascinating discussion. A little bit scary because there's probably no end in sight but a lot of fun. So Kai, thank you so much for your insights and thank you everyone for listening and or reading the transcript. See you next time.
Thanks for listening to The MLSecOps Podcast brought to you by Protect AI.
Be sure to subscribe to get the latest episodes and visit MLSecOps.com to join the conversation, ask questions, or suggest future topics. We’re excited to bring you more in-depth MLSecOps discussion. Until next time, thanks for joining!