AI Agent Security: Threats & Defenses for Modern Deployments

Written by Guest | May 21, 2025 5:15:27 PM

Audio-only version also available on your favorite podcast streaming service including Apple Podcasts, Spotify, and iHeart Podcasts.

Episode Summary:

In this episode of the MLSecOps Podcast, community leader and episode host Madi Vorbrich sits down with researchers Yifeng (Ethan) He and Yuyang (Peter) Rong, two of the authors behind the paper “Security of AI Agents.” Tune in as they explore how today’s AI agents can be exploited (from session hijacks to tool-based jailbreaks), how developers can build safer systems through sandboxing and agent-to-agent protocols, and a look ahead at what a secure agent stack might require in 2025 and beyond.

Transcript:

[Intro]

Madi Vorbrich (00:07):

Welcome to the MLSecOps Podcast. I'm your host and one of your community leaders, Madi Vorbrich. And today in this episode, I wanna talk about a really hot topic or hot button issue that I've heard circulating for a minute now, which is how to build and break AI agents. And with me today, I have the pleasure of speaking with Peter and Ethan, who are both researchers behind the paper "Security of AI Agents". Thanks for being here.

Yuyang (Peter) Rong (00:34):

Thanks for having us.

Madi Vorbrich (00:36):

So, Ethan, Peter, can you go ahead and dive into quickly just your background and kind of what brought you into this space and why you started this paper to begin with?

Yifeng (Ethan) He (00:49):

So I'm a PhD candidate here at UC Davis in the computer science department. And my research focused on software security and AI security. And a little bit story behind the paper is that back in last year, we were trying to use the most advanced techniques of LLM agents or AI agents to solve software security problems like detecting bugs and fixing bugs automatically. And back then we realized that there's a new trend in agentic AI or AI agents to a lot of full potential of the language model to use tools as agents, right? And we started looking at those methods.

I found out that there are security issues with those approaches and especially when we are wearing them with respect of security. And that's when we figure out that we should first analyze and take a series of the security problems of those agents. And we decided to take a closer look and write a paper on it.

Madi Vorbrich (01:58):

Awesome. And Peter, do you wanna go ahead and give the audience a little bit of your background too?

Yuyang (Peter) Rong (02:04):

Oh, sure. Thanks everyone for having me here. So I'm Peter. I got my PhD in UC Davis. I actually started off doing software security or security in general. So I did a lot of hardcore software stuff like compilers, operating systems stuff. It was around '23 when ChatGPT came out and everyone's super hyped about what it can do or cannot do. And that's when we started to think, okay, of all the crazy things you can do, can it be used to write code? It turns out back in '23, it couldn't, and now it could, of course, we know that already, but the security problem remains is how legit is a code returned by AI?

And that's how we started to do security research on AI agents, because the more we dive into it, the more we realize that it's not just the code, it is writing that is not, may not be secure, it is a lot of things could be an exploit point for adversarials. So that's when we started doing a lot of AI research and eventually summarize them into this paper.

Madi Vorbrich (03:22):

Yeah, and it's a really great paper. I'm gonna go ahead and link it in the show notes as well. But basically how I wanna dive into this episode is kind of go by like how you written it, like in your research paper. So I wanna dive into the threat landscape how LLMs or LLM agents can be exploited. Then the flip side of that and talk about the defense side, and then take a closer look at how teams can better, you know, secure themselves as well as like a 2025 outlook and kind of tips from you guys on what we should be looking for, right?

So with that, I know that your paper focuses on four key areas as it relates to threats. And I have them listed here, and it's unpredictable user input, so prompt attacks, internal execution complexity, so back doors and poisoning, environmental variability, so RL and memory attacks. And then lastly, untrusted interactions. So that's like tool and agent based attacks. So I'm gonna dive into each of those a little bit deeper, but can you guys go ahead and give me just a brief walkthrough of each of these to kind of set the stage?

Yifeng (Ethan) He (04:32):

I really like how you summarize these key points.

Yifeng (Ethan) He (04:36):

For the current design of AI agents, I think unpredictable user inputs, right? And the untrusted interactions between the agents and the tools can go together a little bit because both the interaction between user and agents and between the agent and tools are using natural language in the prompt, right? So if we see the large language models themselves as a software it's just a bunch of ways and it is status, right? And the status of LLM based software or AI agents are encoded in the prompt history or in the context.

So every interaction should well be carried along the context check history for the LLM to know what's happening and make the next step, right? And during this multi-term interaction between user and agents and agents and tools, both directions are facing a text from the users and from the tools. Is their descriptions and feedbacks, and both of those are vulnerable.

The internal execution complexity, mainly originally from the inherent limitation of large language models, mainly when you wanna improve the models as the backbone of the agents or improve the models to build like personalized agents, right? So that involves centering the agents with user data.

Yifeng (Ethan) He (06:17):

And well, another way is just to encode everything in the prompt, but we know that the context window of the language models are not that advanced yet. So, the solution right now involves training the language model to improve their usability. And when you try to modify the ways of the language model with untrusted user interaction data, you don't know if the user is malicious or not, and you don't know the data can be trusted or not.

So that opens the text surface of data poisoning and backdoor attacks, like attackers can invest some backdoors in their prompts. And when you fine tune the model using those interaction data, they can just trigger the backdoor with very simple non attacking words and send out some very sensitive data toward debut as the backdoor.

Madi Vorbrich (07:21):

And we can dive into each of these further. So I wanna go ahead and start off by diving into session hijacks, like the first real world exploit here. So can either of you kind of describe how can an attacker hijack or confuse an agent's user session? And then with that, what real world risks does that pose?

Yifeng (Ethan) He (07:45):

So session hijacking in AI agents we can view as two different ways. The first is the session of agents itself, and the other is you can hijack the HTTP session of agents, but that that's not only apply to agents, they also apply to all the web applications.

Yuyang (Peter) Rong (08:07):

Like a network issue. Yeah.

Yifeng (Ethan) He (08:08):

Right, right. So, I think in this episode, we only talk about the agent issue. And for the agent issue, because as we mentioned briefly, the sessions or the state of AI agents are in coded in the prompt history, right? In the context. So for large language models you can just use a very simple prompt to let the language model forget about all the check history before that.

So with very simple prompt of text, for example, in coding, in one feedback of one of the tools, that tool can hijack all the session and make, for example, verbal votes against the other, the other tools, like say, if you are using a ranking ranking system for web content or products, they can just use very simple embedded prompts to let the, the agents favor their products instead of the others. That's one real world example. And they can be any other similar examples.

Madi Vorbrich (09:15):

Yeah. Yeah. And that's a good example. Definitely is a stark risk for sure. And with that too, I also wanna look sort of under the hood, right? And with that, I wanna dive into internal poisoning as well.

So in your paper, under internal execution complexity, you describe model or memory poisoning, right? So how can attackers manipulate an agent's memory or training data to change its behavior over time?

Yifeng (Ethan) He (09:44):

The model provider improve their models using fine-tuning matters, like preference optimization or single stuff like add a direct preference optimization or human involved reinforcement learning, things like that. And in that way, let's just take direct preference optimization as an example. Model providers center the model based on human related responses.

Like when I'm interacting with the agents, and sometimes I'm facing multiple choices, and I will choose one of the responses given by the agents, right? And that will be labeled as a preference. And the malicious user can intentionally give wrong preference, that's tools, some kind of tools that they're paid to optimize for, or that's just their own tools. If there's a lot of those malicious users, that's very likely that they'll point in the preference data, the user infraction data that's used for training or training the agents.

Madi Vorbrich (10:59):

And Peter, do you feel the same way as well? Did you have any other examples to provide here too?

Yuyang (Peter) Rong (11:04):

I think one example you can think of is, so the model will try to remember things when you fine tune it. And it is very questionable when you… What is your fine tuning data? Most of us are trying to use user interactions nowadays. And then how do we see through the user interactions, which interactions are good interactions and which interactions are malicious? If I keep trying to tell a model that some wrong statements and then that statement, that false statement is used as a training data, would they remember it? And the unfortunate fact is that we cannot observe that. That is why we think it could be a security problem.

Madi Vorbrich (11:56):

So would you say that these attacks can almost be invisible until it's too late sort of a thing?

Yuyang (Peter) Rong (12:04):

Yes.

Madi Vorbrich (12:05):

Yeah. Yeah, I thought so. And then finally in this segment, I wanna go ahead and dive into tool-based jailbreaks. So in the interaction category, tool-based jailbreak stands out, right? Why do you two think why do malicious actors exploit agent tool APIs? And also too, how do these attacks bypass traditional prompt defenses?

Yifeng (Ethan) He (12:33):

The most important problem is that the traditional way of detecting prompt attacks are there are two main ways. The first is just a keyword based filter, and you just do a blog list and filter out any unwanted keywords contents in the prompt.

And the other is training language model based guard, like Llama Guard is a famous example; they released Llama [models] by the Meta group. And those kind of approach uses label data to train the guard to classify if the prompt is malicious or not. And in many cases, if the tools used by AI agents send misinformation it's not likely to be directly malicious. Instead it's misinformation instead of attacking prompts. And those are not detected by the guards or the Q-based filter.

The other key point is that, the things, the chat history stack up, right? And one of the prompts or one of the responses from a tools API may not be seen as malicious by the guards, but if a lot of them stack together in their chat history, they might become malicious.

Madi Vorbrich (13:58):

Mm-Hmm.

Yifeng (Ethan) He (13:59):

One example I like to, well, I like, it's somehow like social engineering or frauds like for example.

Madi Vorbrich (14:09):

Mm-Hmm.

Yifeng (Ethan) He (14:11):

And that's targeting agents.

Yuyang (Peter) Rong (14:12):

I like to look at this from a software perspective because I came from a software background. Think of a logging system. You can verify it, in computer science we call it verification. Basically proving that what you want it to do will not fail regardless of what the input is.

The problem with any AI agent is that you cannot do these kinds of verifications because the agent is a black box. You can certainly do filtering, but after all the keywords get filtered out, the rest or the rest input, could it still cause problems? We don't know. So I wouldn't say the current defense is not any good. I would say the problem is that we don't know how good it is.

Madi Vorbrich (15:15):

Right.

Yuyang (Peter) Rong (15:15):

So falls into the previous thing we discussed before we the, we wouldn't know how big an issue is until the issue came around.

Madi Vorbrich (15:25):

Right. And just from your experience, have you seen... Since you've published this research paper, right, and since you've done iterations on it, have you seen anything change in this regard when it comes to these exploits?

Or just like how, I don't know, maybe more like in depth or advance they've gotten since you've written this up?

Yifeng (Ethan) He (15:49):

Well there are many rapid changes in building AI agents, right? But as of right now, the main focus of the community is to get AI agents to work. Like we don't have a really working agent right now. So there's, we don't know how to discuss how practical is the difference because now the agents are already working, right?

So it is somehow like when you just have the internet and it's all chaos and people don't know how to difference stuff. And people just trained to defense and attack the internet. It’s similar to AI agents right now.

Yuyang (Peter) Rong (16:29):

I have a very, very good example of why filtering doesn't work. To amend my comments. I think it was a year or a year and a half ago, one of our colleagues told us, so of course you can filter out all the dangerous words for AI agents. Don't tell people how to build a bomb, don't do this, don't do that. One of the interesting bugs, or you could call it vulnerability, is if you ask an agent to say something repeatedly, nonstop after a certain amount of repetitions, it starts to say some very questionable things.

Madi Vorbrich (17:08):

Right.

Yuyang (Peter) Rong (17:09):

Yeah. So that is one example. Like you, you can't really filter because repeat is a normal word that you can just use. And if you ask it to do something repeatedly, that doesn't sound dangerous, but the output is. I think that has, has that patch been fixed? Have they done anything about it? I have no idea.

Yifeng (Ethan) He (17:30):

Well, about that issue. There are people trying to solve that issue. Like that, that's one of the jailbreak attacks to large language models. All the solutions I have read from the research papers are just their training - or they’re fine-tuning - the model [to not] repeatedly say stuff.

Yuyang (Peter) Rong (17:52):

So I wouldn't say that's a fix, but they're, they're trying to mitigate it.

Madi Vorbrich (17:58):

Alright. Awesome. So now that we've kind of explored the attack side of it all, I wanna go ahead and flip the script and talk about defenses, the other part of your research paper.

So in your paper you focus on approaches to mitigating said threats and vulnerabilities that we just discussed. So with that, what does effective sandboxing look like for agents? And also too, if there's any like earlier beginner listeners tuning in, can you kind of define what that is? And then also too, what do teams often get wrong when it comes to effective sandboxing?

Yifeng (Ethan) He (18:37):

Well, sandboxing is really ideal for software systems, like it's isolating software execution from the rest of the system to protect the rest of the system, of course. So essentially what we don't trust the agents, right? We don't trust, well, we shouldn't trust anything generated by the AI. And in that case, we want to isolate the execution of AI agents generated actions from the rest of systems. Either it's a local system or it's a remote resource. In software we trust, we trust software because we can verify the mass people mentioned, but right now we don't have a way to verify actions generated by the AI agents.

And they're, well, to make it simple, like we don't have a framework to verify or to predict the results of AI agents execution. So we don't have entrust to the AI agents. That's why we want to just isolate all of them. The first step we can do is to limit their access to all the important resources at computational or sensitive data on your system.

Yuyang (Peter) Rong (19:56):

I actually wanna refine a little bit about the idea of sandboxing. So Ethan said you shouldn't trust an AI agent, but from a security perspective, you shouldn't trust anything at all. So the idea of sandboxing is essentially saying, if I lock you up in the box and then there is no harm that you can do inside this box, then everything should be fine. It's like putting a toddler in the kitchen, just remove all the knives and nothing could happen. Nothing could go wrong.

Madi Vorbrich (20:35):

That's a really good example. Yeah.

Yuyang (Peter) Rong (20:38):

Yeah. Right. So the essential idea is that we put the AI agent in this sandbox and regard what we only give, so we only give little interfaces to the outside world. For example, you can browse the internet, you can download things from the internet, but you cannot post things to the internet. That is a very simple idea of a sandboxed AI agent to like read stuff for you. Yeah.

Madi Vorbrich (21:15):

And then what do you think teams often get wrong when it comes to sandboxing?

Yuyang (Peter) Rong (21:24):

I think that could be a sandboxing problem in general, not just AI agents.

Madi Vorbrich (21:29):

Gotcha.

Yuyang (Peter) Rong (21:30):

You put too much, you gave the box too much permissions that it doesn't really need.

Madi Vorbrich (21:39):

Gotcha. And you also talk about encrypted prompts in session isolation as well. So can you explain how they work too? And then also what threat models do they help mitigate?

Yifeng (Ethan) He (21:54):

Right, when we first talk about the session isolation idea, and that one was designed for a very specific use case. That you have one agent facing multiple users. And the second idea was to isolate the chat history or the state of the AI agents between each users. So the agent doesn't get confused.

I also noticed that the agent designers are shifting from one agent's, multiple users to more like a one user, one agent design pattern. So, there will be new, similar, different techniques in the future for the new design patterns for agents.

Madi Vorbrich (22:38):

And what about if we were to talk about like, quick win tactics that teams can apply right away when it comes to the defensive side of this? What are some low hanging fruits that teams can implement today to make their agent systems more secure?

Yifeng (Ethan) He (22:55):

In my opinion, the first low hanging fruit is of course, sandbox, right? You should limit the agents from accessing data, accessing resources. And I think it's pretty easy to implement if you have a choice of your agent framework, like you're using OpenAI's SDK, then it should be easy to implement a resource isolation or sandboxing for your agents.

Yuyang (Peter) Rong (23:22):

I will say the mitigations we proposed in the paper are pretty, are all pretty low hanging fruit. I think that should happen immediately.

Madi Vorbrich (23:33):

Okay. Great. So now that we've kind of built the foundation I wanna also discuss the higher level protocols that sort of tie this all in together. So that's kind of looking at what's happening right now and kind of in the future, what do you guys sort of recommend?

So first off, Ethan, I know that you and I have talked on the side previously to recording this episode, and you actually brought up Agent-to-Agent Protocol (A2A). So can you kind of define what that is, right? And then what kind of problems is this designed to solve and what would a secure interaction between agents look like in practice?

Yifeng (Ethan) He (24:12):

Okay. So the protocols for AI agents... Well, currently there are two popular protocols. One proposed by Anthropic and the other by Google, right? And the one proposed by Anthropic is called Model Context Protocol (MCP). It basically defines a standard way to do the interaction between agents, data sources and tools. And well, it's a protocol for tool use, but it's not a protocol for security purposes. There are no access control mechanisms in MCP, the Model Context Protocol.

And the Agent-to-Agent Protocol (A2A) is built by Google as a complete complimentary protocol to model conduct protocol. Well, it defines how agents should talk to each other. And again, it is still a way to, like the agents can call other agents as tool using, like agents are tools and agent to agents like they're tools for each other and they can communicate, but there's still no security built into those agents protocol I mean.

Madi Vorbrich (25:31):

And so once agents can talk safely, what's the overall deployment, like quote unquote playbook, right? Like if someone is listening today to this podcast and they're shipping LLM agents in a real product, what sort of the 2025 ready playbook here that you would give them? Or what are some pieces of advice that you'd give someone?

Yifeng (Ethan) He (25:57):

I would say, when you are building agents, keep the security issues with posts in mind and stay up to date to the newly growing technologies and perhaps follow the podcast and get updated. And I think that's something to keep in mind.

Madi Vorbrich (26:21):

And then Peter, do you have any gems or words of wisdom on your end?

Yuyang (Peter) Rong (26:26):

I think it's very important when you're building it to think of where is the data coming from and where is the data going to? In the coming from part, is it from a malicious actor or is it from a legitimate user? As I mentioned earlier, you really shouldn't trust anything. So how do you keep that separation?

And in terms of where you're going to, how sensitive is the data? Is the receiving, does the receiving end have the permission to read whatever data it is receiving? If you are sending the data back to the user, fine. If you are sending social security to another agent, that's a very questionable problem. So yeah, that is my 2 cents, literally 2 cents.

Madi Vorbrich (27:12):

And so if we're gonna look at what's next for the field, where is in your opinion, agent security headed to next? What questions still need to be answered? And then also too, what kind of research or community collaboration do you also wanna see from here on out as it relates to it?

Yuyang (Peter) Rong (27:36):

I think one of the huge issues is scalability. Now, we've been talking to one user, one agent or multiple users, one agent. What if we are talking about thousands of users and thousands of agents? How are they talking to each other and how are the security guards going to look like, security measures are going to look like, when these parties are talking to each other? I think that is the most urgent issue at this point.

Madi Vorbrich (28:06):

Yeah, that's a really good point. Ethan, did you also have anything that you wanted to add?

Yifeng (Ethan) He (28:13):

Mine is similar. So I wanna mention that one thing that sticks immediate attention is how can we define a computation model for your agents? Well, that computation model, I don't mean like machine learning models. I mean, how can we define the interactions or how agents compute?

Yifeng (Ethan) He (28:34):

Like if you refer to software systems, like one example I can bring up is concurrency like programming languages people or system people define multiple computation models for concurrency. Like quarantines, the actor models or structure concurrency, like those computation models offer a way for us, the security people, to be reasonable or to verify the system to understand their potential behavior before running them.

So, if we can define a computational model for AI agents, it is possible for us to do static analysis to track the agent's behavior before running any of them before they do any harm. We can know that it is safe or not.

Madi Vorbrich (29:27):

Right. Awesome. Well those are some really good insights. So I wanna go ahead and wrap up, put a nice little bow on this and just dive into some key takeaways. I know that you guys shared some really good gems before this, but one last question. If our listeners wanna walk away with anything from this episode, like one idea or just one really good piece of advice, what would that be?

Yifeng (Ethan) He (29:53):

I would say be aware of the vulnerabilities and don't fully trust the agents. And GenAI of course.

Yuyang (Peter) Rong (30:02):

I would say zero trust. Try to verify things, whether it's your data or it's your software. You gotta know how at what point it will break and what are the vulnerabilities. If you don't know it, then it could be a problem.

Madi Vorbrich (30:22):

Yeah, definitely. So Ethan and Peter, where if someone wants to connect with you after listening to this podcast and they wanna maybe talk about your paper a little more, one-on-one in depth, is there anywhere where people can reach either of the two of you?

Yifeng (Ethan) He (30:40):

I think LinkedIn is a good way. I also have one, have my emails and personal website on my LinkedIn.

Yuyang (Peter) Rong (30:48):

Yeah, I have my personal web in LinkedIn as well. That will be pretty good. Yeah.

Madi Vorbrich (31:00):

Awesome. Well, Ethan, Peter, thank you so much for joining me. It was a pleasure having you.

Yuyang (Peter) Rong (31:06):

Thanks. It's a pleasure.

Madi Vorbrich (31:07):

Yeah, thank you. And to our listeners, please follow, share, and stay tuned for more deep dives from the MLSecOps Community and our guests on the MLSecOps Podcast. Thank you so much and we'll catch you next time.

Yuyang (Peter) Rong (31:24):

Thanks.

Yifeng (Ethan) He (31:25):

Thanks.

[Closing]

Additional tools and resources to check out:

Protect AI Guardian: Zero Trust for ML Model

Recon: Automated Red Teaming for GenAI

Protect AI’s ML Security-Focused Open Source Tools

LLM Guard Open Source Security Toolkit for LLM Interactions

Huntr - The World's First AI/Machine Learning Bug Bounty Platform

Thanks for checking out the MLSecOps Podcast! Get involved with the MLSecOps Community and find more resources at https://community.mlsecops.com.

View full post