Unpacking Generative AI Red Teaming and Practical Security Solutions

Feb 5, 2025 By Guest

Audio-only version also available on Apple Podcasts, Spotify, iHeart Podcasts, and many more.

Episode Summary:

In this episode of the MLSecOps Podcast, co-hosts Charlie McCarthy and Sailesh Mishra sit down with Donato Capitella, Principal Security Consultant at WithSecure Consulting, to dissect the real-world challenges of securing LLM-powered applications. Donato shares his journey from software engineer to AI security researcher, explaining how “LLM red teaming” is often misunderstood and why context—how an LLM interacts with users, data sources, and APIs—might matter more than the model alone. He introduces Spikee (Simple Prompt Injection Kit for Evaluation and Exploitation), and underscores the need for guardrails, continuous monitoring, and early security planning. Donato also reflects on building an AI-focused YouTube channel (LLM Chronicles), highlighting how educational platforms like the MLSecOps Community play a vital role in elevating AI security awareness.

Transcript:

[Intro]

Charlie McCarthy (00:08):

Hello, MLSecOps Community and welcome back to the MLSecOps Podcast. I'm one of your community leaders, Charlie McCarthy. It's wonderful to be back in the studio with you, and today I'm joined by my co-host and colleague, Sailesh Mishra, who's a colleague of mine at Protect AI, and our guest that we'll be interviewing today, Donato Capitella. Welcome both of you to the show.

Donato Capitella (00:31):

Thank you for having me.

Sailesh Mishra (00:32):

Thank you. Thank you, Charlie.

Charlie McCarthy (00:34):

Yeah, it's an honor to have you both here. Quickly just to get the audience oriented, Sailesh, why don't we start with you, if you don't mind giving just a brief background for yourself and then Donato, we'll move on to you and then dig into the meat of the interview here.

Sailesh Mishra (00:48):

Sure. Hello everyone. I'm Sailesh. I've been working with Protect AI as Director of Business Development for Gen AI security initiatives. Prior to this I was working in a startup called SydeLabs which was building an AI red teaming tool. We got acquired into the Protect AI ecosystem, and that's how me and Charlie met each other. And this is a part of a team now.

Donato Capitella (01:11):

So, my name is Donato Capitella, and I am a Principal Security Consultant at WithSecure Consulting. We are a cybersecurity company. We do penetration testing and adversary simulations, testing networks, applications and systems that our clients, our clients build. And we have researched, like our consultants spend 20 to 30% of their time researching different upcoming topics that they find interesting. And I happen to be the person that found Gen AI and LLMs interesting. So, I have applied some of my knowledge as an ethical hacker and penetration tester to the world of AI.

Charlie McCarthy (01:58):

Excellent. Well, thank you both again for being here. Let's dig into that a little bit more Donato. You, you know, of course you're a Principal Security Consultant at WithSecure Consulting, but you've got such a broad career history. Can you talk to the audience a little bit about your journey from, you know, software engineering and pen testing to becoming or teaching yourself about AI security and becoming more of an expert in this space and, you know, that led to launching a YouTube channel. Just what has that journey been like for you?

Donato Capitella (02:29):

So, I started as a software engineer. I really like making things. But then the problem that I had as a software engineer was that it kind of pigeon-holes you into one language, one technology, especially if you go back 15 years ago, like I was a Java developer and I was very curious about everything to do with computers. So, the way that I got into cybersecurity was because it's one of the few areas where you can work with different clients week by week on different types of systems. So, I can work on a network assessment of a Cisco router one week, the next week I can work on a web application written in Ruby, a week after I can do an adversary simulation… Basically, whatever my clients are building you get to play with that particular technology.

Donato Capitella (03:24):

And obviously you have to learn how it works in order to figure out how an attacker might be breaking it, because then what I do is, is breaking it. And now as part of that it was 2023, the beginning of 2023 a friend sent me ChatGPT. I logged in and this looked very different from anything else I had done when I studied machine learning. I just studied one machine learning course in university maybe 16 years ago. And it was like, you know, expert systems and all sorts of things that didn't look like any of that. So I wanted to teach myself how that worked. And I started, you know, looking through books, coding labs, and I wanted to build a little LLM from scratch, but understanding everything that was going on. And then, as I very typically do, I needed an external motivation to do that and to do that well.

Donato Capitella (04:24):

And when I learn stuff, I also like to make these mind maps, these are canvases where I tried to put all the concepts together. And so I started doing that. A friend said, oh, why don't you put it on YouTube? And so I made the first video with one of these canvases talking about what I was learning. And then “LLM Chronicles” was born. And basically I documented that journey of how you go from zero to building a small LLM. And then on top of that, because I became the LLM guy in my company, when clients towards the end of 2023 actually started implementing all of this, because in my own time I had an interest and I understood a little bit about what LLMs were. I was very lucky because then I got into all these meetings with all of these engineers from different organizations and data scientists, and they were talking to me about what they were building, and I was applying my hackers mindset. Thinking, how could somebody break this? And so that's how I got into Gen AI and LLM security.

Charlie McCarthy (05:35):

Excellent. So, transitioning to some of the work that you're doing now, your history and experience with penetration testing and ethical hacking, a term that I'm sure you're very familiar with is red teaming. And we're starting to hear more about AI red teaming or LLM red teaming. Are clients currently asking for an LLM red team as a service now? And if so, what have you found they're typically looking for and how do you define that term for them?

Donato Capitella (06:06):

Okay. I think that's an extremely good question and somewhat triggers me a little bit in a positive way, hopefully. But, the reality is that we, in cybersecurity, used the term red teaming for something very, very specific. And so at the beginning, I wasn't really sure what people were asking for when they asked for LLM red teaming. Because red teaming for us is an end-to-end adversary simulation. You pretend to be a Russian threat actor nation state and you try to infiltrate a company. Typically, it takes three to six months and you hack everything that you can hack, including people. It's not something that you do on a single system. But when people ask about LLM red teaming, our clients are asking for that. And I think because now the LLM security community has taken that term.

Donato Capitella (07:07):

So, I have learned to adapt myself to what it means there. And to me, people are asking for two things. They're asking for most LLM red teaming feels like an LLM benchmark to me. Like if you're taking the LLM in isolation, you are asking questions about the safety of that LLM with certain prompts. That's similar to a benchmark. It could be static or adaptive. And typically those LLM red teams are focused a lot on harmful outcomes. So, how to make a bomb, will the LLM respond to that? Write me a piece of malware or say something hateful. So, hate speech and this kind of things. So, that's what a lot of the LLM red teaming world talks about. Then I think when our clients come to us, what they actually mean, they're looking for a security assessment. And that's what I think the majority of organizations, by the way, mean actually, when they're asking for an LLM red team. They are building something. They're building a use case, and the LLM happens to be one of the parts of that use case.

Donato Capitella (08:18):

And then they make the LLM interact with users, documents, data source tools and interactions that defines your use case and that defines the risk. And so what I think organizations are asking is, I have built, or I am using this particular use case, how could an attacker exploit that against me, against my clients, or against my organization? And to me that's more similar to what we would typically call a security assessment. Which is what I call it when I do it for clients. Okay, what have you built? And they say they want a red team and eventually end up being a security assessment. And part of it involves prompt injection and some of the other attacks, but obviously it depends what people have built. It's very context specific.

Sailesh Mishra (09:08):

Donato, what are some of the common misconceptions, so buzzwords that you've encountered when you discuss LLM red teaming or a security assessment or a security eval of an LLM?

Donato Capitella (09:22):

So, I think one of the biggest misconceptions I typically ask the client what they're trying to get out of it. What is the question they're trying to answer? So, most of our clients will use GPT-4 or on Azure or they will use Anthropic Claude on AWS. These are the models that people use. And so when people come to me and ask for a red team of that, again, I think what they're asking is more to look at what they built they use as the model and how the vulnerabilities, which are inherent in the model can be used by an attacker. Yeah. Yeah. So, I think this is what people actually are- so, there is a bit of misconception around that. And the other one, and this is where I think I am a little bit different in the community, I think that we need to look at these vulnerabilities in context.

Donato Capitella (10:22):

Like you can go to a chatbot and you can make it say something hateful. We don't like that, but you and you are just breaking it. But if you're making that chatbot, say that back to you, the risk, the way that we see it in cybersecurity, this would be a low risk. Now, if you can make that chatbot say or do stuff that impacts other users or that perhaps impacts the organization that's actually running that particular application, then it matters more. So, a lot of what we do is moving from oh, I'm jailbreaking the LLM, how to make a bomb, or how to make stuff that, you know, the prompter does on X you know, all those tweets. We try to move from that to what's the risk to your customers, to your organization, and how somebody could actually attack it.

Donato Capitella (11:16):

So, I think the misconception is about what people are trying to get out of it, then, whether you should red team an LLM if you haven't even trained an LLM. I mean, if you've trained or modified an LLM, maybe do a red team because it kind of, it's a benchmark. It tells you if it is better or worse when it comes to those attacks than what it was before. But if you take GPT-4, if I red team GPT-4 with the same data set, I will get the same output that I get for all my clients. The model is the same.

Sailesh Mishra (11:46):

But will that be, so in a probabilistic setup, like what we've seen in generative AI will it be a silver bullet of, you know, sort of just training the model with a certain data set, running a red teaming exercise, getting the same kind of outcomes and insights for every client? How would you differentiate between, let's say, red teaming of a foundation model like a GPT-4 versus what the clients have done and what could be the security impact of their actions?

Donato Capitella (12:16):

Oh, absolutely. And I think that that's immediate. The first thing that I ask clients is what have you done? And you get into that call and they're like, oh, we trained this model on our data. And then I look into it and I'm like, okay, that's just RAG (Retrieval-Augmented Generation). You haven't touched the original model the slightest. The majority of people don't fine tune models. Now, if you fine tune a model on a data set, then I think you can ask the question, okay, I have any benchmark and I can benchmark the model to see whether your fine tuning has created some problems. You know, maybe it's more vulnerable to certain types of attacks or less vulnerable. Maybe you've got some data in your training set that I can leak. That's okay. But the majority of people don't do that. The majority of people are doing RAG or doing agentic workflows. They basically consume APIs of standard models. And then around those APIs, they build their use cases. I mean, if you think about it, copilot is this. You knowGitHub copilot, if you reverse engineer it, it's literally directly prompting GPT-4 and GPT-4o mini. It's not a custom model that they changed. So, this is the reality that I see very often.

Sailesh Mishra (13:37):

Yes, that's absolutely true. What kind of risks or vulnerabilities have you typically uncovered when you, let's say, do a red teaming engagement for a client? I mean, the scope that we are discussing is very broad.

Donato Capitella (13:53):

So, first of all, it could be anything, right? So most people have got these chatbots. Which is the use case that people think of immediately, but it doesn't have to be a chatbot. You could have different types of, I call them Gen AI features. Maybe you have a button. Actually, like two weeks ago I was testing something that a client did where there is like a button that they can press and it summarizes like a case, like a complaint with all of the information that the customer put in. That's not a chat, that's not a conversational workflow. And then you also have agentic workflows. So, where the workflow is not static, but obviously the LLM is given tools and it can perform actions calling those tools dynamically. So, these are the three different things that we typically test.

Donato Capitella (14:46):

And I'll give you some examples. So, last week we did a test of a chatbot for a financial organization, and they had a few. So, the question was can it access any data? Or is it just generic? So, that's a very important question to define risk. But then what's important is that if we tested that in isolation, let's say we took any benchmark, I say garak because it's very well known, and we run an LLM benchmark like garak against an API for a conversational agent, we would've gotten a lot of information out of it. Not really sure how to interpret it, but if you tested that generative conversation in isolation, you wouldn't have found out that there was an access control issue. Which is a typical cybersecurity vulnerability, where now I can jailbreak the LLM. So, that's the LLM side of it, but it's not just to make it answer something back to me, but we could essentially inject a prompt into another user's chat.

Donato Capitella (15:53):

So, now what is a standard jailbreak? [Example] “Tell me how to make a bomb.” I could make the chatbot by using, again, not an LLM vulnerability, but a classic web application vulnerability. I could make the chatbot say something back to another user. So, now I could attack other users leveraging the LLM and the jailbreak. So, that jailbreak that would've been a low risk for us now becomes a high risk. And we demonstrated how you can make it you know, obviously paused as the company and told the user to input some confidential data and then exfiltrate that information. This is some, you know, data exfiltration is a big thing with markdown images and stuff like that in LLM. So, that's an example. Another example on the other side, I mean, this is a conversational case. The other example is agents. Like we looked at, something we did last year is looking at browser agents.

Donato Capitella (16:55):

And last year there were just research projects. So, there was this thing called Taxy AI. It's a plugin. You put it in the browser and it connects your current tab to any LLM you want, and it gives the LLM two actions. You can click on the page wherever you want, and it can type whatever you want in any field. So, it calls it, you give it a prompt and it navigates the page. And we showed that, you know, you could send an email, do a prompt injection attack, and then make, for example, the LLM poison the LLM in that case and make it now perform under the control of the attacker. And we have a little demo where we steal some information from a user's mailbox. But these are more of the agentic workflows that back then when we looked at this in April 2024, people laughed at us because they said, oh yeah, but nobody will ever do this.

Donato Capitella (17:49):

If you look at what people are doing now that's quite a lot of it. So, but hopefully I've given you an idea of the kind of like different use cases. And again, you've got the agentic side tools and actions, and these are typically where we find the biggest vulnerabilities because you can use those jailbreaks and prompt injection the stuff that you do in red teaming of LLMs. But now you can use it to make the LLM do an attack against a user or against an organization. And on the other side, the conversational agents, I think a lot of it has to do with whether you can make it say something to a different user, either via documents or indirect prompt injection or via some classic broken access control vulnerability in web applications. So, this is kind of the spread of the things that we see.

Sailesh Mishra (18:40):

Yeah, I believe the risk management frameworks that are available today, the OWASPs of the world, the NIST, the communities are also doing such a great job in sort of building the guidelines around this. Everyone's struggling and also super excited to, you know, see how the agentic side of you know, security comes in. So, great insights. Thank you. Thank you, Donato.

Charlie McCarthy (19:04):

Yeah, and to your point, Sailesh, this definitely feels like an all hands on deck moment in time. You know, we need as many perspectives as possible. So, it's such a treat to have you here sharing this with us Donato. Can you talk a bit about, I mean, you just shared a wealth of information and I can see how somebody, like a business leader or, you know, maybe somebody on the engineering side who might not have a deep background in security, there might be a sense of overwhelm around all of this and like, where our systems are at risk or you know, what the specific risks are in just our organization, because they can vary. Can you provide a bit of insight into, you know, individuals or teams, organizations who might wish to engage in like LLM red team activities with a very experienced security consultant like yourself or your firm? Are there questions that they should have ready that they're hoping to have answered, or information about their use case that they should bring with them to those initial meetings? Like goals from the outset for these types of assessments that will make an LLM red team engagement the most valuable? Can you share anything there?

Donato Capitella (20:17):

So, the most important thing is to talk about the use case and understand that the biggest vulnerability is not in the LLM itself. Also because you can't fix the vulnerability in the LLM to an extent, but it is in the interactions with the external world that the use case provides. So, you should come, they should come to a meeting talking about this, a scoping meeting for an LLM red team telling me or anybody that's doing that, what users are using the application, what document and data sources come in and in what way, whether there are agentic workflows and tools and APIs, and what do these look like and what's the level of agency that the LLM has? Because all of these things, users, level of agency, system, APIs, tools and documents, knowledge bases, categorization of what kind of data it's got access to, that builds you a threat model and frames how I'm going to test that and what kind of things matter to you.

Donato Capitella (21:27):

So, instead of going to somebody and saying, oh, we are using GPT-4, can you write timid, tell me, oh, we have a workflow when customer emails come in with complaints, we pass them through an LLM and the output goes into this web application and it's drawn in this particular way, and then it's got access to this tool to do this and that. Okay, that's gonna help me and it's gonna help the client understand this is what matters, and these are the controls, this is what we can test, and these are the controls we can put in place to deploy these in production in a safe way. Because all of these ultimately people need to keep in mind that we need to deploy this in production now. Like there is a lot of research in academia, but the difference between what we do with our clients and what happens in academia is that they have to put this live in front of people now. So, how do you do that? And that depends on the use case.

Charlie McCarthy (22:24):

I see. Okay. Thank you. And on the other side of the coin, when you are engaging with clients on this topic, are there stakeholders on the client side that are most useful for you to be speaking with during these scoping meetings? Like what, who are the personas that are helpful to have in the room during those initial conversations?

Donato Capitella (22:46):

Data scientists and software engineers.

Charlie McCarthy (22:49):

Okay. Those two specifically. Any particular - for audience members who, maybe these are some new concepts, can you share a little bit about the why behind that quickly?

Donato Capitella (23:00):

It's because we really need to understand what it is that people are building. And again, typically if you don't know what's being built, there will be confusion. People will come in and say, oh, we have trained a model on our data. Well, if you have done that, I will do a completely different test. Instead, if instead of training, you're just using a retrieval augmented pipeline and you're using GPT-4, then I do the test in a different way and the risk is different. So, it's important for us to get access as early as possible to the people that have actually designed and implemented the system, because they're gonna be able to tell us with the right terminology what it is that we're actually going to test. So, some people tell us, we have an agent that needs to be tested, and it is just a retrieval, augmented generation pipeline with public documents. There is literally no agency, no tools, no API, nothing dynamics. So, for us, having those people in the room that know what's being built and that can use the right terminology really helps understand what type of test we are gonna do.

Charlie McCarthy (24:08):

That makes perfect sense. Before I kind of switch gears to talk about some tooling, Sailesh, did you have additional questions or double clicking to do on like the LLM red teaming topic before we move on? Anything else you wanted to ask?

Sailesh Mishra (24:25):

Sure. I also wanted to understand, Donato, what's your perspective on the speed of red teaming engagement. The speed at which the generative AI world is evolving, I mean, the attacks that are probably working today may not be working tomorrow, and if a corrective action is delayed by let's say a few weeks, then maybe you become far more vulnerable than you were a couple of weeks back. So, what's your perspective there?

Donato Capitella (24:52):

So, I think it's a little bit different. So, right now when you think of LLM attacks, jailbreaks and prompt injection, there is no LLM that you can't jailbreak. And we, and there is no guardrail that can't be bypassed, given enough computational time. So, there are obviously research papers on adaptive attacks. I mean, at the beginning people did this manually and we still do it manually. A lot of the manual stuff and the simple stuff still works. But the reality of it is that if you give me enough computation power, I can make enough requests and in the right conditions with adaptive attacks, I can jailbreak any model. The models that we have now, for example, Anthropic just last month had the best of an attack which honestly is a very simple black box attack.

Donato Capitella (25:48):

We implemented it immediately and there hasn't been anything that we haven't been able to jailbreak. So, for me, practically with my cybersecurity background, there are other things in cybersecurity that you can't fully fix. The Windows operating system, it's going to have zero day vulnerabilities. It's going to have vulnerabilities people are going to find, and you can never find all of them. So, how do we deal with that? In, we have a lot of mitigating controls and defense in depth, and that's the saying that I recommend people do with LLM applications. You have kind of like a LLM application pipeline. So, yes, you have the LLM, but then you have the user input, you have a set of guardrails that you can have at the input. You have a set of guardrails that you should have at the output.

Donato Capitella (26:38):

And so all of that pipeline needs to come into play to ensure that you are mitigating the risk of those vulnerabilities. We even released an application security canvas for LLM applications where we kind of go through all of these controls, like, you know, you've got prompt injection models, different types of contextual input validation. Output validation is very important. I mean, I say just one thing, it surprises me how many people take the output of the LLM and just check it back to the user without checking if it contains any link or marked unexpected content. I mean, this is not even machine learning, right? I can look at the output of an LLM using a regular expression and see if there is a link to a domain that I, that, that I have not allowed, then my application in this case should never produce.

Donato Capitella (27:32):

So, these are small things and there is obviously harmful content that you can check with different models. So, you do all of these, but all of these controls can fail to an extent. So, then the final thing that I tell people is practically, and what we do with our clients, it's not just about the controls that you're putting in place, but it's about detecting when somebody is triggering that particular, for example, jailbreak check. You have a model, input comes in, the model classifies that as a potential jailbreak attack. You stop it. I'm gonna go again, best of an attack, I try different attacks, modified input. Eventually I am gonna bypass your guardrail. The problem is how much is it gonna take me? So, if you set a threshold and you say, I've got all of these checks, if a user triggers 10 of them in half an hour, I'm gonna lock the account.

Donato Capitella (28:36):

I'm not gonna allow them to keep probing at the application. Now, that's what in practice for a lot of our clients is making it possible for them to deploy these applications in production. They have guardrails. It's not that they deploy without guardrails, but they know that they will fail exactly as you said, like it moves very fast, but it's similar to password guessing attacks. So, your password is secure, but we have password lockout. If I try to log in into your account with five passwords that are not your password, your account is gonna be locked. That's to prevent me from going over all the possible passwords. And I think you do the same with jailbreaks. You detect them as best as you can, but you stop people that keep trying to do that because eventually they will find their way through.

Sailesh Mishra (29:25):

Yeah, so continuous observability and monitoring on the entire system is what, you know, sort of is the order of the day very soon.

Charlie McCarthy (29:35):

Switching gears a little bit. So, when we're talking about security for AI or AI powered technologies or, you know, cybersecurity for AI, we like to touch on the people, processes, and tools - like the three main categories. And we've talked a little bit about people and processes. You shared during our prep for this episode, Donato, that y'all were just about to release a new open source tool called Spikee. Am I pronouncing that correctly?

Donato Capitella (30:02):

That's correct, yeah.

Charlie McCarthy (30:03):

And I think since then it has been released. Will you share a little bit about that with us?

Donato Capitella (30:10):

Yeah, absolutely. So, it has actually been released today. And that's, so we spent the last year doing these LLM application security assessments for our clients. And as we were doing them again, at the beginning we were doing a lot of manual attacks, and then we started automating some attacks. It was a bunch of scripts that we had. And then at some point we decided to pull some of these scripts together into a framework that was made for penetration testers, for security people. So, if you look at something like garak, we tried to use it, but it's an LLM benchmark. It doesn't really serve you very well for a lot of reasons. If you're doing an assessment of something, an application use case for a client, we needed something where we could first of all focus not on the harmful outcomes, like how to make a bomb, but on attacks relevant for us, like data exfiltration, HTML injection, resource exhaustion, and access control.

Donato Capitella (31:16):

So, making the LLM call a tool that it shouldn't call. So, we needed something that allowed us to do this use case specific. So, something where I look at the client use case, I look at documents that they're using, and I can generate a dynamic data set, which different types of jailbreaks and objectives, and this could be small or large, depending on how much I have, and then I can send it to the application, maybe using tooling that we use. Like I mention some tools like Burp Suite which is an HTTP interception proxy. So, this is common tooling that penetration testers would use, and then you could look at the answers and determine whether the attack was successful without having to use an LLM verifier or an LLM judge. So, we kind of had to boil it down to things that would be practical to do and that you, it would be very easy to create your own data sets, and that's what Spikee essentially does.

Donato Capitella (32:17):

It allows you to input your own jailbreaks, or it comes with a library of jailbreaks, a library of different things that you wanna do, data filtration, and then it comes with very easy ways to detect whether those have been successful. And then it allows you to generate a dataset with some modifiers and test them against an application. So, that's been what we've released now. It's available on spikee.ai. Spikee, with a terrible spelling because we couldn't just make it - we couldn't find an acronym that said what we wanted to say, but Spikee stands for, what does it stand for? I forgot. Simple Prompt Injection Kit for Evaluation and Exploitation. So, it's Spikee, but with two E's at the end rather than a y, silly name. But we had to find something that worked, but essentially this is free open source, people can download it.

Donato Capitella (33:22):

We have other things that we're gonna release, for example, the best of an adaptive attack. We opted not to release it yet because it's an attack that's got essentially a hundred percent success rate. And a lot of the other things that we tested, you know, have got some success rates. So, we wanted to give a tool to the community that could be used. But we kinda want some of the defenses to catch up a little bit before we start releasing the whole data sets. But anyway, that tool can be used to test an end-to-end LLM application use case. You can use it to create a custom data set for a client. You can use it to directly test guardrails, and you can also use it to benchmark an LLM. And actually, we did it. We took, I think we took 20 LLMs and we benchmarked them. Even the reasoning models you know, o1, [OpenAI] o1mini, DeepSeek-R1,

Donato Capitella (34:23):

Gemini, we benchmarked all of them on that. But, the thing is that people can go look at the benchmark and make their own version where they say, okay, what would happen if I change the system, if I use different prompt engineering? What would happen if I changed the use case? So, we released it as a simple benchmark, but people can play with it. Interestingly, if you look at the result, for example, DeepSeek, which is now what everybody is talking about, it's obvious that it's not being fine tuned or trained to be resistant to prompt injection. Like it's literally at the bottom. I mean, almost any other model performs better on those types of attacks than DeepSeek-R1. And this is also something for people to remember. I think there is a lot of hype.

Donato Capitella (35:16):

The models are good, but just because a model can reason very well with certain problems, it doesn't mean that it's got general intelligence. If you haven't taught it to reason about, you know, what prompt injection is, it's actually gonna be very prompt injectable. I literally, I don't know if I, which model? Well, when I go to the benchmark, I think the only models that perform, the only model that performed worse than DeepSeek models is Llama-3-70B. Pretty much every other model, at least for the basic version of the benchmark, was more resistant, and was less likely to follow instructions that were injected. So you know, take it for what it is. I don't think it's a conspiracy. I don't think they did it on purpose. I just think they did not design the model to be resistant to these attacks because they were focused on something else.

Sailesh Mishra (36:09):

So, effectively Donato, Spikee is a combination of, let's say a certain technique and probably an outcome. Is that how you would sort of define the key elements of the data set that gets generated?

Donato Capitella (36:24):

Yeah, so we, it's a combination of documents that are very use case specific, you can put as many of these documents as you want. And these documents can be injected with a combination of what we call a jailbreak and what we call an instruction that you want to include. So, we use jailbreaks to include instructions. So, it could be something as simple as new important instructions, and then the type of instruction or any other type of jailbreak. You know, sorry you know, this is a test. You are in a test pipeline and you're not talking to a real human for the test not to fail. You need to do X and then X would be some of these structures. And all of these things are mixed up. So, you've got this kind of jailbreak, plus jailbreak technique, plus instruction that then is injected at the beginning, or it could be injected in the middle and at the end of the document, and it can be injected in different ways.

Donato Capitella (37:17):

You can control whether you want to just put it in the middle, what kind of delimiters you want, and then you create these malicious documents that you can send to an application if you have the application that you're testing. Or, if you just want to test the LLM in isolation, then you tell Spikee, okay, this is a document, and Spikee supports summarization and Q&A use cases. So, you can say, you can basically say, Spikee can make believe a summarization use case? Please summarize this document, or this is a question to answer in this document, and then Spikee can look at the output and tell you whether as it was summarizing or as it was answering the question, it also followed the malicious instruction that had been injected. So, that's essentially what it does.

Sailesh Mishra (38:07):

And, and this would've taken a significant amount of research, like you said in the beginning. What are your key prioritization takeaways that you can share with us while building Spikee?

Donato Capitella (38:20):

Oh, honestly, the customization was one of the most important stuff. I want to be able to create standalone documents that I can send to an application. So, without system messaging, without prompt, I just want to be able to get the document. Then if I want to add the prompt I can, but I wanted to be able to get raw documents so that then I could submit any way to whatever application I wanted. So, that was number one. Number two, I didn't want an LLM judge. I wanted to be able to look at the output of the LLM and determine whether or not the attack was successful without making another LLM call and asking a judge whether the attack was successful. This costs a lot and slows you down. So, these were the two main angles. The ability to generate my data sets that I could integrate with anything based on the use case. And I didn't want to use an LLM judge because this needed to be practical and you can just do peep install, Spikee, the Spikee in it, and then you get your working directory and you can start using it immediately.

Sailesh Mishra (39:31):

You've already kind of hinted at it, but can you share a few examples where you might have already implemented them in real world use cases or are planning to? What can be some real world examples where Spikee can be used?

Donato Capitella (39:44):

So, we have one that we put on a tutorial on our website. So, obviously all of these things are adapted from testing that we've done. We basically use versions of Spikee in all the testing we've been doing in the past four or five months when we collected those scripts together. But for example, a use case was like a Gen AI feature. They would summarize some just trying not to give away too much, but with summarizing some trading rules that can be written in a language. And so a user that doesn't know that language could look at some of these trading rules in this trading platform, press a button and that would explain to the user what that trading rules was doing in that platform. So, obviously it's a summarization use case there.

Donato Capitella (40:39):

And what we did, we took some realistic trading rules as documents and then we used Spikee to create different types of injections. Where, if we produce markdown images or JavaScript to see whether you could make, as part of the summary, you could make it produce JavaScript that then would be executed in the browser because it doesn't exist in isolation, it was rendered in an application. So, we created that data set and we sent it. They had some prompt injection guardrail, so having that kind of data set that eventually ended up having 500 samples, we kind of could see, okay, this has been blocked and these are the techniques that were able to bypass them. It also has plugins to do like alterations, like lead speech stuff like that. So, that was a case that we used. And we kind of have something similar that we described on our blog if people wanna try and play with it.

Charlie McCarthy (41:40):

That's fantastic. Thank you for sharing all of that with us. And kudos also to you and your team for developing this tool and providing something to the open source community to be able to explore some of this on their own. Before we start to wrap up the show, I wanted to ask your perspectives, Donato, on a term that has been being passed around the industry, we're hearing it all over the place; “Agentic AI” or some people just shorten it to “agents.” Can you wax poetic about that and just some insights and what you've been seeing in the space?

Donato Capitella (42:13):

For me, on the agent side what's really important to understand is that whenever an LLM is given tools that it can use and it can call, there are some use cases where it's fairly easy to secure because you can apply access control on whatever the LLM is allowed to do externally outside of the LLM. And so, all those cases are pretty easy. Whenever you have agents like a browser agent, where it's really difficult to apply access control because the agent can click on anything. You don't have - and then you don't have a way to really know on a generic webpage, what's the access there, what's the action that you should be doing? Because the thing is just clicking. That's harder. And even harder are coding agents. Now, I looked and we published it. We looked at OpenDevin, which is now called OpenEnd, and you can do prompt injection. The issue, and we show that you can open a GitHub issue,

Donato Capitella (43:25):

So something public, you have a repository, you open an issue and you put a prompt injection attack in there. Now, when these open hands or any kind of coding agent, if you point it to look at the open issues in the GitHub repository, the prompt injection is going to work on the agent, and then you can tell it to do other things. The problem is that the agent can do a lot because it's got access to a workspace, it's got access to the code base, it's got access to Bash, it's got access to a Python interpreter, it's got access to the internet... So, these agent workflows, some of them, the ones that you can put a lot of predefined raves and access control with precise tools and APIs, they're easier to secure. The ones that are very generic, like a browser agent that can browse anywhere or even worse, an autonomous software developer that can browse the internet, open GitHub issues and talk to people and has got access to all the stuff that it's got access to, it's fairly hard to secure that. I haven't seen anybody that's really done that in a way that I would say is secure in a satisfactory way.

Charlie McCarthy (44:37):

Awesome! So, we've got a few minutes left here, folks. I'd like to pivot one more time and talk just a little bit about your passion project, Donato, “LLM Chronicles.” Content creation, I've noticed, especially as a leader in the AI security space related to awareness, the MLSecOps Community has been a hub for a couple years, kind of teaching folks about building security practices into the AI/ML pipeline. Are there any discoveries you've made throughout the process of creating content for your YouTube channel to promote AI security awareness or LLM security awareness that - just trends that you've noticed maybe, or things that people are really picking up on or interested in? What's that journey been like for you from a content creator perspective?

Donato Capitella (45:30):

So, first off, the inspiration. I wouldn't be here if MLSecOps hadn't been there. Obviously at the beginning, I didn't know anything. If you go back to 1-2 years ago, I didn't know anything. So, just having resources like MLSecOps, especially podcasts where like this one, you get to meet the people that are actually doing the tech, I think is really important because it gets you these human aspects and people talking and discussing the tech and their experience. So I think this is one of the most important things that creates a community. Every time I approach a new topic I always try to understand the technical side. You can buy a book, you can do a course, but then you have to get to know the people, their opinions, their discourse, and I think these kind of things are incredibly helpful.

Donato Capitella (46:25):

It was this podcast and, I'll just mention it because it was so important to me, Sebastian Raschka has a lot of books on how LLMs work. If he hadn't written those books, I don't know if I'd have gotten where I got now. So, I love people that are sharing. And for me, the journey, I mean, it's hard. I mean, you do this so you know how hard it is to do something that's not infotainment. And there is nothing wrong with infotainment, but what we are doing now is not infotainment. This is discussions, opinions, and content learning. So it requires that people are in the right mindset to want to listen to you. You're not gonna relax on the sofa at 10:00 PM and probably listen to an episode of this. Definitely not listen to me.

Donato Capitella (47:19):

I mean, I could help you sleep, put you to sleep probably, but what's challenging with what we do, because it's not infotainment people need to be in the right mindset and there is a lot of information. So, even with my channel I have a few people that follow and the challenge is always how do you make the content so that it sticks and it's your voice and it's people are getting something out of it. So, one thing that I don't do that I think a lot of people do, is as soon as there is a new model I will not make a video shouting into the microphone, "This changes everything. The world of AI is never gonna be as it was. This is the end of it. China's gonna take over. Nvidia doesn't exist anymore." You know, it's okay, like it's fun, but there is no information.

Charlie McCarthy (48:17):

So, what I'm hearing is to seek resources that have the primary intention of being educational. Yeah, a lot of intention behind it, I think is the main thing.

Sailesh Mishra (48:29):

So, Donato, we discussed so many insightful things today. Is there something that you feel we did not cover? I have a question in my mind that I'm not sure if, if we have, you know, sort of discussed this, but when should a client who's building generative AI applications, when should they start thinking about red teaming assessment?

Donato Capitella (48:53):

Before they build it.

Sailesh Mishra (48:57):

Right. Yes, that's very true. I just wanted to lay that question out for you. Is there anything else that you think we should have covered and we kind of maybe skipped over and you would like to sort of reemphasize any key takeaways?

Donato Capitella (49:15):

So, key takeaways for me I think what's really important is to think about the security of your Gen AI use cases rather than of the security of the individual LLMs. Vulnerabilities do exist in a context. Think about that context and how an attacker is gonna hack it and start thinking about the remedial actions that cyber, that LLM application security canvas with all of those remedial actions, all of those mitigating controls. So the input, the output, the access control and the detection. So, detect when somebody is trying to jailbreak or prompt inject. Allow five, ten attempts and then block them. This is appropriate for the majority of use cases. Unless you are Open AI and you're providing a general purpose LLM. If you're an organization and you're a bank and there is a LLM that I can talk to, I shouldn't be able to send 10,000 requests containing all different jailbreaks. You need to stop that and you need to think of that before you start even building the application. What kind of guardrails are you going to put in, around that use case, that makes sense for that specific use case?

Sailesh Mishra (50:32):

Vulnerability exists in a context. That's kind of it. Thank you.

Charlie McCarthy (50:39):

Awesome. Okay, well in the interest of time, I hate to wrap it up. I feel like we could go on a lot longer and maybe, Donato, sometime if you're willing, you come back and we can do another episode and dig into some more. But for today just wanna express my sincere thanks to both of you for being on the show here. To our sponsors, Protect AI, who operate the MLSecOps community. And then final note, Donato, if folks would like to get in touch with you, we can link something to the show notes. What's the best way to do that? Is it LinkedIn? Is it a website? Is it an online form?

Donato Capitella (51:11):

Linkedin and YouTube. I mean, find a video on YouTube that you actually have a question on. Ask the question there. That's gonna help the YouTube algorithm. He likes that, but I also respond to people's comments.

Charlie McCarthy (51:27):

Perfect. Yeah, we'll absolutely link to the “LLM Chronicles” in the show notes for sure. Okay, well thank you again, both so much and everyone who has tuned in for this episode. We appreciate you and we will see you next time.

[Closing]

Additional tools and resources to check out:

Protect AI Guardian: Zero Trust for ML Models

Recon: Automated Red Teaming for GenAI

Protect AI’s ML Security-Focused Open Source Tools

LLM Guard Open Source Security Toolkit for LLM Interactions

Huntr - The World's First AI/Machine Learning Bug Bounty Platform

Thanks for checking out the MLSecOps Podcast! Get involved with the MLSecOps Community and find more resources at https://community.mlsecops.com.

Guest

SUBSCRIBE TO THE MLSECOPS PODCAST