Holistic AI Pentesting Playbook

Jun 13, 2025 By Guest

Audio-only version also available on your favorite podcast streaming service including Apple Podcasts, Spotify, and iHeart Podcasts.

Episode Summary:

Jason Haddix, veteran OffSec professional and CEO of Arcanum Information Security, joins MLSecOps hosts Madi Vorbrich and Charlie McCarthy to share his methods for assessing and defending real-world AI systems.

Transcript:

[Intro]

Madi Vorbrich (00:07):

Welcome to the MLSecOps Podcast. I'm your host and one of your MLSecOps Community Managers, Madi Vorbrich.

Charlie McCarthy (00:14):

And hi everyone, my name is Charlie McCarthy. I'm one of your MLSecOps Community leaders, and today, Madi and I have the pleasure of hosting a very special guest who we have been following in this community space for quite some time, Jason Haddix - Hi, Jason! - who is a veteran [penetration] tester as well as founder, CEO, hacker and trainer at Arcanum Information Security. Jason, again, just such a thrill to have you on the show today. Thank you.

Jason Haddix (00:42):

Yeah, thanks for having me. It's a pleasure. I had a ton of friends actually be on your podcast and the show, so it's awesome.

Charlie McCarthy (00:50):

Nice! Well, hey, before we dive into the meat of today's interview with you, do you mind giving the audience a little bit about your background and kind of what brought you to Arcanum in this space today, and especially the AI security tie-in?

Jason Haddix (01:05):

Yeah, absolutely. So I've been doing offensive security for about 21 years now. So since probably the dawn of web hacking, I would say, is when I got into offensive security. Started off as a pen tester, just doing web application penetration tests, mobile penetration tests, all that kind of stuff, moved into red teaming. And I've worked at a number of companies. I was the Director of Penetration Testing at HP. I was the Director of Operations and Head of Trust and Security at Bugcrowd. And then I left to do the CISO job for a little while, which was I just kinda wanted to see if I could do it. And I did that at Ubisoft, which is a big game company. And that was, you know, four years of my life and was an interesting experience.

Jason Haddix (01:49):

And then I came back to more technical work. So over the past couple years you know, we do a kind of like a plethora of tests at Arcanum. I started my own company. We do purple team and red team assessments, but you know, kind of when ChatGPT 3 or, you know, 3.5 came out was when I got bit with the generative AI bug. And I saw the application right away in automating like, lower level security tasks. And so I started using it, developing tools, augmenting my tool sets with it. So I've been doing that now for, well, we're about three years in, I think, to the generative AI kind of ride two and a half if, depending on when you started.

And so, yeah, so I was doing that and then, you know, really recently you start to see a lot of adoption by like the Fortune 100, Fortune 500 in really starting to build either internal or external apps that are powered by generative AI on the backend, right.

Jason Haddix (02:46):

Either as extensions to APIs or full chat bots that do different things. And since we're a pen test company and I did a couple talks on prompt injection, some really small ones, people were like, "Hey, do you know how to assess these things? Cause we don't." And that kind of birthed a whole track of research that I've been working on on hacking AI, AI pen tests basically, and methodology that we've developed. You know, that's pretty proprietary at Arcanum. So been on the conference circuit talking about our research that we open sourced. We open sourced a lot of it. And then yeah, just hacking a bunch of enterprise systems, which has been really fun. So yeah.

Charlie McCarthy (03:26):

What a wild ride. Congratulations, by the way.

Jason Haddix (03:28):

Yeah, thank you. Thank you very much.

Madi Vorbrich (03:31):

Yeah. I think that also brings up a really good discussion for us to have, which is assessing AI implementations in the wild. Kind of taking a nod to what you just talked to us about, Jason. And I want to ask, I know that you were at the OpenAI Security Research Conference this year. What did you talk about there? What was your presentation? And then I have some other questions that kind of want to ask about what you discussed there.

Jason Haddix (03:57):

Yeah, absolutely. So this year was the first annual OpenAI Security Research Conference. It was pretty small. It's about a hundred people invite-only. And I happened to have a contact at OpenAI and he sent me the invite to be just an attendee. And I was, you know, honored just to, you know, be kind in the room with like a ton of like neat people who were at the kind of forefront of automating security testing and security workflows with AI as well as assessing them.

So there were talks about both sides, right: using AI to scale security, and then also attacking AI and, you know, kind of the deficiencies and building applications. So I was actually kind of one of the more practitioner led talks. We did a talk on holistic testing methodology for attacking AI implemented systems, which was a mouthful.

Jason Haddix (04:45):

But we did that and it's basically a talk that we've been giving that outlines a whole bunch of enterprise use cases where we've done pentests, AI pentests, and found some cool stuff and how we built our methodology. And then also an introduction to our prompt injection taxonomy, which we also open source. So we have the high level methodology, which I ran through with details and basically outlined case studies of how the methodology worked applied to different customers. And then the prompt injection taxonomy, which is one of the tools we use to succeed in that methodology in different parts of it. So those are the two things I presented on, went really well. Hopefully we'll go back next year and, you know, give version two and yeah. It should be cool.

Madi Vorbrich (05:32):

Absolutely. So what assessment strategies did you specifically share for evaluating AI implementation? So not just models, but also how they're deployed and used as well.

Jason Haddix (05:44):

Yeah, so that's actually the difference, I think, in our methodology rather than kind of what a lot of people talk about right now. So, I mean, you've had about 15 years of adversarial work on different types of ML for many, many years, right? I think the first adversarial ML paper was about, I think maybe 14 years ago actually at least by my research. And most of that is focused on getting the model to do bad things or misclassify things in an image model or something like that. And that has been the term AI red teaming until very recently, right?

And so when you go to like an AI red teaming company that's been around for a while, they focus generally on the model. They focus on getting the model to discuss harm, getting the model to be biased, getting the model to mention certain band topics like CBRNE topics, chemical, radiological, things like that, that it shouldn't talk about.

Jason Haddix (06:40):

But that's actually not a thing that most enterprises care about these days when they're implementing these LLMs into either the backend of APIs or internal chat systems or whatever that's actually less of a focus area that they want assessed. They want the application interface assessed that's connected to the LLM. They want to make sure that the agenting and tools are all locked down.

So they can't do really horrific things, which they can right now. They want to make sure that the RAG data and the business logic they've built into the prompt engineering cannot be messed with. They want to make sure that all of the system inputs are controlled and they want to make sure there's also like a set of applications that supports any implementation of an LLM in production. So you have logging and monitoring observability; you have a bunch of, you know, prompt template libraries.

Jason Haddix (07:32):

You have things that support maybe if you have an MCP protocol, there's all kinds of other DevOps systems that support a live implemented version. Your guardrails are separate AI systems, your classifiers are separate AI systems. They all support this app that you're building. And so there's more than one app in the ecosystem. And we also focus on attacking those. So we started to see this in assessments, and so we just broke it down into a methodology that is exactly that, identify the system inputs, attack the ecosystem, attack the model, attack the prompt engineering, attack the data, attack the application and pivot. And that's our high level methodology.

Charlie McCarthy (08:09):

Makes sense. I want to backtrack a little bit, Jason, to when you were first talking about the adversarial ML example and how the industry has been researching that space for 15 years if not longer. And you mentioned the word “paper,” indicating research. Can you talk a little bit about the process related to AI red teaming and this methodology that you've developed and how researchers can [inform] these types of methodologies or taxonomies that are being publicized?

Because we see papers all over the place, like every day it seems like, or, you know, every couple weeks there's a new paper about AI red teaming, pen testing, all the different components that we need to pay attention to from a penetration testing or ethical hacking perspective where your AI is at risk.

And one of the common pieces of feedback that we've been hearing within the [MLSecOps] Community over the last several months is like, okay, this is all well and good, but a lot of this is mostly academic. How did you transform some of your research and what you were seeing in the industry into this methodology and taxonomy that could be put into real world scenarios? Like what's that step function to go from academic to actual you know, usable research?

Jason Haddix (09:29):

I mean, I think for me specifically and the couple other people that do the work here at Arcanum (but it's mostly me) is it was bridging both the academic and the underground, right? So you have jailbreaking groups and you have prompt injection groups that exist like on Discord or Telegram and things like that. And so we would take these white papers you know - I parsed over like, I think like 400 white papers on prompt injection kind of taxonomy. Some of the, you know, the best ones came from the Learn Prompting crew, Sander [Schulhoff] there he did his, you know, prompt engineering taxonomy that he built out of the hacker prompt competition. And you know, just a whole bunch of other ones. There's been at least seven specifically to prompt injection, but many associated prompt engineering.

Jason Haddix (10:16):

And so, I mean, it was just manual research. Research parsing that, taking out notes that I think would work in the real world. Testing them against bug bounty targets that had you know, actual, like chatbots enabled so that I could test them in the real world. Building CTF apps for myself out of, you know, LangChain in the early days. Raw LangChain. We weren't even using LangGraph then, right? Like you know, just doing that and building CTFs for myself to make sure things worked against, you know, at least the most used open source models and then some of the, you know, frontier or you know, frontier models as well. So it was mostly just trial and error. And so...

Charlie McCarthy (10:54):

Like getting your hands dirty.

Jason Haddix (10:55):

Just getting your hands dirty. Yeah. I mean, that's how all pen test methodology is built, right? It's like, okay, like there's a conference talk and a lot of time it is an academic thought exercise, but then putting it into your methodology and actually using it requires some testing before you say, yeah, this is a worthwhile thing we need to spend time on. So yeah. So that's basically how we did it.

And then, you know, like I am a big mental models person, so a methodology in its highest sense is a mental model. And then so I built the high level methodology to be all encompassing of like, what we're doing, and then you can, you know, drill down into each section and then just add notes. And I build everything in a mind map. So I use X Minds to gather all my thoughts.

Jason Haddix (11:36):

And then I just started collecting notes from these papers and from the underground research, and then I realized there was a kind of a disconnect between you know, like some of the jailbreaking groups and the methods they used for prompt injection. And then some of the stuff that was being talked in, like the academic world. And then some of those things never showed up in enterprise level apps because they just don't work in certain ways. And so it was just, you know, for lack of a better term, it's like, you know, reinforcement learning, right? You know, like making sure the methodology stands up, right?

Charlie McCarthy (12:05):

Yeah. That's good insight.

Madi Vorbrich (12:09):

So if we were to also take a step back and just think about what red teaming really looks like, how does your methodology also expand beyond that? And then also too, if you were to look at it from a defensive perspective as well, like what are some organizations missing? Like what's like the missing piece that sometimes you might see when it comes to that?

Jason Haddix (12:30):

Yeah, so it's really interesting because you have like buckets of organizations and clients who are attempting to build AI-enabled systems, right? So you can have like an enterprise level client who is all in the Amazon ecosystem, right? Which means that they're getting provided you know, classifiers and guardrails that they can use for every model and every agent. And everything's pretty locked down. And there's IAM applied to everything, and they get the newest bells and whistles and those companies, you know, we do tests on them and we find stuff, but you know, like there's, there's a lot of hoops to jump through.

So we have to use more of our prompt injection tricks to like get through that kind of stuff. But then you have organizations who are in like a middle tier bucket or like a lower tier bucket, and they're just trying to build like a simple agent-based system to action their internal data and like, you know, do something with it.

Jason Haddix (13:19):

You know, either tell them about like, you know, like insights into their business or provide, you know, help to engineers or whatever. And so those systems and those companies who are mid tier, they don't have any of that stuff. They don't have classifiers built into the systems. They don't have guardrails, you know, it's pretty rife for prompt injection.

And then the lower tier people are literally just like building bots out of API calls and, you know, the different ecosystems of ChatGPT and, you know, Anthropic and stuff like that, or OpenAI and Anthropic. And so those ones are also really easily attackable.

So I would say there's more of those middle tier and low tier bucket customers that we get that need just like a lot of help. And so it's talking about the availability of either the open source or closed source products that, you know, that you identified every level to stop an attacker.

Jason Haddix (14:06):

So it's input and output protection in the web app. It's input validation into the model. It's a classifier into the model. It is a guardrail into the model for prompt injection, whether you're going to use an open source or closed source one. And then on the output also a classifier in guardrail in that architecture. And also input and output validation back to the user. So that along with role-based access control and scoping API calls for agents are kind of the biggest things that we harp on with our customers for defense.

Charlie McCarthy (14:36):

Would like to clarify for the audience, Jason, is this available under Arcanum's open source offerings?

Jason Haddix (14:47):

Yeah, so our prompt injection taxonomy is the open source project. The high level methodology, like, I don't, I don't think that we could like patent that it's just something we made up or whatever. It's just, it's been talked about in a whole bunch of presentations I've done. So that's out there, like on YouTube, if you look at any of our AI talks or whatever, but we teach a class basically on how to do this. Like we've been teaching other pen test companies how to do these assessments holistically.

So in our class, you know, like we give away a probably like 25 slide talk on case studies, how we do this, same thing we did at OpenAI, but then we have like a big class where we actually teach you step by step how to go through and assess one of these applications at different levels. But all the information is out there at some point. Yeah.

Charlie McCarthy (15:32):

Got it. Okay, I see. That's helpful. We'll hyperlink those resources within the show notes. I just wanted to make sure we were directing people to the right place.

So for members of the MLSecOps Community specifically, I mean, we've got a pretty broad audience including, you know, AI engineers, ML developers, data scientists, AppSec, InfoSec, CISOs, CEOs, policy makers, basically everybody trying to learn how to secure and govern AI systems. Who's the primary audience for your red teaming methodology? And the open source tools. Say within a Fortune 500 company, a security team working to secure AI: what types of practitioners do you envision using [the methodology] and at what stage of the game should these assessments be happening?

Jason Haddix (16:22):

Yeah, I mean, so you probably want to have an AI pen test. And we call them AI pen test because they include more than just the AI red teaming, right? So you probably want to have an AI pen test close to, you know, close to official launch of the system or the feature. But still in, you know kind of a dev environment, right? We end up finding a lot of things that require basically cleaning of data sets, RAG data sets a lot, or scoping of API keys or architecture that's a little bit different than the client originally had them set to. Which, you know, does require some dev time. And so you don't want to be live fixing that in prod usually. People bring us in right before they launch the feature or the product most of the time, although we have some that we're hitting in production and you know, then it requires to go back. But...

Charlie McCarthy (17:08):

Better late than never, I guess?

Jason Haddix (17:09):

Y, better late than never, yeah. And you know, honestly, people are just getting to them, they're just getting to the point where they realize they need a specialized test for this, right? Like I'm, I've talked to plenty of clients, clients who thought it was just going to be pretty easy to train their existing pen testers, you know, who do web app tests or network tests or something like that to do you know, a combination of this AI red teaming and AI testing, and then just found out that it is a giant discipline in and of itself. And that it couldn't really be done by like someone who wasn't like really in it all the time.

And so they end up, you know, having to call their pen test vendor or, you know, look out online and see who's doing this type of research, or some people, you know, put the the scope in their bug bounty to attack it, and then they just go live and get the bug bounty findings.

Jason Haddix (17:56):

So that's an option too. So yeah, so that's, you know, that's kind of some ways to action the testing. I think that you know, we're, we're talking to a lot of usually in our calls it's usually like the highest ranking Defensive Director at an organization. So whether that's like head of like, you know, blue team or security operations or you know, your incident commander, it's more security that's worried about it than development, at least in our world. But those are the two people we're talking to, and usually the mandate falls upon them or they proactively, you know, know that this system has some private data in it that, you know, not everything has been pruned from, you know, everywhere and been locked down. And they really want to get a strong test on the holistic system. So yeah.

Charlie McCarthy (18:44):

That's reassuring to hear that security teams you're noticing are starting to become more concerned about it. Because, you know, as all of this started trending two or three years ago, that was another big thing we noticed within the [MLSecOps] Community was that there was a bit of a disconnect between the developer teams and the security teams, and neither really knew the subject matter that the other team was concerned with.

So like a security team maybe at the time knew broadly that like, “Hey, there are these aspects [of AI systems] that we need to be concerned about, but we don't even necessarily know what we're dealing with or the components,” or yeah.

Jason Haddix (19:22):

Yeah, I mean, luckily, like you said, there's been a lot of talk and content and you know, like podcasts and shows like this that have come out -

Charlie McCarthy (19:30):

That kind of bridge that gap…

Jason Haddix (19:30):

You know, kind of bridge that gap, right? And, you know, AI is everywhere right now. AI security is less everywhere, but it's still got a lot of people doing good work in the field. So I think that we actually caught the wave, you know, at the right time. And, you know, we're, we're doing a lot of assessments and I think we just get stronger as we do more assessments too, because we get exposed to more architectures. And then we can, you know, kind of pivot with our methodology and we need to. But you know, if you're just getting into this, I think in my notes, I think there's like maybe 10 or 12 open sourced like CTFs that can get you started in this world of like assessing like web apps that have been enabled by APIs and different companies have published 'em in their open source or hosted for the community to practice on.

Jason Haddix (20:15):

And I think that that's where you start. It's definitely where I started, right? Like I started with Sanders HackAPrompt competition and that was three years ago now, two and a half years ago now. And then I moved on to, you know, Gandalf was one of the first ones that came out. And then you know, and then now there's a ton of open source labs that exist all over the place to learn at least the prompt injection part. I feel like there is a little breakdown.

A lot of those CTFs that you use, they're very basic prompt injection attacks or agent-based attacks, where in an enterprise based system, you have chained agents that do like data transformation in between, and they have classifiers and guardrails. So there's this big wall that you have to jump over when you're learning is like, oh, this is really simple. I can get started in this. And then you get hit by an enterprise based system and you're like, I can't do anything 'cause I have no idea how this works. And, you know, like my injection's not working. And so there's a learning curve kind of wall there, but it's possible, it's definitely possible if you do enough research.

Madi Vorbrich (21:14):

Yeah, and you know what's so funny, Jason? Is that I feel like all of the great security researchers, hackers, pen testers that have ever talked to, that's always kind of like their path. They're jumping off point is “I started with this CTF, I started with this competition,” and I just always hear that across the board from most people.

Jason Haddix (21:31):

Yeah. My funny story is so Pliny the Prompter is one of the most prolific jailbreakers in the AI scene, right? And his group is the Bossy Group. The Bossy Group basically writes jailbreaks, universal jailbreaks for every model, has pretty much jailbroken every model that exists so far. And so Pliny's team did a CTF called Bad Words, and the CTF was basically trying to get the frontier-based models to say really heinous things. And so I competed in that CTF. I think I'm globally ranked seventh right now, but I did it with a bookmarklet, so I needed to automate my submissions to the CTF. And so what you're doing is you're trying to get it to say horrible things.

So I have six monitors and each monitor is running automation to send prompt injections to these models to say really horrific things. And my 16-year-old daughter walks in the room and she's looking at six monitors of just some heinous stuff on my, and she's like, dad, what is going on? And it just keeps on flashing, like, you know, it flashes the prompt inject, the sentence, and then it's like, you failed or you succeeded. And she's like, what is going on? And I'm like, this is this work. I swear, yeah, this, yeah, this I can explain this is for work. So yeah, that's my funny story about like you know, one of those competitions, so.

Charlie McCarthy (22:47):

Have you noticed, Jason, if there are any parts of the AI stack that are most often being left out of like, traditional AppSec reviews by some of these organizations? Or just recommendations that you have for things that are missing?

Jason Haddix (23:05):

Yeah, so you know, like a thing that we see in that mid tier and low tier bucket, and even sometimes in the enterprise level deployments of stuff, is you're connecting your agents not only to you know, standard tools that you're used to agents like executing, so like web search you know, like file parsing and stuff like that, but you're also connecting them to SaaS-based rest APIs to pull data from things like Salesforce or you know, CMDB or you know, whatever, right? And so these calls that an agent works with to pull data to pull contextual data into your system, we find a lot that people are not scoping these calls to read only and they're scoping them to write as well.

And so through the agent, we've been able to write into several pieces of software on the internal part of the organization by doing prompt injection to tell it to write into Salesforce or Slack or something like that. And we'll use that platform to attack developers. So we'll build JavaScript popups that say, put in your credentials and those will send passwords back to our red teamers. So we find a lot of people are not scoping their API keys super well with the agents.

Charlie McCarthy (24:15):

[Whoa]

Jason Haddix (24:17):

Yeah. So I mean, usually, usually in the talk I do, I do a couple of case studies, so I'll give you one. So one of the other ones we find is we find a lot of healthcare organizations right now are implementing internal knowledge basis to take all their healthcare data to do something with it, either to organize it, to put it in a golden data set to do analytics on it based on patient care or, you know, like how much services they do across a certain portfolio or whatever. So the system inputs part of the methodology, finding all the system inputs is also really important. So one of the clients we worked with their whole system was taking a bunch of patient records and these come in different forms, so the model had to be multimodal.

Jason Haddix (25:02):

And pulling data from them and doing some analysis on them. And basically you know, they used an open source model for this that was multimodal. And so when you think about patient records and you upload these into a system, it could be a PDF, it could be a scan, which is an image, it could be you know, a piece of XML content that comes from one of these medical apps that hospitals use and stuff like that. So they had to take in all this different type of data, and so we just backdoored the hell out of every type of input into this thing.

And what we did is we put up, if you've ever heard of Blind Cross site scripting before, blind cross site scripting it's just a whole bunch of JavaScript attacks that when someone views it, it will pop up or it will silently send a command and control request to us and take screenshots of their browsers and stuff.

Jason Haddix (25:48):

So we backdoored the PDFs, the metadata, the binary data, we put QR codes with JavaScript attacks in all of these things and sifted them through the system. And because the system was supported by all these logging and monitoring systems, and they needed to do human-in-the-loop acceptance of the business analysis too, we ended up hitting their organization in several places. And this all went through the LLM.

So you have to think like a red teamer and pen tester as well as an AI red teamer. You have to like marry these skills. And then we've just seen, like, you know, these days prompt engineering for the agents and for the overall planning system for any of these systems also, that is the core of where the client is trying to build the business logic. Right now they understand that their app is just, you know, is just a lot of fancy prompting supported by agents and so they put a lot of this business logic into the prompt engineering for these systems when you start looking at the enterprise ones.

Jason Haddix (26:44):

And there is some sensitive stuff in these prompts when you work with enterprise people. And there's also some business logic in there that shouldn't be left to the AI to do. Like we had a car manufacturer who made an internal engineering system and basically they were counting on the prompt engineering to only show a subset of the data they were pulling from an agent which was pulling from a database.

And so with some simple prompt injection, we were able to ask it for full info and pull full data out of the RAG dataset. And the RAG dataset had things like basically failure data for different specifications on parts on cars, acquisition cost of patents, a whole bunch of what you would consider IP data that was in the rag and they were counting on the prompt engineering filtering that. But you cannot have that, right? That's a technical thing that you need to do or you need to not have it in, you know, the data at all. That's in the RAG. Yeah, so.

Charlie McCarthy (27:44):

That's crazy.

Madi Vorbrich (27:46):

I know that's... All those stories are actually really crazy. It's scary to think about that, you know, you are testing all these systems and this is just, I mean, are these like case scenarios that you're saying, is it kind of like one in a million or are these things that are happening often when you're doing this?

Jason Haddix (28:03):

No, this is very often. I mean, okay, so like the API keys, hard-coded keys in general, PII and RAG, these are common hitters right now. It is the wild west when you're testing these things.

Charlie McCarthy (28:17):

Yeah, a lot of stuff that I think leadership doesn't necessarily know they should be thinking about. You've got these companies that are trying to innovate and keep up with this breakneck pace -

Jason Haddix (28:28):

They're moving so fast.

Charlie McCarthy (28:28):

They're moving so fast, they want to keep their competitive edge. And that is important, but it comes at a cost, especially if you're not assessing this stuff.

Jason Haddix (28:37):

Yeah. The last case study we usually talk about was it's a sales or it was a tech company and they wanted to build an internal AI that basically tied together all of the Salesforce data with their sales methodology so that their sales engineers and their salespeople could chat, could basically @ a sales bot in Slack and it would give them information about a customer if you just put the customer name and it would give them like where they are in the, you know, sales journey, all the notes from all these disparate systems, all this kind of stuff. And it would expose it via Slack to a salesperson. And so they built this and it was basically the CEO pushing for this 'cause he thought it would help their sales cycle and, you know, be super awesome.

Charlie McCarthy (29:21):

It was convenient.

Jason Haddix (29:22):

Yeah, super convenient. And so he just went to engineering to get it done, and engineering was like, yeah, we could do this. And so they ended up building the system and we got there. And this is one of those lower tier bucket of, you know, customers who's just trying to get something working, right?

“And so I sat down in the middle of the assessment with the CEO and I was like, Hey, it's really interesting that you guys are like cool with, you know, sending all of your Salesforce information, which includes, you know, all of the customers like personal information, you know, red lines for contracts, like quote information that you're sending that to the OpenAI ecosystem and the Slack ecosystem.”

And he is like, we're not doing that. And I'm like, yeah, you are. Like, that's how this system, this is actually how this system works.

Jason Haddix (30:05):

And he is like, no, and I'm like, yeah, look like you do @ sales bot and it goes and queries this data and it, you know, it rewrites it through the OpenAI API and then sends it through the Slack infrastructure. This is passing through all that. And just because they really didn't have a big security team, they didn't understand the threat model of like sending this out to people.

They had committed a pretty big privacy violation for their policies and had no idea. And like this, you're still on the lower end of like, people adopting AI, you see this kind of stuff all the time. You're like, okay. And then again, that was another instance of that access to Slack and Salesforce, those API keys were overscoped and we were able to pull data just raw through the API or through prompt injection through the API calls into those systems. So yeah, a lot of that kind of stuff.

Madi Vorbrich (30:54):

Oh my God. I'm like sweating thinking about this. That's crazy.

Jason Haddix (30:59):

Yeah. Yeah, so we, in the class, we do like a, I mean, we have quite a few assessments we've done, but we also do live threat modeling. We try to do it of enterprise implementations. So what we'll go out and do is we'll go out and look at YouTube and if a company does like a, you know, like a hype video for their new AI feature, we'll go try to find those for big companies like ADP or whatever, you know, and then we'll try to reverse engineer and threat model. Okay? What could be the attack vectors against a system like this? And usually they'll give away some tidbits about their technical architecture. And we do that in the class for like five or six random companies. It's a really fun, it is a really fun exercise and to get you thinking about like, you know, kind of the methodology for threat modeling these and then attacking them, so.

Madi Vorbrich (31:44):

So I want to kind of go back, Jason and talk about the prompt injection taxonomy that you built, right? I mean, it's like what, 60 plus category taxonomy that you put together. So can you kind of walk us through first off, like what it is? And then why did you develop it?

Jason Haddix (32:04):

Yeah. So you know, we basically started doing the testing and when we got to the enterprise level customers who had like a lot of security components or classifiers or data transformation in the mix, we started to get blocked. And I don't like losing. So I wanted to figure out a way to like figure out kind of what were the tips and tricks were in this. So what I did is I took basically a bunch of white papers, a bunch of other taxonomies that had existed, and I just started doing a bunch of research and taking down notes.

And what you'll see a lot of people talking about is holistic single shot jailbreaks, especially even in the tools that are pretty common that people use these days, like Grock and Pirate and I'm blanking on the other one right now, but all of the command line based tools that you can use to assess AI. What they are really good at is finding you know, certain types of biases and then identifying if one shot jailbreaks work on models but they're not really good at like contextual prompt injection.

Jason Haddix (33:06):

And so what we did is we started reverse engineering the techniques for all of this academic research as well as we went to the jailbreak community. The Bossy Group and reverse engineered every jailbreak there and the techniques. And so what we started looking at is like, okay, so there are some repeatable techniques in here.

So, you know if you look at any of the universal jailbreaks that the prompt you know, that the jailbreak people do is they will put what's called an end meta tag at the beginning of their prompt injection, which is like end user instructions or end system instructions, start new system instructions, right? And it looks like an HTML tag, right? Well, they put that there because in the model system prompt, that's how the model vendors start and stop their model system prompts. And so they're trying to confuse the intent of the model by adding their own tags at the beginning of the jailbreak.

Jason Haddix (33:55):

And so there's all these small little tricks that that these jail breakers do, but a lot of people are just shooting their whole text at things and seeing if it works. We pulled these out and defined them as four different things. So in a prompt injection attack or attacking an AI system, you have four primitives we call them. One is your intent. What are you trying to do to this AI system? Are you trying to do traditional red teaming stuff like get it to be biased or speak harm or cook meth or whatever, right? But there's also custom intents.

There's things like business related intents, like give a discount or action an agent or something like that. So what are you trying to get done? Then you have techniques of prompt injections which, you know, one very popular one is like narrative injection, right?

Jason Haddix (34:39):

Like asking the model to pretend like it's tells story in order to achieve an objective, right? And so there's a whole bunch of techniques. And then when you decide on a technique, then you have to get past all of these security gates, which are classifiers and guardrails and you know, sometimes laughs and stuff like that. So you use evasions just like we've been using for the last 10 years in web app hacking to get past web application firewalls or network firewalls. So these are things like, you know, encoding things in ASCI or, you know, encoding them in Base64 or whatever. And then you have some utilities that go along. And this was all inspired by Metasploit, basically. So I saw prompt injection as the vehicle for most attacking of most AI systems now was like, I need to break it up in the way that metasploit breaks up exploits.

Jason Haddix (35:24):

So Metasploit breaks up exploits into designing the exploit, the payload, the utility, and then a whole bunch of other stuff. So that's what we built our methodology kind of to mirror. And so each one has like, you know, a whole bunch of little pieces in it of, and, you know, a whole bunch of types of techniques, whole bunch of types of evasions, whole bunch of utilities. And so if you break 'em up into these primitives, the combination of the intense techniques and evasions comes out to somewhere like under like, you know, 10 trillion combinations that you could test an AI system with, with different combined evasions and techniques and stuff like that. And so we continue to do research on new methods of prompt injection and we add them into the evasions or the technique section. And you know, one, you know, a couple of quick hitters that we usually talk about is, you know, hiding your prompt injects inside of emoji invisible unit code data.

Jason Haddix (36:12):

So you generate an emoji with invisible unit code that basically has your prompt injection inside it. And so it bypasses a lot of classifiers and guardrails. Another one is link injections, since most classifiers and guardrails don't mess too much with code output 'cause if they did, it would be a horrible user experience. Asking, you know, an agent to build a link with private data encoded as part of the link. The uLink URL is really popular right now. And that bypasses a lot of classifiers and guardrails. Another one is, is really cool. It's called bijection or bring your own encoding. You, as a user prompt, you tell the model that you're building a new encoding standard and you build your own encoding.

So you say A equals you know, another character, you know, 75 or whatever. You build your own encoding method at the beginning of the user prompt. And then you use that encoding method throughout the prompt inject in order to bypass classifiers 'cause they've never seen it before and you just made it up in the beginning of the conversation. So these are the kind of techniques that we put in, you know, kind of the taxonomy.

Charlie McCarthy (37:18):

Hmm. Is a lot of this happening in the English language? You know, since we're working with LLMs (large language models). I mean, how much research, if any, have you seen that's been done kind of exploring other languages?

Jason Haddix (37:32):

So part of our evasions are using other languages. So a lot of classifiers, guardrails and even models don't work well with non-standard or non widely used languages. So certain dialects. One of the ones we've been really successful with is Hebrew, actually. And so yeah, I have a buddy who does a bunch of prompt injection in Hebrew and it works really well all the time 'cause the models are not trained super well on large corpuses of data yet. So we use languages as evasions. Yeah. Also another quick hit that it works all the time, or that has worked up until really recently, but people are starting to train on a little more is fictitious languages. So languages like Klingon, Leet Speak, Pig Latin, work really well to bypass classifiers and guardrails. So these are the type of things that you have to think of when you're trying to attack these things.

Charlie McCarthy (38:29):

Out of all of these tactics or techniques that you've been describing, are there a few at the top of the list that work a high percentage of the time? Or, and when we say high percentage, what is that percentage? Like 60%?

Jason Haddix (38:46):

It depends on the model that we're attacking, right? So things like Llama 4 are subject to you know, a lot of these things where, you know, the ChatGPT system might not be subject to them or the Claude ecosystem, you know, right. Might, so I, you know, the ones I already told you, those are our, those are our hitters right now. Those are our big hitters is the emoji injection unit code, also invisible unit code by ejection, bring your own encoding. And yeah, all that, all the language stuff is, you know, kind of, those are the ones that work a lot right now I would say.

Charlie McCarthy (39:20):

Those are most consistently effective so far.

Jason Haddix (39:22):

Yeah. But those are, those are the evasions, right? To get past a lot of the protections you also have to like do the prompt injection correctly. So this is the part we teach in the class is like every system is contextual. So when you look at a system for a customer, you have to really understand either by threat modeling or the questionnaire at the beginning of the assessment, you have to understand what the agents do, what they, you know, kind of have access to. And then you can build a prompt inject that bypasses the classifiers and stuff like that, but also tries to aim to steal specific business data.

This is why the automated tools right now don't work really well for these type of assessments, because you could be asking, like our car manufacturer, what I'm really after is stealing all of their acquisition prices for all of their parts from their vendors, right? No automated tool is going to have a prompt injection sentence that can do that. And so really what we do is we take, you know, our brains, we threat model the application, we ask the questions, and then we run, you know, basically AI to build the attack strings for with the evasions, with the technique, and then we send those prompt injections across, you know, either automatedly or manually if we have to.

Charlie McCarthy (40:37):

That's impressive.

Madi Vorbrich (40:38):

Yeah. So then switching it to like how, again, like your students or even teams might use this in practice, I know you gave some examples, but can you shed some light onto like team-specific they can use this in practice?

Jason Haddix (40:56):

Yeah. So I mean internal teams, if you're, you know, if you don't really like, if you're one of those like mid or low tier customers, not in like value, but just in like the fact that, you know, you don't have like the big money to spend on, you know, all of the Amazon ecosystem or complete integration into the ChatGPT ecosystem with all the bells and whistles. Like you'll use projects like ours just to go through a set of questions. We have on the repo, we have a set of threat modeling questions that's free and open source. So I would start there, start asking questions, and it has all of these things that are big hitters for us. Like, you know, how are the API cues scoped? Do they need to be read and write in order for this agent to work?

Jason Haddix (41:35):

You know, what access do they have via the rest API of tenancy to that platform, right? Like, if you've ever worked in any of these rest based, you know kind of APIs, should it have access to all the data? Should it call everything in your organization, or should it only be scoped to a few data sets that you need to give it to? Right? and so we have threat modeling questions, but I would start there as an internal team and start threat modeling because there's extra AI based questions there that will have you ask, you know, good questions. Like, are we counting on the prompt engineering to filter out any data from the RAG source? And that the answer should be no.

You know, we should have clean RAG data that's already been pruned of PII, but that doesn't happen at all in the real world. And so these are, these are the places where you'll get those ahas and you know, we hope to write a little bit more in the repo about what to do if you've identified these problems. But it's still early days. I mean, the project has only been out for a few months, so yeah.

Madi Vorbrich (42:32):

And then thinking about defenses too, what have you seen on your end? Like what defenses are actually effective against all these attacks that you described? I mean, even like the emoji one that you described to us, what, what is a good defense look like there?

Jason Haddix (42:49):

So the emoji one still works against like the, you know, frontier models right now. So it's hard to defend against. I think that's it... It's a hard question. I think that a layered defense and depth approach to building an agent-based system where every component every AI that you have in the chain, whether it's the agent-based AI or the orchestrator has a classifier and guardrail and data transformation.

I think that data transformation is actually one of the slept on techniques to break a lot of prompt injections. So this means that when the user sends in a natural language query to your system you translate it to JSON or you translate it to XML or markdown or something like that, you'd be surprised at how many times that breaks my prompt injections. And then you have it run through a classifier and then you have it run through that representation, run through a guardrail.

Jason Haddix (43:45):

And that is like the hardest three chain to break through is data transformation, classifier, guardrail. Now there's open source products, right? Of which many vendors, you know, I'm not going to put forth any vendor, right? But there's a whole bunch of open source guardrails out there. You know, the one that most people know is Nemo from Nvidia. And it's, it's pretty good, but there's a ton of other ones out there that are relatively good. You know, at the OpenAI conference like I got to sit and ask Sam Altman he came in and did a Q and A with us. And I, one of the questions we asked him was, you know, you said many years ago you thought prompt injection could be solved. Do you still think that in 2025? And his answer was basically, no.

Jason Haddix (44:30):

He thinks that we can get away with training the models that they're really good to protect against most prompt injections, like somewhere in the 90th to 95th percentile, but there's always going to be 5% that the models just cannot train out of. And that 5% is going to need to be covered by that defense in depth model. And even then, I don't, I think maybe we can get to 98-99%, but there's always going to be ways to hack these systems. Just like there's always been ways to hack web apps. And that's when you're going to need to have this extra assessment on top of it to be part of that defense in depth layer.

Madi Vorbrich (45:07):

So Jason, we're going to start wrapping up the episode, but I want to go ahead and ask what are some really good gems, some really good tips, tricks, advice that you would give to our listeners that are tuning in?

Jason Haddix (45:22):

Yeah, I mean, if you're on the defensive side, right? There's plenty of pretty easily implementable guardrails out there that are open source. And there's a lot of companies right now that are building guardrails that are pretty affordably priced to implement the pipeline that make my job harder, right? And definitely, it's not easy to bypass some of these systems. I have to spend a ton of time crafting prompt injections that work very specifically. And so the first thing I would invest in is I mean using, use the best model you can, obviously, right? That's been, you know, trained and tuned to handle general prompt injection and red team attacks to cover bias and harm. And then check your API keys for your agents, make sure they're not right when they don't need to be.

Jason Haddix (46:06):

Also check the scope and the access of the agents themselves when they access tenant data. Like, you know, through rest APIs, make sure that they can't grab too much data. Make sure that you before you build your RAG knowledge base that it's pruned of PII. This includes metadata. We found plenty of random stuff in the metadata of documents, but the documents themselves had been pruned of PII, but the metadata included private information. So we were able to query that out with different techniques. And then apply a guardrail, like some type of open source guardrail. That would be the start.

And then when you implement these things, you can't forget about the web website. We've seen so many implementations of web app bugs through the AI system, either in the chat interface or the way that it logs.

Jason Haddix (46:52):

We've seen customers streaming all chats to web sockets that anybody could just look at. If you open up the developer prompt, like these are common streaming implementations that they weren't ready for. So don't forget about the web app to assess the web app that's attached to the feature or the API itself. So those are the quick tips for defenders.

For attackers, there's more than ever there's a ton of information, especially like on OWASP and stuff like that. There's a whole LLM security group that you can go be part of, and they have the OWASP LLM Top 10 as well as some testing methodology out there, but also just follow, like, the jailbreak community. I think it's really, really valuable to see the tactics that they use and reverse engineer kind of how they're doing stuff. And just watch the space. There's more and more talks about attacking and hacking AI out there that are coming out at every conference. So just, you know, keep up with it, build your own methodology, take what works, you know, leave the rest. So, yeah.

Charlie McCarthy (47:45):

And it goes without saying, get some training from Jason.

Jason Haddix (47:51):

Yeah. If you're into it, come check us out.

Charlie McCarthy (47:54):

Yeah, yeah, yeah. As we're wrapping up, Jason, I just want to say again, it is an honor and a thrill to be talking with you on the show. We started our little show two and a half years ago. We are in our third season now, and to have someone of your caliber here talking to us is pretty amazing. For community members and MLSecOps Podcast listeners, what are the best places for them to go to follow your work and or get involved in a training? Is it a particular website, is it connecting with you on LinkedIn?

Jason Haddix (48:20):

Yeah, I mean, I'm on LinkedIn, Jason Haddix. I'm also on X as @JHaddix, so J-H-A-D-D-I-X. That's where I spend most of my time blathering about stuff. And I'll post our classes to X, but our main website is our arcanum-sec.com, so A-R-C-A-N-U-M-sec.com and you can go there and it's got all of our training dates at the top of the page as well as when we launch new research or we build new repos or open source anything, it's usually somewhere on the website. Yeah.

Charlie McCarthy (48:53):

Beautiful.

Madi Vorbrich (48:55):

Wonderful. Well, yeah, thank you so much again, Jason for joining us and to our listeners don't forget to share this episode and we'll catch you next time on the next episode of MLSecOps.

Jason Haddix (49:07):

Awesome. Thanks everyone.

Madi Vorbrich (49:08):

Thanks.

[Closing]

Additional tools and resources to check out:

Protect AI Guardian: Zero Trust for ML Model

Recon: Automated Red Teaming for GenAI

Protect AI’s ML Security-Focused Open Source Tools

LLM Guard Open Source Security Toolkit for LLM Interactions

Huntr - The World's First AI/Machine Learning Bug Bounty Platform

Thanks for checking out the MLSecOps Podcast! Get involved with the MLSecOps Community and find more resources at https://community.mlsecops.com.

Guest

SUBSCRIBE TO THE MLSECOPS PODCAST