Evaluating RAG and the Future of LLM Security

Apr 25, 2024 By Guest

Audio-only version also available on Apple Podcasts, Google Podcasts, Spotify, iHeart Podcasts, and many more.

Episode Summary:

In this episode of the MLSecOps Podcast, hosts Neal Swaelens and Oleksandr Yaremchuk from Protect AI, sit down with special guest Simon Suo, co-founder and CTO of LlamaIndex. Simon shares insights into the development of LlamaIndex, a leading data framework for orchestrating data in large language models (LLMs). Drawing from his background in the self-driving industry, Simon discusses the challenges and considerations of integrating LLMs into various applications, emphasizing the importance of contextualizing LLMs within specific environments.

The conversation delves into the evolution of retrieval-augmented generation (RAG) techniques and the future trajectory of LLM-based applications. Simon comments on the significance of balancing performance with cost and latency in leveraging LLM capabilities, envisioning a continued focus on data orchestration and enrichment.

Addressing LLM security concerns, Simon emphasizes the critical need for robust input and output evaluation to mitigate potential risks. He discusses the potential vulnerabilities associated with LLMs, including prompt injection attacks and data leakage, underscoring the importance of implementing strong access controls and data privacy measures. Simon also highlights the ongoing efforts within the LLM community to address security challenges and foster a culture of education and awareness.

As the discussion progresses, Simon introduces LlamaCloud, an enterprise data platform designed to streamline data processing and storage for LLM applications. He emphasizes the platform's tight integration with the open-source LlamaIndex framework, offering users a seamless transition from experimentation to production-grade deployments. Listeners will also learn about LlamaIndex's parsing solution, LlamaParse.

Join us to learn more about the ongoing journey of innovation in large language model-based applications, while remaining vigilant about LLM security considerations.

Transcription:

[Intro] 00:00

Neal Swaelens 00:07

Hi everyone. Welcome back to the MLSecOps Podcast. My name is Neil Swaelens, and I'm a Business Development Director here at Protect AI. Also brought along a colleague, Oleks, who's here on the call with me. Oleks, if you want to give a short introduction.

Oleksandr Yaremchuk 00:20

Yeah, I'm Oleks. I'm a Principal Engineer at Protect AI leading open source initiatives and LLM [large language model] security initiatives as well.

Neal Swaelens 00:28

Great, and today we obviously have a special guest on the show. We actually have Simon Suo here with us. He's a co-founder and CTO of LlamaIndex and we're obviously big fans of LlamaIndex, you know, and the community they've established. And in fact, we recently released a technical guest post [on the LlamaIndex Blog]. So Simon, super happy that you're here. Thanks for joining us.

Simon Suo 00:49

Yeah, super grateful to be here. Thanks for the invite and excited for the conversation ahead.

Neal Swaelens 00:53

Cool. So to set the stage, Simon, do you wanna give a short professional background for our audience?

Simon Suo 01:02

Sure, yeah. I'm co-founder and CTO of LlamaIndex. We built a leading data framework for orchestrating data in large language model in this new generation of software.

Before this, I was a research scientist in the self-driving industry. Worked a lot on training models, deploying models, and really working with these stochastic artifacts that behave very, very differently from the deterministic software that we have become more familiar with. So a lot of my work was building scaffolding and frameworks to make sure that developers can better harness the power of these stochastic systems and kind of bringing that expertise into building this open source framework to really kind of supercharge development in the generative AI era as well.

Neal Swaelens 01:55

Interesting. I guess there's probably a lot of things that you could have transferred away from that experience looking at autonomous driving. I mean, it was actually one of the first real issues of security along with autonomous driving. Is that something you've dealt with quite a bit, like security for AI with respect to autonomous driving?

Simon Suo 02:18

Yeah. In the context of self-driving, a lot of it is more about safety, like how to understand the performance and guarantees of these black box systems that you can really only control via the training data and loss objective. And at the end of the day, you're giving them control in the real world and they can have real impact on like different people who are in the car and outside of the car. So a lot of the work is really rigorous testing to make sure that there's various aspects of, you know, performance and safety and comfort that the model can meet.

Neal Swaelens 02:52

Yeah, a lot of that that we also see today, I guess with large language models (LLMs). And I guess like one of the core questions that I'm keen on learning is obviously today RAG [retrieval-augmented generation] has become a pretty important piece to the typical LLM architecture out there, but that wasn't so obvious, you know, six months ago. Even a year ago, people weren't really talking about it as much as today obviously.

So what really inspired, you know, the creation of LlamaIndex for you and your co-founder, and what sets it apart today looking at the vast amount of frameworks that you have, including those for building RAG-based LLM applications.

Simon Suo 03:35

Totally. I want to just first start by saying my understanding of RAG is, I actually think that term might be a little bit too limiting. The term I like better is like these compound AI systems where you need to orchestrate together a bunch of components, one of them being the large language model and otherwise are these other application APIs or components that you, we have already have familiarity with, right?

Specifically in RAG, it's really about connecting this new form of computation power to the knowledge that we already have, whether that's personal knowledge or like organizational knowledge. It's almost like we have this new computer and it has some built-in hard drive containing a lot of data about the world, but what we really want to do is to have information about us, right? The user who are currently interacting with the application as well as the contextual information about the organization that they have, right?

So I think a RAG as this process of easy way to swap in like a new hard drive containing the custom data you might have so that you can, like, operate on that instead of just like some prebuilt hard drive that comes shipped with the large language model.

Oleksandr Yaremchuk 04:47

Yeah. This reminds me of the idea of LLM OS from Andrej Karpathy.

Simon Suo 04:53

Totally, yeah.

Oleksandr Yaremchuk 04:54

Yeah, that's really remind me, like connections. So you see RAG kind of beyond just, you know, like kind of getting data from vector databases, but actually like connecting the system to like all the knowledge that is available about like a person or organization.

Simon Suo 05:11

Exactly. Yeah. I think RAG - the terminology came from like a 2019 paper and the original paper is not actually just about like frozen retriever as well as the generation being glued together, right? There's the end-to-end training process as well.

So I think we've kind of abuse that term a little bit. I think it's much better to understand it as the broad concept of contextualizing the LLM on the environments that it's interacting with, whether if it's the application context or the organizational context.

Oleksandr Yaremchuk 05:42

Yeah, I'm actually curious. So like, I mean you're doing this open source and we've been also doing, like, we started actually open source as well with our LLM security tooling. But I'm curious, you know, what actually prompted you to, you and your co-founder, to start open source and build in public, and maybe what kind of challenges and like, you know, you faced while being open source?

Simon Suo 06:14

Yeah, that's a good question. I think the project really started at the end of 2023, sorry 2022, right after the ChatGPT had really taken off. We are really just tinkerers who really love learning and kind of interacting with this technology more, and we found a lot of like new wave of generative AI or AI engineers who are also kind of building these new applications, exploring new paradigms.

And a lot of people are reinventing the wheel as it comes to different components in building out RAG or compound AI system. This is about integrating with different data sources, creating different ways to kind of parse and structure this data so that it's optimized for the large language model to reason over it.

So our really initial desire is really to help empower the developers and kind of grow and learn together with them, and build that in public so we can foster a community that can kind of build this new generation of software together. Right. So the open source framework really came out of this desire to kind of share our learnings and really educate this new kind of frontier wave of developers to build these new things.

Neal Swaelens 07:35

Maybe as a follow up question to that obviously being open source, you have that community as a driver to have contributions to your library as well. How do you keep a balance between making a library that is enterprise ready and getting contributions from the community, making that QA like quality assurance a thing?

Simon Suo 08:00

Yeah, that's a great question. I think it's like a constant thing we are thinking about. So one of the big changes we made recently is kind of like breaking up the package into a very small core that we are confident about the security guarantees and the robustness, as well as separate integration packages, right? So that they can be versioned separately and we can put more effort into making sure that the core parts of the library are very robust and well tested, and then sort of more experimental ones are explicitly labeled so - so like you know, for people who are kind of in a more experimentation and prototyping setting, they can explore the newest and the latest research that's come out.

So we typically support that, like new techniques that we enjoy and find potential in almost within days or weeks, to make sure that it's available for tinkerers to be able to try it. But also like the core is very robust and we can harden it across time over the past year to make sure that it has the highest quality possible.

Neal Swaelens 09:03

Yeah, it makes a lot of sense and yeah, a really cool approach. I think, you know, we obviously see quite a few large corporates entering the LLM space, and you've probably seen also the report from Andreessen [Horowitz] on the state of AI at enterprise where you see the massive investment flowing into LLM building, like, built right?

What would you say are kind of the primary security considerations really when you think about developing applications with large language models?

Simon Suo 09:35

Yeah, I think the way I think about a, the security concern is really about like the input and output of the large language model.

Like specifically what input it gets, right? Whether that's only the user data that's currently interacting with the large language model itself, or like you are able to surface organizational information as well, who, which are meant for only a subset of users to see, right? And how is the model, like how are we handling that input before that passed into the large language model, right? So I think there's a lot of things people have like thought about security concern, coming from like prompt injection attack’s ability to reveal like system prompts, ability to reveal user information, organizational data. I think a lot of those make the enterprises super, super scared.

And then on the other side, right, like on the output or action intake is pretty top of mind as well. Like a lot of these applications are being built are co-pilot or assistant applications which have the ability to kind of do something in the real world as well, right? Not just limited to like sending output back, right? A lot of these are being hooked up to like CRMs or like other kind of APIs, so that can trigger some change in the external environment, right? So since now I'm such a stochastic system making sure that the output doesn't have a bad consequence, it's pretty top of mind as well, right?

Neal Swaelens 11:05

Yeah, for sure. Yeah, we also think about it in a similar way. Obviously input and output evaluation being the cornerstone also of LLM Guard and in our view LLM security, but especially if you kinda look at, you know, expanding capabilities i.e. connecting LLMs to, you know, downstream systems, that expanded capability set will also lead to expanded blast radius in case of a breach. So, definitely agree.

Oleksandr Yaremchuk 11:35

Yeah. I'm also curious; so we, when working on this guest post for LlamaIndex and integration with LLM Guard, we experimented, like, can you do like prompt injection in like when you ingest the data? For example, like we took three resumes of different people and in one resume, like in the document we actually put like prompt injection which was, like, white color text. And in that prompt injection, we kind of asked the model to promote this [person’s resume], although the person had like the least experience out of all three candidates, and the model actually did it. It chose that person because we kind of asked like, yeah, this person needs to pay like money. They don't have, like, they're really struggling.

And so yeah, like I would like to understand kind of your take and maybe some stories that you hear from the community about like what are the specific things like with AI security, but like related to RAG?

Simon Suo 12:36

Yeah, that's a good question. Like by the definition of RAG, right, you're giving the large language model access to a lot more data and sometimes you don't necessarily control exactly what goes into that. Actually it's user uploaded, right? I think the example you gave is super great. I don't know if necessarily I want to use LLM to screen resumes quite yet. We actually played around with that and felt that like the, the potential for bias is quite high, so <laugh>. But yeah, I think the general example holds right, like a lot of time like these five applications really relying on some semantic search or hyper search to be able to surface contextual information.

So I think that part maybe is a little bit harder to game by these prompt injection attacks, but I can totally see that even after like the relevant chunks are kind of retrieved like sort of embedded malicious things can be passed into LLM to alternative behavior, right? And like you said, like when you attach more systems, you have a higher blast radius.

Another example I can think of is just like when you use these contextual information to decide an action to take, it can actually cause even more harm in the downstream systems, right? So I think right now, like from people who are currently building these applications, the biggest fear is not necessarily quite these kind of harmful prompt injection attacks yet; it's more about just like data leakage, I think, right? It's like it's not maliciously injecting some instruction to do something, but by accident, like the access control is off and then like I don't know, like this employee gets to see the data that a CEO has put in there and that they're not supposed to see <laugh>.

So I think those kind of fear and concerns are much more top of mind, at least from the conversations I've had. And a lot of that is really just about like great access control configurations as comes to the data, like actually propagating the, the ACLs coming from the data sources into the implementation as relates to like vector database or whatever search engine you have, so that like those information can be safeguarded and have the right access control.

Oleksandr Yaremchuk 14:46

Yeah. It's actually like something that we see a lot actually, like a lot of projects are being developed right now, like by like startups actually tackling this specific problem of data access. But you mentioned a good, like a very interesting thing about propagating a CLS to basically the storage. But do you think like, you know, if like you scale the data and basically start like ingesting more and more, it's not maybe like, it's not kind of the best option because like we see a lot of startups actually trying to kind of classify data and like guess what kind of ACL to assign to it, but we still don't like know if it's like accurate enough way to to tackle this problem. What do you think?

Simon Suo 15:37

Yeah, I mean I think there's many different ways. Like I think like the most important thing is just accurately propagating information from the, the organization’s like identity provider, right? Like like I think a lot of companies are interested in, for example, like connecting SharePoint, right? Like building a knowledge base and putting RAG on top SharePoint, right? Like a lot of like the group and user information already as well available as metadata. They just need to be properly kind of propagated and managed both in the ingestion side as well as the retrieval side, right? Yeah. And I guess like an improper management of that process can actually lead to malicious attacks based on just like, you know, like prompt-based ways to change this access or like metadata, but it depends on how much control the sort of underlying large language might have access in this data injection system.

Oleksandr Yaremchuk 16:35

Yeah. And maybe also would be interesting to see, to hear, like how does LlamaIndex prioritize security and data privacy, like in this design philosophy?

Simon Suo 16:48

Yeah, so I think for us we are fairly unopinionated and what we really wanna have is like an easy way to integrate with these kind of guardrails or like PII (personally identifiable information) masking module so that like goes on the input side and the side of large language model, it's very easy to slot in different providers of these, right? And I think like it's really about giving the developers choice and ability to customize these pipelines so that they can build in the way that they want, right? Like that's kind of the core ethos of the open source framework. We're not trying to be opinionated and I think like, you know, like for developers in these larger organizations who are familiar with their like security requirements, it's best for them to kind of like integrate with LlamaIndex in an easy way.

Neal Swaelens 17:40

Got it. And maybe as a question there, you mentioned obviously developers being slightly more aware of security risks. Do you see a massive shift in the market in terms of education around LLM security when you engage with your users or clients, do you see, for example, a massive increase in queries on security for LLMs with RAG or?

Simon Suo 18:04

I think it's a full spectrum, I want to say? Like the level of maturity in terms of understanding even what LLM is or like RAG is, it's all across the board. I think most sophisticated customers already have these systems set up. They're actively improving the capability and performance and also like pushing forward on the security aspect, right? And I think, like I would say like the majority of them are still most focused on the data privacy and security aspect, less on the malicious attack aspect from what we have seen so far.

Neal Swaelens 18:37

Right. Yeah. And I think we, we kind of see that emerge as well. In our perspective kind of slowly enterprises setting up these AI security roles or AI security director, AI security lead roles. So yeah, I think also on top of that kind of the, the good resources that are out there from the OWASP community on LLM security is probably one of the reasons that drove that. So yeah,

Simon Suo 19:04

Totally.

Oleksandr Yaremchuk 19:05

Another probably question is more about like RAG and the future of it because like a lot of discussions are currently about these long context models, like for example the new Gemini 1.5 like model, or actually a couple of days ago there was a paper from Google, I think it was called like “Leave No Context Behind,” like infinite context window. [Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention]

Yeah. And like how do you envision the future like evolution of RAG techniques given these developments?

Simon Suo 19:45

So I think like I, I typically like to go back to that computer processor analogy again, right? Like when you have more and more powerful CPUs and like larger, larger amounts of RAM, you still want a way to be able to load information into RAM, right? Like there's always like a cost as well as performance trade off in like how much data you give it.

Specifically in RAG, like I think what I see in the future is going to be like you have this trade off frontier of like how much cost and latency versus like performance. And like for some of the tasks it's much more amenable to shove the whole context in and just like expand a lot of computation power and cost to be able to get a high quality analysis result. But in the other cases where you want very low latency and you want like a very and it's operating on a simple like factual information that you can retrieve very precise pieces out of the overall corpus, you'll prefer the other side as well, right?

So I, I think for example, like what we see right now is that like for like video analysis or like multi-document comparison and general summarization, people are putting into these like, like 1 million token context windows. And as long as the task itself is valuable enough, it's totally okay to wait that time and spend that money, right? But when you have like these massive corpus and you just want to know a simple fact, you're never gonna dump the entire corpus into the context window, right? So I, I think that's the general high level, like the way I think about this.

I think tactically, like specifically I imagine there would be less chunking and these kind of like ad hoc operation on these file size documents, and we're going to move towards much more like, you know, use pieces or like semantic sections as the retrieval target, and then after that do a query or like context expansion to grab the entire document and send that to the large language model as the cost and latency continue to improve for these non-con, right?

And I think the most interesting thing to me right now is that we've kind of gotten used to this embedding-based semantic search and right now, like the context on those models are really lagging the large language models, right? So the retrieval target and what the document you want to reason over might not be the same. So we need this additional layer of mapping and like pre-processing enrichment expansion that's needed. So almost like the work is shifting a little bit in the overall architecture, but you still need to map the ability to kind of the, you still have the ability to optimize your retrieval and then bring back the right representation for the LLM to reason now.

Oleksandr Yaremchuk 22:38

Yeah, it's, I think that you, you mentioned this at the beginning that you see LlamaIndex as, like, more like a data orchestration also in that sense, and this is, this makes it actually very like good tool basically, not just for RAG, but if you even go to like longer context window, you still need to ingest this data somewhere. You still need to like retrieve, you cannot just put everything that you have in the same prompt, right?

Like at this at the end of the day, and you also mentioned chunking. I know that chunking is like a kind of big like a huge challenge in RAG. Like there are a lot of ways like I remember even once I I looked at Jerry's Twitter and he was like, mentioning different techniques on, on chunking and people always ask like, what's the best one? Like, do you support this paper? They support that paper. So it was always interesting to kind of follow like how people think about this and not just, you know, chunk just specific text, but like being more like smarter about that.

Simon Suo 23:46

Yeah. It's always kind of funny to me that like chunking becomes such a big topic, <laugh> in the large language model era application building you'll think that's solved by now, right? But we continue to kind of need these specific techniques to be able to get the high quality result out of the system. So yeah.

Neal Swaelens 24:05

Yeah, totally. I think maybe to switch gears obviously, you know, beyond LlamaIndex open source, you just recently released LlamaCloud, and that's a big step forward to making LlamaIndex more accessible to enterprises. Can you kind of give for the audience a short intro on the differentiation between the open source piece and LlamaCloud, and also to what extent you're thinking about, you know, data processing and storing there in terms of security?

Simon Suo 24:39

Totally, yeah. So LlamaCloud is our enterprise data platform and it has a very tight integration with the open source framework, right? So the way we see things is that there's a lot of customization needs on the upper side of application building, but there's a middle layer of data processing and data interface building that are very common to different use cases, right?

For LlamaCloud is almost like a platform as a service in which you can offload some of those works to a managed service so that the AI engineers can focus on developing these applications and customize it to the business need, right?

Specifically for us, the first step we're tackling is unstructured data processing and enrichments, right? A lot of challenge we've seen in RAG is that, you know, garbage in-garbage out, right? If you're not having a good initial step in properly processing complex PDF slides, html, which are the most common format of enterprise documents you're not going to be able to get high quality responses, right?

So we built our custom document processing engine, and it's very much optimized for retrieval generation, right? Unlike sort of generic document parsers out there it's able to semantically extract out sections from these complex documents and also enrich that with additional metadata and tags specifically make it easy for these high research system to retrieve the right sections as well as like kind of easier for the large language model to reason over that, right? And we kind of package it together along with a scalable data ingestion system that can allow you to connect to over 150 data sources, allow you to kind of parse these different document types, and very easily ingest that into the downstream destination you want, whether it's like a blob storage, whether that's like a vector database or in the future also if you want to kind of do additional extraction and injection into like a knowledge graph database as well.

And so we think of ourselves as this like data interface layer that does a lot of the pre-work so that the application sitting on top of that can actually get high quality data for the large model application, right? So this is one aspect I think like in the future we're going to expand to like really good connection to structured data as well, like ability to connect to your SQL database and specify the right schema and metadata so that, you know, those can be exposed to higher level agents or like orchestration system to kind of like combine this information and use that to make the right decision.

Oleksandr Yaremchuk 27:29

I really like that you have this, you know, LlamaIndex as like open source and people can start like using it like right away, just like download and start like playing and building some applications, but then when they like, okay, now we are like, we want to go to production or like high scale, we have a lot of systems connected, we can choose to like optimize ourself LlamaIndex, or we can actually just get like a prise solution with your expertise and like, you know, all the knowledge that you have on this topic.

Simon Suo 28:01

Exactly. So I think like for us, we, we still want everyone to use the open source framework, right? I think the flexibility and extensibility it affords gives you a lot of control to optimize for the application-specific needs. And we wanna make sure that there's a path to at least offload the data processing aspect, right? Because like that way you can kind of save the time and like maintain and like kind of like taking the latest learning from the open source and research and building into your data platform, right? Like, we're trying to make sure that it is always up to date with the latest way to process enrich these data components so that they could be kind of driving the best performance on these systems.

Neal Swaelens 28:51

Cool. I think maybe as a final question: is there one key call to action that you want to leave with the audience? Something that you want to have them take away from this conversation? What would it be?

Simon Suo 29:04

Yeah, I think two things. One is like a high level sentiment. I feel like we're so early in this journey as it relates to large language model-based application. Everyone is figuring out the best practices, and I think the true value unlock is just starting to happen, right? Like, even though it feels like everyone has kind of heard about this already in my little bubble, but every time when I go outside of San Francisco to talk to someone, like they're not even thinking about this yet, right? So I'm super hopeful that the uptick is happening super quickly, but like there's much more to come.

And number two, a more concrete one, like we've been really excited about LlamaCloud and specifically our document parsing solution. And we would love anyone who's working with complex document RAG use cases to give it a try. We have LlamaParse, which is the document parsing part available as a self-serve API. So we've been able to drive a lot of value for our customers in better extracting high quality information from complex documents. Now you can imagine that will work very well for other folks as well. So, a little bit of a plug here.

Neal Swaelens 30:21

Awesome. No, that's a really good way to end the episode. So once again, I'm Neil Swaelens, your host. Thanks for listening in and the continued support as well for community and of course the mission. And I also want to thank our sponsor, Protect AI - our employer - and my co-host Oleks. But of course, last but not least, Simon, thank you so much for the great conversation. And be sure to check out the notes for this episode because we'll drop in some links to the resources that we mentioned. Thanks, everyone.

[Closing]

Additional tools and resources to check out:

Protect AI Radar: End-to-End AI Risk Management

Protect AI’s ML Security-Focused Open Source Tools

LLM Guard - The Security Toolkit for LLM Interactions

Huntr - The World's First AI/Machine Learning Bug Bounty Platform

Thanks for listening! Find more episodes and transcripts at https://mlsecops.com/podcast.

Guest

SUBSCRIBE TO THE MLSECOPS PODCAST