Privacy Engineering: Safeguarding AI in a Data-Driven Era
Welcome to The MLSecOps Podcast, where we dive deep into the world of machine learning security operations. In this episode, we talk with the renowned Katharine Jarmul. Katharine is a Principal Data Scientist at Thoughtworks, and the author of the popular new book, Practical Data Privacy.
Katharine also writes a blog titled, Probably Private, where she writes about data privacy, data security, and the intersection of data science and machine learning.
YouTube:
Transcription:
[Intro] 0:00
D Dehghanpisheh 0:29
Hello and welcome to The MLSecOps Podcast, where we dive deep into the world of machine learning security operations. I’m one of your co-hosts, D, and today Charlie and I have an amazing guest joining us on the show.
Her name is Katharine Jarmul. She’s a Principal Data Scientist at Thoughtworks, and the author of a fantastic new book called Practical Data Privacy. She also has a blog called, Probably Private, where she writes about data privacy, data security, and the intersection of data science and machine learning. We’ll have links to both her blog, and her book in the show notes and transcripts, so check them out.
There is so much we cover in this conversation; from the more general data privacy and security risks associated with ML models, to more specific cases such as the case with OpenAI’s ChatGPT. We also touch on things like how GDPR and other regulatory frameworks put a spotlight on the privacy concerns we all have when it comes to the massive amount of data collected by models. Where does the data come from? How is it collected? Who gives consent? What if somebody wants to have their data removed. Things of that nature.
We also get into how organizations and people like business leaders, data scientists, ML practitioners can address these challenges when it comes to data, the privacy risks, security risks, reputational risks, and of course, we explore the practices and processes that need to be implemented in order to integrate “privacy by design,” as she calls it, into the machine learning lifecycle.
Overall, it’s a really informative talk, and Katharine is a wealth of knowledge and insight into these data privacy issues. As always, thanks for listening to the podcast and for reading the transcript, and supporting the show in any way you can.
With that, here’s our conversation with Katharine Jarmul.
Katharine Jarmul 2:15
I mean, I had the unfortunate experience of going to university during the maybe second AI winter, I think is what we call it now. So I was really good at math, but everybody told me to do computer science. I really hated computer science because computers are awful and I really liked math. So I ended up actually switching my major to do statistical thinking.
And mainly I actually went to the social sciences. So I did political science and economics with a focus on statistical modeling and so forth. That was really fun. I liked it a lot. And then I went beyond that. I did a little detour, and I was a teacher for a few years. And when I got back into data, I was hired as a data journalist at The Washington Post, and that's when I first worked on what we would call today data science.
So at The Washington Post we had some investigative projects. So we were in the first places to do Hadoop for data reporting and so forth. And I worked on a Python application team that built news applications with data and did storytelling with data. And that was now some time back. And from news applications and kind of news processing, I eventually got pulled into NLP [natural language processing].
So I shifted to a startup that was based in Southern California that was doing kind of trendiness and NLP techniques started by some Xooglers and some ex-Yahoo folks. And we got to play with Hadoop some more there and do NLP and some of the first NER [named-entity recognition]. So entity recognition and these types of things I learned on the job there.
And then beyond that, I started getting involved in some larger scale data things. So, how do we respond to data in scale? How do we do real time scaling of these systems? And eventually found my way into deep learning because eventually deep learning kind of took off in the NLP space.
And that's when I learned deep learning. I was like, this is really cool. It's a bunch of math. I like it. Linear algebra, some other types of thinking through algorithms, thinking through Bayesian modeling, Bayesian thinking and all that was really exciting. And I had to learn a lot of it on the job and via self study. So I'm deep in the self-study space of machine learning, I guess.
Charlie McCarthy 4:38
Fabulous. Thank you. So as we're thinking about data science then, and more specifically data privacy within the realm of AI and machine learning, what role would you say data privacy plays in contributing to the holistic security of machine learning models?
Katharine Jarmul 4:57
Yeah, I mean, one of the things that I talk about in the book, and one of the things that I first noticed when we're thinking of something like natural language processing is you're dealing with quite a large amount of personal data.
So if you're doing any natural language processing at scale, and particularly if you're doing natural language processing at a particular entity, let's say, who wants to analyze messages, who wants to analyze, maybe posts, social media posts and so forth. You're immediately getting into a space where you're dealing with a lot of private information and potentially private information that the people sharing it don't exactly know in what context this might be used.
And one of the things that we notice when we do this at scale with deep learning is there is a tendency– or there is an ability for a neural network or other types of network architectures to essentially memorize or overfit parts of the data, particularly outliers. And outliers, therefore, are at much greater privacy risk when we think particularly in deep learning use cases, but also really any machine learning use cases. And this is particularly dangerous if you are thinking, oh, the outlier is somebody who put their address into a machine learning system or they put their name or they put other personal, private, sensitive information that can be tied back to their personhood.
And I think that this is part of how I started getting interested in the problem of privacy and machine learning.
D Dehghanpisheh 6:27
First of all, thanks again for joining us. I love having you on. This is fantastic. And congratulations on the book, which we'll get to later. That's fantastic stuff as well.
As you talk about the machine learning model lifecycle, you've talked a little bit about, say, how models can memorize their training set and potentially leak that at an inference and that has changes in the model lifecycle and how data scientists and everybody needs to think about that.
Talk to us a little bit about some of the main data privacy and security risks associated with the ML models. And I'm wondering if, for our audience, you could categorize those risks based on where they occur in the model lifecycle.
Katharine Jarmul 7:04
Ooh! That’s like a “twofer”– two for one question. [Laughs]
D Dehghanpisheh 7:09
That's why we have two of us on hosting, right?
Katharine Jarmul 7:12
I love it! I love it!
D Dehghanpisheh 7:13
Come to the MLSecOps Podcast, you’ll get free information.
Katharine Jarmul 7:16
Yeah. No, I love it.
It's also a little bit of systems thinking. Okay, I will try to be comprehensive, but I might miss something. So please comment if you see this and I missed something.
D Dehghanpisheh 7:30
It sounds like we're all missing a lot in data privacy though.
Katharine Jarmul 7:34
Yeah, so, I mean, obviously we have the entire data infrastructure that might collect the data. And depending on how advanced your machine learning setup might be, this might be a series of pipelines and a series of lakes or warehouses or however you have it. And if people are using feature stores, then you might even have a situation where you're ingesting data and putting it directly into feature store.
The biggest problem that we have in these systems, whether or not you have, like, a staging and then a production and then a feature store that's derived from production. What happens when somebody comes and they said, like, “Oh, my data is incorrect,” or, “Oh, I'd like to delete my data,” or you have data retention questions, or you have consent questions with regard to data privacy regulation or something like this.
And we often don't have very good lineage tracking in these systems. And often when we have lineage tracking, it's fairly, let's say, not well connected and well understood how to query that in an automated way. And so we often run into these problems where in feature store we actually have some features that we've put maybe even into production use cases or that we've essentially engineered and put into either a feature store or indirect pipelines that go into training models.
And nobody can answer the question of consent information. Nobody can answer the question, when, where, how is the data collected? Nobody could answer the question, did we get the ability to use this for machine learning? And nobody can answer the question, if somebody comes and they ask for their data to be removed, what are we supposed to do with the feature pipelines or the artifacts that were created with those features? And this is a huge mess.
So this is the first stage, which is basically data processing, preparation, and maybe even feature engineering. We haven't even got to training and we have these big problems. Sorry, go ahead.
D Dehghanpisheh 9:40
Yeah, and I mean, just on that alone it seems like most of the regulatory frameworks that guide a lot of the thinking in terms of data privacy as it exists now, take GDPR or others. They're not encompassing for that model lifecycle, which means I may have solved it and said, okay, I got a request to remove this person's information, but it's memorized in a training data set.
And if that training data set is memorized in the model, it just kind of cascades through and it's like, well, that was a good, valiant attempt, but it failed because it's already kind of like spawned literally into all the other models and maybe even this kind of cascading design of model training where the output of one is training, the input of another.
Is there really a way to comprehensively address this in your mind?
Katharine Jarmul 10:29
Yeah. So as part of the book, I wrote an idea that I had which is kind of based off of I'm not sure if somebody's I'm certain somebody's referenced it, but let me just reference it first for listeners to this episode is the idea of model cards. This model card mindset of, hey, what data went into it? What were the training mechanisms? What was the evaluation criteria? And really, we should have this anyways.
Just as you all know, from MLOps perspective, we need to have a more organized and better able to query and use way of comparing models. And we have some systems, like some of the things that I know first, weights & biases led, but now many, many people work on, which is, how do we evaluate from a performance perspective, the many different models that we might be training. We might even have shadow model systems, and how do we evaluate and switch models?
But we don't have that really, from a governance perspective very often. And from a governance perspective, data governance perspective. And also, I guess you could call it also model governance perspective. How do I quickly look up what models have been trained with data, from what locations? How do I quickly segregate, let's say, models trained on data of Californians, right? Whenever the new privacy law in California goes into effect versus other places, how do I look at sunsetting contributions of certain regions over time?
And these are also really important questions from just model drift and data drift perspective, as well as to better understand where is the data coming from? What's the population represented here? In what use cases do we want to divide things by regions or divide things by particular user groups or something like this? And coming back to my social scientist roots, population selection is a nontrivial problem in data science by default from a statistical perspective. And I think that it's an undervalued problem in a lot of machine learning workflows to actually think through what is population selection.
And as for companies that operate in Europe, as the AI act gets going, you have to take a lot of thinking about population selection. And these types of things are definitely going to need to be better documented.
So in the book, I describe what I call privacy card for machine learning models, where I suggest maybe some things that you can do, like what privacy technologies you evaluated. Where does the data come from? Does the model need to be deleted at a certain point in time? How are you allowed to use the model based on the data that was used to train it?
And I think thinking through model retention periods is probably just a good practice by default. It's sad when I meet teams that say they're using the same model for a year, because I think that that probably means they're not actually evaluating the model in any type of real-time sense.
Charlie McCarthy 13:36
Speaking of regulation, we mentioned a couple guiding policies like GDPR. We came across a blog that you wrote a couple of months ago, Katharine, in your Probably Private blog. This one specifically was entitled “ChatGPT & GDPR Showdown, AINow Report and Privacy, Privilege & Fraud.”
So, ChatGPT. Super popular right now. How does the case of OpenAI's ChatGPT highlight some of the privacy concerns that you were just discussing or maybe some that we haven't touched on yet? Some of the more prominent concerns?
Katharine Jarmul 14:09
Yeah, I think the question becomes if you're an outlier in the data, or let's even say you're not an outlier in the data. Let's say you're a creator. You're a content creator. You both are content creators. Have you had a chance to ask ChatGPT about your podcast?
Charlie McCarthy 14:27
Never have.
D Dehghanpisheh 14:28
I have. I hate to admit it.
Katharine Jarmul 14:31
And did it have updated information about who you are?
Charlie McCarthy 14:36
It shouldn't have!
D Dehghanpisheh 14:37
No, it was a hallucination, and the training set was like, it was very clear that it didn't have any reference. But some of that was the prompt engineer, right?
Like, we intentionally avoided giving it any information that would have fed that retraining pipeline.
Katharine Jarmul 14:53
Yeah. But even from the language model, so the GPT-3 plus whatever it is now–
D Dehghanpisheh 15:00
3.5 and 4, yeah.
Katharine Jarmul 15:02
Yeah– is from data collected from ‘21 and earlier. So, 2021 and earlier. And I encourage people to have a look and to ask a bit who you are, who's your company, if those are listed, what's your blog, what's your newsletter, what's your podcast? Sometimes you'll get these hallucinations, sometimes you'll get things that are half true, half hallucination, and sometimes you'll even get your own words repeated back to you and so forth.
So when I ask, it says, Katharine Jarmul is a data scientist focused on this, that, and whatever. And I think the question becomes, is that what the user would expect? Because I'm not on Wikipedia. I'm not like a famous person. I don’t think ChatGPT should probably know who I am. And yes, you can search my name on the Internet. Yes, indeed, you can find out things, but then it's mainly websites I control that I update.
And the question starts to become, okay, personhood. These questions are personhood, branding, who you are, what you work on. And then also for people like journalists, their words, their actual words, work, articles they've written. We get into a lot of intellectual property questions, but we also get into a lot of, is this expected user behavior? Like, is this expected? Would a user expect that they can be found in this system or that their company or that their work can be found in the system? And I think the question starts to become should there just be an opt out?
Because most of privacy engineering, most of thinking through building privacy into systems is actually thinking through the user desire and essentially trying to avoid situations where what the user expects and what actually happens, we have this very creepy divergence.
I think that's some of what you hear people say about personalized ads. When they're creepy, people don't want them. When they're good, people want them, but when they're creepy, people don't want them. And I think a lot of privacy is figuring out socially and culturally, where are the acceptable boundaries, and how can we make technology meet the social boundary rather than the other way around?
D Dehghanpisheh 17:15
So that social boundary of the creep line, right? Like, are you coming up on creeper, or are you not coming up on creeper? Besides the technical components, there are clearly some process enhancements that I think you're subtly recommending, and in some cases, maybe less than subtly recommending, such as the ability to let your users opt out of sharing or to go back and scrub their data from that.
What are some of the processes that you think need to be invoked outside of just the, hey, these are the technical things you should do? You've mentioned model governance, data governance. There's code governance. The reason we started Protect AI was because we don't believe that any of these encompass the ML lifecycle holistically. They're all kind of transactional.
But what are some of the process components that you're thinking of in terms of helping advocate for to improve privacy and get privacy by design in the models early on and continuing that?
Katharine Jarmul 18:17
Yeah, I mean, one happens, as you pointed out, right at the beginning. It's like, how do we decide what the model can do, what the model can't do? How do we govern that in any reasonable sense? And then how do we look at what data we're allowed to use, what data we're not allowed to use, and keep that all testable, right?
Like, you should literally be forced to test this. That data from bucket A should not be allowed to reach feature store C. And if so, it should be missing these pieces of data, right? And these are all things that you all very well know we could test, we can validate, we can revalidate. We can think through even testing for entropy, should we want to use encryption, should we want to use masking tokenization, any of the number of things that we can use.
But then, even in training, there's actually a lot of options. So when you think of model training, one of the chapters in the book goes through adding differential privacy as a potential privacy protection during training. And your results may vary. So your problem might not actually fit differential privacy. That's fine.
But there's other things that you can do, too, when you're in the training state where you can, let's say, remove outliers or where you can actually formulate or model the problem in such a way that it's not so focused on any one individual, where people are more grouped, or where you use things like Bayesian thinking or other approaches to your algorithm design and development, and therefore also your training, that allow for people to have the outcomes they expect.
And again, if it's a personalized ad system, guess what? There's only a certain amount of privacy you're going to give, because the model itself is not supposed to give privacy. But what people should be doing is also testing with users. Is this the expected outcomes? And then as you all focus on, having those tests then integrated.
So, how can we formulate the expected outcome in a technical way? And then how can we use that evaluation criteria in the way that we evaluate whether the model is behaving as expected before we release the model into a production system? And then obviously, when we have things, like people want explanations of model behavior, or people are requesting to refute a model inference or something like this, then we can also collect those as data points of how models are doing from a privacy point of view?
D Dehghanpisheh 20:48
So I would imagine that this kind of centralized mechanism of managing the privacy by design elements that you're advocating for, and that make a lot of sense, that we're huge believers of at Protect AI and MLSecOps podcast here.
How does federated training and federated learning, though, change that? That's a pretty big disruption just from an architectural perspective. Talk to some of our are more, I would say, developer-oriented listeners about how you would guide them on that.
Katharine Jarmul 21:17
Yeah. So I know that you've had some folks talk about federated learning already on the podcast…
D Dehghanpisheh 21:23
Just marginally.
Charlie McCarthy 21:25
In passing.
Katharine Jarmul 21:26
Okay, then we'll start from, just so that everybody's oriented–
So when we're shifting to a federated learning perspective, we're shifting training and also validation from a centralized data perspective to a completely federated or decentralized setup to some degree. And federated learning was first championed at Google, and the reason why it was championed is they actually wanted to use more sensitive and private data, but they realized that they would very much go over the creep line if they did it.
Because what they wanted to do is they wanted better keyboard suggestions for non-native English speakers. So they wanted keyboard suggestions in multiple languages. And they realized that their training data set is so biased in English that they had really poor suggestions for sometimes both like things like emojis, but also even words and conjugation in other languages, not in every other language, but in numerous other languages.
And the data science team that was working on this problem thought it would be really great if we could get better data from people's keyboards to predict these next words or potential next tokens. For multiple languages and get them directly from people who said, okay, my default language of my keyboard is Spanish or Portuguese or Italian or German. Right?
But they realize that's super creepy. We can't just steal everybody's keyboard data from their G boards on the Ick factor. That's pretty high. That wouldn't have gone over well from a PR perspective or probably any privacy regulation perspective. And so they said, why don't we ship training to the device? So instead of taking the data in, putting it into our centralized feature store, training on it, and then maybe deciding to delete it in some sort of retention period, which is one way to approach privacy, they said, we're never going to collect the data, we're going to send the model training to the device.
And that sounded, six years ago when it started, really crazy. But now I think people are used to their device having even specially designed chips like M1 chips, and these types of things that Apple are doing that have GPU like or Vectorized or Tensorized operations accelerated on device or on end hardware or on embedded systems, that's a reality. And then we have more and more RAM, more and more processing power. And this provides some extra privacy because, again, it's not sending every token, but we can argue about the privacy controls of federated.
So it's definitely good in that data doesn't become centralized. But there are ways that the systems that do this can leak private information, particularly for outliers and these types of things. But I think it's a really cool step and I think it also opens some interesting security questions, as maybe you're pointing to.
D Dehghanpisheh 24:34
Yeah.
Yeah, you landed on that right, because, I mean, it's a different type of almost secure enclave type of thinking, but it also creates a bunch of spirals in terms of how you'd have to take care of that operationally.
Katharine Jarmul 24:48
Absolutely.
Absolutely, and you're now training with data you can't see.
So just… Everybody reflect on that for a minute.
D Dehghanpisheh 24:58
Yeah. And you don't know where it is, you don't know where it came from, and you don't know where it's residing.
Katharine Jarmul 25:03
Exactly.
D Dehgahanpisheh 25:04
It's just kind of out there. It's ephemeral.
Katharine Jarmul 25:07
Exactly.
Charlie McCarthy 25:08
Yeah, so that gets me thinking. What types of privacy enhancing methods should organizations be thinking about with regard to their model training?
Can you talk a little bit about encryption, anonymization? What should we be looking at for models that maybe have been trained on more non private methods in the past?
Katharine Jarmul 25:27
Yeah, I think the easy start is to start thinking through the absolute basics, which is minimization. So data minimization. What can we take out that maybe we don't need? Even things like categorization or tokenizing, masking.
So one of the suggestions I had in the book is, it happens that a lot of times fraud or credit align with private attributes, and these are often also encoded in biases in our society and so forth. So, like, that zip code implies income, and that income then implies creditworthiness or fraud risk and these types of things. So a lot of times you'll go into these systems, and a lot of times, zip code is directly encoded in fraud or directly encoded in credit risk or any of these other types of modeling because it performs well, because it's directly tied to income, and sometimes it's also tied to race, and sometimes it's also tied to other factors.
And so when we think through all this, we have to ask ourselves, okay, from a privacy point of view, does it actually make sense if what we're actually trying to encode is income, shouldn't we just literally take the census income block by block and map it to income? So instead of tacitly saying these zip codes have fraud and these zip codes don't have fraud, which release quite a bit of private information also about where your customers are, for example, and where they aren't, you could have an encoding mechanism where you take zip code and you translate it into income. And when it's income, there's a much smaller search space and yeah, okay, now I know somebody's. Income or presumed income, right? It's a guess at their income. But that's a lot better than knowing their income and their zip code.
Sometimes you just got to think, what data can we remove? Or more clearly say, here's the private information that we're trying to learn, and I'm going to model it so that it matches what I'm actually trying to learn. So beyond these basic approaches of trying to remove data and think through the problem, like, what are you actually trying to do with the private information you're using?
You can think through anonymization So differential privacy as a method you can use during training. Potentially as a method that you can use during feature engineering, although that's much more difficult. And then you move towards things like federated learning, things like encrypted learning. So I used to work in encrypted learning, where we actually only ever train on encrypted data, and these are obviously much more advanced methods and are going to be dependent upon (A) you having the team capabilities to launch these types of things and the infrastructure that it takes to launch these things. But also, I think, fundamental shifts in the way you can design machine learning systems with more privacy and often with more security guaranteed.
And I think there's two chapters of the book, one on federated learning, one on encrypted learning, or encrypted computation as a part of encrypted learning, where I deeply go into what are the theories, how do these things work? And then, what are some open source libraries and tools you can use to get started?
At least just if you're a tinkerer like me, maybe you just want to tap your toe in the water and see, how does it even work? How can I even learn on encrypted data? And I think for me, personally, I think the future of the industry is thinking through more decentralized, more user-controlled machine learning. And obviously, these two things, encrypted learning and federated learning have then, a big impact, should that be a future of machine learning that we can see.
D Dehghanpisheh 29:08
So, as a follow on to that, Katharine, I'm assuming that one of the things we need to build into the build process of machine learning is really more auditability of the system, not just model cards. Like, you need to know which model on which inference endpoint contained.
And you need that full lineage to see it. And you need the ability to go back in time and reconstruct everything in the model lifecycle, not just the model card as it existed at that point in time. Is that a fair assumption?
Katharine Jarmul 29:39
Yeah. I mean, at the end of the day, we need auditable and at least partially automated ways of doing this. As in, there's no way that if a model card or this type of audit chain, even supply chain–we could think of it as the data supply chain of a model as well as the process supply chain of a model. If that's not somewhat automated, we all know nobody's ever going to do it.
Nobody's ever going to update it. It's just going to die on a vine somewhere. And somebody's like, did you update the wiki? No, we didn't, so sorry, I don't know.
D Dehghanpisheh 30:20
So it's interesting that you bring that up, and I'd be remiss to both our listeners and the employees of Protect AI if I didn't comment because this was unintended. But that's exactly what Protect AI's tool is doing. So for anybody who's listening or reading or wants to see that, feel free in the show notes, book a demo and we'll give it to you.
I love this concept of thinking through, kind of understanding the surface area of your model and understanding the privacy by design elements that you can check for audit for and control for, based on who, what, when, where was looking at something. I think it's a really critical, important concept that I think you're passing on to the listeners and readers of this podcast. Is that a fair assessment?
I mean, I don't want to put words in your mouth here, but I think it's just about incorporating looking around the corner, if you will, and incorporating techniques that make the entire ML experience and journey more secure and more private by design.
Katharine Jarmul 31:23
Yeah, absolutely.
I mean, I think if there is an auditable chain, like you say, and if I can look back and say, okay, I need to actually flush out all models from a data retention point of view. Let's say that Europe decides that models are personal data if they've been trained on personal data, which could happen. And if so, it will probably happen soon. And if that's indeed the case, then data retention applies to models.
So, if I can't segregate the models that I had that use so and so data sources with so and so training mechanisms or lack of certain training mechanisms, like lack of using our differential privacy features or something like this or differential privacy during training, then I immediately need to go find those models. I need to delete or remove them, or I need to figure out how I'm going to anonymize them.
And I was recently at the EU Privacy Forum where the data protection authorities meet with the cybersecurity authorities, along with academics, industry folks and so on. And there was a whole section of the program this year that was just on deletion of personal data from models. And so I do not say this with like, oh, maybe whatever is actively being discussed, put on the agenda by the European cybersecurity authorities and the data protection authorities.
So I'm not trying to be like a scare-mongerer or anything, but it's a real concern that people have and I don't think it's going to go away.
Charlie McCarthy 32:55
So we've talked quite a bit, Catherine, about some of the processes and technologies. Let's talk a little bit about the people.
Say organizations are looking to incorporate a lot of these privacy enhancing technologies into the process of building their ML models and systems. When we're talking about the people aspect, though, how important is it for the data scientists and engineers, practitioners, to have a clear understanding of regulations like GDPR in the EU and California Consumer Privacy Act, CCPA, these individual contributors in a larger organization?
Katharine Jarmul 33:37
Yeah, I think it's a great question, and I think it sometimes very much depends on how your organization is set up. So your organization might be set up already with a privacy engineering team, which means they're going to be the experts that you can always go to and say for this particular model that I'm thinking about, or for this particular use case, how exactly should I leverage privacy enhancing technologies?
And a lot of the larger technology companies, as well as a growing amount of actually surprisingly consumer facing, ecommerce facing companies, are starting to have privacy engineering teams as well as most of the cloud providers now have fairly large privacy engineering teams. So sometimes that help already falls underneath the hat of the privacy engineer. Which is cool, if so.
But if you're at a company that just has like, let's say, a risk and compliance or like an audit risk compliance plus privacy team, and then the data team, which sits in a totally different part of the enterprise, and you're trying to convey across these very different ways of speaking about data, right? ‘Data protection-speak’ from a legal aspect is totally different than actual ‘privacy technology-speak’ in a lot of companies, and that's totally fine.
But who acts then as the bridge between these two to make sure that the solutions that are being designed are actually compliant with the firm's policies, with the guiding principles from the privacy team and also the other way around? So when the privacy team designs new principles or a new policy or anything like this, how do we actually then, from a technical point of view, translate that into our running systems and our architectures and say are we still compliant? Are we not? Or if so, then how are we going to reorganize our architecture?
And in well functioning organizations, this is already decided and maybe taken under the realm of data governance. But in numerous organizations, this responsibility is quite scattered. And if you work at an organization where the responsibility is quite scattered, you yourself as a data scientist or a machine learning person, you might decide, you know what? I think this sounds cool. This sounds interesting to me.
Let me tell you, there's some hard technical challenges in privacy. It's not all fun and games. There's lots of good math there too. And maybe you decide, hey, I want to work as a privacy champion, and maybe I want to create a role as a privacy engineer. I'm going to take this on. But without people doing that themselves, I think often what ends up happening is somewhere during handoff, things get lost in translation. And then I'm sure that you've seen this all the time from a security point of view when InfoSec hands off to the data team and the other way around where there's just gaping gaps between where people say they're at or think they're at and where they're actually at.
D Dehghanpisheh 36:36
Yeah, and I feel like a lot of times operationally, they are way too siloed, and they assume that the other silos understand what they're trying to talk about or communicating. So, for example, when you're thinking about how to implement a data policy and thus maybe a gate in your CI flow around how to get that model– prevent it from going out the door. People just don't even bring everybody to the table to talk about what these policies mean and how they're brought to life.
Like, it feels to me that the simplest steps are just bringing people to the table and just it's not being done.
Charlie McCarthy 37:07
Kind of going back to what you just said, Katharine, that knowledge disparity between certain groups within an organization. You know, policymakers, privacy teams, data science teams. Any recommendations for how we begin to bridge that gap, other than maybe individual contributors taking an interest in the AI machine learning space and wanting to blend the two? Is there a call for a new type of team related to privacy in AI. Specifically, would you say? Or, how have you been thinking about that?
Katharine Jarmul 37:39
Actually, in the book, so there's a whole chapter on basically working through policies and understanding regulation and working with the InfoSec team, which is also really important, with governance and privacy teams, which are, of course, very important. But in that I actually reference I borrow an idea out of security and I'm not the first to do it, I won't be the last to do it, but the idea of creating a security champion culture, which I think the security community–and I'm sure that you both are very deeply familiar with this culture–is like, how do we imbibe security in the way that we think through things?
How do we teach people about security so that they don't have so many oopsies at this type of thing? And I think that we don't really have that yet for privacy at most organizations. But I think that we, as privacy champions, can learn from the hard won efforts of InfoSec and security as a culture, as a community of how do we teach each other about privacy? How do we teach about technical privacy? How do we talk about these things and have trade offs and have those conversations?
And I think I would say that there was a great interview that I had a chance to lead with a privacy engineer recently and published, and his belief is that privacy engineering is probably like where InfoSec was 15-20 years ago. It's like we're just at the very beginning of figuring out how do we even shape problems in ways they can be solved?
And I think it's so cool to look at, for example, what you all have been doing in DevSec Ops, in ML DevSecOps, and bringing those ideas into machine learning. That has to happen also across an entire of how do we take the privacy stuff and put it in? So even beyond just the machine learning and data teams, also in the other teams.
D Dehghanpisheh
So to that end, right, we've spent some time talking about technical risks. We've spent time talking about organizational risks as it relates to data privacy, privacy by design, privacy in the age of machine learning and AI models.
I'm curious about, though, and I love this concept of champion culture, right? In the champion culture component, you've got people who took the mantle to do it, and then, lo and behold, now you have a CISO chief information security officer, right, as an example. And you have these flavors of shift-left security for developers to become more conscientious of code security, et cetera. That's happening and maybe slower than we'd like.
But when it comes to the champion culture of managing the third category of risk, which is reputational risk, and data privacy, and ensuring your customers' experiences remain private when they should be, we had to have people have bad things happen before that champion culture took over and caused brand reputational risk.
I'm curious how you see reputational risk in the era of data privacy at the convergence point with machine learning. And how does that champion culture for maintaining privacy, where does that reside? Where should that be in your mind as it specifically relates to reputational risk?
Katharine Jarmul 40:55
I talk a lot with the InfoSec folks in my line of work, and I think the interesting thing is we started to see, like, a merge of reputational risk between security and privacy because quite a bit of reputational risk is data loss. Just like both, either insider threat or something like this, or actual breaches, technical-oriented breaches and so on and so forth.
And these breaches, they are worse when it's private information that's lost. They're also just financially way riskier, because the way that data protection regulation functions is these breaches are held as particular violations of privacy. And in fact, I would say, or I would argue, and maybe some of my InfoSec colleagues would argue differently, but I would say that privacy regulation gave a great gift to InfoSec in that it started to financially penalize and create true financial liability for security mistakes. And so I see them as like, you're welcome, we made it financially very risky.
D Dehghanpisheh 42:04
Money, money, money!
Katharine Jarmul 42:05
Yeah, exactly. Extremely risky for you to mess it up.
D Dehghanpisheh 42:09
Money talks, bullshit, runs a marathon.
Katharine Jarmul 42:10
Yeah. exactly. But I think even on the positive–so that's the negative side of reputational risk. On the positive side, we can look at companies like Apple who now have entire branding campaigns around privacy and who are saying, as a reputation, we want to be known as not our competitors. And when you buy our hardware, you're protected.
And it's like, one can argue about all sorts of what are the motivations and this, that but–
D Dehghanpisheh 42:43
Well, you see, like, Meta with WhatsApp going against Apple and Apple going against them. Like, there's some really fascinating things that's happening.
Katharine Jarmul 42:49
Exactly. And so it's like from a reputational reward, there's also clearly somebody who has run the numbers, and there's also reward for deciding we're going to be the privacy-oriented company. We're going to be the one that respects your privacy. We're going to be the one that helps you protect your privacy.
And let's just say that Apple might be a great company in a lot of ways, but they're not going to do it if it's not also financially beneficial to them.
D Dehghanpisheh 43:19
Yeah, so I guess to end on that, what is the call to action for listeners from you interested in improving the AI/ML security postures, vis a vis data privacy enhancing data governance policies?
What's your one or two call to actions that we can get everybody in the community rallied around for you?
Katharine Jarmul 43:38
Well, I mean, maybe I'm most excited to be here because I think that you're already working on a big, hard part of the problem, which is when we think through building models in a more secure way, we're also kind of often touching how to build them in a more private way. And so I think you've already created a community and also a company and a product around thinking through, how do we actually validate that what we're out there and we're shipping is secure by design.
And I think that adding on private by design there would not be a huge stretch to thinking through, okay, it's secure. We've covered all of the bases, we think, from potential security loopholes. But can we also cover compliance risk? Can we also cover privacy risk?
And can we also cover the types of governance risk that, by the way, are very much actively being worked on when we think about regulatory space. And I think both what the US is doing in terms of thinking through how do we do responsible model development, as well as the steps that Europe is making, and who knows where else in the world will also start making regulations around this?
Privacy is going to be a growing part of machine learning, and I want people to not be afraid of it and to instead embrace the chance of learning new cool math concepts, learning new cool technical concepts, and also maybe learning a bit about the social concepts of privacy, and how do we think through machine learning problems with all that stuff already baked into the way that we design our machine learning systems and our products?
D Dehghanpisheh 45:16
Thank you, Katharine. And I think you missed one call to action. That is, everybody should go read the book, Practical Data Privacy.
Katharine Jarmul 45:22
[Laughs]
Excellent!
D Dehghanpisheh 45:24
Everybody, it’s a great book. I loved it. I think it's great. Any book that is already thinking about Federated and training and learning and that's not even widely adopted. It's just a really great thought leading book. So one of the calls to action is, go get that book.
And to that end, we are going to be giving away a copy of that as a promotion. So all the listeners, please come on in, maybe we can even get her to autograph it for you.
And with that, Katharine, I just wanted to thank you for coming on, for fascinating, wonderful discussion about all things data privacy, privacy by design, privacy into your ML model lineage. We covered a lot of ground, so I wanted to thank you for coming on.
Katharine Jarmul 46:01
Thank you both for having me.
Charlie McCarthy 46:03
Thank you so much. Katharine, it's always a treat to talk to you.
I hope we get to do it again very, very soon.
Katharine Jarmul 46:07
Sounds great.
[Closing] 46:10
Thanks for listening to The MLSecOps Podcast brought to you by Protect AI.
Be sure to subscribe to get the latest episodes and visit MLSecOps.com to join the conversation, ask questions, or suggest future topics.
We’re excited to bring you more in-depth MLSecOps discussions.
Until next time, thanks for joining!
Additional tools and resources to check out:
Protect AI’s ML Security-Focused Open Source Tools
LLM Guard - The Security Toolkit for LLM Interactions
Huntr - The World's First AI/Machine Learning Bug Bounty Platform
Thanks for listening! Find more episodes and transcripts at https://mlsecops.com/podcast.