ML Model Fairness: Measuring and Mitigating Algorithmic Disparities
Aug 23, 2023 • 24 min read
Charlie McCarthy 0:29
Hello, MLSecOps Community! This is Charlie McCarthy welcoming you back to another episode of The MLSecOps Podcast, along with my co-host and President of Protect AI, D Dehghanpisheh.
This week we’re talking about the role of fairness in AI. It is becoming increasingly apparent that incorporating fairness into our AI systems and machine learning models while mitigating bias and potential harms is a critical challenge. Not only that, it’s a challenge that demands a collective effort to ensure the responsible, secure, and equitable development of AI and machine learning systems.
But what does this actually mean in practice? To find out, we spoke with Nick Schmidt, the Chief Technology and Innovation Officer at SolasAI. In this week’s episode, Nick reviews some key principles related to model governance and fairness, from things like accountability and ownership all the way to model deployment and monitoring.
He also discusses real life examples of when machine learning algorithms have demonstrated bias and disparity, along with how those outcomes could be harmful to individuals or groups. Later in the episode, Nick offers some insightful advice for organizations who are assessing their AI security risk related to algorithmic disparities and unfair models.
So now, without further ado, we thank you for listening to and supporting The MLSecOps Podcast, and here is our conversation with Nick Schmidt.
Nick Schmidt 2:14
Thank you so much for having me on the podcast today and really excited for this conversation. My name is Nick Schmidt, and I am the Chief Technology and Innovation Officer and co-founder of SolasAI.
Solas is an algorithmic fairness software company, and our mission is to focus on helping our customers develop highly predictive, fair, and ethical predictive models. And the way we do this is not by building models ourselves, but by offering tools to ensure that a model built by a customer meets fairness constraints.
So we focus on just a few things, really. One is measuring discrimination. Two is understanding where that discrimination might be coming from if there is a problem present. And then third is actually mitigating the discrimination. So, helping our customers to find models that are fairer and highly predictive so they meet the business needs, but they're not harming their customers.
Charlie McCarthy 3:13
Fantastic. And what about you personally? Your interest in this space?
You've got a lengthy background in regulatory compliance. Can you tell us a little bit more about that and how it led you to Solas AI and this focus on AI fairness and model governance?
Nick Schmidt 3:28
Sure. So, to maybe kind of begin at the beginning,
I grew up in a household where my mother was a lawyer who loved to do public good. She was really focused on helping the poor, helping people who didn't have access to lawyers to make sure that they were able to be represented. My father was a chemist; a chemistry professor. And so that combination of doing good and mathematics kind of combined into what I do now.
And I got my start in regulatory consulting focused on economic and statistical issues. And what we would do is if a company was sued by a group of employees or another company for intellectual property violations, antitrust questions, or employment discrimination, we would come in and do an analysis of the economics of the lawsuit. And then if there was evidence of securities fraud or discrimination or whatever it might be, we would then put a dollar figure on it and testify in court if necessary.
D Dehghanpisheh 4:35
So, hey, Nick, as you talk about Solas AI's mission, and you spoke very eloquently about the things that it does. At its core, I would imagine there's a lot of principles of fairness that you're trying to imbibe into artificial intelligence applications and the use of AI.
With that, I assume, comes model governance. And when you think about model governance, what are the key principles and aspects that organizations should consider?
Nick Schmidt 5:02
And I think that it's important to realize that fairness is one piece of model governance. And if you don't really have a good model governance structure, then you're probably not going to have fair algorithms. And so when I think about model governance, what I'm really thinking about are a few different things.
One of them is having accountability and ownership. And so, what this means is that everybody who touches the model has some level of ownership and accountability for it. And then I think about transparency and explainability, and that is that everybody in your organization who is involved in the modeling process ought to know as much as they need to know in order to ensure that the model is safe and fair and robust.
Explainability is another important one, where what I would say is you don't want to have a model that is unnecessarily opaque. And even when a model is opaque or a black box-like large language model, you want to understand as much as you can about it and be able to understand where your limits are. So that's another principle of governance.
And then fairness and robustness. And this goes to both the question of what's affecting the customers and what's affecting the business. You can't have a model in production that isn't being fair to the customers, that's a legal, regulatory, and reputational risk. And you can't have a model that's in production that's going to bankrupt the business. So you really need to have a model governance process that's looking at the model and making sure that it's robust to any business conditions that might change or might fluctuate once you get it into production.
And then data quality and integrity. Making sure that you're not putting garbage in. Because even though there's a lot of people who seem to think you can do anything with machine learning and it's this sort of magical thing, if you put garbage in, you get garbage out.
And then the final thing is monitoring and validation. And that's really just making sure that your model is doing what you expect it to do after it goes into production. So regularly monitoring and checking for things like fairness, robustness, model performance and so forth.
D Dehghanpisheh 7:19
So you just briefly spoke about how model governance is applied at all of the stages of the model lifecycle or of an ML pipeline, whether it's talking about data at the input components all the way out through understanding what happens at the inference, i.e. the engagement of a person or an entity with a machine learning model in production.
When you think about that entire pipeline, from experimentation to inference, to re-training, all of it – when you think about the complexity of that, why is it important to ensure that fairness is integrated into all those steps?
What are specific steps or actions that you would recommend engineers employ at each stage of the life cycle? Could you decompose that for us a little bit?
Nick Schmidt 8:06
Sure. And I think maybe before I jump into what you should do, I think that it's worth touching on why you should do it.
And to me, there are a number of reasons why we have to integrate fairness throughout the machine learning pipeline. But at a principal level, or a top down level, they come down to two things. One is ethics, and one is business.
Humans have ethical standards, and if we're putting models into production that don't meet those standards, we're really not living up to what we want to be. And then on the business side, suppose you're a non ethical person and all you care about is money. Well, that's fine, but one thing that's really not good for your job is if your CEO is in front of a Senate subcommittee apologizing for what your AI model has done.
D Dehghanpisheh 8:53
I do not recall, Senator.
Nick Schmidt 8:54
Exactly. That is not a good career move.
And so I think that's the broad reason for why we want to make sure that fairness is incorporated throughout the pipeline, but then breaking it down and thinking about what's important there. It's asking the question, “where can discrimination come into a model? How can a model become unfair?”
And that can happen anywhere, really.
D Dehghanpisheh 9:22
So is there a particular stage of the model lifecycle where it's more likely to occur than not?
Nick Schmidt 9:29
Yeah, I think there are two places. One is in the data itself, what you've decided to put into the model–
D Dehghanpisheh 9:36
in terms of the data training set?
Nick Schmidt 9:38
The data training set, exactly.
And then there's the model itself, the built model. There's a lot of chance that you could have bias in that. And so in the data stage, I think there's been so much press about data being biased, it may even seem like it's relatively obvious. But if you think about credit data for determining whether or not people get loans, there's historical and present day discrimination that's encoded in that data.
For example, African Americans are much more likely to be fired without cause than are whites. And if you look at why people default on loans, very frequently because they lose their jobs.
And so if you've got that kind of information that's going into the data, you're going to encode that kind of bias. So even if you have the best intentions in the world, if you're using a data set that's biased, it's going to give you a biased result.
The other way that data can be biased, and this is a more subtle one, I think is in coverage. One of the things that we see very frequently is problems with missing data, and missing data being more frequently occurring for minorities.
D Dehghanpisheh 10:56
So in other words, you're talking about the completeness of data in one instance, and you're talking about kind of like an “oversample,” if you will, of a certain subset of the data in another instance. Is that correct?
Nick Schmidt 11:10
For the former, absolutely. For the latter. I don't know that I think of it as an oversample. I think of it as not reflecting what reality would be if there weren’t just–
D Dehghanpisheh 11:20
Nick Schmidt 11:21
Yeah, if there weren't discrimination, we would not see these differences in outcomes.
D Dehghanpisheh 11:27r
So then coming back to the how, right? You led us into the why and said, hey, it's more important to understand the why. And these are great examples.
And then it comes back to the how. In that instance, how is an organization to go about looking at, say, the quality of their data and thus the quality of their model code so that they can ultimately understand the quality and the fairness of the output of that model when it's combined?
What's a company to do with their own first party data that is largely going to be used to train their internal models?
Nick Schmidt 11:58
I think there are two somewhat distinct approaches that you have to think about, and one of them is qualitative, and one of them is quantitative.
As much as I would like to say our software could solve every aspect of fairness, that's just not true. Humans have to be in the loop. And when you're thinking about data integrity, the first place to start is with a human review of the variables that are going into the model.
I've seen countless times where modelers misunderstand what the data set represents, and so they'll be willing to put something into a data set or into a model, things like browser search history, where maybe it seems unobjectionable, but if you look at where they're focusing on, they end up giving lower weights to websites that are focused on minorities. And so if you don't have people reviewing this information, they may not realize that this kind of stuff is going into a model.
And so that's a qualitative check, and then there has to be a quantitative check as well. And that's really where software comes in. Because you have such large data sets and you have so many variables, it's really intractable to do a full qualitative review on most data. And that's where we can look for things like asking questions if variables are proxying for a protected class.
So [for example] is there a variable or a combination of variables that, even though you think it's okay, is really just a stand-in for, are you Black? Or, are you older? Are you a woman? And those sorts of things have to be rooted out.
D Dehghanpisheh 13:41
You mentioned the qualitative component being about bringing more humans into the loop, not just a human into the loop.
Are there specific functional humans in a corporation or an enterprise that need to be brought to this process that maybe aren't being brought to the process today?
Nick Schmidt 13:59
And I think the framework I think about is the three lines of defense, and that's used in financial services especially. But thinking about where you want to have different people, it's in these three lines.
And the first is the business. It's the model users and the model owners. The second is relevant for your question, which is internal compliance. And the third line, just to finish out, is internal audit.
So it's the second and the third line that's not being done right now in most organizations. And what you want to have is people who are experienced in whatever question they're addressing. So if it's a question of model quality, you want to have high quality model builders or experienced model builders in the compliance function around model risk management. If the questions are around fairness, you probably want to have lawyers or people who have been doing this a long time who understand how to measure fairness, what it means to mitigate discrimination, things like that.
And so it's a compliance and legal function that you usually want to bring in.
Charlie McCarthy 15:10
Nick, it sounds like there's a pretty big social component to all of this when we're talking about people in the processes. And as I'm listening to you describe things like fairness and what we're looking for, I think a lot of people might agree that it's really important to also discuss how we define some of this terminology within this ever-evolving AI vernacular.
Like, there's debate right now about “what is AI? How do we even define that?” The term bias is being used a lot - [also] responsible AI, trusted AI - and I think the industry is moving toward better defining what we want to use as terminology as a group as we're working to solve some of these problems.
So, I mean, for example, can you give us an idea of how we might better frame our thinking? What I'm thinking of specifically is this article that I was reading from The Atlantic, and the title was, “How an Attempt at Correcting Bias in Tech Goes Wrong,” and it was talking about the Google example of allegedly scanning the face of volunteers with dark skin tones in order to perfect the Pixel phone's face unlock technology - so facial recognition technology, and the potential for racism there.
The term bias, if we can talk about that for a moment - and maybe will you expand on the distinction between underrepresentation bias, which in my understanding is more human driven, and accuracy bias, which is more of tech result, and how those two influence each other? Those two types of bias in terms of disparity in how it presents in these algorithms.
Nick Schmidt 16:43
Yeah, so this question of definitions is really important, and we can't possibly begin to address fairness unless we have come to agreement on some definitions. But what's really important to know about this is that there's already been 50 years of civil rights legislation, litigation and court precedents and regulatory findings around this.
And one of the things that I really struggle with is that the AI community and machine learning community sort of came in and said, we're going to redefine everything in terms of this is what bias is, this is what fairness is. And while a lot of that work has been really beneficial and really good, and I don't want to say that it shouldn't have been done, I think not enough respect has been given to what is already out there.
And there's already a very good framework for understanding fairness. And that gets into your question of representation and accuracy. The way I think about bias is really in terms of legal distinctions, and the law is set up to represent essentially two different types of discrimination. One of them is called disparate treatment. The other is called disparate impact.
Disparate treatment is sort of your traditional kind of discrimination. It's the obvious one. It's, “I'm not going to give a loan to Charlie because she's a woman,” and that's just patently wrong and illegal.
The other kind of discrimination, disparate impact, is really more subtle. What it says is that if outcomes are on average different for one group relative to another, then that is evidence of disparate impact discrimination. And so what that means in practice is if you have an algorithm that's doing credit scoring and it's giving loans to fewer African Americans than whites, then that shows evidence of disparate impact.
Interestingly, though, that is not necessarily illegal [depending on the circumstances]. But what it does is it leads to a legal requirement to search for less discriminatory alternative models so that's a model that is similar, still predictive, still predicts loan outcomes, but it has less of that disparate impact. You're increasing the number of African Americans that you give a loan to.
D Dehghanpisheh 19:12
So in other words, it's not just the size of, say, an impact or the frequency of an impact. It's really the relevance of the impact, for lack of a better term. Like, it's not always how many people that you're affecting, but it's the difference between, say, approving ten people for a $100,000 loan or one person for a $10 million loan.
Nick Schmidt 19:33
So, that's a really interesting question, and that gets into questions of damages.
Normally what we're focused on when we're talking about AI modeling is not so much the impact like, are the damages a million dollars to one person or $100,000 to ten people? We're really talking about what would be described as liability. And liability is usually about the number of people you're affecting.
And so in the credit score context it's, are you giving loans to fewer African Americans? But bringing this back to, Charlie, your initial question about bias and accuracy, in a model setting we almost always see disparate impact in credit scores or health outcome models. And what we want to do is we want to fix that.
What we don't see quite as often is model bias. And that's where the model is less accurate for one group than another. And I oftentimes have conversations with modelers about this question and they really push back on the idea of disparate impact. What they want to say is that if a model is similarly accurate for all the groups, then there's no evidence of a problem.
And I get that, it makes sense. But what that's not doing is it's not taking into account the idea that there's historical discrimination that's part of the data and that you want to rectify that. And so that's where just using a measure of bias isn't likely to be sufficient.
Charlie McCarthy 21:14
So as we as an industry, a group, the public are continuing to learn and identify some of these situations, are there other examples in real life that you can point to other than the Google example of facial recognition?
Maybe other industries where you've seen actions that led to unfair outcomes?
Nick Schmidt 21:35
Yeah, I think one of the ones – and this might be interesting to your community because I think it's such a big risk in machine learning is something I call usage drift, which is where you have a model and it's designed well and is fair and appropriate and predictive, and then someone decides to use it for something else. For a purpose that it wasn't originally designed for. And there was a model that was built by a company, and it was designed to predict health care costs, how much you were going to spend on healthcare in the future.
And that's a perfectly reasonable thing to try to predict if you're a health insurance company or in some way going to have to pay healthcare bills, you want to know how much people are going to be spending. The problem, though, was someone had the idea that healthcare spending is associated with healthcare outcomes.
And so they started using this model that predicts [patient] healthcare spending to predict whether or not you were, or how sick you were going to be. The problem with that is that healthcare spending is not equal by race in particular. And well, it started in the 1940s, but it lasted until 1972. There was something that happened called the Tuskegee experiment. And it's this horrible thing that happened in American history where African American men were injected with syphilis intentionally by the US government.
And when it came out in 1972, the year after that, the statistic I've read (read the working paper by the National Bureau of Economic Research here) is that African American visits to primary healthcare physicians dropped by 26% because there was such suspicion in medical care among that community.
Well, what does that mean? That means that on average, not necessarily individually, but on average, African Americans are less likely to see their doctor, which means they're less likely to spend money on health care. Well, when you go and start predicting health care outcomes using healthcare spending all of a sudden you're going to start seeing that whites are much sicker than African Americans on average.
But that's not really true. What's really true is that African Americans who spend relatively less [money on healthcare now] are just as sick as the white patients who go into the hospital [but because studies show that African Americans are visiting doctors less frequently, potentially as a result of past harmful practices by health practitioners like in the Tuskegee experiment, a Black person’s cumulative spending on healthcare over a period of time would likely be less than than that of a white person who visited a doctor more frequently during that same period. The frequency of doctor visits is not necessarily related to the severity of illness, nor is it an indicator of a particular group’s overall health. Therefore, healthcare spending is not necessarily an accurate predictor of a group’s health care outcomes]. And so that was an example of an algorithm where its usage drifted and that usage drift ultimately killed people.
Charlie McCarthy 24:14
That is a really interesting example, though. Wow.
D Dehghanpisheh 24:18
That's a frightening example. It's not really an AI thing so much as it’s a frightening example of a bad precedent that then cascades through time and corrupts the potential through time.
Nick Schmidt 24:29
So that example is really horrifying and I think should be frightening to all of us. But there are other examples that are similarly bad, but maybe a little bit more subtle.
And you were talking about the Google facial recognition problem, and the source of that is a really interesting story. There was a woman who is African American. And my understanding of the story is she was talking to her friends about facial recognition and they were talking about how bad it was for them.
And so she started studying this, and ultimately what she found was that the error rates for facial recognition programs for women of color, they were 35 times the error rates for white men. And from one standpoint, it's just bad. Right? But if you think about what facial recognition is being used for; border patrolling, all sorts of things in warfare, targeting, I mean, this is really a terrifying thing.
And so we have these examples of places where a model can go wrong and it could have absolutely devastating consequences on people. But I think the important thing about that second example, and it's really true of the first example, is that there was no malice intended, I don't believe, on the part of the model builders. And it was really in the case of the facial recognition programs, it was a problem of there not being a representative data set. There were not enough people of color in the training data set that they used, and so ultimately the models themselves couldn't be as accurate.
D Dehghanpisheh 26:09
So a follow up question to that.
How would one go about getting more humans in the loop organizationally to address that and catch that early on?
Nick Schmidt 26:19
You need to start by having a good model governance structure.
So in a way, this takes us back to the beginning of the conversation where what you want to do is start setting up processes and policies around accountability and ownership of the model. And that's how you start to get humans in the loop, is by giving them responsibility.
I think that one of the things that is naturally going to happen as you give ownership of a portion of the model lifecycle to a person, you're going to start having them be accountable for that and they're going to start taking responsibility. So, what my hope would have been is that at some point in the model development process or the production process, there would have been someone who said, wait, we haven't tested this on different populations and so we need to go back and we need to test it and make sure that that's right.
D Dehghanpisheh 27:26
You mentioned accountability and responsibility for correcting things or addressing things. And often accountability and responsibility come about because of legal liability or legal regulatory frameworks that require that accountability to be enforced.
We know that there are some common frameworks used, particularly in financial services such as the Fed Reserve's, Guidance on Model Risk Management, aka SR 11-7, and other things that the OCC are doing that can be adapted to machine learning model governance needs today, not just the financial model risk.
In that example, how do we think about building a more fair ML environment on existing regulatory processes to address those issues of accountability, transparency, and fundamentally risk management, which is what I think we're trying to get at?
Nick Schmidt 28:20
I think what is not well recognized in the machine learning and AI community is that those frameworks that you mentioned are out there. Machine learning and AI do present particular problems that really do need to be addressed, but the framework for understanding, measuring, and mitigating risk is already out there.
I like to joke that SR 11-7 is my favorite regulation, and the reason for that is that we have not had a financial crisis since that was put into place. If you look at the 2007 to 2009 housing bubble, a lot of that was because there was not good model governance in place at banks. And since then, SR 11-7 was put into place in 2011, I believe it was, we haven't had a financial crisis. And that's because there is a robust framework for model governance.
And so I believe the right step to take is to stop talking and start implementing. And what that requires is getting advice from people who already know how to do it. And a lot of that is already contained in these frameworks that you mentioned.
D Dehghanpisheh 29:40
So, if SR 11-7 is your most favorite regulation and its applicability to ML, what are some other forms of regulation or other existing regulatory frameworks in other industries – for example, in the FDA's playbook and arsenal or other regulated industries in the energy markets, etcetera – where you think that existing regulations can be recontoured or reconstituted if you will, or just reapplied to AI/ML machine learning components?
Nick Schmidt 30:13
I think the one that really stands out to mind, because it's something I'm so familiar with, is the framework around fairness in employment. There are regulations there that focus on those two types of discrimination disparate impact and disparate treatment. And they're very well thought out. And I think that those have direct applicability to machine learning and AI.
D Dehghanpisheh 30:39
And those are within the EEOC, right? The Equal Employment Opportunity Commission.
Nick Schmidt 30:43
D Dehghanpisheh 30:56
For listeners and or readers. Refer to the show notes and we'll link out to some of those regulatory URLs.
So, Nick, just coming back to this notion of fairness, model governance, policy, that intersection. What's the one thing you think needs to be done by policymakers right now to ensure more responsible and fair model building in AI/ML governance?
Nick Schmidt 31:21
The most important thing is to get off of the focus on large language models as being the only form of AI.
D Dehghanpisheh 31:30
Amen to that.
Nick Schmidt 31:31
Yeah. And to start thinking about machine learning as a much broader set of algorithms that are affecting our lives every day.
And what I think I would love to see is regulatory agencies that are focused on different areas adopting standards like SR 11-7 for model governance, and standards like disparate impact framework or disparate treatment framework for the particular industries that they regulate. And I think that would have potentially the biggest consequences and most effect on the industry.
D Dehghanpisheh 32:12
And then finally, you opened the show by talking about the mission of SolasAI, and you shared with us earlier that it really includes helping organizations improve fundamentally the fairness of their machine learning models and their AI applications.
So, for those listening and those reading, what is the recommendation or the call to action from you and the rest of the SolasAI team for businesses that need and must address their AI security risks in this space, especially as it relates to algorithmic discrimination and unfair model outcomes?
Nick Schmidt 32:45
One of the things that I've seen a lot of companies try to do that are outside of regulated industries is to go it alone. Particularly with fairness, everybody has their own idea of what it means to be fair, and that's great, but it doesn't necessarily comport with what our regulatory requirements and what other people might think of fairness.
And so my first piece of advice is to get some help. Find organizations like ours. But there are many that know the background, know the frameworks, know the ways to look at this. And empower your employees to find that help and put it into place. That's the first step.
The second step is really what it means to put it in place. You have to give the authority to the governance committees to be able to tell the business, no, that model is not going to be put into production because of these fairness concerns or because of x, y or z robustness concerns. And that kind of pushback, you really have to give people a fair amount of authority in order to be effective there.
And that's one of the places I see a lot of organizations going wrong. They start with this grand idea of, we're going to implement fairness, we're going to implement model governance, but the people who are running model governance don't have the authority to actually enact it.
D Dehghanpisheh 34:13
Awesome. So, it boils down to give authority to the right people, seek outside help and expertise, and let's go!
Nick Schmidt 34:21
Couldn't say it better.
D Dehghanpisheh 34:22
Hey, Nick, thank you for coming on the show. And thank those at Solas AI for their contributions as well, and your team. We really enjoyed it.
Charlie, any last words?
Charlie McCarthy 34:34
Yeah. Thank you so much, Nick, for being here.
It was refreshing, as you said, to have a nice talk that wasn't entirely about a chat bot, so thank you for that. Thank you for helping us expand our views and think a little bit more about the broader landscape. It was a pleasure.
And we will link everyone to Nick's contact details in the transcript so that you know how to reach him and his team. And thanks for listening!
Nick Schmidt 34:56
Thank you very much.
Thanks for listening to The MLSecOps Podcast brought to you by Protect AI. Be sure to subscribe to get the latest episodes and visit MLSecOps.com to join the conversation, ask questions, or suggest future topics.
We’re excited to bring you more in depth MLSecOps discussions. Until next time, thanks for joining!
Additional tools and resources to check out:
Thanks for listening! Find more episodes and transcripts at https://mlsecops.com/podcast.