First, a confession: part of the reason I’m writing this blog post today is becuase I’m having major FOMO on account of having missed #FAT2018, the first annual Conference on Fairness, Accountability, and Transparency. (You can get the proceedings online here, though!) I have been seeing a lot of tweets & good discussion around the conference and it’s gotten me thinking again about something I’ve been chewing on for a while: what does it mean for a machine learning tool to be fair?
If you’re not familiar with the literature, this recent paper by Friedler et al is a really good introduction, although it’s not intended as a review paper. I also review some of it in these slides. Once you’ve dug into the work a little bit, you may notice that a lot of this work is focused on examples where output of the algorithm is a decision made on the level of the individual: Does this person get a loan or not? Should this resume be passed on to a human recruiter? Should this person receive parole?
While these are deeply important issues, the methods developed to address fairness in these contexts don’t necessarily translate well to evaluating fairness for other applications. I’m thinking specifically about tools like speech recognition or facial recognition for automatic focusing in computer vision: applications where an automatic tool is designed to supplant or augment some sort of human labor. The stakes are lower in these types of applications, but it’s still important that they don’t unintentionally end up working poorly for certain groups of people. This is what I’m calling “parity of utility”.
Parity of utility: A machine learning application which automatically completes a task using human data should not preform reliably worse for members of one or more social groups relevant to the task.
That’s a bit much to throw at you all at once, so let me break down my thinking a little bit more.
- Machine learning application: This could be a specific algorithm or model, an ensemble of multiple different models working together, or an entire pipeline from data collection to the final output. I’m being intentionally vague here so that this definition can be broadly useful.
- Automatically completes a task: Again, I’m being intentionally vague here. By “a task” I mean some sort of automated decision based on some input stimulus, especially classification or recognition.
- Human data: This definition of fairness is based on social groups and is thus restricted to humans. While it may be frustrating that your image labeler is better at recognizing cows than horses, it’s not unfair to horses becuase they’re not the ones using the system. (And, arguably, becuase horses have no sense of fairness.)
- Preform reliably worse: Personally I find a statistically significant difference in performance between groups with a large effect size to be convincing evidence. There are other ways of quantifying difference, like the odds ratio or even the raw accuracy across groups, that may be more suitable depending on your task and standards of evidence.
- Social groups relevant to the task: This particular fairness framework is concerned with groups rather than individuals. Which groups? That depends. Not every social group is going to relevant to every task. For instance, your mother language(s) is very relevant for NLP applications, while it’s only important for things like facial recognition in as far as it co-varies with other, visible demographic factors. Which social groups are relevant for what types of human behavior is, thankfully, very well studied, especially in sociology.
So how can we turn these very nice-sounding words into numbers? The specifics will depend on your particular task, but here are some examples:
- Is there a difference in positive predictive value (PPV), error rate (1-PPV), true positive rate (TPR), and false positive rate (FPR) for three commercial gender classification algorithms based on the gender and skin tone of the person whose image is being classified? (Answer: Yep! People with lighter skin tones and men are classified more accurately, with darker-skinned women being classified with the most errors.)
- Is there a statistically-significant difference in the word error rate for commercial speech recognition algorithms based on the speaker’s dialect, race and gender? (Answer: yes, yes and maybe, with General American and white talkers having the lowest word error rates.)
So the tools reviewed in these papers don’t demonstrate parity of utility: they work better for folks from specific groups and worse for folks from other groups. (And, not co-incidentally, when we find that systems don’t have parity in utility, they tend work best for more privileged groups and worse for less privileged groups.)
Parity of Utility vs. Overall Performance
So this is the bit where things get a bit complicated: parity of utility is one way to evaluate a system, but it’s a measure of fairness, not overall system performance. Ideally, a high-performing system should also be a fair system and preform well for all groups. But what if you have a situation where you need to choose between prioritizing a fairer system or one with overall higher performance?
I can’t say that one goal is unilaterally better than the other in all situations. What I can say is that focusing on only higher performance and not investigating measures of fairness we risk building systems that have systematically lower performance for some groups. In other words, we can think of an unfair system (under this framework) as one that is overfit to one or more social groups. And, personally, I consider a model overfit to a specific social group while being intended for general use to be flawed.
Parity of Utility is an idea I’ve been kicking around for a while, but it could definitely use some additional polish and wheel-kicking, so feel free to chime in in the comments. I’m especially interested in getting some other perspectives on these questions:
- Do you agree that a tool that has parity in utility is “fair”?
- What would you need to add (or remove) to have a framework that you would consider fair?
- What do you consider an acceptable balance of fairness and overall performance?
- Would you prefer to create an unfair system with slightly higher overall performance or a fair system with slightly lower overall performance? Which type of system would you prefer to be a user of? What about if a group you are a member of had reliably lower performance?
- Is it more important to ensure that some groups are treated fairly? What about a group that might need a tool for accessibility rather than just convenience?
One thought on “Parity in Utility: One way to think about fairness in machine learning tools”
I guess one way of evaluating systems with Parity in mind is designing test set divisions that allow meaningful *macro-average* metrics to be computed. Just like in cross-domain or cross-lingual evaluation, you then get a sense of the bias behind your systems.