Google’s speech recognition has a gender bias

In my last post, I looked at how Google’s automatic speech recognition worked with different dialects. To get this data, I hand-checked annotations  more than 1500 words from fifty different accent tag videos .

Now, because I’m a sociolinguist and I know that it’s important to stratify your samples, I made sure I had an equal number of male and female speakers for each dialect. And when I compared performance on male and female talkers, I found something deeply disturbing: YouTube’s auto captions consistently performed better on male voices than female voice (t(47) = -2.7, p < 0.01.) . (You can see my data and analysis here.)

accuarcyByGender

On average, for each female speaker less than half (47%) her words were captioned correctly. The average male speaker, on the other hand, was captioned correctly 60% of the time.

It’s not that there’s a consistent but small effect size, either, 13% is a pretty big effect. The Cohen’s d was 0.7 which means, in non-math-speak, that if you pick a random man and random woman from my sample, there’s an almost 70% chance the transcriptions will be more accurate for the man. That’s pretty striking.

What it is not, unfortunately, is shocking. There’s a long history of speech recognition technology performing better for men than women:

This is a real problem with real impacts on people’s lives. Sure, a few incorrect Youtube captions aren’t a matter of life and death. But some of these applications have a lot higher stakes. Take the medical dictation software study. The fact that men enjoy better performance than women with these technologies means that it’s harder for women to do their jobs. Even if it only takes a second to correct an error, those seconds add up over the days and weeks to a major time sink, time your male colleagues aren’t wasting messing with technology. And that’s not even touching on the safety implications of voice recognition in cars.

 

So where is this imbalance coming from? First, let me make one thing clear: the problem is not with how women talk. The suggestion that, for example, “women could be taught to speak louder, and direct their voices towards the microphone” is ridiculous. In fact, women use speech strategies that should make it easier for voice recognition technology to work on women’s voices.  Women tend to be more intelligible (for people without high-frequency hearing loss), and to talk slightly more slowly. In general, women also favor more standard forms and make less use of stigmatized variants. Women’s vowels, in particular, lend themselves to classification: women produce longer vowels which are more distinct from each other than men’s are. (Edit 7/28/2016: I have since found two papers by Sharon Goldwater, Dan Jurafsky and Christopher D. Manning where they found better performance for women than men–due to the above factors and different rates of filler words like “um” and “uh”.) One thing that may be making a difference is that women also tend not to be as loud, partly as a function of just being smaller, and cepstrals (the fancy math thing what’s under the hood of most automatic voice recognition) are sensitive to differences in intensity. This all doesn’t mean that women’s voices are more difficult; I’ve trained classifiers on speech data from women and they worked just fine, thank you very much. What it does mean is that women’s voices are different from men’s voices, though, so a system designed around men’s voices just won’t work as well for women’s.

Which leads right into where I think this bias is coming from: unbalanced training sets. Like car crash dummies, voice recognition systems were designed for (and largely by) men. Over two thirds of the authors in the  Association for Computational Linguistics Anthology Network are male, for example. Which is not to say that there aren’t truly excellent female researchers working in speech technology (Mari Ostendorf and Gina-Anne Levow here at the UW and Karen Livescu at TTI-Chicago spring immediately to mind) but they’re outnumbered. And that unbalance seems to extend to the training sets, the annotated speech that’s used to teach automatic speech recognition systems what things should sound like. Voxforge, for example, is a popular open source speech dataset that “suffers from major gender and per speaker duration imbalances.” I had to get that info from another paper, since Voxforge doesn’t have speaker demographics available on their website. And it’s not the only popular corpus that doesn’t include speaker demographics: neither does the AMI meeting corpus, nor the Numbers corpus.  And when I could find the numbers, they weren’t balanced for gender. TIMIT, which is the single most popular speech corpus in the Linguistic Data Consortium, is just over 69% male. I don’t know what speech database the Google speech recognizer is trained on, but based on the speech recognition rates by gender I’m willing to bet that it’s not balanced for gender either.

Why does this matter? It matters because there are systematic differences between men’s and women’s speech. (I’m not going to touch on the speech of other genders here, since that’s a very young research area. If you’re interested, the Journal of Language and Sexuality is a good jumping-off point.) And machine learning works by making computers really good at dealing with things they’ve already seen a lot of. If they get a lot of speech from men, they’ll be really good at identifying speech from men. If they don’t get a lot of speech from women, they won’t be that good at identifying speech from women. And it looks like that’s the case. Based on my data from fifty different speakers, Google’s speech recognition (which, if you remember, is probably the best-performing proprietary automatic speech recognition system on the market) just doesn’t work as well for women as it does for men.

Advertisements

12 responses

  1. Every learning algorithm makes complex tradeoffs in an attempt to minimize error during prediction. Let’s take medical transcription as an example. If we stratified the training data by sex, then we might be able to get equal errors rates for male and female doctors. However, this would probably come at the expense of keeping the same error rate for women and increasing the error rate for men (because we would be deleting training data from men in order to achieve sex balance in the training data). This would not be a good outcome. This example raises the question of what constitutes undesirable bias. I think we need to look at who is affected by the biases. Suppose that 75% of physicians are male but that the sex of the doctor is independent of the sex of the patient. Then by increasing accuracy for male voices, we are increasing accuracy equally for male and female patients.

    • TLDR: Jim White’s comments are much better than yours.

      Why do people have an impulse to justify problems identified? It’s different if you identify causes, then discuss solutions. You identify causes, and then imply that the only way to change that so that the results are balanced is to weaken the effectiveness of recognition for male voices. You give no justification for that stance. In contrast the author gives a strong argument that there is nothing specific to female voices that would decrease effectiveness of recognition. Clearly, and in contrast to your comments, any approach to bring parity to voice recognition should be an approach that increases women’s voice recognition without negatively impacting male results. It’s not like Google is operating under compute constraints relative to anyone else.

  2. I do not speak for Google and I don’t directly work on speech-to-text but I will explain my understanding what’s happening based on Google Research publications. The key issue here is that ASR at “Google Scale” means it is heavily based on “unsupervised” learning methods using data gathered from users of Google systems, not manually curated research corpora.

    Google publishes widely on its speech research (http://research.google.com/pubs/SpeechProcessing.html). Some relevant papers that explain how the ASR system was built are:

    DEPLOYING GOOG-411: EARLY LESSONS IN DATA, MEASUREMENT, AND TESTING
    Unsupervised Testing Strategies for ASR
    Investigations on Exemplar-Based Features for Speech Recognition Towards Thousands of Hours of Unsupervised, Noisy Data

    Therefore the gender bias in performance you observe almost certainly reflects biased distribution in the users of Google speech products. That isn’t a surprise since it has been observed that women have been adopting new digital technologies at a lower rate than men. For example:

    Gendered Space: The Digital Divide between Male and Female Users in Internet Public Access Sites, Laura J. Dixon and Teresa Correa, Journal of Computer-Mediated Communication, 2014

    The good news is if that is the cause then this issue will “fix itself” when women use these systems equally as often as men, even if nothing else is done to address it. Of course that doesn’t mean it shouldn’t be addressed specifically, but any solution would need to improve aggregate performance across all users to be a success. An internship at Google would be a great way for you to find out exactly what is happening and try your hand at fixing the problem.

  3. Pingback: Are there differences in automatic caption error rates due to pitch or speech rate? |

  4. Pingback: Research shows gender bias in Google's voice recognition - Artificial Intelligence Online

  5. Pingback: Research shows gender bias in Google’s voice recognition | Buzz Nova

  6. Pingback: Social Diversity Is The Key To Smarter AI | ogilvydo.com

  7. Pingback: Can you configure speech recognition for a specific speaker? |

  8. Pingback: Inside the black box: Understanding AI decision-making | The News Junky

  9. Pingback: What does the National Endowment for the Humanities even do? |

  10. Pingback: Gender Bias in Voice Recognition Software : Stephen E. Arnold @ Beyond Search

  11. I feel like I’m constantly yelling to get Google Voice to understand me. It gets my husband every time. He think women just don’t speak as clearly. I think voice recognition should do a better job accounting for how women speak.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s