How well do Google and Microsoft and recognize speech across dialect, gender and race?

If you’ve been following my blog for a while, you may remember that last year I found that YouTube’s automatic captions didn’t work as well for some dialects, or for women. The effects I found were pretty robust, but I wanted to replicate them for a couple of reasons:

  • I only looked at one system, YouTube’s automatic captions, and even that was over a period of several years instead of at just one point in time. I controlled for time-of-upload in my statistical models, but it wasn’t the fairest system evaluation.
  • I didn’t control for the audio quality, and since speech recognition is pretty sensitive to things like background noise and microphone quality, that could have had an effect.
  • The only demographic information I had was where someone was from. Given recent results that find that natural language processing tools don’t work as well for African American English, I was especially interested in looking at automatic speech recognition (ASR) accuracy for African American English speakers.

With that in mind, I did a second analysis on both YouTube’s automatic captions and Bing’s speech API (that’s the same tech that’s inside Microsoft’s Cortana, as far as I know).

Speech Data

For this project, I used speech data from the International Dialects of English Archive. It’s a collection of English speech from all over, originally collected to help actors sound more realistic.

I used speech data from four varieties: the South (speakers from Alabama), the Northern Cities (Michigan), California (California) and General American. “General American” is the sort of news-caster style of speech that a lot of people consider unaccented–even though it’s just as much an accent as any of the others! You can hear a sample here.

For each variety, I did an acoustic analysis to make sure that speakers I’d selected actually did use the variety I thought they should, and they all did.

Systems

For the YouTube captions, I just uploaded the speech files to YouTube as videos and then downloaded the subtitles. (I would have used the API instead, but when I was doing this analysis there was no Python Google Speech API, even though very thorough documentation had already been released.)

Bing’s speech API was a little  more complex. For this one, my co-author built a custom Android application that sent the files to the API & requested a long-form transcript back. For some reason, a lot of our sound files were returned as only partial transcriptions. My theory is that there is a running confidence function for the accuracy of the transcription, and once the overall confidence drops below a certain threshold, you get back whatever was transcribed up to there. I don’t know if that’s the case, though, since I don’t have access to their source code. Whatever the reason, the Bing transcriptions were less accurate overall than the YouTube transcriptions, even when we account for the fact that fewer words were returned.

Results

OK, now to the results. Let’s start with dialect area. As you might be able to tell from the graphs below, there were pretty big differences between the two systems we looked at. In general, there was more variation in the word error rate for Bing and overall the error rate tended to be a bit higher (although that could be due to the incomplete transcriptions we mentioned above). YouTube’s captions were generally more accurate and more consistent. That said, both systems had different error rates across dialects, with the lowest average error rates for General American English.

dialect

Differences in Word Error Rate (WER) by dialect were not robust enough to be significant for Bing (under a one way ANOVA) (F[3, 32] = 1.6, p = 0.21), but they were for YouTube’s automatic captions (F[3, 35] = 3.45,p < 0.05). Both systems had the lowest average WER for General American.

Now, let’s turn to gender. If you read my earlier work, you’ll know that I previously found that YouTube’s automatic captions were more accurate for men and less accurate for women. This time, with carefully recorded speech samples, I found no robust difference in accuracy by gender in either system. Which is great! In addition, the unreliable trends for each system pointed in opposite ways; Bing had a lower WER for male speakers, while YouTube had a lower WER for female speakers.

So why did I find an effect last time? My (untested) hypothesis is that there was a difference in the signal to noise ratio for male and female speakers in the user-uploaded files. Since women are (on average) smaller and thus (on average) slightly quieter when they speak, it’s possible that their speech was more easily masked by background noises, like fans or traffic. These files were all recorded in a quiet place, however, which may help to explain the lack of difference between genders.

gender

Neither Bing (F[1, 34] = 1.13, p = 0.29), nor YouTube’s automatic captions (F[1, 37] = 1.56, p = 0.22) had a significant difference in accuracy by gender.

Finally, what about race? For this part of the analysis, I excluded General American speakers, since they did not report their race. I also excluded the single Native American speaker. Even with fewer speakers, and thus reduced power, the differences between races were still robust enough to be significant for YouTube’s automatic captions and Bing followed the same trend. Both systems were most accurate for Caucasian speakers.

ethnicity

As with dialect, differences in WER between races were not significant for Bing (F[4, 31] = 1.21, p = 0.36), but were significant for YouTube’s automatic captions (F[4, 34] = 2.86,p< 0.05). Both systems were most accurate for Caucasian speakers.

While I was happy to find no difference in performance by gender, the fact that both systems made more errors on non-Caucasian and non-General-American speaking talkers is deeply concerning. Regional varieties of American English and African American English are both consistent and well-documented. There is nothing intrinsic to these varieties that make them less easy to recognize. The fact that they are recognized with more errors is most likely due to bias in the training data. (In fact, Mozilla is currently collecting diverse speech samples for an open corpus of training data–you can help them out yourself.)

So what? Why does word error rate matter?

There are two things I’m really worried about with these types of speech recognition errors. The first is higher error rates seem to overwhelmingly affect already-disadvantaged groups. In the US, strong regional dialects tend to be associated with speakers who aren’t as wealthy, and there is a long and continuing history of racial discrimination in the United States.

Given this, the second thing I’m worried about is the fact that these voice recognition systems are being incorporated into other applications that have a real impact on people’s lives.

Every automatic speech recognition system makes errors. I don’t think that’s going to change (certainly not in my lifetime). But I do think we can get to the point where those error don’t disproportionately affect already-marginalized people. And if we keep using automatic speech recognition into high-stakes situations it’s vital that we get to that point quickly and, in the meantime, stay aware of these biases.

If you’re interested in the long version, you can check out the published paper here.

Advertisements

“Men” vs. “Females” and sexist writing

So, I have a confession to make. I actually set out to write a completely different blog post. In searching Wikimedia Commons for a picture, though, I came across something that struck me as odd. I was looking for pictures of people writing, and I noticed that there were two gendered sub-categories, one for men and one for women. Leaving aside the question of having only two genders, what really stuck out to me were the names. The category with pictures of men was called “Men Writing” and the category with pictures of women was called “Females Writing”.

Family 3

According to this sign, the third most common gender is “child”.

So why did that bother me? It is true that male humans are men and that women are female humans. Sure, a writing professor might nag about how the two terms lack parallelism, but does it really matter?

The thing is, it wouldn’t matter if this was just a one-off thing. But it’s not. Let’s look at the Category: Males and Category: Females*. At the top of the category page for men, it states “This category is about males in general. For human males, see Category:Male humans”. And the male humans category is, conveniently, the first subcategory. Which is fine, no problem there. BUT. There is no equivalent disclaimer at the top of Category: Females, and the first subcategory is not female humans but female animals. So even though “Females” is used to refer specifically to female humans when talking about writing, when talking about females in general it looks as if at least one editor has decided that it’s more relevant for referring to female animals. And that also gels with my own intuitions. I’m more like to ask “How many females?” when looking at a bunch of baby chickens than I am when looking at a bunch of baby humans. Assuming the editors responsible for these distinctions are also native English speakers, their intuitions are probably very similar.

So what? Well, it makes me uncomfortable to be referred to with a term that is primarily used for non-human animals while men are referred to with a term that I associate with humans. (Or, perhaps, women are being referred to as “female men”, but that’s equally odd and exclusionary.)

It took me a while to come to that conclusion. I felt that there was something off about the terminology, but I had to turn and talk it over with my officemate for a couple minutes before finally getting at the kernel of the problem. And I don’t think it’s a concious choice on the part of the editors–it’s probably something they don’t even realize they’re doing. But I definitely do think that it’s related to the gender imbalance of the editors of Wikimedia. According to recent statistics, over ninety percent (!) of Wikipedia editors are male. And this type of sexist language use probably perpetuates that imbalance. If I feel, even if it’s for reasons that I have a hard time articulating, that I’m not welcome in a community then I’m less likely to join it. And that’s not just me. Students who are presented with job descriptions in language that doesn’t match thier gender are less likely to be interested in those jobs. Women are less likely to respond to job postings if “he” is used to refer to both men and women. I could go on citing other studies, but we could end up being here all day.

My point is this: sexist language affects the behaviour and choices of those who hear it. And in this case, it makes me less likely to participate in this on-line community because I don’t feel as if I would be welcomed and respected there. It’s not only Wikipedia/Wikimedia, either. This particular usage pattern is also something I associate with Reddit (a good discussion here). The gender breakdown of Reddit? About 70% male.

For some reason, the idea that we should avoid sexist language usage seems to really bother people. I was once a TA for a large lecture class where, in the middle of discussions of the effects of sexist language, a male student interrupted the professor to say that he didn’t think it was a problem. I’ve since thought about it quite a bit (it was pretty jarring) and I’ve come to the conclusion that the reason the student felt that way is that, for him, it really wasn’t a problem. Since sexist language is almost always exclusionary to women, and he was not a woman, he had not felt that moment of discomfort before.

Further, I think he may have felt that, because this type of language tends to benefit men, he felt that we were blaming him. I want to be clear here: I’m not blaming anyone for thier unconscious biases. And I’m  not saying that only men use sexist language. The Wikimedia editors who made this choice may very well have been women. What I am saying is that we need to be aware of these biases and strive to correct them. It’s hard, and it takes constant vigilance, but it’s an important and relatively simple step that we can all take in order to help eliminate sexism.

*As they were on Wednesday, April 8 2015. If they’ve been changed, I’d recommend the Way Back Machine.