Google’s speech recognition has a gender bias

Edit, July 2020: Hello! This blog post has been cited quite a bit recently so I thought I’d update it with the more recent reserach. I’m no longer working actively on this topic, but in the last paper I wrote on it, in 2017, I found that when audio quality was controlled the gender effects disappeared. I take this to be evidence that differences in gender are due to differences in overall signal-to-noise ratio when recording in noisy environments rather than problems in the underlying ML models.

That said, bias against specific demographics categories in automatic speech recognition is a problem. In my 2017 study, I found that multiple commercial ASR systems had higher error rates for non-white speakers. More recent research has found the same effect: ASR systems make more errors for Black speakers than white speakers. In my professional opinion, the racial differences are both more important and difficult to solve.

The original, unedited blog post, continues below.


In my last post, I looked at how Google’s automatic speech recognition worked with different dialects. To get this data, I hand-checked annotations  more than 1500 words from fifty different accent tag videos .

Now, because I’m a sociolinguist and I know that it’s important to stratify your samples, I made sure I had an equal number of male and female speakers for each dialect. And when I compared performance on male and female talkers, I found something deeply disturbing: YouTube’s auto captions consistently performed better on male voices than female voice (t(47) = -2.7, p < 0.01.) . (You can see my data and analysis here.)

On average, for each female speaker less than half (47%) her words were captioned correctly. The average male speaker, on the other hand, was captioned correctly 60% of the time.

It’s not that there’s a consistent but small effect size, either, 13% is a pretty big effect. The Cohen’s d was 0.7 which means, in non-math-speak, that if you pick a random man and random woman from my sample, there’s an almost 70% chance the transcriptions will be more accurate for the man. That’s pretty striking.

What it is not, unfortunately, is shocking. There’s a long history of speech recognition technology performing better for men than women:

This is a real problem with real impacts on people’s lives. Sure, a few incorrect Youtube captions aren’t a matter of life and death. But some of these applications have a lot higher stakes. Take the medical dictation software study. The fact that men enjoy better performance than women with these technologies means that it’s harder for women to do their jobs. Even if it only takes a second to correct an error, those seconds add up over the days and weeks to a major time sink, time your male colleagues aren’t wasting messing with technology. And that’s not even touching on the safety implications of voice recognition in cars.

So where is this imbalance coming from? First, let me make one thing clear: the problem is not with how women talk. The suggestion that, for example, “women could be taught to speak louder, and direct their voices towards the microphone” is ridiculous. In fact, women use speech strategies that should make it easier for voice recognition technology to work on women’s voices.  Women tend to be more intelligible (for people without high-frequency hearing loss), and to talk slightly more slowly. In general, women also favor more standard forms and make less use of stigmatized variants. Women’s vowels, in particular, lend themselves to classification: women produce longer vowels which are more distinct from each other than men’s are. (Edit 7/28/2016: I have since found two papers by Sharon Goldwater, Dan Jurafsky and Christopher D. Manning where they found better performance for women than men–due to the above factors and different rates of filler words like “um” and “uh”.) One thing that may be making a difference is that women also tend not to be as loud, partly as a function of just being smaller, and cepstrals (the fancy math thing what’s under the hood of most automatic voice recognition) are sensitive to differences in intensity. This all doesn’t mean that women’s voices are more difficult; I’ve trained classifiers on speech data from women and they worked just fine, thank you very much. What it does mean is that women’s voices are different from men’s voices, though, so a system designed around men’s voices just won’t work as well for women’s.

Which leads right into where I think this bias is coming from: unbalanced training sets. Like car crash dummies, voice recognition systems were designed for (and largely by) men. Over two thirds of the authors in the  Association for Computational Linguistics Anthology Network are male, for example. Which is not to say that there aren’t truly excellent female researchers working in speech technology (Mari Ostendorf and Gina-Anne Levow here at the UW and Karen Livescu at TTI-Chicago spring immediately to mind) but they’re outnumbered. And that unbalance seems to extend to the training sets, the annotated speech that’s used to teach automatic speech recognition systems what things should sound like. Voxforge, for example, is a popular open source speech dataset that “suffers from major gender and per speaker duration imbalances.” I had to get that info from another paper, since Voxforge doesn’t have speaker demographics available on their website. And it’s not the only popular corpus that doesn’t include speaker demographics: neither does the AMI meeting corpus, nor the Numbers corpus.  And when I could find the numbers, they weren’t balanced for gender. TIMIT, which is the single most popular speech corpus in the Linguistic Data Consortium, is just over 69% male. I don’t know what speech database the Google speech recognizer is trained on, but based on the speech recognition rates by gender I’m willing to bet that it’s not balanced for gender either.

Why does this matter? It matters because there are systematic differences between men’s and women’s speech. (I’m not going to touch on the speech of other genders here, since that’s a very young research area. If you’re interested, the Journal of Language and Sexuality is a good jumping-off point.) And machine learning works by making computers really good at dealing with things they’ve already seen a lot of. If they get a lot of speech from men, they’ll be really good at identifying speech from men. If they don’t get a lot of speech from women, they won’t be that good at identifying speech from women. And it looks like that’s the case. Based on my data from fifty different speakers, Google’s speech recognition (which, if you remember, is probably the best-performing proprietary automatic speech recognition system on the market) just doesn’t work as well for women as it does for men.


The Motor Theory of Speech Perception

Ok, so like I talked about in my previous two posts, modelling speech perception is an ongoing problem with a lot of hurdles left to jump. But there are potential candidate theories out there, all of which offer good insight into the problem. The first one I’m going to talk about is motor theory.

Clamp-Type 2C1.5-4 Motor
So your tongue is like the motor body and the other person’s ear are like the load cell…
So motor theory has one basic premise and three major claims.  The basic premise is a keen observation: we don’t just perceive speech sounds, we also make them. Whoa, stop the presses. Ok, so maybe it seems really obvious, but motor theory was really the first major attempt to model speech perception that took this into account. Up until it was first posited in the 1960’s , people had pretty much been ignoring that and treating speech perception like the only information listeners had access to was what was in the acoustic speech signal. We’ll discuss that in greater detail, later, but it’s still pretty much the way a lot of people approach the problem. I don’t know of a piece of voice recognition software, for example, that include an anatomical model.

So what’s the fact that listeners are listener/speakers get you? Well, remember how there aren’t really invariant units in the speech signal? Well, if you decide that what people are actually perceiving aren’t actually a collection of acoustic markers that point to one particular language sound but instead the gestures needed to make up that sound, then suddenly that’s much less of a problem. To put it in another way, we’re used to thinking of speech being made up of a bunch of sounds, and that when we’re listening speech we’re deciding what the right sounds are and from there picking the right words. But from a motor theory standpoint, what you’re actually doing when you’re listening to speech is deciding what the speaker’s doing with their mouth and using that information to figure out what words they’re saying. So in the dictionary in your head, you don’t store words as strings of sounds but rather as strings of gestures

If you’re like me when I first encountered this theory, it’s about this time that you’re starting to get pretty skeptical. I mean, I basically just said that what you’re hearing is the actual movement of someone else’s tongue and figuring out what they’re saying by reverse engineering it based on what you know your tongue is doing when you say the same word. (Just FYI, when I say tongue here, I’m referring to the entire vocal tract in its multifaceted glory, but that’s a bit of a mouthful. Pun intended. 😉 ) I mean, yeah, if we accept this it gives us a big advantage when we’re talking about language acquisition–since if you’re listening to gestures, you can learn them just by listening–but still. It’s weird. I’m going to need some convincing.

Well, let’s get back to the those three principles I mentioned earlier, which are taken from Galantucci, Flower and Turvey’s excellent review of motor theory.

  1. Speech is a weird thing to perceive and pretty much does its own thing. I’ve talked about this at length, so let’s just take that as a given for now.
  2. When we’re listening to speech, we’re actually listening to gestures. We talked about that above. 
  3. We use our motor system to help us perceive speech.

Ok, so point three should jump out at you a bit. Why? Of these three points, its the easiest one to test empirically. And since I’m a huge fan of empirically testing things (Science! Data! Statistics!) we can look into the literature and see if there’s anything that supports this. Like, for example, a study that shows that when listening to speech, our motor cortex gets all involved. Well, it turns out that there  are lots of studies that show this. You know that term “active listening”? There’s pretty strong evidence that it’s more than just a metaphor; listening to speech involves our motor system in ways that not all acoustic inputs do.

So point three is pretty well supported. What does that mean for point two? It really depends on who you’re talking to. (Science is all about arguing about things, after all.) Personally, I think motor theory is really interesting and address a lot of the problems we face in trying to model speech perception. But I’m not ready to swallow it hook, line and sinker. I think Robert Remez put it best in the proceedings of Modularity and The Motor Theory of Speech Perception:

I think it is clear that Motor Theory is false. For the other, I think the evidence indicates no less that Motor Theory is essentially, fundamentally, primarily and basically true. (p. 179)

On the one hand, it’s clear that our motor system is involved in speech perception. On the other, I really do think that we use parts of the acoustic signal in and of themselves. But we’ll get into that in more depth next week.

How do you pronounce Gangnam?

So if you’ve been completely oblivious lately, you might not be aware that Korean musician Psy has recently become a international sensation due to the song below. If you haven’t already seen it, you should. I’ll wait.

Ok, good. Now, I wrote a post recently where I suggested that a trained phonetician can help you learn to pronounce things and I thought I’d put my money where my mouth is and run you though how to pronounce “Gangnam”; phonetics style. (Note: I’m assuming you’re a native English speaker here.)

First, let’s see how a non-phonetician does it. Here’s a brief guide to the correct pronunciation offered on Reddit by ThatWonAsianGuy, who I can only assume is a native Korean speaker.

The first G apparently sounds like a K consonant to non-Korean speakers, but it’s somewhere between a G and a K, but more towards the G. (There are three letters similar, ,, and . The first is a normal “k,” the second the one used in Gangnam, and the third being a clicky, harsh g/k noise.)

The “ang”part is a very wide “ahh” (like when a doctor tells you to open your mouth) followed by an “ng” (like the end of “ending”). The “ahh” part, however, is not a long vowel, so it’s pronounced quickly.

“Nam” also has the “ahh” for the a. The other letters are normal.

So it sounds like (G/K)ahng-nahm.

Let’s see how he did. Judges?

Full marks for accuracy, Rachael. Nothing he said is incorrect. On the other hand, I give it a usability score of just 2 out of 10.  While the descriptions of the vowels and nasal sounds are intelligible and usable to most English speakers, even I was stumped by  his description of a sound between a  “g” and a “k”. A strong effort, though; with some training this kid could make it to the big leagues of phonetics.

Thank you Rachael, and good luck to ThatWonAsianGuy in his future phonetics career. Ok, so what is going on here in terms of the k/g/apparently clicky harsh sound? Funny you should ask, because I’m about to tell you in gruesome detail.

First things first: you need to know what voicing is. Put your hand over your throat and  say “k”. Now say “g”. Can you feel how, when you say “g”, there’s sort of a buzzing feeling? That’s what linguists call voicing. What’s actually happening is that you’re pulling your vocal folds together and then forcing air through them. This makes them vibrate, which in turn makes a sound. Like so:

(If you’re wondering that little cat-tongue looking thing is, that’s the epiglottis. It keeps you from choking to death by trying to breath food and is way up there on my list of favorite body parts.)

But wait! That’s not all! What we think of as “regular voicing” (ok, maybe you don’t think of it all that often, but I’m just going to assume that you do) is just one of the things you can do with your voicing. What other types of voicing are there? It’s the type of thing that’s really best described vocally, so here goes:

Ok, so, that’s what’s going on in your larynx. Why is this important? Well it turns out that only one of the three sounds is actually voiced, and it’s voiced using a different type of voicing. Any guesses as to which one?

Yep, it’s the harsh, clicky one and it’s got glottal voicing (that really low, creaky sort of voice)*. The difference between the “regular k” and the “k/g sound” has nothing to do with voicing type. Which is crazy talk, because almost every “learn Korean” textbook or online course I’ve come across has described them as “k” and “g” respectively and, as we already established, the difference between “k” and “g” is that the “k” is voiced and the “g” isn’t.

Ok, I simplified things a bit. When you say “k” and “g” at the beginning of a word in English (and only at the beginning of a word), there’s actually one additional difference between them. Try this. Put your hand in front of your mouth and say “cab”. Then say “gab”. Do you notice a difference?

You should have felt a puff of air when you said the “k” but not when you said the “g”. Want proof that it only happens at the beginning of words? Try saying “back” and “bag” in the same way, with your hand in front of you mouth. At the end of words they feel about the same.  What’s going on?

Well, in English we always say an unvoiced “k” with a little puff of air at the beginning of the word. In fact, we tend to listen for that puff more than we listen for voicing. So if you say “kat” without voicing the sound, but also without the little puff of air, it sounds more like “gat”. (Which is why language teachers tell you to say it “g” instead of “k”. It’s not, strictly speaking, right, but it is a little easier to hear. The same thing happens in Mandarin, BTW.) And that’s the sound that’s at the beginning of Gangnam.

You’ll probably need to practice a bit before you get it right, but if you can make a sound at the beginning of a word where your vocal chords aren’t vibrating and without that little puff of air, you’re doing it right. You can already make the sound, it’s just the moving it to the beginning of the word that’s throwing a monkey wrench in the works.

So it’s the unvoiced “k” without the little puff of air. Then an “aahhh” sound, just as described above. Then the “ng” sound, which you tend to see at the end of words in English. It can happen in the middle of words as well, though, like in “finger”. And then “nam”, pronounced in the same way as the last syllable as “Vietnam”.

In the special super-secret International Phonetic (Cabal’s) Alphabet, that’s [kaŋnam]. Now go out there and impress a Korean speaker by not butchering the phonetics of their language!

*Ok, ok, that’s a bit of an oversimplification. You can find the whole story here.

Why is studying linguistics useful? *Is* studying linguistics useful?

So I recently gave a talk at the University of Washington Scholar’s Studio. In it, I covered a couple things that I’ve already talked about here on my blog: the fact that, acoustically speaking, there’s no such thing as a “word” and that our ears can trick us. My general point was that our intuitions about speech, a lot of the things we think seem completely obvious, actually aren’t true at all from an acoustic perspective.

What really got to me, though, was that after I’d finished my talk (and it was super fast, too, only five minutes) someone asked why it mattered. Why should we care that our intuitions don’t match reality? We can still communicate perfectly well. How is linguistics useful, they asked. Why should they care?

I’m sorry, what was it you plan to spend your life studying again? I know you told me last week, but for some reason all I remember you saying is “Blah, blah, giant waste of time.”

It was a good question, and I’m really bummed I didn’t have time to answer it. I sometimes forget, as I’m wading through a hip-deep piles of readings that I need to get to, that it’s not immediately obvious to other people why what I do is important. And it is! If I didn’t believe that, I wouldn’t be in grad school. (It’s certainly not the glamorous easy living and fat salary that keep me here.) It’s important in two main ways. Way one is the way in which it enhances our knowledge and way two is the way that it helps people.

 Increasing our knowledge. Ok, so, a lot of our intuitions are wrong. So what? So a lot of things! If we’re perceiving things that aren’t really there, or not perceiving things that are really there, something weird and interesting is going on. We’re really used to thinking of ourselves as pretty unbiased in our observations. Sure, we can’t hear all the sounds that are made, but we’ve built sensors for that, right? But it’s even more pervasive than that. We only perceive the things that our bodies and sensory organs and brains can perceive, and we really don’t know how all these biological filters work. Well, okay, we do know some things (lots and lots of things about ears, in particular) but there’s a whole lot that we still have left to learn. The list of unanswered questions in linguistics is a little daunting, even just in the sub-sub-field of perceptual phonetics.

Every single one of us uses language every single day. And we know embarrassingly little about how it works. And, what we do know, it’s often hard to share with people who have little background in linguistics. Even here, in my blog, without time restraints and an audience that’s already pretty interested (You guys are awesome!) I often have to gloss over interesting things. Not because I don’t think you’ll understand them, but because I’d metaphorically have to grow a tree, chop it down and spends hours carving it just to make a little step stool so you can get the high-level concept off the shelf and, seriously, who has time for that? Sometimes I really envy scientists in the major disciplines  because everyone already knows the basics of what they study. Imagine that you’re a geneticist, but before you can tell people you look at DNA, you have to convince them that sexual reproduction exists. I dream of the day when every graduating high school senior will know IPA. (That’s the international phonetic alphabet, not the beer.)

Okay, off the soapbox.

Helping people. Linguistics has lots and lots and lots of applications. (I’m just going to talk about my little sub-field here, so know that there’s a lot of stuff being left unsaid.) The biggest problem is that so few people know that linguistics is a thing. We can and want to help!

  • Foreign language teaching. (AKA applied linguistics) This one is a particular pet peeve of mine. How many of you have taken a foreign language class and had the instructor tell you something about a sound in the language, like: “It’s between a “k” and a “g” but more like the “k” except different.” That crap is not helpful. Particularly if the instructor is a native speaker of the language, they’ll often just keep telling you that you’re doing it wrong without offering a concrete way to make it correctly. Fun fact: There is an entire field dedicated to accurately describing the sounds of the world’s languages. One good class on phonetics and suddenly you have a concrete description of what you’re supposed to be doing with your mouth and the tools to tell when you’re doing it wrong. On the plus side, a lot language teachers are starting to incorporate linguistics into their curriculum with good results.
  • Speech recognition and speech synthesis. So this is an area that’s a little more difficult. Most people working on these sorts of projects right now are computational people and not linguists. There is a growing community of people who do both (UW offers a masters degree in computational linguistics that feeds lots of smart people into Seattle companies like Microsoft and Amazon, for example) but there’s definite room for improvement. The main tension is the fact that using linguistic models instead of statistical ones (though some linguistic models are statistical) hugely increases the need for processing power. The benefit is that accuracy  tends to increase. I hope that, as processing power continues to be easier and cheaper to access, more linguistics research will be incorporated into these applications. Fun fact: In computer speech recognition, an 80% comprehension accuracy rate in conversational speech is considered acceptable. In humans, that’s grounds to test for hearing or brain damage.
  • Speech pathology. This is a great field and has made and continues to make extensive use of linguistic research. Speech pathologists help people with speech disorders overcome them, and the majority of speech pathologists have an undergraduate degree in linguistics and a masters in speech pathology. Plus, it’s a fast-growing career field with a good outlook.  Seriously, speech pathology is awesome. Fun fact: Almost half of all speech pathologists work in school environments, helping kids with speech disorders. That’s like the antithesis of a mad scientist, right there.

And that’s why you should care. Linguistics helps us learn about ourselves and help people, and what else could you ask for in a scientific discipline? (Okay, maybe explosions and mutant sharks, but do those things really help humanity?)

Indiscreet words, Part II: Son of Sounds

Ok, so in my last post about how the speech stream is far from discrete, I talked about how difficult it is to pick apart words. But I didn’t really talk that much about phonemes, and since I promised you phonetics and phonology and phun, I thought I should cover that. Besides, it’s super interesting.

It’s not just that language is continuous, it’s that language that’s discrete is actually impossible to understand. I ran across this Youtube video a while back that’s a great example of this phenomenon.

What the balls of yarn is he saying? It’s actually the preamble to the constitution, but it took me well over half the video to pick up on it, and I spend a dumb amount of time listening to phonemes in isolation.

You probably find this troubling on some level. After all, you’re a literate person, and as a literate person you’re really, really used to thinking about words as being easy to break down into “letter sounds”. If you’ve ever tried to fiddle around with learning Mandarin or Cantonese, you know just how table-flippingly frustrating it is to memorize a writing system where the graphemes (smallest unit of writing, just as morpheme is the smallest unit of meaning, phoneme is the small unit of sound and dormeme is the smallest amount of space you can legally house a person in) have no relation to the series of sounds they represent.

Fun fact: It’s actually pretty easy to learn to speak Mandarin or Cantonese once you get past the tones. They’re syntactically a lot like English, don’t have a lot of fussy agreement markers or grammatical gender and have a pretty small core vocabulary. It’s the characters that will make you tear your hair out.

Hm. Well, it kinda looks me sitting on a chair hunched over my laptop while wearing a little hat and ARGH WHAT AM I DOING THAT LOOKS NOTHING LIKE A BIRD.

But. Um. Sorry, got a little off track there. Point was, you’re really used to thinking about words as being further segmented. Like oranges. Each orange is an individual, and then there are neat little segments inside the orange so you don’t get your hands sticky. And, because you’re already familiar with the spelling system of your language, (which is, let’s face it, probably English) you probably have a fond idea that it’s pretty easy to divide words that way. But it’s not. If it were, things like instantaneous computational voice to voice translation would be common.

It’s hard because the edges of our sounds blur together like your aunt’s watercolor painting that you accidently spilled lemonade on. So let’s say you’re saying “round”. Well, for the “n” you’re going to close off your nasal passages and put your tongue against the little ridge right behind your teeth. But wait! That’s where you tongue needs to be to make the “d” sound! To make it super clear, you should stop open up your nasal passages before you flick your tongue down and release that little packet of air that you were storing behind it. You’re totally not going to, though. I mean, your tongue’s already where you need it to be; why would you take the extra time to make sure your nasal passages are fully clear before releasing the “d”? That’s just a waste of time. And if you did it, you’d sound weird. So the “d” gets some of that nasally goodness and neither you or your listener give a flying Fluco.

But, if you’re a computer who’s been told, “If it’s got this nasal sound, it’s an ‘n'”, then you’re going to be super confused. Maybe you’ll be all like, “Um, ok. It kinda sounds like an ‘n’, but then it’s got that little pop of air coming out that I’ve been told to look for with the ‘p’, ‘b’, ‘t’ ‘d’, ‘k’, ‘g’ set… so… let’s go with ‘rounp’. That’s a word, right?” Obviously, this is a vast over-simplification, but you get my point; computers are easily confused by the smearing around of sounds in words. They’re getting better, but humans are still the best.

So just remember: when you’re around the robot overlords, be sure to run your phonemes together as much as possible. It might confuse them enough for you to have time to run away.

Indiscreet Words

All right, first I’d like to apologize for the title. The opposite of discrete is not indiscreet, but continuous, and continuous is what language, especially speech, is. By continuous, I mean that it doesn’t come out in separable chunks; it’s more like a stream of water than a stream of ice cubes. In fact, English itself discriminates between things that are discrete and continuous; discrete things are called count nouns because (gasp!) you can count them, and continuous things are called mass nouns. You can count ice cubes and words, but you can’t count water or language unless you assign them units.

“But wait,” I can hear you protest. “Language is discrete.  I’m speaking in sentences, that are made up of words that are made up of letters.” And you’re right. For you, your language is made up of units that are psychologically real to you. Somewhere between the speaker vocalizing the words and you parsing them, you segment them using the rules that you’ve mastered. It’s a deeply complex process and one that we still don’t completely understand. If we did, we’d be able to write speech recognition programs that wouldn’t give us errors like “the wells were gathered and planning” for “the walls were dark and clammy”. (True life. I got that very error not that long ago.)

Here, let’s look at some data. Here’s the waveform that shows the wave intensity, or loudness, of a native speaker of English saying “I am an elephant.”

Can you pick out the part of the speech signal for each of the words? Here, let me help you.

So… if speech really is discrete, wouldn’t expect four separate bumps in loudness for the words, with silence in between? (Maybe with a couple extra bumps on the end for the laugher.)

Instead, what we get is pretty much a constant rush of noise that you rely on the vast amount of knowledge you have about your language to decode accurately. Take out that knowledge and you get something completely incomprehensible. And there’s a really easy way to show this, just listen to someone speaking a language you aren’t familiar with.

That’s Finnish and if you speak it well enough to understand everything he just said, I’d like to extend some mad props unto you; Finno-Ugric languages are as hard as ice-cream from a deep freezer. But to get back to the point, what observations can you make about what you just heard?

  • The speaker was speaking super-quickly.
  • There didn’t seem to be any pauses between words
  • Basically, it was like standing in front of a language fire hose.

For people who don’t speak your native language, you sound very similar. They’re not speaking any more quickly in Hindi or Mandarin or Swahili or German than you are in English, you just don’t have a metalinguistic framework to help you cut the sound-stream into words, slap it up on a syntactic framework and yank meaning out of it.