Google’s speech recognition has a gender bias

In my last post, I looked at how Google’s automatic speech recognition worked with different dialects. To get this data, I hand-checked annotations  more than 1500 words from fifty different accent tag videos .

Now, because I’m a sociolinguist and I know that it’s important to stratify your samples, I made sure I had an equal number of male and female speakers for each dialect. And when I compared performance on male and female talkers, I found something deeply disturbing: YouTube’s auto captions consistently performed better on male voices than female voice (t(47) = -2.7, p < 0.01.) . (You can see my data and analysis here.)

accuarcyByGender

On average, for each female speaker less than half (47%) her words were captioned correctly. The average male speaker, on the other hand, was captioned correctly 60% of the time.

It’s not that there’s a consistent but small effect size, either, 13% is a pretty big effect. The Cohen’s d was 0.7 which means, in non-math-speak, that if you pick a random man and random woman from my sample, there’s an almost 70% chance the transcriptions will be more accurate for the man. That’s pretty striking.

What it is not, unfortunately, is shocking. There’s a long history of speech recognition technology performing better for men than women:

This is a real problem with real impacts on people’s lives. Sure, a few incorrect Youtube captions aren’t a matter of life and death. But some of these applications have a lot higher stakes. Take the medical dictation software study. The fact that men enjoy better performance than women with these technologies means that it’s harder for women to do their jobs. Even if it only takes a second to correct an error, those seconds add up over the days and weeks to a major time sink, time your male colleagues aren’t wasting messing with technology. And that’s not even touching on the safety implications of voice recognition in cars.

 

So where is this imbalance coming from? First, let me make one thing clear: the problem is not with how women talk. The suggestion that, for example, “women could be taught to speak louder, and direct their voices towards the microphone” is ridiculous. In fact, women use speech strategies that should make it easier for voice recognition technology to work on women’s voices.  Women tend to be more intelligible (for people without high-frequency hearing loss), and to talk slightly more slowly. In general, women also favor more standard forms and make less use of stigmatized variants. Women’s vowels, in particular, lend themselves to classification: women produce longer vowels which are more distinct from each other than men’s are. (Edit 7/28/2016: I have since found two papers by Sharon Goldwater, Dan Jurafsky and Christopher D. Manning where they found better performance for women than men–due to the above factors and different rates of filler words like “um” and “uh”.) One thing that may be making a difference is that women also tend not to be as loud, partly as a function of just being smaller, and cepstrals (the fancy math thing what’s under the hood of most automatic voice recognition) are sensitive to differences in intensity. This all doesn’t mean that women’s voices are more difficult; I’ve trained classifiers on speech data from women and they worked just fine, thank you very much. What it does mean is that women’s voices are different from men’s voices, though, so a system designed around men’s voices just won’t work as well for women’s.

Which leads right into where I think this bias is coming from: unbalanced training sets. Like car crash dummies, voice recognition systems were designed for (and largely by) men. Over two thirds of the authors in the  Association for Computational Linguistics Anthology Network are male, for example. Which is not to say that there aren’t truly excellent female researchers working in speech technology (Mari Ostendorf and Gina-Anne Levow here at the UW and Karen Livescu at TTI-Chicago spring immediately to mind) but they’re outnumbered. And that unbalance seems to extend to the training sets, the annotated speech that’s used to teach automatic speech recognition systems what things should sound like. Voxforge, for example, is a popular open source speech dataset that “suffers from major gender and per speaker duration imbalances.” I had to get that info from another paper, since Voxforge doesn’t have speaker demographics available on their website. And it’s not the only popular corpus that doesn’t include speaker demographics: neither does the AMI meeting corpus, nor the Numbers corpus.  And when I could find the numbers, they weren’t balanced for gender. TIMIT, which is the single most popular speech corpus in the Linguistic Data Consortium, is just over 69% male. I don’t know what speech database the Google speech recognizer is trained on, but based on the speech recognition rates by gender I’m willing to bet that it’s not balanced for gender either.

Why does this matter? It matters because there are systematic differences between men’s and women’s speech. (I’m not going to touch on the speech of other genders here, since that’s a very young research area. If you’re interested, the Journal of Language and Sexuality is a good jumping-off point.) And machine learning works by making computers really good at dealing with things they’ve already seen a lot of. If they get a lot of speech from men, they’ll be really good at identifying speech from men. If they don’t get a lot of speech from women, they won’t be that good at identifying speech from women. And it looks like that’s the case. Based on my data from fifty different speakers, Google’s speech recognition (which, if you remember, is probably the best-performing proprietary automatic speech recognition system on the market) just doesn’t work as well for women as it does for men.

The Acoustic Theory of Speech Perception

So, quick review: understanding speech is hard to model and the first model we discussed, motor theory, while it does address some problems, leaves something to be desired. The big one is that it doesn’t suggest that the main fodder for perception is the acoustic speech signal. And that strikes me as odd. I mean, we’re really used to thinking about hearing speech as a audio-only thing. Telephones and radios work perfectly well, after all, and the information you’re getting there is completely audio. That’s not to say that we don’t use visual, or, heck, even tactile data in speech perception. The McGurk effect, where a voice saying “ba” dubbed over someone saying “ga” will be perceived as “da” or “tha”, is strong evidence that we can and do use our eyes during speech perception. And there’s even evidence that a puff of air on the skin will change our perception of speech sounds. But we seem to be able to get along perfectly well without these extra sensory inputs, relying on acoustic data alone.

CPT-sound-physical-manifestation

This theory sounds good to me. Sorry, I’ll stop.

Ok, so… how do we extract information from acoustic data? Well, like I’ve said a couple time before, it’s actually a pretty complex problem. There’s no such thing as “invariance” in the speech signal and that makes speech recognition monumentally hard. We tend not to think about it because humans are really, really good at figuring out what people are saying, but it’s really very, very complex.

You can think about it like this: imagine that you’re looking for information online about platypuses. Except, for some reason, there is no standard spelling of platypus. People spell it “platipus”, “pladdypuss”, “plaidypus”, “plaeddypus” or any of thirty or forty other variations. Even worse, one person will use many different spellings and may never spell it precisely the same way twice. Now, a search engine that worked like our speech recognition works would not only find every instance of the word platypus–regardless of how it was spelled–but would also recognize that every spelling referred to the same animal. Pretty impressive, huh? Now imagine that every word have a very variable spelling, oh, and there are no spaces between words–everythingisjustruntogetherlikethisinonelongspeechstream. Still not difficult enough for you? Well, there is also the fact that there are ambiguities. The search algorithm would need to treat “pladypuss” (in the sense of  a plaid-patterned cat) and “palattypus” (in the sense of the venomous monotreme) as separate things. Ok, ok, you’re right, it still seems pretty solvable. So let’s add the stipulation that the program needs to be self-training and have an accuracy rate that’s incredibly close to 100%. If you can build a program to these specifications, congratulations: you’ve just revolutionized speech recognition technology. But we already have a working example of a system that looks a heck of a lot like this: the human brain.

So how does the brain deal with the “different spellings” when we say words? Well, it turns out that there are certain parts of a word that are pretty static, even if a lot of other things move around. It’s like a superhero reboot: Spiderman is still going to be Peter Parker and get bitten by a spider at some point and then get all moody and whine for a while. A lot of other things might change, but if you’re only looking for those criteria to figure out whether or not you’re reading a Spiderman comic you have a pretty good chance of getting it right. Those parts that are relatively stable and easy to look for we call “cues”. Since they’re cues in the acoustic signal, we can be even more specific and call them “acoustic cues”.

If you think of words (or maybe sounds, it’s a point of some contention) as being made up of certain cues, then it’s basically like a list of things a house-buyer is looking for in a house. If a house has all, or at least most, of the things they’re looking for, than it’s probably the right house and they’ll select that one. In the same way, having a lot of cues pointing towards a specific word makes it really likely that that word is going to be selected. When I say “selected”, I mean that the brain will connect the acoustic signal it just heard to the knowledge you have about a specific thing or concept in your head. We can think of a “word” as both this knowledge and the acoustic representation. So in the “platypuss” example above, all the spellings started with “p” and had an “l” no more than one letter away. That looks like a  pretty robust cue. And all of the words had a second “p” in them and ended with one or two tokens of “s”. So that also looks like a pretty robust queue. Add to that the fact that all the spellings had at least one of either a “d” or “t” in between the first and second “p” and you have a pretty strong template that would help you to correctly identify all those spellings as being the same word.

Which all seems to be well and good and fits pretty well with our intuitions (or mine at any rate). But that leaves us with a bit of a problem: those pesky parts of Motor Theory that are really strongly experimentally supported. And this model works just as well for motor theory too, just replace  the “letters” with specific gestures rather than acoustic cues. There seems to be more to the story than either the acoustic model or the motor theory model can offer us, though both have led to useful insights.

Why is studying linguistics useful? *Is* studying linguistics useful?

So I recently gave a talk at the University of Washington Scholar’s Studio. In it, I covered a couple things that I’ve already talked about here on my blog: the fact that, acoustically speaking, there’s no such thing as a “word” and that our ears can trick us. My general point was that our intuitions about speech, a lot of the things we think seem completely obvious, actually aren’t true at all from an acoustic perspective.

What really got to me, though, was that after I’d finished my talk (and it was super fast, too, only five minutes) someone asked why it mattered. Why should we care that our intuitions don’t match reality? We can still communicate perfectly well. How is linguistics useful, they asked. Why should they care?

I’m sorry, what was it you plan to spend your life studying again? I know you told me last week, but for some reason all I remember you saying is “Blah, blah, giant waste of time.”

It was a good question, and I’m really bummed I didn’t have time to answer it. I sometimes forget, as I’m wading through a hip-deep piles of readings that I need to get to, that it’s not immediately obvious to other people why what I do is important. And it is! If I didn’t believe that, I wouldn’t be in grad school. (It’s certainly not the glamorous easy living and fat salary that keep me here.) It’s important in two main ways. Way one is the way in which it enhances our knowledge and way two is the way that it helps people.

 Increasing our knowledge. Ok, so, a lot of our intuitions are wrong. So what? So a lot of things! If we’re perceiving things that aren’t really there, or not perceiving things that are really there, something weird and interesting is going on. We’re really used to thinking of ourselves as pretty unbiased in our observations. Sure, we can’t hear all the sounds that are made, but we’ve built sensors for that, right? But it’s even more pervasive than that. We only perceive the things that our bodies and sensory organs and brains can perceive, and we really don’t know how all these biological filters work. Well, okay, we do know some things (lots and lots of things about ears, in particular) but there’s a whole lot that we still have left to learn. The list of unanswered questions in linguistics is a little daunting, even just in the sub-sub-field of perceptual phonetics.

Every single one of us uses language every single day. And we know embarrassingly little about how it works. And, what we do know, it’s often hard to share with people who have little background in linguistics. Even here, in my blog, without time restraints and an audience that’s already pretty interested (You guys are awesome!) I often have to gloss over interesting things. Not because I don’t think you’ll understand them, but because I’d metaphorically have to grow a tree, chop it down and spends hours carving it just to make a little step stool so you can get the high-level concept off the shelf and, seriously, who has time for that? Sometimes I really envy scientists in the major disciplines  because everyone already knows the basics of what they study. Imagine that you’re a geneticist, but before you can tell people you look at DNA, you have to convince them that sexual reproduction exists. I dream of the day when every graduating high school senior will know IPA. (That’s the international phonetic alphabet, not the beer.)

Okay, off the soapbox.

Helping people. Linguistics has lots and lots and lots of applications. (I’m just going to talk about my little sub-field here, so know that there’s a lot of stuff being left unsaid.) The biggest problem is that so few people know that linguistics is a thing. We can and want to help!

  • Foreign language teaching. (AKA applied linguistics) This one is a particular pet peeve of mine. How many of you have taken a foreign language class and had the instructor tell you something about a sound in the language, like: “It’s between a “k” and a “g” but more like the “k” except different.” That crap is not helpful. Particularly if the instructor is a native speaker of the language, they’ll often just keep telling you that you’re doing it wrong without offering a concrete way to make it correctly. Fun fact: There is an entire field dedicated to accurately describing the sounds of the world’s languages. One good class on phonetics and suddenly you have a concrete description of what you’re supposed to be doing with your mouth and the tools to tell when you’re doing it wrong. On the plus side, a lot language teachers are starting to incorporate linguistics into their curriculum with good results.
  • Speech recognition and speech synthesis. So this is an area that’s a little more difficult. Most people working on these sorts of projects right now are computational people and not linguists. There is a growing community of people who do both (UW offers a masters degree in computational linguistics that feeds lots of smart people into Seattle companies like Microsoft and Amazon, for example) but there’s definite room for improvement. The main tension is the fact that using linguistic models instead of statistical ones (though some linguistic models are statistical) hugely increases the need for processing power. The benefit is that accuracy  tends to increase. I hope that, as processing power continues to be easier and cheaper to access, more linguistics research will be incorporated into these applications. Fun fact: In computer speech recognition, an 80% comprehension accuracy rate in conversational speech is considered acceptable. In humans, that’s grounds to test for hearing or brain damage.
  • Speech pathology. This is a great field and has made and continues to make extensive use of linguistic research. Speech pathologists help people with speech disorders overcome them, and the majority of speech pathologists have an undergraduate degree in linguistics and a masters in speech pathology. Plus, it’s a fast-growing career field with a good outlook.  Seriously, speech pathology is awesome. Fun fact: Almost half of all speech pathologists work in school environments, helping kids with speech disorders. That’s like the antithesis of a mad scientist, right there.

And that’s why you should care. Linguistics helps us learn about ourselves and help people, and what else could you ask for in a scientific discipline? (Okay, maybe explosions and mutant sharks, but do those things really help humanity?)

Indiscreet words, Part II: Son of Sounds

Ok, so in my last post about how the speech stream is far from discrete, I talked about how difficult it is to pick apart words. But I didn’t really talk that much about phonemes, and since I promised you phonetics and phonology and phun, I thought I should cover that. Besides, it’s super interesting.

It’s not just that language is continuous, it’s that language that’s discrete is actually impossible to understand. I ran across this Youtube video a while back that’s a great example of this phenomenon.

What the balls of yarn is he saying? It’s actually the preamble to the constitution, but it took me well over half the video to pick up on it, and I spend a dumb amount of time listening to phonemes in isolation.

You probably find this troubling on some level. After all, you’re a literate person, and as a literate person you’re really, really used to thinking about words as being easy to break down into “letter sounds”. If you’ve ever tried to fiddle around with learning Mandarin or Cantonese, you know just how table-flippingly frustrating it is to memorize a writing system where the graphemes (smallest unit of writing, just as morpheme is the smallest unit of meaning, phoneme is the small unit of sound and dormeme is the smallest amount of space you can legally house a person in) have no relation to the series of sounds they represent.

Fun fact: It’s actually pretty easy to learn to speak Mandarin or Cantonese once you get past the tones. They’re syntactically a lot like English, don’t have a lot of fussy agreement markers or grammatical gender and have a pretty small core vocabulary. It’s the characters that will make you tear your hair out.

Hm. Well, it kinda looks me sitting on a chair hunched over my laptop while wearing a little hat and ARGH WHAT AM I DOING THAT LOOKS NOTHING LIKE A BIRD.

But. Um. Sorry, got a little off track there. Point was, you’re really used to thinking about words as being further segmented. Like oranges. Each orange is an individual, and then there are neat little segments inside the orange so you don’t get your hands sticky. And, because you’re already familiar with the spelling system of your language, (which is, let’s face it, probably English) you probably have a fond idea that it’s pretty easy to divide words that way. But it’s not. If it were, things like instantaneous computational voice to voice translation would be common.

It’s hard because the edges of our sounds blur together like your aunt’s watercolor painting that you accidently spilled lemonade on. So let’s say you’re saying “round”. Well, for the “n” you’re going to close off your nasal passages and put your tongue against the little ridge right behind your teeth. But wait! That’s where you tongue needs to be to make the “d” sound! To make it super clear, you should stop open up your nasal passages before you flick your tongue down and release that little packet of air that you were storing behind it. You’re totally not going to, though. I mean, your tongue’s already where you need it to be; why would you take the extra time to make sure your nasal passages are fully clear before releasing the “d”? That’s just a waste of time. And if you did it, you’d sound weird. So the “d” gets some of that nasally goodness and neither you or your listener give a flying Fluco.

But, if you’re a computer who’s been told, “If it’s got this nasal sound, it’s an ‘n'”, then you’re going to be super confused. Maybe you’ll be all like, “Um, ok. It kinda sounds like an ‘n’, but then it’s got that little pop of air coming out that I’ve been told to look for with the ‘p’, ‘b’, ‘t’ ‘d’, ‘k’, ‘g’ set… so… let’s go with ‘rounp’. That’s a word, right?” Obviously, this is a vast over-simplification, but you get my point; computers are easily confused by the smearing around of sounds in words. They’re getting better, but humans are still the best.

So just remember: when you’re around the robot overlords, be sure to run your phonemes together as much as possible. It might confuse them enough for you to have time to run away.