I recently read a very interesting article on the design of aspects of choosing a wake word, the word you use to turn on a voice-activated system. In Star Trek it’s “Computer”, but these days two of the more popular ones are “Alexa” and “OK Google”. The article’s author was a designer and noted that she found “Ok Google” or “Hey Google” to be more pleasant to use than “Alexa”. As I was reading the comments (I know, I know) I noticed that a lot of the people who strongly protested that they preferred “Alexa” had usernames or avatars that I would associate with male users. It struck me that there might be an underlying social pattern here.
So, being the type of nerd I am, I whipped up a quick little survey to look at the interaction between user gender and their preference for wake words. The survey only had two questions:
What is your gender?
If Google Home and the Echo offered identical performance in all ways except for the wake word (the word or phrase you use to wake the device and begin talking to it), which wake word would you prefer?
“Ok Google” or “Hey Google”
I included only those options becuase those are the defaults–I am aware you can choose to change the Echo’s wake word. (And probably should, given recent events.) 67 people responded to my survey. (If you were one of them, thanks!)
So what were the results? They were actually pretty strongly in line with my initial observations: as a group, only men preferred “Alexa” to “Ok Google”. Furthermore, this preference was far weaker than people of other genders’ for “Ok Google”. Women preferred “Ok Google” at a rate of almost two-to-one, and no people of other genders preferred “Alexa”.
I did have a bit of a skewed sample, with more women than men and people of other genders, but the differences between genders were robust enough to be statistically significant (c2(2, N = 67) = 7.25, p = 0.02)).
So what’s the take-away? Well, for one, Johna Paolino (the author of the original article) is by no means alone in her preference for a non-gendered wake word. More broadly, I think that, like the Clippy debacle, this is excellent evidence that there are strong gendered differences in how users’ gender affects their interaction with virtual agents. If you’re working to create virtual agents, it’s important to consider all types of users or you might end up creating something that rubs more than half of your potential customers the wrong way.
As someone who’s studied sign language, my immediate thought was “Of course there’s a directionality to emoji: they encode the spatial relationships of the scene.” This is just fancy linguist talk for: “if there’s a dog eating a hot-dog, and the dog is on the right, you’re going to use 🌭🐕, not 🐕🌭.” But the more I thought about it, the more I began to think that maybe it would be better not to rely on my intuitions in this case. First, because I know American Sign Language and that might be influencing me and, second, because I am pretty gosh-darn dyslexic and I can’t promise that my really excellent ability to flip adjacent characters doesn’t extend to emoji.
So, like any good behavioral scientist, I ran a little experiment. I wanted to know two things.
Does an emoji description of a scene show the way that things are positioned in that scene?
Does the order of emojis tend to be the same as the ordering of those same concepts in an equivalent sentence?
As it turned out, the answers to these questions are actually fairly intertwined, and related to a third thing I hadn’t actually considered while I was putting together my stimuli (but probably should have): whether there was an agent-patient relationship in the photo.
Agent: The entity in a sentence that’s affecting a changed, the “doer” of the action.
The dog ate the hot-dog.
The raccoons pushed over all the trash-bins.
Patient: The entity that’s being changed, the “receiver” of the action.
The dog ate the hot-dog.
The raccoons pushed over all the trash-bins.
To get data, I showed people three pictures and asked them to “pick the emoji sequence that best describes the scene” and then gave them two options that used different orders of the same emoji. Then, once they were done with the emoji part, I asked them to “please type a short sentence to describe each scene”. For all the language data, I just went through and quickly coded the order that the same concepts as were encoded in the emoji showed up.
“The dog ate a hot-dog” -> dog hot-dog
“The hot-dog was eaten by the dog” -> hot-dog dog
“A dog eating” -> dog
“The hot-dog was completely devoured” -> hot-dog
So this gave me two parallel data sets: one with emojis and one with language data.
All together, 133 people filled out the emoji half and 127 people did the whole thing, mostly in English (I had one person respond in Spanish and I went ahead and included it). I have absolutely no demographics on my participants, and that’s by design; since I didn’t go through the Institutional Review Board it would actually be unethical for me to collect data about people themselves rather than just general information on language use. (If you want to get into the nitty-gritty this is a really good discussion of different types of on-line research.)
Picture one – A man counting money
I picked this photo as sort of a sanity-check: there’s no obvious right-to-left ordering of the man and the money, and there’s one pretty clear way of describing what’s going on in this scene. There’s an agent (the man) and a patient (the money), and since we tend to describe things as agent first, patient second I expected people to pretty much all do the same thing with this picture. (Side note: I know I’ve read a paper about the cross-linguistic tendency for syntactic structures where the agent comes first, but I can’t find it and I don’t remember who it’s by. Please let me know if you’ve got an idea what it could be in the comments–it’s driving me nuts!)
And they did! Pretty much everyone described this picture by putting the man before the money, both with emoji and words. This tells us that, when there’s no information about orientation you need to encode (e.g. what’s on the right or left), people do tend to use emoji in the same order as they would the equivalent words.
Picture two – A man walking by a castle
But now things get a little more complex. What if there isn’t a strong agent-patient relationship and there is a strong orientation in the photo? Here, a man in a red shirt is walking by a castle, but he shows up on the right side of the photo. Will people be more likely to describe this scene with emoji in a way that encodes the relationship of the objects in the photo?
I found that they were–almost four out of five participants described this scene by using the emoji sequence “castle man”, rather than “man castle”. This is particularly striking because, in the sentence writing part of the experiment, most people (over 56%) wrote a sentence where “man/dude/person etc.” showed up before “castle/mansion/chateau etc.”.
So while people can use emoji to encode syntax, they’re also using them to encode spatial information about the scene.
Picture three – A man photographing a model
Ok, so let’s add a third layer of complexity: what about when spatial information and the syntactic agent/patient relationships are pointing in opposite directions? For the scene above, if you’re encoding the spatial information then you should use an emoji ordering like “woman camera man”, but if you’re encoding an agent-patient relationship then, as we saw in the picture of the man counting money, you’ll probably want to put the agent first: “man camera woman”.
(I leave it open for discussion whether the camera emoji here is representing a physical camera or a verb like “photograph”.)
So people were a little more divided here. It wasn’t quite a 50-50 split, but it really does look like you can go either way with this one. The thing that jumped out at me, though, was how the word order and emoji order pattern together: if your sentence is something like “A man photographs a model”, then you are far more likely to use the “man camera woman” emoji ordering. On the other hand, if your sentence is something like “A woman being photographed by the sea” or “Photoshoot by the water”, then it’s more likely that your emoji ordering described the physical relation of the scene.
So what’s the big takeaway here? Well, one thing is that emoji don’t really have a fixed syntax in the same way language does. If they did, I’d expect that there would be a lot more agreement between people about the right way to represent a scene with emoji. There was a lot of variation.
On the other hand, emoji ordering isn’t just random either. It is encoding information, either about the syntactic/semantic relationship of the concepts or their physical location in space. The problem is that you really don’t have a way of knowing which one is which.
Edit 12/16/2016: The dataset and the R script I used to analyze it are now avaliable on Github.
So a friend of mine who’s a reference librarian (and has a gaming YouTube channel you should check out) recently got an interesting question: how loud would a million dogs barking be?
This is an interesting question because it gets at some interesting properties of how sound work, in particular the decibel scale.
So, first off, we need to establish our baseline. The loudest recorded dog bark clocked in at 113.1 dB, and was produced by a golden retriever named Charlie. (Interestingly, the loudest recorded human scream was 129 dB, so it looks like Charlie’s got some training to do to catch up!) That’s louder than a chain saw, and loud enough to cause hearing damage if you heard it consonantly.
Now, let’s scale our problem down a bit and figure out how loud it would be if ten Charlies barked together. (I’m going to use copies of Charlie and assume they’ll bark in phase becuase it makes the math simpler.) One Charlie is 113 dB, so your first instinct may be to multiply that by ten and end up 1130 dB. Unfortunately, if you took this approach you’d be (if you’ll excuse the expression) barking up the wrong tree. Why? Because the dB scale is logarithmic. This means that a 1130 dB is absolutely ridiculously loud. For reference, under normal conditions the loudest possible sound (on Earth) is 194 dB. A sound of 1000 dB would be loud enough to create a black hole larger than the galaxy. We wouldn’t be able to get a bark that loud even if we covered every inch of earth with clones of champion barker Charlie.
Ok, so we know what one wrong approach is, but what’s the right one? Well, we have our base bark at 113 dB. If we want a bark that is one million times as powerful (assuming that we can get a million dogs to bark as one) then we need to take the base ten log of one million and multiply it by ten (that’s the deci part of decibel). (If you want more math try this site.) The base ten log of one million is six, so times ten that’s sixty decibels. But it’s sixty decibels louder than our original sound of 113dB, for a grand total of 173dB.
Now, to put this in perspective, that’s still pretty durn loud. That’s loud enough to cause hearing loss in our puppies and everyone in hearing distance. We’re talking about the loudness of a cannon, or a rocket launch from 100 meters away. So, yes, very loud, but not quite “destroying the galaxy” loud.
A final note: since the current world record for loudest barking group of dogs is a more modest 124 dB from group of just 76 dogs, if you could get a million dogs to bark in unison you’d definitely set a new world record! But, considering that you’d end up hurting the dogs’ hearing (and having to scoop all that poop) I’m afraid I really can’t recommend it.
This post is pretty special: it’s the 100th post I’ve made since starting my blog! It’s hard to believe I’ve been doing this so long. I started blogging in 2012, in my final year of undergrad, and now I’m heading into my last year of my PhD. Crazy how fast time flies.
Ok, back on topic. As I was looking back over everything I’ve written, it struck me that 99 posts worth of text on a very specific subject domain (linguistics) in a very specific register (informal) should be enough text to train a simple text generator.
So how did I go about building a blog bot? It was pretty easy! All I needed was:
67,000 words of text (all blog posts before this one)
1 R script to tidy up the text
1 Python script to train a Markov Chain text generator
A Markov Whatnow?
A Markov Chain is a type of simple (but surprisingly powerful) statistical model that tells you, given the item you’re currently on, what item you’re likely to see next. Today we’re going to apply it to whole words in a text.
How does it work? Basically, for each word in your text, you count how many different words occur after it, how many time each shows up and figure out the probability of each transition. So if your text is “The dog ate the apple.”, then there’s a 50% chance that “the” will be followed by “apple”, but a 100% chance that “apple” will be followed by “.”. You can then use these probabilities to generate new sentences, like so:
Input: The dog ate the apple.
The dog ate the apple.
The dog ate the dog ate the apple.
The dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the apple. (This may seem silly, but remember that we’re only looking at two words at a time. This model doesn’t “remember” what came earlier or “know” that it’s repeating itself.)
OK, now for the actual training of the model. If you want to play around with this yourself, all my code and text is up on GitHub here.
First, grab your text. I downloaded all my blog posts as an XML file and used a quick little R script to snag just the text I was interested in.
Next, train your model. I used PyMarkovChain, but you can always write your own if you’re feeling it.
Finally, use the model to generate text.
Alright, now I’ve got (admittedly very, very dumb) little program that kinda knows what I write like when I’m blogging. How well does it do at pretending to be me?
I’m going to break eye contact, look down at your own personalized ASR system
Well that’s… vaguely disturbing. But also not really very human-sounding, and certainly not something I’d write in a blog post. Let’s give it a couple more tries and see if it does any better.
But, if frosting has to have a career where you learned it from Clarice
We only perceive the things that our bodies and sensory organs and brains can perceive, and we really don’t care about to tell the sounds in the second part, I was expecting to see whether something is grammatical is to put up a bunch of historical stuff that we’ve all heard about the structure in the arts and sciences were depressed
(Which is really interesting–I’d love to learn your language doesn’t make you very, very good a recognizing speech
Believe me, I know what you can uncontract them and what’s the take-away
People with short face syndrome–yes, that’s a classical description of what a language does, along with parallels drawn to another, related, languages
Short answer: they’re all correct
And those speakers are aware of
The Job Market for Linguistics PhDsWhat do you much
Hmm, not so great. The only output that sounded good to me was “Short answer: they’re all correct”. And that was just the exact same phrasing I used in this blog post. No wonder it sounds just like me; it is me!
So it looks like I won’t be handing the reins over to Making Noise and Hearing Things bot any time soon. True, I could have used a fancier tool, like a Recurrent Neural Network. But to be perfectly honest, I have yet to see any text generation system that can produce anything even close to approximating a human-written blog post. Will we get there? Maybe. But natural language generation, especially outside of really formulaic things like weather or sports reporting, is a super hard problem. Heck, we still haven’t gotten to point where computers can reliably solve third-grade math word problems.
The very complexities that make language so useful (and interesting to study) also make it so hard to model. Which is good news for me! It means there’s still plenty of work to do in language modelling and blogging.
So I recently had a pretty disconcerting experience. It turns out that almost no one else has heard of a word that I thought was pretty common. And when I say “no one” I’m including dialectologists; it’s unattested in the Oxford English Dictionary and the Dictionary of American Regional English. Out of the twenty two people who responded to my Twitter poll (which was probably mostly other linguists, given my social networks) only one other person said they’d even heard the word and, as I later confirmed, it turned out to be one of my college friends.
So what is this mysterious word that has so far evaded academic inquiry? Ladies, gentlemen and all others, please allow me to introduce you to…
The word means something like “fool” or “incompetent person”. To prove that this is actually a real word that people other than me use, I’ve (very, very laboriously) found some examples from the internet. It shows up in the comments section of this news article:
THAT is why people are voting for Mr Trump, even if he does act sometimes like a Bumpus.
I also found it in a smattering of public tweets like this one:
If you ever meet my dad, please ask him what a “bumpus” is
A raucous, boisterous person or thing (usually african-american.)
I’m a little sceptical about the last one, though. Partly because it doesn’t line up with my own intuitions (I feel like a bumpus is more likely to be silent than rowdy) and partly becuase less popular Urban Dictionary entries, especially for words that are also names, are super unreliable.
I also wrote to my parents (Hi mom! Hi dad!) and asked them if they’d used the word growing up, in what contexts, and who they’d learned it from. My dad confirmed that he’d heard it growing up (mom hadn’t) and had a suggestion for where it might have come from:
I am pretty sure my dad used it – invariably in one of the two phrases [“don’t be a bumpus” or “don’t stand there like a bumpus”]…. Bumpass, Virginia is in Lousia County …. Growing up in Norfolk, it could have held connotations of really rural Virginia, maybe, for Dad.
While this is definitely a possibility, I don’t know that it’s definitely the origin of the word. Bumpass, Virginia, like Bumpass Hell (see this review, which also includes the phrase “Don’t be a bumpass”), was named for an early settler. Interestingly, the college friend mentioned earlier is also from the Tidewater region of Virginia, which leads me to think that the word may have originated there.
My mom offered some other possible origins, that the term might be related to “country bumpkin” or “bump on a log”. I think the latter is especially interesting, given that “bump on a log” and “bumpus” show up in exactly the same phrase: standing/sitting there like a _______.
She also suggested it might be related to “bumpkis” or “bupkis”. This is a possibility, especially since that word is definitely from Yiddish and Norfolk, VA does have a history of Jewish settlement and Yiddish speakers.
A usage of “Bumpus” which seems to be the most common is in phrases like “Bumpus dog” or “Bumpus hound”. I think that this is probably actually a different use, though, and a direct reference to a scene from the movie A Christmas Story:
One final note is that there was a baseball pitcher in the late 1890’s who went by the nickname “Bumpus”: Bumpus Jones. While I can’t find any information about where the nickname came from, this post suggests that his family was from Virginia and that he had Powhatan ancestry.
I’m really interesting in learning more about this word and its distribution. My intuition is that it’s mainly used by older, white speakers in the South, possibly centered around the Tidewater region of Virginia.
If you’ve heard of or used this word, please leave a comment or drop me a line letting me know 1) roughly how old you are, 2) where you grew up and 3) (if you can remember) where you learned it. Feel free to add any other information you feel might be relevant, too!
Is there a voice recognition product that is focusing on women’s voices or allows for configuring for women’s voices (or the characteristics of women’s voices)?
I don’t know of any ASR systems specifically designed for women. But the answer to the second half of your question is yes!
There are two main types of automatic speech recognition, or ASR, systems. The first is speaker independnet. These are systems, like YouTube automatic captions or Apple’s Siri, that should work equally well across a large number of different speakers. Of course, as manyotherresearchers have found and I corroborated in my own investigation, that’s not always the case. A major reason for this is socially-motivated variation between speakers. This is something we all know as language users. You can guess (with varying degrees of accuracy) a lot about someone from just their voice: thier sex, whether they’re young or old, where they grew up, how educated they are, how formal or casual they’re being.
So what does this mean for speech recognition? Well, while different speakers speak in a lot of different ways, individual speakers tend to use less variation. (With the exception of bidialectal speakers, like John Barrowman.) Which brings me nicely to the second type of speech recognition: speaker dependent. These are systems that are designed to work for one specific speaker, and usually to adapt and get more accurate for that speaker over time.
If you read some of my earlier posts, I suggested that the different performance for between dialects and genders was due to imbalances in the training data. The nice thing about speaker dependent systems is that the training data is made up of one voice: yours. (Although the system is usually initialized based on some other training set.)
So how can you get a speaker dependent ASR system?
By buying software such as Dragon speech recognition. This is probably the most popular commercial speaker-dependent voice recognition software (or at least the one I hear the most about). It does, however, cost real money.
In theory, the bones of any ASR system should work equally well on any spoken human language. (Sign language recognition is a whole nother kettle of fish.) The difficulty is getting large amounts of (socially stratified) high-quality training data. By feeding a system data without a lot of variation, for example by using only one person’s voice, you can usually get more accurate recognition more quickly.
I got a cool question from Veronica the other day:
Which wavelength someone would use not to hear but feel it on the body as a vibration?
So this would depend on two things. The first is your hearing ability. If you’ve got no or limited hearing, most of your interaction with sound will be tactile. This is one of the reasons why many Deaf individuals enjoy going to concerts; if the sound is loud enough you’ll be able to feel it even if you can’t hear it. I’ve even heard stories about folks who will take balloons to concerts to feel the vibrations better. In this case, it doesn’t really depend on the pitch of the sound (how high or low it is), just the volume.
But let’s assume that you have typical hearing. In that case, the relationship between pitch, volume and whether you can hear or feel a sound is a little more complex. This is due to something called “frequency response”. Basically, the human ear is better tuned to hearing some pitches than others. We’re really sensitive to sounds in the upper ranges of human speech (roughly 2k to 4k Hz). (The lowest pitch in the vocal signal can actually be much lower [down to around 80 Hz for a really low male voice] but it’s less important to be able to hear it because that frequency is also reflected in harmonics up through the entire pitch range of the vocal signal. Most telephones only transmit signals between 300 Hz to 3400 Hz, for example, and it’s only really the cut-off at the upper end of the range that causes problems–like making it hard to tell the difference between “sh” and “s”.)
The takeaway from all this is that we’re not super good at hearing very low sounds. That means they can be very, very loud before we pick up on them. If the sound is low enough and loud enough, then the only way we’ll be able to sense it is by feeling it.
How low is low enough? Most people can’t really hear anything much below 20 Hz (like the lowest note on a really big organ). The older you are and the more you’ve been exposed to really loud noises in that range, like bass-heavy concerts or explosions, the less you’ll be able to pick up on those really low sounds.
What about volume? My guess for what would be “sufficiently loud”, in this case, is 120+ Db. 120 Db is as loud as a rock concert, and it’s possible, although difficult and expensive, to get out of a home speaker set-up. If you have a neighbor listening to really bass-y music or watching action movies with a lot of low, booming sound effects on really expensive speakers, it’s perfectly possible that you’d feel those vibrations rather than hearing them. Especially if there are walls between the speakers and you. While mid and high frequency sounds are pretty easy to muffle, low-frequency sounds are much more difficult to sound proof against.
Are there any health risks? The effects of exposure to these types of low-frequency noise is actually something of an active research question. (You may have heard about the “brown note“, for example.) You can find a review of some of that research here. One comforting note: if you are exposed to a very loud sound below the frequencies you can easily hear–even if it’s loud enough to cause permanent damage at much higher frequencies–it’s unlikely that you will suffer any permanent hearing loss. That doesn’t mean you shouldn’t ask your neighbor to turn down the volume, though; for their ears if not for yours!
So after my last blog post went up, a couple people wondered if the difference in classification error rates between men and women might be due to pitch, since men tend to have lower voices. I had no idea, so, being experimentally inclined, I decided to find out.
First, I found the longest list of words that I could from the accent tag. Pretty much every video I looked used a subset of these words.
Then I recorded myself reading them at a natural pace, with list intonation. In order to better match the speakers in the other Youtube videos, I didn’t go into the lab and break out the good microphones; I just grabbed my gaming headset and used that mic. Then, I used Praat (a free, open source software package for phonetics) to shift the pitch of the whole file up and down 60 Hertz in 20 Hertz intervals. That left me with seven total sound files: the original one, three files that were 20, 40 and 60 Hertz higher and finally three files that were 20, 40 and 60 Hertz lower. You can listen to all the files individually here.
The original recording had a mean of 192 Hz and a median of 183, which means that my voice is slightly lower pitched than average for an American English speakering women. For reference, Pepiot 2014 found a mean pitch of 210 Hz for female American English speakers. The same papers also lists a mean pitch of 119 Hz for male American English speakers. This means that my lowest pitch manipulation (mean of 132) is still higher than the average American English speaking male. I didn’t want to go too much lower with my pitch manipulations, though, because the sound files were starting to sound artifact-y and robotic.
Why did I do things this way?
Only using one recording. This lets me control 100% for demographic information. I’m the same person, with the same language background, saying the same words in the same way. If I’d picked a bunch of speakers with different pitches, they’d also have different language backgrounds and voices. Plus I’m not getting effects from using different microphones.
Manipulating pitch both up and down. This was for two reasons. First, it means that the original recording isn’t the end-point for the pitch continuum. Second, it means that we can pick apart whether accuracy is a function of pitch or just the file having been manipulated.
You can check out how well the auto-captions did yourself by checking out this video. Make sure to hit the CC button in the lower left-hand corner.
The first thing I noticed was that I had really, really good results with the auto captions. Waaayyyy better than any of the other videos I looked at. There were nine errors across 434 tokens, for a total error rate of only 2%, which I’d call pretty much at ceiling. There was maaayybe a slight effect of the pitch manipulation, with higher pitches having slightly higher error rates, as you can see:
BUT there’s also sort of a u-shaped curve, which suggests to me that the recognizer is doing worse with the files that have been messed with the most. (Although, weirdly, only the file that had had its pitched shifted up by 20 Hz had no errors.) I’m going to go ahead and say that I’m not convinced that pitch is a determining factor
So why were these captions so much better than the ones I looked at in my last post? It could just be that I was talking very slowly and clearly. To check that out, I looked at autocaptions for the most recent video posted by someone who’s fairly similar to me in terms of social and vocal characteristics: a white woman who speaks standardized American English with Southern features. Ideally I’d match for socioeconomic class, education and rural/urban background as well, but those are harder to get information about.
I chose Bunny Meyer, who posts videos as Grav3yardgirl. In this video her speech style is fast and conversational, as you can hear for yourself:
To make sure I had roughly the same amount of data as I had before, I checked the captions for the first 445 words, which was about two minutes worth of video (you can check my work here). There was an overall error rate of approximately 8%, if you count skipped words as errors. Which, considering that recognizing words in fast/connected speech is generally more error-prone, is pretty good. It’s definitely better than in the videos I analyzed for my last post. It’s also a fairly small difference from my careful speech: definitely less than the 13% difference I found for gender.
So it looks like neither the speed of speech nor the pitch are strongly affecting recognition rate (at least for videos captioned recently). There are a couple other things that I think may be going on here that I’m going to keep poking at:
ASR has got better over time. It’s totally possible that more women just did the accent tag challenge earlier, and thus had higher error rates because the speech recognition system was older and less good. I’m going to go back and tag my dataset for date, though, and see if that shakes out some of the gender differences.
Being louder may be important, especially in less clear recordings. I used a head-mounted microphone in a quiet room to make my recordings, and I’m assuming that Bunny uses professional recording equipment. If you’re recording outside or with a device microphone, though, there going to be a lot more noise. If your voice is louder, and men’s voices tend to be, it should be easier to understand in noise. My intuition is that, since there are gender differences in how loud people talk, some of the error may be due to intensity differences in noisy recordings. Although an earlier study found no difference in speech recognition rates for men and women in airplane cockpits, which are very noisy, so who knows? Testing that out will have to wait for another day, though.
Edit, July 2020: Hello! This blog post has been cited quite a bit recently so I thought I’d update it with the more recent reserach. I’m no longer working actively on this topic, but in the last paper I wrote on it, in 2017, I found that when audio quality was controlled the gender effects disappeared. I take this to be evidence that differences in gender are due to differences in overall signal-to-noise ratio when recording in noisy environments rather than problems in the underlying ML models.
That said, bias against specific demographics categories in automatic speech recognition is a problem. In my 2017 study, I found that multiple commercial ASR systems had higher error rates for non-white speakers. More recent research has found the same effect: ASR systems make more errors for Black speakers than white speakers. In my professional opinion, the racial differences are both more important and difficult to solve.
The original, unedited blog post, continues below.
In my last post, I looked at how Google’s automatic speech recognition worked with different dialects. To get this data, I hand-checked annotations more than 1500 words from fifty different accent tag videos .
Now, because I’m a sociolinguist and I know that it’s important to stratify your samples, I made sure I had an equal number of male and female speakers for each dialect. And when I compared performance on male and female talkers, I found something deeply disturbing: YouTube’s auto captions consistently performed better on male voices than female voice (t(47) = -2.7, p < 0.01.) . (You can see my data and analysis here.)
It’s not that there’s a consistent but small effect size, either, 13% is a pretty big effect. The Cohen’s d was 0.7 which means, in non-math-speak, that if you pick a random man and random woman from my sample, there’s an almost 70% chance the transcriptions will be more accurate for the man. That’s pretty striking.
What it is not, unfortunately, is shocking. There’s a long history of speech recognition technology performing better for men than women:
This is a real problem with real impacts on people’s lives. Sure, a few incorrect Youtube captions aren’t a matter of life and death. But some of these applications have a lot higher stakes. Take the medical dictation software study. The fact that men enjoy better performance than women with these technologies means that it’s harder for women to do their jobs. Even if it only takes a second to correct an error, those seconds add up over the days and weeks to a major time sink, time your male colleagues aren’t wasting messing with technology. And that’s not even touching on the safety implications of voice recognition in cars.
So where is this imbalance coming from? First, let me make one thing clear: the problem is not with how women talk. The suggestion that, for example, “women could be taught to speak louder, and direct their voices towards the microphone” is ridiculous. In fact, women use speech strategies that should make it easier for voice recognition technology to work on women’s voices. Women tend to be more intelligible (for people without high-frequency hearing loss), and to talk slightly more slowly. In general, women also favor more standard forms and make less use of stigmatized variants. Women’s vowels, in particular, lend themselves to classification: women produce longer vowels which are more distinct from each other than men’s are. (Edit 7/28/2016: I have since found twopapers by Sharon Goldwater, Dan Jurafsky and Christopher D. Manning where they found better performance for women than men–due to the above factors and different rates of filler words like “um” and “uh”.) One thing that may be making a difference is that women also tend not to be as loud, partly as a function of just being smaller, and cepstrals (the fancy math thing what’s under the hood of most automatic voice recognition) are sensitive to differences in intensity. This all doesn’t mean that women’s voices are more difficult; I’ve trained classifiers on speech data from women and they worked just fine, thank you very much. What it does mean is that women’s voices are different from men’s voices, though, so a system designed around men’s voices just won’t work as well for women’s.
Why does this matter? It matters because there are systematic differences between men’s and women’s speech. (I’m not going to touch on the speech of other genders here, since that’s a very young research area. If you’re interested, the Journal of Language and Sexuality is a good jumping-off point.) And machine learning works by making computers really good at dealing with things they’ve already seen a lot of. If they get a lot of speech from men, they’ll be really good at identifying speech from men. If they don’t get a lot of speech from women, they won’t be that good at identifying speech from women. And it looks like that’s the case. Based on my data from fifty different speakers, Google’s speech recognition (which, if you remember, is probably the best-performing proprietary automatic speech recognition system on the market) just doesn’t work as well for women as it does for men.
If your primary dialect is something other than Standardized American English (that sort of from-the-US-but-not-anywhere-in-particular type of English you hear a lot of onthenews) you may have noticed that speech recognition software doesn’t generally work very well for you. You can see the sort of thing I’m talking about in this clip:
This clip is a little old, though (2010). Surely voice recognition technology has improved since then, right? I mean, we’ve got more data and more computing power than ever. Surely somebody’s gotten around to making sure that the current generation of voice-recognition software deals equally well with different dialects of English. Especially given that those self-driving cars that everyone’s so excited about are probably going to use voice-based interfaces.
Data: I picked videos with accents from Maine (U.S), Georgia (U.S.), California (U.S), Scotland and New Zealand. I picked these locations because they’re pretty far from each other and also have pretty distinct regional accents. All speakers from the U.S. were (by my best guess) white and all looked to be young-ish. I’m not great at judging age, but I’m pretty confident no one was above fifty or so.
What I did: For each location, I checked the accuracy of the automatic captions on the word-list part of the challenge for five male and five female speakers. So I have data for a total of 50 people across 5 dialect regions. For each word in the word list, I marked it as “correct” if the entire word was correctly captioned on the first try. Anything else was marked wrong. To be fair, the words in the accent tag challenge were specifically chosen because they have a lot of possible variation. On the other hand, they’re single words spoken in isolation, which is pretty much the best case scenario for automatic speech recognition, so I think it balances out.
Ok, now the part you’ve all been waiting for: the results. Which dialects fared better and which worse? Does dialect even matter? First the good news: based on my (admittedly pretty small) sample, the effect of dialect is so weak that you’d have to be really generous to call it reliable. A linear model that estimated number of correct classifications based on total number of words, speaker’s gender and speaker’s dialect area fared only slightly better (p = 0.08) than one that didn’t include dialect area. Which is great! No effect means dialect doesn’t matter, right?
Weellll, not really. Based on a power analysis, I really should have sampled forty people from each dialect, not ten. Unfortunately, while I love y’all and also the search for knowledge, I’m not going to hand-annotate two hundred Youtube videos for a side project. (If you’d like to add data, though, feel free to branch the dataset on Github here. Just make sure to check the URL for the video you’re looking at so we don’t double dip.)
So while I can’t confidently state there is an effect, based on the fact that I’m sort of starting to get one with only a quarter of the amount of data I should be using, I’m actually pretty sure there is one. No one’s enjoying stellar performance (there’s a reason that they tend to be called AutoCraptions in the Deaf community) but some dialect areas are doing better than others. Look at this chart of accuracy by dialect region:
There’s variation, sure, but in general the recognizer seems to be working best on people from California (which just happens to be where Google is headquartered) and worst on Scottish English. The big surprise for me is how well the recognizer works on New Zealand English, especially compared to Scottish English. It’s not a function of country population (NZ = 4.4 million, Scotland = 5.2 million). My guess is that it might be due to sample bias in the training sets, especially if, say, there was some 90’s TV shows in there; there’s a lot of captioned New Zealand English in Hercules, Xena and related spin-offs. There’s also a Google outreach team in New Zealand, but not Scotland, so that might be a factor as well.
So, unfortunately, it looks like the lift skit may still be current. ASR still works better for some dialects than others. And, keep in mind, these are all native English speakers! I didn’t look at non-native English speakers, but I’m willing to bet the system is also letting them down. Which is a shame. It’s a pity that how well voice recognition works for you is still dependent on where you’re from. Maybe in another six years I’ll be able to write a blog post says it isn’t.