How well do Google and Microsoft and recognize speech across dialect, gender and race?

If you’ve been following my blog for a while, you may remember that last year I found that YouTube’s automatic captions didn’t work as well for some dialects, or for women. The effects I found were pretty robust, but I wanted to replicate them for a couple of reasons:

  • I only looked at one system, YouTube’s automatic captions, and even that was over a period of several years instead of at just one point in time. I controlled for time-of-upload in my statistical models, but it wasn’t the fairest system evaluation.
  • I didn’t control for the audio quality, and since speech recognition is pretty sensitive to things like background noise and microphone quality, that could have had an effect.
  • The only demographic information I had was where someone was from. Given recent results that find that natural language processing tools don’t work as well for African American English, I was especially interested in looking at automatic speech recognition (ASR) accuracy for African American English speakers.

With that in mind, I did a second analysis on both YouTube’s automatic captions and Bing’s speech API (that’s the same tech that’s inside Microsoft’s Cortana, as far as I know).

Speech Data

For this project, I used speech data from the International Dialects of English Archive. It’s a collection of English speech from all over, originally collected to help actors sound more realistic.

I used speech data from four varieties: the South (speakers from Alabama), the Northern Cities (Michigan), California (California) and General American. “General American” is the sort of news-caster style of speech that a lot of people consider unaccented–even though it’s just as much an accent as any of the others! You can hear a sample here.

For each variety, I did an acoustic analysis to make sure that speakers I’d selected actually did use the variety I thought they should, and they all did.


For the YouTube captions, I just uploaded the speech files to YouTube as videos and then downloaded the subtitles. (I would have used the API instead, but when I was doing this analysis there was no Python Google Speech API, even though very thorough documentation had already been released.)

Bing’s speech API was a little  more complex. For this one, my co-author built a custom Android application that sent the files to the API & requested a long-form transcript back. For some reason, a lot of our sound files were returned as only partial transcriptions. My theory is that there is a running confidence function for the accuracy of the transcription, and once the overall confidence drops below a certain threshold, you get back whatever was transcribed up to there. I don’t know if that’s the case, though, since I don’t have access to their source code. Whatever the reason, the Bing transcriptions were less accurate overall than the YouTube transcriptions, even when we account for the fact that fewer words were returned.


OK, now to the results. Let’s start with dialect area. As you might be able to tell from the graphs below, there were pretty big differences between the two systems we looked at. In general, there was more variation in the word error rate for Bing and overall the error rate tended to be a bit higher (although that could be due to the incomplete transcriptions we mentioned above). YouTube’s captions were generally more accurate and more consistent. That said, both systems had different error rates across dialects, with the lowest average error rates for General American English.


Differences in Word Error Rate (WER) by dialect were not robust enough to be significant for Bing (under a one way ANOVA) (F[3, 32] = 1.6, p = 0.21), but they were for YouTube’s automatic captions (F[3, 35] = 3.45,p < 0.05). Both systems had the lowest average WER for General American.

Now, let’s turn to gender. If you read my earlier work, you’ll know that I previously found that YouTube’s automatic captions were more accurate for men and less accurate for women. This time, with carefully recorded speech samples, I found no robust difference in accuracy by gender in either system. Which is great! In addition, the unreliable trends for each system pointed in opposite ways; Bing had a lower WER for male speakers, while YouTube had a lower WER for female speakers.

So why did I find an effect last time? My (untested) hypothesis is that there was a difference in the signal to noise ratio for male and female speakers in the user-uploaded files. Since women are (on average) smaller and thus (on average) slightly quieter when they speak, it’s possible that their speech was more easily masked by background noises, like fans or traffic. These files were all recorded in a quiet place, however, which may help to explain the lack of difference between genders.


Neither Bing (F[1, 34] = 1.13, p = 0.29), nor YouTube’s automatic captions (F[1, 37] = 1.56, p = 0.22) had a significant difference in accuracy by gender.

Finally, what about race? For this part of the analysis, I excluded General American speakers, since they did not report their race. I also excluded the single Native American speaker. Even with fewer speakers, and thus reduced power, the differences between races were still robust enough to be significant for YouTube’s automatic captions and Bing followed the same trend. Both systems were most accurate for Caucasian speakers.


As with dialect, differences in WER between races were not significant for Bing (F[4, 31] = 1.21, p = 0.36), but were significant for YouTube’s automatic captions (F[4, 34] = 2.86,p< 0.05). Both systems were most accurate for Caucasian speakers.

While I was happy to find no difference in performance by gender, the fact that both systems made more errors on non-Caucasian and non-General-American speaking talkers is deeply concerning. Regional varieties of American English and African American English are both consistent and well-documented. There is nothing intrinsic to these varieties that make them less easy to recognize. The fact that they are recognized with more errors is most likely due to bias in the training data. (In fact, Mozilla is currently collecting diverse speech samples for an open corpus of training data–you can help them out yourself.)

So what? Why does word error rate matter?

There are two things I’m really worried about with these types of speech recognition errors. The first is higher error rates seem to overwhelmingly affect already-disadvantaged groups. In the US, strong regional dialects tend to be associated with speakers who aren’t as wealthy, and there is a long and continuing history of racial discrimination in the United States.

Given this, the second thing I’m worried about is the fact that these voice recognition systems are being incorporated into other applications that have a real impact on people’s lives.

Every automatic speech recognition system makes errors. I don’t think that’s going to change (certainly not in my lifetime). But I do think we can get to the point where those error don’t disproportionately affect already-marginalized people. And if we keep using automatic speech recognition into high-stakes situations it’s vital that we get to that point quickly and, in the meantime, stay aware of these biases.

If you’re interested in the long version, you can check out the published paper here.


Can a computer write my blog posts?

This post is pretty special: it’s the 100th post I’ve made since starting my blog! It’s hard to believe I’ve been doing this so long. I started blogging in 2012, in my final year of undergrad, and now I’m heading into my last year of my PhD. Crazy how fast time flies.

Ok, back on topic. As I was looking back over everything I’ve written, it struck me that 99 posts worth of text on a very specific subject domain (linguistics) in a very specific register (informal) should be enough text to train a simple text generator.

So how did I go about building a blog bot? It was pretty easy! All I needed was:

  • 67,000 words of text (all blog posts before this one)
  • 1 R script to tidy up the text
  • 1 Python script to train a Markov Chain  text generator

A Markov Whatnow?

A Markov Chain is a type of simple (but surprisingly powerful) statistical model that tells you, given the item you’re currently on, what item you’re likely to see next. Today we’re going to apply it to whole words in a text.

How does it work? Basically, for each word in your text, you count how many different words occur after it, how many time each shows up and figure out the probability of each transition. So if your text is “The dog ate the apple.”, then there’s a 50% chance that “the” will be followed by “apple”, but a 100% chance that “apple” will be followed by “.”. You can then use these probabilities to generate new sentences, like so:

  • Input: The dog ate the apple.
  • Possible outputs:
    • The apple.
    • The dog ate the apple.
    • The dog ate the dog ate the apple.
    • The dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the apple. (This may seem silly, but remember that we’re only looking at two words at a time. This model doesn’t “remember” what came earlier or “know” that it’s repeating itself.)

OK, now for the actual training of the model. If you want to play around with this yourself, all my code and text is up on GitHub here.

  1. First, grab your text. I downloaded all my blog posts as an XML file and used a quick little R script to snag just the text I was interested in.
  2. Next, train your model. I used PyMarkovChain, but you can always write your own if you’re feeling it.
  3. Finally, use the model to generate text.

Alright, now I’ve got (admittedly very, very dumb) little program that kinda knows what I write like when I’m blogging. How well does it do at pretending to be me?

First try:

I’m going to break eye contact, look down at your own personalized ASR system

Well that’s… vaguely disturbing. But also not really very human-sounding, and certainly not something I’d write in a blog post. Let’s give it a couple more tries and see if it does any better.

  • But, if frosting has to have a career where you learned it from Clarice
  • We only perceive the things that our bodies and sensory organs and brains can perceive, and we really don’t care about to tell the sounds in the second part, I was expecting to see whether something is grammatical is to put up a bunch of historical stuff that we’ve all heard about the structure in the arts and sciences were depressed
  • (Which is really interesting–I’d love to learn your language doesn’t make you very, very good a recognizing speech
  • Believe me, I know what you can uncontract them and what’s the take-away
    People with short face syndrome–yes, that’s a classical description of what a language does, along with parallels drawn to another, related, languages
  • Short answer: they’re all correct
  • And those speakers are aware of
  • The Job Market for Linguistics PhDsWhat do you much

Hmm, not so great. The only output that sounded good to me was “Short answer: they’re all correct”. And that was just the exact same phrasing I used in this blog post. No wonder it sounds just like me; it is me!

So it looks like I won’t be handing the reins over to Making Noise and Hearing Things bot any time soon. True, I could have used a fancier tool, like a Recurrent Neural Network. But to be perfectly honest, I have yet to see any text generation system that can produce anything even close to approximating a human-written blog post. Will we get there? Maybe. But natural language generation, especially outside of really formulaic things like weather or sports reporting, is a super hard problem. Heck, we still haven’t gotten to point where computers can reliably solve third-grade math word problems.

The very complexities that make language so useful (and interesting to study) also make it so hard to model. Which is good news for me! It means there’s still plenty of work to do in language modelling and blogging.

Meme Grammar

So the goal of linguistics is to find and describe the systematic ways in which humans use language. And boy howdy do we humans love using language systematically. A great example of this is internet memes.

What are internet memes? Well, let’s start with the idea of a “meme”. “Memes” were posited by Richard Dawkin in his book The Selfish Gene. He used the term to describe cultural ideas that are transmitted from individual to individual much like a virus or bacteria. The science mystique I’ve written about is a great example of a meme of this type. If you have fifteen minutes, I suggest Dan Dennett’s TED talk on the subject of memes as a much more thorough introduction.

So what about the internet part? Well, internet memes tend to be a bit narrower in their scope. Viral videos, for example, seem to be a separate category from intent memes even though they clearly fit into Dawkin’s idea of what a meme is. Generally, “internet meme” refers to a specific image and text that is associated with that image. These are generally called image macros. (For a through analysis of emerging and successful internet memes, as well as an excellent object lesson in why you shouldn’t scroll down to read the comments, I suggest Know Your Meme.) It’s the text that I’m particularly interested in here.

Memes which involve language require that it be used in a very specific way, and failure to obey these rules results in social consequences. In order to keep this post a manageable size, I’m just going to look at the use of language in the two most popular image memes, as ranked by, though there is a lot more to study here. (I think a study of the differing uses of the initialisms MRW [my reaction when]  and MFW [my face when] on imgur and 4chan would show some very interesting patterns in the construction of identity in the two communities. Particularly since the 4chan community is made up of anonymous individuals and the imgur community is made up of named individuals who are attempting to gain status through points. But that’s a discussion for another day…)

The God tier (i.e. most popular) characters at on the website Meme Generator as of February 23rd, 2013. Click for link to site.

The God tier (i.e. most popular) characters at on the website Meme Generator as of February 23rd, 2013. Click for link to site. If you don’t recognize all of these characters, congratulations on not spending all your free time on the internet.

Without further ado, let’s get to the grammar. (I know y’all are excited.)

Y U No

This meme is particularly interesting because its page on Meme Generator already has a grammatical description.

The Y U No meme actually began as Y U No Guy but eventually evolved into simply Y U No, the phrase being generally followed by some often ridiculous suggestion. Originally, the face of Y U No guy was taken from Japanese cartoon Gantz’ Chapter 55: Naked King, edited, and placed on a pink wallpaper. The text for the item reads “I TXT U … Y U NO TXTBAK?!” It appeared as a Tumblr file, garnering over 10,000 likes and reblogs.

It went totally viral, and has morphed into hundreds of different forms with a similar theme. When it was uploaded to MemeGenerator in a format that was editable, it really took off. The formula used was : “(X, subject noun), [WH]Y [YO]U NO (Y, verb)?”[Bold mine.]

A pretty good try, but it can definitely be improved upon. There are always two distinct groupings of text in this meme, always in impact font, white with a black border and in all caps. This is pretty consistent across all image macros. In order to indicate the break between the two text chunks, I will use — throughout this post. The chunk of text that appears above the image is a noun phrase that directly addresses someone or something, often a famous individual or corporation. The bottom text starts with “Y U NO” and finishes with a verb phrase. The verb phrase is an activity or action that the addressee from the first block of text could or should have done, and that the meme creator considers positive. It is also inflected as if “Y U NO” were structurally equivalent to “Why didn’t you”. So, since you would ask Steve Jobs “Why didn’t you donate more money to charity?”, a grammatical meme to that effect would be “STEVE JOBS — Y U NO DONATE MORE MONEY TO CHARITY”. In effect, this meme questions someone or thing who had the agency to do something positive why they chose not to do that thing. While this certainly has the potential to be a vehicle for social commentary, like most memes it’s mostly used for comedic effect. Finally, there is some variation in the punctuation of this meme. While no punctuation is the most common, an exclamation points, a question mark or both are all used. I would hypothesize that the the use of punctuation varies between internet communities… but I don’t really have the time or space to get into that here.

A meme (created by me using Meme Generator) following the guidelines outlined above.

Futurama Fry

This meme also has a brief grammatical analysis

The text surrounding the meme picture, as with other memes, follows a set formula. This phrasal template goes as follows: “Not sure if (insert thing)”, with the bottom line then reading “or just (other thing)”. It was first utilized in another meme entitled “I see what you did there”, where Fry is shown in two panels, with the first one with him in a wide-eyed expression of surprise, and the second one with the familiar half-lidded expression.

As an example of the phrasal template, Futurama Fry can be seen saying: “Not sure if just smart …. Or British”. Another example would be “Not sure if highbeams … or just bright headlights”. The main form of the meme seems to be with the text “Not sure if trolling or just stupid”.

This meme is particularly interesting because there seems to an extremely rigid syntactic structure. The phrase follow the form “NOT SURE IF _____ — OR _____”. The first blank can either be filled by a complete sentence or a subject complement while the second blank must be filled by a subject complement. Subject complements, also called predicates (But only by linguists; if you learned about predicates in school it’s probably something different. A subject complement is more like a predicate adjective or predicate noun.), are everything that can come after a form of the verb “to be” in a sentence. So, in a sentence like “It is raining”, “raining” is the subject complement. So, for the Futurama Fry meme, if you wanted to indicate that you were uncertain whther it was raining or sleeting, both of these forms would be correct:


Note that, if a complete sentence is used and abbreviation is possible, it must be abbreviated. Thus the following sentence is not a good Futurama Fry sentence:


This is particularly interesting  because the “phrasal template” description does not include this distinction, but it is quite robust. This is a great example of how humans notice and perpetuate linguistic patterns that they aren’t necessarily aware of.

A meme (created by me using Meme Generator) following the guidelines outlined above. If you’re not sure whether it’s phonetics or phonology, may I recommend this post as a quick refresher?

So this is obviously very interesting to a linguist, since we’re really interested in extracting and distilling those patterns. But why is this useful/interesting to those of you who aren’t linguists? A couple of reasons.

  1. I hope you find it at least a little interesting and that it helps to enrich your knowledge of your experience as a human. Our capacity for patterning is so robust that it affects almost every aspect of our existence and yet it’s easy to forget that, to let our awareness of that slip our of our conscious minds. Some patterns deserve to be examined and criticized, though, and  linguistics provides an excellent low-risk training ground for that kind of analysis.
  2. If you are involved in internet communities I hope you can use this new knowledge to avoid the social consequences of violating meme grammars. These consequences can range from a gentle reprimand to mockery and scorn The gatekeepers of internet culture are many, vigilant and vicious.
  3. As with much linguistic inquiry, accurately noting and describing these patterns is the first step towards being able to use them in a useful way. I can think of many uses, for example, of a program that did large-scale sentiment analyses of image macros but was able to determine which were grammatical (and therefore more likely to be accepted and propagated by internet communities) and which were not.

Why is it so hard for computers to recognize speech?

This is a problem that’s plagued me for quite a while. I’m not a computational linguist  myself, but one of the reasons that theoretical linguistics is important is that it allows us to create robust concpetional models of language… which is basically what voice recognition (or synthesis) programs are. But, you may say to yourself, if it’s your job to create and test robust models, you’re clearly not doing very well. I mean, just listen to this guy. Or this guy. Or this person, whose patience in detailing errors borders on obsession. Or, heck, this person, who isn’t so sure that voice recognition is even a thing we need.

Electronic eye

You mean you wouldn’t want to be able to have pleasant little chats with your computer? I mean, how could that possibly go wrong?

Now, to be fair to linguists, we’ve kinda been out of the loop for a while. Fred Jelinek, a very famous researcher in speech recognition, once said “Every time we fire a phonetician/linguist, the performance of our system goes up”. Oof, right in the career prospects. There was, however, a very good reason for that, and it had to do with the pressures on computer scientists and linguists respectively. (Also a bunch of historical stuff that we’re not going to get into.)

Basically, in the past (and currently to a certain extent) there was this divide in linguistics. Linguists wanted to model speaker’s competence, not their performance. Basically, there’s this idea that there is some sort of place in your brain where you knew all the rules of language and  have them all perfectly mapped out and described. Not in a consious way, but there nonetheless. But somewhere between the magical garden of language and your mouth and/or ears you trip up and mistakes happen. You say a word wrong or mishear it or switch bits around… all sorts of things can go wrong. Plus, of course, even if we don’t make a recognizable mistake, there’s a incredible amount of variation that we can decipher without a problem. That got pushed over to the performance side, though, and wasn’t looked at as much. Linguistics was all about what was happening in the language mind-garden (the competence) and not the messy sorts of things you say in everyday life (the performance). You can also think of it like what celebrities actually say in an interview vs. what gets into the newspaper; all the “um”s and “uh”s are taken out, little stutters or repetitions are erased and if the sentence structure came out a little wonky the reporter pats it back into shape. It was pretty clear what they meant to say, after all.

So you’ve got linguists with their competence models explaining them to the computer folks and computer folks being all clever and mathy and coming up with algorithms that seem to accurately model our knowledge of human linguistic competency… and getting terrible results. Everyone’s working hard and doing their best and it’s just not working.

I think you can probably figure out why: if you’re a computer and just sitting there with very little knowledge of language (consider that this was before any of the big corpora were published, so there wasn’t a whole lot of raw data) and someone hands you a model that’s supposed to handle only perfect data and also actual speech data, which even under ideal conditions is far from perfect, you’re going to spit out spaghetti and call it a day. It’s a bit like telling someone to make you a peanut butter and jelly sandwich and just expecting them to do it. Which is fine if they already know what peanut butter and jelly are, and where you keep the bread, and how to open jars, and that food is something humans eat, so you shouldn’t rub it on anything too covered with bacteria or they’ll get sick and die. Probably not the best way to go about it.

So the linguists got the boot and they and the computational people pretty much did their own things for a bit. The model that most speech recognition programs use today is mostly statistical, based on things like how often a word shows up in whichever corpus they’re using currently. Which works pretty well. In a quiet room. When you speak clearly. And slowly. And don’t use any super-exotic words. And aren’t having a conversation. And have trained the system on your voice. And have enough processing power in whatever device you’re using. And don’t get all wild and crazy with your intonation. See the problem?

Language is incredibly complex and speech recognition technology, particularly when it’s based on a purely statistical model, is not terrific at dealing with all that complexity. Which is not to say that I’m knocking statistical models! Statistical phonology is mind-blowing and I think we in linguistics will get a lot of mileage from it. But there’s a difference. We’re not looking to conserve processing power: we’re looking to model what humans are actually doing. There’s been a shift away from the competency/performance divide (though it does still exist) and more interest in modelling the messy stuff that we actually see: conversational speech, connected speech, variation within speakers. And the models that we come up with are complex. Really complex. People working in Exemplar Theory, for example, have found quite a bit of evidence that you remember everything you’ve ever heard and use all of it to help parse incoming signals. Yeah, it’s crazy. And it’s not something that our current computers can do. Which is fine; it give linguists time to further refine our models. When computers are ready, we will be too, and in the meantime computer people and linguistic people are showing more and more overlap again, and using each other’s work more and more. And, you know, singing Kumbayah and roasting marshmallows together. It’s pretty friendly.

So what’s the take-away? Well, at least for the moment, in order to get speech recognition to a better place than it is now, we need  to build models that work for a system that is less complex than the human brain. Linguistics research, particularly into statistical models, is helping with this. For the future? We need to build systems that are as complex at the human brain. (Bonus: we’ll finally be able to test models of child language acquisition without doing deeply unethical things! Not that we would do deeply unethical things.) Overall, I’m very optimistic that computers will eventually be able to recognize speech as well as humans can.

TL;DR version:

  • Speech recognition has been light on linguists because they weren’t modeling what was useful for computational tasks.
  • Now linguists are building and testing useful models. Yay!
  • Language is super complex and treating it like it’s not will get you hit in the face with an error-ridden fish.
  • Linguists know language is complex and are working diligently at accurately describing how and why. Yay!
  • In order to get perfect speech recognition down, we’re going to need to have computers that are similar to our brains.
  • I’m pretty optimistic that this will happen.



Mapping language, language maps

So for some reason, I’ve come across three studies in quick succession based in mapping language. Now, if you know me, you know that nattering on about linguistic methodology is pretty much the Persian cat to my Blofeld, but I really do think that looking at the way that linguists do linguistics is incredibly important. (Warning: the next paragraph will be kinda preachy, feel free to skip it.)

It’s something the field, to paint with an incredibly broad brush, tends to skimp on. After all, we’re asking all these really interesting questions that have the potential to change people’s lives. How is hearing speech different from hearing other things? What causes language pathologies and how can we help correct them? Can we use the voice signal to reliably detect Parkinson’s over the phone? That’s what linguistics is. Who has time to look at whether asking  people to list the date on a survey form affects their responses? If linguists don’t use good, controlled methods to attempt to look at these questions, though, we’ll either find the wrong answers or miss it completely because of some confounding variable we didn’t think about. Believe me, I know firsthand how heart wrenching it is to design an experiment,  run subjects, do your stats and end up with a big pile of useless goo because your methodology wasn’t well thought out. It sucks. And it happens way more than it needs to, mainly because a lot of linguistics programs don’t stress rigorous scientific training.

OK, sermon over. Maps! I think using maps to look at language data is a great methodology! Why?


Hmm… needs more data about language. Also the rest of the continents, but who am I to judge? 

  1.  You get an end product that’s tangible and easy to read and use. People know what maps are and how to use them. Presenting linguistic data as a map rather than, say, a terabyte of detailed surveys or a thousand hours of recordings is a great way to make that same data accessible. Accessible data gets used. And isn’t that kind of the whole point?
  2. Maps are so. accurateright now. This means that maps of data aren’t  just rough approximations, they’re the best, most accurate way to display this information. Seriously, the stuff you can do with GIS is just mind blowing. (Check out this dialect map of the US. If you click on the region you’re most interested, you get additional data like field recordings, along with the precise place they were made. Super useful.)
  3. Maps are fun. Oh, come on, who doesn’t like looking at  maps? Particularly if you’re looking at a region you’re familiar with. See, here’s my high school, and the hay field we rented three years ago. Oh, and there’s my friend’s house! I didn’t realize they were so close to the highway. Add a second layer of information and BOOM, instant learning.

The studies

Two of the studies I came across were actually based on Twitter data. Twitter’s an amazing resource for studying linguistics because you have this enormous data set you can just use without having to get consent forms from every single person. So nice. Plus, because all tweets are archived, in the Library of Congress if nowhere else, other researchers can go back and verify things really easily.

This study looks at how novel slang expressions spread across the US. It hasn’t actually been published yet, so I don’t have the map itself, but they do talk about some interesting tidbits. For example: the places most likely to spawn new successful slang are urban centers with a high African American population.

The second Twitter study is based in London and looked at the different languages Londoners tweet in and did have a map:

Click for link to author’s blog post.

Interesting, huh? You can really get a good idea of the linguistic landscape of London. Although there were some potential methodological problems with this study, I still think it’s a great way to present this data.

The third study I came across is one that’s actually here at the University of Washington. This one is interesting because it kind of goes the other way. Basically, the researchers has respondents indicate areas on a map of Washington where they thought  language communities existed and then had them describe them.  So what you end up with is sort of a representation of the social ideas of what language is like in various parts of Washington state. Like so:

Click for link to study site.

There are lots more interesting maps on the study site, each of which shows some different perception of language use in Washington State. (My favorite is the one that suggests that people think other people who live right next to the Canadian border sound Canadian.)

So these are just a couple of the ways in which people are using maps to look at language data. I hope it’s a trend that continues.