What’s the difference between & and +?

So if you’re like me, you sometimes take notes on the computer and end up using some shortcuts so you can keep up with the speed of whoever’s talking. One of the short cuts I use a lot is replacing the word “and” with punctuation. When I’m handwriting things I only ever use “+” (becuase I can’t reliably write an ampersand), but in typing I use both “+” and “&”. And I realized recently, after going back to change which one I used, that I had the intuition that they should be used for different things.


I don’t use Ampersands when I’m handwriting things becuase they’re hard to write.

Like sometimes happens with linguistic intuitions, though, I didn’t really have a solid idea of how they were different, just that they were. Fortunately, I had a ready-made way to figure it out. Since I use both symbols on Twitter quite a bit, all I had to do was grab tweets of mine that used either + or & and figure out what the difference was.

I got 450 tweets from between October 7th and November 11th of this year from my own account (@rctatman). I used either & or + in 83 of them, or roughly 18%. This number is a little bit inflated because I was livetweeting a lot of conference talks in that time period, and if a talk has two authors I start every livetweet from that talk with “AuthorName1 & AuthorName2:”. 43 tweets use & in this way. If we get rid of those, only around 8% of my tweets contain either + or &. They’re still a lot more common in my tweets than in writing in other genres, though, so it’s still a good amount of data.

So what do I use + for? See for yourself! Below are all the things I conjoined with + in my Twitter dataset. (Spelling errors intact. I’m dyslexic, so if I don’t carefully edit text—and even sometimes when I do, to my eternal chagrin—I tend to have a lot of spelling errors. Also, a lot of these tweets are from EMNLP so there’s quite a bit of jargon.)

  • time + space
  • confusable Iberian language + English
  • Data + code
  • easy + nice
  • entity linking + entity clustering
  • group + individual
  • handy-dandy worksheet + tips
  • Jim + Brenda, Finn + Jake
  • Language + action
  • linguistic rules + statio-temporal clustering
  • poster + long paper
  • Ratings + text
  • static + default methods
  • syntax thing + cattle
  • the cooperative principle + Gricean maxims
  • Title + first author
  • to simplify manipulation + preserve struture

If you’ve had some syntactic training, it might jump out to you that most of these things have the same syntactic structure: they’re noun phrases! There are just a couple of exception. The first is “static + default methods”, where the things that are being conjoined are actually adjectives modifying a single noun. The other is “to simplify manipulation + preserve struture”. I’m going to remain agnostic about where in the verb phrase that coordination is taking place, though, so I don’t get into any syntax arguments😉. That said, this is a fairly robust pattern! Remember that I haven’t been taught any rules about what I “should” do, so this is just an emergent pattern.

Ok, so what about &? Like I said, my number one use is for conjunction of names. This probably comes from my academic writing training. Most of the papers I read that use author names for in-line citations use an & between them. But I do also use it in the main body of tweets. My use of & is a little bit harder to characterize, so I’m going to go through and tell you about each type of thing.

First, I use it to conjoin user names with the @ tag. This makes sense, since I have a strong tendency to use & with names:

  • @uwengineering & @uwnlp
  • @amazon @baidu @Grammarly & @google

In some cases, I do use it in the same way as I do +, for conjoining noun phrases:

  • Q&A
  • the entities & relations
  • these features & our corpus
  • LSTM & attention models
  • apples & concrete
  • context & content

But I also use it for comparatives:

  • Better suited for weak (bag-level) labels & interpretable and flexible
  • easier & faster

And, perhaps more interestingly, for really high-level conjugation, like at the level of the sentence or entire verb phrase (again, I’m not going to make ANY claims about what happens in and around verbs—you’ll need to talk to a syntactician for that!).

  • Classified as + or – & then compared to polls
  • in 30% of games the group performance was below average & in 17% group was worse than worst individual
  • math word problems are boring & kids learn better if they’re interested in the theme of the problem
  • our system is the first temporal tagger designed for social media data & it doesn’t require hand tagging
  • use a small labeled corpus w/ small lexicon & choose words with high prob. of 1 label

And, finally, it gets used in sort of miscellaneous places, like hashtags and between URLs.

So & gets used in a lot more places than + does. I think that this is probably because, on some subconscious level I consider & to be the default (or, in linguistics terms, “unmarked“). This might be related to how I’m processing these symbols when I read them. I’m one of those people who hears an internal voice when reading/writing, so I tend to have canonical vocalizations of most typed symbols. I read @ as “at”, for example, and emoticons as a prosodic beat with some sort of emotive sound. Like I read the snorting emoji as the sound of someone snorting. For & and +, I read & as “and” and + as “plus”. I also use “plus” as a conjunction fairly often in speech, as do many of my friends, so it’s possible that it may pattern with my use in speech (I don’t have any data for that, though!). But I don’t say “plus” nearly as often as I say “and”. “And” is definitely the default and I guess that, by extension, & is as well.

Another thing that might possibly be at play here is ease of entering these symbols. While I’m on my phone they’re pretty much equally easy to type, on a full keyboard + is slightly easier, since I don’t have to reach as far from the shift key. But if that were the only factor my default would be +, so I’m fairly comfortable claiming that the fact that I use & for more types of conjunction is based on the influence of speech.

A BIG caveat before I wrap up—this is a bespoke analysis. It may hold for me, but I don’t claim that it’s the norm of any of my language communities. I’d need a lot more data for that! That said, I think it’s really neat that I’ve unconsciously fallen into a really regular pattern of use for two punctuation symbols that are basically interchangeable. It’s a great little example of the human tendency to unconsciously tidy up language.

How loud would a million dogs barking be?

So a friend of mine who’s a reference librarian (and has a gaming YouTube channel you should check out) recently got an interesting question: how loud would a million dogs barking be?

This is an interesting question because it gets at some interesting properties of how sound work, in particular the decibel scale.

So, first off, we need to establish our baseline. The loudest recorded dog bark clocked in at 113.1 dB, and was produced by a golden retriever named Charlie. (Interestingly, the loudest recorded human scream was 129 dB, so it looks like Charlie’s got some training to do to catch up!) That’s louder than a chain saw, and loud enough to cause hearing damage if you heard it consonantly.

Now, let’s scale our problem down a bit and figure out how loud it would be if ten Charlies barked together. (I’m going to use copies of Charlie and assume they’ll bark in phase becuase it makes the math simpler.) One Charlie is 113 dB, so your first instinct may be to multiply that by ten and end up 1130 dB. Unfortunately, if you took this approach you’d be (if you’ll excuse the expression) barking up the wrong tree. Why? Because the dB scale is logarithmic. This means that a 1130 dB is absolutely ridiculously loud. For reference, under normal conditions the loudest possible sound (on Earth) is 194 dB.  A sound of 1000 dB would be loud enough to create a black hole larger than the galaxy. We wouldn’t be able to get a bark that loud even if we covered every inch of earth with clones of champion barker Charlie.

Ok, so we know what one wrong approach is, but what’s the right one? Well, we have our base bark at 113 dB. If we want a bark that is one million times as powerful (assuming that we can get a million dogs to bark as one) then we need to take the base ten log of one million and multiply it by ten (that’s the deci part of decibel). (If you want more math try this site.) The base ten log of one million is six, so times ten that’s sixty decibels. But it’s sixty decibels louder than our original sound of 113dB, for a grand total of 173dB.

Now, to put this in perspective, that’s still pretty durn loud. That’s loud enough to cause hearing loss in our puppies and everyone in hearing distance. We’re talking about the loudness of a cannon, or a rocket launch from 100 meters away. So, yes, very loud, but not quite “destroying the galaxy” loud.

A final note: since the current world record for loudest barking group of dogs is a more modest 124 dB from group of just 76 dogs, if you could get a million dogs to bark in unison you’d definitely set a new world record! But, considering that you’d end up hurting the dogs’ hearing (and having to scoop all that poop) I’m afraid I really can’t recommend it.

Can a computer write my blog posts?

This post is pretty special: it’s the 100th post I’ve made since starting my blog! It’s hard to believe I’ve been doing this so long. I started blogging in 2012, in my final year of undergrad, and now I’m heading into my last year of my PhD. Crazy how fast time flies.

Ok, back on topic. As I was looking back over everything I’ve written, it struck me that 99 posts worth of text on a very specific subject domain (linguistics) in a very specific register (informal) should be enough text to train a simple text generator.

So how did I go about building a blog bot? It was pretty easy! All I needed was:

  • 67,000 words of text (all blog posts before this one)
  • 1 R script to tidy up the text
  • 1 Python script to train a Markov Chain  text generator

A Markov Whatnow?

A Markov Chain is a type of simple (but surprisingly powerful) statistical model that tells you, given the item you’re currently on, what item you’re likely to see next. Today we’re going to apply it to whole words in a text.

How does it work? Basically, for each word in your text, you count how many different words occur after it, how many time each shows up and figure out the probability of each transition. So if your text is “The dog ate the apple.”, then there’s a 50% chance that “the” will be followed by “apple”, but a 100% chance that “apple” will be followed by “.”. You can then use these probabilities to generate new sentences, like so:

  • Input: The dog ate the apple.
  • Possible outputs:
    • The apple.
    • The dog ate the apple.
    • The dog ate the dog ate the apple.
    • The dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the apple. (This may seem silly, but remember that we’re only looking at two words at a time. This model doesn’t “remember” what came earlier or “know” that it’s repeating itself.)

OK, now for the actual training of the model. If you want to play around with this yourself, all my code and text is up on GitHub here.

  1. First, grab your text. I downloaded all my blog posts as an XML file and used a quick little R script to snag just the text I was interested in.
  2. Next, train your model. I used PyMarkovChain, but you can always write your own if you’re feeling it.
  3. Finally, use the model to generate text.

Alright, now I’ve got (admittedly very, very dumb) little program that kinda knows what I write like when I’m blogging. How well does it do at pretending to be me?

First try:

I’m going to break eye contact, look down at your own personalized ASR system

Well that’s… vaguely disturbing. But also not really very human-sounding, and certainly not something I’d write in a blog post. Let’s give it a couple more tries and see if it does any better.

  • But, if frosting has to have a career where you learned it from Clarice
  • We only perceive the things that our bodies and sensory organs and brains can perceive, and we really don’t care about to tell the sounds in the second part, I was expecting to see whether something is grammatical is to put up a bunch of historical stuff that we’ve all heard about the structure in the arts and sciences were depressed
  • (Which is really interesting–I’d love to learn your language doesn’t make you very, very good a recognizing speech
  • Believe me, I know what you can uncontract them and what’s the take-away
    People with short face syndrome–yes, that’s a classical description of what a language does, along with parallels drawn to another, related, languages
  • Short answer: they’re all correct
  • And those speakers are aware of
  • The Job Market for Linguistics PhDsWhat do you much

Hmm, not so great. The only output that sounded good to me was “Short answer: they’re all correct”. And that was just the exact same phrasing I used in this blog post. No wonder it sounds just like me; it is me!

So it looks like I won’t be handing the reins over to Making Noise and Hearing Things bot any time soon. True, I could have used a fancier tool, like a Recurrent Neural Network. But to be perfectly honest, I have yet to see any text generation system that can produce anything even close to approximating a human-written blog post. Will we get there? Maybe. But natural language generation, especially outside of really formulaic things like weather or sports reporting, is a super hard problem. Heck, we still haven’t gotten to point where computers can reliably solve third-grade math word problems.

The very complexities that make language so useful (and interesting to study) also make it so hard to model. Which is good news for me! It means there’s still plenty of work to do in language modelling and blogging.

Six Linguists of Color (who you can follow on Twitter!)

In the light of some recent white supremacist propaganda showing up on my campus, I’ve decided to spotlight a tiny bit of the amazing work being done around the country by linguists of color. Each of the scholars below is doing interesting, important linguistics research and has a Twitter account that I personally enjoy following. If you’re on this blog, you probably will as well! I’ll give you a quick intro to their research and, if it piques your interest, you can follow them on Twitter for all the latest updates.

(BTW, if you’re wondering why I haven’t included any grad students on this list, it’s becuase we generally don’t have as well developed of a research trajectory and I want this to be a useful resource for at least a few years.)

Anne Charity Hudley

Dr. Charity Hudley is professor at the College of William and Mary (Go Tribe!). Her research focuses on language variation, especially the use of varieties such as African American English, in the classroom. If you know any teachers, they might find her two books on language variation in the classroom a useful resource. She and Christine Mallinson have even released an app to go with them!

Michel DeGraff

Dr. Michel DeGraff is a professor at MIT. His research is on Haitian Creole, and he’s been very active in advocating for the official recognition of Haitian Creole as a distinct language. If you’re not sure what Haitian Creole looks like, go check out his Twitter; many of his tweets are in the language! He’s also done some really cool work on using technology to teach low-resource languages.

Nelson Flores

Dr. Nelson Flores is a professor at the University of Pennsylvania. His work focuses on how we create the ideas of race and language, as well as bilingualism/multilingualism and bilingual education. I really enjoy his thought-provoking discussions of recent events on his Twitter account. He also runs a blog, which is a good resource for more in-depth discussion.

Nicole Holliday

Dr. Nicole Holliday is (at the moment) Chau Mellon Postdoctoral Scholar at Pomona College. Her research focuses on language use by biracial speakers. I saw her talk on how speakers use pitch differently depending on who they’re talking to at last year’s LSA meeting and it was fantastic: I’m really looking forwards to seeing her future work! She’s also a contributor to Word., an online journal about African American English.

Rupal Patel

Dr. Rupal Patel is a professor at Northeastern University, and also the founder and CEO of VocaliD. Her research focuses on the speech of speakers with developmental  disabilities, and how technology can ease communication for them. One really cool project she’s working on that you can get involved with is The Human Voicebank. This is collection of voices from all over the world that is used to make custom synthetic voices for those who need them for day-to-day communication. If you’ve got a microphone and a quiet room you can help out by recording and donating your voice.

John R. Rickford

Last, but definitely not least, is Dr. John Rickford, a professor at Stanford. If you’ve taken any linguistics courses, you’re probably already familiar with his work. He’s one of the leading scholars working on African American English and was crucial in bringing a research-based evidence to bare on the Ebonics controversy. If you’re interested, he’s also written a non-academic book on African American English that I would really highly recommend; it even won the American Book Award!

What’s a “bumpus”?

So I recently had a pretty disconcerting experience. It turns out that almost no one else has heard of a word that I thought was pretty common. And when I say “no one” I’m including dialectologists; it’s unattested in the Oxford English Dictionary and the Dictionary of American Regional English. Out of the twenty two people who responded to my Twitter poll (which was probably mostly other linguists, given my social networks) only one other person said they’d even heard the word and, as I later confirmed, it turned out to be one of my college friends.

So what is this mysterious word that has so far evaded academic inquiry? Ladies, gentlemen and all others, please allow me to introduce you to…


Pronounced ‘bʌm.pɪs or ‘bʌm.pəs. You can hear me say the word and use it in context by listening to this low quality recording.

The word means something like “fool” or “incompetent person”. To prove that this is actually a real word that people other than me use, I’ve (very, very laboriously) found some examples from the internet. It shows up in the comments section of this news article:

THAT is why people are voting for Mr Trump, even if he does act sometimes like a Bumpus.

I also found it in a smattering of public tweets like this one:

If you ever meet my dad, please ask him what a “bumpus” is

And this one:

Having seen horror of war, one would think, John McCain would run from war. No, he runs to war, to get us involved. What a bumpus.

And, my personal favorite, this one:

because the SUN(in that pic) is wearing GLASSES god karen ur such a bumpus

There’s also an Urban Dictionary entry which suggests the definition:

A raucous, boisterous person or thing (usually african-american.)

I’m a little sceptical about the last one, though. Partly because it doesn’t line up with my own intuitions (I feel like a bumpus is more likely to be silent than rowdy) and partly becuase less popular Urban Dictionary entries, especially for words that are also names, are super unreliable.

I also wrote to my parents (Hi mom! Hi dad!) and asked them if they’d used the word growing up, in what contexts, and who they’d learned it from. My dad confirmed that he’d heard it growing up (mom hadn’t) and had a suggestion for where it might have come from:

I am pretty sure my dad used it – invariably in one of the two phrases [“don’t be a bumpus” or “don’t stand there like a bumpus”]….  Bumpass, Virginia is in Lousia County …. Growing up in Norfolk, it could have held connotations of really rural Virginia, maybe, for Dad.

While this is definitely a possibility, I don’t know that it’s definitely the origin of the word. Bumpass, Virginia, like  Bumpass Hell (see this review, which also includes the phrase “Don’t be a bumpass”), was named for an early settler. Interestingly, the college friend mentioned earlier is also from the Tidewater region of Virginia, which leads me to think that the word may have originated there.

My mom offered some other possible origins, that the term might be related to “country bumpkin” or “bump on a log”. I think the latter is especially interesting, given that “bump on a log” and “bumpus” show up in exactly the same phrase: standing/sitting there like a _______.

She also suggested it might be related to “bumpkis” or “bupkis”. This is a possibility, especially since that word is definitely from Yiddish and Norfolk, VA does have a history of Jewish settlement and Yiddish speakers.

A usage of “Bumpus” which seems to be the most common is in phrases like “Bumpus dog” or “Bumpus hound”. I think that this is probably actually a different use, though, and a direct reference to a scene from the movie A Christmas Story:

One final note is that there was a baseball pitcher in the late 1890’s who went by the nickname “Bumpus”: Bumpus Jones. While I can’t find any information about where the nickname came from, this post suggests that his family was from Virginia and that he had Powhatan ancestry.

I’m really interesting in learning more about this word and its distribution. My intuition is that it’s mainly used by older, white speakers in the South, possibly centered around the Tidewater region of Virginia.

If you’ve heard of or used this word, please leave a comment or drop me a line letting me know 1) roughly how old you are, 2) where you grew up and 3) (if you can remember) where you learned it. Feel free to add any other information you feel might be relevant, too!


Can you configure speech recognition for a specific speaker?

James had an interesting question based on one of my earlier posts on gender differences in speech recognition:

Is there a voice recognition product that is focusing on women’s voices or allows for configuring for women’s voices (or the characteristics of women’s voices)?

I don’t know of any ASR systems specifically designed for women. But the answer to the second half of your question is yes!

BSPC 19 i Nyborg Danmark 2009 (4)

There are two main types of automatic speech recognition, or ASR, systems. The first is speaker independnet. These are systems, like YouTube automatic captions or  Apple’s Siri, that should work equally well across a large number of different speakers. Of course, as many other researchers have found and I corroborated in my own investigation, that’s not always the case. A major reason for this is socially-motivated variation between speakers. This is something we all know as language users. You can guess (with varying degrees of accuracy) a lot about someone from just their voice: thier sex, whether they’re young or old, where they grew up, how educated they are, how formal or casual they’re being.

So what does this mean for speech recognition? Well, while different speakers speak in a lot of different ways, individual speakers tend to use less variation. (With the exception of bidialectal speakers, like John Barrowman.) Which brings me nicely to the second type of speech recognition: speaker dependent. These are systems that are designed to work for one specific speaker, and usually to adapt and get more accurate for that speaker over time.

If you read some of my earlier posts, I suggested that the different performance for between dialects and genders was due to imbalances in the training data. The nice thing about speaker dependent systems is that the training data is made up of one voice: yours. (Although the system is usually initialized based on some other training set.)

So how can you get a speaker dependent ASR system?

  • By buying software such as Dragon speech recognition. This is probably the most popular commercial speaker-dependent voice recognition software (or at least the one I hear the most about). It does, however, cost real money.
  • Making your own! If you’re feeling inspired, you can make your own personalized ASR system. I’d recommend the CMU Sphinx toolkit; it’s free and well-documented. To make your own recognizer, you’ll need to build your own language model using text you’ve written as well as adapt the acoustic model using your recorded speech. The former lets the recognizer know what words you’re likely to say, and the latter how you say things. (If you’re REALLY gung-ho you can even build your own acoustic model from scratch, but that’s pretty involved.)

In theory, the bones of any ASR system should work equally well on any spoken human language. (Sign language recognition is a whole nother kettle of fish.) The difficulty is getting large amounts of (socially stratified) high-quality training data. By feeding a system data without a lot of variation, for example by using only one person’s voice, you can usually get more accurate recognition more quickly.



What sounds you can feel but not hear?

I got a cool question from Veronica the other day: 

Which wavelength someone would use not to hear but feel it on the body as a vibration?

So this would depend on two things. The first is your hearing ability. If you’ve got no or limited hearing, most of your interaction with sound will be tactile. This is one of the reasons why many Deaf individuals enjoy going to concerts; if the sound is loud enough you’ll be able to feel it even if you can’t hear it. I’ve even heard stories about folks who will take balloons to concerts to feel the vibrations better. In this case, it doesn’t really depend on the pitch of the sound (how high or low it is), just the volume.

But let’s assume that you have typical hearing. In that case, the relationship between pitch, volume and whether you can hear or feel a sound is a little more complex. This is due to something called “frequency response”. Basically, the human ear is better tuned to hearing some pitches than others. We’re really sensitive to sounds in the upper ranges of human speech (roughly 2k to 4k Hz). (The lowest pitch in the vocal signal can actually be much lower [down to around 80 Hz for a really low male voice] but it’s less important to be able to hear it because that frequency is also reflected in harmonics up through the entire pitch range of the vocal signal. Most telephones only transmit signals between  300 Hz to 3400 Hz, for example, and it’s only really the cut-off at the upper end of the range that causes problems–like making it hard to tell the difference between “sh” and “s”.)

The takeaway from all this is that we’re not super good at hearing very low sounds. That means they can be very, very loud before we pick up on them. If the sound is low enough and loud enough, then the only way we’ll be able to sense it is by feeling it.

How low is low enough? Most people can’t really hear anything much below 20 Hz (like the lowest note on a really big organ). The older you are and the more you’ve been exposed to really loud noises in that range, like bass-heavy concerts or explosions, the less you’ll be able to pick up on those really low sounds.

What about volume? My guess for what would be “sufficiently loud”, in this case, is 120+ Db. 120 Db is as loud as a rock concert, and it’s possible, although difficult and expensive, to get out of a home speaker set-up. If you have a neighbor listening to really bass-y music or watching action movies with a lot of low, booming sound effects on really expensive speakers, it’s perfectly possible that you’d feel those vibrations rather than hearing them. Especially if there are walls between the speakers and you. While mid and high frequency sounds are pretty easy to muffle, low-frequency sounds are much more difficult to sound proof against.

Are there any health risks? The effects of exposure to these types of low-frequency noise is actually something of an active research question. (You may have heard about the “brown note“, for example.) You can find a review of some of that research here. One comforting note: if you are exposed to a very loud sound below the frequencies you can easily hear–even if it’s loud enough to cause permanent damage at much higher frequencies–it’s unlikely that you will suffer any permanent hearing loss. That doesn’t mean you shouldn’t ask your neighbor to turn down the volume, though; for their ears if not for yours!

Are there differences in automatic caption error rates due to pitch or speech rate?

So after my last blog post went up, a couple people wondered if the difference in classification error rates between men and women might be due to pitch, since men tend to have lower voices. I had no idea, so, being experimentally inclined, I decided to find out.

First, I found the longest list of words that I could from the accent tag. Pretty much every video I looked used a subset of these words.

Aunt, Roof, Route, Wash, Oil, Theater, Iron, Salmon, Caramel, Fire, Water, Sure, Data, Ruin, Crayon, New Orleans, Pecan, Marriage, Both, Again, Probably, Spitting Image, Alabama, Guarantee, Lawyer, Coupon, Mayonnaise, Ask, Potato, Three, Syrup, Cool Whip, Pajamas, Caught, Catch, Naturally, Car, Aluminium, Envelope, Arizonia, Waffle, Auto, Tomato, Figure, Eleven, Atlantic, Sandwich, Attitude, Officer, Avacodo, Saw, Bandana, Oregon, Twenty, Halloween, Quarter, Muslim, Florida, Wagon

Then I recorded myself reading them at a natural pace, with list intonation. In order to better match the speakers in the other Youtube videos, I didn’t go into the lab and break out the good microphones; I just grabbed my gaming headset and used that mic. Then, I used Praat (a free, open source software package for phonetics) to shift the pitch of the whole file up and down 60 Hertz in 20 Hertz intervals. That left me with seven total sound files: the original one, three files that were 20, 40 and 60 Hertz higher and finally three files that were 20, 40 and 60 Hertz lower. You can listen to all the files individually here.

The original recording had a mean of 192 Hz and a median of 183, which means that my voice is slightly lower pitched than average for an American English speakering women. For reference, Pepiot 2014 found a mean pitch of 210 Hz for female American English speakers. The same papers also lists a mean pitch of 119 Hz for male American English speakers. This means that my lowest pitch manipulation (mean of 132) is still higher than the average American English speaking male. I didn’t want to go too much lower with my pitch manipulations, though, because the sound files were starting to sound artifact-y and robotic.

Why did I do things this way?

  • Only using one recording. This lets me control 100% for demographic information. I’m the same person, with the same language background, saying the same words in the same way. If I’d picked a bunch of speakers with different pitches, they’d also have different language backgrounds and voices. Plus I’m not getting effects from using different microphones.
  • Manipulating pitch both up and down. This was for two reasons. First, it means that the original recording isn’t the end-point for the pitch continuum. Second, it means that we can pick apart whether accuracy is a function of pitch or just the file having been manipulated.


You can check out how well the auto-captions did yourself by checking out this video. Make sure to hit the CC button in the lower left-hand corner.

The first thing I noticed was that I had really, really good results with the auto captions. Waaayyyy better than any of the other videos I looked at. There were nine errors across 434 tokens, for a total error rate of only 2%, which I’d call pretty much at ceiling. There was maaayybe a slight effect of the pitch manipulation, with higher pitches having slightly higher error rates, as you can see:


BUT there’s also sort of a u-shaped curve, which suggests to me that the recognizer is doing worse with the files that have been messed with the most. (Although, weirdly, only the file that had had its pitched shifted up by 20 Hz had no errors.) I’m going to go ahead and say that I’m not convinced that pitch is a determining factor

So why were these captions so much better than the ones I looked at in my last post? It could just be that I was talking very slowly and clearly. To check that out, I looked at autocaptions for the most recent video posted by someone who’s fairly similar to me in terms of social and vocal characteristics: a white woman who speaks standardized American English with Southern features. Ideally I’d match for socioeconomic class, education and rural/urban background as well, but those are harder to get information about.

I chose Bunny Meyer, who posts videos as Grav3yardgirl. In this video her speech style is fast and conversational, as you can hear for yourself:

To make sure I had roughly the same amount of data as I had before, I checked the captions for the first 445 words, which was about two minutes worth of video (you can check my work here). There was an overall error rate of approximately 8%, if you count skipped words as errors.  Which, considering that recognizing words in fast/connected speech is generally more error-prone, is pretty good. It’s definitely better than in the videos I analyzed for my last post. It’s also a fairly small difference from my careful speech: definitely less than the 13% difference I found for gender.

So it looks like neither the speed of speech nor the pitch are strongly affecting recognition rate (at least for videos captioned recently). There are a couple other things that I think may be going on here that I’m going to keep poking at:

  • ASR has got better over time. It’s totally possible that more women just did the accent tag challenge earlier, and thus had higher error rates because the speech recognition system was older and less good. I’m going to go back and tag my dataset for date, though, and see if that shakes out some of the gender differences.
  • Being louder may be important, especially in less clear recordings. I used a head-mounted microphone in a quiet room to make my recordings, and I’m assuming that Bunny uses professional recording equipment. If you’re recording outside or with a device microphone, though, there going to be a lot more noise. If your voice is louder, and men’s voices tend to be, it should be easier to understand in noise. My intuition is that, since there are gender differences in how loud people talk, some of the error may be due to intensity differences in noisy recordings. Although an earlier study found no difference in speech recognition rates for men and women in airplane cockpits, which are very noisy, so who knows? Testing that out will have to wait for another day, though.

Google’s speech recognition has a gender bias

In my last post, I looked at how Google’s automatic speech recognition worked with different dialects. To get this data, I hand-checked annotations  more than 1500 words from fifty different accent tag videos .

Now, because I’m a sociolinguist and I know that it’s important to stratify your samples, I made sure I had an equal number of male and female speakers for each dialect. And when I compared performance on male and female talkers, I found something deeply disturbing: YouTube’s auto captions consistently performed better on male voices than female voice (t(47) = -2.7, p < 0.01.) . (You can see my data and analysis here.)


On average, for each female speaker less than half (47%) her words were captioned correctly. The average male speaker, on the other hand, was captioned correctly 60% of the time.

It’s not that there’s a consistent but small effect size, either, 13% is a pretty big effect. The Cohen’s d was 0.7 which means, in non-math-speak, that if you pick a random man and random woman from my sample, there’s an almost 70% chance the transcriptions will be more accurate for the man. That’s pretty striking.

What it is not, unfortunately, is shocking. There’s a long history of speech recognition technology performing better for men than women:

This is a real problem with real impacts on people’s lives. Sure, a few incorrect Youtube captions aren’t a matter of life and death. But some of these applications have a lot higher stakes. Take the medical dictation software study. The fact that men enjoy better performance than women with these technologies means that it’s harder for women to do their jobs. Even if it only takes a second to correct an error, those seconds add up over the days and weeks to a major time sink, time your male colleagues aren’t wasting messing with technology. And that’s not even touching on the safety implications of voice recognition in cars.


So where is this imbalance coming from? First, let me make one thing clear: the problem is not with how women talk. The suggestion that, for example, “women could be taught to speak louder, and direct their voices towards the microphone” is ridiculous. In fact, women use speech strategies that should make it easier for voice recognition technology to work on women’s voices.  Women tend to be more intelligible (for people without high-frequency hearing loss), and to talk slightly more slowly. In general, women also favor more standard forms and make less use of stigmatized variants. Women’s vowels, in particular, lend themselves to classification: women produce longer vowels which are more distinct from each other than men’s are. (Edit 7/28/2016: I have since found two papers by Sharon Goldwater, Dan Jurafsky and Christopher D. Manning where they found better performance for women than men–due to the above factors and different rates of filler words like “um” and “uh”.) One thing that may be making a difference is that women also tend not to be as loud, partly as a function of just being smaller, and cepstrals (the fancy math thing what’s under the hood of most automatic voice recognition) are sensitive to differences in intensity. This all doesn’t mean that women’s voices are more difficult; I’ve trained classifiers on speech data from women and they worked just fine, thank you very much. What it does mean is that women’s voices are different from men’s voices, though, so a system designed around men’s voices just won’t work as well for women’s.

Which leads right into where I think this bias is coming from: unbalanced training sets. Like car crash dummies, voice recognition systems were designed for (and largely by) men. Over two thirds of the authors in the  Association for Computational Linguistics Anthology Network are male, for example. Which is not to say that there aren’t truly excellent female researchers working in speech technology (Mari Ostendorf and Gina-Anne Levow here at the UW and Karen Livescu at TTI-Chicago spring immediately to mind) but they’re outnumbered. And that unbalance seems to extend to the training sets, the annotated speech that’s used to teach automatic speech recognition systems what things should sound like. Voxforge, for example, is a popular open source speech dataset that “suffers from major gender and per speaker duration imbalances.” I had to get that info from another paper, since Voxforge doesn’t have speaker demographics available on their website. And it’s not the only popular corpus that doesn’t include speaker demographics: neither does the AMI meeting corpus, nor the Numbers corpus.  And when I could find the numbers, they weren’t balanced for gender. TIMIT, which is the single most popular speech corpus in the Linguistic Data Consortium, is just over 69% male. I don’t know what speech database the Google speech recognizer is trained on, but based on the speech recognition rates by gender I’m willing to bet that it’s not balanced for gender either.

Why does this matter? It matters because there are systematic differences between men’s and women’s speech. (I’m not going to touch on the speech of other genders here, since that’s a very young research area. If you’re interested, the Journal of Language and Sexuality is a good jumping-off point.) And machine learning works by making computers really good at dealing with things they’ve already seen a lot of. If they get a lot of speech from men, they’ll be really good at identifying speech from men. If they don’t get a lot of speech from women, they won’t be that good at identifying speech from women. And it looks like that’s the case. Based on my data from fifty different speakers, Google’s speech recognition (which, if you remember, is probably the best-performing proprietary automatic speech recognition system on the market) just doesn’t work as well for women as it does for men.

Which accents does automatic speech recognition work best for?

If your primary dialect is something other than Standardized American English (that sort of from-the-US-but-not-anywhere-in-particular type of English you hear a lot of on the news) you may have noticed that speech recognition software doesn’t generally work very well for you. You can see the sort of thing I’m talking about in this clip:

This clip is a little old, though (2010). Surely voice recognition technology has improved since then, right? I mean, we’ve got more data and more computing power than ever. Surely somebody’s gotten around to making sure that the current generation of voice-recognition software deals equally well with different dialects of English. Especially given that those self-driving cars that everyone’s so excited about are probably going to use voice-based interfaces.

To check, I spent some time on Youtube looking at the accuracy automatic captions for videos of the accent tag challenge, which was developed by Bert Vaux. I picked Youtube automatic captions because they’re done with Google’s Automatic Speech Recognition technology–which is one of the most accurate commercial systems out there right now.

Data: I picked videos with accents from Maine (U.S), Georgia (U.S.), California (U.S), Scotland and New Zealand. I picked these locations because they’re pretty far from each other and also have pretty distinct regional accents.  All speakers from the U.S. were (by my best guess) white and all looked to be young-ish. I’m not great at judging age, but I’m pretty confident no one was above fifty or so.

What I did: For each location, I checked the accuracy of the automatic captions on the word-list part of the challenge for five male and five female speakers. So I have data for a total of 50 people across 5 dialect regions. For each word in the word list, I marked it as “correct” if the entire word was correctly captioned on the first try. Anything else was marked wrong. To be fair, the words in the accent tag challenge were specifically chosen because they have a lot of possible variation. On the other hand, they’re single words spoken in isolation, which is pretty much the best case scenario for automatic speech recognition, so I think it balances out.

Ok, now the part you’ve all been waiting for: the results. Which dialects fared better and which worse? Does dialect even matter? First the good news: based on my (admittedly pretty small) sample, the effect of dialect is so weak that you’d have to be really generous to call it reliable. A linear model that estimated number of correct classifications based on total number of words, speaker’s gender and speaker’s dialect area fared only slightly better (p = 0.08) than one that didn’t include dialect area. Which is great! No effect means dialect doesn’t matter, right?

Weellll, not really. Based on a power analysis, I really should have sampled forty people from each dialect, not ten. Unfortunately, while I love y’all and also the search for knowledge, I’m not going to hand-annotate two hundred Youtube videos for a side project. (If you’d like to add data, though, feel free to branch the dataset on Github here. Just make sure to check the URL for the video you’re looking at so we don’t double dip.)

So while I can’t confidently state there is an effect, based on the fact that I’m sort of starting to get one with only a quarter of the amount of data I should be using, I’m actually pretty sure there is one. No one’s enjoying stellar performance (there’s a reason that they tend to be called AutoCraptions in the Deaf community) but some dialect areas are doing better than others. Look at this chart of accuracy by dialect region:


Proportion of correctly recognized words by dialect area, color coded by country.

There’s variation, sure, but in general the recognizer seems to be working best on people from California (which just happens to be where Google is headquartered) and worst on Scottish English. The big surprise for me is how well the recognizer works on New Zealand English, especially compared to Scottish English. It’s not a function of country population (NZ = 4.4 million, Scotland = 5.2 million). My guess is that it might be due to sample bias in the training sets,  especially if, say, there was some 90’s TV shows in there; there’s a lot of captioned New Zealand English in Hercules, Xena and related spin-offs. There’s also a Google outreach team in New Zealand, but not Scotland, so that might be a factor as well.

So, unfortunately, it looks like the lift skit may still be current. ASR still works better for some dialects than others. And, keep in mind, these are all native English speakers! I didn’t look at non-native English speakers, but I’m willing to bet the system is also letting them down. Which is a shame. It’s a pity that how well voice recognition works for you is still dependent on where you’re from. Maybe in another six years I’ll be able to write a blog post says it isn’t.