Can you configure speech recognition for a specific speaker?

James had an interesting question based on one of my earlier posts on gender differences in speech recognition:

Is there a voice recognition product that is focusing on women’s voices or allows for configuring for women’s voices (or the characteristics of women’s voices)?

I don’t know of any ASR systems specifically designed for women. But the answer to the second half of your question is yes!

BSPC 19 i Nyborg Danmark 2009 (4)

There are two main types of automatic speech recognition, or ASR, systems. The first is speaker independnet. These are systems, like YouTube automatic captions or  Apple’s Siri, that should work equally well across a large number of different speakers. Of course, as many other researchers have found and I corroborated in my own investigation, that’s not always the case. A major reason for this is socially-motivated variation between speakers. This is something we all know as language users. You can guess (with varying degrees of accuracy) a lot about someone from just their voice: thier sex, whether they’re young or old, where they grew up, how educated they are, how formal or casual they’re being.

So what does this mean for speech recognition? Well, while different speakers speak in a lot of different ways, individual speakers tend to use less variation. (With the exception of bidialectal speakers, like John Barrowman.) Which brings me nicely to the second type of speech recognition: speaker dependent. These are systems that are designed to work for one specific speaker, and usually to adapt and get more accurate for that speaker over time.

If you read some of my earlier posts, I suggested that the different performance for between dialects and genders was due to imbalances in the training data. The nice thing about speaker dependent systems is that the training data is made up of one voice: yours. (Although the system is usually initialized based on some other training set.)

So how can you get a speaker dependent ASR system?

  • By buying software such as Dragon speech recognition. This is probably the most popular commercial speaker-dependent voice recognition software (or at least the one I hear the most about). It does, however, cost real money.
  • Making your own! If you’re feeling inspired, you can make your own personalized ASR system. I’d recommend the CMU Sphinx toolkit; it’s free and well-documented. To make your own recognizer, you’ll need to build your own language model using text you’ve written as well as adapt the acoustic model using your recorded speech. The former lets the recognizer know what words you’re likely to say, and the latter how you say things. (If you’re REALLY gung-ho you can even build your own acoustic model from scratch, but that’s pretty involved.)

In theory, the bones of any ASR system should work equally well on any spoken human language. (Sign language recognition is a whole nother kettle of fish.) The difficulty is getting large amounts of (socially stratified) high-quality training data. By feeding a system data without a lot of variation, for example by using only one person’s voice, you can usually get more accurate recognition more quickly.



What sounds you can feel but not hear?

I got a cool question from Veronica the other day: 

Which wavelength someone would use not to hear but feel it on the body as a vibration?

So this would depend on two things. The first is your hearing ability. If you’ve got no or limited hearing, most of your interaction with sound will be tactile. This is one of the reasons why many Deaf individuals enjoy going to concerts; if the sound is loud enough you’ll be able to feel it even if you can’t hear it. I’ve even heard stories about folks who will take balloons to concerts to feel the vibrations better. In this case, it doesn’t really depend on the pitch of the sound (how high or low it is), just the volume.

But let’s assume that you have typical hearing. In that case, the relationship between pitch, volume and whether you can hear or feel a sound is a little more complex. This is due to something called “frequency response”. Basically, the human ear is better tuned to hearing some pitches than others. We’re really sensitive to sounds in the upper ranges of human speech (roughly 2k to 4k Hz). (The lowest pitch in the vocal signal can actually be much lower [down to around 80 Hz for a really low male voice] but it’s less important to be able to hear it because that frequency is also reflected in harmonics up through the entire pitch range of the vocal signal. Most telephones only transmit signals between  300 Hz to 3400 Hz, for example, and it’s only really the cut-off at the upper end of the range that causes problems–like making it hard to tell the difference between “sh” and “s”.)

The takeaway from all this is that we’re not super good at hearing very low sounds. That means they can be very, very loud before we pick up on them. If the sound is low enough and loud enough, then the only way we’ll be able to sense it is by feeling it.

How low is low enough? Most people can’t really hear anything much below 20 Hz (like the lowest note on a really big organ). The older you are and the more you’ve been exposed to really loud noises in that range, like bass-heavy concerts or explosions, the less you’ll be able to pick up on those really low sounds.

What about volume? My guess for what would be “sufficiently loud”, in this case, is 120+ Db. 120 Db is as loud as a rock concert, and it’s possible, although difficult and expensive, to get out of a home speaker set-up. If you have a neighbor listening to really bass-y music or watching action movies with a lot of low, booming sound effects on really expensive speakers, it’s perfectly possible that you’d feel those vibrations rather than hearing them. Especially if there are walls between the speakers and you. While mid and high frequency sounds are pretty easy to muffle, low-frequency sounds are much more difficult to sound proof against.

Are there any health risks? The effects of exposure to these types of low-frequency noise is actually something of an active research question. (You may have heard about the “brown note“, for example.) You can find a review of some of that research here. One comforting note: if you are exposed to a very loud sound below the frequencies you can easily hear–even if it’s loud enough to cause permanent damage at much higher frequencies–it’s unlikely that you will suffer any permanent hearing loss. That doesn’t mean you shouldn’t ask your neighbor to turn down the volume, though; for their ears if not for yours!

Are there differences in automatic caption error rates due to pitch or speech rate?

So after my last blog post went up, a couple people wondered if the difference in classification error rates between men and women might be due to pitch, since men tend to have lower voices. I had no idea, so, being experimentally inclined, I decided to find out.

First, I found the longest list of words that I could from the accent tag. Pretty much every video I looked used a subset of these words.

Aunt, Roof, Route, Wash, Oil, Theater, Iron, Salmon, Caramel, Fire, Water, Sure, Data, Ruin, Crayon, New Orleans, Pecan, Marriage, Both, Again, Probably, Spitting Image, Alabama, Guarantee, Lawyer, Coupon, Mayonnaise, Ask, Potato, Three, Syrup, Cool Whip, Pajamas, Caught, Catch, Naturally, Car, Aluminium, Envelope, Arizonia, Waffle, Auto, Tomato, Figure, Eleven, Atlantic, Sandwich, Attitude, Officer, Avacodo, Saw, Bandana, Oregon, Twenty, Halloween, Quarter, Muslim, Florida, Wagon

Then I recorded myself reading them at a natural pace, with list intonation. In order to better match the speakers in the other Youtube videos, I didn’t go into the lab and break out the good microphones; I just grabbed my gaming headset and used that mic. Then, I used Praat (a free, open source software package for phonetics) to shift the pitch of the whole file up and down 60 Hertz in 20 Hertz intervals. That left me with seven total sound files: the original one, three files that were 20, 40 and 60 Hertz higher and finally three files that were 20, 40 and 60 Hertz lower. You can listen to all the files individually here.

The original recording had a mean of 192 Hz and a median of 183, which means that my voice is slightly lower pitched than average for an American English speakering women. For reference, Pepiot 2014 found a mean pitch of 210 Hz for female American English speakers. The same papers also lists a mean pitch of 119 Hz for male American English speakers. This means that my lowest pitch manipulation (mean of 132) is still higher than the average American English speaking male. I didn’t want to go too much lower with my pitch manipulations, though, because the sound files were starting to sound artifact-y and robotic.

Why did I do things this way?

  • Only using one recording. This lets me control 100% for demographic information. I’m the same person, with the same language background, saying the same words in the same way. If I’d picked a bunch of speakers with different pitches, they’d also have different language backgrounds and voices. Plus I’m not getting effects from using different microphones.
  • Manipulating pitch both up and down. This was for two reasons. First, it means that the original recording isn’t the end-point for the pitch continuum. Second, it means that we can pick apart whether accuracy is a function of pitch or just the file having been manipulated.


You can check out how well the auto-captions did yourself by checking out this video. Make sure to hit the CC button in the lower left-hand corner.

The first thing I noticed was that I had really, really good results with the auto captions. Waaayyyy better than any of the other videos I looked at. There were nine errors across 434 tokens, for a total error rate of only 2%, which I’d call pretty much at ceiling. There was maaayybe a slight effect of the pitch manipulation, with higher pitches having slightly higher error rates, as you can see:


BUT there’s also sort of a u-shaped curve, which suggests to me that the recognizer is doing worse with the files that have been messed with the most. (Although, weirdly, only the file that had had its pitched shifted up by 20 Hz had no errors.) I’m going to go ahead and say that I’m not convinced that pitch is a determining factor

So why were these captions so much better than the ones I looked at in my last post? It could just be that I was talking very slowly and clearly. To check that out, I looked at autocaptions for the most recent video posted by someone who’s fairly similar to me in terms of social and vocal characteristics: a white woman who speaks standardized American English with Southern features. Ideally I’d match for socioeconomic class, education and rural/urban background as well, but those are harder to get information about.

I chose Bunny Meyer, who posts videos as Grav3yardgirl. In this video her speech style is fast and conversational, as you can hear for yourself:

To make sure I had roughly the same amount of data as I had before, I checked the captions for the first 445 words, which was about two minutes worth of video (you can check my work here). There was an overall error rate of approximately 8%, if you count skipped words as errors.  Which, considering that recognizing words in fast/connected speech is generally more error-prone, is pretty good. It’s definitely better than in the videos I analyzed for my last post. It’s also a fairly small difference from my careful speech: definitely less than the 13% difference I found for gender.

So it looks like neither the speed of speech nor the pitch are strongly affecting recognition rate (at least for videos captioned recently). There are a couple other things that I think may be going on here that I’m going to keep poking at:

  • ASR has got better over time. It’s totally possible that more women just did the accent tag challenge earlier, and thus had higher error rates because the speech recognition system was older and less good. I’m going to go back and tag my dataset for date, though, and see if that shakes out some of the gender differences.
  • Being louder may be important, especially in less clear recordings. I used a head-mounted microphone in a quiet room to make my recordings, and I’m assuming that Bunny uses professional recording equipment. If you’re recording outside or with a device microphone, though, there going to be a lot more noise. If your voice is louder, and men’s voices tend to be, it should be easier to understand in noise. My intuition is that, since there are gender differences in how loud people talk, some of the error may be due to intensity differences in noisy recordings. Although an earlier study found no difference in speech recognition rates for men and women in airplane cockpits, which are very noisy, so who knows? Testing that out will have to wait for another day, though.

Google’s speech recognition has a gender bias

In my last post, I looked at how Google’s automatic speech recognition worked with different dialects. To get this data, I hand-checked annotations  more than 1500 words from fifty different accent tag videos .

Now, because I’m a sociolinguist and I know that it’s important to stratify your samples, I made sure I had an equal number of male and female speakers for each dialect. And when I compared performance on male and female talkers, I found something deeply disturbing: YouTube’s auto captions consistently performed better on male voices than female voice (t(47) = -2.7, p < 0.01.) . (You can see my data and analysis here.)


On average, for each female speaker less than half (47%) her words were captioned correctly. The average male speaker, on the other hand, was captioned correctly 60% of the time.

It’s not that there’s a consistent but small effect size, either, 13% is a pretty big effect. The Cohen’s d was 0.7 which means, in non-math-speak, that if you pick a random man and random woman from my sample, there’s an almost 70% chance the transcriptions will be more accurate for the man. That’s pretty striking.

What it is not, unfortunately, is shocking. There’s a long history of speech recognition technology performing better for men than women:

This is a real problem with real impacts on people’s lives. Sure, a few incorrect Youtube captions aren’t a matter of life and death. But some of these applications have a lot higher stakes. Take the medical dictation software study. The fact that men enjoy better performance than women with these technologies means that it’s harder for women to do their jobs. Even if it only takes a second to correct an error, those seconds add up over the days and weeks to a major time sink, time your male colleagues aren’t wasting messing with technology. And that’s not even touching on the safety implications of voice recognition in cars.


So where is this imbalance coming from? First, let me make one thing clear: the problem is not with how women talk. The suggestion that, for example, “women could be taught to speak louder, and direct their voices towards the microphone” is ridiculous. In fact, women use speech strategies that should make it easier for voice recognition technology to work on women’s voices.  Women tend to be more intelligible (for people without high-frequency hearing loss), and to talk slightly more slowly. In general, women also favor more standard forms and make less use of stigmatized variants. Women’s vowels, in particular, lend themselves to classification: women produce longer vowels which are more distinct from each other than men’s are. (Edit 7/28/2016: I have since found two papers by Sharon Goldwater, Dan Jurafsky and Christopher D. Manning where they found better performance for women than men–due to the above factors and different rates of filler words like “um” and “uh”.) One thing that may be making a difference is that women also tend not to be as loud, partly as a function of just being smaller, and cepstrals (the fancy math thing what’s under the hood of most automatic voice recognition) are sensitive to differences in intensity. This all doesn’t mean that women’s voices are more difficult; I’ve trained classifiers on speech data from women and they worked just fine, thank you very much. What it does mean is that women’s voices are different from men’s voices, though, so a system designed around men’s voices just won’t work as well for women’s.

Which leads right into where I think this bias is coming from: unbalanced training sets. Like car crash dummies, voice recognition systems were designed for (and largely by) men. Over two thirds of the authors in the  Association for Computational Linguistics Anthology Network are male, for example. Which is not to say that there aren’t truly excellent female researchers working in speech technology (Mari Ostendorf and Gina-Anne Levow here at the UW and Karen Livescu at TTI-Chicago spring immediately to mind) but they’re outnumbered. And that unbalance seems to extend to the training sets, the annotated speech that’s used to teach automatic speech recognition systems what things should sound like. Voxforge, for example, is a popular open source speech dataset that “suffers from major gender and per speaker duration imbalances.” I had to get that info from another paper, since Voxforge doesn’t have speaker demographics available on their website. And it’s not the only popular corpus that doesn’t include speaker demographics: neither does the AMI meeting corpus, nor the Numbers corpus.  And when I could find the numbers, they weren’t balanced for gender. TIMIT, which is the single most popular speech corpus in the Linguistic Data Consortium, is just over 69% male. I don’t know what speech database the Google speech recognizer is trained on, but based on the speech recognition rates by gender I’m willing to bet that it’s not balanced for gender either.

Why does this matter? It matters because there are systematic differences between men’s and women’s speech. (I’m not going to touch on the speech of other genders here, since that’s a very young research area. If you’re interested, the Journal of Language and Sexuality is a good jumping-off point.) And machine learning works by making computers really good at dealing with things they’ve already seen a lot of. If they get a lot of speech from men, they’ll be really good at identifying speech from men. If they don’t get a lot of speech from women, they won’t be that good at identifying speech from women. And it looks like that’s the case. Based on my data from fifty different speakers, Google’s speech recognition (which, if you remember, is probably the best-performing proprietary automatic speech recognition system on the market) just doesn’t work as well for women as it does for men.

Which accents does automatic speech recognition work best for?

If your primary dialect is something other than Standardized American English (that sort of from-the-US-but-not-anywhere-in-particular type of English you hear a lot of on the news) you may have noticed that speech recognition software doesn’t generally work very well for you. You can see the sort of thing I’m talking about in this clip:

This clip is a little old, though (2010). Surely voice recognition technology has improved since then, right? I mean, we’ve got more data and more computing power than ever. Surely somebody’s gotten around to making sure that the current generation of voice-recognition software deals equally well with different dialects of English. Especially given that those self-driving cars that everyone’s so excited about are probably going to use voice-based interfaces.

To check, I spent some time on Youtube looking at the accuracy automatic captions for videos of the accent tag challenge, which was developed by Bert Vaux. I picked Youtube automatic captions because they’re done with Google’s Automatic Speech Recognition technology–which is one of the most accurate commercial systems out there right now.

Data: I picked videos with accents from Maine (U.S), Georgia (U.S.), California (U.S), Scotland and New Zealand. I picked these locations because they’re pretty far from each other and also have pretty distinct regional accents.  All speakers from the U.S. were (by my best guess) white and all looked to be young-ish. I’m not great at judging age, but I’m pretty confident no one was above fifty or so.

What I did: For each location, I checked the accuracy of the automatic captions on the word-list part of the challenge for five male and five female speakers. So I have data for a total of 50 people across 5 dialect regions. For each word in the word list, I marked it as “correct” if the entire word was correctly captioned on the first try. Anything else was marked wrong. To be fair, the words in the accent tag challenge were specifically chosen because they have a lot of possible variation. On the other hand, they’re single words spoken in isolation, which is pretty much the best case scenario for automatic speech recognition, so I think it balances out.

Ok, now the part you’ve all been waiting for: the results. Which dialects fared better and which worse? Does dialect even matter? First the good news: based on my (admittedly pretty small) sample, the effect of dialect is so weak that you’d have to be really generous to call it reliable. A linear model that estimated number of correct classifications based on total number of words, speaker’s gender and speaker’s dialect area fared only slightly better (p = 0.08) than one that didn’t include dialect area. Which is great! No effect means dialect doesn’t matter, right?

Weellll, not really. Based on a power analysis, I really should have sampled forty people from each dialect, not ten. Unfortunately, while I love y’all and also the search for knowledge, I’m not going to hand-annotate two hundred Youtube videos for a side project. (If you’d like to add data, though, feel free to branch the dataset on Github here. Just make sure to check the URL for the video you’re looking at so we don’t double dip.)

So while I can’t confidently state there is an effect, based on the fact that I’m sort of starting to get one with only a quarter of the amount of data I should be using, I’m actually pretty sure there is one. No one’s enjoying stellar performance (there’s a reason that they tend to be called AutoCraptions in the Deaf community) but some dialect areas are doing better than others. Look at this chart of accuracy by dialect region:


Proportion of correctly recognized words by dialect area, color coded by country.

There’s variation, sure, but in general the recognizer seems to be working best on people from California (which just happens to be where Google is headquartered) and worst on Scottish English. The big surprise for me is how well the recognizer works on New Zealand English, especially compared to Scottish English. It’s not a function of country population (NZ = 4.4 million, Scotland = 5.2 million). My guess is that it might be due to sample bias in the training sets,  especially if, say, there was some 90’s TV shows in there; there’s a lot of captioned New Zealand English in Hercules, Xena and related spin-offs. There’s also a Google outreach team in New Zealand, but not Scotland, so that might be a factor as well.

So, unfortunately, it looks like the lift skit may still be current. ASR still works better for some dialects than others. And, keep in mind, these are all native English speakers! I didn’t look at non-native English speakers, but I’m willing to bet the system is also letting them down. Which is a shame. It’s a pity that how well voice recognition works for you is still dependent on where you’re from. Maybe in another six years I’ll be able to write a blog post says it isn’t.

What types of emoji do people want more of?

So if you’re a weird internet nerd like me, you might already know that Unicode 9.0 was released today. The  deets are here, but they’re fairly boring unless you really care about typography. What’s more interesting to me, as someone who studies visual, spoken and written language, is that there are a whole batch of new emoji. And it’s led to lots of interesting speculation about, for example, what is the most popular new emoji is going to be (tldr: probably the ROFL face. People have a strong preference for using positive face emojis.)  This led me to wonder: what obvious lexical gaps are there?

[I]n some cases it is useful to refer to the words that are not part of the vocabulary: the nonexisting words. Instead of referring to nonexisting words, it is common to speak about lexical gaps, since the nonexisting words are indications of “holes” in the lexicon of the language that could be filled.

Janssen, M. 2012. “Lexical Gaps”. The Encyclopedia of Applied Linguistics.

This question is pretty easy to answer about emoji– we can just find out what words people are most likely to use when they’re complaining about not being able to use emoji. There’s even a Twitter bot that collects these kind of tweets. I decided to do something similar, but with a twist. I wanted to know what kinds of emoji people complain about wanting the most.

Boring technical details 💤

  1. Yesterday, I grabbed 4817 recent tweets that contained both the words “no” and “emoji”. (You can find the R script I used for this on my Github.)
  2. For each tweet, I took the two words occurring directly in front of the word “emoji” and created a corpus from them using the tm (text mining) package.
  3. I tidied up the corpus–removing super-common words like “the”, making everything lower-case, and so on. (The technical term is “cleaning“, but I like the sound of tidying better. It sounds like you’re  getting comfy with your data, not delousing it.)
  4. I ranked these words by frequency, or how often then showed up. There were 1888 distinct words, but the vast majority (1280) showed up only once. This is completely normal for word frequency data and is modelled by Zipf’s law.
  5. I then took all words that occurred more than three times and did a content analysis.


Exciting results! 😄

At the end of my content analysis, I arrived at nine distinct categories. I’ve listed them below, with the most popular four terms from each. One thing I noticed right off is how many of these are emoji that either already exist or are in the Unicode update. To highlight this, I’ve italicized terms in the list below that don’t have an emoji.

  • animal: shark, giraffe, butterfly, duck
  • color: orange, red, white, green
  • face: crying, angry, love, hate
  • (facial) feature: mustache, redhead, beard, glasses
  • flag: flag, England, Welshpride
  • food: bacon, avocado, salt, carrot
  • gesture: peace, finger, middle, crossed
  • object: rifle, gun, drum, spoon
  • person: mermaid, pirate, clown, chef

(One note: the rifle is in unicode 9.0, but isn’t an emoji. This has been the topic of some discussion, and is probably why it’s so frequent.)

Based on these categories, where are the lexical gaps? The three categories that have the most different items in them are, in order 1) food, 2) animals and 3) objects. These are also the three categories with the most mentions across all items.

So, given that so many people are talking about emojis for animals, food and objects, why aren’t the bulk of emojis in these categories? We can see why this might be by comparing how many different items get mentioned in each category to how many times each item is mentioned.


Yeah, people talk about food a lot… but they also talk about a lot of different types of food. On the other hand you have categories like colors, which aren’t talked about as much but where the same colors come up over and over again.

As you can see from the figure above, the most popular categories have a lot of different things in them, but each thing is mentioned relatively rarely. So while there is an impassioned zebra emoji fanbase, it only comes up three times in this dataset. On the other hand, “red” is fairly common but shows up because of discussion of, among other things, flowers, shoes and hair color. Some categories, like flags, fall in a happy medium–lots of discussion and fairly few suggestions for additions.

Based on this teeny data set, I’d say that if the Unicode consortium continues to be in charge of putting emoji standardization it’ll have its hands full for quite some time to come. There’s a lot of room for growth, and most of it is in food, animals and objects, which all have a lot of possible items, rather than gestures or facial expressions, which have much fewer.

Why do Canadians say ‘eh’?

Perhaps it’s because Seattle is so close to Canada, but for some reason when I ask classes of undergraduate students what they want to know about language and language use, one question I tend to get a lot is:

 Why do Canadians say ‘eh’?


Fortunately for my curious students, this is actually an active area of inquiry. (It’s actually one those research questions where there was a flurry of work–in this case in the 1970’s–and then a couple quiet decades followed by a resurgence in interest. The ‘eh’ renaissance started in the mid-2000’s and continues today. For some reason, at least in linguistics, this sort of thing tends to happen a lot. I’ll leave discussing why this particular pattern is so common to the sociologists of science.)  So what do we know about ‘eh’?

Is ‘eh’ actually Canadian?

‘Eh’ has quite the pedigree–it’s first attested in Middle English and even shows up in Chaucer. Canadian English, however, boasts a more frequent use of ‘eh’, which can fill the same role as ‘right?’, ‘you know?’ or ‘innit?’ for speakers of other varieties of English.

What does ‘eh’ mean?

The real thing that makes an ‘eh’ Canadian, though, is how it’s used. Despite some claims to the contrary, “eh” is far from meaningless. It has a limited number of uses (Elaine Gold identified an even dozen in her 2004 paper) some of which aren’t found outside of Canada. Walter Avis described two of these uniquely Canadian uses in his 1972 paper, “So eh? is Canadian, eh” (it’s not available anywhere online as far as I can tell):

  1. Narrative use: Used to punctuate a story, in the same way that an American English speaker (south of the border, that is) might use “right?” or “you know?”
    1. Example: I was walking home from school, eh?  I was right by that construction site where there’s a big hole in the ground, eh? And I see someone toss a piece of trash right in it.
  2. Miscellaneous/exclamation use:  Tacked on to the end of a statement. (Although more recent work, presented by Martina Wiltschko and Alex D’Arcy at last year’s NWAV suggests that there’s really a limited number of ways to use this type of ‘eh’ and that they can be told apart by the way the speaker uses pitch.)
    1. Example: What a litterbug, eh?

And these uses seems to be running strong. Gold found that use of ‘eh’ in a variety of contexts has either increased or remained stable since 1980.

That’s not to say there’s no change going on, though. D’Arcy and Wiltschko found that younger speakers of Canadian English are more likely than older speakers to use ‘right?’ instead of ‘eh?’. Does this mean that ‘eh’ may be going the way of the dodo or ‘sliver’ to mean ‘splinter’ in British English?

Probably not–but it may show up in fewer places than it used to. In particular, in their 2006 study Elaine Gold and Mireille Tremblay found that almost half of their participants feel negatively about the narrative use of ‘eh’ and only 16% actually used it themselves. This suggests this type of uniquely-Canadian usage may be on its way out.

Should you go to grad school for linguistics?

So I’ve had this talk, in different forms, with lots of different people over the last couple of years. Mainly undergrads thinking about applying to PhD programs in linguistics but, occasionally, people in industry thinking about going back to school as well. Every single one of these people was smart, cool, dedicated, hard-working, a great linguist and would have been an asset to the field. And when they asked me, a current linguistics graduate student, whether it was a good idea to go to grad school in linguistics, I gave them all the same answer:

“But Rachael,” you say, “you’re going to grad school in linguistics and having all sorts of fun. Why are you trying to keep me from doing the same thing?” Two big reasons.

The Job Market for Linguistics PhDs

What do you want to do when you get out of grad school? If you’re like most people, you’ll probably say you want to teach linguistics at the college or university level. What you should know is that this is an increasingly unsustainable career path.

In 1975, 30 percent of college faculty were part-time. By 2011, 51 percent of college faculty were part-time, and another 19 percent were non–tenure track, full-time employees. In other words, 70 percent were contingent faculty, a broad classification that includes all non–tenure track faculty (NTTF), whether they work full-time or part-time.

More Than Half of College Faculty Are Adjuncts: Should You Care? by Dan Edmonds.

And most of these part-time faculty, or adjuncts, are very poorly paid. This survey from 2015 found that 62% of adjuncts made less than $20,000 a year. This is even more upsetting you consider that you need a PhD and scholarly publications to even be considered for one of these posts.

(“But what about being paid for your research publications?” you ask. “Surely you can make a few bucks by publishing in those insanely expensive academic journals.” While I understand where you’re coming from–in almost any other professional publishing context it’s completely normal to be paid for your writing–authors of academic papers are not paid. Nor are the reviewers. Furthermore, authors are often charged fees by the publishers. One journal I was recently  looking at charges $2,900 per article, which  is about three times the funding my department gives us for research over our entire degree. Not a scam journal, either–an actual reputable venue for scholarly publication.)

Yes, there are still tenure-track positions available in linguistics, but they are by far the minority. What’s more, even including adjunct positions, there are still fewer academic posts than graduating linguists with PhDs. It’s been that way for a while, too, so even for a not-so-great adjunct position you’ll be facing stiff competition. Is it impossible to find a good academic post in linguistics? No. Are the odds in your (or my, or any other current grad student’s) favor? Also no. But don’t take it from me. In Surviving Linguistics: A Guide for Graduate Students (which I would highly recommend) Monica Macaulay says:

[It] is common knowledge that we are graduating more PhDs than there are faculty positions available, resulting in certain disappointment for many… graduates. The solution is to think creatively about job opportunities and keep your options open.

As Dr. Macaulay goes on to outline, there are jobs for linguists outside academia. Check out the LSA’s Linguistics Beyond Academia special interest group or the Linguists Outside Academia mailing list. There are lots of things you can do with a linguistics degree, from data science to forensic linguistics.

That said, there are degrees that will better prepare you for a career than a PhD in theoretical linguistics. A master’s degree in Speech Language Pathology (SLP) or Computational Linguistics or Teaching English to Speakers of Other Languages (TESOL) will prepare you for those careers far better than a general PhD.

Even if you’re 100% dead set on teaching post-secondary students, you should look around and see what linguists are doing outside of universities. Sure, you might win the job-lottery, but at least some of your students probably won’t, and you’ll want to make sure they can find well-paying, fulfilling work.

Grad School is Grueling

Yes, grad school can absolutely be fun. On a good day, I enjoy it tremendously. But it’s also work. (And don’t give me any nonsense about it not being real work because you do it sitting down. I’ve had jobs that required hard physical and/or emotional labor, and grad school is exhausting.) I feel like I probably have a slightly better than average work/life balance–partly thanks to my fellowship, which means I have limited teaching duties and don’t need a second job any more–and I’m still actively trying to get better about stopping work when I’m tired. I fail, and end up all tearful and exhausted, about once a week.

It’s also emotionally draining. Depression runs absolutely rampant among grad students. This 2015 report from Berkeley, for example, found that over two thirds of PhD students in the arts and sciences were depressed. The main reason? Point number one above–the stark realities of the job market. It can be absolutely gutting to see a colleague do everything right, from research to teaching, and end up not having any opportunity to do the job they’ve been preparing for. Especially since you know the same lays in wait for you.

And “doing everything right” is pretty Herculean in and of itself. You have to have very strong personal motivation to finish a PhD. Sure, your committee is there to provide oversight and you have drop-dead due dates. But those deadlines are often very far away and, depending on your committee, you may have a lot of independence. That means motivating yourself to work steadily while manage several ongoing projects in parallel (you’re publishing papers in addition to writing your dissertation, right?) and not working yourself to exhaustion in the process. Basically you’re going to need a big old double helping of executive functioning.

And oh by the way, to be competitive in the job market you’ll also need to demonstrate you can teach and perform service for your school/discipline. Add in time to sleep, eat, get at least a little exercise and take breaks (none of which are optional!) and you’ve got a very full plate indeed. Some absolutely iron-willed people even manage all of this while having/raising kids and I have nothing but respect for them.

Main take-away

Whether inside or outside of academia, it’s true that a PhD does tend to correlate with higher salary–although the boost isn’t as much as you’d get from a related professional degree. BUT in order to get that higher salary you’ll need to give up some of your most productive years. My spouse (who also  has a bachelors in linguistics) got a master’s degree,  found a good job,  got promoted and has cultivated a professional social network in the time it’s taken me just to get to the point of starting my dissertation.The opportunity cost of spending five more years (at a minimum–I’ve heard of people who took more than a decade to finish) in school, probably in your twenties, is very, very high. And my spouse can leave work at work, come home on weekends and just chill. This month I’ve got four full weekends of either conferences or outreach. Even worse, no matter how hard I try to stamp it out, I’ve got a tiny little voice in my head that’s very quietly screaming “you should be working” literally all the time.

I’m being absolutely real right now: going to grad school for linguistics is a bad investment of your time and labor. I knew that going in–heck, I knew that before I even applied–and I still went in. Why? Because I decided that, for me, it was a worthwhile trade-off. I really like doing research. I really like being part of the scientific community. Grad school is hard, yes, but overall I’m enjoying myself. And even if I don’t end up being able to find a job in academia (although I’m still hopeful and still plugging away at it) I really, truly believe that the research I’m doing now is valuable and interesting and, in some small way, helping the world. What can I say? I’m a nerdy idealist.

But this is 100% a personal decision. It’s up to you as an individual to decide whether the costs are worth it to you. Maybe you’ll decide, as I have, that they are. But maybe you won’t. And to make that decision you really do need to know what those costs are. I hope I’ve helped to begin making them clear. 

One final thought: Not going to grad school doesn’t mean you’re not smart. In fact, considering everything I’ve discussed above, it probably means you are.

What is linguistic discrimination?

Recently, UC Berkeley student Khairuldeen Makhzoomi was removed from his flight. The reason: he was speaking Arabic. And this isn’t the first time this has happened. Nor the second. These are all, in addition to being deeply disturbing and illegal, examples of linguistic discrimination.

What is linguistic discrimination?

Linguistic discrimination is discrimination based on someone’s language use. And it’s not restricted to the instances I discussed above:

As I’ve talked about before, linguistic discrimination can be a way to discriminate against a specific group of people without saying so in so many words. Linguistic discrimination, in addition to being morally repugnant,is illegal in the U.S. under Titles VI and VII of the Civil Rights Act of 1964.

These are important legal protections and the number of people affected by them is huge: There are over 350 different languages spoken in the United States. In Seattle, where I live, over a fifth of people over age five speak a language other than English at home. That’s a lot of people! Further, most of these individuals are bilingual or multilingual; 90% of second-generation immigrants speak English. And since multilingualism has both neurological benefits for individuals and larger positive impacts on society, I see this as no bad thing. And I’m hardly the only one: how many people that you know are learning or want to learn another language?

Unfortunately, linguistic discrimination threatens this rich diversity, and every person who speaks anything other than the standardized variety of the dominant language.

What can you do?

  • Don’t participate in linguistic discrimination. It can be hard to retrain yourself to reduce the impact of negative stereotypes but, especially if you’re in a position of privilege (as I am), it’s literally the least you can do. Don’t make assumptions about people based on their language use.
  • Stand up for people who may be facing linguistic discrimination. If you see someone being discriminated in in the workplace (like being given lower performance evaluations for having a non-native accent) point out that this is illegal, and back up people who are being discriminated against.
  • Be patient with non-native speakers. Appreciate that they’ve gone through a lot of effort to learn your language. If possible, try and arrange for an interpreter (for face-to-face communication) or translator (for written communications). Sometimes non-native speakers are more comfortable with reading and writing than speaking; offer to communicate through e-mails or other written correspondence.


What’s the difference between frosting and icing?

Fair warning: this post is full of pictures of baked goods. I can’t claim responsibly for any impulsive cake-baking that may result from reading further.

This is the second post in this series. The first half, here, focused on responses to whether “frosting” and “icing” were different things, or different words for the same thing. This post gets a little more in-depth. In the first part, I was just asking people what they thought they said. In the second part, I was asking them to pick words for specific pictures. It’s not a perfect design–by asking people what they think they saw first I primed them pretty heavily–but it does reveal some interesting patterns of usage.

The main thing I was interested in was this–did people who said frosting and icing were interchangeable for them actually use them as if they were the same? Why is this a good question to ask? Because  it turns out that a lot of the time people aren’t the best judges of how they use language. Especially if there’s some sort of “rule” about how you’re “supposed” to do it. For example, there’s something of a running joke among linguists how often people will use the passive voice while they’re telling people not to! I don’t think anyone would intentionally lie about their usage, but it’s possible that respondents aren’t always doing exactly what they think they are.

I split my dataset into people who said they thought the words “frosting” and “icing” meant the same thing and those who thought they were different. In the charts below these groups are labelled “same” and “different” respectively. For this stage of analysis, I left out people who weren’t sure; there weren’t a whole lot of them anyway.

Matcha-cupcakes (6453300119)

So this picture was a pretty canonical example of what people brought up a lot–it’s on a cake, and it’s been both whipped and piped. For a lot of people, then, this should be “frosting”. So what did people say?

cupcakeChartThe results here were pretty much what I expected. (Whew!) People who thought the words meant different things pretty much all thought this was “frosting”. And there was a pretty strong different between the groups. But this still doesn’t answer some of my questions. Is it the texture that makes it “frosting” or, as the AP Styleguide suggests, the fact that it’s on a cake? After all, you can definitely put buttercream on a cookie, as evinced by Lofthouse.



Next I had some doughnuts. A lot of people, when I first started asking around, brought up doughnuts as something that they thought were iced rather than frosted. So what did people say?


That does seem to hold true.There was no strong difference between the groups, but there were also a lot of write-in answers. (“Glaze” was especially popular, which, for the record, is probably what I’d say. ) So there seems to be more variety in what people call doughnut toppings but there is a tendency towards “icing”.

Cake with fondant

Sao Valentim 2013 (5)

Ok, so this image was a bit of a trick. The cake here is covered in fondant. Which, to me, isn’t really frosting or icing. But if it’s really “being on a cake” that makes something “frosting”, we should see a strong “frosting” bias from people with a distinction. fondantAnd that’s just not  the case. There’s also a pretty big difference between the groups here. Interestingly, people who thought “frosting” and “icing” are different things were more likely to write in “fondant”. (Remember that level of baking knowledge had no effect on whether people said there was a difference or not, so it’s probably not just specialized knowledge.)

Bundt Cake

Lemon bundt cake (2), January 2010

I included this image for a couple of reasons. Again, I’m poking at this “on a cake” idea. But I also had a lot of people tell me that, for them, the distinction between the words was texture-based. So responses here could have gone two ways: If anything on a cake is frosting, then we’d expect frosting to win. But, if frosting has to be fluffy/whipped, then we’d expect icing to win.


And icing wins! This is no surprise, given the written results summarized in my previous post and the responses for the cake pictures above, but for me it really puts the nail in the coffin of the “on cakes” argument. (Take note, AP Styleguide!) Even on this one, though, people with no distinction are much more likely to be able to use “frosting”.

Sweet Roll

Delicious orange roll

So this is an interesting one. I included it because, for me, cinnamon rolls are synonymous with cream cheese frosting/icing. Since several people I talked to said specifically that cream cheese had to be frosting and not icing, I was expecting a large “frosting” response on this one.


That was definitely not what I saw, though. (Although people with no distinction were much more likely to be able to say “frosting”, so I guess I came by it natural.) Most people, and especially people with a distinction, thought it was “icing”.


So there are two main takeaways here:

  • There’s a strong difference in usage between people who say that “frosting” and “icing” are different things and those who say they aren’t. (For most of the pictures, these groups responded significantly differently.)
  • If there is a difference, it’s got everything to do with texture and nothing to do with cake.

That’s not to say that these things will always hold true; no one knows better than linguists that language is in a constant state of flux. But for now, these generalizations seem to hold for most of the people surveyed. So if you’re going to make a usage distinction between these words, please make one that’s based on the actual usage and not some completely made-up rule!

A final note: if you’re interested in seeing the (slightly sanitized) data and the R code I used for analysis, both are available here.