Why is it so hard for computers to recognize speech?

This is a problem that’s plagued me for quite a while. I’m not a computational linguist  myself, but one of the reasons that theoretical linguistics is important is that it allows us to create robust concpetional models of language… which is basically what voice recognition (or synthesis) programs are. But, you may say to yourself, if it’s your job to create and test robust models, you’re clearly not doing very well. I mean, just listen to this guy. Or this guy. Or this person, whose patience in detailing errors borders on obsession. Or, heck, this person, who isn’t so sure that voice recognition is even a thing we need.

Electronic eye
You mean you wouldn’t want to be able to have pleasant little chats with your computer? I mean, how could that possibly go wrong?
Now, to be fair to linguists, we’ve kinda been out of the loop for a while. Fred Jelinek, a very famous researcher in speech recognition, once said “Every time we fire a phonetician/linguist, the performance of our system goes up”. Oof, right in the career prospects. There was, however, a very good reason for that, and it had to do with the pressures on computer scientists and linguists respectively. (Also a bunch of historical stuff that we’re not going to get into.)

Basically, in the past (and currently to a certain extent) there was this divide in linguistics. Linguists wanted to model speaker’s competence, not their performance. Basically, there’s this idea that there is some sort of place in your brain where you knew all the rules of language and  have them all perfectly mapped out and described. Not in a consious way, but there nonetheless. But somewhere between the magical garden of language and your mouth and/or ears you trip up and mistakes happen. You say a word wrong or mishear it or switch bits around… all sorts of things can go wrong. Plus, of course, even if we don’t make a recognizable mistake, there’s a incredible amount of variation that we can decipher without a problem. That got pushed over to the performance side, though, and wasn’t looked at as much. Linguistics was all about what was happening in the language mind-garden (the competence) and not the messy sorts of things you say in everyday life (the performance). You can also think of it like what celebrities actually say in an interview vs. what gets into the newspaper; all the “um”s and “uh”s are taken out, little stutters or repetitions are erased and if the sentence structure came out a little wonky the reporter pats it back into shape. It was pretty clear what they meant to say, after all.

So you’ve got linguists with their competence models explaining them to the computer folks and computer folks being all clever and mathy and coming up with algorithms that seem to accurately model our knowledge of human linguistic competency… and getting terrible results. Everyone’s working hard and doing their best and it’s just not working.

I think you can probably figure out why: if you’re a computer and just sitting there with very little knowledge of language (consider that this was before any of the big corpora were published, so there wasn’t a whole lot of raw data) and someone hands you a model that’s supposed to handle only perfect data and also actual speech data, which even under ideal conditions is far from perfect, you’re going to spit out spaghetti and call it a day. It’s a bit like telling someone to make you a peanut butter and jelly sandwich and just expecting them to do it. Which is fine if they already know what peanut butter and jelly are, and where you keep the bread, and how to open jars, and that food is something humans eat, so you shouldn’t rub it on anything too covered with bacteria or they’ll get sick and die. Probably not the best way to go about it.

So the linguists got the boot and they and the computational people pretty much did their own things for a bit. The model that most speech recognition programs use today is mostly statistical, based on things like how often a word shows up in whichever corpus they’re using currently. Which works pretty well. In a quiet room. When you speak clearly. And slowly. And don’t use any super-exotic words. And aren’t having a conversation. And have trained the system on your voice. And have enough processing power in whatever device you’re using. And don’t get all wild and crazy with your intonation. See the problem?

Language is incredibly complex and speech recognition technology, particularly when it’s based on a purely statistical model, is not terrific at dealing with all that complexity. Which is not to say that I’m knocking statistical models! Statistical phonology is mind-blowing and I think we in linguistics will get a lot of mileage from it. But there’s a difference. We’re not looking to conserve processing power: we’re looking to model what humans are actually doing. There’s been a shift away from the competency/performance divide (though it does still exist) and more interest in modelling the messy stuff that we actually see: conversational speech, connected speech, variation within speakers. And the models that we come up with are complex. Really complex. People working in Exemplar Theory, for example, have found quite a bit of evidence that you remember everything you’ve ever heard and use all of it to help parse incoming signals. Yeah, it’s crazy. And it’s not something that our current computers can do. Which is fine; it give linguists time to further refine our models. When computers are ready, we will be too, and in the meantime computer people and linguistic people are showing more and more overlap again, and using each other’s work more and more. And, you know, singing Kumbayah and roasting marshmallows together. It’s pretty friendly.

So what’s the take-away? Well, at least for the moment, in order to get speech recognition to a better place than it is now, we need  to build models that work for a system that is less complex than the human brain. Linguistics research, particularly into statistical models, is helping with this. For the future? We need to build systems that are as complex at the human brain. (Bonus: we’ll finally be able to test models of child language acquisition without doing deeply unethical things! Not that we would do deeply unethical things.) Overall, I’m very optimistic that computers will eventually be able to recognize speech as well as humans can.

TL;DR version:

  • Speech recognition has been light on linguists because they weren’t modeling what was useful for computational tasks.
  • Now linguists are building and testing useful models. Yay!
  • Language is super complex and treating it like it’s not will get you hit in the face with an error-ridden fish.
  • Linguists know language is complex and are working diligently at accurately describing how and why. Yay!
  • In order to get perfect speech recognition down, we’re going to need to have computers that are similar to our brains.
  • I’m pretty optimistic that this will happen.

 

 

Advertisement

How to pronounce the “th” sound in English

Or, as I like to call it, hunting the wild Eth and Thorn (which are old letters that can be difficult to typesest), because back in the day, English had the two distinct “th” sounds represented differently in their writing system. There was one where you vibrated your vocal folds (that’s called ‘voiced’) which was written as “ð” and one where you didn’t (unvoiced) which was written as “þ”. It’s a bit like the difference between “s” and “z” in English today. Try it: you can say both “s” and “z” without moving your tongue a millimeter. Unfortunately, while the voiced and voiceless “th” sounds remain distinct, they’re now represented by the same “th” sequence. The difference between “thy” and “thigh”, for example, is the first sound, but the spelling doesn’t reflect that. (Yet another example of why English orthography is horrible.)

Used with permission from the How To Be British Collection copyright LGP, click picture for website.

The fact that they’re written with the same letters even though they’re different sounds is only part of why they’re so hard to master. (That goes for native English speakers as well as those who are learning it as their second language: it’s one of the last sounds children learn.). The other part is that they’re relatively rare across languages. Standard Arabic  Greek, some varieties of Spanish, Welsh and a smattering of other languages have them.  If you happen to have a native language that doesn’ t have it, though, it’s tough to hear and harder to say. Don’t worry, though, linguistics can help!

I’m afraid the cartoon above may accurately express the difficulty of  producing the “th” for non-native speakers of English, but the technique is somewhat questionable. So, the fancy technical term for the “th” sounds are the interdental fricatives.  Why? Because there are two parts to making it. The first is the place of articulation, which means where you put your tongue. In this case, as you can probably guess (“inter-” between and “-dental” teeth), it goes in between your teeth. Gently!

The important thing about your tongue placement is that your tongue tip needs to be pressed lightly against the bottom of your top teeth. You need to create a small space to push air thorough, small enough that it makes a hissing sound as it escapes. That’s the “fricative” part. Fricatives are sounds where you force air through a small space and the air molecules start jostling each other and make a high-frequency hissing noise. Now, it won’t be as loud when you’re forcing air between your upper teeth and tongue as it is, for example, when you’re making an “s”, but it should still be noticeable.

So, to review, put the tip of  your tongue against the bottom of your top teeth. Blow air through the thin space between your tongue and your teeth so that it creates a (not very loud) hissing  sound. Now try voicing the sound (vibrating  your vocal folds) as you do so. That’s it! You’ve got both of the English “th” sounds down.

If you’d like some more help, I really like this video, and it has some super-cool slow-motion videos. The lady who made it has a website focusing on English pronunciation which has some great  resources.  Good luck!

How do you pronounce Gangnam?

So if you’ve been completely oblivious lately, you might not be aware that Korean musician Psy has recently become a international sensation due to the song below. If you haven’t already seen it, you should. I’ll wait.

Ok, good. Now, I wrote a post recently where I suggested that a trained phonetician can help you learn to pronounce things and I thought I’d put my money where my mouth is and run you though how to pronounce “Gangnam”; phonetics style. (Note: I’m assuming you’re a native English speaker here.)

First, let’s see how a non-phonetician does it. Here’s a brief guide to the correct pronunciation offered on Reddit by ThatWonAsianGuy, who I can only assume is a native Korean speaker.

The first G apparently sounds like a K consonant to non-Korean speakers, but it’s somewhere between a G and a K, but more towards the G. (There are three letters similar, ,, and . The first is a normal “k,” the second the one used in Gangnam, and the third being a clicky, harsh g/k noise.)

The “ang”part is a very wide “ahh” (like when a doctor tells you to open your mouth) followed by an “ng” (like the end of “ending”). The “ahh” part, however, is not a long vowel, so it’s pronounced quickly.

“Nam” also has the “ahh” for the a. The other letters are normal.

So it sounds like (G/K)ahng-nahm.

Let’s see how he did. Judges?

Full marks for accuracy, Rachael. Nothing he said is incorrect. On the other hand, I give it a usability score of just 2 out of 10.  While the descriptions of the vowels and nasal sounds are intelligible and usable to most English speakers, even I was stumped by  his description of a sound between a  “g” and a “k”. A strong effort, though; with some training this kid could make it to the big leagues of phonetics.

Thank you Rachael, and good luck to ThatWonAsianGuy in his future phonetics career. Ok, so what is going on here in terms of the k/g/apparently clicky harsh sound? Funny you should ask, because I’m about to tell you in gruesome detail.

First things first: you need to know what voicing is. Put your hand over your throat and  say “k”. Now say “g”. Can you feel how, when you say “g”, there’s sort of a buzzing feeling? That’s what linguists call voicing. What’s actually happening is that you’re pulling your vocal folds together and then forcing air through them. This makes them vibrate, which in turn makes a sound. Like so:

(If you’re wondering that little cat-tongue looking thing is, that’s the epiglottis. It keeps you from choking to death by trying to breath food and is way up there on my list of favorite body parts.)

But wait! That’s not all! What we think of as “regular voicing” (ok, maybe you don’t think of it all that often, but I’m just going to assume that you do) is just one of the things you can do with your voicing. What other types of voicing are there? It’s the type of thing that’s really best described vocally, so here goes:

Ok, so, that’s what’s going on in your larynx. Why is this important? Well it turns out that only one of the three sounds is actually voiced, and it’s voiced using a different type of voicing. Any guesses as to which one?

Yep, it’s the harsh, clicky one and it’s got glottal voicing (that really low, creaky sort of voice)*. The difference between the “regular k” and the “k/g sound” has nothing to do with voicing type. Which is crazy talk, because almost every “learn Korean” textbook or online course I’ve come across has described them as “k” and “g” respectively and, as we already established, the difference between “k” and “g” is that the “k” is voiced and the “g” isn’t.

Ok, I simplified things a bit. When you say “k” and “g” at the beginning of a word in English (and only at the beginning of a word), there’s actually one additional difference between them. Try this. Put your hand in front of your mouth and say “cab”. Then say “gab”. Do you notice a difference?

You should have felt a puff of air when you said the “k” but not when you said the “g”. Want proof that it only happens at the beginning of words? Try saying “back” and “bag” in the same way, with your hand in front of you mouth. At the end of words they feel about the same.  What’s going on?

Well, in English we always say an unvoiced “k” with a little puff of air at the beginning of the word. In fact, we tend to listen for that puff more than we listen for voicing. So if you say “kat” without voicing the sound, but also without the little puff of air, it sounds more like “gat”. (Which is why language teachers tell you to say it “g” instead of “k”. It’s not, strictly speaking, right, but it is a little easier to hear. The same thing happens in Mandarin, BTW.) And that’s the sound that’s at the beginning of Gangnam.

You’ll probably need to practice a bit before you get it right, but if you can make a sound at the beginning of a word where your vocal chords aren’t vibrating and without that little puff of air, you’re doing it right. You can already make the sound, it’s just the moving it to the beginning of the word that’s throwing a monkey wrench in the works.

So it’s the unvoiced “k” without the little puff of air. Then an “aahhh” sound, just as described above. Then the “ng” sound, which you tend to see at the end of words in English. It can happen in the middle of words as well, though, like in “finger”. And then “nam”, pronounced in the same way as the last syllable as “Vietnam”.

In the special super-secret International Phonetic (Cabal’s) Alphabet, that’s [kaŋnam]. Now go out there and impress a Korean speaker by not butchering the phonetics of their language!

*Ok, ok, that’s a bit of an oversimplification. You can find the whole story here.

Why is studying linguistics useful? *Is* studying linguistics useful?

So I recently gave a talk at the University of Washington Scholar’s Studio. In it, I covered a couple things that I’ve already talked about here on my blog: the fact that, acoustically speaking, there’s no such thing as a “word” and that our ears can trick us. My general point was that our intuitions about speech, a lot of the things we think seem completely obvious, actually aren’t true at all from an acoustic perspective.

What really got to me, though, was that after I’d finished my talk (and it was super fast, too, only five minutes) someone asked why it mattered. Why should we care that our intuitions don’t match reality? We can still communicate perfectly well. How is linguistics useful, they asked. Why should they care?

I’m sorry, what was it you plan to spend your life studying again? I know you told me last week, but for some reason all I remember you saying is “Blah, blah, giant waste of time.”

It was a good question, and I’m really bummed I didn’t have time to answer it. I sometimes forget, as I’m wading through a hip-deep piles of readings that I need to get to, that it’s not immediately obvious to other people why what I do is important. And it is! If I didn’t believe that, I wouldn’t be in grad school. (It’s certainly not the glamorous easy living and fat salary that keep me here.) It’s important in two main ways. Way one is the way in which it enhances our knowledge and way two is the way that it helps people.

 Increasing our knowledge. Ok, so, a lot of our intuitions are wrong. So what? So a lot of things! If we’re perceiving things that aren’t really there, or not perceiving things that are really there, something weird and interesting is going on. We’re really used to thinking of ourselves as pretty unbiased in our observations. Sure, we can’t hear all the sounds that are made, but we’ve built sensors for that, right? But it’s even more pervasive than that. We only perceive the things that our bodies and sensory organs and brains can perceive, and we really don’t know how all these biological filters work. Well, okay, we do know some things (lots and lots of things about ears, in particular) but there’s a whole lot that we still have left to learn. The list of unanswered questions in linguistics is a little daunting, even just in the sub-sub-field of perceptual phonetics.

Every single one of us uses language every single day. And we know embarrassingly little about how it works. And, what we do know, it’s often hard to share with people who have little background in linguistics. Even here, in my blog, without time restraints and an audience that’s already pretty interested (You guys are awesome!) I often have to gloss over interesting things. Not because I don’t think you’ll understand them, but because I’d metaphorically have to grow a tree, chop it down and spends hours carving it just to make a little step stool so you can get the high-level concept off the shelf and, seriously, who has time for that? Sometimes I really envy scientists in the major disciplines  because everyone already knows the basics of what they study. Imagine that you’re a geneticist, but before you can tell people you look at DNA, you have to convince them that sexual reproduction exists. I dream of the day when every graduating high school senior will know IPA. (That’s the international phonetic alphabet, not the beer.)

Okay, off the soapbox.

Helping people. Linguistics has lots and lots and lots of applications. (I’m just going to talk about my little sub-field here, so know that there’s a lot of stuff being left unsaid.) The biggest problem is that so few people know that linguistics is a thing. We can and want to help!

  • Foreign language teaching. (AKA applied linguistics) This one is a particular pet peeve of mine. How many of you have taken a foreign language class and had the instructor tell you something about a sound in the language, like: “It’s between a “k” and a “g” but more like the “k” except different.” That crap is not helpful. Particularly if the instructor is a native speaker of the language, they’ll often just keep telling you that you’re doing it wrong without offering a concrete way to make it correctly. Fun fact: There is an entire field dedicated to accurately describing the sounds of the world’s languages. One good class on phonetics and suddenly you have a concrete description of what you’re supposed to be doing with your mouth and the tools to tell when you’re doing it wrong. On the plus side, a lot language teachers are starting to incorporate linguistics into their curriculum with good results.
  • Speech recognition and speech synthesis. So this is an area that’s a little more difficult. Most people working on these sorts of projects right now are computational people and not linguists. There is a growing community of people who do both (UW offers a masters degree in computational linguistics that feeds lots of smart people into Seattle companies like Microsoft and Amazon, for example) but there’s definite room for improvement. The main tension is the fact that using linguistic models instead of statistical ones (though some linguistic models are statistical) hugely increases the need for processing power. The benefit is that accuracy  tends to increase. I hope that, as processing power continues to be easier and cheaper to access, more linguistics research will be incorporated into these applications. Fun fact: In computer speech recognition, an 80% comprehension accuracy rate in conversational speech is considered acceptable. In humans, that’s grounds to test for hearing or brain damage.
  • Speech pathology. This is a great field and has made and continues to make extensive use of linguistic research. Speech pathologists help people with speech disorders overcome them, and the majority of speech pathologists have an undergraduate degree in linguistics and a masters in speech pathology. Plus, it’s a fast-growing career field with a good outlook.  Seriously, speech pathology is awesome. Fun fact: Almost half of all speech pathologists work in school environments, helping kids with speech disorders. That’s like the antithesis of a mad scientist, right there.

And that’s why you should care. Linguistics helps us learn about ourselves and help people, and what else could you ask for in a scientific discipline? (Okay, maybe explosions and mutant sharks, but do those things really help humanity?)

Mapping language, language maps

So for some reason, I’ve come across three studies in quick succession based in mapping language. Now, if you know me, you know that nattering on about linguistic methodology is pretty much the Persian cat to my Blofeld, but I really do think that looking at the way that linguists do linguistics is incredibly important. (Warning: the next paragraph will be kinda preachy, feel free to skip it.)

It’s something the field, to paint with an incredibly broad brush, tends to skimp on. After all, we’re asking all these really interesting questions that have the potential to change people’s lives. How is hearing speech different from hearing other things? What causes language pathologies and how can we help correct them? Can we use the voice signal to reliably detect Parkinson’s over the phone? That’s what linguistics is. Who has time to look at whether asking  people to list the date on a survey form affects their responses? If linguists don’t use good, controlled methods to attempt to look at these questions, though, we’ll either find the wrong answers or miss it completely because of some confounding variable we didn’t think about. Believe me, I know firsthand how heart wrenching it is to design an experiment,  run subjects, do your stats and end up with a big pile of useless goo because your methodology wasn’t well thought out. It sucks. And it happens way more than it needs to, mainly because a lot of linguistics programs don’t stress rigorous scientific training.

OK, sermon over. Maps! I think using maps to look at language data is a great methodology! Why?

FraMauroMap
Hmm… needs more data about language. Also the rest of the continents, but who am I to judge? 
  1.  You get an end product that’s tangible and easy to read and use. People know what maps are and how to use them. Presenting linguistic data as a map rather than, say, a terabyte of detailed surveys or a thousand hours of recordings is a great way to make that same data accessible. Accessible data gets used. And isn’t that kind of the whole point?
  2. Maps are so. accurateright now. This means that maps of data aren’t  just rough approximations, they’re the best, most accurate way to display this information. Seriously, the stuff you can do with GIS is just mind blowing. (Check out this dialect map of the US. If you click on the region you’re most interested, you get additional data like field recordings, along with the precise place they were made. Super useful.)
  3. Maps are fun. Oh, come on, who doesn’t like looking at  maps? Particularly if you’re looking at a region you’re familiar with. See, here’s my high school, and the hay field we rented three years ago. Oh, and there’s my friend’s house! I didn’t realize they were so close to the highway. Add a second layer of information and BOOM, instant learning.

The studies

Two of the studies I came across were actually based on Twitter data. Twitter’s an amazing resource for studying linguistics because you have this enormous data set you can just use without having to get consent forms from every single person. So nice. Plus, because all tweets are archived, in the Library of Congress if nowhere else, other researchers can go back and verify things really easily.

This study looks at how novel slang expressions spread across the US. It hasn’t actually been published yet, so I don’t have the map itself, but they do talk about some interesting tidbits. For example: the places most likely to spawn new successful slang are urban centers with a high African American population.

The second Twitter study is based in London and looked at the different languages Londoners tweet in and did have a map:

Click for link to author’s blog post.

Interesting, huh? You can really get a good idea of the linguistic landscape of London. Although there were some potential methodological problems with this study, I still think it’s a great way to present this data.

The third study I came across is one that’s actually here at the University of Washington. This one is interesting because it kind of goes the other way. Basically, the researchers has respondents indicate areas on a map of Washington where they thought  language communities existed and then had them describe them.  So what you end up with is sort of a representation of the social ideas of what language is like in various parts of Washington state. Like so:

Click for link to study site.

There are lots more interesting maps on the study site, each of which shows some different perception of language use in Washington State. (My favorite is the one that suggests that people think other people who live right next to the Canadian border sound Canadian.)

So these are just a couple of the ways in which people are using maps to look at language data. I hope it’s a trend that continues.