New series: 50 Great Ideas in Linguistics

As I’ve been teaching this summer (And failing to blog on a semi-regular basis like a loser. Mea culpa.) I’ll occasionally find that my students aren’t familiar with something I’d assumed they’d covered at some point already. I’ve also found that there are relatively few resources for looking up linguistic ideas that don’t require a good deal of specialized knowledge going in. SIL’s glossary of linguistic terms is good but pretty jargon-y, and the various handbooks tend not to have on-line versions. And even with a concerted effort by linguists to make Wikipedia a good resource, I’m still not 100% comfortable with recommending that my students use it.

Therefore! I’ve decided to make my own list of Things That Linguistic-Type People Should Know and then slowly work on expounding on them. I have something to point my students to and it’s a nice bite-sized way to talk about things; perfect for a blog.

Here, in no particular order, are 50ish Great Ideas of Linguistics sorted by sub-discipline. (You may notice a slightly sub-disciplinary bias.) I might change my mind on some of these–and feel free to jump in with suggestions–but it’s a start. Look out for more posts on them.

  • Sociolinguistics
    • Sociolinguistic variables
    • Social class and language
    • Social networks
    • Accommodation
    • Style
    • Language change
    • Linguistic security
    • Linguistic awareness
    • Covert and overt prestige
  • Phonetics
    • Places of articulation
    • Manners of articulation
    • Voicing
    • Vowels and consonants
    • Categorical perception
    • “Ease”
    • Modality
  • Phonology
    • Rules
    • Assimilation and dissimilation
    • Splits and mergers
    • Phonological change
  • Morphology
  • Syntax
  • Semantics
    • Pragmatics
    • Truth values
    • Scope
    • Lexical semantics
    • Compositional semantics
  • Computational linguistics
    • Classifiers
    • Natural Language Processing
    • Speech recognition
    • Speech synthesis
    • Automata
  • Documentation/Revitalization
    • Language death
    • Self-determination
  • Psycholinguistics

How to Take Care of your Voice

Inflammation, polyps and nodules, oh my! Learn about some common problems that can affect your voice and how to avoid them, all in a shiny new audio format. For more tips about caring for your vocal folds and more information about rarer problems like tumours or paralysis, check out this page or this page.

Why do people talk in their sleep?

Sleep-talking (or “Somniloquy”, as us fancy-pants scientist people call it) is a phenomena where a sleeping person starts talking. For example, the internet sensation The Sleep Talkin Man. Sleep talking can range from grunts or moans to relatively clear speech. While most people know what sleep talking is (there was even a hit song about it that’s older than I am) fewer people know what causes it.

A.Cortina El sueño
Sure, she looks all peaceful, but you should hear her go on.
To explain what happens when someone’s talking in their sleep, we first need to talk about 1) what happens during sleep and 2) what happens when we talk normally.

  • Sleeping normally: One of the weirder things about sleep talking is that it happens at all. When you’re asleep normally, your muscles undergo atony during the stage of sleep called Rapid Eye Movement, or REM sleep. Basically, your muscles release and go into a state of relaxation or paralysis. If you’ve ever woken suddenly and been unable to move, it’s because your body is still in that state. This serves an important purpose: when we dream we can rehearse movements without actually moving around and hurting ourselves. Of course, the system isn’t perfect. When your muscles fail to “turn off” while you dream, you’ll end up acting out your dream and sleep walking. This is particularly problematic for people with narcolepsy.
  • Speaking while awake: So speech is an incredibly complex process. Between a tenth and a third of a second before you begin to speak you start brain activation in the insula. This is where you plan the movements you’ll need to successfully speak. These come in three main stages, that I like to call breathing, vibrating and tonguing. All speech comes from breath, so you need to inhale in preparation for speaking. Normal exhalation won’t work for speaking, though–it’s too fast–so you switch on your intercostal muscles, in the walls of your ribcage, to help your lungs empty more slowly. Next, you need to tighten your vocal folds as you force air through them. This makes them vibrate (like so) and gives you the actual sound of your voice. By putting different amounts of pressure on your vocal folds you can change your pitch or the quality of your voice. Finally, your mouth needs to manipulate the buzzing sound your vocal folds make to make the specific speech sounds you need. You might flick your tongue, bring your teeth to your lips, or open your soft palate so that air goes through your nose instead of your mouth. And voila! You’re speaking.

Ok, so, it seems like sleep talking shouldn’t really happen, then. When you’re asleep your muscles are all turned off and they certainly don’t seem up to the multi-stage process that is speech production. Besides, there’s no need for us to be making speech movements anyway, right? Wrong. You actually use your speech planning processes even if you’re not planning to speak aloud. I’ve already talked about the motor theory of speech perception, which suggests that we use our speech planning mechanisms to understand speech. And it’s not just speech perception. When reading silently, we still plan out the speech movements we’d make if we were to read out loud (though the effect is smaller with more fluent readers). So you sometimes do all the planning work even if you’re not going to say anything… and one of the times you do that is when you’re asleep. Usually, your muscles are all turned off when you’re asleep. But, sometimes, especially in young children or people with PTSD, the system will occasionally stop working as well. And if it happens to stop working when you’re dreaming that you’re talking and therefore planning out your speech movements? You start sleep talking.

Of course, all of this means that some of the things that we’ve all heard about about sleep talking are actually myths. Admissions of guilt while asleep, for example, aren’t reliable and not admissible in court. (Unless, of course, you really did put that purple beaver in the banana pudding.) It’s also very common; about 50% of children talk in their sleep. Unless it’s causing problems–like waking people you’re sleeping with–sleep talking isn’t generally problematic. But you can help reduce the severity by getting enough sleep (which is probably a good goal anyway), and avoiding alcohol and drugs.

Excellent BBC program about forensic phonetics

I don’t usually reblog things, but this is an excellent program on the use of forensic phonetics in Britain.

Which are better, earphones or headphones?

As a phonetician, it’s part of my job to listen to sounds very closely. Plus, I like to listen to music while I work, enjoy listening to radio dramas and use a headset to chat with my guildies while I’m gaming.  As a result, I spend a lot of time with things on/in my ears. And, because of my background, I’m also fairly well informed about the acoustic properties of  earphones and headphones and how they interact with anatomy. All of which helps me answer the question: which is better? Or, more accurately, what are some of the pros and cons of each? There are a number of factors to consider, including frequency response, noise isolation, noise cancellation and comfort/fit. Before I get into specifics, however, I want to make sure we’re on the same page when we talk about “headphones” and “earphones”.

Earphones: For the purposes of this article, I’m going to use the term “earphone” to refer to devices that are meant to be worn inside the pinna (that’s the fancy term for the part of the ear you can actually see). These are also referred to as “earbuds”, “buds”, “in-ears”, “canalphones”, “in-ear moniters”, “IEM’s” and “in-ear headphones”. You can see an example of what I’m calling “earphones” below.

IPod Touch 2G Remote Mic
Ooo, so white and shiny and painful.

Headphones: I’m using this term to refer to devices that are not meant to rest in the pinna, whether they go around or on top of the ear. These are also called “earphones”, (apparently) “earspeakers” or, my favorites, “cans”. You can see somewhat antiquated examples of what I’m calling “headphones” below.

Club holds radio dance wearing earphones 1920
I mean, sure, it’s a wonder of modern technology and all, but the fidelity is just so low.

Alright, now that we’ve  cleared that up, let’s get down to brass tacks. (Or, you might say…. bass tacks.)

  1. Frequency response curve: How much distortion do they introduce? In an ideal world, ‘phones should responded equally well to all frequencies (or pitches), without transmitting one frequency rage more loudly than another. This desirable feature is commonly referred to as a “flat” frequency response. That means that the signal you’re getting out is pretty much the same one that was fed in, at all frequency ranges.
    1. Earphones: In general, earphones tend to have a worse frequency response.
    2. Headphones: In general, headphones tend to have better frequency response.
    3. WinnerHeadphones are probably the better choice if you’re really worried about distortion. You should read the specifications of the device you’re interested in, however, since there’s a large amount of variability.
  2. Frequency response: What is their pitch range? This term is sometimes used to refer to the frequency response curve I talked about above and sometimes used to refer to pitch range. I know, I know, it’s confusing. Pitch range is usually expressed as the lowest sound the ‘phones can transmit followed by the highest. Most devices on the market today can pretty much play anything between 20 and 20k Hz. (You can see what that sounds like here. Notice how it sounds loudest around 300Hz? That’s an artifact of your hearing, not the video. Humans are really good at hearing sounds around 300Hz which [not coincidentally] is about where the human voice hangs out.)
    1. Earphones: Earphones tend to have a smaller pitch range than headphones. Of course, there are always exceptions.
    2. Headphones: Headphones tend to have a better frequency range than earphones.
    3. Winner: In general, headphones have a better frequency range. That said, it’s not really that big of a deal. You can’t really hear very high or very low sounds that well because of the way your hearing system works regardless of how well your ‘phones are delivering the signal. Anything that plays sounds between 20Htz and 20,000Htz should do you just fine.
  3. Noise isolation: How well do they isolate you from sounds other than the ones you’re trying to listen to? More noise isolation is generally better, unless there’s some reason you need to be able to hear environmental sounds as well whatever you’re listening to. Better isolation also means you’re less likely to bother other people with your music.
    1. Earphones:  A properly fitted pair of in-ear earphones will give you the best noise isolation. It makes sense; if you’re wearing them properly they should actually form a complete seal with your ear canal. No sound in, no sound out, excellent isolation.
    2. Headphones: Even really good over-ear headphones won’t form a complete seal around your ear. (Well, ok, maybe if you’re completely bald and you make some creative use of adhesives, but you know what I mean.) As a result, you’re going to get some noise leakage .
    3. Winner: You’ll get the best noise isolation from well-fitting earphones that sit in the ear canal.
  4. Noise cancellation: How well can they correct for atmospheric sounds? So noise cancellation is actually completely different from noise isolation. Noise isolation is something that all ‘phones have. Noise-cancelling ‘phones, on the other hand, actually do some additional signal processing before you get the sound. They “listen” for atmospheric sounds, like an air-conditioner or a car engine. Then they take that waveform, reproduce it and invert it. When they play the inverted waveform along with your music, it exactly cancels out the sound. Which is awesome and space-agey, but isn’t perfect. They only really work with steady background noises. If someone drops a book, they won’t be able to cancel that sudden, sharp noise. They also tend not to work as well with really high-pitched noises.
    1. Earphones: Noise-cancelling earphones tend not be as effective as noise-cancelling headphones until you get to the high end of the market (think $200 plus).
    2. Headphones: Headphones tend to be slightly better at noise-cancellation than earphones of a similar quality, in my experience. This is partly due to the fact that there’s just more room for electronics in headphones.
    3. Winner: Headphones usually have a slight edge here. Of course, really expensive noise-cancelling devices, whether headphones or earphones, usually perform better than their bargain cousins.
  5. Comfort/fit: Is they comfy?
    1. Earphones: So this is where earphones tend to suffer. There is quite a bit of variation in the shape of the cavum conchæ, which is the little bowl shape just outside your ear canal. Earphone manufacturers have to have somewhere to put their magnets and drivers and driver support equipment and it usually ends up in the “head” of the earphone, nestled right in your concha cavum. Which is awesome if it’s a shape that fits your ear. If it’s not, though, it can quickly start to become irritating and eventually downright painful. Personally, this is the main reason I prefer over-ear headphones.
    2. Headphones: A nicely fitted pair of over-ear headphones that covers your whole ear is just incredibly comfortable. Plus, they keep your ears warm! I find on-ear headphones less comfortable in general, but a nice cushy pair can still feel awesome. There are other factors to take into account, though; wearing headphones and glasses with a thick frame can get really uncomfortable really fast.
    3. Winner: While this is clearly a matter of personal preference, I have a strong preference for headphones on this count.

So, for me at least, headphones are the clear winner overall. I find them more comfortable, and they tend to reproduce sound better than earphones. There are instances where I find earphones preferable, though. They’re great for travelling or if I really need an isolated signal. When I’m just sitting at my desk working, though, I reach for headphones 99% of the time.

One final caveat: the sound quality you get out of your ‘phones depends most on what files you’re playing. The best headphones in the world can’t do anything about quantization noise (that’s the noise introduced when you convert analog sound-waves to digital ones) or a background hum in the recording.

Feeling Sound

We’re all familiar with the sensation of sound so loud we can actually feel it: the roar of a jet engine, the palpable vibrations of a loud concert, a thunderclap so close it shakes the windows. It may surprise you to learn, however, that that’s not the only way in which we “feel” sounds. In fact, recent research suggests that tactile information might be just as important as sound in some cases!

Touch Gently (3022697095)
What was that? I couldn’t hear you, you were touching too gently.
I’ve already talked about how we can see sounds, and the role that sound plays in speech perception before. But just how much overlap is there between our sense of touch and hearing? There is actually pretty strong evidence that what we feel can actually override what we’re hearing. Yau et. al. (2009), for example, found that tactile expressions of frequency could override auditory cues. In other words, you might hear two identical tones as different if you’re holding something that is vibrating faster or slower. If our vision system had a similar interplay, we might think that a person was heavier if we looked at them while holding a bowling ball, and lighter if we looked at them while holding a volleyball.

And your sense of touch can override your ears (not that they were that reliable to begin with…) when it comes to speech as well. Gick and Derrick (2013) have found that tactile information can override auditory input for speech sounds. You can be tricked into thinking that you heard a “peach” rather than “beach”, for example, if you’re played the word “beach” and a puff of air is blown over your skin just as you hear the “b” sound. This is because when an English speaker says “peach”, they aspirate the “p”, or say it with a little puff of air. That isn’t there when they say the “b” in “beach”, so you hear the wrong word.

Which is all very cool, but why might this be useful to us as language-users? Well, it suggests that we use a variety of cues when we’re listening to speech. Cues act as little road-signs that point us towards the right interpretation. By having access to a lots of different cues, we ensure that our perception is more robust. Even when we lose some cues–say, a bear is roaring in the distance and masking some of the auditory information–you can use the others to figure out that your friend is telling you that there’s a bear. In other words, even if some of the road-signs are removed, you can still get where you’re going. Language is about communication, after all, and it really shouldn’t be surprising that we use every means at our disposal to make sure that communication happens.

The Acoustic Theory of Speech Perception

So, quick review: understanding speech is hard to model and the first model we discussed, motor theory, while it does address some problems, leaves something to be desired. The big one is that it doesn’t suggest that the main fodder for perception is the acoustic speech signal. And that strikes me as odd. I mean, we’re really used to thinking about hearing speech as a audio-only thing. Telephones and radios work perfectly well, after all, and the information you’re getting there is completely audio. That’s not to say that we don’t use visual, or, heck, even tactile data in speech perception. The McGurk effect, where a voice saying “ba” dubbed over someone saying “ga” will be perceived as “da” or “tha”, is strong evidence that we can and do use our eyes during speech perception. And there’s even evidence that a puff of air on the skin will change our perception of speech sounds. But we seem to be able to get along perfectly well without these extra sensory inputs, relying on acoustic data alone.

CPT-sound-physical-manifestation
This theory sounds good to me. Sorry, I’ll stop.
Ok, so… how do we extract information from acoustic data? Well, like I’ve said a couple time before, it’s actually a pretty complex problem. There’s no such thing as “invariance” in the speech signal and that makes speech recognition monumentally hard. We tend not to think about it because humans are really, really good at figuring out what people are saying, but it’s really very, very complex.

You can think about it like this: imagine that you’re looking for information online about platypuses. Except, for some reason, there is no standard spelling of platypus. People spell it “platipus”, “pladdypuss”, “plaidypus”, “plaeddypus” or any of thirty or forty other variations. Even worse, one person will use many different spellings and may never spell it precisely the same way twice. Now, a search engine that worked like our speech recognition works would not only find every instance of the word platypus–regardless of how it was spelled–but would also recognize that every spelling referred to the same animal. Pretty impressive, huh? Now imagine that every word have a very variable spelling, oh, and there are no spaces between words–everythingisjustruntogetherlikethisinonelongspeechstream. Still not difficult enough for you? Well, there is also the fact that there are ambiguities. The search algorithm would need to treat “pladypuss” (in the sense of  a plaid-patterned cat) and “palattypus” (in the sense of the venomous monotreme) as separate things. Ok, ok, you’re right, it still seems pretty solvable. So let’s add the stipulation that the program needs to be self-training and have an accuracy rate that’s incredibly close to 100%. If you can build a program to these specifications, congratulations: you’ve just revolutionized speech recognition technology. But we already have a working example of a system that looks a heck of a lot like this: the human brain.

So how does the brain deal with the “different spellings” when we say words? Well, it turns out that there are certain parts of a word that are pretty static, even if a lot of other things move around. It’s like a superhero reboot: Spiderman is still going to be Peter Parker and get bitten by a spider at some point and then get all moody and whine for a while. A lot of other things might change, but if you’re only looking for those criteria to figure out whether or not you’re reading a Spiderman comic you have a pretty good chance of getting it right. Those parts that are relatively stable and easy to look for we call “cues”. Since they’re cues in the acoustic signal, we can be even more specific and call them “acoustic cues”.

If you think of words (or maybe sounds, it’s a point of some contention) as being made up of certain cues, then it’s basically like a list of things a house-buyer is looking for in a house. If a house has all, or at least most, of the things they’re looking for, than it’s probably the right house and they’ll select that one. In the same way, having a lot of cues pointing towards a specific word makes it really likely that that word is going to be selected. When I say “selected”, I mean that the brain will connect the acoustic signal it just heard to the knowledge you have about a specific thing or concept in your head. We can think of a “word” as both this knowledge and the acoustic representation. So in the “platypuss” example above, all the spellings started with “p” and had an “l” no more than one letter away. That looks like a  pretty robust cue. And all of the words had a second “p” in them and ended with one or two tokens of “s”. So that also looks like a pretty robust queue. Add to that the fact that all the spellings had at least one of either a “d” or “t” in between the first and second “p” and you have a pretty strong template that would help you to correctly identify all those spellings as being the same word.

Which all seems to be well and good and fits pretty well with our intuitions (or mine at any rate). But that leaves us with a bit of a problem: those pesky parts of Motor Theory that are really strongly experimentally supported. And this model works just as well for motor theory too, just replace  the “letters” with specific gestures rather than acoustic cues. There seems to be more to the story than either the acoustic model or the motor theory model can offer us, though both have led to useful insights.

The Motor Theory of Speech Perception

Ok, so like I talked about in my previous two posts, modelling speech perception is an ongoing problem with a lot of hurdles left to jump. But there are potential candidate theories out there, all of which offer good insight into the problem. The first one I’m going to talk about is motor theory.

Clamp-Type 2C1.5-4 Motor
So your tongue is like the motor body and the other person’s ear are like the load cell…
So motor theory has one basic premise and three major claims.  The basic premise is a keen observation: we don’t just perceive speech sounds, we also make them. Whoa, stop the presses. Ok, so maybe it seems really obvious, but motor theory was really the first major attempt to model speech perception that took this into account. Up until it was first posited in the 1960’s , people had pretty much been ignoring that and treating speech perception like the only information listeners had access to was what was in the acoustic speech signal. We’ll discuss that in greater detail, later, but it’s still pretty much the way a lot of people approach the problem. I don’t know of a piece of voice recognition software, for example, that include an anatomical model.

So what’s the fact that listeners are listener/speakers get you? Well, remember how there aren’t really invariant units in the speech signal? Well, if you decide that what people are actually perceiving aren’t actually a collection of acoustic markers that point to one particular language sound but instead the gestures needed to make up that sound, then suddenly that’s much less of a problem. To put it in another way, we’re used to thinking of speech being made up of a bunch of sounds, and that when we’re listening speech we’re deciding what the right sounds are and from there picking the right words. But from a motor theory standpoint, what you’re actually doing when you’re listening to speech is deciding what the speaker’s doing with their mouth and using that information to figure out what words they’re saying. So in the dictionary in your head, you don’t store words as strings of sounds but rather as strings of gestures

If you’re like me when I first encountered this theory, it’s about this time that you’re starting to get pretty skeptical. I mean, I basically just said that what you’re hearing is the actual movement of someone else’s tongue and figuring out what they’re saying by reverse engineering it based on what you know your tongue is doing when you say the same word. (Just FYI, when I say tongue here, I’m referring to the entire vocal tract in its multifaceted glory, but that’s a bit of a mouthful. Pun intended. 😉 ) I mean, yeah, if we accept this it gives us a big advantage when we’re talking about language acquisition–since if you’re listening to gestures, you can learn them just by listening–but still. It’s weird. I’m going to need some convincing.

Well, let’s get back to the those three principles I mentioned earlier, which are taken from Galantucci, Flower and Turvey’s excellent review of motor theory.

  1. Speech is a weird thing to perceive and pretty much does its own thing. I’ve talked about this at length, so let’s just take that as a given for now.
  2. When we’re listening to speech, we’re actually listening to gestures. We talked about that above. 
  3. We use our motor system to help us perceive speech.

Ok, so point three should jump out at you a bit. Why? Of these three points, its the easiest one to test empirically. And since I’m a huge fan of empirically testing things (Science! Data! Statistics!) we can look into the literature and see if there’s anything that supports this. Like, for example, a study that shows that when listening to speech, our motor cortex gets all involved. Well, it turns out that there  are lots of studies that show this. You know that term “active listening”? There’s pretty strong evidence that it’s more than just a metaphor; listening to speech involves our motor system in ways that not all acoustic inputs do.

So point three is pretty well supported. What does that mean for point two? It really depends on who you’re talking to. (Science is all about arguing about things, after all.) Personally, I think motor theory is really interesting and address a lot of the problems we face in trying to model speech perception. But I’m not ready to swallow it hook, line and sinker. I think Robert Remez put it best in the proceedings of Modularity and The Motor Theory of Speech Perception:

I think it is clear that Motor Theory is false. For the other, I think the evidence indicates no less that Motor Theory is essentially, fundamentally, primarily and basically true. (p. 179)

On the one hand, it’s clear that our motor system is involved in speech perception. On the other, I really do think that we use parts of the acoustic signal in and of themselves. But we’ll get into that in more depth next week.

Why speech is different from other types of sounds

Ok, so, a couple weeks ago I talked about why speech perception was hard to  model. Really, though, what I talked about was why building linguistic models is a hard task. There’s a couple other thorny problems that plague people who work with speech perception, and they have to do with the weirdness of the speech signal itself. It’s important to talk about because it’s on account of dealing with these weirdnesses that some theories of speech perception themselves can start to look pretty strange. (Motor theory, in particular, tends to sound pretty messed-up the first time you encounter it.)

The speech signal and the way we deal with it is really strange in two main ways.

  1. The speech signal doesn’t contain invariant units.
  2. We both perceive and produce speech in ways that are surprisingly non-linear.

So what are “invariant units” and why should we expect to have them? Well, pretty much everyone agrees that we store words as larger chunks made up of smaller chunks. Like, you know that the word “beet” is going to be made with the lips together at the beginning for the “b” and your tongue behind your teeth at the end for the “t”. And you also know that it will have certain acoustic properties; a short  break in the signal followed by a small burst of white noise in a certain frequency range (that’s a the “b” again) and then a long steady state for the vowel and then another sudden break in the signal for the “t”. So people make those gestures and you listen for those sounds and everything’s pretty straightforwards  right? Weeellllll… not really.

It turns out that you can’t really be grabbing onto certain types of acoustic queues because they’re not always reliably there. There are a bunch of different ways to produce “t”, for example, that run the gamut from the way you’d say it by itself to something that sound more like a “w” crossed with an “r”. When you’re speaking quickly in an informal setting, there’s no telling where on that continuum you’re going to fall. Even with this huge array of possible ways to produce a sound, however, you still somehow hear is at as “t”.

And even those queues that are almost always reliably there vary drastically from person to person. Just think about it: about half the population has a fundamental frequency, or pitch, that’s pretty radically different from the other half. The old interplay of biological sex and voice quality thing. But you can easily, effortlessly even, correct for the speaker’s gender and understand the speech produced by men and women equally well. And if a man and woman both say “beet”, you have no trouble telling that they’re saying the same word, even though the signal is quite different in both situations. And that’s not a trivial task. Voice recognition technology, for example, which is overwhelmingly trained on male voices, often has a hard time understanding women’s voices. (Not to mention different accents. What that says about regional and sex-based discrimination is a  topic for another time.)

And yet. And yet humans are very, very good a recognizing speech. How? Well linguists have made some striking progress in answering that question, though we haven’t yet arrived at an answer that makes everyone happy. And the variance in the signal isn’t the only hurdle facing humans as the recognize the vocal signal: there’s also the fact that the fact that we are humans has effects on what we can hear.

Akustik db2phon
Ooo, pretty rainbow. Thorny problem, though: this shows how we hear various frequencies better or worse. The sweet spot is right around 300 kHz or so. Which, coincidentally, just so happens to be where we produce most of the noise in the speech signal. But we do still produce information at other frequencies and we do use that in speech perception: particularly for sounds like “s” and “f”.

We can think of the information available in the world as a sheet of cookie dough. This includes things like UV light and sounds below 0 dB in intensity. Now imagine a cookie-cutter. Heck, make it a gingerbread man. The cookie-cutter represents the ways in which the human body limits our access to this information. There are just certain things that even a normal, healthy human isn’t capable of perceiving. We can only hear the information that falls inside the cookie cutter. And the older we get, the smaller the cookie-cutter becomes, as we slowly lose sensitivity in our auditory and visual systems. This makes it even more difficult to perceive speech. Even though it seems likely that we’ve evolved our vocal system to take advantage of the way our perceptual system works, it still makes the task of modelling speech perception even more complex.

Why is it hard to model speech perception?

So this is a kick-off post for a series of posts about various speech perception models. Speech perception models, you ask? Like, attractive people who are good at listening?

Romantic fashion model
Not only can she discriminate velar, uvular and pharyngeal fricatives with 100% accuracy, but she can also do it in heels.
No, not really. (I wish that was a job…) I’m talking about a scientific model of how humans perceive speech sounds. If you’ve ever taken an introductory science class, you already have some experience with scientific models. All of Newton’s equations are just a way of generalizing general principals generally across many observed cases. A good model has both explanatory and predictive power. So if I say, for example, that force equals mass times acceleration, then that should fit with any data I’ve already observed as well as accurately describe new observations. Yeah, yeah, you’re saying to yourself, I learned all this in elementary school. Why are you still going on about it? Because I really want you to appreciate how complex this problem is.

Let’s take an example from an easier field, say, classical mechanics. (No offense physicists, but y’all know it’s true.) Imagine we want to model something relatively simple. Perhaps we want to know whether a squirrel who’s jumping from one tree to another is going to make. What do we need to know? And none of that “assume the squirrel is a sphere and there’s no air resistance” stuff, let’s get down to the nitty-gritty. We need to know the force and direction of the jump, the locations of the trees, how close the squirrel needs to get to be able to hold on, what the wind’s doing, air resistance and how that will interplay with the shape of the squirrel, the effects of gravity… am I missing anything? I feel like I might be, but that’s most of it.

So, do you notice something that all of these things we need to know the values of have in common? Yeah, that’s right, they’re easy to measure directly. Need to know what the wind’s doing? Grab your anemometer. Gravity? To the accelerometer closet! How far apart the trees are? It’s yardstick time. We need a value , we measure a value, we develop a model with good predictive and explanatory power (You’ll need to wait for your simulations to run on your department’s cluster. But here’s one I made earlier so you can see what it looks like. Mmmm, delicious!) and you clean up playing the numbers on the professional squirrel-jumping circuit.

Let’s take a similarly simple problem from the field of linguistics. You take a person, sit them down in a nice anechoic chamber*, plop some high quality earphones on them and play a word that could be “bite” and could be “bike” and ask them to tell you what they heard. What do you need to know to decide which way they’ll go? Well, assuming that your stimuli is actually 100% ambiguous (which is a little unlikely) there a ton of factors you’ll need to take into account. Like, how recently and often has the subject heard each of the words before? (Priming and frequency effects.) Are there any social factors which might affect their choice? (Maybe one of the participant’s friends has a severe overbite, so they just avoid the word “bite” all together.) Are they hungry? (If so, they’ll probably go for “bite” over “bike”.) And all of that assumes that they’re a native English speaker with no hearing loss or speech pathologies and that the person’s voice is the same as theirs in terms of dialect, because all of that’ll bias the  listener as well.

The best part? All of this is incredibly hard to measure. In a lot of ways, human language processing is a black box. We can’t mess with the system too much and taking it apart to see how it works, in addition to being deeply unethical, breaks the system. The best we can do is tap a hammer lightly against the side and use the sounds of the echos to guess what’s inside. And, no, brain imaging is not a magic bullet for this.  It’s certainly a valuable tool that has led to a lot of insights, but in addition to being incredibly expensive (MRI is easily more than a grand per participant and no one has ever accused linguistics of being a field that rolls around in money like a dog in fresh-cut grass) we really need to resist the urge to rely too heavily on brain imaging studies, as a certain dead salmon taught us.

But! Even though it is deeply difficult to model, there has been a lot of really good work done on towards a theory of speech perception. I’m going to introduce you to some of the main players, including:

  • Motor theory
  • Acoustic/auditory theory
  • Double-weak theory
  • Episodic theories (including Exemplar theory!)

Don’t worry if those all look like menu options in an Ethiopian restaurant (and you with your Amharic phrasebook at home, drat it all); we’ll work through them together.  Get ready for some mind-bending, cutting-edge stuff in the coming weeks. It’s going to be [fʌn] and [fʌnetɪk]. 😀

*Anechoic chambers are the real chambers of secrets.