Why speech is different from other types of sounds

Ok, so, a couple weeks ago I talked about why speech perception was hard to  model. Really, though, what I talked about was why building linguistic models is a hard task. There’s a couple other thorny problems that plague people who work with speech perception, and they have to do with the weirdness of the speech signal itself. It’s important to talk about because it’s on account of dealing with these weirdnesses that some theories of speech perception themselves can start to look pretty strange. (Motor theory, in particular, tends to sound pretty messed-up the first time you encounter it.)

The speech signal and the way we deal with it is really strange in two main ways.

  1. The speech signal doesn’t contain invariant units.
  2. We both perceive and produce speech in ways that are surprisingly non-linear.

So what are “invariant units” and why should we expect to have them? Well, pretty much everyone agrees that we store words as larger chunks made up of smaller chunks. Like, you know that the word “beet” is going to be made with the lips together at the beginning for the “b” and your tongue behind your teeth at the end for the “t”. And you also know that it will have certain acoustic properties; a short  break in the signal followed by a small burst of white noise in a certain frequency range (that’s a the “b” again) and then a long steady state for the vowel and then another sudden break in the signal for the “t”. So people make those gestures and you listen for those sounds and everything’s pretty straightforwards  right? Weeellllll… not really.

It turns out that you can’t really be grabbing onto certain types of acoustic queues because they’re not always reliably there. There are a bunch of different ways to produce “t”, for example, that run the gamut from the way you’d say it by itself to something that sound more like a “w” crossed with an “r”. When you’re speaking quickly in an informal setting, there’s no telling where on that continuum you’re going to fall. Even with this huge array of possible ways to produce a sound, however, you still somehow hear is at as “t”.

And even those queues that are almost always reliably there vary drastically from person to person. Just think about it: about half the population has a fundamental frequency, or pitch, that’s pretty radically different from the other half. The old interplay of biological sex and voice quality thing. But you can easily, effortlessly even, correct for the speaker’s gender and understand the speech produced by men and women equally well. And if a man and woman both say “beet”, you have no trouble telling that they’re saying the same word, even though the signal is quite different in both situations. And that’s not a trivial task. Voice recognition technology, for example, which is overwhelmingly trained on male voices, often has a hard time understanding women’s voices. (Not to mention different accents. What that says about regional and sex-based discrimination is a  topic for another time.)

And yet. And yet humans are very, very good a recognizing speech. How? Well linguists have made some striking progress in answering that question, though we haven’t yet arrived at an answer that makes everyone happy. And the variance in the signal isn’t the only hurdle facing humans as the recognize the vocal signal: there’s also the fact that the fact that we are humans has effects on what we can hear.

Akustik db2phon

Ooo, pretty rainbow. Thorny problem, though: this shows how we hear various frequencies better or worse. The sweet spot is right around 300 kHz or so. Which, coincidentally, just so happens to be where we produce most of the noise in the speech signal. But we do still produce information at other frequencies and we do use that in speech perception: particularly for sounds like “s” and “f”.

We can think of the information available in the world as a sheet of cookie dough. This includes things like UV light and sounds below 0 dB in intensity. Now imagine a cookie-cutter. Heck, make it a gingerbread man. The cookie-cutter represents the ways in which the human body limits our access to this information. There are just certain things that even a normal, healthy human isn’t capable of perceiving. We can only hear the information that falls inside the cookie cutter. And the older we get, the smaller the cookie-cutter becomes, as we slowly lose sensitivity in our auditory and visual systems. This makes it even more difficult to perceive speech. Even though it seems likely that we’ve evolved our vocal system to take advantage of the way our perceptual system works, it still makes the task of modelling speech perception even more complex.

4 responses

  1. Well from an anatomist’s standpoint, there frequency tuning in the cochlea. When sounds waves push against the oval window, it propagates down the scala vestibuli, turns at the helicotrema, and down the scala tympani. There, the basilar membrane oscillates at different frequencies based on the thickness of the membrane. Some portions oscillate more than others, differentiating sound at different frequencies. This is called frequency tuning. Our ears are tuned to human speech, 2-3 kHz out of our range of 20-20k Hz for pretty obvious reasons.
    On the basilar membrane, there are hair cells that are bent, and send depolarizing signals to the spiral ganglia and to the cochlear nerve.
    These hair cells also contain the protein prestin, which is used to change its tension and ampify certain signals, producing different thresholds for sounds.

    Since 2-3 kHz is what we are programmed to respond most to, it is what gets our attention, tells our hair cells to adjust the tension to an individual’s voice, and programs us to turn our heads in the direction of the voice, if it is perceived as important to us.

    Once the sound reaches the CNS, it is tonotopically organized on the gyrus of heschel, where it is organized with summation columns and suppression columns, sifting through the buffet of sounds perceived to hone in on an individual’s voice range.

    These summation and suppression columns, and how sound relays between the different association (parietal) areas of the brain are the current enigma of medical science, but that is a job for a neurologist =P

    • Yep. ^^ I’ve read some papers which suggest that there’s also some tuning that goes on in the brain-stem and auditory cortex to enhance the signal/reduce noise further. (Hence the usefulness of an auditory brain-stem response or ABR to linguists). This is all a tiny bit more technical than I like my blog posts to be, though. 😉

  2. Pingback: The Motor Theory of Speech Perception |

  3. Pingback: The Acoustic Theory of Speech Perception |

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s