Why speech is different from other types of sounds

Ok, so, a couple weeks ago I talked about why speech perception was hard to  model. Really, though, what I talked about was why building linguistic models is a hard task. There’s a couple other thorny problems that plague people who work with speech perception, and they have to do with the weirdness of the speech signal itself. It’s important to talk about because it’s on account of dealing with these weirdnesses that some theories of speech perception themselves can start to look pretty strange. (Motor theory, in particular, tends to sound pretty messed-up the first time you encounter it.)

The speech signal and the way we deal with it is really strange in two main ways.

  1. The speech signal doesn’t contain invariant units.
  2. We both perceive and produce speech in ways that are surprisingly non-linear.

So what are “invariant units” and why should we expect to have them? Well, pretty much everyone agrees that we store words as larger chunks made up of smaller chunks. Like, you know that the word “beet” is going to be made with the lips together at the beginning for the “b” and your tongue behind your teeth at the end for the “t”. And you also know that it will have certain acoustic properties; a short  break in the signal followed by a small burst of white noise in a certain frequency range (that’s a the “b” again) and then a long steady state for the vowel and then another sudden break in the signal for the “t”. So people make those gestures and you listen for those sounds and everything’s pretty straightforwards  right? Weeellllll… not really.

It turns out that you can’t really be grabbing onto certain types of acoustic queues because they’re not always reliably there. There are a bunch of different ways to produce “t”, for example, that run the gamut from the way you’d say it by itself to something that sound more like a “w” crossed with an “r”. When you’re speaking quickly in an informal setting, there’s no telling where on that continuum you’re going to fall. Even with this huge array of possible ways to produce a sound, however, you still somehow hear is at as “t”.

And even those queues that are almost always reliably there vary drastically from person to person. Just think about it: about half the population has a fundamental frequency, or pitch, that’s pretty radically different from the other half. The old interplay of biological sex and voice quality thing. But you can easily, effortlessly even, correct for the speaker’s gender and understand the speech produced by men and women equally well. And if a man and woman both say “beet”, you have no trouble telling that they’re saying the same word, even though the signal is quite different in both situations. And that’s not a trivial task. Voice recognition technology, for example, which is overwhelmingly trained on male voices, often has a hard time understanding women’s voices. (Not to mention different accents. What that says about regional and sex-based discrimination is a  topic for another time.)

And yet. And yet humans are very, very good a recognizing speech. How? Well linguists have made some striking progress in answering that question, though we haven’t yet arrived at an answer that makes everyone happy. And the variance in the signal isn’t the only hurdle facing humans as the recognize the vocal signal: there’s also the fact that the fact that we are humans has effects on what we can hear.

Akustik db2phon

Ooo, pretty rainbow. Thorny problem, though: this shows how we hear various frequencies better or worse. The sweet spot is right around 300 kHz or so. Which, coincidentally, just so happens to be where we produce most of the noise in the speech signal. But we do still produce information at other frequencies and we do use that in speech perception: particularly for sounds like “s” and “f”.


We can think of the information available in the world as a sheet of cookie dough. This includes things like UV light and sounds below 0 dB in intensity. Now imagine a cookie-cutter. Heck, make it a gingerbread man. The cookie-cutter represents the ways in which the human body limits our access to this information. There are just certain things that even a normal, healthy human isn’t capable of perceiving. We can only hear the information that falls inside the cookie cutter. And the older we get, the smaller the cookie-cutter becomes, as we slowly lose sensitivity in our auditory and visual systems. This makes it even more difficult to perceive speech. Even though it seems likely that we’ve evolved our vocal system to take advantage of the way our perceptual system works, it still makes the task of modelling speech perception even more complex.

Book Review: Punctuation..?

So the good folks over at Userdesign asked me to review their newest volume, Punctuation..? and I was happy to oblige. Linguists rarely study punctuation (it falls under the sub-field orthography, or the study of writing systems) but what we do study is the way that language attitudes and punctuation come together. I’ve written before about language attitudes when it come to grammar instruction and the strong prescriptive attitudes of most grammar instruction books. What makes this book so interesting is that it is partly prescriptive and partly descriptive. Since a descriptive bent in a grammar instruction manual is rare, I thought I’d delve into that a bit.

User_design_Books_Punctuation_w_cover

Image copyright Userdesign, used with permission. (Click for link to site.)

So, first of all, how about a quick review of the difference between a descriptive and prescriptive approach to language?

  • Descriptive: This is what linguists do. We don’t make value or moral judgments about languages or language use, we just say what’s going on as best we can. You can think of it like an anthropological ethnography: we just describe what’s going on. 
  • Prescriptive: This is what people who write letters to the Times do. They have a very clear idea of what’s “right” and “wrong” with regards to language use and are all to happy to tell you about it. You can think of this like a manner book: it tells you what the author thinks you should be doing. 

As a linguist, my relationship with language is mainly scientific, so I have a clear preference for a descriptive stance. An ichthyologist doesn’t tell octopi, “No, no, no, you’re doing it all wrong!” after all. At the same time, I live in a culture which has very rigid expectations for how an educated individual should write and sound, and if I want to be seen as an educated individual (and be considered for the types of jobs only open to educated individuals) you better believe I’m going to adhere to those societal standards. The problem comes when people have a purely prescriptive idea of what grammar is and what it should be. That can lead to nasty things like linguistic discrimination. I.e., language B (and thus all those individuals who speak language B) is clearly inferior to language A because they don’t do things properly. Since I think we can all agree that unfounded discrimination of this type is bad, you can see why linguists try their hardest to avoid value judgments of languages.

As I mentioned before, this book is a fascinating mix of prescriptive and descriptive snippets. For example, the author says this about exclamation points: “In everyday writing, the exclamation mark is often overused in the belief that it adds drama and excitement. It is, perhaps  the punctuation mark that should be used with the most restraint” (p 19). Did you notice that “should'”? Classic marker of a prescriptivist claiming their territory. But then you have this about Guillements: “Guillements are used in several languages to indicate passages of speech in the same way that single and double quotation marks (” “”) are used in the English language” (p. 22). (Guillements look like this, since I know you were wondering;  « and ». ) See, that’s a classical description of what a language does, along with parallels drawn to another, related, languages. It may not seem like much, but try to find a comparably descriptive stance in pretty much any widely-distributed grammar manual. And if you do, let me know so that I can go buy a copy of it. It’s change, and it’s positive change, and I’m a fan of it. Is this an indication of a sea-change in grammar manuals? I don’t know, but I certainly hope so.

Over all, I found this book fascinating (though not, perhaps, for the reasons the author intended!). Particularly because it seems to stand in contrast to the division that I just spent this whole post building up. It’s always interesting to see the ways that stances towards language can bleed and melt together, for all that linguists (and I include myself here) try to show that there’s a nice, neat dividing line between the evil, scheming prescriptivists and the descriptivists in their shining armor here to bring a veneer of scientific detachment to our relationship with language. Those attitudes can and do co-exist. Data is messy.  Language is complex. Simple stories (no matter how pretty we might think them) are suspicious. But these distinctions can be useful, and I’m willing to stand by the descriptivist/prescriptivist, even if it’s harder than you might think to put people in one camp or the others.

But beyond being an interesting study in language attitdues, it was a fun read. I learned lots of neat little factoids, which is always a source of pure joy for me. (Did you know that this symbol:  is called a Pilcrow? I know right? I had no idea either; I always just called it the paragraph mark.)

Why is it hard to model speech perception?

So this is a kick-off post for a series of posts about various speech perception models. Speech perception models, you ask? Like, attractive people who are good at listening?

Romantic fashion model

Not only can she discriminate velar, uvular and pharyngeal fricatives with 100% accuracy, but she can also do it in heels.

No, not really. (I wish that was a job…) I’m talking about a scientific model of how humans perceive speech sounds. If you’ve ever taken an introductory science class, you already have some experience with scientific models. All of Newton’s equations are just a way of generalizing general principals generally across many observed cases. A good model has both explanatory and predictive power. So if I say, for example, that force equals mass times acceleration, then that should fit with any data I’ve already observed as well as accurately describe new observations. Yeah, yeah, you’re saying to yourself, I learned all this in elementary school. Why are you still going on about it? Because I really want you to appreciate how complex this problem is.

Let’s take an example from an easier field, say, classical mechanics. (No offense physicists, but y’all know it’s true.) Imagine we want to model something relatively simple. Perhaps we want to know whether a squirrel who’s jumping from one tree to another is going to make. What do we need to know? And none of that “assume the squirrel is a sphere and there’s no air resistance” stuff, let’s get down to the nitty-gritty. We need to know the force and direction of the jump, the locations of the trees, how close the squirrel needs to get to be able to hold on, what the wind’s doing, air resistance and how that will interplay with the shape of the squirrel, the effects of gravity… am I missing anything? I feel like I might be, but that’s most of it.

So, do you notice something that all of these things we need to know the values of have in common? Yeah, that’s right, they’re easy to measure directly. Need to know what the wind’s doing? Grab your anemometer. Gravity? To the accelerometer closet! How far apart the trees are? It’s yardstick time. We need a value , we measure a value, we develop a model with good predictive and explanatory power (You’ll need to wait for your simulations to run on your department’s cluster. But here’s one I made earlier so you can see what it looks like. Mmmm, delicious!) and you clean up playing the numbers on the professional squirrel-jumping circuit.

Let’s take a similarly simple problem from the field of linguistics. You take a person, sit them down in a nice anechoic chamber*, plop some high quality earphones on them and play a word that could be “bite” and could be “bike” and ask them to tell you what they heard. What do you need to know to decide which way they’ll go? Well, assuming that your stimuli is actually 100% ambiguous (which is a little unlikely) there a ton of factors you’ll need to take into account. Like, how recently and often has the subject heard each of the words before? (Priming and frequency effects.) Are there any social factors which might affect their choice? (Maybe one of the participant’s friends has a severe overbite, so they just avoid the word “bite” all together.) Are they hungry? (If so, they’ll probably go for “bite” over “bike”.) And all of that assumes that they’re a native English speaker with no hearing loss or speech pathologies and that the person’s voice is the same as theirs in terms of dialect, because all of that’ll bias the  listener as well.

The best part? All of this is incredibly hard to measure. In a lot of ways, human language processing is a black box. We can’t mess with the system too much and taking it apart to see how it works, in addition to being deeply unethical, breaks the system. The best we can do is tap a hammer lightly against the side and use the sounds of the echos to guess what’s inside. And, no, brain imaging is not a magic bullet for this.  It’s certainly a valuable tool that has led to a lot of insights, but in addition to being incredibly expensive (MRI is easily more than a grand per participant and no one has ever accused linguistics of being a field that rolls around in money like a dog in fresh-cut grass) we really need to resist the urge to rely too heavily on brain imaging studies, as a certain dead salmon taught us.

But! Even though it is deeply difficult to model, there has been a lot of really good work done on towards a theory of speech perception. I’m going to introduce you to some of the main players, including:

  • Motor theory
  • Acoustic/auditory theory
  • Double-weak theory
  • Episodic theories (including Exemplar theory!)

Don’t worry if those all look like menu options in an Ethiopian restaurant (and you with your Amharic phrasebook at home, drat it all); we’ll work through them together.  Get ready for some mind-bending, cutting-edge stuff in the coming weeks. It’s going to be [fʌn] and [fʌnetɪk]. 😀

*Anechoic chambers are the real chambers of secrets.