So, quick review: understanding speech is hard to model and the first model we discussed, motor theory, while it does address some problems, leaves something to be desired. The big one is that it doesn’t suggest that the main fodder for perception is the acoustic speech signal. And that strikes me as odd. I mean, we’re really used to thinking about hearing speech as a audio-only thing. Telephones and radios work perfectly well, after all, and the information you’re getting there is completely audio. That’s not to say that we don’t use visual, or, heck, even tactile data in speech perception. The McGurk effect, where a voice saying “ba” dubbed over someone saying “ga” will be perceived as “da” or “tha”, is strong evidence that we can and do use our eyes during speech perception. And there’s even evidence that a puff of air on the skin will change our perception of speech sounds. But we seem to be able to get along perfectly well without these extra sensory inputs, relying on acoustic data alone.
Ok, so… how do we extract information from acoustic data? Well, like I’ve said a couple time before, it’s actually a pretty complex problem. There’s no such thing as “invariance” in the speech signal and that makes speech recognition monumentally hard. We tend not to think about it because humans are really, really good at figuring out what people are saying, but it’s really very, very complex.
You can think about it like this: imagine that you’re looking for information online about platypuses. Except, for some reason, there is no standard spelling of platypus. People spell it “platipus”, “pladdypuss”, “plaidypus”, “plaeddypus” or any of thirty or forty other variations. Even worse, one person will use many different spellings and may never spell it precisely the same way twice. Now, a search engine that worked like our speech recognition works would not only find every instance of the word platypus–regardless of how it was spelled–but would also recognize that every spelling referred to the same animal. Pretty impressive, huh? Now imagine that every word have a very variable spelling, oh, and there are no spaces between words–everythingisjustruntogetherlikethisinonelongspeechstream. Still not difficult enough for you? Well, there is also the fact that there are ambiguities. The search algorithm would need to treat “pladypuss” (in the sense of a plaid-patterned cat) and “palattypus” (in the sense of the venomous monotreme) as separate things. Ok, ok, you’re right, it still seems pretty solvable. So let’s add the stipulation that the program needs to be self-training and have an accuracy rate that’s incredibly close to 100%. If you can build a program to these specifications, congratulations: you’ve just revolutionized speech recognition technology. But we already have a working example of a system that looks a heck of a lot like this: the human brain.
So how does the brain deal with the “different spellings” when we say words? Well, it turns out that there are certain parts of a word that are pretty static, even if a lot of other things move around. It’s like a superhero reboot: Spiderman is still going to be Peter Parker and get bitten by a spider at some point and then get all moody and whine for a while. A lot of other things might change, but if you’re only looking for those criteria to figure out whether or not you’re reading a Spiderman comic you have a pretty good chance of getting it right. Those parts that are relatively stable and easy to look for we call “cues”. Since they’re cues in the acoustic signal, we can be even more specific and call them “acoustic cues”.
If you think of words (or maybe sounds, it’s a point of some contention) as being made up of certain cues, then it’s basically like a list of things a house-buyer is looking for in a house. If a house has all, or at least most, of the things they’re looking for, than it’s probably the right house and they’ll select that one. In the same way, having a lot of cues pointing towards a specific word makes it really likely that that word is going to be selected. When I say “selected”, I mean that the brain will connect the acoustic signal it just heard to the knowledge you have about a specific thing or concept in your head. We can think of a “word” as both this knowledge and the acoustic representation. So in the “platypuss” example above, all the spellings started with “p” and had an “l” no more than one letter away. That looks like a pretty robust cue. And all of the words had a second “p” in them and ended with one or two tokens of “s”. So that also looks like a pretty robust queue. Add to that the fact that all the spellings had at least one of either a “d” or “t” in between the first and second “p” and you have a pretty strong template that would help you to correctly identify all those spellings as being the same word.
Which all seems to be well and good and fits pretty well with our intuitions (or mine at any rate). But that leaves us with a bit of a problem: those pesky parts of Motor Theory that are really strongly experimentally supported. And this model works just as well for motor theory too, just replace the “letters” with specific gestures rather than acoustic cues. There seems to be more to the story than either the acoustic model or the motor theory model can offer us, though both have led to useful insights.
One thought on “The Acoustic Theory of Speech Perception”