The Science of Speaking in Tongues

So I was recently talking with one of my friends, and she asked me what linguists know about speaking in tongues (or glossolalia, which is the fancy linguistical term for it). It’s not a super well-studied phenomenon, but there has been enough research done that we’ve reached some pretty confident conclusions, which I’ll outline below.

Bozen 1 (327)
More like speaking around tongues, in this guy’s case.
  • People don’t tend to use sounds that aren’t in their native language. (citation) So if you’re an English speaker, you’re not going to bust out some Norwegian vowels. This rather lets the air out of the theory that individuals engaged in glossolalia are actually speaking another language. It is more like playing alphabet soup with the sounds you already know. (Although not always all the sounds you know. My instinct is that glossolalia is made up predominately of the sounds that are the most common in the person’s language.)
  • It lacks the structure of language. (citation) So one of the core ideas of linguistics, which has been supported again and again by hundreds of years of inquiry, is that there are systems and patterns underlying language use: sentences are usually constructed of some sort of verb-like thing and some sort of noun-like thing or things, and it’s usually something on the verb that tells you when and it’s usually something on the noun that tells you things like who possessed what. But these patterns don’t appear in glossolalia. Plus, of course, there’s not really any meaningful content being transmitted. (In fact, the “language” being unintelligible to others present is one of the markers that’s often used to identify glossolalia.) It may sort of smell like a duck, but it doesn’t have any feathers, won’t quack and when we tried to put it in water it just sort of dissolved, so we’ve come to conclusion that it is no, in fact, a duck.
  • It’s associated with a dissociative psychological state. (citation) Basically, this means that speakers are aware of what they’re doing, but don’t really feel like they’re the ones doing it. In glossolalia, the state seems to come and then pass on, leaving speakers relatively psychologically unaffected. Disassociation can be problematic, though; if it’s particularly extreme and long-term it can be characterized as multiple personality disorder.
  • It’s a learned behaviour. (citation) Basically, you only see glossolalia in cultures where it’s culturally expected and only in situations where it’s culturally appropriate. In fact, during her fieldwork, Dr. Goodman (see the citation) actually observed new initiates into a religious group being explicitly instructed in how to enter a dissociative state and engage in glossolalia.

So glossolalia may seem language-like, but from a linguistic standpoint it doesn’t seem to be actually be language.  (Which is probably why there hasn’t been that much research done on it.) It’s vocalization that arises as the result of a learned psychological stated that lacks linguistic systematicity.

Advertisements

The Acoustic Theory of Speech Perception

So, quick review: understanding speech is hard to model and the first model we discussed, motor theory, while it does address some problems, leaves something to be desired. The big one is that it doesn’t suggest that the main fodder for perception is the acoustic speech signal. And that strikes me as odd. I mean, we’re really used to thinking about hearing speech as a audio-only thing. Telephones and radios work perfectly well, after all, and the information you’re getting there is completely audio. That’s not to say that we don’t use visual, or, heck, even tactile data in speech perception. The McGurk effect, where a voice saying “ba” dubbed over someone saying “ga” will be perceived as “da” or “tha”, is strong evidence that we can and do use our eyes during speech perception. And there’s even evidence that a puff of air on the skin will change our perception of speech sounds. But we seem to be able to get along perfectly well without these extra sensory inputs, relying on acoustic data alone.

CPT-sound-physical-manifestation
This theory sounds good to me. Sorry, I’ll stop.
Ok, so… how do we extract information from acoustic data? Well, like I’ve said a couple time before, it’s actually a pretty complex problem. There’s no such thing as “invariance” in the speech signal and that makes speech recognition monumentally hard. We tend not to think about it because humans are really, really good at figuring out what people are saying, but it’s really very, very complex.

You can think about it like this: imagine that you’re looking for information online about platypuses. Except, for some reason, there is no standard spelling of platypus. People spell it “platipus”, “pladdypuss”, “plaidypus”, “plaeddypus” or any of thirty or forty other variations. Even worse, one person will use many different spellings and may never spell it precisely the same way twice. Now, a search engine that worked like our speech recognition works would not only find every instance of the word platypus–regardless of how it was spelled–but would also recognize that every spelling referred to the same animal. Pretty impressive, huh? Now imagine that every word have a very variable spelling, oh, and there are no spaces between words–everythingisjustruntogetherlikethisinonelongspeechstream. Still not difficult enough for you? Well, there is also the fact that there are ambiguities. The search algorithm would need to treat “pladypuss” (in the sense of  a plaid-patterned cat) and “palattypus” (in the sense of the venomous monotreme) as separate things. Ok, ok, you’re right, it still seems pretty solvable. So let’s add the stipulation that the program needs to be self-training and have an accuracy rate that’s incredibly close to 100%. If you can build a program to these specifications, congratulations: you’ve just revolutionized speech recognition technology. But we already have a working example of a system that looks a heck of a lot like this: the human brain.

So how does the brain deal with the “different spellings” when we say words? Well, it turns out that there are certain parts of a word that are pretty static, even if a lot of other things move around. It’s like a superhero reboot: Spiderman is still going to be Peter Parker and get bitten by a spider at some point and then get all moody and whine for a while. A lot of other things might change, but if you’re only looking for those criteria to figure out whether or not you’re reading a Spiderman comic you have a pretty good chance of getting it right. Those parts that are relatively stable and easy to look for we call “cues”. Since they’re cues in the acoustic signal, we can be even more specific and call them “acoustic cues”.

If you think of words (or maybe sounds, it’s a point of some contention) as being made up of certain cues, then it’s basically like a list of things a house-buyer is looking for in a house. If a house has all, or at least most, of the things they’re looking for, than it’s probably the right house and they’ll select that one. In the same way, having a lot of cues pointing towards a specific word makes it really likely that that word is going to be selected. When I say “selected”, I mean that the brain will connect the acoustic signal it just heard to the knowledge you have about a specific thing or concept in your head. We can think of a “word” as both this knowledge and the acoustic representation. So in the “platypuss” example above, all the spellings started with “p” and had an “l” no more than one letter away. That looks like a  pretty robust cue. And all of the words had a second “p” in them and ended with one or two tokens of “s”. So that also looks like a pretty robust queue. Add to that the fact that all the spellings had at least one of either a “d” or “t” in between the first and second “p” and you have a pretty strong template that would help you to correctly identify all those spellings as being the same word.

Which all seems to be well and good and fits pretty well with our intuitions (or mine at any rate). But that leaves us with a bit of a problem: those pesky parts of Motor Theory that are really strongly experimentally supported. And this model works just as well for motor theory too, just replace  the “letters” with specific gestures rather than acoustic cues. There seems to be more to the story than either the acoustic model or the motor theory model can offer us, though both have led to useful insights.

The Motor Theory of Speech Perception

Ok, so like I talked about in my previous two posts, modelling speech perception is an ongoing problem with a lot of hurdles left to jump. But there are potential candidate theories out there, all of which offer good insight into the problem. The first one I’m going to talk about is motor theory.

Clamp-Type 2C1.5-4 Motor
So your tongue is like the motor body and the other person’s ear are like the load cell…
So motor theory has one basic premise and three major claims.  The basic premise is a keen observation: we don’t just perceive speech sounds, we also make them. Whoa, stop the presses. Ok, so maybe it seems really obvious, but motor theory was really the first major attempt to model speech perception that took this into account. Up until it was first posited in the 1960’s , people had pretty much been ignoring that and treating speech perception like the only information listeners had access to was what was in the acoustic speech signal. We’ll discuss that in greater detail, later, but it’s still pretty much the way a lot of people approach the problem. I don’t know of a piece of voice recognition software, for example, that include an anatomical model.

So what’s the fact that listeners are listener/speakers get you? Well, remember how there aren’t really invariant units in the speech signal? Well, if you decide that what people are actually perceiving aren’t actually a collection of acoustic markers that point to one particular language sound but instead the gestures needed to make up that sound, then suddenly that’s much less of a problem. To put it in another way, we’re used to thinking of speech being made up of a bunch of sounds, and that when we’re listening speech we’re deciding what the right sounds are and from there picking the right words. But from a motor theory standpoint, what you’re actually doing when you’re listening to speech is deciding what the speaker’s doing with their mouth and using that information to figure out what words they’re saying. So in the dictionary in your head, you don’t store words as strings of sounds but rather as strings of gestures

If you’re like me when I first encountered this theory, it’s about this time that you’re starting to get pretty skeptical. I mean, I basically just said that what you’re hearing is the actual movement of someone else’s tongue and figuring out what they’re saying by reverse engineering it based on what you know your tongue is doing when you say the same word. (Just FYI, when I say tongue here, I’m referring to the entire vocal tract in its multifaceted glory, but that’s a bit of a mouthful. Pun intended. 😉 ) I mean, yeah, if we accept this it gives us a big advantage when we’re talking about language acquisition–since if you’re listening to gestures, you can learn them just by listening–but still. It’s weird. I’m going to need some convincing.

Well, let’s get back to the those three principles I mentioned earlier, which are taken from Galantucci, Flower and Turvey’s excellent review of motor theory.

  1. Speech is a weird thing to perceive and pretty much does its own thing. I’ve talked about this at length, so let’s just take that as a given for now.
  2. When we’re listening to speech, we’re actually listening to gestures. We talked about that above. 
  3. We use our motor system to help us perceive speech.

Ok, so point three should jump out at you a bit. Why? Of these three points, its the easiest one to test empirically. And since I’m a huge fan of empirically testing things (Science! Data! Statistics!) we can look into the literature and see if there’s anything that supports this. Like, for example, a study that shows that when listening to speech, our motor cortex gets all involved. Well, it turns out that there  are lots of studies that show this. You know that term “active listening”? There’s pretty strong evidence that it’s more than just a metaphor; listening to speech involves our motor system in ways that not all acoustic inputs do.

So point three is pretty well supported. What does that mean for point two? It really depends on who you’re talking to. (Science is all about arguing about things, after all.) Personally, I think motor theory is really interesting and address a lot of the problems we face in trying to model speech perception. But I’m not ready to swallow it hook, line and sinker. I think Robert Remez put it best in the proceedings of Modularity and The Motor Theory of Speech Perception:

I think it is clear that Motor Theory is false. For the other, I think the evidence indicates no less that Motor Theory is essentially, fundamentally, primarily and basically true. (p. 179)

On the one hand, it’s clear that our motor system is involved in speech perception. On the other, I really do think that we use parts of the acoustic signal in and of themselves. But we’ll get into that in more depth next week.

Why speech is different from other types of sounds

Ok, so, a couple weeks ago I talked about why speech perception was hard to  model. Really, though, what I talked about was why building linguistic models is a hard task. There’s a couple other thorny problems that plague people who work with speech perception, and they have to do with the weirdness of the speech signal itself. It’s important to talk about because it’s on account of dealing with these weirdnesses that some theories of speech perception themselves can start to look pretty strange. (Motor theory, in particular, tends to sound pretty messed-up the first time you encounter it.)

The speech signal and the way we deal with it is really strange in two main ways.

  1. The speech signal doesn’t contain invariant units.
  2. We both perceive and produce speech in ways that are surprisingly non-linear.

So what are “invariant units” and why should we expect to have them? Well, pretty much everyone agrees that we store words as larger chunks made up of smaller chunks. Like, you know that the word “beet” is going to be made with the lips together at the beginning for the “b” and your tongue behind your teeth at the end for the “t”. And you also know that it will have certain acoustic properties; a short  break in the signal followed by a small burst of white noise in a certain frequency range (that’s a the “b” again) and then a long steady state for the vowel and then another sudden break in the signal for the “t”. So people make those gestures and you listen for those sounds and everything’s pretty straightforwards  right? Weeellllll… not really.

It turns out that you can’t really be grabbing onto certain types of acoustic queues because they’re not always reliably there. There are a bunch of different ways to produce “t”, for example, that run the gamut from the way you’d say it by itself to something that sound more like a “w” crossed with an “r”. When you’re speaking quickly in an informal setting, there’s no telling where on that continuum you’re going to fall. Even with this huge array of possible ways to produce a sound, however, you still somehow hear is at as “t”.

And even those queues that are almost always reliably there vary drastically from person to person. Just think about it: about half the population has a fundamental frequency, or pitch, that’s pretty radically different from the other half. The old interplay of biological sex and voice quality thing. But you can easily, effortlessly even, correct for the speaker’s gender and understand the speech produced by men and women equally well. And if a man and woman both say “beet”, you have no trouble telling that they’re saying the same word, even though the signal is quite different in both situations. And that’s not a trivial task. Voice recognition technology, for example, which is overwhelmingly trained on male voices, often has a hard time understanding women’s voices. (Not to mention different accents. What that says about regional and sex-based discrimination is a  topic for another time.)

And yet. And yet humans are very, very good a recognizing speech. How? Well linguists have made some striking progress in answering that question, though we haven’t yet arrived at an answer that makes everyone happy. And the variance in the signal isn’t the only hurdle facing humans as the recognize the vocal signal: there’s also the fact that the fact that we are humans has effects on what we can hear.

Akustik db2phon
Ooo, pretty rainbow. Thorny problem, though: this shows how we hear various frequencies better or worse. The sweet spot is right around 300 kHz or so. Which, coincidentally, just so happens to be where we produce most of the noise in the speech signal. But we do still produce information at other frequencies and we do use that in speech perception: particularly for sounds like “s” and “f”.

We can think of the information available in the world as a sheet of cookie dough. This includes things like UV light and sounds below 0 dB in intensity. Now imagine a cookie-cutter. Heck, make it a gingerbread man. The cookie-cutter represents the ways in which the human body limits our access to this information. There are just certain things that even a normal, healthy human isn’t capable of perceiving. We can only hear the information that falls inside the cookie cutter. And the older we get, the smaller the cookie-cutter becomes, as we slowly lose sensitivity in our auditory and visual systems. This makes it even more difficult to perceive speech. Even though it seems likely that we’ve evolved our vocal system to take advantage of the way our perceptual system works, it still makes the task of modelling speech perception even more complex.

Book Review: Punctuation..?

So the good folks over at Userdesign asked me to review their newest volume, Punctuation..? and I was happy to oblige. Linguists rarely study punctuation (it falls under the sub-field orthography, or the study of writing systems) but what we do study is the way that language attitudes and punctuation come together. I’ve written before about language attitudes when it come to grammar instruction and the strong prescriptive attitudes of most grammar instruction books. What makes this book so interesting is that it is partly prescriptive and partly descriptive. Since a descriptive bent in a grammar instruction manual is rare, I thought I’d delve into that a bit.

User_design_Books_Punctuation_w_cover
Image copyright Userdesign, used with permission. (Click for link to site.)

So, first of all, how about a quick review of the difference between a descriptive and prescriptive approach to language?

  • Descriptive: This is what linguists do. We don’t make value or moral judgments about languages or language use, we just say what’s going on as best we can. You can think of it like an anthropological ethnography: we just describe what’s going on. 
  • Prescriptive: This is what people who write letters to the Times do. They have a very clear idea of what’s “right” and “wrong” with regards to language use and are all to happy to tell you about it. You can think of this like a manner book: it tells you what the author thinks you should be doing. 

As a linguist, my relationship with language is mainly scientific, so I have a clear preference for a descriptive stance. An ichthyologist doesn’t tell octopi, “No, no, no, you’re doing it all wrong!” after all. At the same time, I live in a culture which has very rigid expectations for how an educated individual should write and sound, and if I want to be seen as an educated individual (and be considered for the types of jobs only open to educated individuals) you better believe I’m going to adhere to those societal standards. The problem comes when people have a purely prescriptive idea of what grammar is and what it should be. That can lead to nasty things like linguistic discrimination. I.e., language B (and thus all those individuals who speak language B) is clearly inferior to language A because they don’t do things properly. Since I think we can all agree that unfounded discrimination of this type is bad, you can see why linguists try their hardest to avoid value judgments of languages.

As I mentioned before, this book is a fascinating mix of prescriptive and descriptive snippets. For example, the author says this about exclamation points: “In everyday writing, the exclamation mark is often overused in the belief that it adds drama and excitement. It is, perhaps  the punctuation mark that should be used with the most restraint” (p 19). Did you notice that “should'”? Classic marker of a prescriptivist claiming their territory. But then you have this about Guillements: “Guillements are used in several languages to indicate passages of speech in the same way that single and double quotation marks (” “”) are used in the English language” (p. 22). (Guillements look like this, since I know you were wondering;  « and ». ) See, that’s a classical description of what a language does, along with parallels drawn to another, related, languages. It may not seem like much, but try to find a comparably descriptive stance in pretty much any widely-distributed grammar manual. And if you do, let me know so that I can go buy a copy of it. It’s change, and it’s positive change, and I’m a fan of it. Is this an indication of a sea-change in grammar manuals? I don’t know, but I certainly hope so.

Over all, I found this book fascinating (though not, perhaps, for the reasons the author intended!). Particularly because it seems to stand in contrast to the division that I just spent this whole post building up. It’s always interesting to see the ways that stances towards language can bleed and melt together, for all that linguists (and I include myself here) try to show that there’s a nice, neat dividing line between the evil, scheming prescriptivists and the descriptivists in their shining armor here to bring a veneer of scientific detachment to our relationship with language. Those attitudes can and do co-exist. Data is messy.  Language is complex. Simple stories (no matter how pretty we might think them) are suspicious. But these distinctions can be useful, and I’m willing to stand by the descriptivist/prescriptivist, even if it’s harder than you might think to put people in one camp or the others.

But beyond being an interesting study in language attitdues, it was a fun read. I learned lots of neat little factoids, which is always a source of pure joy for me. (Did you know that this symbol:  is called a Pilcrow? I know right? I had no idea either; I always just called it the paragraph mark.)

Why is it hard to model speech perception?

So this is a kick-off post for a series of posts about various speech perception models. Speech perception models, you ask? Like, attractive people who are good at listening?

Romantic fashion model
Not only can she discriminate velar, uvular and pharyngeal fricatives with 100% accuracy, but she can also do it in heels.
No, not really. (I wish that was a job…) I’m talking about a scientific model of how humans perceive speech sounds. If you’ve ever taken an introductory science class, you already have some experience with scientific models. All of Newton’s equations are just a way of generalizing general principals generally across many observed cases. A good model has both explanatory and predictive power. So if I say, for example, that force equals mass times acceleration, then that should fit with any data I’ve already observed as well as accurately describe new observations. Yeah, yeah, you’re saying to yourself, I learned all this in elementary school. Why are you still going on about it? Because I really want you to appreciate how complex this problem is.

Let’s take an example from an easier field, say, classical mechanics. (No offense physicists, but y’all know it’s true.) Imagine we want to model something relatively simple. Perhaps we want to know whether a squirrel who’s jumping from one tree to another is going to make. What do we need to know? And none of that “assume the squirrel is a sphere and there’s no air resistance” stuff, let’s get down to the nitty-gritty. We need to know the force and direction of the jump, the locations of the trees, how close the squirrel needs to get to be able to hold on, what the wind’s doing, air resistance and how that will interplay with the shape of the squirrel, the effects of gravity… am I missing anything? I feel like I might be, but that’s most of it.

So, do you notice something that all of these things we need to know the values of have in common? Yeah, that’s right, they’re easy to measure directly. Need to know what the wind’s doing? Grab your anemometer. Gravity? To the accelerometer closet! How far apart the trees are? It’s yardstick time. We need a value , we measure a value, we develop a model with good predictive and explanatory power (You’ll need to wait for your simulations to run on your department’s cluster. But here’s one I made earlier so you can see what it looks like. Mmmm, delicious!) and you clean up playing the numbers on the professional squirrel-jumping circuit.

Let’s take a similarly simple problem from the field of linguistics. You take a person, sit them down in a nice anechoic chamber*, plop some high quality earphones on them and play a word that could be “bite” and could be “bike” and ask them to tell you what they heard. What do you need to know to decide which way they’ll go? Well, assuming that your stimuli is actually 100% ambiguous (which is a little unlikely) there a ton of factors you’ll need to take into account. Like, how recently and often has the subject heard each of the words before? (Priming and frequency effects.) Are there any social factors which might affect their choice? (Maybe one of the participant’s friends has a severe overbite, so they just avoid the word “bite” all together.) Are they hungry? (If so, they’ll probably go for “bite” over “bike”.) And all of that assumes that they’re a native English speaker with no hearing loss or speech pathologies and that the person’s voice is the same as theirs in terms of dialect, because all of that’ll bias the  listener as well.

The best part? All of this is incredibly hard to measure. In a lot of ways, human language processing is a black box. We can’t mess with the system too much and taking it apart to see how it works, in addition to being deeply unethical, breaks the system. The best we can do is tap a hammer lightly against the side and use the sounds of the echos to guess what’s inside. And, no, brain imaging is not a magic bullet for this.  It’s certainly a valuable tool that has led to a lot of insights, but in addition to being incredibly expensive (MRI is easily more than a grand per participant and no one has ever accused linguistics of being a field that rolls around in money like a dog in fresh-cut grass) we really need to resist the urge to rely too heavily on brain imaging studies, as a certain dead salmon taught us.

But! Even though it is deeply difficult to model, there has been a lot of really good work done on towards a theory of speech perception. I’m going to introduce you to some of the main players, including:

  • Motor theory
  • Acoustic/auditory theory
  • Double-weak theory
  • Episodic theories (including Exemplar theory!)

Don’t worry if those all look like menu options in an Ethiopian restaurant (and you with your Amharic phrasebook at home, drat it all); we’ll work through them together.  Get ready for some mind-bending, cutting-edge stuff in the coming weeks. It’s going to be [fʌn] and [fʌnetɪk]. 😀

*Anechoic chambers are the real chambers of secrets.

Why do I really, really love West African languages?

So I found a wonderful free app that lets you learn Yoruba, or at least Yoruba words,  and posted about it on Google plus. Someone asked a very good question: why am I interested in Yoruba? Well, I’m not interested just in Yoruba. In fact, I would love to learn pretty much any western African language or, to be a little more precise, any Niger-Congo language.

Niger-Congo-en
This map’s color choices make it look like a chocolate-covered ice cream cone.
Why? Well, not to put too fine a point on it, I’ve got a huge language crush on them. Whoa there, you might be thinking, you’re a linguist. You’re not supposed to make value judgments on languages. Isn’t there like a linguist code of ethics or something? Well, not really, but you are right. Linguists don’t usually make value judgments on languages. That doesn’t mean we can’t play favorites!  And West African languages are my favorites. Why? Because they’re really phonologically and phonetically interesting. I find the sounds and sound systems of these languages rich and full of fascinating effects and processes. Since that’s what I study within linguistics, it makes sense that that’s a quality I really admire in a language.

What are a few examples of Niger-Congo sound systems that are just mind blowing? I’m glad you asked.

  • Yoruba: Yoruba has twelve vowels. Seven of them are pretty common (we have all but one in American English) but if you say four of them nasally, they’re different vowels. And if you say a nasal vowel when you’re not supposed to, it’ll change the entire meaning of a word. Plus? They don’t have a ‘p’ or an ‘n’ sound. That is crazy sauce! Those are some of the most widely-used sounds in human language. And Yoruba has a complex tone system as well. You probably have some idea of the level of complexity that can add to a sound system if you’ve ever studied Mandarin, or another East Asian language. Seriously, their sound system makes English look childishly simplistic.
  • Akan: There are several different dialects of Akan, so I’ll just stick to talking about Asante, which is the one used in universities and for official business. It’s got a crazy consonant system. Remember how  Yoruba didn’t have an “n” sound? Yeah, in Akan they have nine. To an English speaker they all  pretty much sound the same, but if you grew up speaking Akan you’d be able to tell the difference easily. Plus, most sounds other than “p”, “b”, “f” or “m” can be made while rounding the lips (linguists call this “labialized” and are completely different sounds). They’ve also got a vowel harmony system, which means you can’t have vowels later in a word that are completely different from vowels earlier in the word. Oh, yeah, and tones and a vowel nasalization distinction and some really cool tone terracing. I know, right? It’s like being a kid in a candy store.

But how did these language get so cool? Well, there’s some evidence that these languages have really robust and complex sound systems because the people speaking them never underwent large-scale migration to another Continent. (Obviously, I can’t ignore the effects of colonialism or the slave trade, but it’s still pretty robust.) Which is not to say that, say, Native American languages don’t have awesome sound systems; just just tend to be slightly smaller on average.

Now that you know how kick-ass these languages, I’m sure you’re chomping at the bit to hear some of them. Your wish is my command; here’s a song in Twi (a dialect of Akan) from one of my all-time-favorite musicians: Sarkodie. (He’s making fun of Ghanaian emigrants who forget their roots. Does it get any better than biting social commentary set to a sick beat?)