What’s up with calling a woman “a female”? A look at the parts of speech of “male” and “female” on Twitter .

This is something I’ve written about before, but I’ve recently had several discussions with people who say they don’t find it odd to refer to a women as a female. Personally, I don’t like being called “a female” becuase its a term I to associate strongly with talking about animals. (Plus, it makes you sound like a Ferengi.)  I would also protest men being called males, for the same reason, but my intuition is that that doesn’t happen as often. I’m willing to admit that my intuition may be wrong in this case, though, so I’ve decided to take a more data-driven approach. I had two main questions:

  • Do “male” and “female” get used as nouns at different rates?
  • Does one of these terms get used more often?

Data collection

I used the Twitter public API to collect two thousand English tweets, one thousand each containing the exact string “a male” and “a female”. I looked for these strings to help get as many tweets as possible with “male” or “female” used as a noun. “A” is what linguist call a determiner, and a determiner has to have a noun after it. It doesn’t have to be the very next word, though; you can get an adjective first, like so:

  • A female mathematician proved the theorm.
  • A female proved the theorm.

So this will let me directly compare these words in a situation where we should only be able to see a limited number of possible parts of speech & see if they differ from each other. Rather than tagging two thousand tweets by hand, I used a Twitter specific part-of-speech tagger to tag each set of tweets.

A part of speech tagger is a tool that guesses the part of speech of every word in a text. So if you tag a sentence like “Apples are tasty”, you should get back that “apples” is a plural noun, “are” is a verb and “tasty” is an adjective. You can try one out for yourself on-line here.

Parts of Speech

In line with my predictions, every instance of “male” or “female” was tagged as either a noun, an adjective or a hashtag. (I went through and looked at the hashtags and they were all porn bots. #gross #hazardsOfTwitterData)

However, not every noun was tagged as the same type of noun. I saw three types of tags in my data: NN (regular old noun), NNS (plural noun) and, unexpectedly, NNP (proper noun, singular). (If you’re confused by the weird upper case abbreviations, they’re the tags used in the Penn Treebank, and you can see the full list here.) In case it’s been a while since you studied parts of speech, proper nouns are things like personal or place names. The stuff that tend to get capitalized in English. The examples from the Penn Treebank documentation include “Motown”, “Venneboerger”,  and “Czestochwa”. I wouldn’t consider either “female” or “male” a name, so it’s super weird that they’re getting tagged as proper nouns. What’s even weirder? It’s pretty much only “male” that’s getting tagged as a proper noun, as you can see below:


Number of times each word tagged as each part of speech by the GATE Twitter part-of-speech tagger. NNS is a plural noun, NNP a proper noun, NN a noun and JJ an adjective.

The differences in tagged POS between “male” and “female” was super robust(X2(6, N = 2033) = 1019.2, p <.01.). So what’s happening here?  My first thought was that it might be that, for some reason, “male” is getting capitalized more often and that was confusing the tagger. But when I looked into, there wasn’t a strong difference between the capitalization of “male” and “female”: both were capitalized about 3% of the time. 

My second thought was that it was a weirdness showing up becuase I used a tagger designed for Twitter data. Twitter is notoriously “messy” (in the sense that it can be hard for computers to deal with) so it wouldn’t be surprising if tagging “male” as a proper noun is the result of the tagger being trained on Twitter data. So, to check that, I re-tagged the same data using the Stanford POS tagger. And, sure enough, the weird thing where “male” is overwhelming tagged as a proper noun disappeared.


Number of times each word tagged as each part of speech by the Stanford POS tagger. NNS is a plural noun, NNP a proper noun, NN a noun, JJ an adjective and FW a “foreign word”.

So it looks like “male” being tagged as a proper noun is an artifact of the tagger being trained on Twitter data, and once we use a tagger trained on a different set of texts (in this case the Wall Street Journal) there wasn’t a strong difference in what POS “male” and “female” were tagged as.

Rate of Use

That said, there was a strong difference between “a female” and “a male”: how often they get used. In order to get one thousand tweets with the exact string “a female”, Twitter had to go back an hour and thirty-four minutes. In order to get a thousand tweets with “a male”, however, Twitter had to go back two hours and fifty eight minutes. Based on this sample, “a female” gets said almost twice as often as “a male”.

So what’s the deal?

  • Do “male” and “female” get used as nouns at different rates?  It depends on what tagger you use! In all seriousness, though, I’m not prepared to claim this based on the dataset I’ve collected.
  • Does one of these terms get used more often? Yes! Based on my sample, Twitter users use “a female” about twice as often as “a male”.

I think the greater rate of use of “a female” that points to the possibility of an interesting underlying difference in how “male” and “female” are used, one that calls for a closer qualitative analysis. Does one term get used to describe animals more often than the other? What sort of topics are people talking about when they say “a male” and “a female”? These questions, however, will have to wait for the next blog post!

In the meantime, I’m interested in getting more opinions on this. How do you feel about using “a male” and “a female” as nouns to talk about humans? Do they sound OK or strike you as odd?

My code and is available on my GitHub.


What does the National Endowment for the Humanities even do?

From the title, you might think this is a US-centric post. To a certain extent, it is. But I’m also going to be talking about topics that are more broadly of interest: what are some specific benefits of humanities research? And who should fund basic research? A lot has been written about these topics generally, so I’m going to be talking about linguistics and computational linguistics specifically.

This blog post came out of a really interesting conversation I had on Twitter the other day, sparked by this article on the potential complete elimination of both the National Endowment for the Humanities and the National Endowment for the Arts. During the course of the conversation, I realized that the person I was talking to (who was not a researcher, as far as I know) had some misconceptions about the role and reach of the NEH. So I thought it might be useful to talk about the role the NEH plays in my field, and has played in my own development as a researcher.


Oh this? Well, we don’t have funding to buy books anymore, so I put a picture of them in my office to remind myself they exist.

What does the NEH do?

I think the easiest way to answer this is to give you specific examples of projects that have been funded by the National Endowment for the Humanities, and talk about thier individual impacts. Keep in mind that this is just the tip of the iceberg; I’m only going to talk about projects that have benefitted my work in particular, and not even all of those.

  • Builds language teaching resources. One of my earliest research experiences was as a research assistance for Jack Martin, working with the Koasati tribe in Louisiana on a project funded by the NEH. The bulk of the work I did that summer was on a talking dictionary of the Koasati language, which the community especially wanted both as a record of the language and to support Koasati language courses. I worked with speakers to record the words for the dictionary, edit and transcribe the sound files to be put into the talking dictionaries. In addition to creating an important resource of the community, I learned important research skills that led me towards my current work on language variation. And the dictionary? It’s available on-line.
  • Helps fight linguistic discrimination. One of my main research topics is linguistic bias in automatic speech recognition (you can see some of that work here and here). But linguistic bias doesn’t only happen with computers. It’s a particularly pernicious form of discrimination that’s a big problem in education as well. As someone who’s both from the South and an educator, for example, I have purposefully cultivated my ability to speak mainstream American English becuase I know that, fair or not, I’ll be taken less seriously the more southern I sound. The NEH is at the forefront of efforts to help fight linguistic discrimination.
  • Document linguistic variation. This is a big one for my work, in particular: I draw on NEH-funded resources documenting linguistic variation in the United States in almost every research paper I write.

How does funding get allocated?

  • Which projects are funded is not decided by politicians. I didn’t realize this wasn’t common knowledge, but which projects get funded by federal funding agencies, including the NEH, NSF (which I’m currently being funded through) and NEA (National Endowment for the Arts) are not decided by politicians. This is a good thing–even the most accomplished politician can’t be expected to be an expert on everything from linguistics to history to architecture. You can see the breakdown of the process of allocating funding here.
  • Who looks at funding applications? Applications are peer reviewed, just like journal articles and other scholarly publications. The people looking at applications are top scholars in thier field. This means that they have a really good idea of which projects are going to have the biggest long-term impact, and that they can insure no one’s going to be reinventing the wheel.
  • How many projects are funded? All federal  research funding is extremely competitive, with many more applications submitted than accepted. At the NEH, this means as few as 6% of applications to a specific grant program will be accepted. This isn’t just free money–you have to make a very compelling case to a panel of fellow scholars that your project is truly exceptional.
  • What criteria are used to evaluate projects? This varies from grant to grant, but for the documenting endangered languages grant (which is what my work with the Koasati tribe was funded through), the evaluation criteria includes the following:
    • What is the potential for the proposed activity to
      1. Advance knowledge and understanding within its own field or across different fields (Intellectual Merit); and
      2. Benefit society or advance desired societal outcomes (Broader Impacts)?
    • To what extent do the proposed activities suggest and explore creative, original, or potentially transformative concepts?
    • Is the plan for carrying out the proposed activities well-reasoned, well-organized, and based on a sound rationale? Does the plan incorporate a mechanism to assess success?
    • How well qualified is the individual, team, or organization to conduct the proposed activities?
    • Are there adequate resources available to the PI (either at the home organization or through collaborations) to carry out the proposed activities?

Couldn’t this research be funded by businesses?

Sure, it could be. Nothing’s stopping companies from funding basic research in the humanities… but in my experience it’s not a priority, and they don’t. And that’s a real pity, because basic humanities research has a tendency of suddenly being vitally needed in other fields. Some examples from Natural Language Processing that have come up in just the last year:

  • Ethics: I’m currently taking what will  probably be my last class in graduate school. It’s a seminar course, filled with a mix of NLP researchers, electrical engineers and computer scientists, and we’re all reading… ethics texts. There’s been a growing awareness in the NLP and machine learning communities that algorithmic design and data selection is leading to serious negative social impacts (see this paper for some details). Ethics is suddenly taking center stage, and without the work of scholars working in the humanities, we’d be working up from first principles.
  • Pragmatics: Pragmatics, or the study of how situational factors affect meaning, is one of the more esoteric sub-disciplines in linguistics–many linguistics departments don’t even teach it as a core course. But one of the keynotes at the 2016 Empirical Methods in Natural Language Processing conference was about it (in NLP, conferences are the premier publication venue, so that’s a pretty big deal). Why? Because dialog systems, also known as chatbots, are a major research area right now. And modelling things like what you believe the person you’re talking to already knows is going to be critical to making interacting with them more natural.
  • Discourse analysis: Speaking of chatbots, discourse analysis–or the analysis of the structure of conversations–is another area of humanities research that’s been applied to a lot of computational systems. There are currently over 6000 ACL publications that draw on the discourse analysis literature. And given the strong interest in chatbots right now, I can only see that number going up.

These are all areas of research we’d traditionally consider humanities that have directly benefited the NLP community, and in turn many of the products and services we use day to day. But it’s hard to imagine companies supporting the work of someone working in the humanities whose work might one day benefit their products. These research programs that may not have an immediate impact but end up being incredibly important down-the-line is exactly the type of long-term investment in knowledge that the NEH supports, and that really wouldn’t happen otherwise.

Why does it matter?

“Now Rachael,” you may be saying, “your work definitely counts as STEM (science, technology, engineering and math). Why do you care so much about some humanities funding going away?”

I hope the reasons that I’ve outlined above help to make the point that humanities research has long-ranging impacts and is a good investment. NEH funding was pivotal in my development as a researcher. I would not be where I am today without early research experience on projects funded by the NEH.  And as a scholar working in multiple disciplines, I see how humanities research constantly enriches work in other fields, like engineering, which tend to be considered more desirable.

One final point: the National Endowment for the Humanities is, compared to other federal funding programs, very small indeed. In 2015 the federal government spent 146 million on the NEH, which was only 2% of the 7.1  billion dollar Department of Defense research budget. In other words, if everyone in the US contributed equally to the federal budget, the NEH would cost us each less than fifty cents a year. I think that’s a fair price for all of the different on-going projects the NEH funds, don’t you?


The entire National Endowment for the Humanities & National Endowment for the Arts, as well as the National Park Service research budget, all fit in that tiny “other” slice at the very top.


Do emojis have their own syntax?

So a while ago I got into a discussion with someone on Twitter about whether emojis have syntax. Their original question was this:

As someone who’s studied sign language, my immediate thought was “Of course there’s a directionality to emoji: they encode the spatial relationships of the scene.” This is just fancy linguist talk for: “if there’s a dog eating a hot-dog, and the dog is on the right, you’re going to use 🌭🐕, not 🐕🌭.” But the more I thought about it, the more I began to think that maybe it would be better not to rely on my intuitions in this case. First, because I know American Sign Language and that might be influencing me and, second, because I am pretty gosh-darn dyslexic and I can’t promise that my really excellent ability to flip adjacent characters doesn’t extend to emoji.

So, like any good behavioral scientist, I ran a little experiment. I wanted to know two things.

  1. Does an emoji description of a scene show the way that things are positioned in that scene?
  2. Does the order of emojis tend to be the same as the ordering of those same concepts in an equivalent sentence?

As it turned out, the answers to these questions are actually fairly intertwined, and related to a third thing I hadn’t actually considered while I was putting together my stimuli (but probably should have): whether there was an agent-patient relationship in the photo.

Agent: The entity in a sentence that’s affecting a changed, the “doer” of the action.

  • The dog ate the hot-dog.
  • The raccoons pushed over all the trash-bins.

Patient: The entity that’s being changed, the “receiver” of the action.

  • The dog ate the hot-dog.
  • The raccoons pushed over all the trash-bins.


To get data, I showed people three pictures and asked them to “pick the emoji sequence that best describes the scene” and then gave them two options that used different orders of the same emoji. Then, once they were done with the emoji part, I asked them to “please type a short sentence to describe each scene”. For all the language data, I just went through and quickly coded the order that the same concepts as were encoded in the emoji showed up.


  • “The dog ate a hot-dog”  -> dog hot-dog
  • “The hot-dog was eaten by the dog” -> hot-dog dog
  • “A dog eating” -> dog
  • “The hot-dog was completely devoured” -> hot-dog

So this gave me two parallel data sets: one with emojis and one with language data.

All together, 133 people filled out the emoji half and 127 people did the whole thing, mostly in English (I had one person respond in Spanish and I went ahead and included it). I have absolutely no demographics on my participants, and that’s by design; since I didn’t go through the Institutional Review Board it would actually be unethical for me to collect data about people themselves rather than just general information on language use. (If you want to get into the nitty-gritty this is a really good discussion of different types of on-line research.)

Picture one – A man counting money

Watch, movie schedule, poster, telephone, cashier machine, cash register Fortepan 6680

I picked this photo as sort of a sanity-check: there’s no obvious right-to-left ordering of the man and the money, and there’s one pretty clear way of describing what’s going on in this scene. There’s an agent (the man) and a patient (the money), and since we tend to describe things as agent first, patient second I expected people to pretty much all do the same thing with this picture. (Side note: I know I’ve read a paper about the cross-linguistic tendency for syntactic structures where the agent comes first, but I can’t find it and I don’t remember who it’s by. Please let me know if you’ve got an idea what it could be in the comments–it’s driving me nuts!)


And they did! Pretty much everyone described this picture by putting the man before the money, both with emoji and words. This tells us that, when there’s no information about orientation you need to encode (e.g. what’s on the right or left), people do tend to use emoji in the same order as they would the equivalent words.

Picture two – A man walking by a castle

Château de Canisy (5)

But now things get a little more complex. What if there isn’t a strong agent-patient relationship and there is a strong orientation in the photo? Here, a man in a red shirt is walking by a castle, but he shows up on the right side of the photo. Will people be more likely to describe this scene with emoji in a way that encodes the relationship of the objects in the photo?


I found that they were–almost four out of five participants described this scene by using the emoji sequence “castle man”, rather than “man castle”. This is particularly striking because, in the sentence writing part of the experiment, most people (over 56%) wrote a sentence where “man/dude/person etc.” showed up before “castle/mansion/chateau etc.”.

So while people can use emoji to encode syntax, they’re also using them to encode spatial information about the scene.

Picture three – A man photographing a model

Photographing a model

Ok, so let’s add a third layer of complexity: what about when spatial information and the syntactic agent/patient relationships are pointing in opposite directions? For the scene above, if you’re encoding the spatial information then you should use an emoji ordering like “woman camera man”, but if you’re encoding an agent-patient relationship then, as we saw in the picture of the man counting money, you’ll probably want to put the agent first: “man camera woman”.

(I leave it open for discussion whether the camera emoji here is representing a physical camera or a verb like “photograph”.)


For this chart I removed some data to make it readable. I kicked out anyone who picked another ordering of the emoji, and any word order that fewer than ten people (e.g. less than 10% of participants) used.

So people were a little more divided here. It wasn’t quite a 50-50 split, but it really does look like you can go either way with this one. The thing that jumped out at me, though, was how the word order and emoji order pattern together: if your sentence is something like “A man photographs a model”, then you are far more likely to use the “man camera woman” emoji ordering. On the other hand, if your sentence is something like “A woman being photographed by the sea” or “Photoshoot by the water”, then it’s more likely that your emoji ordering described the physical relation of the scene.

So what?

So what’s the big takeaway here? Well, one thing is that emoji don’t really have a fixed syntax in the same way language does. If they did, I’d expect that there would be a lot more agreement between people about the right way to represent a scene with emoji. There was a lot of variation.

On the other hand, emoji ordering isn’t just random either. It is encoding information, either about the syntactic/semantic relationship of the concepts or their physical location in space. The problem is that you really don’t have a way of knowing which one is which.

Edit 12/16/2016: The dataset and the R script I used to analyze it are now avaliable on Github.

Can a computer write my blog posts?

This post is pretty special: it’s the 100th post I’ve made since starting my blog! It’s hard to believe I’ve been doing this so long. I started blogging in 2012, in my final year of undergrad, and now I’m heading into my last year of my PhD. Crazy how fast time flies.

Ok, back on topic. As I was looking back over everything I’ve written, it struck me that 99 posts worth of text on a very specific subject domain (linguistics) in a very specific register (informal) should be enough text to train a simple text generator.

So how did I go about building a blog bot? It was pretty easy! All I needed was:

  • 67,000 words of text (all blog posts before this one)
  • 1 R script to tidy up the text
  • 1 Python script to train a Markov Chain  text generator

A Markov Whatnow?

A Markov Chain is a type of simple (but surprisingly powerful) statistical model that tells you, given the item you’re currently on, what item you’re likely to see next. Today we’re going to apply it to whole words in a text.

How does it work? Basically, for each word in your text, you count how many different words occur after it, how many time each shows up and figure out the probability of each transition. So if your text is “The dog ate the apple.”, then there’s a 50% chance that “the” will be followed by “apple”, but a 100% chance that “apple” will be followed by “.”. You can then use these probabilities to generate new sentences, like so:

  • Input: The dog ate the apple.
  • Possible outputs:
    • The apple.
    • The dog ate the apple.
    • The dog ate the dog ate the apple.
    • The dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the apple. (This may seem silly, but remember that we’re only looking at two words at a time. This model doesn’t “remember” what came earlier or “know” that it’s repeating itself.)

OK, now for the actual training of the model. If you want to play around with this yourself, all my code and text is up on GitHub here.

  1. First, grab your text. I downloaded all my blog posts as an XML file and used a quick little R script to snag just the text I was interested in.
  2. Next, train your model. I used PyMarkovChain, but you can always write your own if you’re feeling it.
  3. Finally, use the model to generate text.

Alright, now I’ve got (admittedly very, very dumb) little program that kinda knows what I write like when I’m blogging. How well does it do at pretending to be me?

First try:

I’m going to break eye contact, look down at your own personalized ASR system

Well that’s… vaguely disturbing. But also not really very human-sounding, and certainly not something I’d write in a blog post. Let’s give it a couple more tries and see if it does any better.

  • But, if frosting has to have a career where you learned it from Clarice
  • We only perceive the things that our bodies and sensory organs and brains can perceive, and we really don’t care about to tell the sounds in the second part, I was expecting to see whether something is grammatical is to put up a bunch of historical stuff that we’ve all heard about the structure in the arts and sciences were depressed
  • (Which is really interesting–I’d love to learn your language doesn’t make you very, very good a recognizing speech
  • Believe me, I know what you can uncontract them and what’s the take-away
    People with short face syndrome–yes, that’s a classical description of what a language does, along with parallels drawn to another, related, languages
  • Short answer: they’re all correct
  • And those speakers are aware of
  • The Job Market for Linguistics PhDsWhat do you much

Hmm, not so great. The only output that sounded good to me was “Short answer: they’re all correct”. And that was just the exact same phrasing I used in this blog post. No wonder it sounds just like me; it is me!

So it looks like I won’t be handing the reins over to Making Noise and Hearing Things bot any time soon. True, I could have used a fancier tool, like a Recurrent Neural Network. But to be perfectly honest, I have yet to see any text generation system that can produce anything even close to approximating a human-written blog post. Will we get there? Maybe. But natural language generation, especially outside of really formulaic things like weather or sports reporting, is a super hard problem. Heck, we still haven’t gotten to point where computers can reliably solve third-grade math word problems.

The very complexities that make language so useful (and interesting to study) also make it so hard to model. Which is good news for me! It means there’s still plenty of work to do in language modelling and blogging.

A Linguistic Analysis of #PronouncingThingsIncorrectly

One of the really cool things about the internet is that it’s a great medium to observe linguistic innovations. A lot of examples of linguistic play that would  have been pretty ephemeral are now safely recorded and shared. (Can you imagine being able to listen to the first examples of Pig Latin? In addition to being cool, it might have told us even more about syllable structure than the game itself already does.)

One example that I’m pretty excited about is #PronouncingThingsIncorrectly, which is a language game invented by Chaz Smith. Smith is a Viner, Cinema Studies student at the University of Pennsylvania and advocate for sexual assault prevention. But right now, I’m mostly interested in his role as a linguistic innovator. In that role he’s invented a new type of language game, which you can see an example of here:

It’s been picked up by a lot of other viners, as well. You can seem some additional examples here.

So why is this linguistically interesting? Because, like most other language games, it has rules to it. I don’t think Chaz necessarily sat down and came up with them (he could have, but I’d be surprised) but they’re there none the less. This is a great example of one of the big True Things linguists know about language: even in play, it tends to be structured. This particular game has three structures I noticed right away: vowel harmony,re-syllabification and new stress assignment.

Vowel Harmony 

Vowel harmony is where all the vowels in a word tend to sound alike. It’s not really a big thing in English, but you may be familiar with it from the nursery rhyme “I like to eat Apples and Bananas“. Other languages, though, use it all the time: Finnish, Nez Perce, Turkish and Maasai all have vowel harmony.

It’s also part of this language game. For example, “tide” is pronounced so that it rhymes with “speedy” and “tomatoes” rhymes with “toe so toes”. Notice that both words have the same vowel sound throughout. Not all words have the same vowel all the way through, but there’s more vowel harmony in the  #PronouncingThingsIncorrectly words than there are in the original versions.


Syllables are a way of chunking up words–you probably learned about them in school at some point. (If not, I’ve talked about them before.) But languages break words up in different places. And in the game, the boundaries get moved around. We’ve already seen one example: “tide”. It’s usually one chunk, but in the game it gets split in to two: “tee.dee”. (Linguists like to put periods in the middle of words to show where the syllable boundaries are.)

You might have noticed that “tide” is spelled with two  a silent “e” on the end. My strong intuition is that spelling plays a big role in this word game. (Which is pretty cool! Usually language games like this rely on mostly on sounds and not the letters used to write them.) Most words get each of the vowels in thier spelling produced separately, which is where a lot of these resyllabifications come from. Two consonants in a row also tend to each get their syllables. You can see some examples of each below:

  • Hawaiian  -> ha.why.EE.an
  • Mayonnaise -> may.yon.nuh.ASS.ee
  • Skittles -> ski.TI.til.ees

New Stress Assignment

English stress assignment (how we pick which syllables in a word get the most emphasis) is a mess. It depends on, among other things, which language we borrowed the word from (words from Latin and words from Old English work differently), whether you can break the word down into smaller meaning bits (like how “bats” is “bat” + “s”) and what part of speech it is (the “compact” in “powder compact” and “compact car” have stress in different places). People have spent entire careers trying to describe it.

In this word game, however, Smith fixes English stress. After resyllabificaiotn, almost all words with more than one syllable have stress one syllable in from the right edge:

  • suc.CESS -> SUC.cess
  • pe.ROK.side -> pee.rok.SEED.dee
  • col.OGNE -> col.OG.nee
  • HON.ey stays the same

But if you’ve been paying attention, you’ll notice that there are some exceptions, like Skittles:

  • Skittles -> ski.TI.til.ees
  • Jalapenos -> djuh.LA.pen.os

Why are these ones different? I think it’s probably because they’re plural, and if the final syllable is plural it doesn’t really count. You can hear some more examples of this in the Vine embedded above:

  • bubbles -> BOO.buh.lees
  • drinks -> duh.RIN.uh.kus
  • bottles -> BOO.teh.less

So what? 

Ok, so why is this important or interesting? Well, for one thing it’s a great example of how humans can’t help but be systematic. This is very informal linguistic play that still manages to be pretty predictable. By investigating this sort of language game we can better characterize what it is to be a human using language.

Secondly, this particular language games shows us some of the pressures on English. While it’s my impression that the introduction of vowel harmony is done to be funny (especially since there are other humorous processes at work here–if a word can be pronounced like “booty” or “ass” is usually is) I’m really interested in the resyllabification and stress assignment–or is that ree.sill.luh.ah.bee.fee.ca.TEE.oin and STUH.rees ass.see.guh.nuh.MEN.tee? The ways they’re done in this game is real improvement over the current way of doing things, at least in terms of being systematic and easy to learn. Who knows? In a couple centuries maybe we’ll all be #PronouncingThingsIncorrectly.

The problem with the grammar police

I’ll admit it: I used to be a die-hard grammar corrector. I practically stalked around conversations with a red pen, ready to jump out and shout “gotcha!” if someone ended a sentence with a preposition or split an infinitive or said “irregardless”. But I’ve done a lot of learning and growing since then and, looking back, I’m kind of ashamed. The truth is, when I used to correct people’s grammar, I wasn’t trying to help them. I was trying to make myself look like a language authority, but in doing so I was actually hurting people. Ironically, I only realized this after years of specialized training to become an actual authority on language.

Chicago police officer on segway

I’ll let you go with a warning this time, but if I catch you using “less” for “fewer” again, I’ll have to give you a ticket.

But what do I mean when I say I was hurting people? Well, like some other types of policing, the grammar police don’t target everyone equally. For example, there has been a lot of criticism of Rihanna’s language use in her new single “Work” being thrown around recently. But that fact is that her language is perfectly fine. She’s just using Jamaican Patois, which most American English speakers aren’t familiar with. People claiming that the language use in “Work” is wrong is sort of similar to American English speakers complaining that Nederhop group ChildsPlay’s language use is wrong. It’s not wrong at all, it’s just different.

And there’s the problem. The fact is that grammar policing isn’t targeting speech errors, it’s targeting differences that are, for many people, perfectly fine. And, overwhelmingly, the people who make “errors” are marginalized in other ways. Here are some examples to show you what I mean:

  • Misusing “ironic”: A lot of the lists of “common grammar errors” you see will include a lot of words where the “correct” use is actually less common then other ways the word is used. Take “ironic”. In general use it can mean surprising or remarkable. If you’re a literary theorist, however, irony has a specific technical meaning–and if you’re not a literary theorist you’re going to need to take a course on it to really get what irony’s about. The only people, then, who are going to use this word “correctly” will be those who are highly educated. And, let’s be real, you know what someone means when they say ironic and isn’t that the point?
  • Overusing words like “just”: This error is apparently so egregious that there’s an e-mail plug-in, targeted mainly at women, to help avoid it. However, as other linguists have pointed out, not only is there limited evidence that women say “just” more than men, but even if there were a difference why would the assumption be that women were overusing “just”? Couldn’t it be that men aren’t using it enough?
  • Double negatives: Also called negative concord, this “error” happens when multiple negatives are used in a sentence, as in, “There isn’t nothing wrong with my language.” This particular construction is perfectly natural and correct in a lot of dialects of American English, including African American English and Southern English, not to mention the standard in some other languages, including French.

In each of these cases, the “error” in question is one that’s produced more by certain groups of people. And those groups of people–less educated individuals, women, African Americans–face disadvantages in other aspects of their life too. This isn’t a mistake or coincidence. When we talk about certain ways of talking, we’re talking about certain types of people. And almost always we’re talking about people who already have the deck stacked against them.

Think about this: why don’t American English speakers point out whenever the Queen of England says things differently? For instance, she often fails to produce the “r” sound in words like “father”, which is definitely not standardized American English. But we don’t talk about how the Queen is “talking lazy” or “dropping letters” like we do about, for instance,  “th” being produced as “d” in African American English. They’re both perfectly regular, logical language varieties that differ from standardized American English…but only one group gets flack for it.

Now I’m not arguing that language errors don’t exist, since they clearly do. If you’ve ever accidentally said a spoonerism or suffered from a tip of the tongue moment then you know what it feel like when your language system breaks down for a second. But here’s a fundamental truth of linguistics: barring a condition like aphasia, a native speaker of a language uses their language correctly. And I think it’s important for us all to examine exactly why it is that we’ve been led to believe otherwise…and who it is that we’re being told is wrong.


Why can you mumble “good morning” and still be understood?

I got an interesting question on Facebook a while ago and though it might be a good topic for a blog post:

I say “good morning” to nearly everyone I see while I’m out running. But I don’t actually say “good”, do I? It’s more like “g’ morning” or “uh morning”. Never just morning by itself, and never a fully articulated good. Is there a name for this grunt that replaces a word? Is this behavior common among English speakers, only southeastern speakers, or only pre-coffee speakers?

This sort of thing is actually very common in speech, especially in conversation. (Or “in the wild” as us laboratory types like to call it.) The fancy-pants name for it is “hypoarticulation”. That’s less (hypo) speech-producing movements of the mouth and throat (articulation). On the other end of the spectrum you have “hyperarticulation” where you very. carefully. produce. each. individual. sound.

Ok, so you can change how much effort you put into producing speech sounds, fair enough. But why? Why don’t we just sort of find a happy medium and hang out there? Two reasons:

  1. Humans are fundamentally lazy. To clarify: articulation costs energy, and energy is a limited resource. More careful articulation also takes more time, which, again, is a limited resource. So the most efficient speech will be very fast and made with very small articulator movements. Reducing the word “good” to just “g” or “uh” is a great example of this type of reduction.
  2. On the other hand, we do want to communicate clearly. As my advisor’s fond of saying, we need exactly enough pointers to get people to the same word we have in mind. So if you point behind someone and say “er!” and it could be either a tiger or a bear, that’s not very helpful. And we’re very aware of this in production: there’s evidence that we’re more likely to hyperarticulate words that are harder to understand.

So we want to communicate clearly and unambiguously, but with as little effort as possible. But how does that tie in with this example? “G” could be “great” or “grass” or “génial “, and “uh” could be any number of things. For this we need to look outside the linguistic system.

The thing is, language is a social activity and when we’re using language we’re almost always doing so with other people. And whenever we interact with other people, we’re always trying to guess what they know. If we’re pretty sure someone can get to the word we mean with less information, for example if we’ve already said it once in the conversation, then we will expend less effort in producing the word. These contexts where things are really easily guessable are called “low entropy“. And in a social context like jogging past someone in the morning, phrases liked “good morning” have very low entropy. Much lower than, for example “Could you hand me that pickle?”–if you jogged past someone  and said that you’d be very likely to hyperarticulate to make sure they understood.

Do sign languages use the feet?

So one of the things that a lot of people who aren’t familiar with sign languages tend to find surprising is that there’s a lot more involved than just the hands. In fact (as I think I’ve mentioned before), fluent signers actually focus on the eyes of the person they’re signing with — not the hands at all. That makes it easier to see things like grammatical facial expressions. But it the use of other body parts doesn’t stop there. In fact, I was recently surprised to learn that several sign languages around the world actually make use of the feet during signing! (If you’d asked me even a couple of months ago, I’d have guessed there weren’t any, and I was super wrong.)

Dancers' feet

Signs Produced on the Feet

So one way in which the feet are used during signing is that some signs are produced with the hands, but on top of or in contact with the feet. Signers aren’t usually bending down to touch their toes in the middle of signing, though. Usually these are languages that are mainly used while sitting cross-legged on the ground. As a result, the feet are easily within the signing space.

Signs Produced With the Feet!

Now these are even more exciting for me. Some languages actually use the feet as active articulators. This was very surprising to me. Why? Well, like I said before, most signers tend to look at other signers’ eyes while they’re communicating. If you’re using your feet during signing, though, your communication partner will need to break eye contact, look down at your feet, and then look all the way back up to your face again. That may not sound like a whole lot of work, but imagine if you were reading this passage and every so often there was a word written on your knee instead of the screen. It would be pretty annoying, and languages tend not to do things that are annoying to their users (because language users stop doing it!).

  • Some sign languages that produce signs with the feet:
    • Walpiri Sign Language (Australia): Signs like RUN and WALK in this language actually involve moving the feet as if running or walking.
    • Central Taurus Sign Language (Turkey): Color signs are produced by using the toe to point to appropriately colored parts of richly colored carpets. (Thanks to Rabia Ergin for the info!)
    • Highland Mayan Sign Language/Meemul Tziij (Guatamala): Signers in this language not only use their feet, but they will actually reach down to the feet while standing. (Which is really interesting–I’d love to see more data on this language.)

So, yes, multiple sign languages do make use of the feet as both places of articulation and active articulators. Interestingly, it seems to be predominantly village sign languages–that is, sign languages used by both deaf and hearing members in small communities with a high incidence of deafness. I don’t know of any Deaf community sign languages–which are used primarily by culturally Deaf individuals who are part of a larger, non-signing society–that make use of the feet. I’d be very interested to hear if anyone knows of any!