Can your use of capitalization reveal your political affiliation?

This week, I’m in Vancouver this week for the meeting of the Association for Computational Linguistics. (On the subject of conferences, don’t forget that my offer to help linguistics students from underrepresented minorities with the cost of conferences still stands!) The work I’m presenting is on a new research direction I’m pursuing and I wanted to share it with y’all!

If you’ve read some of my other posts on sociolinguistics, you may remember that the one of its central ideas is that certain types of language usage pattern together with aspects of people’s social identities. In the US, for example, calling a group of people “yinz” is associated with being from Pittsburgh. Or in Spanish, replacing certain “s” sounds with “th” sounds is associated with being from northern or central Spain. When a particular linguistic form is associated with a specific part of someone’s social identity, we call that a “sociolinguistic variable”

There’s been a lot of work on the type of sociolinguistic variables people use when they’re speaking, but there’s been less work on what people do when they’re writing. And this does make a certain amount of sense: many sociolinguistic variables are either 1) something people aren’t aware they’re doing or 2) something that they’re aware they’re doing but might not consider “proper”. As a result, they tend not to show up in formal writing.

This is where the computational linguistics part comes in; people do a lot of informal writing on computers, especially on the internet. In fact, I’d wager that humans are producing more text now than at any other point in history, and a lot of it is produced in public places. That lets us look for sociolinguistics variables in writing in a way that wasn’t really possible before.

Which is a whole lot of background to be able to say: I’m looking at how punctuation and capitalization pattern with political affiliation on Twitter.

Political affiliation is something that other sociolinguists have definitely looked at. It’s also something that’s very, very noticeable on Twitter these days. This is actually a boon to this type of research. One of the hard things about doing research on Twitter is that you don’t always necessarily know someone’s social identity. And if you use a linguistic feature to try to figure out their identity when what you’re interested in is linguistic features, you quickly end up with the problem of circular logic.

Accounts which are politically active, however, will often explicitly state their political affiliation in their Twitter bio. And I used that information to get tweets from people I was very sure had a specific political affiliation.

For this project, I looked at people who use the hashtags #MAGA and #theResistance in their Twitter bios. The former is an initialism for “Make America Great Again” and is used by politically conservative folks who support President Trump. The latter is used by political liberal folks who are explicitly opposed to President Trump. These two groups not only have different political identities, but also are directly opposed to each other. This means there’s good reason to believe that they will use language in different ways that reflect that identity.

But what about the linguistic half of the equation? Punctuation and capitalization are especially interesting to me because they seem to be capturing some of the same information we might find in prosody or intonation in spoken language. Things like YELLING or…pausing….or… uncertainty?  They’re also much, much easier to measure punctuation than intonation, which is notoriously difficult and time-consuming to annotate.  At the same time, I have good evidence that how you use punctuation and capitalization has some social meaning. Check out this tweet, for example:


As this tweet shows, putting a capital letter at the beginning of a tweet is anything but “aloof and uninterested yet woke and humorous”.

So, if punctuation and capitalization are doing something socially, is part of what they’re doing expressing political affiliation?

That’s what I looked into. I grabbed up to 100 tweets each from accounts which used either #MAGA or #theResistance in their Twitter bios. Then I looked at how much punctuation and capitalization users from these two groups used in their tweets.


First, I looked at all punctuation marks. I did find that, on average, liberal users tended to use less punctuation. But when I took a closer look at the data, an interesting pattern emerged. In both the liberal and conservative groups, there were two clusters of users: those who used a lot of punctuation and those who used almost none.


Politically liberal users on average tended to use less punctuation than politically conservative users, but in both groups there’s really two sets of users: those who use a lot of punctuation and those who use basically  none. There just happen to be more of the latter in #theResistance.

What gives rise to these two clusters? I honestly don’t know, but I do have a hypothesis. I think that there’s  probably a second social variable in this data that I wasn’t able to control for. It seems likely that the user’s age might have something to do with it, or their education level, or even whether they use thier Twitter account for professional or personal communication.


My intuition that there’s a second latent variable at work in this data is even stronger given the results for the amount of capitalization folks used. Conservative users tended to use more capitalization than the average liberal user, but there was a really strong bi-modal distribution for the liberal accounts.


Again, we see that conservative accounts use more of the marker (in this case capitalization), but that there’s a strong bi-modal distribution in the liberal users’ data.

What’s more, the liberal accounts that used a lot of punctuation also tended to use a lot of capitalization. Since these features are both ones that I associate with very “proper” usage (things like always starting a tweet with a capital letter, and ending it with a period) this seems to suggest that some liberal accounts are very standardized in their use of language, while others reject at least some of those standards.

So what’s the answer the question I posed in the title? Can capitalization or punctuation reveal political affiliation? For now, I’m going to go with a solid “maybe”. Users who use very little capitalization and punctuation are more likely to be liberal… but so are users who use a lot of both. And, while I’m on the subject of caveats, keep in mind that I was only looking at very politically active accounts who discuss thier politics in their user bios.  These observations probably don’t apply to all Twitter accounts (and certainly not across different languages).

If you’re interested in reading more, you can check out the fancy-pants versions of this research here and here.  And I definitely intend to consider looking at this; I’ll keep y’all posted on my findings. For now, however, off to find me a Nanimo bar!


Contest announcement! Making noise and going places ✈️🛄

I recently wrote the acknowledgements section my dissertation and it really put into perspective how much help I’ve received during my degree. I’ve decided to pass some of that on by helping out others! Specifically, I’ve decided to help make travelling to conferences a little more affordable for linguistics students who are from underrepresented minorities (African American, American Indian/Alaska Native, or Latin@), LGBT or have a disability.

Biologist speaking at the Friday morning Town Hall session, where attendees were welcome to discuss their ideas on how to further landscape conservation. (5471417317)

To enter:

Entry is open to any student (graduate or undergraduate) studying any aspect of language (broadly defined) who is from an underrepresented minority (African American, American Indian/Alaska Native, or Latin@), LGBT or has a disability.  E-mail me and attach:

  • An abstract or paper that has been accepted at an upcoming (i.e. starting after June 23, 2017) conference
  • The acceptance letter/email from the conference
  • A short biography/description of your work

One entry per person, please!


I’ll pick up to two entries. Each winner will receive 100 American dollars to help them with costs associated with the conference, and I’ll write a blog post highlighting each winner’s research.

Contest closes July 31I’ll contact winners by July 5

Good luck!


What is computational sociolinguistics? (And who’s doing it?)

If you follow me on Twitter (@rctatman) you probably already know that I defended my dissertation last week. That’s right: I’m now officially Dr. Tatman! [party horn emoji]

I’ve spent a lot of time focusing on all the minutia of writing a dissertation lately, from formatting references to correcting a lot of typos (my committee members are all heroes). As a result, I’m more than ready to zoom out and think about big-picture stuff for a little while. And, in academia at least, pictures don’t get much bigger than whole disciplines. Which brings me to the title of this blog post: computational sociolinguistics. I’ve talked about my different research projects quite a bit on this blog (and I’ve got a couple more projects coming up that I’m excited to share with y’all!) but they can seem a little bit scattered. What do patterns of emoji use have to do with how well speech recognition systems deal with different dialects with how people’s political affiliation is reflected in their punctuation use? The answer is that they all fall within the same discipline: computational sociolingustics.

Computational sociolinguistics is a fairly new field that lies at the intersection of two other, more established fields: computational linguistics and sociolinguistics. You’re actually probably already familiar with at least some of the work being done in computational linguistics and its sister field of Natural Language Processing (commonly called NLP). The technologies that allow us to interact with computers or phones using human language, rather than binary 1’s and 0’s, are the result of decades of research in these fields. Everything from spell check, to search engines that know that “puppy” and “dog” are related topics, to automatic translation are the result of researchers working in computational linguistics and NLP.

Sociolinguistics is another well-established field, which focuses on the effects of social context on language how we use language and understand. “Social context”, in this case, can be everything from someone’s identity–like their gender or where they’re from–to the specific linguistic situation they’re in, like how much they like the person they’re talking to or whether or not they think they can be overheard. While a lot of work in sociolinguistics is more qualitative, describing observations without a lot of exact measures, of it is also quantitative.

So what happens when you squish these to fields together? For me, the result is work that focuses on research questions that would be more likely to be asked by sociolinguistics, but using methods from computational linguistics and NLP. It also means asking sociolinguistic questions about how we use language in computational context, drawing on the established research fields of Computer Mediated Communication (CMC), Computational Social Science (CSS) and corpus linguistics, but with a stronger focus on sociolingusitics.

One difficult thing about working in a very new field, however, is that it doesn’t have the established social infrastructure that older fields do. If you do variationist sociolinguistics, for example, there’s an established conference (New Ways of Analyzing Variation, or NWAV) and journals (Language Variation and Change, American Speech, the Journal of Sociolinguistics). Older fields also have an established set of social norms. For instance, conferences are considered more prestigious research venues in computational linguistics, while for sociolinguistics journal publications are usually preferred. But computational sociolinguistics doesn’t really have any of that yet. There also isn’t an established research canon, or any textbooks, or a set of studies that you can assume most people in the field have had exposure to (with the possible exception of Dong et al.’s really fabulous survey article). This is exciting, but also a little bit scary, and really frustrating if you want to learn more about it. Science is about the communities that do it as much as it is about the thing that you’re investigating, and as it stands there’s not really an established formal computational sociolinguistics community that you can join.

Fortunately, I’ve got your back. Below, I’ve collected a list of a few of the scholars whose work I’d consider to be computational sociolinguistics along with small snippets of how they describe their work on their personal websites. This isn’t a complete list, by any means, but it’s a good start and should help you begin to learn a little bit more about this young discipline.

  • Jacob Eisenstein at Georgia Tech
    • “My research combines machine learning and linguistics to build natural language processing systems that are robust to contextual variation and offer new insights about social phenomena.”
  • Jack Grieve at the University of Birmingham
    • “My research focuses on the quantitative analysis of language variation and change. I am especially interested in grammatical and lexical variation in the English language across time and space and the development of new methods for collecting and analysing large corpora of natural language.”
  • Dirk Hovy at the University of Copenhagen
  • Michelle A. McSweeney at Columbia
    • “My research can be summed up by the question: How do we convey tone in text messaging? In face-to-face conversations, we rely on vocal cues, facial expressions, and other non-linguistic features to convey meaning. These features are absent in text messaging, yet digital communication technologies (text messaging, email, etc.) have entered nearly every domain of modern life. I identify the features that facilitate successful communication on these platforms and understand how the availability of digital technologies (i.e., mobile phones) has helped to shape urban spaces.”
  • Dong Nguyen at the University of Edinburgh & Alan Turing Institute
    • “I’m interested in Natural Language Processing and Information Retrieval, and in particular computational text analysis for research questions from the social sciences and humanities. I especially enjoy working with social media data.”
  • Sravana Reddy at Wellesley
    • “My recent interests center around computational sociolinguistics and the intersection between natural language processing and privacy. In the past, I have worked on unsupervised learning, pronunciation modeling, and applications of NLP to creative language.”
  • Tyler Schnoebelen at Decoded AI Consulting
    • “I’m interested in how people make meaning with language. My PhD is from Stanford’s Department of Linguistics (my dissertation was on language and emotion). I’ve also founded a natural language processing start-up (four years), did UX research at Microsoft (ten years) and ran the features desk of a national newspaper in Pakistan (one year).”
  • (Ph.D. Candidate) Philippa Shoemark at the University of Edinburgh
    • “My research interests span computational linguistics, natural language processing, cognitive modelling, and complex networks. I am currently using computational methods to identify social and individual factors that condition linguistic variation and change on social media, under the supervision of Sharon Goldwater and James Kirby.”
  • (Ph.D. Candidate) Sean Simpson at Georgetown University
    • “My interests include computa­tional socio­linguistics, socio­phonetics, language variation, and conservation & revitalization of endangered languages. My dissertation focuses on the incorporation of sociophonetic detail into automated speaker profiling systems.”

Should English be the official language of the United States?

There is currently a bill in the US House to make English the official language of the United States. These bills have been around for a while now. H.R. 997, also known as the “The English Language Unity Act”, was first proposed in 2003. The companion bill, last seen as S. 678 in the 114th congress, was first introduced to the Senate as S. 991 in 2009, and if history is any guide will be introduced again this session.

So if these bills have been around for a bit, why am I just talking about them now? Two reasons. First, I had a really good conversation about this with someone on Twitter the other day and I thought it would be useful to discuss  this in more depth. Second, I’ve been seeing some claims that President Trump made English the official language of the U.S. (he didn’t), so I thought it would be timely to discuss why I think that’s such a bad idea.

As both a linguist and a citizen, I do not think that English should be the official language of the United States.

In fact, I don’t think the US should have any official language. Why? Two main reasons:

  • Historically, language legislation at a national level has… not gone well for other countries.
  • Picking one official language ignores the historical and current linguistic diversity of the United States.

Let’s start with how passing legislation making one or more languages official has gone for other countries. I’m going to stick with just two, Canada and Belgium, but please feel free to discuss others in the comments.


Unlike the US, Canada does have an official language. In fact, thanks to a  1969 law, they have two: English and French. If you’ve ever been to Canada, you know that road signs are all in both English and French.

This law was passed in the wake of turmoil in Quebec sparked by a Montreal school board’s decision to teach all first grade classes in French, much to the displeasure of the English-speaking residents of St. Leonard. Quebec later passed Bill 101 in 1977, making French the only official language of the province. One commenter on this article by the Canadian Broadcasting Corporation called this “the most divisive law in Canadian history”.

Language legislation and its enforcement in Canada has been particularity problematic for businesses. On one occasion, an Italian restaurant faced an investigation for using the word “pasta” on thier menu, instead of the French “pâtes”. Multiple retailers have faced prosecution at the hands of the Office Québécois de la langue Française for failing to have retail websites available in both English and French. A Montreal boutique narrowly avoided a large fine for making Facebook posts only in English. There’s even an official list of English words that Quebec Francophones aren’t supposed to use. While I certainly support bilingualism, personally I would be less than happy to see the same level of government involvement in language use in the US.

In addition, having only French and English as the official languages of Canada leave out a  very important group: aboriginal language users. There are over 60 different indigenous languages used in Canada used by over 213 thousand speakers. And even those don’t make up the majority of languages spoken in Canada: there are over 200 different languages used in Canada and 20% of the population speaks neither English nor French at home.


Another country with a very storied past in terms of language legislation is Belgium. The linguistic situation in Belgium is very complex (there’s a more in-depth discussion here), but the general gist is that there are three languages used in different parts of the country. Dutch is used in the north, French is the south, and German in parts of the east. There is also a deep cultural divide between these regions, which language legislation has long served as a proxy for. There have been no fewer than eight separate national laws passed restricting when and where each language can be used. In 1970, four distinct language regions were codified in the Belgium constitution. You can use whatever language you want in private but there are restrictions on what language you can use for government business, in court, in education and employment.  While you might think that would put a rest to legislation on language, tensions have continued to be high. In 2013, for instance, the European Court of Justice overturned a Flemish law that contracts written in Flanders had to be in Dutch to be binding after a contractor working on an English contract was fired. Again, this represents a greater level of official involvement with language use than I’m personally comfortable with.

I want to be clear: I don’t think multi-lingualism is a problem. As a linguist, I value every language and I also recognize that bilingualism offers significant cognitive benefits. My problem is with legislating which languages should be used in a multi-lingual situation; it tends to lead to avoidable strife.

The US

Ok, you might be thinking, but in the US we really are just an English-speaking country! We wouldn’t have that same problem here. Weeeeelllllll….

The truth is, the US is very multilingual. We have a Language Diversity Index of .353, according to the UN. That means that, if you randomly picked two people from the United States population, the chance that they’d speak two different languages is over 35%. That’s far higher than a lot of other English-speaking countries. The UK clocks in at .139,  while New Zealand and Australia are at .102 and .126, respectively. (For the record, Canada is at .549 and Belgium .734.)

The number of different languages spoken in the US is also remarkable. In New York City alone there may be speakers of as many as 800 different languages, which would make it one of the most linguistically-diverse places in the world; like the Amazon rain-forest of languages. In King County, where I live now, there are over 170 different languages spoken, with the most common being Spanish, Chinese, Vietnamese and Amharic. And that linguistic diversity is reflected in the education system: there are almost 5 million students in the US Education system who are learning English, nearly 1 out of 10 students.

Multilingualism in the United States is nothing new, either: it’s been a part of the American experience since long before there was an America. Of course, there continue to be many speakers of indigenous languages in the United States, including Hawaiian (keep in mind that Hawaii did not actually want to become a state). But even within just European languages, English has never had sole dominion. Spanish has been spoken in Florida since the 1500’s. At the time of the signing of the Deceleration of Independence, an estimated 10% of the citizens of the newly-founded US spoke German (although the idea that it almost became the official language of the US is a myth). New York city? Used to be called New Amsterdam, and Dutch was spoken there into the 1900’s. Even the troops fighting the revolutionary war were doing so in at least five languages.

Making English the official language of the United States would ignore the rich linguistic history and the current linguistic diversity of this country. And, if other countries’ language legislation is any guide, would cause a lot of unnecessary fuss.

Social Speech Perception: Reviewing recent work

As some of you may already know, it’s getting to be crunch time for me on my PhD. So for the next little bit, my blog posts will probably mostly be on things that directly related to my dissertation. Consider this your behind-the-scenes pass to the research process.

With that in mind, today we’ll be looking at some work that’s closely related to my own.(And, not coincidentally, that I need to include in my lit review. Twofer.) They all share a common thread: social categories and speech perception. I’ve talked about this before, but these are more recent peer-reviewed papers on the topic.

Vowel perception by listeners from different English dialects

In this paper, Karpinska and coauthors investigated the role of a listener’s regional dialect on their use of two acoustic cues: formats and duration.

An acoustic cue is a specific part of the speech signal that you pay attention to to help you decide what sound you’re hearing. For example, when listening to a fricative, like “s” or “sh”,  you pay a lot of attention to the high-pitched, staticy-sounding  part of the sound to tell which fricative you’re hearing. This cue helps you tell the difference  between “sip” and “ship”, and if it gets removed or covered by another sound you’ll have a hard time telling those words apart.

They found that for listeners from the UK, New Zealand, Ireland and Singapore, formants were the most important cue distinguishing the vowels in “bit” and “beat”. For Australian listeners, however, duration (how long the vowel was) was a more important cue to the identity of these vowels. This study provides additional evidence that a listener’s dialect affects their speech perception, and in particular which cues they rely on.

Social categories are shared across bilinguals׳ lexicons

In this experiment Szakay and co-authors looked at English-Māori bilinguals from New Zealand. In New Zealand, there  are multiple dialects of English, including Māori English (the variety used by Native New Zealanders) and Pākehā English (the variety used by white New Zealanders). The authors found that there was a cross-language priming effect from Māori to English, but only for  Māori English.

Priming is a phenomena in linguistics where hearing or saying a particular linguistic unit, like a word, later makes it easier to understand or say a similar unit. So if I show you a picture of a dog, you’re going to be able to read the word “puppy” faster than you would have otherwise becuase you’re already thinking about canines.

They argue that this is due to the activation of language forms associated with a specific social identity–in this case Māori ethnicity. This provides evidence that listener’s beliefs about a speaker’s social identity affects their processing.

Intergroup Dynamics in Speech Perception: Interaction Among Experience, Attitudes and Expectations

Nguyen and co-authors investigate the effects of three separate factors on speech perception:

  • Experience: How much prior interaction a listener has had with a given speech variety.
  • Attitudes: How a listener feels about a speech variety and those who speak it.
  • Expectations: Whether a listener knows what speech variety they’re going to hear. (In this case only some listeners were explicitly told what speech variety they were going to hear.)

They found that these three factors all influenced the speech perception of Australian English speakers listening to Vietnamese accented English, and that there was an interaction between these factors. In general, participants with correct expectations (i.e. being told beforehand that they were going to hear Vietnamese accented English) identified more words correctly.

There was an interesting interaction between listener’s attitudes towards Asians and thier experience with Vietnamese accented English. Listeners who had negative prejudices towards Asians and little experience with Vietnamese English were more accurate than those with little experience and positive prejudice. The authors suggest that that was due to listeners with negative prejudice being more attentive. However, the opposite effect was found for listener’s with experience listening to Vietnamese English. In this group, positive prejudice increase accuracy while negative prejudice decreased it. There were, however, uneven numbers of participants between the groups so this might have skewed the results.

For me, this study is most useful becuase it shows that a listener’s experience with a speech variety and their expectation of hearing it affect their perception. I would, however, like to see a larger listener sample, especially given the strong negative skew in listner’s attitudes towards Asians (which I realize the researchers couldn’t have controlled for).

Perceiving the World through Group-Colored Glasses: A Perceptual Model of Intergroup Relations

Rather than presenting an experiment, this paper lays out a framework for the interplay of social group affiliation and perception. The authors pull together numerous lines of research showing that an individual’s own group affiliation can change thier perception and interpretation of the same stimuli. In the authors’ own words:

The central premise of our model is that social identification influences perception.

While they discuss perception across many domains (visual, tactile, orlfactory, etc.) the part which directly fits with my work is that of auditory perception. As they point out, auditory perception of speech depends on both bottom up and top down information. Bottom-up information, in speech perception, is the acoustic signal, while top-down information includes both knowledge of the language (like which words are more common) and social knowledge (like our knowledge of different dialects). While the authors do not discuss dialect perception directly, other work (including the three studies discussed above) fits nicely into this framework.

The key difference between this framework and Kleinschmidt & Jaeger’s Robust Speech Perception model is the centrality of the speaker’s identity. Since all language users have thier own dialect which affects their speech perception (although, of course, some listeners can fluently listen to more than one dialect) it is important to consider both the listener’s and talker’s social affiliation when modelling speech perception.

Preference for wake words varies by user gender

I recently read a very interesting article on the design of aspects of choosing a wake word, the word you use to turn on a voice-activated system. In Star Trek it’s “Computer”, but these days two of the more popular ones are “Alexa” and “OK Google”. The article’s author was a designer and noted that she found “Ok Google” or “Hey Google” to be more pleasant to use than “Alexa”. As I was reading the comments (I know, I know) I noticed that a lot of the people who strongly protested that they preferred “Alexa” had usernames or avatars that I would associate with male users. It struck me that there might be an underlying social pattern here.

So, being the type of nerd I am, I whipped up a quick little survey to look at the interaction between user gender and their preference for wake words. The survey only had two questions:

  • What is your gender?
    • Male
    • Female
    • Other
  • If Google Home and the Echo offered identical performance in all ways except for the wake word (the word or phrase you use to wake the device and begin talking to it), which wake word would you prefer?
    • “Ok Google” or “Hey Google”
    • “Alexa”

I included only those options becuase those are the defaults–I am aware you can choose to change the Echo’s wake word. (And probably should, given recent events.) 67 people responded to my survey. (If you were one of them, thanks!)

So what were the results? They were actually pretty strongly in line with my initial observations: as a group, only men preferred “Alexa” to “Ok Google”. Furthermore, this preference was far weaker than people of other genders’ for “Ok Google”. Women preferred “Ok Google” at a rate of almost two-to-one, and no people of other genders preferred “Alexa”.

I did have a bit of a skewed sample, with more women than men and people of other genders, but the differences between genders were robust enough to be statistically significant (c2(2, N = 67) = 7.25, p = 0.02)).


Women preferred “Ok Google” to “Alexa” 27:11, men preferred “Alexa” to “Ok Google” 14:11, and the four people of other genders in my survey all preferred “Ok Google”.

So what’s the take-away? Well, for one, Johna Paolino (the author of the original article) is by no means alone in her preference for a non-gendered wake word. More broadly, I think that, like the Clippy debacle, this is excellent evidence that there are strong gendered differences in how users’ gender affects their interaction with virtual agents. If you’re working to create virtual agents, it’s important to consider all types of users or you might end up creating something that rubs more than half of your potential customers the wrong way.

My code and data are available here.

Do emojis have their own syntax?

So a while ago I got into a discussion with someone on Twitter about whether emojis have syntax. Their original question was this:

As someone who’s studied sign language, my immediate thought was “Of course there’s a directionality to emoji: they encode the spatial relationships of the scene.” This is just fancy linguist talk for: “if there’s a dog eating a hot-dog, and the dog is on the right, you’re going to use 🌭🐕, not 🐕🌭.” But the more I thought about it, the more I began to think that maybe it would be better not to rely on my intuitions in this case. First, because I know American Sign Language and that might be influencing me and, second, because I am pretty gosh-darn dyslexic and I can’t promise that my really excellent ability to flip adjacent characters doesn’t extend to emoji.

So, like any good behavioral scientist, I ran a little experiment. I wanted to know two things.

  1. Does an emoji description of a scene show the way that things are positioned in that scene?
  2. Does the order of emojis tend to be the same as the ordering of those same concepts in an equivalent sentence?

As it turned out, the answers to these questions are actually fairly intertwined, and related to a third thing I hadn’t actually considered while I was putting together my stimuli (but probably should have): whether there was an agent-patient relationship in the photo.

Agent: The entity in a sentence that’s affecting a changed, the “doer” of the action.

  • The dog ate the hot-dog.
  • The raccoons pushed over all the trash-bins.

Patient: The entity that’s being changed, the “receiver” of the action.

  • The dog ate the hot-dog.
  • The raccoons pushed over all the trash-bins.


To get data, I showed people three pictures and asked them to “pick the emoji sequence that best describes the scene” and then gave them two options that used different orders of the same emoji. Then, once they were done with the emoji part, I asked them to “please type a short sentence to describe each scene”. For all the language data, I just went through and quickly coded the order that the same concepts as were encoded in the emoji showed up.


  • “The dog ate a hot-dog”  -> dog hot-dog
  • “The hot-dog was eaten by the dog” -> hot-dog dog
  • “A dog eating” -> dog
  • “The hot-dog was completely devoured” -> hot-dog

So this gave me two parallel data sets: one with emojis and one with language data.

All together, 133 people filled out the emoji half and 127 people did the whole thing, mostly in English (I had one person respond in Spanish and I went ahead and included it). I have absolutely no demographics on my participants, and that’s by design; since I didn’t go through the Institutional Review Board it would actually be unethical for me to collect data about people themselves rather than just general information on language use. (If you want to get into the nitty-gritty this is a really good discussion of different types of on-line research.)

Picture one – A man counting money

Watch, movie schedule, poster, telephone, cashier machine, cash register Fortepan 6680

I picked this photo as sort of a sanity-check: there’s no obvious right-to-left ordering of the man and the money, and there’s one pretty clear way of describing what’s going on in this scene. There’s an agent (the man) and a patient (the money), and since we tend to describe things as agent first, patient second I expected people to pretty much all do the same thing with this picture. (Side note: I know I’ve read a paper about the cross-linguistic tendency for syntactic structures where the agent comes first, but I can’t find it and I don’t remember who it’s by. Please let me know if you’ve got an idea what it could be in the comments–it’s driving me nuts!)


And they did! Pretty much everyone described this picture by putting the man before the money, both with emoji and words. This tells us that, when there’s no information about orientation you need to encode (e.g. what’s on the right or left), people do tend to use emoji in the same order as they would the equivalent words.

Picture two – A man walking by a castle

Château de Canisy (5)

But now things get a little more complex. What if there isn’t a strong agent-patient relationship and there is a strong orientation in the photo? Here, a man in a red shirt is walking by a castle, but he shows up on the right side of the photo. Will people be more likely to describe this scene with emoji in a way that encodes the relationship of the objects in the photo?


I found that they were–almost four out of five participants described this scene by using the emoji sequence “castle man”, rather than “man castle”. This is particularly striking because, in the sentence writing part of the experiment, most people (over 56%) wrote a sentence where “man/dude/person etc.” showed up before “castle/mansion/chateau etc.”.

So while people can use emoji to encode syntax, they’re also using them to encode spatial information about the scene.

Picture three – A man photographing a model

Photographing a model

Ok, so let’s add a third layer of complexity: what about when spatial information and the syntactic agent/patient relationships are pointing in opposite directions? For the scene above, if you’re encoding the spatial information then you should use an emoji ordering like “woman camera man”, but if you’re encoding an agent-patient relationship then, as we saw in the picture of the man counting money, you’ll probably want to put the agent first: “man camera woman”.

(I leave it open for discussion whether the camera emoji here is representing a physical camera or a verb like “photograph”.)


For this chart I removed some data to make it readable. I kicked out anyone who picked another ordering of the emoji, and any word order that fewer than ten people (e.g. less than 10% of participants) used.

So people were a little more divided here. It wasn’t quite a 50-50 split, but it really does look like you can go either way with this one. The thing that jumped out at me, though, was how the word order and emoji order pattern together: if your sentence is something like “A man photographs a model”, then you are far more likely to use the “man camera woman” emoji ordering. On the other hand, if your sentence is something like “A woman being photographed by the sea” or “Photoshoot by the water”, then it’s more likely that your emoji ordering described the physical relation of the scene.

So what?

So what’s the big takeaway here? Well, one thing is that emoji don’t really have a fixed syntax in the same way language does. If they did, I’d expect that there would be a lot more agreement between people about the right way to represent a scene with emoji. There was a lot of variation.

On the other hand, emoji ordering isn’t just random either. It is encoding information, either about the syntactic/semantic relationship of the concepts or their physical location in space. The problem is that you really don’t have a way of knowing which one is which.

Edit 12/16/2016: The dataset and the R script I used to analyze it are now avaliable on Github.

How loud would a million dogs barking be?

So a friend of mine who’s a reference librarian (and has a gaming YouTube channel you should check out) recently got an interesting question: how loud would a million dogs barking be?

This is an interesting question because it gets at some interesting properties of how sound work, in particular the decibel scale.

So, first off, we need to establish our baseline. The loudest recorded dog bark clocked in at 113.1 dB, and was produced by a golden retriever named Charlie. (Interestingly, the loudest recorded human scream was 129 dB, so it looks like Charlie’s got some training to do to catch up!) That’s louder than a chain saw, and loud enough to cause hearing damage if you heard it consonantly.

Now, let’s scale our problem down a bit and figure out how loud it would be if ten Charlies barked together. (I’m going to use copies of Charlie and assume they’ll bark in phase becuase it makes the math simpler.) One Charlie is 113 dB, so your first instinct may be to multiply that by ten and end up 1130 dB. Unfortunately, if you took this approach you’d be (if you’ll excuse the expression) barking up the wrong tree. Why? Because the dB scale is logarithmic. This means that a 1130 dB is absolutely ridiculously loud. For reference, under normal conditions the loudest possible sound (on Earth) is 194 dB.  A sound of 1000 dB would be loud enough to create a black hole larger than the galaxy. We wouldn’t be able to get a bark that loud even if we covered every inch of earth with clones of champion barker Charlie.

Ok, so we know what one wrong approach is, but what’s the right one? Well, we have our base bark at 113 dB. If we want a bark that is one million times as powerful (assuming that we can get a million dogs to bark as one) then we need to take the base ten log of one million and multiply it by ten (that’s the deci part of decibel). (If you want more math try this site.) The base ten log of one million is six, so times ten that’s sixty decibels. But it’s sixty decibels louder than our original sound of 113dB, for a grand total of 173dB.

Now, to put this in perspective, that’s still pretty durn loud. That’s loud enough to cause hearing loss in our puppies and everyone in hearing distance. We’re talking about the loudness of a cannon, or a rocket launch from 100 meters away. So, yes, very loud, but not quite “destroying the galaxy” loud.

A final note: since the current world record for loudest barking group of dogs is a more modest 124 dB from group of just 76 dogs, if you could get a million dogs to bark in unison you’d definitely set a new world record! But, considering that you’d end up hurting the dogs’ hearing (and having to scoop all that poop) I’m afraid I really can’t recommend it.

Can a computer write my blog posts?

This post is pretty special: it’s the 100th post I’ve made since starting my blog! It’s hard to believe I’ve been doing this so long. I started blogging in 2012, in my final year of undergrad, and now I’m heading into my last year of my PhD. Crazy how fast time flies.

Ok, back on topic. As I was looking back over everything I’ve written, it struck me that 99 posts worth of text on a very specific subject domain (linguistics) in a very specific register (informal) should be enough text to train a simple text generator.

So how did I go about building a blog bot? It was pretty easy! All I needed was:

  • 67,000 words of text (all blog posts before this one)
  • 1 R script to tidy up the text
  • 1 Python script to train a Markov Chain  text generator

A Markov Whatnow?

A Markov Chain is a type of simple (but surprisingly powerful) statistical model that tells you, given the item you’re currently on, what item you’re likely to see next. Today we’re going to apply it to whole words in a text.

How does it work? Basically, for each word in your text, you count how many different words occur after it, how many time each shows up and figure out the probability of each transition. So if your text is “The dog ate the apple.”, then there’s a 50% chance that “the” will be followed by “apple”, but a 100% chance that “apple” will be followed by “.”. You can then use these probabilities to generate new sentences, like so:

  • Input: The dog ate the apple.
  • Possible outputs:
    • The apple.
    • The dog ate the apple.
    • The dog ate the dog ate the apple.
    • The dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the apple. (This may seem silly, but remember that we’re only looking at two words at a time. This model doesn’t “remember” what came earlier or “know” that it’s repeating itself.)

OK, now for the actual training of the model. If you want to play around with this yourself, all my code and text is up on GitHub here.

  1. First, grab your text. I downloaded all my blog posts as an XML file and used a quick little R script to snag just the text I was interested in.
  2. Next, train your model. I used PyMarkovChain, but you can always write your own if you’re feeling it.
  3. Finally, use the model to generate text.

Alright, now I’ve got (admittedly very, very dumb) little program that kinda knows what I write like when I’m blogging. How well does it do at pretending to be me?

First try:

I’m going to break eye contact, look down at your own personalized ASR system

Well that’s… vaguely disturbing. But also not really very human-sounding, and certainly not something I’d write in a blog post. Let’s give it a couple more tries and see if it does any better.

  • But, if frosting has to have a career where you learned it from Clarice
  • We only perceive the things that our bodies and sensory organs and brains can perceive, and we really don’t care about to tell the sounds in the second part, I was expecting to see whether something is grammatical is to put up a bunch of historical stuff that we’ve all heard about the structure in the arts and sciences were depressed
  • (Which is really interesting–I’d love to learn your language doesn’t make you very, very good a recognizing speech
  • Believe me, I know what you can uncontract them and what’s the take-away
    People with short face syndrome–yes, that’s a classical description of what a language does, along with parallels drawn to another, related, languages
  • Short answer: they’re all correct
  • And those speakers are aware of
  • The Job Market for Linguistics PhDsWhat do you much

Hmm, not so great. The only output that sounded good to me was “Short answer: they’re all correct”. And that was just the exact same phrasing I used in this blog post. No wonder it sounds just like me; it is me!

So it looks like I won’t be handing the reins over to Making Noise and Hearing Things bot any time soon. True, I could have used a fancier tool, like a Recurrent Neural Network. But to be perfectly honest, I have yet to see any text generation system that can produce anything even close to approximating a human-written blog post. Will we get there? Maybe. But natural language generation, especially outside of really formulaic things like weather or sports reporting, is a super hard problem. Heck, we still haven’t gotten to point where computers can reliably solve third-grade math word problems.

The very complexities that make language so useful (and interesting to study) also make it so hard to model. Which is good news for me! It means there’s still plenty of work to do in language modelling and blogging.

What’s a “bumpus”?

So I recently had a pretty disconcerting experience. It turns out that almost no one else has heard of a word that I thought was pretty common. And when I say “no one” I’m including dialectologists; it’s unattested in the Oxford English Dictionary and the Dictionary of American Regional English. Out of the twenty two people who responded to my Twitter poll (which was probably mostly other linguists, given my social networks) only one other person said they’d even heard the word and, as I later confirmed, it turned out to be one of my college friends.

So what is this mysterious word that has so far evaded academic inquiry? Ladies, gentlemen and all others, please allow me to introduce you to…


Pronounced ‘bʌm.pɪs or ‘bʌm.pəs. You can hear me say the word and use it in context by listening to this low quality recording.

The word means something like “fool” or “incompetent person”. To prove that this is actually a real word that people other than me use, I’ve (very, very laboriously) found some examples from the internet. It shows up in the comments section of this news article:

THAT is why people are voting for Mr Trump, even if he does act sometimes like a Bumpus.

I also found it in a smattering of public tweets like this one:

If you ever meet my dad, please ask him what a “bumpus” is

And this one:

Having seen horror of war, one would think, John McCain would run from war. No, he runs to war, to get us involved. What a bumpus.

And, my personal favorite, this one:

because the SUN(in that pic) is wearing GLASSES god karen ur such a bumpus

There’s also an Urban Dictionary entry which suggests the definition:

A raucous, boisterous person or thing (usually african-american.)

I’m a little sceptical about the last one, though. Partly because it doesn’t line up with my own intuitions (I feel like a bumpus is more likely to be silent than rowdy) and partly becuase less popular Urban Dictionary entries, especially for words that are also names, are super unreliable.

I also wrote to my parents (Hi mom! Hi dad!) and asked them if they’d used the word growing up, in what contexts, and who they’d learned it from. My dad confirmed that he’d heard it growing up (mom hadn’t) and had a suggestion for where it might have come from:

I am pretty sure my dad used it – invariably in one of the two phrases [“don’t be a bumpus” or “don’t stand there like a bumpus”]….  Bumpass, Virginia is in Lousia County …. Growing up in Norfolk, it could have held connotations of really rural Virginia, maybe, for Dad.

While this is definitely a possibility, I don’t know that it’s definitely the origin of the word. Bumpass, Virginia, like  Bumpass Hell (see this review, which also includes the phrase “Don’t be a bumpass”), was named for an early settler. Interestingly, the college friend mentioned earlier is also from the Tidewater region of Virginia, which leads me to think that the word may have originated there.

My mom offered some other possible origins, that the term might be related to “country bumpkin” or “bump on a log”. I think the latter is especially interesting, given that “bump on a log” and “bumpus” show up in exactly the same phrase: standing/sitting there like a _______.

She also suggested it might be related to “bumpkis” or “bupkis”. This is a possibility, especially since that word is definitely from Yiddish and Norfolk, VA does have a history of Jewish settlement and Yiddish speakers.

A usage of “Bumpus” which seems to be the most common is in phrases like “Bumpus dog” or “Bumpus hound”. I think that this is probably actually a different use, though, and a direct reference to a scene from the movie A Christmas Story:

One final note is that there was a baseball pitcher in the late 1890’s who went by the nickname “Bumpus”: Bumpus Jones. While I can’t find any information about where the nickname came from, this post suggests that his family was from Virginia and that he had Powhatan ancestry.

I’m really interesting in learning more about this word and its distribution. My intuition is that it’s mainly used by older, white speakers in the South, possibly centered around the Tidewater region of Virginia.

If you’ve heard of or used this word, please leave a comment or drop me a line letting me know 1) roughly how old you are, 2) where you grew up and 3) (if you can remember) where you learned it. Feel free to add any other information you feel might be relevant, too!