The Single Most Common Language Technology Mistake (and how to avoid it)

This is an excerpt from my book “The 9 Most Common Language Technology Mistakes (and how to avoid them!)”. If you’d like to grab a copy (plus a handy checklist) you can get it on my Ko-fi. If you’d prefer a video version of the same information, you can check out the livestream I did here.

The single most common mistake I see language technologists make, especially newer ones, is failing to consider the domain of their data.

Mismatched shoes are fun and quirky… mismatched language data domains on the other hand will lead to a LOT of problems.

I’m talking about “domain” here in the linguistic sense, not the networking sense. Basically it means that you are attempting to train a piece of language technology using data that is different from what it will see in production. (This is a special case of out-of-distribution data or, if it occurs due to change over time, distribution shift.) The reason this trips up so many new language technology developers is because of a lack of understanding of what counts as “different”: there may be more factors to consider than you initially realize. Language varies systematically and fairly predictably depending on the situation where it is used. What this means from a machine learning standpoint is that the type of language data used is very strong signal and that systems are very likely to pick up on it during training.


The most obvious type of domain mismatch is topic: what is the text about? If I want to build a system to handle medical text and I only train it with legal text, or customer support chat logs, obviously my system will not be able to handle the specific tokens and expressions that exist in the target medical data but not in the training data.

“Domain” includes a lot more than topic however, and some of the most relevant domain differences are modality, formality, and intended audience.


First up is modality: how was this language originally produced? Was it spoken aloud, written by hand, typed or signed? Each of these is going to have a systematic effect on the language data that you see, even if the topic is more or less the same and they have all been transcribed into the same format. For example, it’s pretty rare to see emojis in handwritten text. You also won’t see spatial referencing in written text or spoken language the way that you do in signed languages.

Even when comparing different inputs–typed as opposed to spoken or dictated–you’ll see pretty big differences. Written text tends to have more rare words while spoken language tends to have more pronouns and a more skewed frequency distribution (i.e. more of the tokens used are very common). This 1998 paper looking at Swedish is a good citation if you’re interested in a deeper dive.


Next up is formality. Often you’ll hear people discussing things like “noisy user generated text”. This means that this text has been produced in an informal setting, like social media, rather than a formal setting like a published paper. (Or even this document!) Informal text has its own patterns that you need to consider. For example, misspellings may sometimes be the result of actual mistakes such as typos but often they encode specific information. This study on French is a great example of this.

You are also more likely to see examples of language use that has been stigmatized in an educational setting. For example, code-switching, or using more than one language or language variety in a single text span. You are also more likely to see different language varieties. In the United States African American English, which is a set of related language varieties predominantly used by the Black community, has been historically extremely stigmatized. As a result you are much more likely to find examples of grammatical structures or words from African American English in informal texts than formal ones (although that is changing and does depend on the specific audience for which the formal text was written). As a result a lot of tools trained on more formal text, like language identifiers, will fail when encountering this particular language variety.

You are also much more likely to see slang in informal text and these slang term generally change very quickly. Consider terms like “based” or “lit” that, depending on when you’re reading this, may already sound extremely dated. On the other hand, very formal text is likely to be more stable over time but also to have its own type of rare tokens, especially jargon.

Intended audience

Speaking of jargon, another common source of variation is due to the intended audience. Consider this in your own life: do you write an email to a close friend the same way that you do to your boss? Even though the format of the text is very similar and it may be on the same topic, who you are producing the language output for has a large effect on what you say and how. And this goes even further than just the specific person you were talking to. Is this language being produced to be broadcast (shared with a large number of people who may not be able to each respond individually to the same degree) or has it been produced as part of a dyadic discussion (with a single other individual)? Broadcast text is likely to assume less background knowledge on the part of the listener whereas dyadic text will assume that you have access to the entire rest of the conversation.

If you have a good idea about what type of language use your system will encounter once it’s in production, you can tailor your training data to include a lot of examples of that type of language use. If you don’t, you are likely to have surprising errors that are extremely difficult to debug since they arise from the data and not the modeling itself.


Large language models cannot replace mental health professionals

I recently posted on Twitter the strong recommendation that individuals do not attempt to use large language models (like BERT or GPT-3) to replace the services of mental health professionals. I’m writing this blog post to specifically address individuals who think that this is a good idea.

CW: discussions of mental illness, medical abuse, self-harm and suicide. If you are in the US and currently experiencing crisis, 988 is now the nationwide number for mental health, substance use and suicidal crises, if you are outside the US this is a list of international suicide hotlines.

Scrabble tiles spelling out “Mental Health”. Wokandapix at Pixabay, CC0, via Wikimedia Commons

First: I want to start by acknowledging that if you are in the position where you are considering doing this, you are probably coming from a well-meaning place. You may know someone who has experienced mental illness or experienced it yourself and you want to help. That is a wonderful impulse and I applaud it. However, attempting to use large language models to replace the services of mental health professionals is *not* better than nothing. In fact, it is worse than nothing and has an extremely high probability of causing harm to the very people that you are trying to help. And this isn’t just my stance. Replacing clinicians with automated systems goes DIRECTLY against the recommendations of clinicians working in clinical applications of ML in mental health:

“ML and NLP should not lead to disempowerment of psychiatrists or replace the clinician-patient pair [emphasis mine].” – Le Glaz A, Haralambous Y, Kim-Dufor DH, Lenca P, Billot R, Ryan TC, Marsh J, DeVylder J, Walter M, Berrouiguet S, Lemey C. Machine Learning and Natural Language Processing in Mental Health: Systematic Review. J Med Internet Res. 2021 May 4;23(5):e15708. doi: 10.2196/15708. PMID: 33944788; PMCID: PMC8132982. 

Let me discuss some of the possible harms in more detail, however, since I have found that often when someone is particularly enthused of using large language models in a highly sensitive application they have not considered the scale of the potential harms or the larger societal impact of that application.

First, large language models are fundamentally unfit for purpose. They are trained on a general selection of language data much of which has been scraped from the internet, and I think most people would agree that “random text from the internet” is a very poor source of mental health advice. If you were to attempt to fine-tune a model for clinical practice, you would need to use clinical notes. These are extremely sensitive personal identifying information. Previous sharing of similar information, specifically text from the Crisis Text Line for training machine learning models, has been resoundingly decried as unethical by both the machine learning and clinical communities. Further, we know that PII included as few as one time in the training data of a large language model can be re-identified via model probing. As a result, any of the extremely sensitive data used to tune the model has the potential of being leaked to end users of the model.

You would also open patients up to other kinds of adversarial attacks. In particular the use of universal triggers by a man-in-the-middle attacker could be used to intentionally serve text to patients that encouraged self harm or suicide. And given the unpredictable nature of the text output from large language models in the first place, it would be impossible to ensure that the model didn’t just do that on its own, regardless of how much prompt engineering was done.

Further, mental health support is not generic. Advice that might be helpful to someone in one situation may be actively harmful to another. For example, someone worried about intrusive thoughts may find it very comforting to be told “you are not alone in your experience”. However, telling someone suffering from paranoia who says they are being followed “you are not alone in your experience” may help to reinforce that belief and further harm their mental health. Encouraging someone suffering from depression to “add some moderate intensity exercise to your daily routine” may be helpful for them. Encouraging someone who is suffering from compulsive exercising to “add some moderate intensity exercise to your daily routine” would be actively harmful. Since large language models are not medical practitioners, I hesitate to call this “malpractice”, however it is clear that there is an enormous potential for harm even when serving seemingly innocuous, general advice.

And while you may be tempted to address this by including diagnosing as part of the system, that in itself offers extremely high potential for harm. Mental health misdiagnosis can be extremely harmful to patients, even when it happens during consultation with a medical services provider. Adding on the veneer of supposed impartiality from an automated system may increase that harm.

And of course, once diagnosis has been made (even if incorrectly) it then becomes personal identifiable information linked with the patient. However, since an automated system is not actually a clinician, even the limited data privacy protections provided for people in the US by something like HIPAA just doesn’t apply. (In fact it’s a big issue even with online services that use human service providers; Better Help in particular has sold Facebook data from therapy sessions.) As a result, this data can be legally sold by third parties like data brokers.  Even if you don’t intend to sell that data, there’s no way for you to ensure that it never will be, including after your death or if whatever legal entity you use to create the service is bought or goes into bankruptcy.

And this potential secondary use of the data, again even if the diagnosis is incorrect, has an enormous potential to harm individuals. In the US it could be used to raise their insurance rates, have them involuntarily committed or placed under an extremely controlling conservatorship (which may strip them of their right to vote or even allow them to be forcibly sterilized, which is still legal in 31 states). Mental illness is extremely stigmatized and creating a process for linking a diagnosis with an individual has extremely high potential for harm.

Attempting to replicate the services of a mental health clinician through the use of LLMs has the potential to harm the individuals who attempt to use that service. And I think there’s a broader lesson here as well: mental illness and mental health issues are fundamentally not an ML engineering problem. While we may use these tools to support clinicians or service providers or public health workers, we can’t replace them. 

Instead think about what it is specifically you want to do and spend your time doing something that will have an immediate, direct impact. If your goal is to support someone you love, support them directly. If you want to help your broader community, join existing organizations doing the work, like local crisis centers. If you need help, seek it out. If you are in crisis, seek immediate support. If you are merely looking for an interesting engineering problem, however, look elsewhere.

The trouble with sentiment analysis

Two things spurred me to write this post. First, I’d given the same advice three times which, according to David Robinson‘s rule, meant it was time. And, second, this news story on a startup that claims that they can detect student emotions over Zoom. With those things in mind, here is my very simple guidance on sentiment analysis:

You should almost never do sentiment analysis.

A picture of a stop sign against a blue sky. There are two wind turbines in the background. Photo by lamoix, CC BY 2.0, via Wikimedia Commons

Thanks for reading, hope that cleared things up. 🙂 In all seriousness, though, the places where it makes sense for a data scientist or NLP practitioner working in industry to use sentiment analysis are vanishingly rare. First, because it doesn’t work very well and second, because even when it does work it’s usually measuring the wrong thing.

What do I mean that sentiment analysis doesn’t work very well?

Let’s consider the most common approach. You have a list of words that are “positive” and a list of words that “negative” (or, more rarely, a list of words associated with different emotions) and you count how many words from each list appear in your target text. If there are more positive words you assign a positive sentiment, more negative words a negative one and if they are equal or (much more likely) none of the words on either list show up you assign a neutral sentiment. Of course, there are a variety of other, more sophisticated approaches, but this is the most common one.

As you can see, there’s a lot you’ll miss using this approach. The construction of the lists may not have been done with the target text producers in mind, for example. A list from five years ago may possibly have “lit” as a word with positive sentiment, but what about “based”? If you’re attempting to characterize young internet users you’ll need to be very careful with your sentiment lists.

And a word-based approach cannot account for context. “Our flight was late but Cindy at O’Hare managed to change our connection”, for example, may have gotten a negative sentiment assigned to it due to “late” (assuming you’re working within the transportation domain) but in context this is actually a pretty positive review.

Plus, of course, sarcasm will be completely missed by such an approach. If someone says “lovely weather outside” in the middle of a tornado warning then you, as a human, know that they probably aren’t very happy about the weather. Which ties in to my next point: the question of what you’re measuring.

The specific thing you’re attempting to measure using sentiment analysis is the sentiment expressed in the text, but often you’ll see folks (generally tacitly) make a leap to assuming that what you’re measuring is who someone actually felt when they were writing the text. That’s pretty clearly not the case: I can write “I’m absolutely livid” and “I feel ecstatic joy” without experiencing those emotions and I’d expect most sentiment analysis tools would give those statements very strong negative and positive sentiments, respectively.

This is important because, generally, people tend to care more about what people are feeling than what they’re expressing in text. (A good counterexample would be a digital humanities project showing the sentiment of different passages in a novel.) And figuring out someone’s emotions from text is much, much more difficult and in most cases completely utterly impossible. And speaking about what people are feeling….

Sentiment isn’t generally actually that useful

So given that sentiment analysis of text isn’t likely to tell you what people are feeling with any fidelity, where would you want to use it? A great question, and one that I think folks should ask more often. Usually when I see it being used, it’s in a case where there’s another, actually more useful, thing you want to know. Let’s look at some examples.

  • In a chatbot to know if the conversation is going well
    • Since most words are neutral and most turns are pretty short, you’re pretty unlikely to get helpful information unless things have already gone very, very wrong (as Zeerak Waseem points out,you can look for swearing). A far better thing to measure would be patterns that you shouldn’t see in efficient conversations, like lots of repetition or slightly rephrasing.
  • To determine if reviews are good or bad
    • This one in particular I find baffling: most reviews are associated with a star rating, which is a clear measure directly from the person. A better use of time would probably be to do topic modelling or some other sort of unsupervised text analysis to see if there are persistent issues that should be addressed.
  • To predict customer churn based on call center logs
    • If you have the raw input text and the labelled outcomes (churned or not) then I’d just build a raw classifier. You’re also likely to get more mileage out of other metadata features, like how often they’ve contacted support, number of support tickets filed or something similar. Someone can be very polite to a customer service rep and still churn because the product just doesn’t meet their needs.

In all of these cases “what is the sentiment being expressed in the text” just isn’t the useful question. Sure, it’s quick enough to do a sentiment analysis that you might add it as a new feature to just see if it adds anything… but I’d say your time would be better spent elsewhere.

So why do people still use sentiment analysis?

Great question. Probably one reason is that its often used as an example in teaching. It has a pretty simple intuition, there are lots of existing tools for it and can it help students develop intuitions about corpus methods. That means that a *lot * of people have done sentiment analysis already at some point, and it’s much simpler to use a method you are already familiar with. (I get it, I do it too.)

And another, more pernicious reason, is that it’s harder to define a new problem (for which a tool or measure might not exist) than it is to redefine it as an existing one where an existing, simple-to-use tool is available. Even if that redefinition means that the work is no longer actually all that useful.

So, with all that said, the next time you’re thinking that you need to do sentiment analysis I’d encourage you to spend some time really considering if you’re sure before you decide to dive in.

An emoji dance notation system for TikTok dance tutorials 👀💃

This blog post is more of a quick record of my thoughts than a full in-depth analysis, because when I saw this I immediately wanted to start writing about it. Basically, TikTok is a social media app for short form video (RIP Vine, forever in our hearts) and one of the most popular genres of content is short dances; you may already be familiar with the concept.

HOWEVER, what’s particularly intriguing to me is this sort of video here, where someone creates a tutorial for a specific dance and includes an emoji-based dance notation:

Example of a dance with an emoji notation system by

Back in grad school, when I was studying signed languages, I probably spent more time than I should have reading about writing systems for signed languages and also dance notations. To roughly sum up an entire field of study: representing movements of the human body in time and space using a writing system, or even a more specialized notation, is extremely difficult. There are a LOT of other notations out there, and you probably haven’t run into them for a reason: they’re complex, hard to learn, necessarily miss nuances and are a bit redundant given that the vast majority of dance is learned through watching & copying movement. Probably the most well-known type of dance notation is for ballroom dance where the footwork patterns are represented on the floor using images of footsteps, like so:

Langsamer Walzer Grundschritt

I think part of the reason that this notation in particular tends to work well is that it’s completely iconic: the image of a shoe print is where your shoe print should go. It also captures a large part of the relevant information; the upper body position can be inferred from the position of the feet (and in many cases will more or less remain the same throughout).

I think that’s true to some degree of these emoji notations as well. The fact that they work at all may be arising in part due to the constraints of the TikTok dance genre. In most TikTok dances, the dancer faces in a single direction for the dance, there is minimal movement around the space and the feet move minimally if at all. The performance format itself helps as well: the videos are short and easy to repeat, and you can still see the movements being preformed in full with the notation being used as a shorthand.

And it’s clear that this use of style of notation isn’t idiosyncratic; this compilation has a variety of tutorials from different creators that use variations on the same style of notation.

A selection of tiktok dance tutorials, some of which include emoji notation

Some of the types of ways emoji are used here are similar to the ways that things like Stokoe notation are, to indicate handshape and movement (although not location). A few other types of ways that emoji are used that stick out:

  • Articulator (hands with handshape, peach emoji for the hips)
  • Manner of articulation/movement (“explosive”, a specific number of repetitions, direction of movement using arrows)
  • Iconic representation of a movement using an object (helicopter = hands make the motion of helicopter blades, mermaid = bodywaves, as if a mermaid swimming)
  • Iconic representation of a shape to be traced (a house emoji = tracing a house shape with the hands, heart = trace a heart shape)
  • (Not emoji) Written shorthand for a (presumably) already known dance, for example “WOAH” for the woah

To sum up: I think this is a cool idea, it’s an interesting new type of dance notation that is clearly useful to a specific artistic community. It’s also another really good piece of evidence in the bucket of “emoji are gestures”: these are clearly not a linguistic system and are used in a variety of ways by different users that don’t seem entirely systematic.

Buuuut there’s also the way that the emojis are groups into phrases for a specific set of related motions, which smells like some sort of shallow parsing even if it’s not a full consistency structure, and I’d say that’s definitely linguistic-ish. I think I’d need to spend more time on analysis to have any more firmly held opinion than that.

Who all studies language? 🤔 A brief disciplinary tour

red and yellow bus photo
Buckle up friends, we’re going on a tour!

One of the nice things about human language is that no matter what your question about it might be, someone, somewhere has almost certainly already asked the same thing… and probably found at least part of an answer! The downside of this wealth of knowledge is that, even if you restrict yourself to just looking at the Western academic tradition, 1) there’s a lot of it and 2) it’s scattered across a lot of disciplines which can make it very hard to find.

An academic discipline is a field of study but also a social network of scholars with shared norms and vocabulary. While people do do “interdisciplinary” work that draws on more than one discipline, the majority of academic life is structured around working in a single discipline. This is reflected in everything from departments to journals and conferences to how research funding is divided.

As a result, even if you study human language in some capacity yourself it can be very hard to form a good idea of where else people are doing related work if it falls into another discipline you don’t have contact with. You won’t see them at your conferences, you probably won’t cite each other in your papers and even if you are studying the exact same thing you’ll probably use different words to describe it and have different reserach goals. As a result, even many researchers working in language may not know what’s happening in the discipline next door.

For better or worse, though, I’ve always been very curious about disciplinary boundaries and talk and read to a lot of folks and, as a result, have ended up learning a lot about different disciplines. (Note: I don’t know that I’d recommend this to other junior scholars. It made me a bit of a “neither fish nor fowl” when I was on the faculty job market. I did have fun though. 😉 The upside of this is that I’ve had at least three discussions with people where the gist of it was “here are the academic fields that are relevant to your interest” and so I figured it was time to write it up as a blog post to save myself some time in the future.

Disciplines where language is the main focus

These fields study language itself. While people working in these fields may use different tools and have different goals, these are fields where people are likely to say that language is their area of study.


This is the field that studies Language and how it works. Sometimes you’ll hear people talk about “capital L language” to distinguish it from the study of a specific language. Whatever tools or methods or theoretical linguists use, their main object of study is language itself. There a lot of fields within linguistics and they vary a lot, but generally if a field has “linguistics” on the end, they’re going to be focusing on language itself.

For more information about linguistics, check out the Linguistic Society of America or my friend Gretchen’s blog.

Language-specific disciplines (classics, English, literature, foreign language departments etc.)

This is a collection of disciplines that study particular languages and specific instances of language use (like specific documents or pieces of oral literature). These fields generally focus on language teaching or applying frameworks like critical theory to better understand texts. Oh, or they produce new texts themselves. If you ask someone in one of these fields what they study, they’ll probably say the name of the specific language or family of languages they work on.

There are a lot of different fields that fall under this umbrella, so I’d recommend searching for “[whatever language you what to know about ] studies” and taking it from there.

Speech language pathology/Audiology/Speech and hearing

I’m grouping these disciplines together because they generally focus on language in a medical context. The main focus of researchers in this field is studying how the human body produces and receives language input. A lot of the work here focus on identifying and treating instances when these processes break down.

A good place to learn more is the American Speech-Language-Hearing Association.

Computer science (Specifically natural language processing, computational linguistics)

This field (more likely to be called NLP these days) focuses on building and understanding computational systems where language data, usually text, is part of either the input or output. Currently the main focus on the field (in terms of press coverage and $$ at any rate) is in applying machine learning methods to various problems. A lot of work in NLP is focused around particular tasks which generally have an associated dataset and shared metric and where the aim is to outperform other systems on the same problem. NLP does use some methods from other fields of machine learning (like computer vision) but the majority of the work uses techniques specific to, or at least developed for, language data.

To learn more, I’d check out the Association for Computational Linguistics. (Note that “NLP” is also an acronym for a pseudoscienience thing so I’d recommend searching #NLProc or “Natural Language Processing” instead.)

For reference, I would say that currently my main field is in applied NLP, but my background is primarily in linguistics and sprinkling of language-specific studies, especially English and American Sign Language. (Although I’ve taken course work and been a co-author on papers in speech & hearing.)

Disciplines where language is sometimes studied

There are also a lot of related fields where language data is used, or language is used as a tool to study a different object of inquiry.

  • Data Science. You would you shocked how much of data science is working with text data (or maybe you’re a data scientist and you wouldn’t be). Pretty much every organization has some sort of text they would like to learn about without having to read it all.
  • Computational social science, which uses language data but also frequently other types of data produced by human interaction with computational system. The aim is usually more to model or understand society rather than language use.
  • Anthropology, where language data is often used to better understand humans. (As a note, early British anthropology in particular is straight up racist imperial apologism, so be ye warned. There have been massive changes in the field, thankfully.) A lot of language documentation used to happen in anthropology departments, although these days I think it tends to be more linguistics. The linguistic-focused subdisciplines are anthropological linguistics or linguistic anthropology (they’re slightly different).
  • Sociology, the study of society. Sociolinguistics is more sociologically-informed linguistics, and in the US historically has been slightly more macro focused.
  • Psychology/Cognitive science. Non-physical brain stuff, like the mind and behavior. The linguistic part is psycholinguistics. This is where a lot of the work on language learning goes on.
  • Neurology. Physical brain stuff. The linguistic part is neurolinguistics. They tend to do a lot of imaging.
  • Education. A lot of the literature on language learning is in education. (Language learning is not to be confused with language acquisition; that’s only for the process by which children naturally acquire a language without formal instruction.)
  • Electrical engineering (Signal processing). This is generally the field of folks who are working on telephony and automatic speech recognition. NLP historically hasn’t done as much with voices, that’s been in electrical engineering/signal processing.
  • Disability studies. A lot of work on signed languages will be in disability studies departments if they don’t have their own department.
  • Historians. While they aren’t primarily studying the changes in linguistic systems, historians interact with older language data a lot and provide context for things like language contact, shift and historical usage.
  • Informatics/information science/library science. Information science is broader than linguistics (including non-linguistic information all well) but often dovetails with it, especially in semantics (the study of meaning) and ontologies (a formal representation of categories and their relations).
  • Information theory. This field is superficially focused on how digital information is encoded. Usually linguistics draws from it rather than vice-versa because it’s lower level, but if you’ve heard of entropy, compression or source-channel theory those are all from information theory.
  • Philosophy. A lot of early linguistics scholars, like Ferdinand de Saussure, would probably have considered themselves primarily philosophers and there was this whole big thing in the early 1900’s. The language-specific branch is philosophy of language.
  • Semiotics. This is a field I haven’t interacted with too much (I get the impression that it’s more popular in Europe than the US) but they study “signs”, which as I understand it is any way of referring to a thing in any medium without using the actual thing, which by that definition does include language.
  • Design studies. Another field I’m not super familiar with, but my understanding is that it includes studying how users of a designed thing interact with it, which may include how they use or interpret language. Also: good design is so important and I really don’t think designers get enough credit/kudos.

What you can, can’t and shouldn’t do with social media data

Earlier this summer, I gave a talk on the promise & pitfalls of social media data for the Joint Statistical Meetings. While I don’t think there’s a recording of the talk, enough people asked for one that I figured it would be worth putting together a blog post version of the talk. Enjoy!

What you can do with social media data

Let’s start with the good news: research using social media data has revolutionized social science research. It’s let us ask bigger question more quickly, helped us overcome some of the key drawbacks of behavioral experimental work and ask new kinds of questions.

More data faster

I can’t overstate how revolutionary the easy availability of social media data has been, especially in linguistics. It has increased both the rate and scale of data collection by orders of magnitude. Compare the time it took to compare the Dictionary of American Regional English (DARE) to the Wordmapper app below. The results are more or less the same, maps of where in the US folks use different words (in this example, “cellar”). But what once took the entire careers of multiple researchers can now be done in a few months, and with far higher resolution.

Dictionary of American Regional English (DARE) Word Mapper App
Data collection 48 years (1965 – 2013) <1 year
Size of team 2,777 people 4 people
Number of participants 1,843 people 20 million
DARE map Wordmapper Map

Social networks

Social media sites with a following or friend feature also let us ask really large scale questions about social networks. How do social networks and political affiliation interact? How does language change move through a social network? What characteristics of social network structure are more closely associated with the spread of misinformation? Of course, we could ask these questions before social media data… but by using APIs to access social media data, we reduce the timescale of these projects from decades to weeks or even days and we have a clear way to operationalize social network ties. It’s fairly hard for someone to sit down and list everyone they interact with face-to-face, but it’s very easy to grab a list of all the Twitter accounts you follow.

Wild-caught, all natural data

One of the constant struggles in experimental work is the fact that the mere fact of being observed changes behavior. This is known as the Hawthorne Effect in psychology or the Observer’s Paradox in sociolinguistics. As a result, even the most well-designed experiment is limited by the fact that the participants know that they are completing an experiment.

Social media data, however, doesn’t have this limitation. Since most social media research projects are conducted on public data without interacting directly with participants, they are not generally considered human subjects research. When you post something on a public social media account, you don’t have a reasonable expectation of privacy. In other words, you know that just anyone could come along and read it, and that includes researchers. As a result it is not generally necessary to collect informed consent for social media projects. (Informed consent is when you are told exactly what’s going to happen during an experiment you’re participating, and you agree to participate in it.) This means that the vast majority of folks who are participating in a social media study don’t actually know that they’re part of a study.

The benefit of this is that it allows researchers to get around three common confounds that plague social science research:

  • Bradley effect: People tend to tell researchers what they think they want to hear
  • Response bias: The sample of people willing to do an experiment/survey differ in a meaningful way from the population as a whole
  • Observer’s paradox/Hawthorne effect: People change their behavior when they know they’re being observed

While this is a boon to researchers, the lack of informed consent does introduce other other problems, which we’ll talk about later.

What you can’t do with social media data

Of course, all the benefits of social media come at a cost. There are several key drawbacks and limitations of social media research:

  • You can’t be sure who your participants are.
  • There’s inherent sampling bias.
  • You can’t violate the developer’s agreements.

You’re not sure who you’re studying…

Because you don’t meet with the people whose data is included in your study, you don’t know for sure what sorts of demographic categories they belong to, whether they are who they’re claiming to be or even if they’re human at all. You have to deal with both bots, accounts where content is produced and distributed automatically by a computer and sock puppets, where one person pretends to be another person. Sock puppets in particular can be very difficult to spot and may skew your results in unpredictable ways.

…but you can be sure your sample is biased.

Social media users aren’t randomly drawn from the world’s population as a whole. Social media users tend to be WEIRD: from wealthy, educated, industrialized, rich and democratic societies. This group is already over-represented in social science and psychology research studies, which may be subtly skewing our models of human behavior.

In addition, different social media platforms have different user bases. For example, Instagram and Snapchat tend to have younger users, Pinterest has more women (especially compared to Reddit, which skews male) and LinkedIn users tend to be highly educated and upper middle class. And that doesn’t even get to social network effects: you’re more likely to be on the same platform your friends are on, and since social networks tend to be homophilous, you can end up with pockets of very socially homogeneous folks. So, even if you manage to sample randomly from a social media platform, your sample is likely to differ from one taken from the population as a whole.

You need to abide by the developer’s agreements for whatever platform you’re using data from.

This is mainly an issue if you’re using API (application programmatic interface) to fetch data from a service. Developer’s agreements vary between platforms, but most limit the amount of data you can fetch and store, and how and if you can share it with other researchers. For example, if you’re sharing Twitter data you can only share 50,000 tweets at a time and even then only if you have to have people download a file by clicking on it. If you share any more than that, you should just share the ID’s of the tweets rather than the full tweets. (Document the Now’s Hydrator can help you fetch the tweets associated with a set of IDs.)

What you shouldn’t do with social media data

Finally, there are ethical restrictions on what we should do with social media data. As researchers, we need to 1) respect the wishes of users and 2) safeguard their best interests, especially given that we don’t (currently) generally get informed consent from the folks whose data we’re collecting.

Respecting users’ wishes

At least in the US, ethical human subjects research is led by three guiding principles set forth in the Belmont report. If you’re unfamiliar with the report, it was written in the aftermath of the Tuskegee Valley experiments. These were a series of medical experiments on African Americans men who had contracted syphilis conducted from the 1930’s to 1970’s. During the study, researchers withheld the cure (and even information that it existed) from the participants. The study directly resulted in the preventable deaths of 128 men and many health problems for study participants, their wives and children. It was a clear ethical violation of the human rights of participants and the moral stain of it continues to shape how we conduct human subjects research in the US.

The three principles of ethical human subjects research are:

  1. Respect for Persons: People should be treated as autonomous individuals and persons with diminished autonomy (like children or prisoners) are entitled to protection.
  2. Beneficence: 1) Do not harm and 2) maximize possible benefits and minimize possible harms.
  3. Justice: Both the risks and benefits of research should be distributed equally.

Social media research might not technically fall under the heading of human subjects research, since we aren’t intervening with our participants. However, I still believe that it’s important that researchers following these general guides when designing and distributing experiments.

One thing we can do is respect their wishes of the communities we study. Fortunately, we have some evidence of what those wishes are. Feisler and Proferes (2018) surveyed 368 Twitter users on their perception of a variety of research behaviors.

Screenshot from 2018-07-25 16-10-21
Fiesler, C., & Proferes, N. (2018). “Participant” Perceptions of Twitter Research Ethics. Social Media+ Society, 4(1), 2056305118763366. Table 4. 

In general, Twitter users are more OK with research with the following characteristics:

  • Large datasets
  • Analyzed automatically
  • Social media users informed about research
  • If tweets are quoted, they are anonymized. (Note that if you include the exact text, it’s possible to reverse search the quoted tweet and de-anonymize it. I recommend changing at least 20% of the content words in a tweet to synonyms to get around this and double-checking by trying to de-anonymize it yourself.)

These characteristics, however, are not as acceptable to Twitter users:

  • Small datasets
  • Analysis done by hand (presumably including analysis by Mechanical Turk workers)
  • Tweets from protected accounts or deleted tweets analyzed (which is also against the developer’s agreement, so you shouldn’t be doing this anyway)
  • Quoting with citation (very different from academic norms!)

In general, I think these suggest general best practices for researchers working with Twitter data.

  • Stick to larger datasets
  • Try to automate wherever possible
  • Follow the developer’s agreement
  • Take anonymity seriously.

There is one thing I disagree with, however: I don’t think we should contact everyone who’s tweets we use in our research.

Should we contact people whose tweets we use in our studies? My gut instinct on this one is “no”. If you’re collecting a large amount of data, you probably shouldn’t reach out to everyone in the data.

For users who don’t have open DM’s, the only way to contact them is to publicly mention them using @username. The problem with this is that it partly de-anonymizes your data. If you then choose to share your data, having publicly shared a list of whose data was included in the dataset it makes it much easier to de-anonymize. Instead of trying to figure out whose tweets were included when looking at all of Twitter, an adversary only has to figure out which of the users on the list you’ve given them is connected to which record.

The main exception to this is if have a project that’s a deep dive on one user, in which case you probably should. (For example, I contacted Chaz Smith and let him know about my phonological analysis of his #pronouncingthingsincorrectly Vines.)

Do no harm

Another aspect of ethical research is trying to ensure that your research or research data doesn’t have potentially unethical applications. The elephant in the room here, of course, is the data Cambridge Analytica collected from Facebook users. Researchers at Cambridge, collecting data for a research project, got lots of people’s permission to access their Facebook data. While that wasn’t a problem, they collected and saved Facebook data from other folks as well, who hadn’t opted in. In the end, only a half of a half of a percent of the folks whose data was in the final dataset actually agreed to be included in it. To make matters worse, this data was used by a commercial company founded by one of the researchers to (possibly) influence elections in the US and UK. Here’s a New York Times article that goes into much more detail. This has understandably lead to increased scrutiny of how social media research data is collected and used.

I’m not bringing this up to call out Facebook in particular, but to explain why it’s important to consider how research data might be used long-term. How and where will it be stored? For how long? Who will have access to it? In short, if you’re a researcher, how can you ensure that data you collected won’t end up somehow hurting the people you collected it from?

As an example of how important these questions are, consider this OK Cupid “research” dataset. It was collected without consent and shared publicly without anonymization. It included many personal details that were only intended to be shared with other users of the site, including explicit statements of sexual orientation. In addition to being an unforgivable breach of privacy, this directly endangered users whose data was collected: information on sexual orientation was shared for people living in countries where homosexuality is a crime that carries a death penalty or sentence of life in prison. I have a lot of other issues with this “study” as well, but the fact that it directly endangered research subjects who had no chance to opt out is by far the most egregious ethical breach.

If you are collecting social media data for research purposes, it is your ethical responsibility to safeguard the well-being of the people whose data you’re using.

I bring up these cautionary tales not to scare you off of social media research but to really impress the gravity of the responsibility you carry as a social media researcher. Social media data has the potential to dramatically improve our understanding of the world. A lot of my own work has relied heavily on it! But it’s important that we, as researchers, take our moral duty to make sure that we don’t end up doing more harm than good very seriously.

Are emoji sequences as informative as text?

Something I’ve been thinking about a lot lately is how much information we really convey with emoji. I was recently at the 1​st​ International Workshop on Emoji Understanding and Applications in Social Media and one theme that stood out to me from the papers was that emoji tend to be used more to communicate social meaning (things like tone and when a conversation is over) than semantics (content stuff like “this is a dog” or “an icecream truck”).

I’ve been itching to apply an information theoretic approach to emoji use for a while, and this seemed like the perfect opportunity. Information theory is the study of storing, transmitting and, most importantly for this project, quantifying information. In other words, using an information theoretic approach we can actually look at two input texts and figure out which one has more information in it. And that’s just what we’re going to do: we’re going to use a measure called “entropy” to directly compare the amount of information in text and emoji.

What’s entropy?

Shannon entropy is a measure of how much information there is in a sequence. Higher entropy means that there’s more uncertainty about what comes next, while lower entropy means there’s less uncertainty.  (Mathematically, entropy is always less than or the same as log2(n), where n is the total number of unique characters. You can learn more about calculating entropy and play around with an interactive calculator here if you’re curious.)

So if you have a string of text that’s just one character repeated over and over (like this: 💀💀💀💀💀) you don’t need a lot of extra information to know what the next character will be: it will always be the same thing. So the string “💀💀💀💀💀” has a very low entropy. In this case it’s actually 0, which means that if you’re going through the string and predicting what comes next, you’re always going to be able to guess what comes next becuase it’s always the same thing. On the other hand, if you have a string that’s made up of four different characters, all of which are equally probable (like this:♢♡♧♤♡♧♤♢), then you’ll have an entropy of 2.

TL;DR: The higher the entropy of a string the more information is in it.



We do have some theoretical maximums for the entropy text and emoji. For text, if the text string is just randomly drawn from the 128 ASCII characters (which isn’t how language works, but this is just an approximation) our entropy would be 7. On the other hand, for emoji, if people are just randomly using any emoji they like from the set of emoji as of June 2017, then we’d expect to see an entropy of around 11.

So if people are just  using letters or emoji randomly, then text should have lower entropy than emoji. However, I don’t think that’s what’s happening. My hypothesis, based on the amount of repetition in emoji, was that emoji should have lower entropy, i.e. less information, than text.


To get emoji and text spans for our experiment I used four different datasets: three from Twitter and one from YouTube.

I used multiple datasets for a couple reasons. First, becuase I wanted a really large dataset of tweets with emoji, and since only between 0.9% and 0.5% of tweets from each Twitter dataset actually contained emoji I needed to case a wide net. And, second, because I’m growing increasingly concerned about genre effects in NLP research. (Like, a lot of our research is on Twitter data. Which is fine, but I’m worried that we’re narrowing the potential applications of our research becuase of it.) It’s the second reason that led me to include YouTube data. I used Twitter data for my initial exploration and then used the YouTube data to validate my findings.

For each dataset, I grabbed all adjacent emoji from a tweet and stored them separately. So this tweet:

Love going to ballgames! ⚾🌭 Going home to work in my garden now, tho 🌸🌸🌸🌸

Has two spans in it:

Span 1:  ⚾🌭

Span 2: 🌸🌸🌸🌸

All told, I ended up with 13,825 tweets with emoji and 18,717 emoji spans of which only 4,713 were longer than one emoji. (I ignored all the emoji spans of length one, since they’ll always have an entropy of 0 and aren’t that interesting to me.) For the YouTube comments, I ended up with 88,629 comments with emoji, 115,707 emoji spans and 47,138 spans with a length greater than one.

In order to look at text as parallel as possible to my emoji spans, I grabbed tweets & YouTube comments without emoji. For each genre, I took a number of texts equal to the number of spans of length > 1 and then calculated the character-level entropy for the emoji spans and the texts.



First, let’s look at Tweets. Here’s the density (it’s like a smooth histogram, where the area under the curve is always equal to 1 for each group) of the entropy of an equivalent number of emoji spans and tweets.

download (6)
Text has a much high character-level entropy than emoji. For text, the mean and median entropy are both around 5. For emoji, there is a multimodal distribution, with the median entropy being 0 and also clusters around 1 and 1.5.

It looks like my hypothesis was right! At least in tweets, text has much more information than emoji. In fact, the most common entropy for an emoji span is 0: which means that most emoji spans with a length greater than one are just repititons of the same emoji over and over again.

But is this just true on Twitter, or does it extend to YouTube comments as well?

download (5)
The pattern for emoji & text in YouTube comments is very similar to that for Tweets. The biggest difference is that it looks like there’s less information in YouTube Comments that are text-based; they have a mean and median entropy closer to 4 than 5.

The YouTube data, which we have almost ten times more of, corroborates the earlier finding: emoji spans are less informative, and more repetitive, than text.

Which emoji were repeated the most/least often?

Just in case you were wondering, the emoji most likely to be repeated was the skull emoji, 💀. It’s generally used to convey strong negative emotion, especially embarrassment, awkwardness or speechlessness, similar to “ded“.

The least likely was the right-pointing arrow (▶️), which is usually used in front of links to videos.

More info & further work

If you’re interested, the code for my analysis is available here. I also did some of this work as live coding, which you can follow along with on YouTube here.

For future work, I’m planning on looking at which kinds of emoji are more likely to be repeated. My intuition is that gestural emoji (so anything with a hand or face) are more likely to be repeated than other types of emoji–which would definitely add some fuel to the “are emoji words or gestures” debate!

Datasets for data cleaning practice

Looking for datasets to practice data cleaning or preprocessing on? Look no further!

Each of these datasets needs a little bit of TLC before it’s ready for different analysis techniques. For each dataset, I’ve included a link to where you can access it, a brief description of what’s in it, and an “issues” section describing what needs to be done or fixed in order for it to fit easily into a data analysis pipeline.

Big thanks to everyone in this Twitter thread who helped me out by pointing me towards these datasets and letting me know what sort of pre-processing each needed. There were also some other data sources I didn’t include here, so check it out if you need more practice data. And feel free to comment with links to other datasets that would make good data cleaning practice! 🙂

List of datasets:

  • Hourly Weather Surface – Brazil (Southeast region)
  • PhyloTree Data
  • International Comprehensive Ocean-Atmosphere Data Set
  • CLEANEVAL: Development dataset
  • London Air
  • Production and Perception of Linguistic Voice Quality
  • Australian Marriage Law Postal Survey, 2017
  • The Metropolitan Museum of Art Open Access
  • National Drug Code Directory
  • Flourish OA
  • WikiPlots
  • Register of UK Parliament Members’ Financial Interests
  • NYC Gifted & Talented Scores

Hourly Weather Surface – Brazil (Southeast region)

It’s covers hourly weather data from 122 weathers stations of southeast region (Brazil). The southeast include the states of Rio de Janeiro, São Paulo, Minas Gerais e Espirito Santo. Dataset Source: INMET (National Meteorological Institute – Brazil).

Issues: Can you predict the amount of rain? Temperature? NOTE: Not all weather stations started operating since 2000

PhyloTree Data

Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Human mitochondrial DNA is widely used as tool in many fields including evolutionary anthropology and population history, medical genetics, genetic genealogy, and forensic science. Many applications require detailed knowledge about the phylogenetic relationship of mtDNA variants. Although the phylogenetic resolution of global human mtDNA diversity has greatly improved as a result of increasing sequencing efforts of complete mtDNA genomes, an updated overall mtDNA tree is currently not available. In order to facilitate a better use of known mtDNA variation, we have constructed an updated comprehensive phylogeny of global human mtDNA variation, based on both coding‐ and control region mutations. This complete mtDNA tree includes previously published as well as newly identified haplogroups.

Issues: This data would be more useful if it were in the Newick tree format and could be read in using the read.newick() function. Can you help get the data in this format?

International Comprehensive Ocean-Atmosphere Data Set

The International Comprehensive Ocean-Atmosphere Data Set (ICOADS) offers surface marine data spanning the past three centuries, and simple gridded monthly summary products for 2° latitude x 2° longitude boxes back to 1800 (and 1°x1° boxes since 1960)—these data and products are freely distributed worldwide. As it contains observations from many different observing systems encompassing the evolution of measurement technology over hundreds of years, ICOADS is probably the most complete and heterogeneous collection of surface marine data in existence.

Issues: The ICOADS contains O(500M) meteorological observations from ~1650 onwards. Issues include bad observation values, mis-positioned data, missing date/time information, supplemental data in a variety of formats, duplicates etc.

CLEANEVAL: Development dataset

CLEANEVAL is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus, for linguistic and language technology research and development. There are three versions of each file: original, pre-processed, and manually cleaned. All files of each kind are gathered in a directory. The file number remains the same for the three versions of the same file.

Issues: Your task is to “clean up” a set of webpages so that their contents can be easily used for further linguistic processing and analysis. In short, this implies:

  • removing all HTML/Javascript code and “boilerplate” (headers, copyright notices, link lists, materials repeated across most pages of a site, etc.);
  • adding a basic encoding of the structure of the page using a minimal set of symbols to mark the beginning of headers, paragraphs and list elements.

London Air

The London Air Quality Network (LAQN) is run by the Environmental Research Group of King’s College London. LAQN stands for the London Air Quality Network which was formed in 1993 to coordinate and improve air pollution monitoring in London. The network collects air pollution data from London boroughs, with each one funding monitoring in its own area. Increasingly, this information is being supplemented with measurements from local authorities surrounding London in Essex, Kent and Surrey, thereby providing an overall perspective of air pollution in South East England, as well as a greater understanding of pollution in London itself.

Issues: Lots of gaps (null/zero handling), outliers, date handling, pivots and time aggregation needed first!


Candy hierarchy data for 2017 Boing Boing Halloween candy hierarchy. This is survey data from this survey.

Issues: If you want to look for longitudinal effects, you also have access to previous datasets. Unfortunate quirks in the data include the fact that the 2014 data is not the raw set (can’t seem to find it), and in 2015, the candy preference was queried without the MEH option.

Production and Perception of Linguistic Voice Quality

Data from the “Production and Perception of Linguistic Voice Quality” project at UCLA. This project was funded by NSF grant BCS-0720304 to Prof. Pat Keating, with Prof. Abeer Alwan, Prof. Jody Kreiman of UCLA, and Prof. Christina Esposito of Macalester College, for 2007-2012.

The data includes spreadsheet files with measures gathered using Voicesauce (Shue, Keating, Vicenik & Yu 2011) for both acoustic measures and EGG measures. The accompanying readme file provides information on the various coding used in both spreadsheets.

Issues: The following issues are with the acoustics measures spreadsheet specifically.

  1. xlsx format with meaningful color coding created by a VBA script (which is copy-pasted into the second sheet)
  2. partially wide format instead of long/tidy, with a ton of columns split into different timepoints
  3. line 6461 has another set of column headers rather than data for some of the columns starting with “shrF0_mean”. I think this was a copy-paste error. Hopefully it doesn’t mean that all of the data below that row is shifted down by 1!

Australian Marriage Law Postal Survey, 2017

Response: Should the law be changed to allow same-sex couples to marry?

Of the eligible Australians who expressed a view on this question, the majority indicated that the law should be changed to allow same-sex couples to marry, with 7,817,247 (61.6%) responding Yes and 4,873,987 (38.4%) responding No. Nearly 8 out of 10 eligible Australians (79.5%) expressed their view.

All states and territories recorded a majority Yes response. 133 of the 150 Federal Electoral Divisions recorded a majority Yes response, and 17 of the 150 Federal Electoral Divisions recorded a majority No response.

Issues: Miles McBain discusses his approach to cleaning this dataset in depth in this blog post.

The Metropolitan Museum of Art Open Access

The Metropolitan Museum of Art provides select datasets of information on more than 420,000 artworks in its Collection for unrestricted commercial and noncommercial use. To the extent possible under law, The Metropolitan Museum of Art has waived all copyright and related or neighboring rights to this dataset using Creative Commons Zero. This work is published from: The United States Of America. You can also find the text of the CC Zero deed in the file LICENSE in this repository. These select datasets are now available for use in any media without permission or fee; they also include identifying data for artworks under copyright. The datasets support the search, use, and interaction with the Museum’s collection.

Issues: Missing values, inconsistent information, missing documentation, possible duplication, mixed text and numeric data.

National Drug Code Directory

The Drug Listing Act of 1972 requires registered drug establishments to provide the Food and Drug Administration (FDA) with a current list of all drugs manufactured, prepared, propagated, compounded, or processed by it for commercial distribution. (See Section 510 of the Federal Food, Drug, and Cosmetic Act (Act) (21 U.S.C. § 360)). Drug products are identified and reported using a unique, three-segment number, called the National Drug Code (NDC), which serves as a universal product identifier for drugs. FDA publishes the listed NDC numbers and the information submitted as part of the listing information in the NDC Directory which is updated daily.

The information submitted as part of the listing process, the NDC number, and the NDC Directory are used in the implementation and enforcement of the Act.

Issue: Non-trivial duplication (which drugs are different names for the same things?).

Flourish OA

Our data comes from a variety of sources, including researchers, web scraping, and the publishers themselves. All data is cleaned and reviewed to ensure its validity and integrity. Our catalog expands regularly, as does the number of features our data contains. We strive to maintain the most complete and sophisticated store of Open Access data in the world, and it is this mission that drives our continued work and expansion.

A dataset on journal/publisher information that is a bit dirty and might make for great practice. It’s been a graduate student/community project:

Issues: Scraped data, has some missing fields, possible duplication and some encoding issues (possibly multiple character encodings).



The WikiPlots corpus is a collection of 112,936 story plots extracted from English language Wikipedia. These stories are extracted from any English language article that contains a sub-header that contains the word “plot” (e.g., “Plot”, “Plot Summary”, etc.).

This repository contains code and instructions for how to recreate the WikiPlots corpus.

The dataset itself can be downloaded from here: (updated: 09/26/2017). The zip file contains two files:

  • plots: a text file containing all story plots. Each story plot is given with one sentence per line. Each story is followed by on a line by itself.
  • titles: a text file containing a list of titles for each article in which a story plot was found and extracted.

Issues: Some lines may be cut off due to abbreviations. Some plots may be incomplete or contain surrounding irrelevant information.

Register of UK Parliament Members’ Financial Interests

The main purpose of the Register is to provide information about any financial interest which a Member has, or any benefit which he or she receives, which others might reasonably consider to influence his or her actions or words as a Member of Parliament.

Members must register any change to their registrable interests within 28 days. The rules are set out in detail in the Guide to the Rules relating to the Conduct of Members, as approved by the House on 17 March 2015. Interests which arose before 7 May 2015 are registered in accordance with earlier rules.

The Register is maintained by the Parliamentary for Commissioner for Standards. It is updated fortnightly online when the House is sitting, and less frequently at other times. Interests remain on the Register for twelve months after they have expired.

Issues: Each member’s transactions are on a separate webpage with a different text format, with contributions listed under different headings (not necessarily one per line) and in different formats. Will take quite a bit of careful preprocessing to get into CSV or JSON format.

NYC Gifted & Talented Scores

Couple of messy but easy data sets: NYC parents reporting their kids’ scores on the gifted and talented exam, as well as school priority ranking. Some enter the percentiles as point scores, some skip all together, no standard preference format, etc. Also birth quarter affects percentiles.

How do we use emoji?

Those of you who know me may know that I’m a big fan of emoji. I’m also a big fan of linguistics and NLP, so, naturally, I’m very curious about the linguistic roles of emoji. Since I figured some of you might also be curious, I’ve pulled together a discussion of some of the very serious scholarly research on emoji. In particular, I’m going to talk about five recent papers that explore the exact linguistic nature of these symbols: what are they and how do we use them?

Twemoji2 1f913
Emoji are more than just cute pictures! They play a set of very specific linguistic roles.

Dürscheid & Siever, 2017:

This paper makes one overarching point: emoji are not words. They cannot be unambiguously interpreted without supporting text and they do not have clear syntactic relationships to one another. Rather, the authors consider emoji to be specialized characters, and place them within Gallmann’s 1985 hierarchy of graphical signs. The authors show that emoji can play a range of roles within the Gallmann’s functional classification.

  • Allography: using emoji to replace specific characters (for example: the word “emoji” written as “em😝ji”)
  • Ideograms: using emoji to replace a specific word (example: “I’m travelling by 🚘” to mean “I’m travelling by car”)
  • Border and Sentence Intention signals: using emoji both to clarify the tone of the preceding sentence and also to show that the sentence is over, often replacing the final punctuation marks.

Based on an analysis of a Swiss German Whatsapp corpus, the authors conclude that the final category is far and away the most popular, and that emoji rarely replace any part of the lexical parts of a message.

Na’aman et al, 2017:

Na’aman and co-authors also develop a hierarchy of emoji usage, with three top-level categories: Function, Content (both of which would fall under mostly under the ideogram category in Dürscheid & Siever’s classifications) and Multimodal.

  • Function: Emoji replacing function words, including prepositions, auxiliary verbs, conjunctions, determinatives and punctuation. An example of this category would be “I like 🍩 you”, to be read as “I do not like you”.
  • Content: Emoji replacing content words and phrases, including nouns, verbs, adjectives and adverbs. An example of this would be “The 🔑 to success”, to be read as “the key to success”.
  • Multimodal: These emoji “enrich a grammatically-complete text with markers of
    affect or stance”. These would fall under the category of border signals in Dürscheid & Siever’s framework, but Na’aman et all further divide these into four categories: attitude, topic, gesture and other.

Based on analysis of a Twitter corpus made of up of only tweets containing emoji, the authors find that multimodal emoji encoding attitude are far and away the most common, making up over 50% of the emoji spans in their corpus. The next most common uses of emoji are to multimodal:topic and multimodal:gesture. Together, these three categories account for close to 90% of the all the emoji use in the corpus, corroborating the findings of Dürscheid & Siever.

Wood & Ruder, 2016:

Wood and Ruder provide further evidence that emoji are used to express emotion (or “attitude”, in Na’aman et al’s terms). They found a strong correlation between the presence of emoji that they had previously determined were associated with a particular emotion, like 😂 for joy or 😭 for sadness, and human annotations of the emotion expressed in those tweets. In addition, an emotion classifier using only emoji as input performed similarly to one trained using n-grams excluding emoji. This provides evidence that there is an established relationship between specific emoji use and expressing emotion.

Donato & Paggio, 2017:

However, the relationship between text and emoji may not always be so close. Donato & Paggio collected a corpus of tweets which contained at least one emoji and that were hand-annotated for whether the emoji was redundant given the text of the tweet.  For example, “We’ll always have Beer. I’ll see to it. I got your back on that one. 🍺” would be redundant, while “Hopin for the best 🎓” would not be, since the beer emoji expresses content already expressed in the tweet, while the motorboard adds new information (that the person is hoping to graduate, perhaps). The majority of emoji, close to 60%, were found not to be redundant and added new information to the tweet.

However, the corpus was intentionally balanced between ten topic areas, of which only one was feelings, and as a result the majority of feeling-related tweets were excluded from analysis. Based on this analysis and Wood and Ruder’s work, we might hypothesize that feelings-related emoji may be more redundant than other emoji from other semantic categories.

Barbieri et al, 2017:

Additional evidence for the idea that emoji, especially those that show emotion, are predictable given the text surrounding them comes from Barbieri et al. In their task, they removed the emoji from a thousand tweets that contained one of the following five emoji: 😂, ❤️, 😍, 💯 or 🔥. These emoji were selected since they were the most common in the larger dataset of half a million tweets. Then then asked human crowd workers to fill in the missing emoji given the text of the tweet, and trained a character-level bidirectional LSTM to do the same task. Both humans and the LSTM performed well over chance, with an F1 score of 0.50 for the humans and 0.65 for the LSTM.

So that was a lot of papers and results I just threw at you. What’s the big picture? There are two main points I want you to take away from this post:

  • People mostly use emoji to express emotion. You’ll see people playing around more than that, sure, but by far the most common use is to make sure people know what emotion you’re expressing with a specific message.
  • Emoji, particularly emoji that are used to represent emotions, are predictable given the text of the message. It’s pretty rare for us to actually use emoji to introduce new information, and we generally only do that when we’re using emoji that have a specific, transparent meaning.

If you’re interested in reading more, here are all the papers I mentioned in this post:


Barbieri, F., Ballesteros, M., & Saggion, H. (2017). Are Emojis Predictable? EACL.

Donato, G., & Paggio, P. (2017). Investigating Redundancy in Emoji Use: Study on a Twitter Based Corpus. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (pp. 118-126).

Dürscheid, C., & Siever, C. M. (2017). Beyond the Alphabet–Communication of Emojis. Kurzfassung eines (auf Deutsch) zur Publikation eingereichten Manuskripts.

Gallmann, P. (1985). Graphische Elemente der geschriebenen Sprache. Grundlagen für eine Reform der Orthographie. Tübingen: Niemeyer.

Na’aman, N., Provenza, H., & Montoya, O. (2017). Varying Linguistic Purposes of Emoji in (Twitter) Context. In Proceedings of ACL 2017, Student Research Workshop (pp. 136-141).

Wood, I. & Ruder, S. (2016). Emoji as Emotion Tags for Tweets. Sánchez-Rada, J. F., & Schuller, B (Eds.). In Proceedings of LREC 2016, Workshop on Emotion and Sentiment Analysis (pp. 76-80).

Parity in Utility: One way to think about fairness in machine learning tools

First, a confession: part of the reason I’m writing this blog post today is becuase I’m having major FOMO on account of having missed #FAT2018, the first annual Conference on Fairness, Accountability, and Transparency. (You can get the proceedings online here, though!) I have been seeing a lot of tweets & good discussion around the conference and it’s gotten me thinking again about something I’ve been chewing on for a while: what does it mean for a machine learning tool to be fair?

If you’re not familiar with the literature, this recent paper by Friedler et al is a really good introduction, although it’s not intended as a review paper. I also review some of it in these slides. Once you’ve dug into the work a little bit, you may notice that a lot of this work is focused on examples where output of the algorithm is a decision made on the level of the individual: Does this person get a loan or not? Should this resume be passed on to a human recruiter? Should this person receive parole?

M. Vitrvvii Pollionis De architectvra libri decem, ad Caes. Avgvstvm, omnibus omnium editionibus longè emendatiores, collatis veteribus exemplis (1586) (14597168680)
Fairness is a balancing act.

While these are deeply important issues, the methods developed to address fairness in these contexts don’t necessarily translate well to evaluating fairness for other applications. I’m thinking specifically about tools like speech recognition or facial recognition for automatic focusing in computer vision: applications where an automatic tool is designed to supplant or augment some sort of human labor. The stakes are lower in these types of applications, but it’s still important that they don’t unintentionally end up working poorly for certain groups of people. This is what I’m calling “parity of utility”.

Parity of utility: A machine learning application which automatically completes a task using human data should not preform reliably worse for members of one or more social groups relevant to the task.

That’s a bit much to throw at you all at once, so let me break down my thinking a little bit more.

  • Machine learning application: This could be a specific algorithm or model, an ensemble of multiple different models working together, or an entire pipeline from data collection to the final output. I’m being intentionally vague here so that this definition can be broadly useful.
  • Automatically completes a task: Again, I’m being intentionally vague here. By “a task” I mean some sort of automated decision based on some input stimulus, especially classification or recognition.
  • Human data: This definition of fairness is based on social groups and is thus restricted to humans. While it may be frustrating that your image labeler is better at recognizing cows than horses, it’s not unfair to horses becuase they’re not the ones using the system. (And, arguably, becuase horses have no sense of fairness.)
  • Preform reliably worse: Personally I find a statistically significant difference in performance between groups with a large effect size to be convincing evidence. There are other ways of quantifying difference, like the odds ratio or even the raw accuracy across groups, that may be more suitable depending on your task and standards of evidence.
  • Social groups relevant to the task: This particular fairness framework is concerned with groups rather than individuals. Which groups? That depends. Not every social group is going to relevant to every task. For instance, your mother language(s) is very relevant for NLP applications, while it’s only important for things like facial recognition in as far as it co-varies with other, visible demographic factors. Which social groups are relevant for what types of human behavior is, thankfully, very well studied, especially in sociology.

So how can we turn these very nice-sounding words into numbers? The specifics will depend on your particular task, but here are some examples:

So the tools reviewed in these papers don’t demonstrate parity of utility: they work better for folks from specific groups and worse for folks from other groups. (And, not co-incidentally, when we find that systems don’t have parity in utility, they tend work best for more privileged groups and worse for less privileged groups.)

Parity of Utility vs. Overall Performance

So this is the bit where things get a bit complicated: parity of utility is one way to evaluate a system, but it’s a measure of fairness, not overall system performance. Ideally, a high-performing system should also be a fair system and preform well for all groups. But what if you have a situation where you need to choose between prioritizing a fairer system or one with overall higher performance?

I can’t say that one goal is unilaterally better than the other in all situations. What I can say is that focusing on only higher performance and not investigating measures of fairness we risk building systems that have systematically lower performance for some groups. In other words, we can think of an unfair system (under this framework) as one that is overfit to one or more social groups. And, personally, I consider a model overfit to a specific social group while being intended for general use to be flawed.

Parity of Utility is an idea I’ve been kicking around for a while, but it could definitely use some additional polish and wheel-kicking, so feel free to chime in in the comments. I’m especially interested in getting some other perspectives on these questions:

  1. Do you agree that a tool that has parity in utility is “fair”?
  2. What would you need to add (or remove) to have a framework that you would consider fair?
  3. What do you consider an acceptable balance of fairness and overall performance?
  4. Would you prefer to create an unfair system with slightly higher overall performance or a fair system with slightly lower overall performance? Which type of system would you prefer to be a user of? What about if a group you are a member of had reliably lower performance?
  5. Is it more important to ensure that some groups are treated fairly? What about a group that might need a tool for accessibility rather than just convenience?