The Single Most Common Language Technology Mistake (and how to avoid it)

This is an excerpt from my book “The 9 Most Common Language Technology Mistakes (and how to avoid them!)”. If you’d like to grab a copy (plus a handy checklist) you can get it on my Ko-fi. If you’d prefer a video version of the same information, you can check out the livestream I did here.

The single most common mistake I see language technologists make, especially newer ones, is failing to consider the domain of their data.

Mismatched shoes are fun and quirky… mismatched language data domains on the other hand will lead to a LOT of problems.

I’m talking about “domain” here in the linguistic sense, not the networking sense. Basically it means that you are attempting to train a piece of language technology using data that is different from what it will see in production. (This is a special case of out-of-distribution data or, if it occurs due to change over time, distribution shift.) The reason this trips up so many new language technology developers is because of a lack of understanding of what counts as “different”: there may be more factors to consider than you initially realize. Language varies systematically and fairly predictably depending on the situation where it is used. What this means from a machine learning standpoint is that the type of language data used is very strong signal and that systems are very likely to pick up on it during training.


The most obvious type of domain mismatch is topic: what is the text about? If I want to build a system to handle medical text and I only train it with legal text, or customer support chat logs, obviously my system will not be able to handle the specific tokens and expressions that exist in the target medical data but not in the training data.

“Domain” includes a lot more than topic however, and some of the most relevant domain differences are modality, formality, and intended audience.


First up is modality: how was this language originally produced? Was it spoken aloud, written by hand, typed or signed? Each of these is going to have a systematic effect on the language data that you see, even if the topic is more or less the same and they have all been transcribed into the same format. For example, it’s pretty rare to see emojis in handwritten text. You also won’t see spatial referencing in written text or spoken language the way that you do in signed languages.

Even when comparing different inputs–typed as opposed to spoken or dictated–you’ll see pretty big differences. Written text tends to have more rare words while spoken language tends to have more pronouns and a more skewed frequency distribution (i.e. more of the tokens used are very common). This 1998 paper looking at Swedish is a good citation if you’re interested in a deeper dive.


Next up is formality. Often you’ll hear people discussing things like “noisy user generated text”. This means that this text has been produced in an informal setting, like social media, rather than a formal setting like a published paper. (Or even this document!) Informal text has its own patterns that you need to consider. For example, misspellings may sometimes be the result of actual mistakes such as typos but often they encode specific information. This study on French is a great example of this.

You are also more likely to see examples of language use that has been stigmatized in an educational setting. For example, code-switching, or using more than one language or language variety in a single text span. You are also more likely to see different language varieties. In the United States African American English, which is a set of related language varieties predominantly used by the Black community, has been historically extremely stigmatized. As a result you are much more likely to find examples of grammatical structures or words from African American English in informal texts than formal ones (although that is changing and does depend on the specific audience for which the formal text was written). As a result a lot of tools trained on more formal text, like language identifiers, will fail when encountering this particular language variety.

You are also much more likely to see slang in informal text and these slang term generally change very quickly. Consider terms like “based” or “lit” that, depending on when you’re reading this, may already sound extremely dated. On the other hand, very formal text is likely to be more stable over time but also to have its own type of rare tokens, especially jargon.

Intended audience

Speaking of jargon, another common source of variation is due to the intended audience. Consider this in your own life: do you write an email to a close friend the same way that you do to your boss? Even though the format of the text is very similar and it may be on the same topic, who you are producing the language output for has a large effect on what you say and how. And this goes even further than just the specific person you were talking to. Is this language being produced to be broadcast (shared with a large number of people who may not be able to each respond individually to the same degree) or has it been produced as part of a dyadic discussion (with a single other individual)? Broadcast text is likely to assume less background knowledge on the part of the listener whereas dyadic text will assume that you have access to the entire rest of the conversation.

If you have a good idea about what type of language use your system will encounter once it’s in production, you can tailor your training data to include a lot of examples of that type of language use. If you don’t, you are likely to have surprising errors that are extremely difficult to debug since they arise from the data and not the modeling itself.


Large language models cannot replace mental health professionals

I recently posted on Twitter the strong recommendation that individuals do not attempt to use large language models (like BERT or GPT-3) to replace the services of mental health professionals. I’m writing this blog post to specifically address individuals who think that this is a good idea.

CW: discussions of mental illness, medical abuse, self-harm and suicide. If you are in the US and currently experiencing crisis, 988 is now the nationwide number for mental health, substance use and suicidal crises, if you are outside the US this is a list of international suicide hotlines.

Scrabble tiles spelling out “Mental Health”. Wokandapix at Pixabay, CC0, via Wikimedia Commons

First: I want to start by acknowledging that if you are in the position where you are considering doing this, you are probably coming from a well-meaning place. You may know someone who has experienced mental illness or experienced it yourself and you want to help. That is a wonderful impulse and I applaud it. However, attempting to use large language models to replace the services of mental health professionals is *not* better than nothing. In fact, it is worse than nothing and has an extremely high probability of causing harm to the very people that you are trying to help. And this isn’t just my stance. Replacing clinicians with automated systems goes DIRECTLY against the recommendations of clinicians working in clinical applications of ML in mental health:

“ML and NLP should not lead to disempowerment of psychiatrists or replace the clinician-patient pair [emphasis mine].” – Le Glaz A, Haralambous Y, Kim-Dufor DH, Lenca P, Billot R, Ryan TC, Marsh J, DeVylder J, Walter M, Berrouiguet S, Lemey C. Machine Learning and Natural Language Processing in Mental Health: Systematic Review. J Med Internet Res. 2021 May 4;23(5):e15708. doi: 10.2196/15708. PMID: 33944788; PMCID: PMC8132982. 

Let me discuss some of the possible harms in more detail, however, since I have found that often when someone is particularly enthused of using large language models in a highly sensitive application they have not considered the scale of the potential harms or the larger societal impact of that application.

First, large language models are fundamentally unfit for purpose. They are trained on a general selection of language data much of which has been scraped from the internet, and I think most people would agree that “random text from the internet” is a very poor source of mental health advice. If you were to attempt to fine-tune a model for clinical practice, you would need to use clinical notes. These are extremely sensitive personal identifying information. Previous sharing of similar information, specifically text from the Crisis Text Line for training machine learning models, has been resoundingly decried as unethical by both the machine learning and clinical communities. Further, we know that PII included as few as one time in the training data of a large language model can be re-identified via model probing. As a result, any of the extremely sensitive data used to tune the model has the potential of being leaked to end users of the model.

You would also open patients up to other kinds of adversarial attacks. In particular the use of universal triggers by a man-in-the-middle attacker could be used to intentionally serve text to patients that encouraged self harm or suicide. And given the unpredictable nature of the text output from large language models in the first place, it would be impossible to ensure that the model didn’t just do that on its own, regardless of how much prompt engineering was done.

Further, mental health support is not generic. Advice that might be helpful to someone in one situation may be actively harmful to another. For example, someone worried about intrusive thoughts may find it very comforting to be told “you are not alone in your experience”. However, telling someone suffering from paranoia who says they are being followed “you are not alone in your experience” may help to reinforce that belief and further harm their mental health. Encouraging someone suffering from depression to “add some moderate intensity exercise to your daily routine” may be helpful for them. Encouraging someone who is suffering from compulsive exercising to “add some moderate intensity exercise to your daily routine” would be actively harmful. Since large language models are not medical practitioners, I hesitate to call this “malpractice”, however it is clear that there is an enormous potential for harm even when serving seemingly innocuous, general advice.

And while you may be tempted to address this by including diagnosing as part of the system, that in itself offers extremely high potential for harm. Mental health misdiagnosis can be extremely harmful to patients, even when it happens during consultation with a medical services provider. Adding on the veneer of supposed impartiality from an automated system may increase that harm.

And of course, once diagnosis has been made (even if incorrectly) it then becomes personal identifiable information linked with the patient. However, since an automated system is not actually a clinician, even the limited data privacy protections provided for people in the US by something like HIPAA just doesn’t apply. (In fact it’s a big issue even with online services that use human service providers; Better Help in particular has sold Facebook data from therapy sessions.) As a result, this data can be legally sold by third parties like data brokers.  Even if you don’t intend to sell that data, there’s no way for you to ensure that it never will be, including after your death or if whatever legal entity you use to create the service is bought or goes into bankruptcy.

And this potential secondary use of the data, again even if the diagnosis is incorrect, has an enormous potential to harm individuals. In the US it could be used to raise their insurance rates, have them involuntarily committed or placed under an extremely controlling conservatorship (which may strip them of their right to vote or even allow them to be forcibly sterilized, which is still legal in 31 states). Mental illness is extremely stigmatized and creating a process for linking a diagnosis with an individual has extremely high potential for harm.

Attempting to replicate the services of a mental health clinician through the use of LLMs has the potential to harm the individuals who attempt to use that service. And I think there’s a broader lesson here as well: mental illness and mental health issues are fundamentally not an ML engineering problem. While we may use these tools to support clinicians or service providers or public health workers, we can’t replace them. 

Instead think about what it is specifically you want to do and spend your time doing something that will have an immediate, direct impact. If your goal is to support someone you love, support them directly. If you want to help your broader community, join existing organizations doing the work, like local crisis centers. If you need help, seek it out. If you are in crisis, seek immediate support. If you are merely looking for an interesting engineering problem, however, look elsewhere.

The trouble with sentiment analysis

Two things spurred me to write this post. First, I’d given the same advice three times which, according to David Robinson‘s rule, meant it was time. And, second, this news story on a startup that claims that they can detect student emotions over Zoom. With those things in mind, here is my very simple guidance on sentiment analysis:

You should almost never do sentiment analysis.

A picture of a stop sign against a blue sky. There are two wind turbines in the background. Photo by lamoix, CC BY 2.0, via Wikimedia Commons

Thanks for reading, hope that cleared things up. 🙂 In all seriousness, though, the places where it makes sense for a data scientist or NLP practitioner working in industry to use sentiment analysis are vanishingly rare. First, because it doesn’t work very well and second, because even when it does work it’s usually measuring the wrong thing.

What do I mean that sentiment analysis doesn’t work very well?

Let’s consider the most common approach. You have a list of words that are “positive” and a list of words that “negative” (or, more rarely, a list of words associated with different emotions) and you count how many words from each list appear in your target text. If there are more positive words you assign a positive sentiment, more negative words a negative one and if they are equal or (much more likely) none of the words on either list show up you assign a neutral sentiment. Of course, there are a variety of other, more sophisticated approaches, but this is the most common one.

As you can see, there’s a lot you’ll miss using this approach. The construction of the lists may not have been done with the target text producers in mind, for example. A list from five years ago may possibly have “lit” as a word with positive sentiment, but what about “based”? If you’re attempting to characterize young internet users you’ll need to be very careful with your sentiment lists.

And a word-based approach cannot account for context. “Our flight was late but Cindy at O’Hare managed to change our connection”, for example, may have gotten a negative sentiment assigned to it due to “late” (assuming you’re working within the transportation domain) but in context this is actually a pretty positive review.

Plus, of course, sarcasm will be completely missed by such an approach. If someone says “lovely weather outside” in the middle of a tornado warning then you, as a human, know that they probably aren’t very happy about the weather. Which ties in to my next point: the question of what you’re measuring.

The specific thing you’re attempting to measure using sentiment analysis is the sentiment expressed in the text, but often you’ll see folks (generally tacitly) make a leap to assuming that what you’re measuring is who someone actually felt when they were writing the text. That’s pretty clearly not the case: I can write “I’m absolutely livid” and “I feel ecstatic joy” without experiencing those emotions and I’d expect most sentiment analysis tools would give those statements very strong negative and positive sentiments, respectively.

This is important because, generally, people tend to care more about what people are feeling than what they’re expressing in text. (A good counterexample would be a digital humanities project showing the sentiment of different passages in a novel.) And figuring out someone’s emotions from text is much, much more difficult and in most cases completely utterly impossible. And speaking about what people are feeling….

Sentiment isn’t generally actually that useful

So given that sentiment analysis of text isn’t likely to tell you what people are feeling with any fidelity, where would you want to use it? A great question, and one that I think folks should ask more often. Usually when I see it being used, it’s in a case where there’s another, actually more useful, thing you want to know. Let’s look at some examples.

  • In a chatbot to know if the conversation is going well
    • Since most words are neutral and most turns are pretty short, you’re pretty unlikely to get helpful information unless things have already gone very, very wrong (as Zeerak Waseem points out,you can look for swearing). A far better thing to measure would be patterns that you shouldn’t see in efficient conversations, like lots of repetition or slightly rephrasing.
  • To determine if reviews are good or bad
    • This one in particular I find baffling: most reviews are associated with a star rating, which is a clear measure directly from the person. A better use of time would probably be to do topic modelling or some other sort of unsupervised text analysis to see if there are persistent issues that should be addressed.
  • To predict customer churn based on call center logs
    • If you have the raw input text and the labelled outcomes (churned or not) then I’d just build a raw classifier. You’re also likely to get more mileage out of other metadata features, like how often they’ve contacted support, number of support tickets filed or something similar. Someone can be very polite to a customer service rep and still churn because the product just doesn’t meet their needs.

In all of these cases “what is the sentiment being expressed in the text” just isn’t the useful question. Sure, it’s quick enough to do a sentiment analysis that you might add it as a new feature to just see if it adds anything… but I’d say your time would be better spent elsewhere.

So why do people still use sentiment analysis?

Great question. Probably one reason is that its often used as an example in teaching. It has a pretty simple intuition, there are lots of existing tools for it and can it help students develop intuitions about corpus methods. That means that a *lot * of people have done sentiment analysis already at some point, and it’s much simpler to use a method you are already familiar with. (I get it, I do it too.)

And another, more pernicious reason, is that it’s harder to define a new problem (for which a tool or measure might not exist) than it is to redefine it as an existing one where an existing, simple-to-use tool is available. Even if that redefinition means that the work is no longer actually all that useful.

So, with all that said, the next time you’re thinking that you need to do sentiment analysis I’d encourage you to spend some time really considering if you’re sure before you decide to dive in.

An emoji dance notation system for TikTok dance tutorials 👀💃

This blog post is more of a quick record of my thoughts than a full in-depth analysis, because when I saw this I immediately wanted to start writing about it. Basically, TikTok is a social media app for short form video (RIP Vine, forever in our hearts) and one of the most popular genres of content is short dances; you may already be familiar with the concept.

HOWEVER, what’s particularly intriguing to me is this sort of video here, where someone creates a tutorial for a specific dance and includes an emoji-based dance notation:

Example of a dance with an emoji notation system by

Back in grad school, when I was studying signed languages, I probably spent more time than I should have reading about writing systems for signed languages and also dance notations. To roughly sum up an entire field of study: representing movements of the human body in time and space using a writing system, or even a more specialized notation, is extremely difficult. There are a LOT of other notations out there, and you probably haven’t run into them for a reason: they’re complex, hard to learn, necessarily miss nuances and are a bit redundant given that the vast majority of dance is learned through watching & copying movement. Probably the most well-known type of dance notation is for ballroom dance where the footwork patterns are represented on the floor using images of footsteps, like so:

Langsamer Walzer Grundschritt

I think part of the reason that this notation in particular tends to work well is that it’s completely iconic: the image of a shoe print is where your shoe print should go. It also captures a large part of the relevant information; the upper body position can be inferred from the position of the feet (and in many cases will more or less remain the same throughout).

I think that’s true to some degree of these emoji notations as well. The fact that they work at all may be arising in part due to the constraints of the TikTok dance genre. In most TikTok dances, the dancer faces in a single direction for the dance, there is minimal movement around the space and the feet move minimally if at all. The performance format itself helps as well: the videos are short and easy to repeat, and you can still see the movements being preformed in full with the notation being used as a shorthand.

And it’s clear that this use of style of notation isn’t idiosyncratic; this compilation has a variety of tutorials from different creators that use variations on the same style of notation.

A selection of tiktok dance tutorials, some of which include emoji notation

Some of the types of ways emoji are used here are similar to the ways that things like Stokoe notation are, to indicate handshape and movement (although not location). A few other types of ways that emoji are used that stick out:

  • Articulator (hands with handshape, peach emoji for the hips)
  • Manner of articulation/movement (“explosive”, a specific number of repetitions, direction of movement using arrows)
  • Iconic representation of a movement using an object (helicopter = hands make the motion of helicopter blades, mermaid = bodywaves, as if a mermaid swimming)
  • Iconic representation of a shape to be traced (a house emoji = tracing a house shape with the hands, heart = trace a heart shape)
  • (Not emoji) Written shorthand for a (presumably) already known dance, for example “WOAH” for the woah

To sum up: I think this is a cool idea, it’s an interesting new type of dance notation that is clearly useful to a specific artistic community. It’s also another really good piece of evidence in the bucket of “emoji are gestures”: these are clearly not a linguistic system and are used in a variety of ways by different users that don’t seem entirely systematic.

Buuuut there’s also the way that the emojis are groups into phrases for a specific set of related motions, which smells like some sort of shallow parsing even if it’s not a full consistency structure, and I’d say that’s definitely linguistic-ish. I think I’d need to spend more time on analysis to have any more firmly held opinion than that.

Datasets for data cleaning practice

Looking for datasets to practice data cleaning or preprocessing on? Look no further!

Each of these datasets needs a little bit of TLC before it’s ready for different analysis techniques. For each dataset, I’ve included a link to where you can access it, a brief description of what’s in it, and an “issues” section describing what needs to be done or fixed in order for it to fit easily into a data analysis pipeline.

Big thanks to everyone in this Twitter thread who helped me out by pointing me towards these datasets and letting me know what sort of pre-processing each needed. There were also some other data sources I didn’t include here, so check it out if you need more practice data. And feel free to comment with links to other datasets that would make good data cleaning practice! 🙂

List of datasets:

  • Hourly Weather Surface – Brazil (Southeast region)
  • PhyloTree Data
  • International Comprehensive Ocean-Atmosphere Data Set
  • CLEANEVAL: Development dataset
  • London Air
  • Production and Perception of Linguistic Voice Quality
  • Australian Marriage Law Postal Survey, 2017
  • The Metropolitan Museum of Art Open Access
  • National Drug Code Directory
  • Flourish OA
  • WikiPlots
  • Register of UK Parliament Members’ Financial Interests
  • NYC Gifted & Talented Scores

Hourly Weather Surface – Brazil (Southeast region)

It’s covers hourly weather data from 122 weathers stations of southeast region (Brazil). The southeast include the states of Rio de Janeiro, São Paulo, Minas Gerais e Espirito Santo. Dataset Source: INMET (National Meteorological Institute – Brazil).

Issues: Can you predict the amount of rain? Temperature? NOTE: Not all weather stations started operating since 2000

PhyloTree Data

Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Human mitochondrial DNA is widely used as tool in many fields including evolutionary anthropology and population history, medical genetics, genetic genealogy, and forensic science. Many applications require detailed knowledge about the phylogenetic relationship of mtDNA variants. Although the phylogenetic resolution of global human mtDNA diversity has greatly improved as a result of increasing sequencing efforts of complete mtDNA genomes, an updated overall mtDNA tree is currently not available. In order to facilitate a better use of known mtDNA variation, we have constructed an updated comprehensive phylogeny of global human mtDNA variation, based on both coding‐ and control region mutations. This complete mtDNA tree includes previously published as well as newly identified haplogroups.

Issues: This data would be more useful if it were in the Newick tree format and could be read in using the read.newick() function. Can you help get the data in this format?

International Comprehensive Ocean-Atmosphere Data Set

The International Comprehensive Ocean-Atmosphere Data Set (ICOADS) offers surface marine data spanning the past three centuries, and simple gridded monthly summary products for 2° latitude x 2° longitude boxes back to 1800 (and 1°x1° boxes since 1960)—these data and products are freely distributed worldwide. As it contains observations from many different observing systems encompassing the evolution of measurement technology over hundreds of years, ICOADS is probably the most complete and heterogeneous collection of surface marine data in existence.

Issues: The ICOADS contains O(500M) meteorological observations from ~1650 onwards. Issues include bad observation values, mis-positioned data, missing date/time information, supplemental data in a variety of formats, duplicates etc.

CLEANEVAL: Development dataset

CLEANEVAL is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus, for linguistic and language technology research and development. There are three versions of each file: original, pre-processed, and manually cleaned. All files of each kind are gathered in a directory. The file number remains the same for the three versions of the same file.

Issues: Your task is to “clean up” a set of webpages so that their contents can be easily used for further linguistic processing and analysis. In short, this implies:

  • removing all HTML/Javascript code and “boilerplate” (headers, copyright notices, link lists, materials repeated across most pages of a site, etc.);
  • adding a basic encoding of the structure of the page using a minimal set of symbols to mark the beginning of headers, paragraphs and list elements.

London Air

The London Air Quality Network (LAQN) is run by the Environmental Research Group of King’s College London. LAQN stands for the London Air Quality Network which was formed in 1993 to coordinate and improve air pollution monitoring in London. The network collects air pollution data from London boroughs, with each one funding monitoring in its own area. Increasingly, this information is being supplemented with measurements from local authorities surrounding London in Essex, Kent and Surrey, thereby providing an overall perspective of air pollution in South East England, as well as a greater understanding of pollution in London itself.

Issues: Lots of gaps (null/zero handling), outliers, date handling, pivots and time aggregation needed first!


Candy hierarchy data for 2017 Boing Boing Halloween candy hierarchy. This is survey data from this survey.

Issues: If you want to look for longitudinal effects, you also have access to previous datasets. Unfortunate quirks in the data include the fact that the 2014 data is not the raw set (can’t seem to find it), and in 2015, the candy preference was queried without the MEH option.

Production and Perception of Linguistic Voice Quality

Data from the “Production and Perception of Linguistic Voice Quality” project at UCLA. This project was funded by NSF grant BCS-0720304 to Prof. Pat Keating, with Prof. Abeer Alwan, Prof. Jody Kreiman of UCLA, and Prof. Christina Esposito of Macalester College, for 2007-2012.

The data includes spreadsheet files with measures gathered using Voicesauce (Shue, Keating, Vicenik & Yu 2011) for both acoustic measures and EGG measures. The accompanying readme file provides information on the various coding used in both spreadsheets.

Issues: The following issues are with the acoustics measures spreadsheet specifically.

  1. xlsx format with meaningful color coding created by a VBA script (which is copy-pasted into the second sheet)
  2. partially wide format instead of long/tidy, with a ton of columns split into different timepoints
  3. line 6461 has another set of column headers rather than data for some of the columns starting with “shrF0_mean”. I think this was a copy-paste error. Hopefully it doesn’t mean that all of the data below that row is shifted down by 1!

Australian Marriage Law Postal Survey, 2017

Response: Should the law be changed to allow same-sex couples to marry?

Of the eligible Australians who expressed a view on this question, the majority indicated that the law should be changed to allow same-sex couples to marry, with 7,817,247 (61.6%) responding Yes and 4,873,987 (38.4%) responding No. Nearly 8 out of 10 eligible Australians (79.5%) expressed their view.

All states and territories recorded a majority Yes response. 133 of the 150 Federal Electoral Divisions recorded a majority Yes response, and 17 of the 150 Federal Electoral Divisions recorded a majority No response.

Issues: Miles McBain discusses his approach to cleaning this dataset in depth in this blog post.

The Metropolitan Museum of Art Open Access

The Metropolitan Museum of Art provides select datasets of information on more than 420,000 artworks in its Collection for unrestricted commercial and noncommercial use. To the extent possible under law, The Metropolitan Museum of Art has waived all copyright and related or neighboring rights to this dataset using Creative Commons Zero. This work is published from: The United States Of America. You can also find the text of the CC Zero deed in the file LICENSE in this repository. These select datasets are now available for use in any media without permission or fee; they also include identifying data for artworks under copyright. The datasets support the search, use, and interaction with the Museum’s collection.

Issues: Missing values, inconsistent information, missing documentation, possible duplication, mixed text and numeric data.

National Drug Code Directory

The Drug Listing Act of 1972 requires registered drug establishments to provide the Food and Drug Administration (FDA) with a current list of all drugs manufactured, prepared, propagated, compounded, or processed by it for commercial distribution. (See Section 510 of the Federal Food, Drug, and Cosmetic Act (Act) (21 U.S.C. § 360)). Drug products are identified and reported using a unique, three-segment number, called the National Drug Code (NDC), which serves as a universal product identifier for drugs. FDA publishes the listed NDC numbers and the information submitted as part of the listing information in the NDC Directory which is updated daily.

The information submitted as part of the listing process, the NDC number, and the NDC Directory are used in the implementation and enforcement of the Act.

Issue: Non-trivial duplication (which drugs are different names for the same things?).

Flourish OA

Our data comes from a variety of sources, including researchers, web scraping, and the publishers themselves. All data is cleaned and reviewed to ensure its validity and integrity. Our catalog expands regularly, as does the number of features our data contains. We strive to maintain the most complete and sophisticated store of Open Access data in the world, and it is this mission that drives our continued work and expansion.

A dataset on journal/publisher information that is a bit dirty and might make for great practice. It’s been a graduate student/community project:

Issues: Scraped data, has some missing fields, possible duplication and some encoding issues (possibly multiple character encodings).



The WikiPlots corpus is a collection of 112,936 story plots extracted from English language Wikipedia. These stories are extracted from any English language article that contains a sub-header that contains the word “plot” (e.g., “Plot”, “Plot Summary”, etc.).

This repository contains code and instructions for how to recreate the WikiPlots corpus.

The dataset itself can be downloaded from here: (updated: 09/26/2017). The zip file contains two files:

  • plots: a text file containing all story plots. Each story plot is given with one sentence per line. Each story is followed by on a line by itself.
  • titles: a text file containing a list of titles for each article in which a story plot was found and extracted.

Issues: Some lines may be cut off due to abbreviations. Some plots may be incomplete or contain surrounding irrelevant information.

Register of UK Parliament Members’ Financial Interests

The main purpose of the Register is to provide information about any financial interest which a Member has, or any benefit which he or she receives, which others might reasonably consider to influence his or her actions or words as a Member of Parliament.

Members must register any change to their registrable interests within 28 days. The rules are set out in detail in the Guide to the Rules relating to the Conduct of Members, as approved by the House on 17 March 2015. Interests which arose before 7 May 2015 are registered in accordance with earlier rules.

The Register is maintained by the Parliamentary for Commissioner for Standards. It is updated fortnightly online when the House is sitting, and less frequently at other times. Interests remain on the Register for twelve months after they have expired.

Issues: Each member’s transactions are on a separate webpage with a different text format, with contributions listed under different headings (not necessarily one per line) and in different formats. Will take quite a bit of careful preprocessing to get into CSV or JSON format.

NYC Gifted & Talented Scores

Couple of messy but easy data sets: NYC parents reporting their kids’ scores on the gifted and talented exam, as well as school priority ranking. Some enter the percentiles as point scores, some skip all together, no standard preference format, etc. Also birth quarter affects percentiles.

Data science & kitchen gadgets

One of the things I really enjoy about my current job is chatting with other data science folks. Almost inevitably in the course of these conversations, the old “Python vs. R” debate comes up.

For those of you who aren’t familiar, Python and R are both programming languages often used by data scientists and other folks who work with data. Python is a general-purpose programming language (originally designed as a teaching language) that has some popular packages used for data analysis. R is a computer language specifically designed for doing statistics and visualization. They’re both useful languages, but R is much more specialized.

I use both Python & R, but I tend to prefer R for data analysis and vitalization. I also love kitchen gadgets. (I own and routinely use a melon baller, albeit only very rarely for actually balling melon.) My hypothesis is that my preference for R and love of kitchen gadgets share the same underlying cause: I really like specialized tools.

I was curious to see if there was a similar relationship for other people, so I reached out to my Twitter followers with a simple two-question poll:

Do you prefer Python or R?

  • Python
  • R

How do you feel about specialized kitchen gadgets (e.g. veggie peelers, egg slicers, specialized knives).

  • Hate ’em
  • Love ’em

185 people filled out the poll (if you were one of them, thanks!). Unfortunately for my hypothesis, a quick analysis of the results revealed no evidence that there was any relationship between whether someone prefers Python or R and if they like kitchen gadgets. You can check out the big ol’ null result for yourself:

Regardless of how poorly this experiment illustrates my point, however, it still stands: R is a specialized tool, while Python is general purpose one.

I like to think of R as a bread knife and Python as a pocket knife. It’s much easier to slice bread with a bread knife, but sometimes it’s more convenient to use a pocket knife if you already have it to hand.

If you spend a lot of time cleaning and analyzing data that’s already in a tabular format or doing statistical analysis, you might consider checking out R. It’s certainly saved me a lot of time. (Oh, and I juuuust so happen to have a couple of short R tutorials for folks with little to no programming background.)

Can your use of capitalization reveal your political affiliation?

This week, I’m in Vancouver this week for the meeting of the Association for Computational Linguistics. (On the subject of conferences, don’t forget that my offer to help linguistics students from underrepresented minorities with the cost of conferences still stands!) The work I’m presenting is on a new research direction I’m pursuing and I wanted to share it with y’all!

If you’ve read some of my other posts on sociolinguistics, you may remember that the one of its central ideas is that certain types of language usage pattern together with aspects of people’s social identities. In the US, for example, calling a group of people “yinz” is associated with being from Pittsburgh. Or in Spanish, replacing certain “s” sounds with “th” sounds is associated with being from northern or central Spain. When a particular linguistic form is associated with a specific part of someone’s social identity, we call that a “sociolinguistic variable”

There’s been a lot of work on the type of sociolinguistic variables people use when they’re speaking, but there’s been less work on what people do when they’re writing. And this does make a certain amount of sense: many sociolinguistic variables are either 1) something people aren’t aware they’re doing or 2) something that they’re aware they’re doing but might not consider “proper”. As a result, they tend not to show up in formal writing.

This is where the computational linguistics part comes in; people do a lot of informal writing on computers, especially on the internet. In fact, I’d wager that humans are producing more text now than at any other point in history, and a lot of it is produced in public places. That lets us look for sociolinguistics variables in writing in a way that wasn’t really possible before.

Which is a whole lot of background to be able to say: I’m looking at how punctuation and capitalization pattern with political affiliation on Twitter.

Political affiliation is something that other sociolinguists have definitely looked at. It’s also something that’s very, very noticeable on Twitter these days. This is actually a boon to this type of research. One of the hard things about doing research on Twitter is that you don’t always necessarily know someone’s social identity. And if you use a linguistic feature to try to figure out their identity when what you’re interested in is linguistic features, you quickly end up with the problem of circular logic.

Accounts which are politically active, however, will often explicitly state their political affiliation in their Twitter bio. And I used that information to get tweets from people I was very sure had a specific political affiliation.

For this project, I looked at people who use the hashtags #MAGA and #theResistance in their Twitter bios. The former is an initialism for “Make America Great Again” and is used by politically conservative folks who support President Trump. The latter is used by political liberal folks who are explicitly opposed to President Trump. These two groups not only have different political identities, but also are directly opposed to each other. This means there’s good reason to believe that they will use language in different ways that reflect that identity.

But what about the linguistic half of the equation? Punctuation and capitalization are especially interesting to me because they seem to be capturing some of the same information we might find in prosody or intonation in spoken language. Things like YELLING or…pausing….or… uncertainty?  They’re also much, much easier to measure punctuation than intonation, which is notoriously difficult and time-consuming to annotate.  At the same time, I have good evidence that how you use punctuation and capitalization has some social meaning. Check out this tweet, for example:

As this tweet shows, putting a capital letter at the beginning of a tweet is anything but “aloof and uninterested yet woke and humorous”.

So, if punctuation and capitalization are doing something socially, is part of what they’re doing expressing political affiliation?

That’s what I looked into. I grabbed up to 100 tweets each from accounts which used either #MAGA or #theResistance in their Twitter bios. Then I looked at how much punctuation and capitalization users from these two groups used in their tweets.


First, I looked at all punctuation marks. I did find that, on average, liberal users tended to use less punctuation. But when I took a closer look at the data, an interesting pattern emerged. In both the liberal and conservative groups, there were two clusters of users: those who used a lot of punctuation and those who used almost none.

Politically liberal users on average tended to use less punctuation than politically conservative users, but in both groups there’s really two sets of users: those who use a lot of punctuation and those who use basically  none. There just happen to be more of the latter in #theResistance.

What gives rise to these two clusters? I honestly don’t know, but I do have a hypothesis. I think that there’s  probably a second social variable in this data that I wasn’t able to control for. It seems likely that the user’s age might have something to do with it, or their education level, or even whether they use thier Twitter account for professional or personal communication.


My intuition that there’s a second latent variable at work in this data is even stronger given the results for the amount of capitalization folks used. Conservative users tended to use more capitalization than the average liberal user, but there was a really strong bi-modal distribution for the liberal accounts.

Again, we see that conservative accounts use more of the marker (in this case capitalization), but that there’s a strong bi-modal distribution in the liberal users’ data.

What’s more, the liberal accounts that used a lot of punctuation also tended to use a lot of capitalization. Since these features are both ones that I associate with very “proper” usage (things like always starting a tweet with a capital letter, and ending it with a period) this seems to suggest that some liberal accounts are very standardized in their use of language, while others reject at least some of those standards.

So what’s the answer the question I posed in the title? Can capitalization or punctuation reveal political affiliation? For now, I’m going to go with a solid “maybe”. Users who use very little capitalization and punctuation are more likely to be liberal… but so are users who use a lot of both. And, while I’m on the subject of caveats, keep in mind that I was only looking at very politically active accounts who discuss thier politics in their user bios.  These observations probably don’t apply to all Twitter accounts (and certainly not across different languages).

If you’re interested in reading more, you can check out the fancy-pants versions of this research here and here.  And I definitely intend to consider looking at this; I’ll keep y’all posted on my findings. For now, however, off to find me a Nanimo bar!

Contest announcement! Making noise and going places ✈️🛄

I recently wrote the acknowledgements section my dissertation and it really put into perspective how much help I’ve received during my degree. I’ve decided to pass some of that on by helping out others! Specifically, I’ve decided to help make travelling to conferences a little more affordable for linguistics students who are from underrepresented minorities (African American, American Indian/Alaska Native, or Latin@), LGBT or have a disability.

Biologist speaking at the Friday morning Town Hall session, where attendees were welcome to discuss their ideas on how to further landscape conservation. (5471417317)

To enter:

Entry is open to any student (graduate or undergraduate) studying any aspect of language (broadly defined) who is from an underrepresented minority (African American, American Indian/Alaska Native, or Latin@), LGBT or has a disability.  E-mail me and attach:

  • An abstract or paper that has been accepted at an upcoming (i.e. starting after June 23, 2017) conference
  • The acceptance letter/email from the conference
  • A short biography/description of your work

One entry per person, please!


I’ll pick up to two entries. Each winner will receive 100 American dollars to help them with costs associated with the conference, and I’ll write a blog post highlighting each winner’s research.

Contest closes July 31I’ll contact winners by July 5

Good luck!


What is computational sociolinguistics? (And who’s doing it?)

If you follow me on Twitter (@rctatman) you probably already know that I defended my dissertation last week. That’s right: I’m now officially Dr. Tatman! [party horn emoji]

I’ve spent a lot of time focusing on all the minutia of writing a dissertation lately, from formatting references to correcting a lot of typos (my committee members are all heroes). As a result, I’m more than ready to zoom out and think about big-picture stuff for a little while. And, in academia at least, pictures don’t get much bigger than whole disciplines. Which brings me to the title of this blog post: computational sociolinguistics. I’ve talked about my different research projects quite a bit on this blog (and I’ve got a couple more projects coming up that I’m excited to share with y’all!) but they can seem a little bit scattered. What do patterns of emoji use have to do with how well speech recognition systems deal with different dialects with how people’s political affiliation is reflected in their punctuation use? The answer is that they all fall within the same discipline: computational sociolingustics.

Computational sociolinguistics is a fairly new field that lies at the intersection of two other, more established fields: computational linguistics and sociolinguistics. You’re actually probably already familiar with at least some of the work being done in computational linguistics and its sister field of Natural Language Processing (commonly called NLP). The technologies that allow us to interact with computers or phones using human language, rather than binary 1’s and 0’s, are the result of decades of research in these fields. Everything from spell check, to search engines that know that “puppy” and “dog” are related topics, to automatic translation are the result of researchers working in computational linguistics and NLP.

Sociolinguistics is another well-established field, which focuses on the effects of social context on language how we use language and understand. “Social context”, in this case, can be everything from someone’s identity–like their gender or where they’re from–to the specific linguistic situation they’re in, like how much they like the person they’re talking to or whether or not they think they can be overheard. While a lot of work in sociolinguistics is more qualitative, describing observations without a lot of exact measures, of it is also quantitative.

So what happens when you squish these to fields together? For me, the result is work that focuses on research questions that would be more likely to be asked by sociolinguistics, but using methods from computational linguistics and NLP. It also means asking sociolinguistic questions about how we use language in computational context, drawing on the established research fields of Computer Mediated Communication (CMC), Computational Social Science (CSS) and corpus linguistics, but with a stronger focus on sociolingusitics.

One difficult thing about working in a very new field, however, is that it doesn’t have the established social infrastructure that older fields do. If you do variationist sociolinguistics, for example, there’s an established conference (New Ways of Analyzing Variation, or NWAV) and journals (Language Variation and Change, American Speech, the Journal of Sociolinguistics). Older fields also have an established set of social norms. For instance, conferences are considered more prestigious research venues in computational linguistics, while for sociolinguistics journal publications are usually preferred. But computational sociolinguistics doesn’t really have any of that yet. There also isn’t an established research canon, or any textbooks, or a set of studies that you can assume most people in the field have had exposure to (with the possible exception of Dong et al.’s really fabulous survey article). This is exciting, but also a little bit scary, and really frustrating if you want to learn more about it. Science is about the communities that do it as much as it is about the thing that you’re investigating, and as it stands there’s not really an established formal computational sociolinguistics community that you can join.

Fortunately, I’ve got your back. Below, I’ve collected a list of a few of the scholars whose work I’d consider to be computational sociolinguistics along with small snippets of how they describe their work on their personal websites. This isn’t a complete list, by any means, but it’s a good start and should help you begin to learn a little bit more about this young discipline.

  • Jacob Eisenstein at Georgia Tech
    • “My research combines machine learning and linguistics to build natural language processing systems that are robust to contextual variation and offer new insights about social phenomena.”
  • Jack Grieve at the University of Birmingham
    • “My research focuses on the quantitative analysis of language variation and change. I am especially interested in grammatical and lexical variation in the English language across time and space and the development of new methods for collecting and analysing large corpora of natural language.”
  • Dirk Hovy at the University of Copenhagen
  • Michelle A. McSweeney at Columbia
    • “My research can be summed up by the question: How do we convey tone in text messaging? In face-to-face conversations, we rely on vocal cues, facial expressions, and other non-linguistic features to convey meaning. These features are absent in text messaging, yet digital communication technologies (text messaging, email, etc.) have entered nearly every domain of modern life. I identify the features that facilitate successful communication on these platforms and understand how the availability of digital technologies (i.e., mobile phones) has helped to shape urban spaces.”
  • Dong Nguyen at the University of Edinburgh & Alan Turing Institute
    • “I’m interested in Natural Language Processing and Information Retrieval, and in particular computational text analysis for research questions from the social sciences and humanities. I especially enjoy working with social media data.”
  • Sravana Reddy at Wellesley
    • “My recent interests center around computational sociolinguistics and the intersection between natural language processing and privacy. In the past, I have worked on unsupervised learning, pronunciation modeling, and applications of NLP to creative language.”
  • Tyler Schnoebelen at Decoded AI Consulting
    • “I’m interested in how people make meaning with language. My PhD is from Stanford’s Department of Linguistics (my dissertation was on language and emotion). I’ve also founded a natural language processing start-up (four years), did UX research at Microsoft (ten years) and ran the features desk of a national newspaper in Pakistan (one year).”
  • (Ph.D. Candidate) Philippa Shoemark at the University of Edinburgh
    • “My research interests span computational linguistics, natural language processing, cognitive modelling, and complex networks. I am currently using computational methods to identify social and individual factors that condition linguistic variation and change on social media, under the supervision of Sharon Goldwater and James Kirby.”
  • (Ph.D. Candidate) Sean Simpson at Georgetown University
    • “My interests include computa­tional socio­linguistics, socio­phonetics, language variation, and conservation & revitalization of endangered languages. My dissertation focuses on the incorporation of sociophonetic detail into automated speaker profiling systems.”

Should English be the official language of the United States?

There is currently a bill in the US House to make English the official language of the United States. These bills have been around for a while now. H.R. 997, also known as the “The English Language Unity Act”, was first proposed in 2003. The companion bill, last seen as S. 678 in the 114th congress, was first introduced to the Senate as S. 991 in 2009, and if history is any guide will be introduced again this session.

So if these bills have been around for a bit, why am I just talking about them now? Two reasons. First, I had a really good conversation about this with someone on Twitter the other day and I thought it would be useful to discuss  this in more depth. Second, I’ve been seeing some claims that President Trump made English the official language of the U.S. (he didn’t), so I thought it would be timely to discuss why I think that’s such a bad idea.

As both a linguist and a citizen, I do not think that English should be the official language of the United States.

In fact, I don’t think the US should have any official language. Why? Two main reasons:

  • Historically, language legislation at a national level has… not gone well for other countries.
  • Picking one official language ignores the historical and current linguistic diversity of the United States.

Let’s start with how passing legislation making one or more languages official has gone for other countries. I’m going to stick with just two, Canada and Belgium, but please feel free to discuss others in the comments.


Unlike the US, Canada does have an official language. In fact, thanks to a  1969 law, they have two: English and French. If you’ve ever been to Canada, you know that road signs are all in both English and French.

This law was passed in the wake of turmoil in Quebec sparked by a Montreal school board’s decision to teach all first grade classes in French, much to the displeasure of the English-speaking residents of St. Leonard. Quebec later passed Bill 101 in 1977, making French the only official language of the province. One commenter on this article by the Canadian Broadcasting Corporation called this “the most divisive law in Canadian history”.

Language legislation and its enforcement in Canada has been particularity problematic for businesses. On one occasion, an Italian restaurant faced an investigation for using the word “pasta” on thier menu, instead of the French “pâtes”. Multiple retailers have faced prosecution at the hands of the Office Québécois de la langue Française for failing to have retail websites available in both English and French. A Montreal boutique narrowly avoided a large fine for making Facebook posts only in English. There’s even an official list of English words that Quebec Francophones aren’t supposed to use. While I certainly support bilingualism, personally I would be less than happy to see the same level of government involvement in language use in the US.

In addition, having only French and English as the official languages of Canada leave out a  very important group: aboriginal language users. There are over 60 different indigenous languages used in Canada used by over 213 thousand speakers. And even those don’t make up the majority of languages spoken in Canada: there are over 200 different languages used in Canada and 20% of the population speaks neither English nor French at home.


Another country with a very storied past in terms of language legislation is Belgium. The linguistic situation in Belgium is very complex (there’s a more in-depth discussion here), but the general gist is that there are three languages used in different parts of the country. Dutch is used in the north, French is the south, and German in parts of the east. There is also a deep cultural divide between these regions, which language legislation has long served as a proxy for. There have been no fewer than eight separate national laws passed restricting when and where each language can be used. In 1970, four distinct language regions were codified in the Belgium constitution. You can use whatever language you want in private but there are restrictions on what language you can use for government business, in court, in education and employment.  While you might think that would put a rest to legislation on language, tensions have continued to be high. In 2013, for instance, the European Court of Justice overturned a Flemish law that contracts written in Flanders had to be in Dutch to be binding after a contractor working on an English contract was fired. Again, this represents a greater level of official involvement with language use than I’m personally comfortable with.

I want to be clear: I don’t think multi-lingualism is a problem. As a linguist, I value every language and I also recognize that bilingualism offers significant cognitive benefits. My problem is with legislating which languages should be used in a multi-lingual situation; it tends to lead to avoidable strife.

The US

Ok, you might be thinking, but in the US we really are just an English-speaking country! We wouldn’t have that same problem here. Weeeeelllllll….

Tree map of languages in the United States
Languages of the United States, by speakers, based on data provided by the Modern Languages Association, which is in turn based on 2010 census data.

The truth is, the US is very multilingual. We have a Language Diversity Index of .353, according to the UN. That means that, if you randomly picked two people from the United States population, the chance that they’d speak two different languages is over 35%. That’s far higher than a lot of other English-speaking countries. The UK clocks in at .139,  while New Zealand and Australia are at .102 and .126, respectively. (For the record, Canada is at .549 and Belgium .734.)

The number of different languages spoken in the US is also remarkable. In New York City alone there may be speakers of as many as 800 different languages, which would make it one of the most linguistically-diverse places in the world; like the Amazon rain-forest of languages. In King County, where I live now, there are over 170 different languages spoken, with the most common being Spanish, Chinese, Vietnamese and Amharic. And that linguistic diversity is reflected in the education system: there are almost 5 million students in the US Education system who are learning English, nearly 1 out of 10 students.

Multilingualism in the United States is nothing new, either: it’s been a part of the American experience since long before there was an America. Of course, there continue to be many speakers of indigenous languages in the United States, including Hawaiian (keep in mind that Hawaii did not actually want to become a state). But even within just European languages, English has never had sole dominion. Spanish has been spoken in Florida since the 1500’s. At the time of the signing of the Deceleration of Independence, an estimated 10% of the citizens of the newly-founded US spoke German (although the idea that it almost became the official language of the US is a myth). New York city? Used to be called New Amsterdam, and Dutch was spoken there into the 1900’s. Even the troops fighting the revolutionary war were doing so in at least five languages.

Making English the official language of the United States would ignore the rich linguistic history and the current linguistic diversity of this country. And, if other countries’ language legislation is any guide, would cause a lot of unnecessary fuss.