Who all studies language? 🤔 A brief disciplinary tour

May 11, 2021May 11, 2021 ~ Rachael Tatman ~ Leave a comment

red and yellow bus photo — Buckle up friends, we’re going on a tour!

One of the nice things about human language is that no matter what your question about it might be, someone, somewhere has almost certainly already asked the same thing… and probably found at least part of an answer! The downside of this wealth of knowledge is that, even if you restrict yourself to just looking at the Western academic tradition, 1) there’s a lot of it and 2) it’s scattered across a lot of disciplines which can make it very hard to find.

An academic discipline is a field of study but also a social network of scholars with shared norms and vocabulary. While people do do “interdisciplinary” work that draws on more than one discipline, the majority of academic life is structured around working in a single discipline. This is reflected in everything from departments to journals and conferences to how research funding is divided.

As a result, even if you study human language in some capacity yourself it can be very hard to form a good idea of where else people are doing related work if it falls into another discipline you don’t have contact with. You won’t see them at your conferences, you probably won’t cite each other in your papers and even if you are studying the exact same thing you’ll probably use different words to describe it and have different reserach goals. As a result, even many researchers working in language may not know what’s happening in the discipline next door.

For better or worse, though, I’ve always been very curious about disciplinary boundaries and talk and read to a lot of folks and, as a result, have ended up learning a lot about different disciplines. (Note: I don’t know that I’d recommend this to other junior scholars. It made me a bit of a “neither fish nor fowl” when I was on the faculty job market. I did have fun though. 😉 The upside of this is that I’ve had at least three discussions with people where the gist of it was “here are the academic fields that are relevant to your interest” and so I figured it was time to write it up as a blog post to save myself some time in the future.

Disciplines where language is the main focus

These fields study language itself. While people working in these fields may use different tools and have different goals, these are fields where people are likely to say that language is their area of study.

Linguistics

This is the field that studies Language and how it works. Sometimes you’ll hear people talk about “capital L language” to distinguish it from the study of a specific language. Whatever tools or methods or theoretical linguists use, their main object of study is language itself. There a lot of fields within linguistics and they vary a lot, but generally if a field has “linguistics” on the end, they’re going to be focusing on language itself.

For more information about linguistics, check out the Linguistic Society of America or my friend Gretchen’s blog.

Language-specific disciplines (classics, English, literature, foreign language departments etc.)

This is a collection of disciplines that study particular languages and specific instances of language use (like specific documents or pieces of oral literature). These fields generally focus on language teaching or applying frameworks like critical theory to better understand texts. Oh, or they produce new texts themselves. If you ask someone in one of these fields what they study, they’ll probably say the name of the specific language or family of languages they work on.

There are a lot of different fields that fall under this umbrella, so I’d recommend searching for “[whatever language you what to know about ] studies” and taking it from there.

Speech language pathology/Audiology/Speech and hearing

I’m grouping these disciplines together because they generally focus on language in a medical context. The main focus of researchers in this field is studying how the human body produces and receives language input. A lot of the work here focus on identifying and treating instances when these processes break down.

A good place to learn more is the American Speech-Language-Hearing Association.

Computer science (Specifically natural language processing, computational linguistics)

This field (more likely to be called NLP these days) focuses on building and understanding computational systems where language data, usually text, is part of either the input or output. Currently the main focus on the field (in terms of press coverage and $$ at any rate) is in applying machine learning methods to various problems. A lot of work in NLP is focused around particular tasks which generally have an associated dataset and shared metric and where the aim is to outperform other systems on the same problem. NLP does use some methods from other fields of machine learning (like computer vision) but the majority of the work uses techniques specific to, or at least developed for, language data.

To learn more, I’d check out the Association for Computational Linguistics. (Note that “NLP” is also an acronym for a pseudoscienience thing so I’d recommend searching #NLProc or “Natural Language Processing” instead.)

For reference, I would say that currently my main field is in applied NLP, but my background is primarily in linguistics and sprinkling of language-specific studies, especially English and American Sign Language. (Although I’ve taken course work and been a co-author on papers in speech & hearing.)

Disciplines where language is sometimes studied

There are also a lot of related fields where language data is used, or language is used as a tool to study a different object of inquiry.

Data Science. You would you shocked how much of data science is working with text data (or maybe you’re a data scientist and you wouldn’t be). Pretty much every organization has some sort of text they would like to learn about without having to read it all.
Computational social science, which uses language data but also frequently other types of data produced by human interaction with computational system. The aim is usually more to model or understand society rather than language use.
Anthropology, where language data is often used to better understand humans. (As a note, early British anthropology in particular is straight up racist imperial apologism, so be ye warned. There have been massive changes in the field, thankfully.) A lot of language documentation used to happen in anthropology departments, although these days I think it tends to be more linguistics. The linguistic-focused subdisciplines are anthropological linguistics or linguistic anthropology (they’re slightly different).
Sociology, the study of society. Sociolinguistics is more sociologically-informed linguistics, and in the US historically has been slightly more macro focused.
Psychology/Cognitive science. Non-physical brain stuff, like the mind and behavior. The linguistic part is psycholinguistics. This is where a lot of the work on language learning goes on.
Neurology. Physical brain stuff. The linguistic part is neurolinguistics. They tend to do a lot of imaging.
Education. A lot of the literature on language learning is in education. (Language learning is not to be confused with language acquisition; that’s only for the process by which children naturally acquire a language without formal instruction.)
Electrical engineering (Signal processing). This is generally the field of folks who are working on telephony and automatic speech recognition. NLP historically hasn’t done as much with voices, that’s been in electrical engineering/signal processing.
Disability studies. A lot of work on signed languages will be in disability studies departments if they don’t have their own department.
Historians. While they aren’t primarily studying the changes in linguistic systems, historians interact with older language data a lot and provide context for things like language contact, shift and historical usage.
Informatics/information science/library science. Information science is broader than linguistics (including non-linguistic information all well) but often dovetails with it, especially in semantics (the study of meaning) and ontologies (a formal representation of categories and their relations).
Information theory. This field is superficially focused on how digital information is encoded. Usually linguistics draws from it rather than vice-versa because it’s lower level, but if you’ve heard of entropy, compression or source-channel theory those are all from information theory.
Philosophy. A lot of early linguistics scholars, like Ferdinand de Saussure, would probably have considered themselves primarily philosophers and there was this whole big thing in the early 1900’s. The language-specific branch is philosophy of language.
Semiotics. This is a field I haven’t interacted with too much (I get the impression that it’s more popular in Europe than the US) but they study “signs”, which as I understand it is any way of referring to a thing in any medium without using the actual thing, which by that definition does include language.
Design studies. Another field I’m not super familiar with, but my understanding is that it includes studying how users of a designed thing interact with it, which may include how they use or interpret language. Also: good design is so important and I really don’t think designers get enough credit/kudos.

What you can, can’t and shouldn’t do with social media data

August 31, 2018August 31, 2018 ~ Rachael Tatman ~ Leave a comment

Earlier this summer, I gave a talk on the promise & pitfalls of social media data for the Joint Statistical Meetings. While I don’t think there’s a recording of the talk, enough people asked for one that I figured it would be worth putting together a blog post version of the talk. Enjoy!

What you can do with social media data

Let’s start with the good news: research using social media data has revolutionized social science research. It’s let us ask bigger question more quickly, helped us overcome some of the key drawbacks of behavioral experimental work and ask new kinds of questions.

More data faster

I can’t overstate how revolutionary the easy availability of social media data has been, especially in linguistics. It has increased both the rate and scale of data collection by orders of magnitude. Compare the time it took to compare the Dictionary of American Regional English (DARE) to the Wordmapper app below. The results are more or less the same, maps of where in the US folks use different words (in this example, “cellar”). But what once took the entire careers of multiple researchers can now be done in a few months, and with far higher resolution.

	Dictionary of American Regional English (DARE)	Word Mapper App
Data collection	48 years (1965 – 2013)	<1 year
Size of team	2,777 people	4 people
Number of participants	1,843 people	20 million

Social networks

Social media sites with a following or friend feature also let us ask really large scale questions about social networks. How do social networks and political affiliation interact? How does language change move through a social network? What characteristics of social network structure are more closely associated with the spread of misinformation? Of course, we could ask these questions before social media data… but by using APIs to access social media data, we reduce the timescale of these projects from decades to weeks or even days and we have a clear way to operationalize social network ties. It’s fairly hard for someone to sit down and list everyone they interact with face-to-face, but it’s very easy to grab a list of all the Twitter accounts you follow.

Wild-caught, all natural data

One of the constant struggles in experimental work is the fact that the mere fact of being observed changes behavior. This is known as the Hawthorne Effect in psychology or the Observer’s Paradox in sociolinguistics. As a result, even the most well-designed experiment is limited by the fact that the participants know that they are completing an experiment.

Social media data, however, doesn’t have this limitation. Since most social media research projects are conducted on public data without interacting directly with participants, they are not generally considered human subjects research. When you post something on a public social media account, you don’t have a reasonable expectation of privacy. In other words, you know that just anyone could come along and read it, and that includes researchers. As a result it is not generally necessary to collect informed consent for social media projects. (Informed consent is when you are told exactly what’s going to happen during an experiment you’re participating, and you agree to participate in it.) This means that the vast majority of folks who are participating in a social media study don’t actually know that they’re part of a study.

The benefit of this is that it allows researchers to get around three common confounds that plague social science research:

Bradley effect: People tend to tell researchers what they think they want to hear
Response bias: The sample of people willing to do an experiment/survey differ in a meaningful way from the population as a whole
Observer’s paradox/Hawthorne effect: People change their behavior when they know they’re being observed

While this is a boon to researchers, the lack of informed consent does introduce other other problems, which we’ll talk about later.

What you can’t do with social media data

Of course, all the benefits of social media come at a cost. There are several key drawbacks and limitations of social media research:

You can’t be sure who your participants are.
There’s inherent sampling bias.
You can’t violate the developer’s agreements.

You’re not sure who you’re studying…

Because you don’t meet with the people whose data is included in your study, you don’t know for sure what sorts of demographic categories they belong to, whether they are who they’re claiming to be or even if they’re human at all. You have to deal with both bots, accounts where content is produced and distributed automatically by a computer and sock puppets, where one person pretends to be another person. Sock puppets in particular can be very difficult to spot and may skew your results in unpredictable ways.

…but you can be sure your sample is biased.

Social media users aren’t randomly drawn from the world’s population as a whole. Social media users tend to be WEIRD: from wealthy, educated, industrialized, rich and democratic societies. This group is already over-represented in social science and psychology research studies, which may be subtly skewing our models of human behavior.

In addition, different social media platforms have different user bases. For example, Instagram and Snapchat tend to have younger users, Pinterest has more women (especially compared to Reddit, which skews male) and LinkedIn users tend to be highly educated and upper middle class. And that doesn’t even get to social network effects: you’re more likely to be on the same platform your friends are on, and since social networks tend to be homophilous, you can end up with pockets of very socially homogeneous folks. So, even if you manage to sample randomly from a social media platform, your sample is likely to differ from one taken from the population as a whole.

You need to abide by the developer’s agreements for whatever platform you’re using data from.

This is mainly an issue if you’re using API (application programmatic interface) to fetch data from a service. Developer’s agreements vary between platforms, but most limit the amount of data you can fetch and store, and how and if you can share it with other researchers. For example, if you’re sharing Twitter data you can only share 50,000 tweets at a time and even then only if you have to have people download a file by clicking on it. If you share any more than that, you should just share the ID’s of the tweets rather than the full tweets. (Document the Now’s Hydrator can help you fetch the tweets associated with a set of IDs.)

What you shouldn’t do with social media data

Finally, there are ethical restrictions on what we should do with social media data. As researchers, we need to 1) respect the wishes of users and 2) safeguard their best interests, especially given that we don’t (currently) generally get informed consent from the folks whose data we’re collecting.

Respecting users’ wishes

At least in the US, ethical human subjects research is led by three guiding principles set forth in the Belmont report. If you’re unfamiliar with the report, it was written in the aftermath of the Tuskegee Valley experiments. These were a series of medical experiments on African Americans men who had contracted syphilis conducted from the 1930’s to 1970’s. During the study, researchers withheld the cure (and even information that it existed) from the participants. The study directly resulted in the preventable deaths of 128 men and many health problems for study participants, their wives and children. It was a clear ethical violation of the human rights of participants and the moral stain of it continues to shape how we conduct human subjects research in the US.

The three principles of ethical human subjects research are:

Respect for Persons: People should be treated as autonomous individuals and persons with diminished autonomy (like children or prisoners) are entitled to protection.
Beneficence: 1) Do not harm and 2) maximize possible benefits and minimize possible harms.
Justice: Both the risks and benefits of research should be distributed equally.

Social media research might not technically fall under the heading of human subjects research, since we aren’t intervening with our participants. However, I still believe that it’s important that researchers following these general guides when designing and distributing experiments.

One thing we can do is respect their wishes of the communities we study. Fortunately, we have some evidence of what those wishes are. Feisler and Proferes (2018) surveyed 368 Twitter users on their perception of a variety of research behaviors.

Screenshot from 2018-07-25 16-10-21 — Fiesler, C., & Proferes, N. (2018). “Participant” Perceptions of Twitter Research Ethics. Social Media+ Society, 4(1), 2056305118763366. Table 4.

In general, Twitter users are more OK with research with the following characteristics:

Large datasets
Analyzed automatically
Social media users informed about research
If tweets are quoted, they are anonymized. (Note that if you include the exact text, it’s possible to reverse search the quoted tweet and de-anonymize it. I recommend changing at least 20% of the content words in a tweet to synonyms to get around this and double-checking by trying to de-anonymize it yourself.)

These characteristics, however, are not as acceptable to Twitter users:

Small datasets
Analysis done by hand (presumably including analysis by Mechanical Turk workers)
Tweets from protected accounts or deleted tweets analyzed (which is also against the developer’s agreement, so you shouldn’t be doing this anyway)
Quoting with citation (very different from academic norms!)

In general, I think these suggest general best practices for researchers working with Twitter data.

Stick to larger datasets
Try to automate wherever possible
Follow the developer’s agreement
Take anonymity seriously.

There is one thing I disagree with, however: I don’t think we should contact everyone who’s tweets we use in our research.

Should we contact people whose tweets we use in our studies? My gut instinct on this one is “no”. If you’re collecting a large amount of data, you probably shouldn’t reach out to everyone in the data.

For users who don’t have open DM’s, the only way to contact them is to publicly mention them using @username. The problem with this is that it partly de-anonymizes your data. If you then choose to share your data, having publicly shared a list of whose data was included in the dataset it makes it much easier to de-anonymize. Instead of trying to figure out whose tweets were included when looking at all of Twitter, an adversary only has to figure out which of the users on the list you’ve given them is connected to which record.

The main exception to this is if have a project that’s a deep dive on one user, in which case you probably should. (For example, I contacted Chaz Smith and let him know about my phonological analysis of his #pronouncingthingsincorrectly Vines.)

Do no harm

Another aspect of ethical research is trying to ensure that your research or research data doesn’t have potentially unethical applications. The elephant in the room here, of course, is the data Cambridge Analytica collected from Facebook users. Researchers at Cambridge, collecting data for a research project, got lots of people’s permission to access their Facebook data. While that wasn’t a problem, they collected and saved Facebook data from other folks as well, who hadn’t opted in. In the end, only a half of a half of a percent of the folks whose data was in the final dataset actually agreed to be included in it. To make matters worse, this data was used by a commercial company founded by one of the researchers to (possibly) influence elections in the US and UK. Here’s a New York Times article that goes into much more detail. This has understandably lead to increased scrutiny of how social media research data is collected and used.

I’m not bringing this up to call out Facebook in particular, but to explain why it’s important to consider how research data might be used long-term. How and where will it be stored? For how long? Who will have access to it? In short, if you’re a researcher, how can you ensure that data you collected won’t end up somehow hurting the people you collected it from?

As an example of how important these questions are, consider this OK Cupid “research” dataset. It was collected without consent and shared publicly without anonymization. It included many personal details that were only intended to be shared with other users of the site, including explicit statements of sexual orientation. In addition to being an unforgivable breach of privacy, this directly endangered users whose data was collected: information on sexual orientation was shared for people living in countries where homosexuality is a crime that carries a death penalty or sentence of life in prison. I have a lot of other issues with this “study” as well, but the fact that it directly endangered research subjects who had no chance to opt out is by far the most egregious ethical breach.

If you are collecting social media data for research purposes, it is your ethical responsibility to safeguard the well-being of the people whose data you’re using.

I bring up these cautionary tales not to scare you off of social media research but to really impress the gravity of the responsibility you carry as a social media researcher. Social media data has the potential to dramatically improve our understanding of the world. A lot of my own work has relied heavily on it! But it’s important that we, as researchers, take our moral duty to make sure that we don’t end up doing more harm than good very seriously.

Are emoji sequences as informative as text?

July 7, 2018July 7, 2018 ~ Rachael Tatman ~ Leave a comment

Something I’ve been thinking about a lot lately is how much information we really convey with emoji. I was recently at the 1st International Workshop on Emoji Understanding and Applications in Social Media and one theme that stood out to me from the papers was that emoji tend to be used more to communicate social meaning (things like tone and when a conversation is over) than semantics (content stuff like “this is a dog” or “an icecream truck”).

I’ve been itching to apply an information theoretic approach to emoji use for a while, and this seemed like the perfect opportunity. Information theory is the study of storing, transmitting and, most importantly for this project, quantifying information. In other words, using an information theoretic approach we can actually look at two input texts and figure out which one has more information in it. And that’s just what we’re going to do: we’re going to use a measure called “entropy” to directly compare the amount of information in text and emoji.

What’s entropy?

Shannon entropy is a measure of how much information there is in a sequence. Higher entropy means that there’s more uncertainty about what comes next, while lower entropy means there’s less uncertainty. (Mathematically, entropy is always less than or the same as log₂(n), where n is the total number of unique characters. You can learn more about calculating entropy and play around with an interactive calculator here if you’re curious.)

So if you have a string of text that’s just one character repeated over and over (like this: 💀💀💀💀💀) you don’t need a lot of extra information to know what the next character will be: it will always be the same thing. So the string “💀💀💀💀💀” has a very low entropy. In this case it’s actually 0, which means that if you’re going through the string and predicting what comes next, you’re always going to be able to guess what comes next becuase it’s always the same thing. On the other hand, if you have a string that’s made up of four different characters, all of which are equally probable (like this:♢♡♧♤♡♧♤♢), then you’ll have an entropy of 2.

TL;DR: The higher the entropy of a string the more information is in it.

Experiment

Hypothesis

We do have some theoretical maximums for the entropy text and emoji. For text, if the text string is just randomly drawn from the 128 ASCII characters (which isn’t how language works, but this is just an approximation) our entropy would be 7. On the other hand, for emoji, if people are just randomly using any emoji they like from the set of emoji as of June 2017, then we’d expect to see an entropy of around 11.

So if people are just using letters or emoji randomly, then text should have lower entropy than emoji. However, I don’t think that’s what’s happening. My hypothesis, based on the amount of repetition in emoji, was that emoji should have lower entropy, i.e. less information, than text.

Data

To get emoji and text spans for our experiment I used four different datasets: three from Twitter and one from YouTube.

I used multiple datasets for a couple reasons. First, becuase I wanted a really large dataset of tweets with emoji, and since only between 0.9% and 0.5% of tweets from each Twitter dataset actually contained emoji I needed to case a wide net. And, second, because I’m growing increasingly concerned about genre effects in NLP research. (Like, a lot of our research is on Twitter data. Which is fine, but I’m worried that we’re narrowing the potential applications of our research becuase of it.) It’s the second reason that led me to include YouTube data. I used Twitter data for my initial exploration and then used the YouTube data to validate my findings.

For each dataset, I grabbed all adjacent emoji from a tweet and stored them separately. So this tweet:

Love going to ballgames! ⚾🌭 Going home to work in my garden now, tho 🌸🌸🌸🌸

Has two spans in it:

Span 1: ⚾🌭

Span 2: 🌸🌸🌸🌸

All told, I ended up with 13,825 tweets with emoji and 18,717 emoji spans of which only 4,713 were longer than one emoji. (I ignored all the emoji spans of length one, since they’ll always have an entropy of 0 and aren’t that interesting to me.) For the YouTube comments, I ended up with 88,629 comments with emoji, 115,707 emoji spans and 47,138 spans with a length greater than one.

In order to look at text as parallel as possible to my emoji spans, I grabbed tweets & YouTube comments without emoji. For each genre, I took a number of texts equal to the number of spans of length > 1 and then calculated the character-level entropy for the emoji spans and the texts.

Analysis

First, let’s look at Tweets. Here’s the density (it’s like a smooth histogram, where the area under the curve is always equal to 1 for each group) of the entropy of an equivalent number of emoji spans and tweets.

: Text has a much high character-level entropy than emoji. For text, the mean and median entropy are both around 5. For emoji, there is a multimodal distribution, with the median entropy being 0 and also clusters around 1 and 1.5.

It looks like my hypothesis was right! At least in tweets, text has much more information than emoji. In fact, the most common entropy for an emoji span is 0: which means that most emoji spans with a length greater than one are just repititons of the same emoji over and over again.

But is this just true on Twitter, or does it extend to YouTube comments as well?

download (5) — The pattern for emoji & text in YouTube comments is very similar to that for Tweets. The biggest difference is that it looks like there’s less information in YouTube Comments that are text-based; they have a mean and median entropy closer to 4 than 5.

The YouTube data, which we have almost ten times more of, corroborates the earlier finding: emoji spans are less informative, and more repetitive, than text.

Which emoji were repeated the most/least often?

Just in case you were wondering, the emoji most likely to be repeated was the skull emoji, 💀. It’s generally used to convey strong negative emotion, especially embarrassment, awkwardness or speechlessness, similar to “ded“.

OMFFFFFFFFFG……….how you gonna put me on blast like that @Oreo!!!!

Hahahhaha! 💀💀💀💀💀🤣😂 https://t.co/eJ1igiqJ9W

— Jaremi Carey (@PhiPhiOhara) July 5, 2018

The least likely was the right-pointing arrow (▶️), which is usually used in front of links to videos.

What are we naming the @arianagrande + @nickiminaj super group?

1️⃣ Nickari
2️⃣ Aricki
3️⃣ Minajagrande
4️⃣ Granaj

Watch their video #TheLightIsComing now. 💡💡
▶️https://t.co/nwLHtZ2V86 pic.twitter.com/aj6RjI1IRE

— Vevo (@Vevo) July 5, 2018

More info & further work

If you’re interested, the code for my analysis is available here. I also did some of this work as live coding, which you can follow along with on YouTube here.

For future work, I’m planning on looking at which kinds of emoji are more likely to be repeated. My intuition is that gestural emoji (so anything with a hand or face) are more likely to be repeated than other types of emoji–which would definitely add some fuel to the “are emoji words or gestures” debate!

Datasets for data cleaning practice

April 19, 2018April 21, 2018 ~ Rachael Tatman ~ 4 Comments

Looking for datasets to practice data cleaning or preprocessing on? Look no further!

Each of these datasets needs a little bit of TLC before it’s ready for different analysis techniques. For each dataset, I’ve included a link to where you can access it, a brief description of what’s in it, and an “issues” section describing what needs to be done or fixed in order for it to fit easily into a data analysis pipeline.

Big thanks to everyone in this Twitter thread who helped me out by pointing me towards these datasets and letting me know what sort of pre-processing each needed. There were also some other data sources I didn’t include here, so check it out if you need more practice data. And feel free to comment with links to other datasets that would make good data cleaning practice! 🙂

List of datasets:

Hourly Weather Surface – Brazil (Southeast region)
PhyloTree Data
International Comprehensive Ocean-Atmosphere Data Set
CLEANEVAL: Development dataset
London Air
SO MUCH CANDY DATA, SERIOUSLY
Production and Perception of Linguistic Voice Quality
Australian Marriage Law Postal Survey, 2017
The Metropolitan Museum of Art Open Access
National Drug Code Directory
Flourish OA
WikiPlots
Register of UK Parliament Members’ Financial Interests
NYC Gifted & Talented Scores

Hourly Weather Surface – Brazil (Southeast region)

It’s covers hourly weather data from 122 weathers stations of southeast region (Brazil). The southeast include the states of Rio de Janeiro, São Paulo, Minas Gerais e Espirito Santo. Dataset Source: INMET (National Meteorological Institute – Brazil).

Issues: Can you predict the amount of rain? Temperature? NOTE: Not all weather stations started operating since 2000

PhyloTree Data

Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Human mitochondrial DNA is widely used as tool in many fields including evolutionary anthropology and population history, medical genetics, genetic genealogy, and forensic science. Many applications require detailed knowledge about the phylogenetic relationship of mtDNA variants. Although the phylogenetic resolution of global human mtDNA diversity has greatly improved as a result of increasing sequencing efforts of complete mtDNA genomes, an updated overall mtDNA tree is currently not available. In order to facilitate a better use of known mtDNA variation, we have constructed an updated comprehensive phylogeny of global human mtDNA variation, based on both coding‐ and control region mutations. This complete mtDNA tree includes previously published as well as newly identified haplogroups.

Issues: This data would be more useful if it were in the Newick tree format and could be read in using the read.newick() function. Can you help get the data in this format?

International Comprehensive Ocean-Atmosphere Data Set

The International Comprehensive Ocean-Atmosphere Data Set (ICOADS) offers surface marine data spanning the past three centuries, and simple gridded monthly summary products for 2° latitude x 2° longitude boxes back to 1800 (and 1°x1° boxes since 1960)—these data and products are freely distributed worldwide. As it contains observations from many different observing systems encompassing the evolution of measurement technology over hundreds of years, ICOADS is probably the most complete and heterogeneous collection of surface marine data in existence.

Issues: The ICOADS contains O(500M) meteorological observations from ~1650 onwards. Issues include bad observation values, mis-positioned data, missing date/time information, supplemental data in a variety of formats, duplicates etc.

CLEANEVAL: Development dataset

CLEANEVAL is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus, for linguistic and language technology research and development. There are three versions of each file: original, pre-processed, and manually cleaned. All files of each kind are gathered in a directory. The file number remains the same for the three versions of the same file.

Issues: Your task is to “clean up” a set of webpages so that their contents can be easily used for further linguistic processing and analysis. In short, this implies:

removing all HTML/Javascript code and “boilerplate” (headers, copyright notices, link lists, materials repeated across most pages of a site, etc.);
adding a basic encoding of the structure of the page using a minimal set of symbols to mark the beginning of headers, paragraphs and list elements.

London Air

The London Air Quality Network (LAQN) is run by the Environmental Research Group of King’s College London. LAQN stands for the London Air Quality Network which was formed in 1993 to coordinate and improve air pollution monitoring in London. The network collects air pollution data from London boroughs, with each one funding monitoring in its own area. Increasingly, this information is being supplemented with measurements from local authorities surrounding London in Essex, Kent and Surrey, thereby providing an overall perspective of air pollution in South East England, as well as a greater understanding of pollution in London itself.

Issues: Lots of gaps (null/zero handling), outliers, date handling, pivots and time aggregation needed first!

SO MUCH CANDY DATA, SERIOUSLY

Candy hierarchy data for 2017 Boing Boing Halloween candy hierarchy. This is survey data from this survey.

Issues: If you want to look for longitudinal effects, you also have access to previous datasets. Unfortunate quirks in the data include the fact that the 2014 data is not the raw set (can’t seem to find it), and in 2015, the candy preference was queried without the MEH option.

Production and Perception of Linguistic Voice Quality

Data from the “Production and Perception of Linguistic Voice Quality” project at UCLA. This project was funded by NSF grant BCS-0720304 to Prof. Pat Keating, with Prof. Abeer Alwan, Prof. Jody Kreiman of UCLA, and Prof. Christina Esposito of Macalester College, for 2007-2012.

The data includes spreadsheet files with measures gathered using Voicesauce (Shue, Keating, Vicenik & Yu 2011) for both acoustic measures and EGG measures. The accompanying readme file provides information on the various coding used in both spreadsheets.

Issues: The following issues are with the acoustics measures spreadsheet specifically.

xlsx format with meaningful color coding created by a VBA script (which is copy-pasted into the second sheet)
partially wide format instead of long/tidy, with a ton of columns split into different timepoints
line 6461 has another set of column headers rather than data for some of the columns starting with “shrF0_mean”. I think this was a copy-paste error. Hopefully it doesn’t mean that all of the data below that row is shifted down by 1!

Australian Marriage Law Postal Survey, 2017

Response: Should the law be changed to allow same-sex couples to marry?

Of the eligible Australians who expressed a view on this question, the majority indicated that the law should be changed to allow same-sex couples to marry, with 7,817,247 (61.6%) responding Yes and 4,873,987 (38.4%) responding No. Nearly 8 out of 10 eligible Australians (79.5%) expressed their view.

All states and territories recorded a majority Yes response. 133 of the 150 Federal Electoral Divisions recorded a majority Yes response, and 17 of the 150 Federal Electoral Divisions recorded a majority No response.

Issues: Miles McBain discusses his approach to cleaning this dataset in depth in this blog post.

The Metropolitan Museum of Art Open Access

The Metropolitan Museum of Art provides select datasets of information on more than 420,000 artworks in its Collection for unrestricted commercial and noncommercial use. To the extent possible under law, The Metropolitan Museum of Art has waived all copyright and related or neighboring rights to this dataset using Creative Commons Zero. This work is published from: The United States Of America. You can also find the text of the CC Zero deed in the file LICENSE in this repository. These select datasets are now available for use in any media without permission or fee; they also include identifying data for artworks under copyright. The datasets support the search, use, and interaction with the Museum’s collection.

Issues: Missing values, inconsistent information, missing documentation, possible duplication, mixed text and numeric data.

National Drug Code Directory

The Drug Listing Act of 1972 requires registered drug establishments to provide the Food and Drug Administration (FDA) with a current list of all drugs manufactured, prepared, propagated, compounded, or processed by it for commercial distribution. (See Section 510 of the Federal Food, Drug, and Cosmetic Act (Act) (21 U.S.C. § 360)). Drug products are identified and reported using a unique, three-segment number, called the National Drug Code (NDC), which serves as a universal product identifier for drugs. FDA publishes the listed NDC numbers and the information submitted as part of the listing information in the NDC Directory which is updated daily.

The information submitted as part of the listing process, the NDC number, and the NDC Directory are used in the implementation and enforcement of the Act.

Issue: Non-trivial duplication (which drugs are different names for the same things?).

Flourish OA

Our data comes from a variety of sources, including researchers, web scraping, and the publishers themselves. All data is cleaned and reviewed to ensure its validity and integrity. Our catalog expands regularly, as does the number of features our data contains. We strive to maintain the most complete and sophisticated store of Open Access data in the world, and it is this mission that drives our continued work and expansion.

A dataset on journal/publisher information that is a bit dirty and might make for great practice. It’s been a graduate student/community project: http://flourishoa.org/

Issues: Scraped data, has some missing fields, possible duplication and some encoding issues (possibly multiple character encodings).

WikiPlots

The WikiPlots corpus is a collection of 112,936 story plots extracted from English language Wikipedia. These stories are extracted from any English language article that contains a sub-header that contains the word “plot” (e.g., “Plot”, “Plot Summary”, etc.).

This repository contains code and instructions for how to recreate the WikiPlots corpus.

The dataset itself can be downloaded from here: plots.zip (updated: 09/26/2017). The zip file contains two files:

plots: a text file containing all story plots. Each story plot is given with one sentence per line. Each story is followed by on a line by itself.
titles: a text file containing a list of titles for each article in which a story plot was found and extracted.

Issues: Some lines may be cut off due to abbreviations. Some plots may be incomplete or contain surrounding irrelevant information.

Register of UK Parliament Members’ Financial Interests

The main purpose of the Register is to provide information about any financial interest which a Member has, or any benefit which he or she receives, which others might reasonably consider to influence his or her actions or words as a Member of Parliament.

Members must register any change to their registrable interests within 28 days. The rules are set out in detail in the Guide to the Rules relating to the Conduct of Members, as approved by the House on 17 March 2015. Interests which arose before 7 May 2015 are registered in accordance with earlier rules.

The Register is maintained by the Parliamentary for Commissioner for Standards. It is updated fortnightly online when the House is sitting, and less frequently at other times. Interests remain on the Register for twelve months after they have expired.

Issues: Each member’s transactions are on a separate webpage with a different text format, with contributions listed under different headings (not necessarily one per line) and in different formats. Will take quite a bit of careful preprocessing to get into CSV or JSON format.

NYC Gifted & Talented Scores

Couple of messy but easy data sets: NYC parents reporting their kids’ scores on the gifted and talented exam, as well as school priority ranking. Some enter the percentiles as point scores, some skip all together, no standard preference format, etc. Also birth quarter affects percentiles.

Parity in Utility: One way to think about fairness in machine learning tools

February 25, 2018 ~ Rachael Tatman ~ 1 Comment

First, a confession: part of the reason I’m writing this blog post today is becuase I’m having major FOMO on account of having missed #FAT2018, the first annual Conference on Fairness, Accountability, and Transparency. (You can get the proceedings online here, though!) I have been seeing a lot of tweets & good discussion around the conference and it’s gotten me thinking again about something I’ve been chewing on for a while: what does it mean for a machine learning tool to be fair?

If you’re not familiar with the literature, this recent paper by Friedler et al is a really good introduction, although it’s not intended as a review paper. I also review some of it in these slides. Once you’ve dug into the work a little bit, you may notice that a lot of this work is focused on examples where output of the algorithm is a decision made on the level of the individual: Does this person get a loan or not? Should this resume be passed on to a human recruiter? Should this person receive parole?

While these are deeply important issues, the methods developed to address fairness in these contexts don’t necessarily translate well to evaluating fairness for other applications. I’m thinking specifically about tools like speech recognition or facial recognition for automatic focusing in computer vision: applications where an automatic tool is designed to supplant or augment some sort of human labor. The stakes are lower in these types of applications, but it’s still important that they don’t unintentionally end up working poorly for certain groups of people. This is what I’m calling “parity of utility”.

Parity of utility: A machine learning application which automatically completes a task using human data should not preform reliably worse for members of one or more social groups relevant to the task.

That’s a bit much to throw at you all at once, so let me break down my thinking a little bit more.

Machine learning application: This could be a specific algorithm or model, an ensemble of multiple different models working together, or an entire pipeline from data collection to the final output. I’m being intentionally vague here so that this definition can be broadly useful.
Automatically completes a task: Again, I’m being intentionally vague here. By “a task” I mean some sort of automated decision based on some input stimulus, especially classification or recognition.
Human data: This definition of fairness is based on social groups and is thus restricted to humans. While it may be frustrating that your image labeler is better at recognizing cows than horses, it’s not unfair to horses becuase they’re not the ones using the system. (And, arguably, becuase horses have no sense of fairness.)
Preform reliably worse: Personally I find a statistically significant difference in performance between groups with a large effect size to be convincing evidence. There are other ways of quantifying difference, like the odds ratio or even the raw accuracy across groups, that may be more suitable depending on your task and standards of evidence.
Social groups relevant to the task: This particular fairness framework is concerned with groups rather than individuals. Which groups? That depends. Not every social group is going to relevant to every task. For instance, your mother language(s) is very relevant for NLP applications, while it’s only important for things like facial recognition in as far as it co-varies with other, visible demographic factors. Which social groups are relevant for what types of human behavior is, thankfully, very well studied, especially in sociology.

So how can we turn these very nice-sounding words into numbers? The specifics will depend on your particular task, but here are some examples:

Is there a difference in positive predictive value (PPV), error rate (1-PPV), true positive rate (TPR), and false positive rate (FPR) for three commercial gender classification algorithms based on the gender and skin tone of the person whose image is being classified? (Answer: Yep! People with lighter skin tones and men are classified more accurately, with darker-skinned women being classified with the most errors.)
Is there a statistically-significant difference in the word error rate for commercial speech recognition algorithms based on the speaker’s dialect, race and gender? (Answer: yes, yes and maybe, with General American and white talkers having the lowest word error rates.)

So the tools reviewed in these papers don’t demonstrate parity of utility: they work better for folks from specific groups and worse for folks from other groups. (And, not co-incidentally, when we find that systems don’t have parity in utility, they tend work best for more privileged groups and worse for less privileged groups.)

Parity of Utility vs. Overall Performance

So this is the bit where things get a bit complicated: parity of utility is one way to evaluate a system, but it’s a measure of fairness, not overall system performance. Ideally, a high-performing system should also be a fair system and preform well for all groups. But what if you have a situation where you need to choose between prioritizing a fairer system or one with overall higher performance?

I can’t say that one goal is unilaterally better than the other in all situations. What I can say is that focusing on only higher performance and not investigating measures of fairness we risk building systems that have systematically lower performance for some groups. In other words, we can think of an unfair system (under this framework) as one that is overfit to one or more social groups. And, personally, I consider a model overfit to a specific social group while being intended for general use to be flawed.

Parity of Utility is an idea I’ve been kicking around for a while, but it could definitely use some additional polish and wheel-kicking, so feel free to chime in in the comments. I’m especially interested in getting some other perspectives on these questions:

Do you agree that a tool that has parity in utility is “fair”?
What would you need to add (or remove) to have a framework that you would consider fair?
What do you consider an acceptable balance of fairness and overall performance?
Would you prefer to create an unfair system with slightly higher overall performance or a fair system with slightly lower overall performance? Which type of system would you prefer to be a user of? What about if a group you are a member of had reliably lower performance?
Is it more important to ensure that some groups are treated fairly? What about a group that might need a tool for accessibility rather than just convenience?

A Field Guide to Character Encodings

January 28, 2018 ~ Rachael Tatman ~ 1 Comment

I recently gave a talk at PyCascades (a regional Python language conference) on character encodings and I thought it would be nice to put together a little primer on a couple different important character encodings.

If you’re unfamiliar with character encodings, they’re just a variety of different systems used to map a string of binary (i.e. 1’s and 0’s) to a specific character. So the Euro character, €, would be represented as “111000101000001010101100” in a character encoding called UTF-8 but “10100100” in the Latin 9 encoding. There are a lot of different character encodings out there, so I’m just going to cover a handful that I think are especially interesting/important.

Name: ASCII
Created: 1960
Also known as: American Standard Code for Information Interchange, US-ASCII
Most often seen: In legacy systems, especially U.S. government databases that need to be backward compatible

ASCII was the first widely-used character encoding. It has space for only 128 characters, and is best suited for English-language data.

Name: ISO 8859-1
Created: 1985
Also known as: Latin 1, code page 819, iso-ir-100, csISOLatin1, latin1, l1, IBM819, WE8ISO8859P1
Most often seen: Representing languages not covered by ASCII (like Spanish and Portuguese).

Latin 1 is the most popular of a large set of character encodings developed by the ISO (International Organization for Standardization) to deal with the fact that ASCII only really works well for English by adding additional characters. They did this by adding one extra bit to each character (8 instead of the 7 ASCII uses) so that they had space for 256 characters per encoding. However, since there are a lot of characters out there, there are 16 different ISO encodings that can handle different alphabets. (For example, ISO 8859-5 handles Cyrillic characters, while ISO 8859-11 maps Thai characters.)

Name: Windows-1252
Created: 1985
Also known as: CP-1252, Latin 1/ISO 8859-1 (which it isn’t!), ANSI, ansinew
Most often seen: Mislabelled as Latin 1.

While the ISO was developing a standard set of character encodings, pretty much every large software company was also developing thier own set of proprietary encodings that did pretty much the same thing. Windows-1252 is slightly tweaked version of Latin 1, but Windows also had a bunch of different encodings, as did Apple, as did IBM. The 1980’s were a wild time for character encodings!

Name: Shift-JIS
Created: 1997
Also known as: Shift Japanese Industrial Standards, SJIS, Shift_JIS
Most often seen: For Japanese

So one thing you may notice about the character encodings above is that they’re all fairly small, i.e. can’t handle more than 256 characters. But what about a language that has way more than 256 characters, like Japanese? (Japanese has a phonetic writing system with 71 characters, as well as 85,000 kanji characters which each represent a single word.) Well, one solution is to create a different character encoding system specifically for that language with enough space for all the characters you’re going to want to use frequently. But, like with the ASCII-based encodings I talked about above, just one encoding isn’t quite enough to cover all the needs of the language, so a lot of variants and extensions popped up.

Name: UTF-8
Created: 1992
Also known as: Unicode Transformation Format – 8-bit
Most often seen: Pretty much everywhere, including >90% of text on the web. (That’s a good thing!)

Which brings us to UTF-8, the current standard for text encoding. The UTF encodings map from binary to what are known as Unicode codepoints, and then those codepoints are mapped to characters. Why the whole “codepoints” thing in the middle? To help overcome the problems with language-specific encodings that were discussed above. There are over one million codepoints, of which a little over 130,000 have actually been assigned to specific characters. You can update which binary patterns map to which codepoints independently of which codepoints map to which characters. The large number of code points also means that UTF encodings are also pretty future-proof: we have space to add a lot of new characters before we run out. And, in case you’re wondering, there is a single body in charge of determining which code points map to which characters (including emoji!). It’s called the Unicode Consortium and anyone’s free to join.

Like I mentioned, there are lots of different character encodings out there, but knowing about these five and how they’re related will give you a good idea of the complexities of the character encodings landscape. If you’re curious to learn more, the Unicode Standard is a good place to start.

How to be wrong: Measuring error in machine learning models

December 30, 2017 ~ Rachael Tatman ~ 4 Comments

One thing I remember very clearly from writing my dissertation is how confused I initially was about which particular methods I could use to evaluate how often my models were correct or wrong. (A big part of my research was comparing human errors with errors from various machine learning models.) With that in mind, I thought it might be handy to put together a very quick equation-free primer of some different ways of measuring error.

The first step is to figure out what type of model you’re evaluating. Which type of error measurement you use depends on the type of model you’re evaluating. This was a big part of what initially confused me: much of my previous work had been with regression, especially mixed-effects regression, but my dissertation focused on multi-class classification instead. As a result, the techniques I was used to using to evaluate models just didn’t apply.

Today I’m going to talk about three types of models: regression, binary classification and multiclass classification.

Regression

In regression, your goal is to predict the value of an output value given one or more input values. So you might use regression to predict how much a puppy will weigh in four months or the price of cabbage. (If you want to learn more about regression, I recently put together a beginner’s guide to regression with five days of exercises.)

R-squared: This is a measurement of how correlated your predicted values are with the actual observed values. It ranges from 0 to 1, with 0 being no correlation and 1 being perfect correlation. In general, models with higher r-squareds are a better fit for your data.
Root mean squared error (RMSE), aka standard error: This measurement is an average of how wrong you were for each point you predicted. It ranges from 0 up, with closer to zero being better. Outliers (points you were really wrong about) will disproportionately inflate this measure.

Binary Classification

In binary classification, you aim to predict which of two classes an observation will fall. Examples include predicting whether a student will pass or fail a class or whether or not a specific passenger survived on the Titanic. This is a very popular type of model and there are a lot of ways of evaluating them, so I’m just going to stick to the four that I see most often in the literature.

Accuracy: This is proportion of the test cases that your model got right. It ranges from 0 (you got them all wrong) to 1 (you got them all right).
Precision: This is a measure of how good your model is at selecting only the members of a certain class. So if you were predicting whether students would pass or not and all of the students you predicted would pass actually did, then your model would have perfect precision. Precision ranges from 0 (none of the observations you said were in a specific class actually were) to 1 (all of the observations you said were in that class actually were). It doesn’t tell you about how good your model is at identifying all the members of that class, though!
Recall (aka True Positive Rate, Specificity): This is a measure of how good your model was at finding all the data points that belonged to a specific class. It ranges from 0 (you didn’t find any of them) to 1 (you found all of them). In our students example, a model that just predicted all students would pass would have perfect recall–since it would find all the passing students–but probably wouldn’t have very good precision unless very few students failed.
F1 (aka F-Score): The F score is the (harmonic) mean of both precision and recall. It also ranges from 0 to 1. Like precision and recall, it’s calculated based on a specific class you’re interested in. One thing to note about precision, recall and F1 is that they all don’t count true negatives (when you guessed something wasn’t in a specific class and you were right) so if that’s an important consideration for your model you probably shouldn’t rely on these measures.

Multiclass Classification

Multiclass classification is the task of determining which of three or more classes a specific observation belongs to. Things like predicting which icecream flavor someone will buy or automatically identifying the breed of a dog are multiclass classification.

Confusion Matrix: One of the most common ways to evaluate multiclass classifications is with a confusion matrix, which is a table with the actual labels along one axis and the predicted labels along the other (in the same order). Each cell of the table has a count value for the number of predictions that fell into that category. Correct predictions will fall along the center diagonal. This won’t give you a single summary measure of a system, but it will let you quickly compare performance across different classes.
Cohen’s Kappa: Cohen’s kappa is a measure of how much better than chance a model is at assigning the correct class to an observation. It range from -1 to 1, with higher being better. 0 indicates that the model is at chance levels (i.e. you could do as well just by randomly guessing). (Note that there are some people who will strongly advise against using Cohen’s Kappa.)
Informedness (aka Powers’ Kappa): Informedness tells us how likely we are to make an informed decision rather than a random guess. It is the true positive rate (aka recall) plus the true negative rate, minus 1. Like precision, recall and F1, it’s calculated on a class-by-class basis but we can calculate it for a multiclass classification model by taking the (geometric) mean across all of the classes. It ranges from -1 to 1, with 1 being a model that always makes correct predictions, 0 being a model that makes predictions that are no different than random guesses and -1 being a model that always makes incorrect predictions.

Packages for analysis

For R, the Metrics package and caret package both have implementations of these model metrics, and you’ll often find functions for evaluating more specialized models in the packages that contain the models themselves. In Python, you can find implementations of many of these measurements in the scikit-learn module.

Also, it’s worth noting that any single-value metric can only tell you part of the story about a model. It’s important to consider things besides just accuracy when selecting or training the best model for your needs.

Got other tips and tricks for measuring model error? Did I leave out one of your faves? Feel free to share in the comments. 🙂

Analyzing Multilingual Data

October 17, 2017October 17, 2017 ~ Rachael Tatman ~ Leave a comment

This blog post is a little different from my usual stuff. It’s based on a talk I gave yesterday at the first annual Data Institute Conference. As a result, it’s aimed at a slightly more technical audience than my usual stuff, but I hope I’ve done an ok job keeping it accessible. Feel free to drop me a comment if you have any questions or found anything confusing and I’ll be sure to help you out.

You can play with the code yourself by forking this notebook on Kaggle (you don’t even have to download or install anything :).

There are over 7000 languages in the world, 80% of which have fewer than a million speakers each. In fact, six in ten people on Earth speak a language with less than ten million speakers. In other words: the majority of people on Earth use low-resource languages.

As a result, any large sample of user-generated text is almost guaranteed to have multiple languages in it. So what can you do about it? There are a couple options:

Ignore it
Only look at the parts of the data that are in English
Break the data apart by language & use language-specific tools when available

Let’s take a quick look at the benefits and drawbacks of each approach.

Getting started¶

In [1]:

# import libraries we'll use
import spacy # fast NLP
import pandas as pd # dataframes
import langid # language identification (i.e. what language is this?)
from nltk.classify.textcat import TextCat # language identification from NLTK
from matplotlib.pyplot import plot # not as good as ggplot in R :p

To explore working with multilingual data, let’s look a real-life dataset of user-generated text. This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages. It’s a realistic cross-section of the type of linguistic diversity you’ll see in a large text dataset.

# read in our data
tweetsData = pd.read_csv("../input/all_annotated.tsv", sep = "\t")

# check out some of our tweets
tweetsData['Tweet'][0:5]

0                            Bugün bulusmami lazimdiii
1       Volkan konak adami tribe sokar yemin ederim :D
2                                                  Bed
3    I felt my first flash of violence at some fool...
4              Ladies drink and get in free till 10:30
Name: Tweet, dtype: object

Option 1: Ignore the multilingualism¶

Maybe you’ve got a deadline coming up fast, or maybe you didn’t get a chance to actually look at some of your text data and just decide to treat it as if it were English. What could go wrong?

To find out, let’s use Spacy to tokenize all our tweets and take a look at the longest tokens in our data.

Spacy is an open-source NLP library that is much faster than the Natural Language Toolkit, although it does not have as many tasks implemented. You can find more information in the Spacy documentation.

# create a Spacy document of our tweets
# load an English-language Spacy model
nlp = spacy.load("en")

# apply the english language model to our tweets
doc = nlp(' '.join(tweetsData['Tweet']))

Now let’s look at the longest tokens in our Twitter data.

sorted(doc, key=len, reverse=True)[0:5]

[a7e78d48888a6811d84e0759e9387647447d1e74d8c7c4f1bec00d318e4e5030f08eb35668a97873820ca1d9dc61ffb620f8992296f3b029a60f153beac8018f5fb77d000000,
 e44337d70d7a7fec79a8b6bd8aa573367224023e4272f22af6d0844d9682d5b48062e331b33ab3b92dac2c262ed4f154ba679ad07b30d2cf1c15851cdac901315b4e72000000,
 3064d36c909f9d437f7a3f405aa550f65529566547ae2308d6c4f2585250106d33b924ae9c8dcc08856e41f611d9bd15409a79f7ba21d318ab484f0cae10017201590a000000,
 69bdf5177f1ae8ed61ed71c477f7dc415b97a2b2d7e57be079feb1a2c52600a996fd0891e130c1ce13c94e4406f83ba59e5edb5a7e0fb45e5251a17bb29601081f3de0000000,
 lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3]

The five longest tokens are entire tweets, four produced by an art bot that tweets hashes of Unix timestamps and one that’s just the HTML version of “<3” tweeted a bunch of times. In other words: normal Twitter weirdness. This is actual noise in the data and can be safely discarded without hurting downstream tasks, like sentiment analysis or topic modeling.

sorted(doc, key=len, reverse=True)[6:10]

[卒業したった(*^^*)\n彼女にクラスで一緒にいるやつに\nたった一人の同中の拓夢とも写真撮れたし満足や！(^｡^)時間ギリギリまでテニスやってたからテニス部面と写真撮ってねーわ‼︎まぁこいつらわこれからも付き合いあるだろうからいいか！,
 眼鏡は近視用で黒のセルフレームかアンダーリムでお願いします。オフの日は赤いセルフレームです。形状はサークルでお願いします。30代前半です。髪型ボブカットもしくはティモシェンコ元ウクライナ首相みたいなので。色は黒目でとりあえずお願いします,
 普段は写真撮られるの苦手なので、\n\n顔も出さずw\n\n登場回数少ないですが、\n\n元気にampで働いておりますw\n\n一応こんな人が更新してますのでw\n\n#takahiromiyashitathesolois,
 love#instagood#me#cute#tbt#photooftheday#instamood#tweegram#iphonesia#picoftheday#igers#summer#girl#insta]

The next five longest tokens are also whole tweets which have been identified as single tokens. In this case, though, they were produced by humans!

The tokenizer (which assumes it will be given mainly English data) fails to correct tokenize these tweets because it’s looking for spaces. These tweets are in Japanese, though, and like many Asian languages (including all varieties of Chinese, Korean and Thai) they don’t actually use spaces between words.

In case you’re curious, “、” and “。” are single characters and don’t contain spaces! They are, respectively, the ideographic comma and ideographic full stop, and are part of a very long list of line breaking characters associated with specific orthographic systems.

In order to correctly tokenize Japanese, you’ll need to use a language-specific tokenizer.

The takeaway: if you ignore multiple languages, you’ll end up violating the assumptions behind major out-of-the-box NLP tools¶

Option 2: Only look at the parts of the data that are in English¶

So we know that just applying NLP tools designed for English willy-nilly won’t work on multiple languages. So what if we only grabbed the English-language data and then worked with that?

There are two big issues here:

Correctly identifying which tweets are in English
Throwing away data

Correctly identifying which tweets are in English¶

Probably the least time-intensive way to do this is by attempting to automatically identify the language that each Tweet is written in. A BIG grain of salt here: automatic language identifiers are very error prone, especially on very short texts. Let’s check out two of them.

LangID: Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062
TextCat: Cavnar, W. B. and J. M. Trenkle, “N-Gram-Based Text Categorization” In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

First off, here are the languages the first five tweets are actually written in, hand tagged by a linguist (i.e. me):

Turkish
Turkish
English
English
English

Now let’s see how well two popular language identifiers can detect this.

# summerize the labelled language
tweetsData['Tweet'][0:5].apply(langid.classify)

0     (az, -30.30187177658081)
1     (ms, -83.29260611534119)
2      (en, 9.061840057373047)
3    (en, -195.55468368530273)
4     (en, -98.53013229370117)
Name: Tweet, dtype: object

LangID does…alright, with three out of five tweets identified correctly. While it’s pretty good at identifying English, the first tweet was identified as Azerbaijani and the second tweet was labeled as Malay, which is very wrong (not even in the same language family as Turkish).

Let’s look at another algorithm, TextCat, which is based on character-level N-Grams.

# N-Gram-Based Text Categorization
tc = TextCat()

# try to identify the languages of the first five tweets again
tweetsData['Tweet'][0:5].apply(tc.guess_language)

0    tur
1    ind
2    bre
3    eng
4    eng
Name: Tweet, dtype: object

TextCat also only got three out of the five correct. Oddly, it identifier “bed” as Breton. To be fair, “bed” is the Breton word for “world”, but it’s still a bit odd.

Throwing away data¶

Even if language identification were very accurate, how much data would be just be throwing away if we only looked at data we were fairly sure was English?

Note: I’m only going to LangID here for time reasons, but given the high error rate I’d recommend using multiple language identification algorithms.

# get the language id for each text
ids_langid = tweetsData['Tweet'].apply(langid.classify)

# get just the language label
langs = ids_langid.apply(lambda tuple: tuple[0])

# how many unique language labels were applied?
print("Number of tagged languages (estimated):")
print(len(langs.unique()))

# percent of the total dataset in English
print("Percent of data in English (estimated):")
print((sum(langs=="en")/len(langs))*100)

Number of tagged languages (estimated):
95
Percent of data in English (estimated):
40.963625976

Only 40% of our data has been tagged as English by LangId. If we throw the rest of it, we’re going to lose more than half of our dataset! Especially if this is data you spent a lot of time and money collecting, that seems downright wasteful. (Plus, it might skew our analysis.)

So if 40% of our data is in English, what is the other 60% made up of? Let’s check out the distribution data across languages in our dataset.

# convert our list of languages to a dataframe
langs_df = pd.DataFrame(langs)

# count the number of times we see each language
langs_count = langs_df.Tweet.value_counts()

# horrible-looking barplot (I would suggest using R for visualization)
langs_count.plot.bar(figsize=(20,10), fontsize=20)

There’s a really long tail on our dataset; most that were identified in our dataset were only identified a few times. This means that we can get a lot of mileage out of including just a few more popular languages in our analysis. How much will we gain, exactly?

print("Languages with more than 400 tweets in our dataset:")
print(langs_count[langs_count > 400])

print("")

print("Percent of our dataset in these languages:")
print((sum(langs_count[langs_count > 400])/len(langs)) * 100)

Languages with more than 400 tweets in our dataset:
en    4302
es    1020
pt     751
ja     436
tr     414
id     407
Name: Tweet, dtype: int64

Percent of our dataset in these languages:
69.7962292897

By including only five more languages in our analysis (Spanish, Portugese, Japanese, Turkish and Indonesian) we can increase our coverage of the data in our dataset by almost a third!

The takeaway: Just incorporating a couple more languages in your analysis can give you access to a lot more data!¶

Option 3: Break the data apart by language & use language-specific tools¶

Ok, so what exactly does this pipeline look like? Let’s look at just the second most popular language in our dataset: Spanish. What happens when we pull out just the Spanish tweets & tokenize them?

# get a list of tweets labelled "es" by langid
spanish_tweets = tweetsData['Tweet'][langs == "es"]

# load a Spanish-language Spacy model
from spacy.es import Spanish
nlp_es = Spanish(path=None)

# apply the Spanish language model to our tweets
doc_es = nlp_es(' '.join(spanish_tweets))

# print the longest tokens
sorted(doc_es, key=len, reverse=True)[0:5]

[ViernesSantoEnElColiseoRobertoClemente,
 MiFantasia1DEnWembleyConCocaColaFM,
 fortaleciéndonos','escenarios,
 DirectionersConCocaColaFM1D,
 http://t.co/ezZEsXN3MF\nvia]

This time, the longest tokens are Spanish-language hashtags. This is exactly the sort of thing we’d expect to see! From here, we can use this tokenized dataset to feed into other downstream like sentiment analysis.

Of course, it would be impractical to do this for every single language in our dataset, even if we could be sure that they were all identified correctly. You’re probably going to have to accept that you probably won’t be able to consider every language in your dataset unless you can commit a lot of time. But including any additional language will enrich your analysis!

The takeaway: It doesn’t have to be onerous to incorporate multiple languages in your analysis pipeline!¶

So let’s review our options for analyzing multilingual data:¶

Option 1: Ignore Multilingualism

As we saw, this option will result in violating a lot of the assumptions built into NLP tools (e.g. there are spaces between words). If you do this, you’ll end up with a lot of noise and headaches as you try to move through your analysis pipeline.

Option 2: Only look at English

In this dataset, only looking at English would have led to us throwing away over half of our data. Especailly as NLP tools are developed and made avaliable for more and more languages, there’s less reason to stick to English-only NLP.

Option 3: Seperate your data by language & analyze them independently

This does take a little more work than the other options… but not that much more, especially for languages that already have resources avalialbe for them.

Additional resources:¶

Language Identification:

Here are some pre-built language identifiers to use in addition to LandID and TextCat:

detect_language() from TextBlob. Uses a Google API & requires Internet access.
langdetect Python library: Port of Google’s language-detection library (version from 03/03/2014) to Python.

Dealing with texts which contain multiple languages (code switching):

It’s very common for a span of text to include multiple languages. This example contains English and Malay (“kain kain” is Malay for “unwrap”):

Roasted Chicken Rice with Egg. Kain kain! 🙂 [Image of a lunch wrapped in paper being unwrapped.]

How to automatically handle code switching is an active research question in NLP. Here are some resources to get you started learning more:

Resources for code-switching: Including training data and corpora
First Workshop on Computational Approaches to Code Switching: Includes information on a shared task on identifying & labeling code switching
Overview for the Second Shared Task on Language Identification in
Code-Switched Data: Results of the second shared task on code-switching, with an overview of different approaches.

Disciplines where language is the main focus

Disciplines where language is sometimes studied

Share this:

What you can do with social media data

What you can’t do with social media data

What you shouldn’t do with social media data

Share this:

What’s entropy?

Experiment

Hypothesis

Data

Analysis

Which emoji were repeated the most/least often?

More info & further work

Share this:

NYC Gifted & Talented Scores

Share this:

Parity of Utility vs. Overall Performance

Share this:

Share this:

Regression

Binary Classification

Multiclass Classification

Packages for analysis

Share this:

Getting started¶

Option 1: Ignore the multilingualism¶

The takeaway: if you ignore multiple languages, you’ll end up violating the assumptions behind major out-of-the-box NLP tools¶

Option 2: Only look at the parts of the data that are in English¶

Correctly identifying which tweets are in English¶

The takeaway: Automatic language identification, especially on very short texts, is very error prone. (I’d recommend using multiple language identifiers & taking the majority vote.)¶

Throwing away data¶

The takeaway: Just incorporating a couple more languages in your analysis can give you access to a lot more data!¶

Option 3: Break the data apart by language & use language-specific tools¶

The takeaway: It doesn’t have to be onerous to incorporate multiple languages in your analysis pipeline!¶

So let’s review our options for analyzing multilingual data:¶

Additional resources:¶

Share this: