What you can, can’t and shouldn’t do with social media data

August 31, 2018August 31, 2018 ~ Rachael Tatman ~ Leave a comment

Earlier this summer, I gave a talk on the promise & pitfalls of social media data for the Joint Statistical Meetings. While I don’t think there’s a recording of the talk, enough people asked for one that I figured it would be worth putting together a blog post version of the talk. Enjoy!

What you can do with social media data

Let’s start with the good news: research using social media data has revolutionized social science research. It’s let us ask bigger question more quickly, helped us overcome some of the key drawbacks of behavioral experimental work and ask new kinds of questions.

More data faster

I can’t overstate how revolutionary the easy availability of social media data has been, especially in linguistics. It has increased both the rate and scale of data collection by orders of magnitude. Compare the time it took to compare the Dictionary of American Regional English (DARE) to the Wordmapper app below. The results are more or less the same, maps of where in the US folks use different words (in this example, “cellar”). But what once took the entire careers of multiple researchers can now be done in a few months, and with far higher resolution.

	Dictionary of American Regional English (DARE)	Word Mapper App
Data collection	48 years (1965 – 2013)	<1 year
Size of team	2,777 people	4 people
Number of participants	1,843 people	20 million

Social networks

Social media sites with a following or friend feature also let us ask really large scale questions about social networks. How do social networks and political affiliation interact? How does language change move through a social network? What characteristics of social network structure are more closely associated with the spread of misinformation? Of course, we could ask these questions before social media data… but by using APIs to access social media data, we reduce the timescale of these projects from decades to weeks or even days and we have a clear way to operationalize social network ties. It’s fairly hard for someone to sit down and list everyone they interact with face-to-face, but it’s very easy to grab a list of all the Twitter accounts you follow.

Wild-caught, all natural data

One of the constant struggles in experimental work is the fact that the mere fact of being observed changes behavior. This is known as the Hawthorne Effect in psychology or the Observer’s Paradox in sociolinguistics. As a result, even the most well-designed experiment is limited by the fact that the participants know that they are completing an experiment.

Social media data, however, doesn’t have this limitation. Since most social media research projects are conducted on public data without interacting directly with participants, they are not generally considered human subjects research. When you post something on a public social media account, you don’t have a reasonable expectation of privacy. In other words, you know that just anyone could come along and read it, and that includes researchers. As a result it is not generally necessary to collect informed consent for social media projects. (Informed consent is when you are told exactly what’s going to happen during an experiment you’re participating, and you agree to participate in it.) This means that the vast majority of folks who are participating in a social media study don’t actually know that they’re part of a study.

The benefit of this is that it allows researchers to get around three common confounds that plague social science research:

Bradley effect: People tend to tell researchers what they think they want to hear
Response bias: The sample of people willing to do an experiment/survey differ in a meaningful way from the population as a whole
Observer’s paradox/Hawthorne effect: People change their behavior when they know they’re being observed

While this is a boon to researchers, the lack of informed consent does introduce other other problems, which we’ll talk about later.

What you can’t do with social media data

Of course, all the benefits of social media come at a cost. There are several key drawbacks and limitations of social media research:

You can’t be sure who your participants are.
There’s inherent sampling bias.
You can’t violate the developer’s agreements.

You’re not sure who you’re studying…

Because you don’t meet with the people whose data is included in your study, you don’t know for sure what sorts of demographic categories they belong to, whether they are who they’re claiming to be or even if they’re human at all. You have to deal with both bots, accounts where content is produced and distributed automatically by a computer and sock puppets, where one person pretends to be another person. Sock puppets in particular can be very difficult to spot and may skew your results in unpredictable ways.

…but you can be sure your sample is biased.

Social media users aren’t randomly drawn from the world’s population as a whole. Social media users tend to be WEIRD: from wealthy, educated, industrialized, rich and democratic societies. This group is already over-represented in social science and psychology research studies, which may be subtly skewing our models of human behavior.

In addition, different social media platforms have different user bases. For example, Instagram and Snapchat tend to have younger users, Pinterest has more women (especially compared to Reddit, which skews male) and LinkedIn users tend to be highly educated and upper middle class. And that doesn’t even get to social network effects: you’re more likely to be on the same platform your friends are on, and since social networks tend to be homophilous, you can end up with pockets of very socially homogeneous folks. So, even if you manage to sample randomly from a social media platform, your sample is likely to differ from one taken from the population as a whole.

You need to abide by the developer’s agreements for whatever platform you’re using data from.

This is mainly an issue if you’re using API (application programmatic interface) to fetch data from a service. Developer’s agreements vary between platforms, but most limit the amount of data you can fetch and store, and how and if you can share it with other researchers. For example, if you’re sharing Twitter data you can only share 50,000 tweets at a time and even then only if you have to have people download a file by clicking on it. If you share any more than that, you should just share the ID’s of the tweets rather than the full tweets. (Document the Now’s Hydrator can help you fetch the tweets associated with a set of IDs.)

What you shouldn’t do with social media data

Finally, there are ethical restrictions on what we should do with social media data. As researchers, we need to 1) respect the wishes of users and 2) safeguard their best interests, especially given that we don’t (currently) generally get informed consent from the folks whose data we’re collecting.

Respecting users’ wishes

At least in the US, ethical human subjects research is led by three guiding principles set forth in the Belmont report. If you’re unfamiliar with the report, it was written in the aftermath of the Tuskegee Valley experiments. These were a series of medical experiments on African Americans men who had contracted syphilis conducted from the 1930’s to 1970’s. During the study, researchers withheld the cure (and even information that it existed) from the participants. The study directly resulted in the preventable deaths of 128 men and many health problems for study participants, their wives and children. It was a clear ethical violation of the human rights of participants and the moral stain of it continues to shape how we conduct human subjects research in the US.

The three principles of ethical human subjects research are:

Respect for Persons: People should be treated as autonomous individuals and persons with diminished autonomy (like children or prisoners) are entitled to protection.
Beneficence: 1) Do not harm and 2) maximize possible benefits and minimize possible harms.
Justice: Both the risks and benefits of research should be distributed equally.

Social media research might not technically fall under the heading of human subjects research, since we aren’t intervening with our participants. However, I still believe that it’s important that researchers following these general guides when designing and distributing experiments.

One thing we can do is respect their wishes of the communities we study. Fortunately, we have some evidence of what those wishes are. Feisler and Proferes (2018) surveyed 368 Twitter users on their perception of a variety of research behaviors.

Screenshot from 2018-07-25 16-10-21 — Fiesler, C., & Proferes, N. (2018). “Participant” Perceptions of Twitter Research Ethics. Social Media+ Society, 4(1), 2056305118763366. Table 4.

In general, Twitter users are more OK with research with the following characteristics:

Large datasets
Analyzed automatically
Social media users informed about research
If tweets are quoted, they are anonymized. (Note that if you include the exact text, it’s possible to reverse search the quoted tweet and de-anonymize it. I recommend changing at least 20% of the content words in a tweet to synonyms to get around this and double-checking by trying to de-anonymize it yourself.)

These characteristics, however, are not as acceptable to Twitter users:

Small datasets
Analysis done by hand (presumably including analysis by Mechanical Turk workers)
Tweets from protected accounts or deleted tweets analyzed (which is also against the developer’s agreement, so you shouldn’t be doing this anyway)
Quoting with citation (very different from academic norms!)

In general, I think these suggest general best practices for researchers working with Twitter data.

Stick to larger datasets
Try to automate wherever possible
Follow the developer’s agreement
Take anonymity seriously.

There is one thing I disagree with, however: I don’t think we should contact everyone who’s tweets we use in our research.

Should we contact people whose tweets we use in our studies? My gut instinct on this one is “no”. If you’re collecting a large amount of data, you probably shouldn’t reach out to everyone in the data.

For users who don’t have open DM’s, the only way to contact them is to publicly mention them using @username. The problem with this is that it partly de-anonymizes your data. If you then choose to share your data, having publicly shared a list of whose data was included in the dataset it makes it much easier to de-anonymize. Instead of trying to figure out whose tweets were included when looking at all of Twitter, an adversary only has to figure out which of the users on the list you’ve given them is connected to which record.

The main exception to this is if have a project that’s a deep dive on one user, in which case you probably should. (For example, I contacted Chaz Smith and let him know about my phonological analysis of his #pronouncingthingsincorrectly Vines.)

Do no harm

Another aspect of ethical research is trying to ensure that your research or research data doesn’t have potentially unethical applications. The elephant in the room here, of course, is the data Cambridge Analytica collected from Facebook users. Researchers at Cambridge, collecting data for a research project, got lots of people’s permission to access their Facebook data. While that wasn’t a problem, they collected and saved Facebook data from other folks as well, who hadn’t opted in. In the end, only a half of a half of a percent of the folks whose data was in the final dataset actually agreed to be included in it. To make matters worse, this data was used by a commercial company founded by one of the researchers to (possibly) influence elections in the US and UK. Here’s a New York Times article that goes into much more detail. This has understandably lead to increased scrutiny of how social media research data is collected and used.

I’m not bringing this up to call out Facebook in particular, but to explain why it’s important to consider how research data might be used long-term. How and where will it be stored? For how long? Who will have access to it? In short, if you’re a researcher, how can you ensure that data you collected won’t end up somehow hurting the people you collected it from?

As an example of how important these questions are, consider this OK Cupid “research” dataset. It was collected without consent and shared publicly without anonymization. It included many personal details that were only intended to be shared with other users of the site, including explicit statements of sexual orientation. In addition to being an unforgivable breach of privacy, this directly endangered users whose data was collected: information on sexual orientation was shared for people living in countries where homosexuality is a crime that carries a death penalty or sentence of life in prison. I have a lot of other issues with this “study” as well, but the fact that it directly endangered research subjects who had no chance to opt out is by far the most egregious ethical breach.

If you are collecting social media data for research purposes, it is your ethical responsibility to safeguard the well-being of the people whose data you’re using.

I bring up these cautionary tales not to scare you off of social media research but to really impress the gravity of the responsibility you carry as a social media researcher. Social media data has the potential to dramatically improve our understanding of the world. A lot of my own work has relied heavily on it! But it’s important that we, as researchers, take our moral duty to make sure that we don’t end up doing more harm than good very seriously.

Are emoji sequences as informative as text?

July 7, 2018July 7, 2018 ~ Rachael Tatman ~ Leave a comment

Something I’ve been thinking about a lot lately is how much information we really convey with emoji. I was recently at the 1st International Workshop on Emoji Understanding and Applications in Social Media and one theme that stood out to me from the papers was that emoji tend to be used more to communicate social meaning (things like tone and when a conversation is over) than semantics (content stuff like “this is a dog” or “an icecream truck”).

I’ve been itching to apply an information theoretic approach to emoji use for a while, and this seemed like the perfect opportunity. Information theory is the study of storing, transmitting and, most importantly for this project, quantifying information. In other words, using an information theoretic approach we can actually look at two input texts and figure out which one has more information in it. And that’s just what we’re going to do: we’re going to use a measure called “entropy” to directly compare the amount of information in text and emoji.

What’s entropy?

Shannon entropy is a measure of how much information there is in a sequence. Higher entropy means that there’s more uncertainty about what comes next, while lower entropy means there’s less uncertainty. (Mathematically, entropy is always less than or the same as log₂(n), where n is the total number of unique characters. You can learn more about calculating entropy and play around with an interactive calculator here if you’re curious.)

So if you have a string of text that’s just one character repeated over and over (like this: 💀💀💀💀💀) you don’t need a lot of extra information to know what the next character will be: it will always be the same thing. So the string “💀💀💀💀💀” has a very low entropy. In this case it’s actually 0, which means that if you’re going through the string and predicting what comes next, you’re always going to be able to guess what comes next becuase it’s always the same thing. On the other hand, if you have a string that’s made up of four different characters, all of which are equally probable (like this:♢♡♧♤♡♧♤♢), then you’ll have an entropy of 2.

TL;DR: The higher the entropy of a string the more information is in it.

Experiment

Hypothesis

We do have some theoretical maximums for the entropy text and emoji. For text, if the text string is just randomly drawn from the 128 ASCII characters (which isn’t how language works, but this is just an approximation) our entropy would be 7. On the other hand, for emoji, if people are just randomly using any emoji they like from the set of emoji as of June 2017, then we’d expect to see an entropy of around 11.

So if people are just using letters or emoji randomly, then text should have lower entropy than emoji. However, I don’t think that’s what’s happening. My hypothesis, based on the amount of repetition in emoji, was that emoji should have lower entropy, i.e. less information, than text.

Data

To get emoji and text spans for our experiment I used four different datasets: three from Twitter and one from YouTube.

I used multiple datasets for a couple reasons. First, becuase I wanted a really large dataset of tweets with emoji, and since only between 0.9% and 0.5% of tweets from each Twitter dataset actually contained emoji I needed to case a wide net. And, second, because I’m growing increasingly concerned about genre effects in NLP research. (Like, a lot of our research is on Twitter data. Which is fine, but I’m worried that we’re narrowing the potential applications of our research becuase of it.) It’s the second reason that led me to include YouTube data. I used Twitter data for my initial exploration and then used the YouTube data to validate my findings.

For each dataset, I grabbed all adjacent emoji from a tweet and stored them separately. So this tweet:

Love going to ballgames! ⚾🌭 Going home to work in my garden now, tho 🌸🌸🌸🌸

Has two spans in it:

Span 1: ⚾🌭

Span 2: 🌸🌸🌸🌸

All told, I ended up with 13,825 tweets with emoji and 18,717 emoji spans of which only 4,713 were longer than one emoji. (I ignored all the emoji spans of length one, since they’ll always have an entropy of 0 and aren’t that interesting to me.) For the YouTube comments, I ended up with 88,629 comments with emoji, 115,707 emoji spans and 47,138 spans with a length greater than one.

In order to look at text as parallel as possible to my emoji spans, I grabbed tweets & YouTube comments without emoji. For each genre, I took a number of texts equal to the number of spans of length > 1 and then calculated the character-level entropy for the emoji spans and the texts.

Analysis

First, let’s look at Tweets. Here’s the density (it’s like a smooth histogram, where the area under the curve is always equal to 1 for each group) of the entropy of an equivalent number of emoji spans and tweets.

: Text has a much high character-level entropy than emoji. For text, the mean and median entropy are both around 5. For emoji, there is a multimodal distribution, with the median entropy being 0 and also clusters around 1 and 1.5.

It looks like my hypothesis was right! At least in tweets, text has much more information than emoji. In fact, the most common entropy for an emoji span is 0: which means that most emoji spans with a length greater than one are just repititons of the same emoji over and over again.

But is this just true on Twitter, or does it extend to YouTube comments as well?

download (5) — The pattern for emoji & text in YouTube comments is very similar to that for Tweets. The biggest difference is that it looks like there’s less information in YouTube Comments that are text-based; they have a mean and median entropy closer to 4 than 5.

The YouTube data, which we have almost ten times more of, corroborates the earlier finding: emoji spans are less informative, and more repetitive, than text.

Which emoji were repeated the most/least often?

Just in case you were wondering, the emoji most likely to be repeated was the skull emoji, 💀. It’s generally used to convey strong negative emotion, especially embarrassment, awkwardness or speechlessness, similar to “ded“.

OMFFFFFFFFFG……….how you gonna put me on blast like that @Oreo!!!!

Hahahhaha! 💀💀💀💀💀🤣😂 https://t.co/eJ1igiqJ9W

— Jaremi Carey (@PhiPhiOhara) July 5, 2018

The least likely was the right-pointing arrow (▶️), which is usually used in front of links to videos.

What are we naming the @arianagrande + @nickiminaj super group?

1️⃣ Nickari
2️⃣ Aricki
3️⃣ Minajagrande
4️⃣ Granaj

Watch their video #TheLightIsComing now. 💡💡
▶️https://t.co/nwLHtZ2V86 pic.twitter.com/aj6RjI1IRE

— Vevo (@Vevo) July 5, 2018

More info & further work

If you’re interested, the code for my analysis is available here. I also did some of this work as live coding, which you can follow along with on YouTube here.

For future work, I’m planning on looking at which kinds of emoji are more likely to be repeated. My intuition is that gestural emoji (so anything with a hand or face) are more likely to be repeated than other types of emoji–which would definitely add some fuel to the “are emoji words or gestures” debate!

How do we use emoji?

March 17, 2018March 22, 2018 ~ Rachael Tatman ~ 2 Comments

Those of you who know me may know that I’m a big fan of emoji. I’m also a big fan of linguistics and NLP, so, naturally, I’m very curious about the linguistic roles of emoji. Since I figured some of you might also be curious, I’ve pulled together a discussion of some of the very serious scholarly research on emoji. In particular, I’m going to talk about five recent papers that explore the exact linguistic nature of these symbols: what are they and how do we use them?

Dürscheid & Siever, 2017:

This paper makes one overarching point: emoji are not words. They cannot be unambiguously interpreted without supporting text and they do not have clear syntactic relationships to one another. Rather, the authors consider emoji to be specialized characters, and place them within Gallmann’s 1985 hierarchy of graphical signs. The authors show that emoji can play a range of roles within the Gallmann’s functional classification.

Allography: using emoji to replace specific characters (for example: the word “emoji” written as “em😝ji”)
Ideograms: using emoji to replace a specific word (example: “I’m travelling by 🚘” to mean “I’m travelling by car”)
Border and Sentence Intention signals: using emoji both to clarify the tone of the preceding sentence and also to show that the sentence is over, often replacing the final punctuation marks.

Based on an analysis of a Swiss German Whatsapp corpus, the authors conclude that the final category is far and away the most popular, and that emoji rarely replace any part of the lexical parts of a message.

Na’aman et al, 2017:

Na’aman and co-authors also develop a hierarchy of emoji usage, with three top-level categories: Function, Content (both of which would fall under mostly under the ideogram category in Dürscheid & Siever’s classifications) and Multimodal.

Function: Emoji replacing function words, including prepositions, auxiliary verbs, conjunctions, determinatives and punctuation. An example of this category would be “I like 🍩 you”, to be read as “I do not like you”.
Content: Emoji replacing content words and phrases, including nouns, verbs, adjectives and adverbs. An example of this would be “The 🔑 to success”, to be read as “the key to success”.
Multimodal: These emoji “enrich a grammatically-complete text with markers of
affect or stance”. These would fall under the category of border signals in Dürscheid & Siever’s framework, but Na’aman et all further divide these into four categories: attitude, topic, gesture and other.

Based on analysis of a Twitter corpus made of up of only tweets containing emoji, the authors find that multimodal emoji encoding attitude are far and away the most common, making up over 50% of the emoji spans in their corpus. The next most common uses of emoji are to multimodal:topic and multimodal:gesture. Together, these three categories account for close to 90% of the all the emoji use in the corpus, corroborating the findings of Dürscheid & Siever.

Wood & Ruder, 2016:

Wood and Ruder provide further evidence that emoji are used to express emotion (or “attitude”, in Na’aman et al’s terms). They found a strong correlation between the presence of emoji that they had previously determined were associated with a particular emotion, like 😂 for joy or 😭 for sadness, and human annotations of the emotion expressed in those tweets. In addition, an emotion classifier using only emoji as input performed similarly to one trained using n-grams excluding emoji. This provides evidence that there is an established relationship between specific emoji use and expressing emotion.

Donato & Paggio, 2017:

However, the relationship between text and emoji may not always be so close. Donato & Paggio collected a corpus of tweets which contained at least one emoji and that were hand-annotated for whether the emoji was redundant given the text of the tweet. For example, “We’ll always have Beer. I’ll see to it. I got your back on that one. 🍺” would be redundant, while “Hopin for the best 🎓” would not be, since the beer emoji expresses content already expressed in the tweet, while the motorboard adds new information (that the person is hoping to graduate, perhaps). The majority of emoji, close to 60%, were found not to be redundant and added new information to the tweet.

However, the corpus was intentionally balanced between ten topic areas, of which only one was feelings, and as a result the majority of feeling-related tweets were excluded from analysis. Based on this analysis and Wood and Ruder’s work, we might hypothesize that feelings-related emoji may be more redundant than other emoji from other semantic categories.

Barbieri et al, 2017:

Additional evidence for the idea that emoji, especially those that show emotion, are predictable given the text surrounding them comes from Barbieri et al. In their task, they removed the emoji from a thousand tweets that contained one of the following five emoji: 😂, ❤️, 😍, 💯 or 🔥. These emoji were selected since they were the most common in the larger dataset of half a million tweets. Then then asked human crowd workers to fill in the missing emoji given the text of the tweet, and trained a character-level bidirectional LSTM to do the same task. Both humans and the LSTM performed well over chance, with an F1 score of 0.50 for the humans and 0.65 for the LSTM.

So that was a lot of papers and results I just threw at you. What’s the big picture? There are two main points I want you to take away from this post:

People mostly use emoji to express emotion. You’ll see people playing around more than that, sure, but by far the most common use is to make sure people know what emotion you’re expressing with a specific message.
Emoji, particularly emoji that are used to represent emotions, are predictable given the text of the message. It’s pretty rare for us to actually use emoji to introduce new information, and we generally only do that when we’re using emoji that have a specific, transparent meaning.

If you’re interested in reading more, here are all the papers I mentioned in this post:

Bibliography:

Barbieri, F., Ballesteros, M., & Saggion, H. (2017). Are Emojis Predictable? EACL.

Donato, G., & Paggio, P. (2017). Investigating Redundancy in Emoji Use: Study on a Twitter Based Corpus. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (pp. 118-126).

Dürscheid, C., & Siever, C. M. (2017). Beyond the Alphabet–Communication of Emojis. Kurzfassung eines (auf Deutsch) zur Publikation eingereichten Manuskripts.

Gallmann, P. (1985). Graphische Elemente der geschriebenen Sprache. Grundlagen für eine Reform der Orthographie. Tübingen: Niemeyer.

Na’aman, N., Provenza, H., & Montoya, O. (2017). Varying Linguistic Purposes of Emoji in (Twitter) Context. In Proceedings of ACL 2017, Student Research Workshop (pp. 136-141).

Wood, I. & Ruder, S. (2016). Emoji as Emotion Tags for Tweets. Sánchez-Rada, J. F., & Schuller, B (Eds.). In Proceedings of LREC 2016, Workshop on Emotion and Sentiment Analysis (pp. 76-80).

A Field Guide to Character Encodings

January 28, 2018 ~ Rachael Tatman ~ 1 Comment

I recently gave a talk at PyCascades (a regional Python language conference) on character encodings and I thought it would be nice to put together a little primer on a couple different important character encodings.

If you’re unfamiliar with character encodings, they’re just a variety of different systems used to map a string of binary (i.e. 1’s and 0’s) to a specific character. So the Euro character, €, would be represented as “111000101000001010101100” in a character encoding called UTF-8 but “10100100” in the Latin 9 encoding. There are a lot of different character encodings out there, so I’m just going to cover a handful that I think are especially interesting/important.

Name: ASCII
Created: 1960
Also known as: American Standard Code for Information Interchange, US-ASCII
Most often seen: In legacy systems, especially U.S. government databases that need to be backward compatible

ASCII was the first widely-used character encoding. It has space for only 128 characters, and is best suited for English-language data.

Name: ISO 8859-1
Created: 1985
Also known as: Latin 1, code page 819, iso-ir-100, csISOLatin1, latin1, l1, IBM819, WE8ISO8859P1
Most often seen: Representing languages not covered by ASCII (like Spanish and Portuguese).

Latin 1 is the most popular of a large set of character encodings developed by the ISO (International Organization for Standardization) to deal with the fact that ASCII only really works well for English by adding additional characters. They did this by adding one extra bit to each character (8 instead of the 7 ASCII uses) so that they had space for 256 characters per encoding. However, since there are a lot of characters out there, there are 16 different ISO encodings that can handle different alphabets. (For example, ISO 8859-5 handles Cyrillic characters, while ISO 8859-11 maps Thai characters.)

Name: Windows-1252
Created: 1985
Also known as: CP-1252, Latin 1/ISO 8859-1 (which it isn’t!), ANSI, ansinew
Most often seen: Mislabelled as Latin 1.

While the ISO was developing a standard set of character encodings, pretty much every large software company was also developing thier own set of proprietary encodings that did pretty much the same thing. Windows-1252 is slightly tweaked version of Latin 1, but Windows also had a bunch of different encodings, as did Apple, as did IBM. The 1980’s were a wild time for character encodings!

Name: Shift-JIS
Created: 1997
Also known as: Shift Japanese Industrial Standards, SJIS, Shift_JIS
Most often seen: For Japanese

So one thing you may notice about the character encodings above is that they’re all fairly small, i.e. can’t handle more than 256 characters. But what about a language that has way more than 256 characters, like Japanese? (Japanese has a phonetic writing system with 71 characters, as well as 85,000 kanji characters which each represent a single word.) Well, one solution is to create a different character encoding system specifically for that language with enough space for all the characters you’re going to want to use frequently. But, like with the ASCII-based encodings I talked about above, just one encoding isn’t quite enough to cover all the needs of the language, so a lot of variants and extensions popped up.

Name: UTF-8
Created: 1992
Also known as: Unicode Transformation Format – 8-bit
Most often seen: Pretty much everywhere, including >90% of text on the web. (That’s a good thing!)

Which brings us to UTF-8, the current standard for text encoding. The UTF encodings map from binary to what are known as Unicode codepoints, and then those codepoints are mapped to characters. Why the whole “codepoints” thing in the middle? To help overcome the problems with language-specific encodings that were discussed above. There are over one million codepoints, of which a little over 130,000 have actually been assigned to specific characters. You can update which binary patterns map to which codepoints independently of which codepoints map to which characters. The large number of code points also means that UTF encodings are also pretty future-proof: we have space to add a lot of new characters before we run out. And, in case you’re wondering, there is a single body in charge of determining which code points map to which characters (including emoji!). It’s called the Unicode Consortium and anyone’s free to join.

Like I mentioned, there are lots of different character encodings out there, but knowing about these five and how they’re related will give you a good idea of the complexities of the character encodings landscape. If you’re curious to learn more, the Unicode Standard is a good place to start.

Analyzing Multilingual Data

October 17, 2017October 17, 2017 ~ Rachael Tatman ~ Leave a comment

This blog post is a little different from my usual stuff. It’s based on a talk I gave yesterday at the first annual Data Institute Conference. As a result, it’s aimed at a slightly more technical audience than my usual stuff, but I hope I’ve done an ok job keeping it accessible. Feel free to drop me a comment if you have any questions or found anything confusing and I’ll be sure to help you out.

You can play with the code yourself by forking this notebook on Kaggle (you don’t even have to download or install anything :).

There are over 7000 languages in the world, 80% of which have fewer than a million speakers each. In fact, six in ten people on Earth speak a language with less than ten million speakers. In other words: the majority of people on Earth use low-resource languages.

As a result, any large sample of user-generated text is almost guaranteed to have multiple languages in it. So what can you do about it? There are a couple options:

Ignore it
Only look at the parts of the data that are in English
Break the data apart by language & use language-specific tools when available

Let’s take a quick look at the benefits and drawbacks of each approach.

Getting started¶

In [1]:

# import libraries we'll use
import spacy # fast NLP
import pandas as pd # dataframes
import langid # language identification (i.e. what language is this?)
from nltk.classify.textcat import TextCat # language identification from NLTK
from matplotlib.pyplot import plot # not as good as ggplot in R :p

To explore working with multilingual data, let’s look a real-life dataset of user-generated text. This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages. It’s a realistic cross-section of the type of linguistic diversity you’ll see in a large text dataset.

# read in our data
tweetsData = pd.read_csv("../input/all_annotated.tsv", sep = "\t")

# check out some of our tweets
tweetsData['Tweet'][0:5]

0                            Bugün bulusmami lazimdiii
1       Volkan konak adami tribe sokar yemin ederim :D
2                                                  Bed
3    I felt my first flash of violence at some fool...
4              Ladies drink and get in free till 10:30
Name: Tweet, dtype: object

Option 1: Ignore the multilingualism¶

Maybe you’ve got a deadline coming up fast, or maybe you didn’t get a chance to actually look at some of your text data and just decide to treat it as if it were English. What could go wrong?

To find out, let’s use Spacy to tokenize all our tweets and take a look at the longest tokens in our data.

Spacy is an open-source NLP library that is much faster than the Natural Language Toolkit, although it does not have as many tasks implemented. You can find more information in the Spacy documentation.

# create a Spacy document of our tweets
# load an English-language Spacy model
nlp = spacy.load("en")

# apply the english language model to our tweets
doc = nlp(' '.join(tweetsData['Tweet']))

Now let’s look at the longest tokens in our Twitter data.

sorted(doc, key=len, reverse=True)[0:5]

[a7e78d48888a6811d84e0759e9387647447d1e74d8c7c4f1bec00d318e4e5030f08eb35668a97873820ca1d9dc61ffb620f8992296f3b029a60f153beac8018f5fb77d000000,
 e44337d70d7a7fec79a8b6bd8aa573367224023e4272f22af6d0844d9682d5b48062e331b33ab3b92dac2c262ed4f154ba679ad07b30d2cf1c15851cdac901315b4e72000000,
 3064d36c909f9d437f7a3f405aa550f65529566547ae2308d6c4f2585250106d33b924ae9c8dcc08856e41f611d9bd15409a79f7ba21d318ab484f0cae10017201590a000000,
 69bdf5177f1ae8ed61ed71c477f7dc415b97a2b2d7e57be079feb1a2c52600a996fd0891e130c1ce13c94e4406f83ba59e5edb5a7e0fb45e5251a17bb29601081f3de0000000,
 lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3&lt;3]

The five longest tokens are entire tweets, four produced by an art bot that tweets hashes of Unix timestamps and one that’s just the HTML version of “<3” tweeted a bunch of times. In other words: normal Twitter weirdness. This is actual noise in the data and can be safely discarded without hurting downstream tasks, like sentiment analysis or topic modeling.

sorted(doc, key=len, reverse=True)[6:10]

[卒業したった(*^^*)\n彼女にクラスで一緒にいるやつに\nたった一人の同中の拓夢とも写真撮れたし満足や！(^｡^)時間ギリギリまでテニスやってたからテニス部面と写真撮ってねーわ‼︎まぁこいつらわこれからも付き合いあるだろうからいいか！,
 眼鏡は近視用で黒のセルフレームかアンダーリムでお願いします。オフの日は赤いセルフレームです。形状はサークルでお願いします。30代前半です。髪型ボブカットもしくはティモシェンコ元ウクライナ首相みたいなので。色は黒目でとりあえずお願いします,
 普段は写真撮られるの苦手なので、\n\n顔も出さずw\n\n登場回数少ないですが、\n\n元気にampで働いておりますw\n\n一応こんな人が更新してますのでw\n\n#takahiromiyashitathesolois,
 love#instagood#me#cute#tbt#photooftheday#instamood#tweegram#iphonesia#picoftheday#igers#summer#girl#insta]

The next five longest tokens are also whole tweets which have been identified as single tokens. In this case, though, they were produced by humans!

The tokenizer (which assumes it will be given mainly English data) fails to correct tokenize these tweets because it’s looking for spaces. These tweets are in Japanese, though, and like many Asian languages (including all varieties of Chinese, Korean and Thai) they don’t actually use spaces between words.

In case you’re curious, “、” and “。” are single characters and don’t contain spaces! They are, respectively, the ideographic comma and ideographic full stop, and are part of a very long list of line breaking characters associated with specific orthographic systems.

In order to correctly tokenize Japanese, you’ll need to use a language-specific tokenizer.

The takeaway: if you ignore multiple languages, you’ll end up violating the assumptions behind major out-of-the-box NLP tools¶

Option 2: Only look at the parts of the data that are in English¶

So we know that just applying NLP tools designed for English willy-nilly won’t work on multiple languages. So what if we only grabbed the English-language data and then worked with that?

There are two big issues here:

Correctly identifying which tweets are in English
Throwing away data

Correctly identifying which tweets are in English¶

Probably the least time-intensive way to do this is by attempting to automatically identify the language that each Tweet is written in. A BIG grain of salt here: automatic language identifiers are very error prone, especially on very short texts. Let’s check out two of them.

LangID: Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062
TextCat: Cavnar, W. B. and J. M. Trenkle, “N-Gram-Based Text Categorization” In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

First off, here are the languages the first five tweets are actually written in, hand tagged by a linguist (i.e. me):

Turkish
Turkish
English
English
English

Now let’s see how well two popular language identifiers can detect this.

# summerize the labelled language
tweetsData['Tweet'][0:5].apply(langid.classify)

0     (az, -30.30187177658081)
1     (ms, -83.29260611534119)
2      (en, 9.061840057373047)
3    (en, -195.55468368530273)
4     (en, -98.53013229370117)
Name: Tweet, dtype: object

LangID does…alright, with three out of five tweets identified correctly. While it’s pretty good at identifying English, the first tweet was identified as Azerbaijani and the second tweet was labeled as Malay, which is very wrong (not even in the same language family as Turkish).

Let’s look at another algorithm, TextCat, which is based on character-level N-Grams.

# N-Gram-Based Text Categorization
tc = TextCat()

# try to identify the languages of the first five tweets again
tweetsData['Tweet'][0:5].apply(tc.guess_language)

0    tur
1    ind
2    bre
3    eng
4    eng
Name: Tweet, dtype: object

TextCat also only got three out of the five correct. Oddly, it identifier “bed” as Breton. To be fair, “bed” is the Breton word for “world”, but it’s still a bit odd.

Throwing away data¶

Even if language identification were very accurate, how much data would be just be throwing away if we only looked at data we were fairly sure was English?

Note: I’m only going to LangID here for time reasons, but given the high error rate I’d recommend using multiple language identification algorithms.

# get the language id for each text
ids_langid = tweetsData['Tweet'].apply(langid.classify)

# get just the language label
langs = ids_langid.apply(lambda tuple: tuple[0])

# how many unique language labels were applied?
print("Number of tagged languages (estimated):")
print(len(langs.unique()))

# percent of the total dataset in English
print("Percent of data in English (estimated):")
print((sum(langs=="en")/len(langs))*100)

Number of tagged languages (estimated):
95
Percent of data in English (estimated):
40.963625976

Only 40% of our data has been tagged as English by LangId. If we throw the rest of it, we’re going to lose more than half of our dataset! Especially if this is data you spent a lot of time and money collecting, that seems downright wasteful. (Plus, it might skew our analysis.)

So if 40% of our data is in English, what is the other 60% made up of? Let’s check out the distribution data across languages in our dataset.

# convert our list of languages to a dataframe
langs_df = pd.DataFrame(langs)

# count the number of times we see each language
langs_count = langs_df.Tweet.value_counts()

# horrible-looking barplot (I would suggest using R for visualization)
langs_count.plot.bar(figsize=(20,10), fontsize=20)

There’s a really long tail on our dataset; most that were identified in our dataset were only identified a few times. This means that we can get a lot of mileage out of including just a few more popular languages in our analysis. How much will we gain, exactly?

print("Languages with more than 400 tweets in our dataset:")
print(langs_count[langs_count > 400])

print("")

print("Percent of our dataset in these languages:")
print((sum(langs_count[langs_count > 400])/len(langs)) * 100)

Languages with more than 400 tweets in our dataset:
en    4302
es    1020
pt     751
ja     436
tr     414
id     407
Name: Tweet, dtype: int64

Percent of our dataset in these languages:
69.7962292897

By including only five more languages in our analysis (Spanish, Portugese, Japanese, Turkish and Indonesian) we can increase our coverage of the data in our dataset by almost a third!

The takeaway: Just incorporating a couple more languages in your analysis can give you access to a lot more data!¶

Option 3: Break the data apart by language & use language-specific tools¶

Ok, so what exactly does this pipeline look like? Let’s look at just the second most popular language in our dataset: Spanish. What happens when we pull out just the Spanish tweets & tokenize them?

# get a list of tweets labelled "es" by langid
spanish_tweets = tweetsData['Tweet'][langs == "es"]

# load a Spanish-language Spacy model
from spacy.es import Spanish
nlp_es = Spanish(path=None)

# apply the Spanish language model to our tweets
doc_es = nlp_es(' '.join(spanish_tweets))

# print the longest tokens
sorted(doc_es, key=len, reverse=True)[0:5]

[ViernesSantoEnElColiseoRobertoClemente,
 MiFantasia1DEnWembleyConCocaColaFM,
 fortaleciéndonos','escenarios,
 DirectionersConCocaColaFM1D,
 http://t.co/ezZEsXN3MF\nvia]

This time, the longest tokens are Spanish-language hashtags. This is exactly the sort of thing we’d expect to see! From here, we can use this tokenized dataset to feed into other downstream like sentiment analysis.

Of course, it would be impractical to do this for every single language in our dataset, even if we could be sure that they were all identified correctly. You’re probably going to have to accept that you probably won’t be able to consider every language in your dataset unless you can commit a lot of time. But including any additional language will enrich your analysis!

The takeaway: It doesn’t have to be onerous to incorporate multiple languages in your analysis pipeline!¶

So let’s review our options for analyzing multilingual data:¶

Option 1: Ignore Multilingualism

As we saw, this option will result in violating a lot of the assumptions built into NLP tools (e.g. there are spaces between words). If you do this, you’ll end up with a lot of noise and headaches as you try to move through your analysis pipeline.

Option 2: Only look at English

In this dataset, only looking at English would have led to us throwing away over half of our data. Especailly as NLP tools are developed and made avaliable for more and more languages, there’s less reason to stick to English-only NLP.

Option 3: Seperate your data by language & analyze them independently

This does take a little more work than the other options… but not that much more, especially for languages that already have resources avalialbe for them.

Additional resources:¶

Language Identification:

Here are some pre-built language identifiers to use in addition to LandID and TextCat:

detect_language() from TextBlob. Uses a Google API & requires Internet access.
langdetect Python library: Port of Google’s language-detection library (version from 03/03/2014) to Python.

Dealing with texts which contain multiple languages (code switching):

It’s very common for a span of text to include multiple languages. This example contains English and Malay (“kain kain” is Malay for “unwrap”):

Roasted Chicken Rice with Egg. Kain kain! 🙂 [Image of a lunch wrapped in paper being unwrapped.]

How to automatically handle code switching is an active research question in NLP. Here are some resources to get you started learning more:

Resources for code-switching: Including training data and corpora
First Workshop on Computational Approaches to Code Switching: Includes information on a shared task on identifying & labeling code switching
Overview for the Second Shared Task on Language Identification in
Code-Switched Data: Results of the second shared task on code-switching, with an overview of different approaches.

Dance Your PhD: Modeling the Perceptual Learning of Novel Dialect Features

September 28, 2017September 28, 2017 ~ Rachael Tatman ~ Leave a comment

Today’s blog post is a bit different. It’s in dance!

If that wasn’t quite clear enough for you, you can check this blog post for a more detailed explanation.

Where can you find language data on the web?

September 20, 2017 ~ Rachael Tatman ~ Leave a comment

In the course of my day-to-day work on Kaggle’s public data platform, I’ve learned a lot about the ecosystem of language data on the web (or at least the portions of it that have been annotated in English). For example, I’ve noticed a weird disconnect between European and American data repositories resources that I’m pretty sure has its roots in historical and disciplinary divisions.

I’ve also found a lot of great resources, though! At some point, I started keeping notes on interesting data repositories and link aggregators. I finally got around to tidying up and annotating my list of resources, and I figured that it would a useful thing to share with everyone. So, without further ado, here’s an (incomplete) list of some places to find language resources on the web:

META-SHARE
- URL :http://www.meta-share.org/
- META-SHARE has a lot of resources from The International Conference on Language Resources and Evaluation (LREC) on it.
Trolling
- URL: https://dataverse.no/dataverse/trolling
- This data collection is mainly has datasets for replication of linguistics experiments.
Linguistic Data Consortium (LDC)
- URL: https://www.ldc.upenn.edu/
- The Linguistic Data Consortium is an international non-profit that offers archival hosting of datasets. The data offered by them is high quality and usually not free (although they offer data grants for students).
Kaggle
- URL: https://www.kaggle.com/datasets?search=corpus
- Kaggle’s public data platform has a lot of language/NLP datasets available on it, many not in English. You can also do data analysis on Kaggle (with R or Python) without having to download anything or set up a local environment.
European Language Resources Association
- URL: http://catalog.elra.info/, http://universal.elra.info/
- Focus on European languages and language resources, but the universal catalog (second link) has a broader focus.
Zenodo
- URL: https://zenodo.org/
- Hosted by CERN, has datasets (including corpora) from a wide variety of disciplines.
Document the Now
- URL: http://www.docnow.io/catalog/
- Contains lists of Tweet ID’s surrounding certain events. You’ll need to use the “rehydrator” to get the actual tweets.
International Standard Language Resource Number
- URL: http://www.islrn.org/resources/identify_name/ (a list of unique ID #’s associated with language resources)
- Like a digital object identifier (DOI) for language resources. Not the best search (only looks at the title) but if you have a specific phrase you’re looking for it can be a good way to discover new resources.
Language & Culture Archives (SIL)
- URL: https://www.sil.org/resources/language-culture-archives
- Focus on ethnolinguistic minority communities, in many cases the only publicly available data for a given language.
Open Language Archives Community (OLAC)
- URL: http://www.language-archives.org/
- Includes a helpful metadata quality analysis for each onboarded dataset. (A higher score = more complete metadata)
Free sound
- URL: http://freesound.org/
- Freesound is a collaborative database of Creative Commons Licensed sounds. Note that some of the speech is synthetic. Helpful automatic annotation utilities can be found here: https://github.com/CrowdTruth/VU-Sound-Corpus/tree/v1.0
GitHub
- URL: https://github.com/search?q=corpus
- You can sometimes find interesting & high quality language data on Github, but it’s not centralized and of widely varying quality.
Re3data.org
- URL: http://www.re3data.org/search?query=&subjects%5B%5D=104%20Linguistics
- A link aggregator. It has a lot of overlap with other datasets but can be a good place to start looking.
Language Gold Mine
- URL: http://languagegoldmine.com/ (By Bodo Winter)
- Another collection of links, well-tagged by content type.

Know of a resource I forgot to include? Link it in the comments!

How well do Google and Microsoft and recognize speech across dialect, gender and race?

August 29, 2017August 29, 2017 ~ Rachael Tatman ~ 1 Comment

If you’ve been following my blog for a while, you may remember that last year I found that YouTube’s automatic captions didn’t work as well for some dialects, or for women. The effects I found were pretty robust, but I wanted to replicate them for a couple of reasons:

I only looked at one system, YouTube’s automatic captions, and even that was over a period of several years instead of at just one point in time. I controlled for time-of-upload in my statistical models, but it wasn’t the fairest system evaluation.
I didn’t control for the audio quality, and since speech recognition is pretty sensitive to things like background noise and microphone quality, that could have had an effect.
The only demographic information I had was where someone was from. Given recent results that find that natural language processing tools don’t work as well for African American English, I was especially interested in looking at automatic speech recognition (ASR) accuracy for African American English speakers.

With that in mind, I did a second analysis on both YouTube’s automatic captions and Bing’s speech API (that’s the same tech that’s inside Microsoft’s Cortana, as far as I know).

Speech Data

For this project, I used speech data from the International Dialects of English Archive. It’s a collection of English speech from all over, originally collected to help actors sound more realistic.

I used speech data from four varieties: the South (speakers from Alabama), the Northern Cities (Michigan), California (California) and General American. “General American” is the sort of news-caster style of speech that a lot of people consider unaccented–even though it’s just as much an accent as any of the others! You can hear a sample here.

For each variety, I did an acoustic analysis to make sure that speakers I’d selected actually did use the variety I thought they should, and they all did.

Systems

For the YouTube captions, I just uploaded the speech files to YouTube as videos and then downloaded the subtitles. (I would have used the API instead, but when I was doing this analysis there was no Python Google Speech API, even though very thorough documentation had already been released.)

Bing’s speech API was a little more complex. For this one, my co-author built a custom Android application that sent the files to the API & requested a long-form transcript back. For some reason, a lot of our sound files were returned as only partial transcriptions. My theory is that there is a running confidence function for the accuracy of the transcription, and once the overall confidence drops below a certain threshold, you get back whatever was transcribed up to there. I don’t know if that’s the case, though, since I don’t have access to their source code. Whatever the reason, the Bing transcriptions were less accurate overall than the YouTube transcriptions, even when we account for the fact that fewer words were returned.

Results

OK, now to the results. Let’s start with dialect area. As you might be able to tell from the graphs below, there were pretty big differences between the two systems we looked at. In general, there was more variation in the word error rate for Bing and overall the error rate tended to be a bit higher (although that could be due to the incomplete transcriptions we mentioned above). YouTube’s captions were generally more accurate and more consistent. That said, both systems had different error rates across dialects, with the lowest average error rates for General American English.

Now, let’s turn to gender. If you read my earlier work, you’ll know that I previously found that YouTube’s automatic captions were more accurate for men and less accurate for women. This time, with carefully recorded speech samples, I found no robust difference in accuracy by gender in either system. Which is great! In addition, the unreliable trends for each system pointed in opposite ways; Bing had a lower WER for male speakers, while YouTube had a lower WER for female speakers.

So why did I find an effect last time? My (untested) hypothesis is that there was a difference in the signal to noise ratio for male and female speakers in the user-uploaded files. Since women are (on average) smaller and thus (on average) slightly quieter when they speak, it’s possible that their speech was more easily masked by background noises, like fans or traffic. These files were all recorded in a quiet place, however, which may help to explain the lack of difference between genders.

Finally, what about race? For this part of the analysis, I excluded General American speakers, since they did not report their race. I also excluded the single Native American speaker. Even with fewer speakers, and thus reduced power, the differences between races were still robust enough to be significant for YouTube’s automatic captions and Bing followed the same trend. Both systems were most accurate for Caucasian speakers.

ethnicity — As with dialect, differences in WER between races were not significant for Bing (F[4, 31] = 1.21, p = 0.36), but were significant for YouTube’s automatic captions (F[4, 34] = 2.86,p< 0.05). Both systems were most accurate for Caucasian speakers.

While I was happy to find no difference in performance by gender, the fact that both systems made more errors on non-Caucasian and non-General-American speaking talkers is deeply concerning. Regional varieties of American English and African American English are both consistent and well-documented. There is nothing intrinsic to these varieties that make them less easy to recognize. The fact that they are recognized with more errors is most likely due to bias in the training data. (In fact, Mozilla is currently collecting diverse speech samples for an open corpus of training data–you can help them out yourself.)

So what? Why does word error rate matter?

There are two things I’m really worried about with these types of speech recognition errors. The first is higher error rates seem to overwhelmingly affect already-disadvantaged groups. In the US, strong regional dialects tend to be associated with speakers who aren’t as wealthy, and there is a long and continuing history of racial discrimination in the United States.

Given this, the second thing I’m worried about is the fact that these voice recognition systems are being incorporated into other applications that have a real impact on people’s lives.

Recently, an Irish veterinarian living in Australia was denied permanent residency in Australia after failing an English fluency test based on automatic speech recognition. She was a native English speaker.
A recent paper proposes using automatic speech recognition to automatically score a clinical test used to diagnose, among other things, speech impairment, ADHD, dementia and schizophrenia.
Voice recognition is increasingly being incorporated into the hiring process. One example is HireVue, a software product designed to pre-screen candidates before they talk to a recruiter. A direct quote from the HireVue CEO when asked whether it might make a mistake on a candidate: “the algorithm is always right. It would be a ‘Yes’ or a ‘No.’” (Yikes!)

Every automatic speech recognition system makes errors. I don’t think that’s going to change (certainly not in my lifetime). But I do think we can get to the point where those error don’t disproportionately affect already-marginalized people. And if we keep using automatic speech recognition into high-stakes situations it’s vital that we get to that point quickly and, in the meantime, stay aware of these biases.

If you’re interested in the long version, you can check out the published paper here.

Where 👏 do 👏 the 👏 claps 👏 go 👏 when 👏 you 👏 write 👏 like 👏 this 👏?

July 13, 2017 ~ Rachael Tatman ~ 2 Comments

You may already be familiar with the phenomena I’m going to be talking about today: when someone punctuates some text with the clap emoji. It’s a pretty transparent gestural scoring and (for me) immediately brings to mind the way my mom would clap with every word when she was particularly exasperated with my sibling and I (it was usually along with speech like “let’s go, let’s go, let’s go” or “get up now”). It looks like so:

Don't👏🏻call👏🏻yourself👏🏻a👏🏻Panic!👏🏻at👏🏻the👏🏻disco👏🏻fan👏🏻if👏🏻you've👏🏻never👏🏻panicked👏🏻at👏🏻a👏🏻disco

— Hand Clap Tweets (@HandClaps) October 9, 2016

This innovation, which started on Black Twitter is really interesting to me because it ties in with my earlier work on emoji ordering. I want to know where emojis go, particularly in relation to other words. Especially since people have since extended this usage to other emoji, like the US Flag:

https://twitter.com/SarahLerner/status/883732069607587840?ref_src=twsrc%5Etfw&ref_url=https%3A%2F%2Ftwitter.com%2Fulysseas%2Fstatus%2F884121094499713024

Logically, there are several different ways you can intersperse clap emojis with text:

Claps 👏 are 👏 used 👏 between 👏 every 👏 word.
👏 Claps 👏 are 👏 used 👏 around 👏 every 👏 word. 👏
👏 Claps 👏 are 👏 used 👏 before 👏 every 👏 word.
Claps 👏 are 👏 used 👏 after 👏 every 👏 word. 👏
Claps 👏 are used 👏 between phrases 👏 not words

I want to know which of these best describes what people actually do. I’m not aiming to write an internet style guide, but I am hoping to characterize this phenomena in a general way: this is how most people who do this do it, and if you want to use this style in a natural way, you should probably do it the same way.

Data

I used Fireant to grab 10,000 tweets from the Twitter streaming API which had the clap emoji in them at least once. (Twitter doesn’t let you search for a certain number of matches of the same string. If you search for “blob” and “blob blob” you’ll get the same set of results.)

Analysis

From that set of 10,000 tweets, I took only the tweets that had a clap emoji followed by a word followed by another clap emoji and threw out any repeats. That left me with 260 tweets. (This may seem pretty small compared to my starting dataset, but there were a lot of retweets in there, and I didn’t want to count anything twice.) Then I removed @usernames, since those show up in the beginning of any tweet that’s a reply to someone, and URL’s, which I don’t really think of as “words”. Finally, I looked at each word in a tweet and marked whether it was a clap or not. You can see the results of that here:

The “word” axis represents which word in the tweet we’re looking at: the first, second, third, etc. The red portion of the bar are the words that are the clap emoji. The yellow portion is the words that aren’t. (BTW, big shoutout to Hadley Wickham’s emo(ji) package for letting me include emoji in plots!)

From this we can see a clear pattern: almost no one starts a tweet with an emoji, but most people follow the first word with an emoji. The up-down-up-down pattern means that people are alternating the clap emoji with one word. So if we look back at our hypotheses about how emoji are used, we can see right off the bat that three of them are wrong:

Claps 👏 are 👏 used 👏 between 👏 every 👏 word.
~~👏 Claps 👏 are 👏 used 👏 around 👏 every 👏 word. 👏~~
~~👏 Claps 👏 are 👏 used 👏 before 👏 every 👏 word.~~
Claps 👏 are 👏 used 👏 after 👏 every 👏 word. 👏
~~Claps 👏 are used 👏 between phrases 👏 not words~~

We can pick between the two remaining hypotheses by looking at whether people are ending thier tweets with a clap emoji. As it turns out, the answer is “yes”, more often than not.

If they’re using this clapping-between-words pattern (sometimes called the “ratchet clap“) people are statistically more likely to end their tweet with a clap emoji than with a different word or non-clap emoji. This means the most common pattern is to use 👏 a 👏 clap 👏 after 👏 every 👏 word, 👏 including 👏 the 👏 last. 👏

This makes intuitive sense to me. This pattern is mimicking someone is clapping on every word. Since we can’t put emoji on top of words to indicate that they’re happening at the same time, putting them after makes good intuitive sense. In some sense, each emoji is “attached” to the word that comes before it in a similar way to how “quickly” is “attached” to “run” in the phrase “run quickly”. It makes less sense to put emoji between words, becuase then you end up with less claps than words, which doesn’t line up well with the way this is done in speech.

The “clap after every word” pattern is also what this website that automatically puts claps in your tweets does, so I’m pretty positive this is a good characterization of community norms.

So there you have it! If you’re going to put clap emoji in your tweets, you should probably do 👏 it 👏 like 👏 this. 👏 It’s not wrong if you don’t, but it does look kind of weird.

How many people in the US don’t have an accent?

May 2, 2017May 2, 2017 ~ Rachael Tatman ~ 1 Comment

First, the linguist’s answer: none. Zero. Everyone who uses a language uses a variety of that language, one that reflects their social identity–including things like gender, socioeconomic status or regional background.

But the truth is that some people, especially in the US, have the social privileged of being considered “unaccented”. I can’t count how many times I’ve been “congratulated” by new acquaintances on having “gotten rid of” my Virginia accent. The thing is, I do have a lot of linguistic features from Tidewater/Piedmont English, like a strong distinction between the vowels in “body” and “baudy”, “y’all” for the second person plural and calling a drive-through liquor store a “brew thru” (shirts with this guy on them were super popular in my high school). But, at the same time, I also don’t have a lot of strongly stigmatized features, like dropping r’s or strong monopthongization you’d hear from a speaker like Virgil Goode (although most folks don’t really sound like that anymore). Plus, I’m young, white, (currently) urban and really highly educated. That, plus the fact that most people don’t pick up on the Southern features I do have, means that I have the privilege of being perceived as accent-less.

: Map showing the distribution of speakers in the United States who use “y’all”.

But how many people in the US are in the same boat as I am? This is a difficult question, especially given that there is no wide consensus about what “standard”, or “unaccented”, American English is. There is, however, a lot of discussion about what it’s not. In particular, educated speakers from the Midwest and West are generally considered to be standard speakers by non-linguists. Non-linguists also generally don’t consider speakers of African American English and Chicano English to be “standard” speakers (even though both of these are robust, internally consistent language varieties with long histories used by native English speakers). Fortunately for me, the United States census asks census-takers about their language background, race and ethnicity, educational attainment and geographic location, so I could use census data to roughly estimate how many speakers of “standard” English there are in the United States. I chose to use the 2011 census, as detailed data on language use has been released for that year on a state-by-state basis (you can see a summary here).

From this data, I calculated how many individuals were living in states assigned by the U.S. Census Bureau to either the West or Midwest and how many residents surveyed in these states reported speaking English ‘very well’ or better. Then, assuming that residents of these states had educational attainment rates representative of national averages, I estimated how many college educated (with a bachelor’s degree or above) non-Black and non-Hispanic speakers lived in these areas.

So just how many speakers fit into this “standard” mold? Fewer than you might expect! You can see the breakdown below:

Speakers in the 2011 census who…	Count	% of US Population
…live in the United States…	311.7 million	100%
…and live in the Midwest or West…	139,968,791	44.9%
…and speak English at least ‘very well’…	127,937,178	41%
…and are college educated…	38,381,153 (estimated)	12.31%
…and are not Black or Hispanic.	33,391,603 (estimated)	10.7%

Based on the criteria laid out above, only around a tenth of the US population would count as ‘standard’ speakers. Now, keep in mind this estimate is possibly somewhat conservative: not all Black speakers use African American English and not all Hispanic speakers use Chicano English, and the regional dialects of some parts of the Northeast are also sometimes considered “standard”, which isn’t reflected in my rough calculation. That said, I think there’s still something if a large majority of Americans don’t speak what we might consider “standard” English, maybe it’s time to start redefining who gets to be the standard.

What you can do with social media data

What you can’t do with social media data

What you shouldn’t do with social media data

Share this:

What’s entropy?

Experiment

Hypothesis

Data

Analysis

Which emoji were repeated the most/least often?

More info & further work

Share this:

Dürscheid & Siever, 2017:

Na’aman et al, 2017:

Wood & Ruder, 2016:

Donato & Paggio, 2017:

Barbieri et al, 2017:

Bibliography:

Share this:

Share this:

Getting started¶

Option 1: Ignore the multilingualism¶

The takeaway: if you ignore multiple languages, you’ll end up violating the assumptions behind major out-of-the-box NLP tools¶

Option 2: Only look at the parts of the data that are in English¶

Correctly identifying which tweets are in English¶

The takeaway: Automatic language identification, especially on very short texts, is very error prone. (I’d recommend using multiple language identifiers & taking the majority vote.)¶

Throwing away data¶

The takeaway: Just incorporating a couple more languages in your analysis can give you access to a lot more data!¶

Option 3: Break the data apart by language & use language-specific tools¶

The takeaway: It doesn’t have to be onerous to incorporate multiple languages in your analysis pipeline!¶

So let’s review our options for analyzing multilingual data:¶

Additional resources:¶

Share this:

Share this:

Share this:

Speech Data

Systems

Results

So what? Why does word error rate matter?

Share this:

Share this:

Share this: