# Are emoji sequences as informative as text?

Something I’ve been thinking about a lot lately is how much information we really convey with emoji. I was recently at the 1​st​ International Workshop on Emoji Understanding and Applications in Social Media and one theme that stood out to me from the papers was that emoji tend to be used more to communicate social meaning (things like tone and when a conversation is over) than semantics (content stuff like “this is a dog” or “an icecream truck”).

I’ve been itching to apply an information theoretic approach to emoji use for a while, and this seemed like the perfect opportunity. Information theory is the study of storing, transmitting and, most importantly for this project, quantifying information. In other words, using an information theoretic approach we can actually look at two input texts and figure out which one has more information in it. And that’s just what we’re going to do: we’re going to use a measure called “entropy” to directly compare the amount of information in text and emoji.

### What’s entropy?

Shannon entropy is a measure of how much information there is in a sequence. Higher entropy means that there’s more uncertainty about what comes next, while lower entropy means there’s less uncertainty.  (Mathematically, entropy is always less than or the same as log2(n), where n is the total number of unique characters. You can learn more about calculating entropy and play around with an interactive calculator here if you’re curious.)

So if you have a string of text that’s just one character repeated over and over (like this: 💀💀💀💀💀) you don’t need a lot of extra information to know what the next character will be: it will always be the same thing. So the string “💀💀💀💀💀” has a very low entropy. In this case it’s actually 0, which means that if you’re going through the string and predicting what comes next, you’re always going to be able to guess what comes next becuase it’s always the same thing. On the other hand, if you have a string that’s made up of four different characters, all of which are equally probable (like this:♢♡♧♤♡♧♤♢), then you’ll have an entropy of 2.

TL;DR: The higher the entropy of a string the more information is in it.

### Experiment

#### Hypothesis

We do have some theoretical maximums for the entropy text and emoji. For text, if the text string is just randomly drawn from the 128 ASCII characters (which isn’t how language works, but this is just an approximation) our entropy would be 7. On the other hand, for emoji, if people are just randomly using any emoji they like from the set of emoji as of June 2017, then we’d expect to see an entropy of around 11.

So if people are just  using letters or emoji randomly, then text should have lower entropy than emoji. However, I don’t think that’s what’s happening. My hypothesis, based on the amount of repetition in emoji, was that emoji should have lower entropy, i.e. less information, than text.

#### Data

To get emoji and text spans for our experiment I used four different datasets: three from Twitter and one from YouTube.

I used multiple datasets for a couple reasons. First, becuase I wanted a really large dataset of tweets with emoji, and since only between 0.9% and 0.5% of tweets from each Twitter dataset actually contained emoji I needed to case a wide net. And, second, because I’m growing increasingly concerned about genre effects in NLP research. (Like, a lot of our research is on Twitter data. Which is fine, but I’m worried that we’re narrowing the potential applications of our research becuase of it.) It’s the second reason that led me to include YouTube data. I used Twitter data for my initial exploration and then used the YouTube data to validate my findings.

For each dataset, I grabbed all adjacent emoji from a tweet and stored them separately. So this tweet:

Love going to ballgames! ⚾🌭 Going home to work in my garden now, tho 🌸🌸🌸🌸

Has two spans in it:

Span 1:  ⚾🌭

Span 2: 🌸🌸🌸🌸

All told, I ended up with 13,825 tweets with emoji and 18,717 emoji spans of which only 4,713 were longer than one emoji. (I ignored all the emoji spans of length one, since they’ll always have an entropy of 0 and aren’t that interesting to me.) For the YouTube comments, I ended up with 88,629 comments with emoji, 115,707 emoji spans and 47,138 spans with a length greater than one.

In order to look at text as parallel as possible to my emoji spans, I grabbed tweets & YouTube comments without emoji. For each genre, I took a number of texts equal to the number of spans of length > 1 and then calculated the character-level entropy for the emoji spans and the texts.

#### Analysis

First, let’s look at Tweets. Here’s the density (it’s like a smooth histogram, where the area under the curve is always equal to 1 for each group) of the entropy of an equivalent number of emoji spans and tweets.

Text has a much high character-level entropy than emoji. For text, the mean and median entropy are both around 5. For emoji, there is a multimodal distribution, with the median entropy being 0 and also clusters around 1 and 1.5.

It looks like my hypothesis was right! At least in tweets, text has much more information than emoji. In fact, the most common entropy for an emoji span is 0: which means that most emoji spans with a length greater than one are just repititons of the same emoji over and over again.

But is this just true on Twitter, or does it extend to YouTube comments as well?

The YouTube data, which we have almost ten times more of, corroborates the earlier finding: emoji spans are less informative, and more repetitive, than text.

### Which emoji were repeated the most/least often?

Just in case you were wondering, the emoji most likely to be repeated was the skull emoji, 💀. It’s generally used to convey strong negative emotion, especially embarrassment, awkwardness or speechlessness, similar to “ded“.

The least likely was the right-pointing arrow (▶️), which is usually used in front of links to videos.

If you’re interested, the code for my analysis is available here. I also did some of this work as live coding, which you can follow along with on YouTube here.

For future work, I’m planning on looking at which kinds of emoji are more likely to be repeated. My intuition is that gestural emoji (so anything with a hand or face) are more likely to be repeated than other types of emoji–which would definitely add some fuel to the “are emoji words or gestures” debate!

# How do we use emoji?

Those of you who know me may know that I’m a big fan of emoji. I’m also a big fan of linguistics and NLP, so, naturally, I’m very curious about the linguistic roles of emoji. Since I figured some of you might also be curious, I’ve pulled together a discussion of some of the very serious scholarly research on emoji. In particular, I’m going to talk about five recent papers that explore the exact linguistic nature of these symbols: what are they and how do we use them?

### Dürscheid & Siever, 2017:

This paper makes one overarching point: emoji are not words. They cannot be unambiguously interpreted without supporting text and they do not have clear syntactic relationships to one another. Rather, the authors consider emoji to be specialized characters, and place them within Gallmann’s 1985 hierarchy of graphical signs. The authors show that emoji can play a range of roles within the Gallmann’s functional classification.

• Allography: using emoji to replace specific characters (for example: the word “emoji” written as “em😝ji”)
• Ideograms: using emoji to replace a specific word (example: “I’m travelling by 🚘” to mean “I’m travelling by car”)
• Border and Sentence Intention signals: using emoji both to clarify the tone of the preceding sentence and also to show that the sentence is over, often replacing the final punctuation marks.

Based on an analysis of a Swiss German Whatsapp corpus, the authors conclude that the final category is far and away the most popular, and that emoji rarely replace any part of the lexical parts of a message.

### Na’aman et al, 2017:

Na’aman and co-authors also develop a hierarchy of emoji usage, with three top-level categories: Function, Content (both of which would fall under mostly under the ideogram category in Dürscheid & Siever’s classifications) and Multimodal.

• Function: Emoji replacing function words, including prepositions, auxiliary verbs, conjunctions, determinatives and punctuation. An example of this category would be “I like 🍩 you”, to be read as “I do not like you”.
• Content: Emoji replacing content words and phrases, including nouns, verbs, adjectives and adverbs. An example of this would be “The 🔑 to success”, to be read as “the key to success”.
• Multimodal: These emoji “enrich a grammatically-complete text with markers of
affect or stance”. These would fall under the category of border signals in Dürscheid & Siever’s framework, but Na’aman et all further divide these into four categories: attitude, topic, gesture and other.

Based on analysis of a Twitter corpus made of up of only tweets containing emoji, the authors find that multimodal emoji encoding attitude are far and away the most common, making up over 50% of the emoji spans in their corpus. The next most common uses of emoji are to multimodal:topic and multimodal:gesture. Together, these three categories account for close to 90% of the all the emoji use in the corpus, corroborating the findings of Dürscheid & Siever.

### Wood & Ruder, 2016:

Wood and Ruder provide further evidence that emoji are used to express emotion (or “attitude”, in Na’aman et al’s terms). They found a strong correlation between the presence of emoji that they had previously determined were associated with a particular emotion, like 😂 for joy or 😭 for sadness, and human annotations of the emotion expressed in those tweets. In addition, an emotion classifier using only emoji as input performed similarly to one trained using n-grams excluding emoji. This provides evidence that there is an established relationship between specific emoji use and expressing emotion.

### Donato & Paggio, 2017:

However, the relationship between text and emoji may not always be so close. Donato & Paggio collected a corpus of tweets which contained at least one emoji and that were hand-annotated for whether the emoji was redundant given the text of the tweet.  For example, “We’ll always have Beer. I’ll see to it. I got your back on that one. 🍺” would be redundant, while “Hopin for the best 🎓” would not be, since the beer emoji expresses content already expressed in the tweet, while the motorboard adds new information (that the person is hoping to graduate, perhaps). The majority of emoji, close to 60%, were found not to be redundant and added new information to the tweet.

However, the corpus was intentionally balanced between ten topic areas, of which only one was feelings, and as a result the majority of feeling-related tweets were excluded from analysis. Based on this analysis and Wood and Ruder’s work, we might hypothesize that feelings-related emoji may be more redundant than other emoji from other semantic categories.

### Barbieri et al, 2017:

Additional evidence for the idea that emoji, especially those that show emotion, are predictable given the text surrounding them comes from Barbieri et al. In their task, they removed the emoji from a thousand tweets that contained one of the following five emoji: 😂, ❤️, 😍, 💯 or 🔥. These emoji were selected since they were the most common in the larger dataset of half a million tweets. Then then asked human crowd workers to fill in the missing emoji given the text of the tweet, and trained a character-level bidirectional LSTM to do the same task. Both humans and the LSTM performed well over chance, with an F1 score of 0.50 for the humans and 0.65 for the LSTM.

So that was a lot of papers and results I just threw at you. What’s the big picture? There are two main points I want you to take away from this post:

• People mostly use emoji to express emotion. You’ll see people playing around more than that, sure, but by far the most common use is to make sure people know what emotion you’re expressing with a specific message.
• Emoji, particularly emoji that are used to represent emotions, are predictable given the text of the message. It’s pretty rare for us to actually use emoji to introduce new information, and we generally only do that when we’re using emoji that have a specific, transparent meaning.

If you’re interested in reading more, here are all the papers I mentioned in this post:

#### Bibliography:

Donato, G., & Paggio, P. (2017). Investigating Redundancy in Emoji Use: Study on a Twitter Based Corpus. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (pp. 118-126).

Dürscheid, C., & Siever, C. M. (2017). Beyond the Alphabet–Communication of Emojis. Kurzfassung eines (auf Deutsch) zur Publikation eingereichten Manuskripts.

Gallmann, P. (1985). Graphische Elemente der geschriebenen Sprache. Grundlagen für eine Reform der Orthographie. Tübingen: Niemeyer.

Na’aman, N., Provenza, H., & Montoya, O. (2017). Varying Linguistic Purposes of Emoji in (Twitter) Context. In Proceedings of ACL 2017, Student Research Workshop (pp. 136-141).

Wood, I. & Ruder, S. (2016). Emoji as Emotion Tags for Tweets. Sánchez-Rada, J. F., & Schuller, B (Eds.). In Proceedings of LREC 2016, Workshop on Emotion and Sentiment Analysis (pp. 76-80).