Are emoji sequences as informative as text?

Something I’ve been thinking about a lot lately is how much information we really convey with emoji. I was recently at the 1​st​ International Workshop on Emoji Understanding and Applications in Social Media and one theme that stood out to me from the papers was that emoji tend to be used more to communicate social meaning (things like tone and when a conversation is over) than semantics (content stuff like “this is a dog” or “an icecream truck”).

I’ve been itching to apply an information theoretic approach to emoji use for a while, and this seemed like the perfect opportunity. Information theory is the study of storing, transmitting and, most importantly for this project, quantifying information. In other words, using an information theoretic approach we can actually look at two input texts and figure out which one has more information in it. And that’s just what we’re going to do: we’re going to use a measure called “entropy” to directly compare the amount of information in text and emoji.

What’s entropy?

Shannon entropy is a measure of how much information there is in a sequence. Higher entropy means that there’s more uncertainty about what comes next, while lower entropy means there’s less uncertainty.  (Mathematically, entropy is always less than or the same as log2(n), where n is the total number of unique characters. You can learn more about calculating entropy and play around with an interactive calculator here if you’re curious.)

So if you have a string of text that’s just one character repeated over and over (like this: 💀💀💀💀💀) you don’t need a lot of extra information to know what the next character will be: it will always be the same thing. So the string “💀💀💀💀💀” has a very low entropy. In this case it’s actually 0, which means that if you’re going through the string and predicting what comes next, you’re always going to be able to guess what comes next becuase it’s always the same thing. On the other hand, if you have a string that’s made up of four different characters, all of which are equally probable (like this:♢♡♧♤♡♧♤♢), then you’ll have an entropy of 2.

TL;DR: The higher the entropy of a string the more information is in it.

Experiment

Hypothesis

We do have some theoretical maximums for the entropy text and emoji. For text, if the text string is just randomly drawn from the 128 ASCII characters (which isn’t how language works, but this is just an approximation) our entropy would be 7. On the other hand, for emoji, if people are just randomly using any emoji they like from the set of emoji as of June 2017, then we’d expect to see an entropy of around 11.

So if people are just  using letters or emoji randomly, then text should have lower entropy than emoji. However, I don’t think that’s what’s happening. My hypothesis, based on the amount of repetition in emoji, was that emoji should have lower entropy, i.e. less information, than text.

Data

To get emoji and text spans for our experiment I used four different datasets: three from Twitter and one from YouTube.

I used multiple datasets for a couple reasons. First, becuase I wanted a really large dataset of tweets with emoji, and since only between 0.9% and 0.5% of tweets from each Twitter dataset actually contained emoji I needed to case a wide net. And, second, because I’m growing increasingly concerned about genre effects in NLP research. (Like, a lot of our research is on Twitter data. Which is fine, but I’m worried that we’re narrowing the potential applications of our research becuase of it.) It’s the second reason that led me to include YouTube data. I used Twitter data for my initial exploration and then used the YouTube data to validate my findings.

For each dataset, I grabbed all adjacent emoji from a tweet and stored them separately. So this tweet:

Love going to ballgames! ⚾🌭 Going home to work in my garden now, tho 🌸🌸🌸🌸

Has two spans in it:

Span 1:  ⚾🌭

Span 2: 🌸🌸🌸🌸

All told, I ended up with 13,825 tweets with emoji and 18,717 emoji spans of which only 4,713 were longer than one emoji. (I ignored all the emoji spans of length one, since they’ll always have an entropy of 0 and aren’t that interesting to me.) For the YouTube comments, I ended up with 88,629 comments with emoji, 115,707 emoji spans and 47,138 spans with a length greater than one.

In order to look at text as parallel as possible to my emoji spans, I grabbed tweets & YouTube comments without emoji. For each genre, I took a number of texts equal to the number of spans of length > 1 and then calculated the character-level entropy for the emoji spans and the texts.

 

Analysis

First, let’s look at Tweets. Here’s the density (it’s like a smooth histogram, where the area under the curve is always equal to 1 for each group) of the entropy of an equivalent number of emoji spans and tweets.

download (6)
Text has a much high character-level entropy than emoji. For text, the mean and median entropy are both around 5. For emoji, there is a multimodal distribution, with the median entropy being 0 and also clusters around 1 and 1.5.

It looks like my hypothesis was right! At least in tweets, text has much more information than emoji. In fact, the most common entropy for an emoji span is 0: which means that most emoji spans with a length greater than one are just repititons of the same emoji over and over again.

But is this just true on Twitter, or does it extend to YouTube comments as well?

download (5)
The pattern for emoji & text in YouTube comments is very similar to that for Tweets. The biggest difference is that it looks like there’s less information in YouTube Comments that are text-based; they have a mean and median entropy closer to 4 than 5.

The YouTube data, which we have almost ten times more of, corroborates the earlier finding: emoji spans are less informative, and more repetitive, than text.

Which emoji were repeated the most/least often?

Just in case you were wondering, the emoji most likely to be repeated was the skull emoji, 💀. It’s generally used to convey strong negative emotion, especially embarrassment, awkwardness or speechlessness, similar to “ded“.

The least likely was the right-pointing arrow (▶️), which is usually used in front of links to videos.

More info & further work

If you’re interested, the code for my analysis is available here. I also did some of this work as live coding, which you can follow along with on YouTube here.

For future work, I’m planning on looking at which kinds of emoji are more likely to be repeated. My intuition is that gestural emoji (so anything with a hand or face) are more likely to be repeated than other types of emoji–which would definitely add some fuel to the “are emoji words or gestures” debate!

Advertisements

Analyzing Multilingual Data

This blog post is a little different from my usual stuff. It’s based on a talk I gave yesterday at the first annual Data Institute Conference. As a result, it’s aimed at a slightly more technical audience than my usual stuff, but I hope I’ve done an ok job keeping it accessible. Feel free to drop me a comment if you have any questions or found anything confusing and I’ll be sure to help you out.
You can play with the code yourself by forking this notebook on Kaggle (you don’t even have to download or install anything :).

There are over 7000 languages in the world, 80% of which have fewer than a million speakers each. In fact, six in ten people on Earth speak a language with less than ten million speakers. In other words: the majority of people on Earth use low-resource languages.

As a result, any large sample of user-generated text is almost guaranteed to have multiple languages in it. So what can you do about it? There are a couple options:

  1. Ignore it
  2. Only look at the parts of the data that are in English
  3. Break the data apart by language & use language-specific tools when available

Let’s take a quick look at the benefits and drawbacks of each approach.


Getting started

In [1]:
# import libraries we'll use
import spacy # fast NLP
import pandas as pd # dataframes
import langid # language identification (i.e. what language is this?)
from nltk.classify.textcat import TextCat # language identification from NLTK
from matplotlib.pyplot import plot # not as good as ggplot in R :p

To explore working with multilingual data, let’s look a real-life dataset of user-generated text. This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages. It’s a realistic cross-section of the type of linguistic diversity you’ll see in a large text dataset.

# read in our data
tweetsData = pd.read_csv("../input/all_annotated.tsv", sep = "\t")

# check out some of our tweets
tweetsData['Tweet'][0:5]
0                            Bugün bulusmami lazimdiii
1       Volkan konak adami tribe sokar yemin ederim :D
2                                                  Bed
3    I felt my first flash of violence at some fool...
4              Ladies drink and get in free till 10:30
Name: Tweet, dtype: object

Option 1: Ignore the multilingualism

Maybe you’ve got a deadline coming up fast, or maybe you didn’t get a chance to actually look at some of your text data and just decide to treat it as if it were English. What could go wrong?

To find out, let’s use Spacy to tokenize all our tweets and take a look at the longest tokens in our data.

Spacy is an open-source NLP library that is much faster than the Natural Language Toolkit, although it does not have as many tasks implemented. You can find more information in the Spacy documentation.

# create a Spacy document of our tweets
# load an English-language Spacy model
nlp = spacy.load("en")

# apply the english language model to our tweets
doc = nlp(' '.join(tweetsData['Tweet']))

Now let’s look at the longest tokens in our Twitter data.

sorted(doc, key=len, reverse=True)[0:5]
[a7e78d48888a6811d84e0759e9387647447d1e74d8c7c4f1bec00d318e4e5030f08eb35668a97873820ca1d9dc61ffb620f8992296f3b029a60f153beac8018f5fb77d000000,
 e44337d70d7a7fec79a8b6bd8aa573367224023e4272f22af6d0844d9682d5b48062e331b33ab3b92dac2c262ed4f154ba679ad07b30d2cf1c15851cdac901315b4e72000000,
 3064d36c909f9d437f7a3f405aa550f65529566547ae2308d6c4f2585250106d33b924ae9c8dcc08856e41f611d9bd15409a79f7ba21d318ab484f0cae10017201590a000000,
 69bdf5177f1ae8ed61ed71c477f7dc415b97a2b2d7e57be079feb1a2c52600a996fd0891e130c1ce13c94e4406f83ba59e5edb5a7e0fb45e5251a17bb29601081f3de0000000,
 lt;3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3]

The five longest tokens are entire tweets, four produced by an art bot that tweets hashes of Unix timestamps and one that’s just the HTML version of “<3” tweeted a bunch of times. In other words: normal Twitter weirdness. This is actual noise in the data and can be safely discarded without hurting downstream tasks, like sentiment analysis or topic modeling.

sorted(doc, key=len, reverse=True)[6:10]
[卒業したった(*^^*)\n彼女にクラスで一緒にいるやつに\nたった一人の同中の拓夢とも写真撮れたし満足や!(^。^)時間ギリギリまでテニスやってたからテニス部面と写真撮ってねーわ‼︎まぁこいつらわこれからも付き合いあるだろうからいいか!,
 眼鏡は近視用で黒のセルフレームかアンダーリムでお願いします。オフの日は赤いセルフレームです。形状はサークルでお願いします。30代前半です。髪型ボブカットもしくはティモシェンコ元ウクライナ首相みたいなので。色は黒目でとりあえずお願いします,
 普段は写真撮られるの苦手なので、\n\n顔も出さずw\n\n登場回数少ないですが、\n\n元気にampで働いておりますw\n\n一応こんな人が更新してますのでw\n\n#takahiromiyashitathesolois,
 love#instagood#me#cute#tbt#photooftheday#instamood#tweegram#iphonesia#picoftheday#igers#summer#girl#insta]

The next five longest tokens are also whole tweets which have been identified as single tokens. In this case, though, they were produced by humans!

The tokenizer (which assumes it will be given mainly English data) fails to correct tokenize these tweets because it’s looking for spaces. These tweets are in Japanese, though, and like many Asian languages (including all varieties of Chinese, Korean and Thai) they don’t actually use spaces between words.

In case you’re curious, “、” and “。” are single characters and don’t contain spaces! They are, respectively, the ideographic comma and ideographic full stop, and are part of a very long list of line breaking characters associated with specific orthographic systems.

In order to correctly tokenize Japanese, you’ll need to use a language-specific tokenizer.

The takeaway: if you ignore multiple languages, you’ll end up violating the assumptions behind major out-of-the-box NLP tools


Option 2: Only look at the parts of the data that are in English

So we know that just applying NLP tools designed for English willy-nilly won’t work on multiple languages. So what if we only grabbed the English-language data and then worked with that?

There are two big issues here:

  • Correctly identifying which tweets are in English
  • Throwing away data

Correctly identifying which tweets are in English

Probably the least time-intensive way to do this is by attempting to automatically identify the language that each Tweet is written in. A BIG grain of salt here: automatic language identifiers are very error prone, especially on very short texts. Let’s check out two of them.

  • LangID: Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062
  • TextCat: Cavnar, W. B. and J. M. Trenkle, “N-Gram-Based Text Categorization” In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

First off, here are the languages the first five tweets are actually written in, hand tagged by a linguist (i.e. me):

  1. Turkish
  2. Turkish
  3. English
  4. English
  5. English

Now let’s see how well two popular language identifiers can detect this.

# summerize the labelled language
tweetsData['Tweet'][0:5].apply(langid.classify)
0     (az, -30.30187177658081)
1     (ms, -83.29260611534119)
2      (en, 9.061840057373047)
3    (en, -195.55468368530273)
4     (en, -98.53013229370117)
Name: Tweet, dtype: object

LangID does…alright, with three out of five tweets identified correctly. While it’s pretty good at identifying English, the first tweet was identified as Azerbaijani and the second tweet was labeled as Malay, which is very wrong (not even in the same language family as Turkish).

Let’s look at another algorithm, TextCat, which is based on character-level N-Grams.

# N-Gram-Based Text Categorization
tc = TextCat()

# try to identify the languages of the first five tweets again
tweetsData['Tweet'][0:5].apply(tc.guess_language)
0    tur
1    ind
2    bre
3    eng
4    eng
Name: Tweet, dtype: object

TextCat also only got three out of the five correct. Oddly, it identifier “bed” as Breton. To be fair, “bed” is the Breton word for “world”, but it’s still a bit odd.

The takeaway: Automatic language identification, especially on very short texts, is very error prone. (I’d recommend using multiple language identifiers & taking the majority vote.)

Throwing away data

Even if language identification were very accurate, how much data would be just be throwing away if we only looked at data we were fairly sure was English?

Note: I’m only going to LangID here for time reasons, but given the high error rate I’d recommend using multiple language identification algorithms.

# get the language id for each text
ids_langid = tweetsData['Tweet'].apply(langid.classify)

# get just the language label
langs = ids_langid.apply(lambda tuple: tuple[0])

# how many unique language labels were applied?
print("Number of tagged languages (estimated):")
print(len(langs.unique()))

# percent of the total dataset in English
print("Percent of data in English (estimated):")
print((sum(langs=="en")/len(langs))*100)
Number of tagged languages (estimated):
95
Percent of data in English (estimated):
40.963625976

Only 40% of our data has been tagged as English by LangId. If we throw the rest of it, we’re going to lose more than half of our dataset! Especially if this is data you spent a lot of time and money collecting, that seems downright wasteful. (Plus, it might skew our analysis.)

So if 40% of our data is in English, what is the other 60% made up of? Let’s check out the distribution data across languages in our dataset.

# convert our list of languages to a dataframe
langs_df = pd.DataFrame(langs)

# count the number of times we see each language
langs_count = langs_df.Tweet.value_counts()

# horrible-looking barplot (I would suggest using R for visualization)
langs_count.plot.bar(figsize=(20,10), fontsize=20)

There’s a really long tail on our dataset; most that were identified in our dataset were only identified a few times. This means that we can get a lot of mileage out of including just a few more popular languages in our analysis. How much will we gain, exactly?

print("Languages with more than 400 tweets in our dataset:")
print(langs_count[langs_count > 400])

print("")

print("Percent of our dataset in these languages:")
print((sum(langs_count[langs_count > 400])/len(langs)) * 100)
Languages with more than 400 tweets in our dataset:
en    4302
es    1020
pt     751
ja     436
tr     414
id     407
Name: Tweet, dtype: int64

Percent of our dataset in these languages:
69.7962292897

By including only five more languages in our analysis (Spanish, Portugese, Japanese, Turkish and Indonesian) we can increase our coverage of the data in our dataset by almost a third!

The takeaway: Just incorporating a couple more languages in your analysis can give you access to a lot more data!


Option 3: Break the data apart by language & use language-specific tools

Ok, so what exactly does this pipeline look like? Let’s look at just the second most popular language in our dataset: Spanish. What happens when we pull out just the Spanish tweets & tokenize them?

# get a list of tweets labelled "es" by langid
spanish_tweets = tweetsData['Tweet'][langs == "es"]

# load a Spanish-language Spacy model
from spacy.es import Spanish
nlp_es = Spanish(path=None)

# apply the Spanish language model to our tweets
doc_es = nlp_es(' '.join(spanish_tweets))

# print the longest tokens
sorted(doc_es, key=len, reverse=True)[0:5]
[ViernesSantoEnElColiseoRobertoClemente,
 MiFantasia1DEnWembleyConCocaColaFM,
 fortaleciéndonos','escenarios,
 DirectionersConCocaColaFM1D,
 http://t.co/ezZEsXN3MF\nvia]

This time, the longest tokens are Spanish-language hashtags. This is exactly the sort of thing we’d expect to see! From here, we can use this tokenized dataset to feed into other downstream like sentiment analysis.

Of course, it would be impractical to do this for every single language in our dataset, even if we could be sure that they were all identified correctly. You’re probably going to have to accept that you probably won’t be able to consider every language in your dataset unless you can commit a lot of time. But including any additional language will enrich your analysis!

The takeaway: It doesn’t have to be onerous to incorporate multiple languages in your analysis pipeline!


So let’s review our options for analyzing multilingual data:

Option 1: Ignore Multilingualism

As we saw, this option will result in violating a lot of the assumptions built into NLP tools (e.g. there are spaces between words). If you do this, you’ll end up with a lot of noise and headaches as you try to move through your analysis pipeline.

Option 2: Only look at English

In this dataset, only looking at English would have led to us throwing away over half of our data. Especailly as NLP tools are developed and made avaliable for more and more languages, there’s less reason to stick to English-only NLP.

Option 3: Seperate your data by language & analyze them independently

This does take a little more work than the other options… but not that much more, especially for languages that already have resources avalialbe for them.

Additional resources:

Language Identification:

Here are some pre-built language identifiers to use in addition to LandID and TextCat:

Dealing with texts which contain multiple languages (code switching):

It’s very common for a span of text to include multiple languages. This example contains English and Malay (“kain kain” is Malay for “unwrap”):

Roasted Chicken Rice with Egg. Kain kain! 🙂 [Image of a lunch wrapped in paper being unwrapped.]

How to automatically handle code switching is an active research question in NLP. Here are some resources to get you started learning more:

 

What’s up with calling a woman “a female”? A look at the parts of speech of “male” and “female” on Twitter .

This is something I’ve written about before, but I’ve recently had several discussions with people who say they don’t find it odd to refer to a women as a female. Personally, I don’t like being called “a female” becuase its a term I to associate strongly with talking about animals. (Plus, it makes you sound like a Ferengi.)  I would also protest men being called males, for the same reason, but my intuition is that that doesn’t happen as often. I’m willing to admit that my intuition may be wrong in this case, though, so I’ve decided to take a more data-driven approach. I had two main questions:

  • Do “male” and “female” get used as nouns at different rates?
  • Does one of these terms get used more often?

Data collection

I used the Twitter public API to collect two thousand English tweets, one thousand each containing the exact string “a male” and “a female”. I looked for these strings to help get as many tweets as possible with “male” or “female” used as a noun. “A” is what linguist call a determiner, and a determiner has to have a noun after it. It doesn’t have to be the very next word, though; you can get an adjective first, like so:

  • A female mathematician proved the theorm.
  • A female proved the theorm.

So this will let me directly compare these words in a situation where we should only be able to see a limited number of possible parts of speech & see if they differ from each other. Rather than tagging two thousand tweets by hand, I used a Twitter specific part-of-speech tagger to tag each set of tweets.

A part of speech tagger is a tool that guesses the part of speech of every word in a text. So if you tag a sentence like “Apples are tasty”, you should get back that “apples” is a plural noun, “are” is a verb and “tasty” is an adjective. You can try one out for yourself on-line here.

Parts of Speech

In line with my predictions, every instance of “male” or “female” was tagged as either a noun, an adjective or a hashtag. (I went through and looked at the hashtags and they were all porn bots. #gross #hazardsOfTwitterData)

However, not every noun was tagged as the same type of noun. I saw three types of tags in my data: NN (regular old noun), NNS (plural noun) and, unexpectedly, NNP (proper noun, singular). (If you’re confused by the weird upper case abbreviations, they’re the tags used in the Penn Treebank, and you can see the full list here.) In case it’s been a while since you studied parts of speech, proper nouns are things like personal or place names. The stuff that tend to get capitalized in English. The examples from the Penn Treebank documentation include “Motown”, “Venneboerger”,  and “Czestochwa”. I wouldn’t consider either “female” or “male” a name, so it’s super weird that they’re getting tagged as proper nouns. What’s even weirder? It’s pretty much only “male” that’s getting tagged as a proper noun, as you can see below:

maleVsFemalePOS
Number of times each word tagged as each part of speech by the GATE Twitter part-of-speech tagger. NNS is a plural noun, NNP a proper noun, NN a noun and JJ an adjective.

The differences in tagged POS between “male” and “female” was super robust(X2(6, N = 2033) = 1019.2, p <.01.). So what’s happening here?  My first thought was that it might be that, for some reason, “male” is getting capitalized more often and that was confusing the tagger. But when I looked into, there wasn’t a strong difference between the capitalization of “male” and “female”: both were capitalized about 3% of the time. 

My second thought was that it was a weirdness showing up becuase I used a tagger designed for Twitter data. Twitter is notoriously “messy” (in the sense that it can be hard for computers to deal with) so it wouldn’t be surprising if tagging “male” as a proper noun is the result of the tagger being trained on Twitter data. So, to check that, I re-tagged the same data using the Stanford POS tagger. And, sure enough, the weird thing where “male” is overwhelming tagged as a proper noun disappeared.

stanfordTaggerPOS
Number of times each word tagged as each part of speech by the Stanford POS tagger. NNS is a plural noun, NNP a proper noun, NN a noun, JJ an adjective and FW a “foreign word”.

So it looks like “male” being tagged as a proper noun is an artifact of the tagger being trained on Twitter data, and once we use a tagger trained on a different set of texts (in this case the Wall Street Journal) there wasn’t a strong difference in what POS “male” and “female” were tagged as.

Rate of Use

That said, there was a strong difference between “a female” and “a male”: how often they get used. In order to get one thousand tweets with the exact string “a female”, Twitter had to go back an hour and thirty-four minutes. In order to get a thousand tweets with “a male”, however, Twitter had to go back two hours and fifty eight minutes. Based on this sample, “a female” gets said almost twice as often as “a male”.

So what’s the deal?

  • Do “male” and “female” get used as nouns at different rates?  It depends on what tagger you use! In all seriousness, though, I’m not prepared to claim this based on the dataset I’ve collected.
  • Does one of these terms get used more often? Yes! Based on my sample, Twitter users use “a female” about twice as often as “a male”.

I think the greater rate of use of “a female” that points to the possibility of an interesting underlying difference in how “male” and “female” are used, one that calls for a closer qualitative analysis. Does one term get used to describe animals more often than the other? What sort of topics are people talking about when they say “a male” and “a female”? These questions, however, will have to wait for the next blog post!

In the meantime, I’m interested in getting more opinions on this. How do you feel about using “a male” and “a female” as nouns to talk about humans? Do they sound OK or strike you as odd?

My code and is available on my GitHub.

Preference for wake words varies by user gender

I recently read a very interesting article on the design of aspects of choosing a wake word, the word you use to turn on a voice-activated system. In Star Trek it’s “Computer”, but these days two of the more popular ones are “Alexa” and “OK Google”. The article’s author was a designer and noted that she found “Ok Google” or “Hey Google” to be more pleasant to use than “Alexa”. As I was reading the comments (I know, I know) I noticed that a lot of the people who strongly protested that they preferred “Alexa” had usernames or avatars that I would associate with male users. It struck me that there might be an underlying social pattern here.

So, being the type of nerd I am, I whipped up a quick little survey to look at the interaction between user gender and their preference for wake words. The survey only had two questions:

  • What is your gender?
    • Male
    • Female
    • Other
  • If Google Home and the Echo offered identical performance in all ways except for the wake word (the word or phrase you use to wake the device and begin talking to it), which wake word would you prefer?
    • “Ok Google” or “Hey Google”
    • “Alexa”

I included only those options becuase those are the defaults–I am aware you can choose to change the Echo’s wake word. (And probably should, given recent events.) 67 people responded to my survey. (If you were one of them, thanks!)

So what were the results? They were actually pretty strongly in line with my initial observations: as a group, only men preferred “Alexa” to “Ok Google”. Furthermore, this preference was far weaker than people of other genders’ for “Ok Google”. Women preferred “Ok Google” at a rate of almost two-to-one, and no people of other genders preferred “Alexa”.

I did have a bit of a skewed sample, with more women than men and people of other genders, but the differences between genders were robust enough to be statistically significant (c2(2, N = 67) = 7.25, p = 0.02)).

genderandwakewords
Women preferred “Ok Google” to “Alexa” 27:11, men preferred “Alexa” to “Ok Google” 14:11, and the four people of other genders in my survey all preferred “Ok Google”.

So what’s the take-away? Well, for one, Johna Paolino (the author of the original article) is by no means alone in her preference for a non-gendered wake word. More broadly, I think that, like the Clippy debacle, this is excellent evidence that there are strong gendered differences in how users’ gender affects their interaction with virtual agents. If you’re working to create virtual agents, it’s important to consider all types of users or you might end up creating something that rubs more than half of your potential customers the wrong way.

My code and data are available here.