Analyzing Multilingual Data

This blog post is a little different from my usual stuff. It’s based on a talk I gave yesterday at the first annual Data Institute Conference. As a result, it’s aimed at a slightly more technical audience than my usual stuff, but I hope I’ve done an ok job keeping it accessible. Feel free to drop me a comment if you have any questions or found anything confusing and I’ll be sure to help you out.
You can play with the code yourself by forking this notebook on Kaggle (you don’t even have to download or install anything :).

There are over 7000 languages in the world, 80% of which have fewer than a million speakers each. In fact, six in ten people on Earth speak a language with less than ten million speakers. In other words: the majority of people on Earth use low-resource languages.

As a result, any large sample of user-generated text is almost guaranteed to have multiple languages in it. So what can you do about it? There are a couple options:

  1. Ignore it
  2. Only look at the parts of the data that are in English
  3. Break the data apart by language & use language-specific tools when available

Let’s take a quick look at the benefits and drawbacks of each approach.


Getting started

In [1]:
# import libraries we'll use
import spacy # fast NLP
import pandas as pd # dataframes
import langid # language identification (i.e. what language is this?)
from nltk.classify.textcat import TextCat # language identification from NLTK
from matplotlib.pyplot import plot # not as good as ggplot in R :p

To explore working with multilingual data, let’s look a real-life dataset of user-generated text. This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages. It’s a realistic cross-section of the type of linguistic diversity you’ll see in a large text dataset.

# read in our data
tweetsData = pd.read_csv("../input/all_annotated.tsv", sep = "\t")

# check out some of our tweets
tweetsData['Tweet'][0:5]
0                            Bugün bulusmami lazimdiii
1       Volkan konak adami tribe sokar yemin ederim :D
2                                                  Bed
3    I felt my first flash of violence at some fool...
4              Ladies drink and get in free till 10:30
Name: Tweet, dtype: object

Option 1: Ignore the multilingualism

Maybe you’ve got a deadline coming up fast, or maybe you didn’t get a chance to actually look at some of your text data and just decide to treat it as if it were English. What could go wrong?

To find out, let’s use Spacy to tokenize all our tweets and take a look at the longest tokens in our data.

Spacy is an open-source NLP library that is much faster than the Natural Language Toolkit, although it does not have as many tasks implemented. You can find more information in the Spacy documentation.

# create a Spacy document of our tweets
# load an English-language Spacy model
nlp = spacy.load("en")

# apply the english language model to our tweets
doc = nlp(' '.join(tweetsData['Tweet']))

Now let’s look at the longest tokens in our Twitter data.

sorted(doc, key=len, reverse=True)[0:5]
[a7e78d48888a6811d84e0759e9387647447d1e74d8c7c4f1bec00d318e4e5030f08eb35668a97873820ca1d9dc61ffb620f8992296f3b029a60f153beac8018f5fb77d000000,
 e44337d70d7a7fec79a8b6bd8aa573367224023e4272f22af6d0844d9682d5b48062e331b33ab3b92dac2c262ed4f154ba679ad07b30d2cf1c15851cdac901315b4e72000000,
 3064d36c909f9d437f7a3f405aa550f65529566547ae2308d6c4f2585250106d33b924ae9c8dcc08856e41f611d9bd15409a79f7ba21d318ab484f0cae10017201590a000000,
 69bdf5177f1ae8ed61ed71c477f7dc415b97a2b2d7e57be079feb1a2c52600a996fd0891e130c1ce13c94e4406f83ba59e5edb5a7e0fb45e5251a17bb29601081f3de0000000,
 lt;3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3]

The five longest tokens are entire tweets, four produced by an art bot that tweets hashes of Unix timestamps and one that’s just the HTML version of “<3” tweeted a bunch of times. In other words: normal Twitter weirdness. This is actual noise in the data and can be safely discarded without hurting downstream tasks, like sentiment analysis or topic modeling.

sorted(doc, key=len, reverse=True)[6:10]
[卒業したった(*^^*)\n彼女にクラスで一緒にいるやつに\nたった一人の同中の拓夢とも写真撮れたし満足や!(^。^)時間ギリギリまでテニスやってたからテニス部面と写真撮ってねーわ‼︎まぁこいつらわこれからも付き合いあるだろうからいいか!,
 眼鏡は近視用で黒のセルフレームかアンダーリムでお願いします。オフの日は赤いセルフレームです。形状はサークルでお願いします。30代前半です。髪型ボブカットもしくはティモシェンコ元ウクライナ首相みたいなので。色は黒目でとりあえずお願いします,
 普段は写真撮られるの苦手なので、\n\n顔も出さずw\n\n登場回数少ないですが、\n\n元気にampで働いておりますw\n\n一応こんな人が更新してますのでw\n\n#takahiromiyashitathesolois,
 love#instagood#me#cute#tbt#photooftheday#instamood#tweegram#iphonesia#picoftheday#igers#summer#girl#insta]

The next five longest tokens are also whole tweets which have been identified as single tokens. In this case, though, they were produced by humans!

The tokenizer (which assumes it will be given mainly English data) fails to correct tokenize these tweets because it’s looking for spaces. These tweets are in Japanese, though, and like many Asian languages (including all varieties of Chinese, Korean and Thai) they don’t actually use spaces between words.

In case you’re curious, “、” and “。” are single characters and don’t contain spaces! They are, respectively, the ideographic comma and ideographic full stop, and are part of a very long list of line breaking characters associated with specific orthographic systems.

In order to correctly tokenize Japanese, you’ll need to use a language-specific tokenizer.

The takeaway: if you ignore multiple languages, you’ll end up violating the assumptions behind major out-of-the-box NLP tools


Option 2: Only look at the parts of the data that are in English

So we know that just applying NLP tools designed for English willy-nilly won’t work on multiple languages. So what if we only grabbed the English-language data and then worked with that?

There are two big issues here:

  • Correctly identifying which tweets are in English
  • Throwing away data

Correctly identifying which tweets are in English

Probably the least time-intensive way to do this is by attempting to automatically identify the language that each Tweet is written in. A BIG grain of salt here: automatic language identifiers are very error prone, especially on very short texts. Let’s check out two of them.

  • LangID: Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062
  • TextCat: Cavnar, W. B. and J. M. Trenkle, “N-Gram-Based Text Categorization” In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.

First off, here are the languages the first five tweets are actually written in, hand tagged by a linguist (i.e. me):

  1. Turkish
  2. Turkish
  3. English
  4. English
  5. English

Now let’s see how well two popular language identifiers can detect this.

# summerize the labelled language
tweetsData['Tweet'][0:5].apply(langid.classify)
0     (az, -30.30187177658081)
1     (ms, -83.29260611534119)
2      (en, 9.061840057373047)
3    (en, -195.55468368530273)
4     (en, -98.53013229370117)
Name: Tweet, dtype: object

LangID does…alright, with three out of five tweets identified correctly. While it’s pretty good at identifying English, the first tweet was identified as Azerbaijani and the second tweet was labeled as Malay, which is very wrong (not even in the same language family as Turkish).

Let’s look at another algorithm, TextCat, which is based on character-level N-Grams.

# N-Gram-Based Text Categorization
tc = TextCat()

# try to identify the languages of the first five tweets again
tweetsData['Tweet'][0:5].apply(tc.guess_language)
0    tur
1    ind
2    bre
3    eng
4    eng
Name: Tweet, dtype: object

TextCat also only got three out of the five correct. Oddly, it identifier “bed” as Breton. To be fair, “bed” is the Breton word for “world”, but it’s still a bit odd.

The takeaway: Automatic language identification, especially on very short texts, is very error prone. (I’d recommend using multiple language identifiers & taking the majority vote.)

Throwing away data

Even if language identification were very accurate, how much data would be just be throwing away if we only looked at data we were fairly sure was English?

Note: I’m only going to LangID here for time reasons, but given the high error rate I’d recommend using multiple language identification algorithms.

# get the language id for each text
ids_langid = tweetsData['Tweet'].apply(langid.classify)

# get just the language label
langs = ids_langid.apply(lambda tuple: tuple[0])

# how many unique language labels were applied?
print("Number of tagged languages (estimated):")
print(len(langs.unique()))

# percent of the total dataset in English
print("Percent of data in English (estimated):")
print((sum(langs=="en")/len(langs))*100)
Number of tagged languages (estimated):
95
Percent of data in English (estimated):
40.963625976

Only 40% of our data has been tagged as English by LangId. If we throw the rest of it, we’re going to lose more than half of our dataset! Especially if this is data you spent a lot of time and money collecting, that seems downright wasteful. (Plus, it might skew our analysis.)

So if 40% of our data is in English, what is the other 60% made up of? Let’s check out the distribution data across languages in our dataset.

# convert our list of languages to a dataframe
langs_df = pd.DataFrame(langs)

# count the number of times we see each language
langs_count = langs_df.Tweet.value_counts()

# horrible-looking barplot (I would suggest using R for visualization)
langs_count.plot.bar(figsize=(20,10), fontsize=20)

There’s a really long tail on our dataset; most that were identified in our dataset were only identified a few times. This means that we can get a lot of mileage out of including just a few more popular languages in our analysis. How much will we gain, exactly?

print("Languages with more than 400 tweets in our dataset:")
print(langs_count[langs_count > 400])

print("")

print("Percent of our dataset in these languages:")
print((sum(langs_count[langs_count > 400])/len(langs)) * 100)
Languages with more than 400 tweets in our dataset:
en    4302
es    1020
pt     751
ja     436
tr     414
id     407
Name: Tweet, dtype: int64

Percent of our dataset in these languages:
69.7962292897

By including only five more languages in our analysis (Spanish, Portugese, Japanese, Turkish and Indonesian) we can increase our coverage of the data in our dataset by almost a third!

The takeaway: Just incorporating a couple more languages in your analysis can give you access to a lot more data!


Option 3: Break the data apart by language & use language-specific tools

Ok, so what exactly does this pipeline look like? Let’s look at just the second most popular language in our dataset: Spanish. What happens when we pull out just the Spanish tweets & tokenize them?

# get a list of tweets labelled "es" by langid
spanish_tweets = tweetsData['Tweet'][langs == "es"]

# load a Spanish-language Spacy model
from spacy.es import Spanish
nlp_es = Spanish(path=None)

# apply the Spanish language model to our tweets
doc_es = nlp_es(' '.join(spanish_tweets))

# print the longest tokens
sorted(doc_es, key=len, reverse=True)[0:5]
[ViernesSantoEnElColiseoRobertoClemente,
 MiFantasia1DEnWembleyConCocaColaFM,
 fortaleciéndonos','escenarios,
 DirectionersConCocaColaFM1D,
 http://t.co/ezZEsXN3MF\nvia]

This time, the longest tokens are Spanish-language hashtags. This is exactly the sort of thing we’d expect to see! From here, we can use this tokenized dataset to feed into other downstream like sentiment analysis.

Of course, it would be impractical to do this for every single language in our dataset, even if we could be sure that they were all identified correctly. You’re probably going to have to accept that you probably won’t be able to consider every language in your dataset unless you can commit a lot of time. But including any additional language will enrich your analysis!

The takeaway: It doesn’t have to be onerous to incorporate multiple languages in your analysis pipeline!


So let’s review our options for analyzing multilingual data:

Option 1: Ignore Multilingualism

As we saw, this option will result in violating a lot of the assumptions built into NLP tools (e.g. there are spaces between words). If you do this, you’ll end up with a lot of noise and headaches as you try to move through your analysis pipeline.

Option 2: Only look at English

In this dataset, only looking at English would have led to us throwing away over half of our data. Especailly as NLP tools are developed and made avaliable for more and more languages, there’s less reason to stick to English-only NLP.

Option 3: Seperate your data by language & analyze them independently

This does take a little more work than the other options… but not that much more, especially for languages that already have resources avalialbe for them.

Additional resources:

Language Identification:

Here are some pre-built language identifiers to use in addition to LandID and TextCat:

Dealing with texts which contain multiple languages (code switching):

It’s very common for a span of text to include multiple languages. This example contains English and Malay (“kain kain” is Malay for “unwrap”):

Roasted Chicken Rice with Egg. Kain kain! 🙂 [Image of a lunch wrapped in paper being unwrapped.]

How to automatically handle code switching is an active research question in NLP. Here are some resources to get you started learning more:

 

Advertisements

Where can you find language data on the web?

In the course of my day-to-day work on Kaggle’s public data platform, I’ve learned a lot about the ecosystem of language data on the web (or at least the portions of it that have been annotated in English). For example, I’ve noticed a weird disconnect between European and American data repositories  resources that I’m pretty sure has its roots in historical and disciplinary divisions.

Computer Used to Create Printouts of Data (FDA 097) (8250815324)

I’ve also found a lot of great resources, though! At some point, I started keeping notes on interesting data repositories and link aggregators. I finally got around to tidying up and annotating my list of resources, and I figured that it would a useful thing to share with everyone. So, without further ado, here’s an (incomplete) list of some places to find language resources on the web:

  • META-SHARE
    • URL :http://www.meta-share.org/
    • META-SHARE has a lot of resources from The International Conference on Language Resources and Evaluation (LREC) on it.
  • Trolling
  • Linguistic Data Consortium (LDC)
    • URL: https://www.ldc.upenn.edu/
    • The Linguistic Data Consortium is an international non-profit that offers archival hosting of datasets. The data offered by them is high quality and usually not free (although they offer data grants for students).
  • Kaggle
    • URL: https://www.kaggle.com/datasets?search=corpus
    • Kaggle’s public data platform has a lot of language/NLP datasets available on it, many not in English. You can also do data analysis on Kaggle (with R or Python) without having to download anything or set up a local environment.
  • European Language Resources Association
  • Zenodo
    • URL: https://zenodo.org/
    • Hosted by CERN, has datasets (including corpora) from a wide variety of disciplines.
  • Document the Now
    • URL: http://www.docnow.io/catalog/
    • Contains lists of Tweet ID’s surrounding certain events. You’ll need to use the “rehydrator” to get the actual tweets.
  • International Standard Language Resource Number
    • URL: http://www.islrn.org/resources/identify_name/  (a list of unique ID #’s associated with language resources)
    • Like a digital object identifier (DOI) for language resources. Not the best search (only looks at the title)  but if you have a specific phrase you’re looking for it can be a good way to discover new resources.
  • Language & Culture Archives (SIL)
  • Open Language Archives Community (OLAC)
  • Free sound
  • GitHub
    • URL:  https://github.com/search?q=corpus
    • You can sometimes find interesting & high quality language data on Github, but it’s not centralized and of widely varying quality.
  • Re3data.org
  • Language Gold Mine

Know of a resource I forgot to include? Link it in the comments!