There are over 7000 languages in the world, 80% of which have fewer than a million speakers each. In fact, six in ten people on Earth speak a language with less than ten million speakers. In other words: the majority of people on Earth use low-resource languages.
As a result, any large sample of user-generated text is almost guaranteed to have multiple languages in it. So what can you do about it? There are a couple options:
- Ignore it
- Only look at the parts of the data that are in English
- Break the data apart by language & use language-specific tools when available
Let’s take a quick look at the benefits and drawbacks of each approach.
# import libraries we'll use import spacy # fast NLP import pandas as pd # dataframes import langid # language identification (i.e. what language is this?) from nltk.classify.textcat import TextCat # language identification from NLTK from matplotlib.pyplot import plot # not as good as ggplot in R :p
To explore working with multilingual data, let’s look a real-life dataset of user-generated text. This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages. It’s a realistic cross-section of the type of linguistic diversity you’ll see in a large text dataset.
# read in our data tweetsData = pd.read_csv("../input/all_annotated.tsv", sep = "\t") # check out some of our tweets tweetsData['Tweet'][0:5]
0 Bugün bulusmami lazimdiii 1 Volkan konak adami tribe sokar yemin ederim :D 2 Bed 3 I felt my first flash of violence at some fool... 4 Ladies drink and get in free till 10:30 Name: Tweet, dtype: object
Maybe you’ve got a deadline coming up fast, or maybe you didn’t get a chance to actually look at some of your text data and just decide to treat it as if it were English. What could go wrong?
To find out, let’s use Spacy to tokenize all our tweets and take a look at the longest tokens in our data.
Spacy is an open-source NLP library that is much faster than the Natural Language Toolkit, although it does not have as many tasks implemented. You can find more information in the Spacy documentation.
# create a Spacy document of our tweets # load an English-language Spacy model nlp = spacy.load("en") # apply the english language model to our tweets doc = nlp(' '.join(tweetsData['Tweet']))
Now let’s look at the longest tokens in our Twitter data.
sorted(doc, key=len, reverse=True)[0:5]
[a7e78d48888a6811d84e0759e9387647447d1e74d8c7c4f1bec00d318e4e5030f08eb35668a97873820ca1d9dc61ffb620f8992296f3b029a60f153beac8018f5fb77d000000, e44337d70d7a7fec79a8b6bd8aa573367224023e4272f22af6d0844d9682d5b48062e331b33ab3b92dac2c262ed4f154ba679ad07b30d2cf1c15851cdac901315b4e72000000, 3064d36c909f9d437f7a3f405aa550f65529566547ae2308d6c4f2585250106d33b924ae9c8dcc08856e41f611d9bd15409a79f7ba21d318ab484f0cae10017201590a000000, 69bdf5177f1ae8ed61ed71c477f7dc415b97a2b2d7e57be079feb1a2c52600a996fd0891e130c1ce13c94e4406f83ba59e5edb5a7e0fb45e5251a17bb29601081f3de0000000, lt;3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3]
The five longest tokens are entire tweets, four produced by an art bot that tweets hashes of Unix timestamps and one that’s just the HTML version of “<3” tweeted a bunch of times. In other words: normal Twitter weirdness. This is actual noise in the data and can be safely discarded without hurting downstream tasks, like sentiment analysis or topic modeling.
sorted(doc, key=len, reverse=True)[6:10]
[卒業したった(*^^*)\n彼女にクラスで一緒にいるやつに\nたった一人の同中の拓夢とも写真撮れたし満足や！(^｡^)時間ギリギリまでテニスやってたからテニス部面と写真撮ってねーわ‼︎まぁこいつらわこれからも付き合いあるだろうからいいか！, 眼鏡は近視用で黒のセルフレームかアンダーリムでお願いします。オフの日は赤いセルフレームです。形状はサークルでお願いします。30代前半です。髪型ボブカットもしくはティモシェンコ元ウクライナ首相みたいなので。色は黒目でとりあえずお願いします, 普段は写真撮られるの苦手なので、\n\n顔も出さずw\n\n登場回数少ないですが、\n\n元気にampで働いておりますw\n\n一応こんな人が更新してますのでw\n\n#takahiromiyashitathesolois, love#instagood#me#cute#tbt#photooftheday#instamood#tweegram#iphonesia#picoftheday#igers#summer#girl#insta]
The next five longest tokens are also whole tweets which have been identified as single tokens. In this case, though, they were produced by humans!
The tokenizer (which assumes it will be given mainly English data) fails to correct tokenize these tweets because it’s looking for spaces. These tweets are in Japanese, though, and like many Asian languages (including all varieties of Chinese, Korean and Thai) they don’t actually use spaces between words.
In case you’re curious, “、” and “。” are single characters and don’t contain spaces! They are, respectively, the ideographic comma and ideographic full stop, and are part of a very long list of line breaking characters associated with specific orthographic systems.
In order to correctly tokenize Japanese, you’ll need to use a language-specific tokenizer.
The takeaway: if you ignore multiple languages, you’ll end up violating the assumptions behind major out-of-the-box NLP tools¶
So we know that just applying NLP tools designed for English willy-nilly won’t work on multiple languages. So what if we only grabbed the English-language data and then worked with that?
There are two big issues here:
- Correctly identifying which tweets are in English
- Throwing away data
Correctly identifying which tweets are in English¶
Probably the least time-intensive way to do this is by attempting to automatically identify the language that each Tweet is written in. A BIG grain of salt here: automatic language identifiers are very error prone, especially on very short texts. Let’s check out two of them.
- LangID: Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062
- TextCat: Cavnar, W. B. and J. M. Trenkle, “N-Gram-Based Text Categorization” In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.
First off, here are the languages the first five tweets are actually written in, hand tagged by a linguist (i.e. me):
Now let’s see how well two popular language identifiers can detect this.
# summerize the labelled language tweetsData['Tweet'][0:5].apply(langid.classify)
0 (az, -30.30187177658081) 1 (ms, -83.29260611534119) 2 (en, 9.061840057373047) 3 (en, -195.55468368530273) 4 (en, -98.53013229370117) Name: Tweet, dtype: object
LangID does…alright, with three out of five tweets identified correctly. While it’s pretty good at identifying English, the first tweet was identified as Azerbaijani and the second tweet was labeled as Malay, which is very wrong (not even in the same language family as Turkish).
Let’s look at another algorithm, TextCat, which is based on character-level N-Grams.
# N-Gram-Based Text Categorization tc = TextCat() # try to identify the languages of the first five tweets again tweetsData['Tweet'][0:5].apply(tc.guess_language)
0 tur 1 ind 2 bre 3 eng 4 eng Name: Tweet, dtype: object
TextCat also only got three out of the five correct. Oddly, it identifier “bed” as Breton. To be fair, “bed” is the Breton word for “world”, but it’s still a bit odd.
The takeaway: Automatic language identification, especially on very short texts, is very error prone. (I’d recommend using multiple language identifiers & taking the majority vote.)¶
Throwing away data¶
Even if language identification were very accurate, how much data would be just be throwing away if we only looked at data we were fairly sure was English?
Note: I’m only going to LangID here for time reasons, but given the high error rate I’d recommend using multiple language identification algorithms.
# get the language id for each text ids_langid = tweetsData['Tweet'].apply(langid.classify) # get just the language label langs = ids_langid.apply(lambda tuple: tuple) # how many unique language labels were applied? print("Number of tagged languages (estimated):") print(len(langs.unique())) # percent of the total dataset in English print("Percent of data in English (estimated):") print((sum(langs=="en")/len(langs))*100)
Number of tagged languages (estimated): 95 Percent of data in English (estimated): 40.963625976
Only 40% of our data has been tagged as English by LangId. If we throw the rest of it, we’re going to lose more than half of our dataset! Especially if this is data you spent a lot of time and money collecting, that seems downright wasteful. (Plus, it might skew our analysis.)
So if 40% of our data is in English, what is the other 60% made up of? Let’s check out the distribution data across languages in our dataset.
# convert our list of languages to a dataframe langs_df = pd.DataFrame(langs) # count the number of times we see each language langs_count = langs_df.Tweet.value_counts() # horrible-looking barplot (I would suggest using R for visualization) langs_count.plot.bar(figsize=(20,10), fontsize=20)
There’s a really long tail on our dataset; most that were identified in our dataset were only identified a few times. This means that we can get a lot of mileage out of including just a few more popular languages in our analysis. How much will we gain, exactly?
print("Languages with more than 400 tweets in our dataset:") print(langs_count[langs_count > 400]) print("") print("Percent of our dataset in these languages:") print((sum(langs_count[langs_count > 400])/len(langs)) * 100)
Languages with more than 400 tweets in our dataset: en 4302 es 1020 pt 751 ja 436 tr 414 id 407 Name: Tweet, dtype: int64 Percent of our dataset in these languages: 69.7962292897
By including only five more languages in our analysis (Spanish, Portugese, Japanese, Turkish and Indonesian) we can increase our coverage of the data in our dataset by almost a third!
The takeaway: Just incorporating a couple more languages in your analysis can give you access to a lot more data!¶
Ok, so what exactly does this pipeline look like? Let’s look at just the second most popular language in our dataset: Spanish. What happens when we pull out just the Spanish tweets & tokenize them?
# get a list of tweets labelled "es" by langid spanish_tweets = tweetsData['Tweet'][langs == "es"] # load a Spanish-language Spacy model from spacy.es import Spanish nlp_es = Spanish(path=None) # apply the Spanish language model to our tweets doc_es = nlp_es(' '.join(spanish_tweets)) # print the longest tokens sorted(doc_es, key=len, reverse=True)[0:5]
[ViernesSantoEnElColiseoRobertoClemente, MiFantasia1DEnWembleyConCocaColaFM, fortaleciéndonos','escenarios, DirectionersConCocaColaFM1D, http://t.co/ezZEsXN3MF\nvia]
This time, the longest tokens are Spanish-language hashtags. This is exactly the sort of thing we’d expect to see! From here, we can use this tokenized dataset to feed into other downstream like sentiment analysis.
Of course, it would be impractical to do this for every single language in our dataset, even if we could be sure that they were all identified correctly. You’re probably going to have to accept that you probably won’t be able to consider every language in your dataset unless you can commit a lot of time. But including any additional language will enrich your analysis!
The takeaway: It doesn’t have to be onerous to incorporate multiple languages in your analysis pipeline!¶
Option 1: Ignore Multilingualism
As we saw, this option will result in violating a lot of the assumptions built into NLP tools (e.g. there are spaces between words). If you do this, you’ll end up with a lot of noise and headaches as you try to move through your analysis pipeline.
Option 2: Only look at English
In this dataset, only looking at English would have led to us throwing away over half of our data. Especailly as NLP tools are developed and made avaliable for more and more languages, there’s less reason to stick to English-only NLP.
Option 3: Seperate your data by language & analyze them independently
This does take a little more work than the other options… but not that much more, especially for languages that already have resources avalialbe for them.
Here are some pre-built language identifiers to use in addition to LandID and TextCat:
- detect_language() from TextBlob. Uses a Google API & requires Internet access.
- langdetect Python library: Port of Google’s language-detection library (version from 03/03/2014) to Python.
Dealing with texts which contain multiple languages (code switching):
It’s very common for a span of text to include multiple languages. This example contains English and Malay (“kain kain” is Malay for “unwrap”):
Roasted Chicken Rice with Egg. Kain kain! 🙂 [Image of a lunch wrapped in paper being unwrapped.]
How to automatically handle code switching is an active research question in NLP. Here are some resources to get you started learning more:
- Resources for code-switching: Including training data and corpora
- First Workshop on Computational Approaches to Code Switching: Includes information on a shared task on identifying & labeling code switching
- Overview for the Second Shared Task on Language Identification in
Code-Switched Data: Results of the second shared task on code-switching, with an overview of different approaches.