Datasets for data cleaning practice

Looking for datasets to practice data cleaning or preprocessing on? Look no further!

Each of these datasets needs a little bit of TLC before it’s ready for different analysis techniques. For each dataset, I’ve included a link to where you can access it, a brief description of what’s in it, and an “issues” section describing what needs to be done or fixed in order for it to fit easily into a data analysis pipeline.

Big thanks to everyone in this Twitter thread who helped me out by pointing me towards these datasets and letting me know what sort of pre-processing each needed. There were also some other data sources I didn’t include here, so check it out if you need more practice data. And feel free to comment with links to other datasets that would make good data cleaning practice! 🙂

List of datasets:

  • Hourly Weather Surface – Brazil (Southeast region)
  • PhyloTree Data
  • International Comprehensive Ocean-Atmosphere Data Set
  • CLEANEVAL: Development dataset
  • London Air
  • SO MUCH CANDY DATA, SERIOUSLY
  • Production and Perception of Linguistic Voice Quality
  • Australian Marriage Law Postal Survey, 2017
  • The Metropolitan Museum of Art Open Access
  • National Drug Code Directory
  • Flourish OA
  • WikiPlots
  • Register of UK Parliament Members’ Financial Interests
  • NYC Gifted & Talented Scores

Hourly Weather Surface – Brazil (Southeast region)

It’s covers hourly weather data from 122 weathers stations of southeast region (Brazil). The southeast include the states of Rio de Janeiro, São Paulo, Minas Gerais e Espirito Santo. Dataset Source: INMET (National Meteorological Institute – Brazil).

Issues: Can you predict the amount of rain? Temperature? NOTE: Not all weather stations started operating since 2000

PhyloTree Data

Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Human mitochondrial DNA is widely used as tool in many fields including evolutionary anthropology and population history, medical genetics, genetic genealogy, and forensic science. Many applications require detailed knowledge about the phylogenetic relationship of mtDNA variants. Although the phylogenetic resolution of global human mtDNA diversity has greatly improved as a result of increasing sequencing efforts of complete mtDNA genomes, an updated overall mtDNA tree is currently not available. In order to facilitate a better use of known mtDNA variation, we have constructed an updated comprehensive phylogeny of global human mtDNA variation, based on both coding‐ and control region mutations. This complete mtDNA tree includes previously published as well as newly identified haplogroups.

Issues: This data would be more useful if it were in the Newick tree format and could be read in using the read.newick() function. Can you help get the data in this format?

International Comprehensive Ocean-Atmosphere Data Set

The International Comprehensive Ocean-Atmosphere Data Set (ICOADS) offers surface marine data spanning the past three centuries, and simple gridded monthly summary products for 2° latitude x 2° longitude boxes back to 1800 (and 1°x1° boxes since 1960)—these data and products are freely distributed worldwide. As it contains observations from many different observing systems encompassing the evolution of measurement technology over hundreds of years, ICOADS is probably the most complete and heterogeneous collection of surface marine data in existence.

Issues: The ICOADS contains O(500M) meteorological observations from ~1650 onwards. Issues include bad observation values, mis-positioned data, missing date/time information, supplemental data in a variety of formats, duplicates etc.

CLEANEVAL: Development dataset

CLEANEVAL is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus, for linguistic and language technology research and development. There are three versions of each file: original, pre-processed, and manually cleaned. All files of each kind are gathered in a directory. The file number remains the same for the three versions of the same file.

Issues: Your task is to “clean up” a set of webpages so that their contents can be easily used for further linguistic processing and analysis. In short, this implies:

  • removing all HTML/Javascript code and “boilerplate” (headers, copyright notices, link lists, materials repeated across most pages of a site, etc.);
  • adding a basic encoding of the structure of the page using a minimal set of symbols to mark the beginning of headers, paragraphs and list elements.

London Air

The London Air Quality Network (LAQN) is run by the Environmental Research Group of King’s College London. LAQN stands for the London Air Quality Network which was formed in 1993 to coordinate and improve air pollution monitoring in London. The network collects air pollution data from London boroughs, with each one funding monitoring in its own area. Increasingly, this information is being supplemented with measurements from local authorities surrounding London in Essex, Kent and Surrey, thereby providing an overall perspective of air pollution in South East England, as well as a greater understanding of pollution in London itself.

Issues: Lots of gaps (null/zero handling), outliers, date handling, pivots and time aggregation needed first!

SO MUCH CANDY DATA, SERIOUSLY

Candy hierarchy data for 2017 Boing Boing Halloween candy hierarchy. This is survey data from this survey.

Issues: If you want to look for longitudinal effects, you also have access to previous datasets. Unfortunate quirks in the data include the fact that the 2014 data is not the raw set (can’t seem to find it), and in 2015, the candy preference was queried without the MEH option.

Production and Perception of Linguistic Voice Quality

Data from the “Production and Perception of Linguistic Voice Quality” project at UCLA. This project was funded by NSF grant BCS-0720304 to Prof. Pat Keating, with Prof. Abeer Alwan, Prof. Jody Kreiman of UCLA, and Prof. Christina Esposito of Macalester College, for 2007-2012.

The data includes spreadsheet files with measures gathered using Voicesauce (Shue, Keating, Vicenik & Yu 2011) for both acoustic measures and EGG measures. The accompanying readme file provides information on the various coding used in both spreadsheets.

Issues: The following issues are with the acoustics measures spreadsheet specifically.

  1. xlsx format with meaningful color coding created by a VBA script (which is copy-pasted into the second sheet)
  2. partially wide format instead of long/tidy, with a ton of columns split into different timepoints
  3. line 6461 has another set of column headers rather than data for some of the columns starting with “shrF0_mean”. I think this was a copy-paste error. Hopefully it doesn’t mean that all of the data below that row is shifted down by 1!

Australian Marriage Law Postal Survey, 2017

Response: Should the law be changed to allow same-sex couples to marry?

Of the eligible Australians who expressed a view on this question, the majority indicated that the law should be changed to allow same-sex couples to marry, with 7,817,247 (61.6%) responding Yes and 4,873,987 (38.4%) responding No. Nearly 8 out of 10 eligible Australians (79.5%) expressed their view.

All states and territories recorded a majority Yes response. 133 of the 150 Federal Electoral Divisions recorded a majority Yes response, and 17 of the 150 Federal Electoral Divisions recorded a majority No response.

Issues: Miles McBain discusses his approach to cleaning this dataset in depth in this blog post.

The Metropolitan Museum of Art Open Access

The Metropolitan Museum of Art provides select datasets of information on more than 420,000 artworks in its Collection for unrestricted commercial and noncommercial use. To the extent possible under law, The Metropolitan Museum of Art has waived all copyright and related or neighboring rights to this dataset using Creative Commons Zero. This work is published from: The United States Of America. You can also find the text of the CC Zero deed in the file LICENSE in this repository. These select datasets are now available for use in any media without permission or fee; they also include identifying data for artworks under copyright. The datasets support the search, use, and interaction with the Museum’s collection.

Issues: Missing values, inconsistent information, missing documentation, possible duplication, mixed text and numeric data.

National Drug Code Directory

The Drug Listing Act of 1972 requires registered drug establishments to provide the Food and Drug Administration (FDA) with a current list of all drugs manufactured, prepared, propagated, compounded, or processed by it for commercial distribution. (See Section 510 of the Federal Food, Drug, and Cosmetic Act (Act) (21 U.S.C. § 360)). Drug products are identified and reported using a unique, three-segment number, called the National Drug Code (NDC), which serves as a universal product identifier for drugs. FDA publishes the listed NDC numbers and the information submitted as part of the listing information in the NDC Directory which is updated daily.

The information submitted as part of the listing process, the NDC number, and the NDC Directory are used in the implementation and enforcement of the Act.

Issue: Non-trivial duplication (which drugs are different names for the same things?).

Flourish OA

Our data comes from a variety of sources, including researchers, web scraping, and the publishers themselves. All data is cleaned and reviewed to ensure its validity and integrity. Our catalog expands regularly, as does the number of features our data contains. We strive to maintain the most complete and sophisticated store of Open Access data in the world, and it is this mission that drives our continued work and expansion.

A dataset on journal/publisher information that is a bit dirty and might make for great practice. It’s been a graduate student/community project: http://flourishoa.org/

Issues: Scraped data, has some missing fields, possible duplication and some encoding issues (possibly multiple character encodings).

WikiPlots

 

The WikiPlots corpus is a collection of 112,936 story plots extracted from English language Wikipedia. These stories are extracted from any English language article that contains a sub-header that contains the word “plot” (e.g., “Plot”, “Plot Summary”, etc.).

This repository contains code and instructions for how to recreate the WikiPlots corpus.

The dataset itself can be downloaded from here: plots.zip (updated: 09/26/2017). The zip file contains two files:

  • plots: a text file containing all story plots. Each story plot is given with one sentence per line. Each story is followed by on a line by itself.
  • titles: a text file containing a list of titles for each article in which a story plot was found and extracted.

Issues: Some lines may be cut off due to abbreviations. Some plots may be incomplete or contain surrounding irrelevant information.

Register of UK Parliament Members’ Financial Interests

The main purpose of the Register is to provide information about any financial interest which a Member has, or any benefit which he or she receives, which others might reasonably consider to influence his or her actions or words as a Member of Parliament.

Members must register any change to their registrable interests within 28 days. The rules are set out in detail in the Guide to the Rules relating to the Conduct of Members, as approved by the House on 17 March 2015. Interests which arose before 7 May 2015 are registered in accordance with earlier rules.

The Register is maintained by the Parliamentary for Commissioner for Standards. It is updated fortnightly online when the House is sitting, and less frequently at other times. Interests remain on the Register for twelve months after they have expired.

Issues: Each member’s transactions are on a separate webpage with a different text format, with contributions listed under different headings (not necessarily one per line) and in different formats. Will take quite a bit of careful preprocessing to get into CSV or JSON format.

NYC Gifted & Talented Scores

Couple of messy but easy data sets: NYC parents reporting their kids’ scores on the gifted and talented exam, as well as school priority ranking. Some enter the percentiles as point scores, some skip all together, no standard preference format, etc. Also birth quarter affects percentiles.

Advertisements

Data science & kitchen gadgets

One of the things I really enjoy about my current job is chatting with other data science folks. Almost inevitably in the course of these conversations, the old “Python vs. R” debate comes up.

For those of you who aren’t familiar, Python and R are both programming languages often used by data scientists and other folks who work with data. Python is a general-purpose programming language (originally designed as a teaching language) that has some popular packages used for data analysis. R is a computer language specifically designed for doing statistics and visualization. They’re both useful languages, but R is much more specialized.

I use both Python & R, but I tend to prefer R for data analysis and vitalization. I also love kitchen gadgets. (I own and routinely use a melon baller, albeit only very rarely for actually balling melon.) My hypothesis is that my preference for R and love of kitchen gadgets share the same underlying cause: I really like specialized tools.

I was curious to see if there was a similar relationship for other people, so I reached out to my Twitter followers with a simple two-question poll:


Do you prefer Python or R?

  • Python
  • R

How do you feel about specialized kitchen gadgets (e.g. veggie peelers, egg slicers, specialized knives).

  • Hate ’em
  • Love ’em

185 people filled out the poll (if you were one of them, thanks!). Unfortunately for my hypothesis, a quick analysis of the results revealed no evidence that there was any relationship between whether someone prefers Python or R and if they like kitchen gadgets. You can check out the big ol’ null result for yourself:

Regardless of how poorly this experiment illustrates my point, however, it still stands: R is a specialized tool, while Python is general purpose one.

I like to think of R as a bread knife and Python as a pocket knife. It’s much easier to slice bread with a bread knife, but sometimes it’s more convenient to use a pocket knife if you already have it to hand.

If you spend a lot of time cleaning and analyzing data that’s already in a tabular format or doing statistical analysis, you might consider checking out R. It’s certainly saved me a lot of time. (Oh, and I juuuust so happen to have a couple of short R tutorials for folks with little to no programming background.)

Can your use of capitalization reveal your political affiliation?

This week, I’m in Vancouver this week for the meeting of the Association for Computational Linguistics. (On the subject of conferences, don’t forget that my offer to help linguistics students from underrepresented minorities with the cost of conferences still stands!) The work I’m presenting is on a new research direction I’m pursuing and I wanted to share it with y’all!

If you’ve read some of my other posts on sociolinguistics, you may remember that the one of its central ideas is that certain types of language usage pattern together with aspects of people’s social identities. In the US, for example, calling a group of people “yinz” is associated with being from Pittsburgh. Or in Spanish, replacing certain “s” sounds with “th” sounds is associated with being from northern or central Spain. When a particular linguistic form is associated with a specific part of someone’s social identity, we call that a “sociolinguistic variable”

There’s been a lot of work on the type of sociolinguistic variables people use when they’re speaking, but there’s been less work on what people do when they’re writing. And this does make a certain amount of sense: many sociolinguistic variables are either 1) something people aren’t aware they’re doing or 2) something that they’re aware they’re doing but might not consider “proper”. As a result, they tend not to show up in formal writing.

This is where the computational linguistics part comes in; people do a lot of informal writing on computers, especially on the internet. In fact, I’d wager that humans are producing more text now than at any other point in history, and a lot of it is produced in public places. That lets us look for sociolinguistics variables in writing in a way that wasn’t really possible before.

Which is a whole lot of background to be able to say: I’m looking at how punctuation and capitalization pattern with political affiliation on Twitter.

Political affiliation is something that other sociolinguists have definitely looked at. It’s also something that’s very, very noticeable on Twitter these days. This is actually a boon to this type of research. One of the hard things about doing research on Twitter is that you don’t always necessarily know someone’s social identity. And if you use a linguistic feature to try to figure out their identity when what you’re interested in is linguistic features, you quickly end up with the problem of circular logic.

Accounts which are politically active, however, will often explicitly state their political affiliation in their Twitter bio. And I used that information to get tweets from people I was very sure had a specific political affiliation.

For this project, I looked at people who use the hashtags #MAGA and #theResistance in their Twitter bios. The former is an initialism for “Make America Great Again” and is used by politically conservative folks who support President Trump. The latter is used by political liberal folks who are explicitly opposed to President Trump. These two groups not only have different political identities, but also are directly opposed to each other. This means there’s good reason to believe that they will use language in different ways that reflect that identity.

But what about the linguistic half of the equation? Punctuation and capitalization are especially interesting to me because they seem to be capturing some of the same information we might find in prosody or intonation in spoken language. Things like YELLING or…pausing….or… uncertainty?  They’re also much, much easier to measure punctuation than intonation, which is notoriously difficult and time-consuming to annotate.  At the same time, I have good evidence that how you use punctuation and capitalization has some social meaning. Check out this tweet, for example:

0b1022106daeb0d0419263dcf9c5aa93--this-is-me-posts
As this tweet shows, putting a capital letter at the beginning of a tweet is anything but “aloof and uninterested yet woke and humorous”.

So, if punctuation and capitalization are doing something socially, is part of what they’re doing expressing political affiliation?

That’s what I looked into. I grabbed up to 100 tweets each from accounts which used either #MAGA or #theResistance in their Twitter bios. Then I looked at how much punctuation and capitalization users from these two groups used in their tweets.

Punctuation

First, I looked at all punctuation marks. I did find that, on average, liberal users tended to use less punctuation. But when I took a closer look at the data, an interesting pattern emerged. In both the liberal and conservative groups, there were two clusters of users: those who used a lot of punctuation and those who used almost none.

punctuation
Politically liberal users on average tended to use less punctuation than politically conservative users, but in both groups there’s really two sets of users: those who use a lot of punctuation and those who use basically  none. There just happen to be more of the latter in #theResistance.

What gives rise to these two clusters? I honestly don’t know, but I do have a hypothesis. I think that there’s  probably a second social variable in this data that I wasn’t able to control for. It seems likely that the user’s age might have something to do with it, or their education level, or even whether they use thier Twitter account for professional or personal communication.

Capitalization

My intuition that there’s a second latent variable at work in this data is even stronger given the results for the amount of capitalization folks used. Conservative users tended to use more capitalization than the average liberal user, but there was a really strong bi-modal distribution for the liberal accounts.

Rplot
Again, we see that conservative accounts use more of the marker (in this case capitalization), but that there’s a strong bi-modal distribution in the liberal users’ data.

What’s more, the liberal accounts that used a lot of punctuation also tended to use a lot of capitalization. Since these features are both ones that I associate with very “proper” usage (things like always starting a tweet with a capital letter, and ending it with a period) this seems to suggest that some liberal accounts are very standardized in their use of language, while others reject at least some of those standards.

So what’s the answer the question I posed in the title? Can capitalization or punctuation reveal political affiliation? For now, I’m going to go with a solid “maybe”. Users who use very little capitalization and punctuation are more likely to be liberal… but so are users who use a lot of both. And, while I’m on the subject of caveats, keep in mind that I was only looking at very politically active accounts who discuss thier politics in their user bios.  These observations probably don’t apply to all Twitter accounts (and certainly not across different languages).

If you’re interested in reading more, you can check out the fancy-pants versions of this research here and here.  And I definitely intend to consider looking at this; I’ll keep y’all posted on my findings. For now, however, off to find me a Nanimo bar!

Contest announcement! Making noise and going places ✈️🛄

I recently wrote the acknowledgements section my dissertation and it really put into perspective how much help I’ve received during my degree. I’ve decided to pass some of that on by helping out others! Specifically, I’ve decided to help make travelling to conferences a little more affordable for linguistics students who are from underrepresented minorities (African American, American Indian/Alaska Native, or Latin@), LGBT or have a disability.

Biologist speaking at the Friday morning Town Hall session, where attendees were welcome to discuss their ideas on how to further landscape conservation. (5471417317)

To enter:

Entry is open to any student (graduate or undergraduate) studying any aspect of language (broadly defined) who is from an underrepresented minority (African American, American Indian/Alaska Native, or Latin@), LGBT or has a disability.  E-mail me and attach:

  • An abstract or paper that has been accepted at an upcoming (i.e. starting after June 23, 2017) conference
  • The acceptance letter/email from the conference
  • A short biography/description of your work

One entry per person, please!

Prizes:

I’ll pick up to two entries. Each winner will receive 100 American dollars to help them with costs associated with the conference, and I’ll write a blog post highlighting each winner’s research.

Contest closes July 31I’ll contact winners by July 5

Good luck!

 

What is computational sociolinguistics? (And who’s doing it?)

If you follow me on Twitter (@rctatman) you probably already know that I defended my dissertation last week. That’s right: I’m now officially Dr. Tatman! [party horn emoji]

I’ve spent a lot of time focusing on all the minutia of writing a dissertation lately, from formatting references to correcting a lot of typos (my committee members are all heroes). As a result, I’m more than ready to zoom out and think about big-picture stuff for a little while. And, in academia at least, pictures don’t get much bigger than whole disciplines. Which brings me to the title of this blog post: computational sociolinguistics. I’ve talked about my different research projects quite a bit on this blog (and I’ve got a couple more projects coming up that I’m excited to share with y’all!) but they can seem a little bit scattered. What do patterns of emoji use have to do with how well speech recognition systems deal with different dialects with how people’s political affiliation is reflected in their punctuation use? The answer is that they all fall within the same discipline: computational sociolingustics.

Computational sociolinguistics is a fairly new field that lies at the intersection of two other, more established fields: computational linguistics and sociolinguistics. You’re actually probably already familiar with at least some of the work being done in computational linguistics and its sister field of Natural Language Processing (commonly called NLP). The technologies that allow us to interact with computers or phones using human language, rather than binary 1’s and 0’s, are the result of decades of research in these fields. Everything from spell check, to search engines that know that “puppy” and “dog” are related topics, to automatic translation are the result of researchers working in computational linguistics and NLP.

Sociolinguistics is another well-established field, which focuses on the effects of social context on language how we use language and understand. “Social context”, in this case, can be everything from someone’s identity–like their gender or where they’re from–to the specific linguistic situation they’re in, like how much they like the person they’re talking to or whether or not they think they can be overheard. While a lot of work in sociolinguistics is more qualitative, describing observations without a lot of exact measures, of it is also quantitative.

So what happens when you squish these to fields together? For me, the result is work that focuses on research questions that would be more likely to be asked by sociolinguistics, but using methods from computational linguistics and NLP. It also means asking sociolinguistic questions about how we use language in computational context, drawing on the established research fields of Computer Mediated Communication (CMC), Computational Social Science (CSS) and corpus linguistics, but with a stronger focus on sociolingusitics.

One difficult thing about working in a very new field, however, is that it doesn’t have the established social infrastructure that older fields do. If you do variationist sociolinguistics, for example, there’s an established conference (New Ways of Analyzing Variation, or NWAV) and journals (Language Variation and Change, American Speech, the Journal of Sociolinguistics). Older fields also have an established set of social norms. For instance, conferences are considered more prestigious research venues in computational linguistics, while for sociolinguistics journal publications are usually preferred. But computational sociolinguistics doesn’t really have any of that yet. There also isn’t an established research canon, or any textbooks, or a set of studies that you can assume most people in the field have had exposure to (with the possible exception of Dong et al.’s really fabulous survey article). This is exciting, but also a little bit scary, and really frustrating if you want to learn more about it. Science is about the communities that do it as much as it is about the thing that you’re investigating, and as it stands there’s not really an established formal computational sociolinguistics community that you can join.

Fortunately, I’ve got your back. Below, I’ve collected a list of a few of the scholars whose work I’d consider to be computational sociolinguistics along with small snippets of how they describe their work on their personal websites. This isn’t a complete list, by any means, but it’s a good start and should help you begin to learn a little bit more about this young discipline.

  • Jacob Eisenstein at Georgia Tech
    • “My research combines machine learning and linguistics to build natural language processing systems that are robust to contextual variation and offer new insights about social phenomena.”
  • Jack Grieve at the University of Birmingham
    • “My research focuses on the quantitative analysis of language variation and change. I am especially interested in grammatical and lexical variation in the English language across time and space and the development of new methods for collecting and analysing large corpora of natural language.”
  • Dirk Hovy at the University of Copenhagen
  • Michelle A. McSweeney at Columbia
    • “My research can be summed up by the question: How do we convey tone in text messaging? In face-to-face conversations, we rely on vocal cues, facial expressions, and other non-linguistic features to convey meaning. These features are absent in text messaging, yet digital communication technologies (text messaging, email, etc.) have entered nearly every domain of modern life. I identify the features that facilitate successful communication on these platforms and understand how the availability of digital technologies (i.e., mobile phones) has helped to shape urban spaces.”
  • Dong Nguyen at the University of Edinburgh & Alan Turing Institute
    • “I’m interested in Natural Language Processing and Information Retrieval, and in particular computational text analysis for research questions from the social sciences and humanities. I especially enjoy working with social media data.”
  • Sravana Reddy at Wellesley
    • “My recent interests center around computational sociolinguistics and the intersection between natural language processing and privacy. In the past, I have worked on unsupervised learning, pronunciation modeling, and applications of NLP to creative language.”
  • Tyler Schnoebelen at Decoded AI Consulting
    • “I’m interested in how people make meaning with language. My PhD is from Stanford’s Department of Linguistics (my dissertation was on language and emotion). I’ve also founded a natural language processing start-up (four years), did UX research at Microsoft (ten years) and ran the features desk of a national newspaper in Pakistan (one year).”
  • (Ph.D. Candidate) Philippa Shoemark at the University of Edinburgh
    • “My research interests span computational linguistics, natural language processing, cognitive modelling, and complex networks. I am currently using computational methods to identify social and individual factors that condition linguistic variation and change on social media, under the supervision of Sharon Goldwater and James Kirby.”
  • (Ph.D. Candidate) Sean Simpson at Georgetown University
    • “My interests include computa­tional socio­linguistics, socio­phonetics, language variation, and conservation & revitalization of endangered languages. My dissertation focuses on the incorporation of sociophonetic detail into automated speaker profiling systems.”

Should English be the official language of the United States?

There is currently a bill in the US House to make English the official language of the United States. These bills have been around for a while now. H.R. 997, also known as the “The English Language Unity Act”, was first proposed in 2003. The companion bill, last seen as S. 678 in the 114th congress, was first introduced to the Senate as S. 991 in 2009, and if history is any guide will be introduced again this session.

So if these bills have been around for a bit, why am I just talking about them now? Two reasons. First, I had a really good conversation about this with someone on Twitter the other day and I thought it would be useful to discuss  this in more depth. Second, I’ve been seeing some claims that President Trump made English the official language of the U.S. (he didn’t), so I thought it would be timely to discuss why I think that’s such a bad idea.

As both a linguist and a citizen, I do not think that English should be the official language of the United States.

In fact, I don’t think the US should have any official language. Why? Two main reasons:

  • Historically, language legislation at a national level has… not gone well for other countries.
  • Picking one official language ignores the historical and current linguistic diversity of the United States.

Let’s start with how passing legislation making one or more languages official has gone for other countries. I’m going to stick with just two, Canada and Belgium, but please feel free to discuss others in the comments.

Canada

Unlike the US, Canada does have an official language. In fact, thanks to a  1969 law, they have two: English and French. If you’ve ever been to Canada, you know that road signs are all in both English and French.

This law was passed in the wake of turmoil in Quebec sparked by a Montreal school board’s decision to teach all first grade classes in French, much to the displeasure of the English-speaking residents of St. Leonard. Quebec later passed Bill 101 in 1977, making French the only official language of the province. One commenter on this article by the Canadian Broadcasting Corporation called this “the most divisive law in Canadian history”.

Language legislation and its enforcement in Canada has been particularity problematic for businesses. On one occasion, an Italian restaurant faced an investigation for using the word “pasta” on thier menu, instead of the French “pâtes”. Multiple retailers have faced prosecution at the hands of the Office Québécois de la langue Française for failing to have retail websites available in both English and French. A Montreal boutique narrowly avoided a large fine for making Facebook posts only in English. There’s even an official list of English words that Quebec Francophones aren’t supposed to use. While I certainly support bilingualism, personally I would be less than happy to see the same level of government involvement in language use in the US.

In addition, having only French and English as the official languages of Canada leave out a  very important group: aboriginal language users. There are over 60 different indigenous languages used in Canada used by over 213 thousand speakers. And even those don’t make up the majority of languages spoken in Canada: there are over 200 different languages used in Canada and 20% of the population speaks neither English nor French at home.

Belgium

Another country with a very storied past in terms of language legislation is Belgium. The linguistic situation in Belgium is very complex (there’s a more in-depth discussion here), but the general gist is that there are three languages used in different parts of the country. Dutch is used in the north, French is the south, and German in parts of the east. There is also a deep cultural divide between these regions, which language legislation has long served as a proxy for. There have been no fewer than eight separate national laws passed restricting when and where each language can be used. In 1970, four distinct language regions were codified in the Belgium constitution. You can use whatever language you want in private but there are restrictions on what language you can use for government business, in court, in education and employment.  While you might think that would put a rest to legislation on language, tensions have continued to be high. In 2013, for instance, the European Court of Justice overturned a Flemish law that contracts written in Flanders had to be in Dutch to be binding after a contractor working on an English contract was fired. Again, this represents a greater level of official involvement with language use than I’m personally comfortable with.

I want to be clear: I don’t think multi-lingualism is a problem. As a linguist, I value every language and I also recognize that bilingualism offers significant cognitive benefits. My problem is with legislating which languages should be used in a multi-lingual situation; it tends to lead to avoidable strife.

The US

Ok, you might be thinking, but in the US we really are just an English-speaking country! We wouldn’t have that same problem here. Weeeeelllllll….

Tree map of languages in the United States
Languages of the United States, by speakers, based on data provided by the Modern Languages Association, which is in turn based on 2010 census data.

The truth is, the US is very multilingual. We have a Language Diversity Index of .353, according to the UN. That means that, if you randomly picked two people from the United States population, the chance that they’d speak two different languages is over 35%. That’s far higher than a lot of other English-speaking countries. The UK clocks in at .139,  while New Zealand and Australia are at .102 and .126, respectively. (For the record, Canada is at .549 and Belgium .734.)

The number of different languages spoken in the US is also remarkable. In New York City alone there may be speakers of as many as 800 different languages, which would make it one of the most linguistically-diverse places in the world; like the Amazon rain-forest of languages. In King County, where I live now, there are over 170 different languages spoken, with the most common being Spanish, Chinese, Vietnamese and Amharic. And that linguistic diversity is reflected in the education system: there are almost 5 million students in the US Education system who are learning English, nearly 1 out of 10 students.

Multilingualism in the United States is nothing new, either: it’s been a part of the American experience since long before there was an America. Of course, there continue to be many speakers of indigenous languages in the United States, including Hawaiian (keep in mind that Hawaii did not actually want to become a state). But even within just European languages, English has never had sole dominion. Spanish has been spoken in Florida since the 1500’s. At the time of the signing of the Deceleration of Independence, an estimated 10% of the citizens of the newly-founded US spoke German (although the idea that it almost became the official language of the US is a myth). New York city? Used to be called New Amsterdam, and Dutch was spoken there into the 1900’s. Even the troops fighting the revolutionary war were doing so in at least five languages.

Making English the official language of the United States would ignore the rich linguistic history and the current linguistic diversity of this country. And, if other countries’ language legislation is any guide, would cause a lot of unnecessary fuss.

Social Speech Perception: Reviewing recent work

As some of you may already know, it’s getting to be crunch time for me on my PhD. So for the next little bit, my blog posts will probably mostly be on things that directly related to my dissertation. Consider this your behind-the-scenes pass to the research process.

With that in mind, today we’ll be looking at some work that’s closely related to my own.(And, not coincidentally, that I need to include in my lit review. Twofer.) They all share a common thread: social categories and speech perception. I’ve talked about this before, but these are more recent peer-reviewed papers on the topic.

Vowel perception by listeners from different English dialects

In this paper, Karpinska and coauthors investigated the role of a listener’s regional dialect on their use of two acoustic cues: formats and duration.

An acoustic cue is a specific part of the speech signal that you pay attention to to help you decide what sound you’re hearing. For example, when listening to a fricative, like “s” or “sh”,  you pay a lot of attention to the high-pitched, staticy-sounding  part of the sound to tell which fricative you’re hearing. This cue helps you tell the difference  between “sip” and “ship”, and if it gets removed or covered by another sound you’ll have a hard time telling those words apart.

They found that for listeners from the UK, New Zealand, Ireland and Singapore, formants were the most important cue distinguishing the vowels in “bit” and “beat”. For Australian listeners, however, duration (how long the vowel was) was a more important cue to the identity of these vowels. This study provides additional evidence that a listener’s dialect affects their speech perception, and in particular which cues they rely on.

Social categories are shared across bilinguals׳ lexicons

In this experiment Szakay and co-authors looked at English-Māori bilinguals from New Zealand. In New Zealand, there  are multiple dialects of English, including Māori English (the variety used by Native New Zealanders) and Pākehā English (the variety used by white New Zealanders). The authors found that there was a cross-language priming effect from Māori to English, but only for  Māori English.

Priming is a phenomena in linguistics where hearing or saying a particular linguistic unit, like a word, later makes it easier to understand or say a similar unit. So if I show you a picture of a dog, you’re going to be able to read the word “puppy” faster than you would have otherwise becuase you’re already thinking about canines.

They argue that this is due to the activation of language forms associated with a specific social identity–in this case Māori ethnicity. This provides evidence that listener’s beliefs about a speaker’s social identity affects their processing.

Intergroup Dynamics in Speech Perception: Interaction Among Experience, Attitudes and Expectations

Nguyen and co-authors investigate the effects of three separate factors on speech perception:

  • Experience: How much prior interaction a listener has had with a given speech variety.
  • Attitudes: How a listener feels about a speech variety and those who speak it.
  • Expectations: Whether a listener knows what speech variety they’re going to hear. (In this case only some listeners were explicitly told what speech variety they were going to hear.)

They found that these three factors all influenced the speech perception of Australian English speakers listening to Vietnamese accented English, and that there was an interaction between these factors. In general, participants with correct expectations (i.e. being told beforehand that they were going to hear Vietnamese accented English) identified more words correctly.

There was an interesting interaction between listener’s attitudes towards Asians and thier experience with Vietnamese accented English. Listeners who had negative prejudices towards Asians and little experience with Vietnamese English were more accurate than those with little experience and positive prejudice. The authors suggest that that was due to listeners with negative prejudice being more attentive. However, the opposite effect was found for listener’s with experience listening to Vietnamese English. In this group, positive prejudice increase accuracy while negative prejudice decreased it. There were, however, uneven numbers of participants between the groups so this might have skewed the results.

For me, this study is most useful becuase it shows that a listener’s experience with a speech variety and their expectation of hearing it affect their perception. I would, however, like to see a larger listener sample, especially given the strong negative skew in listner’s attitudes towards Asians (which I realize the researchers couldn’t have controlled for).

Perceiving the World through Group-Colored Glasses: A Perceptual Model of Intergroup Relations

Rather than presenting an experiment, this paper lays out a framework for the interplay of social group affiliation and perception. The authors pull together numerous lines of research showing that an individual’s own group affiliation can change thier perception and interpretation of the same stimuli. In the authors’ own words:

The central premise of our model is that social identification influences perception.

While they discuss perception across many domains (visual, tactile, orlfactory, etc.) the part which directly fits with my work is that of auditory perception. As they point out, auditory perception of speech depends on both bottom up and top down information. Bottom-up information, in speech perception, is the acoustic signal, while top-down information includes both knowledge of the language (like which words are more common) and social knowledge (like our knowledge of different dialects). While the authors do not discuss dialect perception directly, other work (including the three studies discussed above) fits nicely into this framework.

The key difference between this framework and Kleinschmidt & Jaeger’s Robust Speech Perception model is the centrality of the speaker’s identity. Since all language users have thier own dialect which affects their speech perception (although, of course, some listeners can fluently listen to more than one dialect) it is important to consider both the listener’s and talker’s social affiliation when modelling speech perception.