Today’s blog post is a bit different. It’s in dance!
If that wasn’t quite clear enough for you, you can check this blog post for a more detailed explanation.
Today’s blog post is a bit different. It’s in dance!
If that wasn’t quite clear enough for you, you can check this blog post for a more detailed explanation.
In the course of my day-to-day work on Kaggle’s public data platform, I’ve learned a lot about the ecosystem of language data on the web (or at least the portions of it that have been annotated in English). For example, I’ve noticed a weird disconnect between European and American data repositories resources that I’m pretty sure has its roots in historical and disciplinary divisions.
I’ve also found a lot of great resources, though! At some point, I started keeping notes on interesting data repositories and link aggregators. I finally got around to tidying up and annotating my list of resources, and I figured that it would a useful thing to share with everyone. So, without further ado, here’s an (incomplete) list of some places to find language resources on the web:
Know of a resource I forgot to include? Link it in the comments!
If you’ve been following my blog for a while, you may remember that last year I found that YouTube’s automatic captions didn’t work as well for some dialects, or for women. The effects I found were pretty robust, but I wanted to replicate them for a couple of reasons:
With that in mind, I did a second analysis on both YouTube’s automatic captions and Bing’s speech API (that’s the same tech that’s inside Microsoft’s Cortana, as far as I know).
For this project, I used speech data from the International Dialects of English Archive. It’s a collection of English speech from all over, originally collected to help actors sound more realistic.
I used speech data from four varieties: the South (speakers from Alabama), the Northern Cities (Michigan), California (California) and General American. “General American” is the sort of news-caster style of speech that a lot of people consider unaccented–even though it’s just as much an accent as any of the others! You can hear a sample here.
For each variety, I did an acoustic analysis to make sure that speakers I’d selected actually did use the variety I thought they should, and they all did.
For the YouTube captions, I just uploaded the speech files to YouTube as videos and then downloaded the subtitles. (I would have used the API instead, but when I was doing this analysis there was no Python Google Speech API, even though very thorough documentation had already been released.)
Bing’s speech API was a little more complex. For this one, my co-author built a custom Android application that sent the files to the API & requested a long-form transcript back. For some reason, a lot of our sound files were returned as only partial transcriptions. My theory is that there is a running confidence function for the accuracy of the transcription, and once the overall confidence drops below a certain threshold, you get back whatever was transcribed up to there. I don’t know if that’s the case, though, since I don’t have access to their source code. Whatever the reason, the Bing transcriptions were less accurate overall than the YouTube transcriptions, even when we account for the fact that fewer words were returned.
OK, now to the results. Let’s start with dialect area. As you might be able to tell from the graphs below, there were pretty big differences between the two systems we looked at. In general, there was more variation in the word error rate for Bing and overall the error rate tended to be a bit higher (although that could be due to the incomplete transcriptions we mentioned above). YouTube’s captions were generally more accurate and more consistent. That said, both systems had different error rates across dialects, with the lowest average error rates for General American English.
Now, let’s turn to gender. If you read my earlier work, you’ll know that I previously found that YouTube’s automatic captions were more accurate for men and less accurate for women. This time, with carefully recorded speech samples, I found no robust difference in accuracy by gender in either system. Which is great! In addition, the unreliable trends for each system pointed in opposite ways; Bing had a lower WER for male speakers, while YouTube had a lower WER for female speakers.
So why did I find an effect last time? My (untested) hypothesis is that there was a difference in the signal to noise ratio for male and female speakers in the user-uploaded files. Since women are (on average) smaller and thus (on average) slightly quieter when they speak, it’s possible that their speech was more easily masked by background noises, like fans or traffic. These files were all recorded in a quiet place, however, which may help to explain the lack of difference between genders.
Finally, what about race? For this part of the analysis, I excluded General American speakers, since they did not report their race. I also excluded the single Native American speaker. Even with fewer speakers, and thus reduced power, the differences between races were still robust enough to be significant for YouTube’s automatic captions and Bing followed the same trend. Both systems were most accurate for Caucasian speakers.
While I was happy to find no difference in performance by gender, the fact that both systems made more errors on non-Caucasian and non-General-American speaking talkers is deeply concerning. Regional varieties of American English and African American English are both consistent and well-documented. There is nothing intrinsic to these varieties that make them less easy to recognize. The fact that they are recognized with more errors is most likely due to bias in the training data. (In fact, Mozilla is currently collecting diverse speech samples for an open corpus of training data–you can help them out yourself.)
There are two things I’m really worried about with these types of speech recognition errors. The first is higher error rates seem to overwhelmingly affect already-disadvantaged groups. In the US, strong regional dialects tend to be associated with speakers who aren’t as wealthy, and there is a long and continuing history of racial discrimination in the United States.
Given this, the second thing I’m worried about is the fact that these voice recognition systems are being incorporated into other applications that have a real impact on people’s lives.
Every automatic speech recognition system makes errors. I don’t think that’s going to change (certainly not in my lifetime). But I do think we can get to the point where those error don’t disproportionately affect already-marginalized people. And if we keep using automatic speech recognition into high-stakes situations it’s vital that we get to that point quickly and, in the meantime, stay aware of these biases.
If you’re interested in the long version, you can check out the published paper here.
This week, I’m in Vancouver this week for the meeting of the Association for Computational Linguistics. (On the subject of conferences, don’t forget that my offer to help linguistics students from underrepresented minorities with the cost of conferences still stands!) The work I’m presenting is on a new research direction I’m pursuing and I wanted to share it with y’all!
If you’ve read some of my other posts on sociolinguistics, you may remember that the one of its central ideas is that certain types of language usage pattern together with aspects of people’s social identities. In the US, for example, calling a group of people “yinz” is associated with being from Pittsburgh. Or in Spanish, replacing certain “s” sounds with “th” sounds is associated with being from northern or central Spain. When a particular linguistic form is associated with a specific part of someone’s social identity, we call that a “sociolinguistic variable”
There’s been a lot of work on the type of sociolinguistic variables people use when they’re speaking, but there’s been less work on what people do when they’re writing. And this does make a certain amount of sense: many sociolinguistic variables are either 1) something people aren’t aware they’re doing or 2) something that they’re aware they’re doing but might not consider “proper”. As a result, they tend not to show up in formal writing.
This is where the computational linguistics part comes in; people do a lot of informal writing on computers, especially on the internet. In fact, I’d wager that humans are producing more text now than at any other point in history, and a lot of it is produced in public places. That lets us look for sociolinguistics variables in writing in a way that wasn’t really possible before.
Which is a whole lot of background to be able to say: I’m looking at how punctuation and capitalization pattern with political affiliation on Twitter.
Political affiliation is something that other sociolinguists have definitely looked at. It’s also something that’s very, very noticeable on Twitter these days. This is actually a boon to this type of research. One of the hard things about doing research on Twitter is that you don’t always necessarily know someone’s social identity. And if you use a linguistic feature to try to figure out their identity when what you’re interested in is linguistic features, you quickly end up with the problem of circular logic.
Accounts which are politically active, however, will often explicitly state their political affiliation in their Twitter bio. And I used that information to get tweets from people I was very sure had a specific political affiliation.
For this project, I looked at people who use the hashtags #MAGA and #theResistance in their Twitter bios. The former is an initialism for “Make America Great Again” and is used by politically conservative folks who support President Trump. The latter is used by political liberal folks who are explicitly opposed to President Trump. These two groups not only have different political identities, but also are directly opposed to each other. This means there’s good reason to believe that they will use language in different ways that reflect that identity.
But what about the linguistic half of the equation? Punctuation and capitalization are especially interesting to me because they seem to be capturing some of the same information we might find in prosody or intonation in spoken language. Things like YELLING or…pausing….or… uncertainty? They’re also much, much easier to measure punctuation than intonation, which is notoriously difficult and time-consuming to annotate. At the same time, I have good evidence that how you use punctuation and capitalization has some social meaning. Check out this tweet, for example:
So, if punctuation and capitalization are doing something socially, is part of what they’re doing expressing political affiliation?
That’s what I looked into. I grabbed up to 100 tweets each from accounts which used either #MAGA or #theResistance in their Twitter bios. Then I looked at how much punctuation and capitalization users from these two groups used in their tweets.
First, I looked at all punctuation marks. I did find that, on average, liberal users tended to use less punctuation. But when I took a closer look at the data, an interesting pattern emerged. In both the liberal and conservative groups, there were two clusters of users: those who used a lot of punctuation and those who used almost none.
What gives rise to these two clusters? I honestly don’t know, but I do have a hypothesis. I think that there’s probably a second social variable in this data that I wasn’t able to control for. It seems likely that the user’s age might have something to do with it, or their education level, or even whether they use thier Twitter account for professional or personal communication.
My intuition that there’s a second latent variable at work in this data is even stronger given the results for the amount of capitalization folks used. Conservative users tended to use more capitalization than the average liberal user, but there was a really strong bi-modal distribution for the liberal accounts.
What’s more, the liberal accounts that used a lot of punctuation also tended to use a lot of capitalization. Since these features are both ones that I associate with very “proper” usage (things like always starting a tweet with a capital letter, and ending it with a period) this seems to suggest that some liberal accounts are very standardized in their use of language, while others reject at least some of those standards.
So what’s the answer the question I posed in the title? Can capitalization or punctuation reveal political affiliation? For now, I’m going to go with a solid “maybe”. Users who use very little capitalization and punctuation are more likely to be liberal… but so are users who use a lot of both. And, while I’m on the subject of caveats, keep in mind that I was only looking at very politically active accounts who discuss thier politics in their user bios. These observations probably don’t apply to all Twitter accounts (and certainly not across different languages).
If you’re interested in reading more, you can check out the fancy-pants versions of this research here and here. And I definitely intend to consider looking at this; I’ll keep y’all posted on my findings. For now, however, off to find me a Nanimo bar!
You may already be familiar with the phenomena I’m going to be talking about today: when someone punctuates some text with the clap emoji. It’s a pretty transparent gestural scoring and (for me) immediately brings to mind the way my mom would clap with every word when she was particularly exasperated with my sibling and I (it was usually along with speech like “let’s go, let’s go, let’s go” or “get up now”). It looks like so:
This innovation, which started on Black Twitter is really interesting to me because it ties in with my earlier work on emoji ordering. I want to know where emojis go, particularly in relation to other words. Especially since people have since extended this usage to other emoji, like the US Flag:
Logically, there are several different ways you can intersperse clap emojis with text:
I want to know which of these best describes what people actually do. I’m not aiming to write an internet style guide, but I am hoping to characterize this phenomena in a general way: this is how most people who do this do it, and if you want to use this style in a natural way, you should probably do it the same way.
I used Fireant to grab 10,000 tweets from the Twitter streaming API which had the clap emoji in them at least once. (Twitter doesn’t let you search for a certain number of matches of the same string. If you search for “blob” and “blob blob” you’ll get the same set of results.)
From that set of 10,000 tweets, I took only the tweets that had a clap emoji followed by a word followed by another clap emoji and threw out any repeats. That left me with 260 tweets. (This may seem pretty small compared to my starting dataset, but there were a lot of retweets in there, and I didn’t want to count anything twice.) Then I removed @usernames, since those show up in the beginning of any tweet that’s a reply to someone, and URL’s, which I don’t really think of as “words”. Finally, I looked at each word in a tweet and marked whether it was a clap or not. You can see the results of that here:
The “word” axis represents which word in the tweet we’re looking at: the first, second, third, etc. The red portion of the bar are the words that are the clap emoji. The yellow portion is the words that aren’t. (BTW, big shoutout to Hadley Wickham’s emo(ji) package for letting me include emoji in plots!)
From this we can see a clear pattern: almost no one starts a tweet with an emoji, but most people follow the first word with an emoji. The up-down-up-down pattern means that people are alternating the clap emoji with one word. So if we look back at our hypotheses about how emoji are used, we can see right off the bat that three of them are wrong:
We can pick between the two remaining hypotheses by looking at whether people are ending thier tweets with a clap emoji. As it turns out, the answer is “yes”, more often than not.
If they’re using this clapping-between-words pattern (sometimes called the “ratchet clap“) people are statistically more likely to end their tweet with a clap emoji than with a different word or non-clap emoji. This means the most common pattern is to use 👏 a 👏 clap 👏 after 👏 every 👏 word, 👏 including 👏 the 👏 last. 👏
This makes intuitive sense to me. This pattern is mimicking someone is clapping on every word. Since we can’t put emoji on top of words to indicate that they’re happening at the same time, putting them after makes good intuitive sense. In some sense, each emoji is “attached” to the word that comes before it in a similar way to how “quickly” is “attached” to “run” in the phrase “run quickly”. It makes less sense to put emoji between words, becuase then you end up with less claps than words, which doesn’t line up well with the way this is done in speech.
The “clap after every word” pattern is also what this website that automatically puts claps in your tweets does, so I’m pretty positive this is a good characterization of community norms.
So there you have it! If you’re going to put clap emoji in your tweets, you should probably do 👏 it 👏 like 👏 this. 👏 It’s not wrong if you don’t, but it does look kind of weird.
I recently wrote the acknowledgements section my dissertation and it really put into perspective how much help I’ve received during my degree. I’ve decided to pass some of that on by helping out others! Specifically, I’ve decided to help make travelling to conferences a little more affordable for linguistics students who are from underrepresented minorities (African American, American Indian/Alaska Native, or Latin@), LGBT or have a disability.
Entry is open to any student (graduate or undergraduate) studying any aspect of language (broadly defined) who is from an underrepresented minority (African American, American Indian/Alaska Native, or Latin@), LGBT or has a disability. E-mail me and attach:
One entry per person, please!
I’ll pick up to two entries. Each winner will receive 100 American dollars to help them with costs associated with the conference, and I’ll write a blog post highlighting each winner’s research.
Contest closes July 31, I’ll contact winners by July 5
If you follow me on Twitter (@rctatman) you probably already know that I defended my dissertation last week. That’s right: I’m now officially Dr. Tatman! [party horn emoji]
I’ve spent a lot of time focusing on all the minutia of writing a dissertation lately, from formatting references to correcting a lot of typos (my committee members are all heroes). As a result, I’m more than ready to zoom out and think about big-picture stuff for a little while. And, in academia at least, pictures don’t get much bigger than whole disciplines. Which brings me to the title of this blog post: computational sociolinguistics. I’ve talked about my different research projects quite a bit on this blog (and I’ve got a couple more projects coming up that I’m excited to share with y’all!) but they can seem a little bit scattered. What do patterns of emoji use have to do with how well speech recognition systems deal with different dialects with how people’s political affiliation is reflected in their punctuation use? The answer is that they all fall within the same discipline: computational sociolingustics.
Computational sociolinguistics is a fairly new field that lies at the intersection of two other, more established fields: computational linguistics and sociolinguistics. You’re actually probably already familiar with at least some of the work being done in computational linguistics and its sister field of Natural Language Processing (commonly called NLP). The technologies that allow us to interact with computers or phones using human language, rather than binary 1’s and 0’s, are the result of decades of research in these fields. Everything from spell check, to search engines that know that “puppy” and “dog” are related topics, to automatic translation are the result of researchers working in computational linguistics and NLP.
Sociolinguistics is another well-established field, which focuses on the effects of social context on language how we use language and understand. “Social context”, in this case, can be everything from someone’s identity–like their gender or where they’re from–to the specific linguistic situation they’re in, like how much they like the person they’re talking to or whether or not they think they can be overheard. While a lot of work in sociolinguistics is more qualitative, describing observations without a lot of exact measures, of it is also quantitative.
So what happens when you squish these to fields together? For me, the result is work that focuses on research questions that would be more likely to be asked by sociolinguistics, but using methods from computational linguistics and NLP. It also means asking sociolinguistic questions about how we use language in computational context, drawing on the established research fields of Computer Mediated Communication (CMC), Computational Social Science (CSS) and corpus linguistics, but with a stronger focus on sociolingusitics.
One difficult thing about working in a very new field, however, is that it doesn’t have the established social infrastructure that older fields do. If you do variationist sociolinguistics, for example, there’s an established conference (New Ways of Analyzing Variation, or NWAV) and journals (Language Variation and Change, American Speech, the Journal of Sociolinguistics). Older fields also have an established set of social norms. For instance, conferences are considered more prestigious research venues in computational linguistics, while for sociolinguistics journal publications are usually preferred. But computational sociolinguistics doesn’t really have any of that yet. There also isn’t an established research canon, or any textbooks, or a set of studies that you can assume most people in the field have had exposure to (with the possible exception of Dong et al.’s really fabulous survey article). This is exciting, but also a little bit scary, and really frustrating if you want to learn more about it. Science is about the communities that do it as much as it is about the thing that you’re investigating, and as it stands there’s not really an established formal computational sociolinguistics community that you can join.
Fortunately, I’ve got your back. Below, I’ve collected a list of a few of the scholars whose work I’d consider to be computational sociolinguistics along with small snippets of how they describe their work on their personal websites. This isn’t a complete list, by any means, but it’s a good start and should help you begin to learn a little bit more about this young discipline.