Are “a female” and “a male” used differently?

In this first part of this two-post series, I looked at how “a male” and “a female” were used on Twitter. I found that one part of speech tagger tagged “male” as a proper noun really frequently (which is weird, cause it isn’t one) and that overall the phrase “a female” was waaaay more frequent. Which is  interesting in itself, since my initial question was “are these terms used differently?” and these findings suggest that they are. But the second question is how are these terms used differently? To answer this, we’ll need to get a little more qualitative with it.

Using the same set of tweets that I collected last time, I randomly selected 100 tweets each from the “a male” and “a female” dataset. Then I hand tagged each subset of tweets for two things: the topic of the tweet (who or what was being referred to as “male” or “female”) and the part of speech of “male”  or “female”.

Who or what is being called “male” or “female”?

Rplot

Because there were so few tweets to analyze, I could do a content analysis. This is a methodology that is really useful when you don’t know for sure ahead of time what types of categories you’re going to see in your data. It’s like clustering that a human does.

Going into this analysis, I thought that there might be a difference between these datasets in terms of how often each term was used to refer to an animal, so I tagged tweets for that. But as I went through the tweets, I was floored by the really high number of tweets talking about trans people, especially Mack Beggs, a trans man from Texas who was forced to wrestle in the women’s division. Trans men were referred to as “a male” really, really often. While there wasn’t a reliable difference between how often “a female” and “a male” was used to refer to animals or humans, there was a huge difference in terms of how often they were  used to refer to trans people. “A male” was significantly more likely to be used to describe a trans person than “a female” (X2 (2, N = 200) = 55.33, p <.001.)

Part of Speech

Since the part of speech taggers I used for the first half of my analysis gave me really mixed results, I also hand tagged the part of speech of “male” or “female” in my samples. In line with my predictions during data collection, the only parts of speech I saw were nouns and adjectives.

When I looked at just the difference between nouns and adjectives, there was a little difference, but nothing dramatic. Then, I decided to break it down a little further. Rather than just looking at the differences in part of speech between “male” and “female”, I looked at the differences in part of speech and whether the tweet was about a trans person or a cis (not trans) person.

Rplot01

For tweets with “female”, it was used as a noun and an adjective at pretty much the same rates regardless of whether someone was talking about a trans person or a cis (non-trans) person. For tweets with “male”, though, when the tweet was about a trans person, it was used almost exclusively as a noun.

And there was a huge difference there. A large majority of tweets with “a male” and talking about a trans person used “male” as a noun. In fact, more than a third of my subsample of tweets using “a male” were using it as a noun to talk about someone who was trans.

So what’s going on here? This construction (using “male” or “female” as a noun to refer to a human) is used more often to talk about:

  1. Women. (Remember that in the first blog post looking at this, I found that “a female” is twice a common as “a male.)
  2. Trans men.

These both make sense if you consider the cultural tendency to think about cis men as, in some sense, the “default”. (Deborah Tannen has a really good discussion of this her article “Marked Women, Unmarked Men“. “Marked” is a linguistics term which gets used in a lot of ways, but generally means something like “not the default” or “the weird one”.) So people seem to be more likely to talk about a person being “a male” or “a female” when they’re talking about anyone but a cis man.

A note on African American English

giphy.gif

I should note that many of the tweets in my sample were African American English, which is not surprising given the large Black community on Twitter, and that use of “female” as a noun is a feature of this variety.  However, the parallel term used to refer to men in this variety is not “a man” or even “a male”, but rather “nigga”, with that spelling. This is similar to “dude” or “guy”: a nonspecific term for any man, regardless of race, as discussed at length by Rachel Jeantal here. You can see an example of this usage in speech above (as seen in the Netflix show “The Unbreakable Kimmy Schmidt“) or in this vine. (I will note, however, that it only has this connotation if used by a speaker of African American English. Borrowing it into another variety, especially if the speaker is white, will change the meaning.)

Now, I’m not a native user of African American English, so I don’t have strong intuitions about the connotation of this usage. Taylor Amari Little (who you may know from her TEDx talk on Revolutionary Self-Produced Justice) is, though, and tweeted this (quoted with permission):

If they call women “females” 24/7, leave em alone chile, run away

And this does square with my own intuitions: there’s something slightly sinister about someone who refers to women exclusively as “females”. As journalist Vonny Moyes pointed out in her recent coverage of ads offering women free rent in exchange for sexual favors, they almost refer to women as “girls or females – rarely ever women“. Personally, I find that very good motivation not to use “a male” or “a female” to talk about any human.

Advertisements

Can what you think you know about someone affect how you hear them?

I’ll get back to “a male/a female” question in my next blog post (promise!), but for now I want to discuss some of the findings from my dissertation research. I’ve talked about my dissertation research a couple times before, but since I’m going to be presenting some of it in Spain (you can read the full paper here), I thought it would be a good time to share some of my findings.

In my dissertation, I’m looking at how what you think you know about a speaker affects what you hear them say. In particular, I’m looking at American English speakers who have just learned to correctly identify the vowels of New Zealand English. Due to an on-going vowel shift, the New Zealand English vowels are really confusing for an American English speaker, especially the vowels in the words “head”, “head” and “had”.

tokensVowelPlot

This plot shows individual vowel tokens by the frequency of thier first and second formants (high-intensity frequency bands in the vowel). Note that the New Zealand “had” is very close to the US “head”, and the New Zealand “head” is really close to the US “hid”.

These overlaps can be pretty confusing when American English speakers are talking to New Zealand English speakers, as this Flight of the Conchords clip shows!

The good news is that, as language users, we’re really good at learning new varieties of languages we already know, so it only takes a couple minutes for an American English speaker to learn to correctly identify New Zealand English vowels. My question was this: once an American English speaker has learned to understand the vowels of New Zealand English, how do they know when to use this new understanding?

In order to test this, I taught twenty one American English speakers who hadn’t had much, if any, previous exposure to New Zealand English to correctly identify the vowels in the words “head”, “heed” and “had”. While I didn’t play them any examples of a New Zealand “hid”–the vowel in “hid” is said more quickly in addition to having different formants, so there’s more than one way it varies–I did let them say that they’d heard “hid”, which meant I could tell if they were making the kind of mistakes you’d expect given the overlap between a New Zealand “head” and American “hid”.

So far, so good: everyone quickly learned the New Zealand English vowels. To make sure that it wasn’t that they were learning to understand the one talker they’d been listening to, I tested half of my listeners on both American English and New Zealand English vowels spoken by a second, different talker. These folks I told where the talker they were listening to was from. And, sure enough, they transferred what they’d learned about New Zealand English to the new New Zealand speaker, while still correctly identifying vowels in American English.

The really interesting results here, though, are the ones that came from the second half the listeners. This group I lied to. I know, I know, it wasn’t the nicest thing to do, but it was in the name of science and I did have the approval of my institutional review board, (the group of people responsible for making sure we scientists aren’t doing anything unethical).

In an earlier experiment, I’d played only New Zealand English as this point, and when I told them the person they were listening to was from America, they’d completely changed the way they listened to those vowels: they labelled New Zealand English vowels as if they were from American English, even though they’d just learned the New Zealand English vowels. And that’s what I found this time, too. Listeners learned the New Zealand English vowels, but “undid” that learning if they thought the speaker was from the same dialect as them.

But what about when I played someone vowels from their own dialect, but told them the speaker was from somewhere else? In this situation, listeners ignored my lies. They didn’t apply the learning they’d just done. Instead, the correctly treated the vowels of thier own dialect as if they were, in fact, from thier dialect.

At first glance, this seems like something of a contradiction: I just said that listeners rely on social information about the person who’s talking, but at the same time they ignore that same social information.

So what’s going on?

I think there are two things underlying this difference. The first is the fact that vowels move. And the second is the fact that you’ve heard a heck of a lot more of your own dialect than one you’ve been listening to for fifteen minutes in a really weird training experiment.

So what do I mean when I say vowels move? Well, remember when I talked about formants above? These are areas of high acoustic energy that occur at certain frequency ranges within a vowel and they’re super important to human speech perception. But what doesn’t show up in the plot up there is that these aren’t just static across the course of the vowel–they move. You might have heard of “diphthongs” before: those are vowels where there’s a lot of formant movement over the course of the vowel.

And the way that vowels move is different between different dialects. You can see the differences in the way New Zealand and American English vowels move in the figure below. Sure, the formants are in different places—but even if you slid them around so that they overlapped, the shape of the movement would still be different.

formantDynamics

Comparison of how the New Zealand and American English vowels move. You can see that the shape of the movement for each vowel is really different between these two dialects.  

Ok, so the vowels are moving in different ways. But why are listeners doing different things between the two dialects?

Well, remember how I said earlier that you’ve heard a lot more of your own dialect than one you’ve been trained on for maybe five minutes? My hypothesis is that, for the vowels in your own dialect, you’re highly attuned to these movements. And when a scientist (me) comes along and tells you something that goes against your huge amount of experience with these shapes, even if you do believe them, you’re so used to automatically understanding these vowels that you can’t help but correctly identify them. BUT if you’ve only heard a little bit of a new dialect you don’t have a strong idea of what these vowels should sound like, so if you’re going to rely more on the other types of information available to you–like where you’re told the speaker is from–even if that information is incorrect.

So, to answer the question I posed in the title, can what you think you know about someone affect how you hear them? Yes… but only if you’re a little uncertain about what you heard in the first place, perhaps becuase it’s a dialect you’re unfamiliar with.

What’s up with calling a woman “a female”? A look at the parts of speech of “male” and “female” on Twitter .

This is something I’ve written about before, but I’ve recently had several discussions with people who say they don’t find it odd to refer to a women as a female. Personally, I don’t like being called “a female” becuase its a term I to associate strongly with talking about animals. (Plus, it makes you sound like a Ferengi.)  I would also protest men being called males, for the same reason, but my intuition is that that doesn’t happen as often. I’m willing to admit that my intuition may be wrong in this case, though, so I’ve decided to take a more data-driven approach. I had two main questions:

  • Do “male” and “female” get used as nouns at different rates?
  • Does one of these terms get used more often?

Data collection

I used the Twitter public API to collect two thousand English tweets, one thousand each containing the exact string “a male” and “a female”. I looked for these strings to help get as many tweets as possible with “male” or “female” used as a noun. “A” is what linguist call a determiner, and a determiner has to have a noun after it. It doesn’t have to be the very next word, though; you can get an adjective first, like so:

  • A female mathematician proved the theorm.
  • A female proved the theorm.

So this will let me directly compare these words in a situation where we should only be able to see a limited number of possible parts of speech & see if they differ from each other. Rather than tagging two thousand tweets by hand, I used a Twitter specific part-of-speech tagger to tag each set of tweets.

A part of speech tagger is a tool that guesses the part of speech of every word in a text. So if you tag a sentence like “Apples are tasty”, you should get back that “apples” is a plural noun, “are” is a verb and “tasty” is an adjective. You can try one out for yourself on-line here.

Parts of Speech

In line with my predictions, every instance of “male” or “female” was tagged as either a noun, an adjective or a hashtag. (I went through and looked at the hashtags and they were all porn bots. #gross #hazardsOfTwitterData)

However, not every noun was tagged as the same type of noun. I saw three types of tags in my data: NN (regular old noun), NNS (plural noun) and, unexpectedly, NNP (proper noun, singular). (If you’re confused by the weird upper case abbreviations, they’re the tags used in the Penn Treebank, and you can see the full list here.) In case it’s been a while since you studied parts of speech, proper nouns are things like personal or place names. The stuff that tend to get capitalized in English. The examples from the Penn Treebank documentation include “Motown”, “Venneboerger”,  and “Czestochwa”. I wouldn’t consider either “female” or “male” a name, so it’s super weird that they’re getting tagged as proper nouns. What’s even weirder? It’s pretty much only “male” that’s getting tagged as a proper noun, as you can see below:

maleVsFemalePOS

Number of times each word tagged as each part of speech by the GATE Twitter part-of-speech tagger. NNS is a plural noun, NNP a proper noun, NN a noun and JJ an adjective.

The differences in tagged POS between “male” and “female” was super robust(X2(6, N = 2033) = 1019.2, p <.01.). So what’s happening here?  My first thought was that it might be that, for some reason, “male” is getting capitalized more often and that was confusing the tagger. But when I looked into, there wasn’t a strong difference between the capitalization of “male” and “female”: both were capitalized about 3% of the time. 

My second thought was that it was a weirdness showing up becuase I used a tagger designed for Twitter data. Twitter is notoriously “messy” (in the sense that it can be hard for computers to deal with) so it wouldn’t be surprising if tagging “male” as a proper noun is the result of the tagger being trained on Twitter data. So, to check that, I re-tagged the same data using the Stanford POS tagger. And, sure enough, the weird thing where “male” is overwhelming tagged as a proper noun disappeared.

stanfordTaggerPOS

Number of times each word tagged as each part of speech by the Stanford POS tagger. NNS is a plural noun, NNP a proper noun, NN a noun, JJ an adjective and FW a “foreign word”.

So it looks like “male” being tagged as a proper noun is an artifact of the tagger being trained on Twitter data, and once we use a tagger trained on a different set of texts (in this case the Wall Street Journal) there wasn’t a strong difference in what POS “male” and “female” were tagged as.

Rate of Use

That said, there was a strong difference between “a female” and “a male”: how often they get used. In order to get one thousand tweets with the exact string “a female”, Twitter had to go back an hour and thirty-four minutes. In order to get a thousand tweets with “a male”, however, Twitter had to go back two hours and fifty eight minutes. Based on this sample, “a female” gets said almost twice as often as “a male”.

So what’s the deal?

  • Do “male” and “female” get used as nouns at different rates?  It depends on what tagger you use! In all seriousness, though, I’m not prepared to claim this based on the dataset I’ve collected.
  • Does one of these terms get used more often? Yes! Based on my sample, Twitter users use “a female” about twice as often as “a male”.

I think the greater rate of use of “a female” that points to the possibility of an interesting underlying difference in how “male” and “female” are used, one that calls for a closer qualitative analysis. Does one term get used to describe animals more often than the other? What sort of topics are people talking about when they say “a male” and “a female”? These questions, however, will have to wait for the next blog post!

In the meantime, I’m interested in getting more opinions on this. How do you feel about using “a male” and “a female” as nouns to talk about humans? Do they sound OK or strike you as odd?

My code and is available on my GitHub.

Should English be the official language of the United States?

There is currently a bill in the US House to make English the official language of the United States. These bills have been around for a while now. H.R. 997, also known as the “The English Language Unity Act”, was first proposed in 2003. The companion bill, last seen as S. 678 in the 114th congress, was first introduced to the Senate as S. 991 in 2009, and if history is any guide will be introduced again this session.

So if these bills have been around for a bit, why am I just talking about them now? Two reasons. First, I had a really good conversation about this with someone on Twitter the other day and I thought it would be useful to discuss  this in more depth. Second, I’ve been seeing some claims that President Trump made English the official language of the U.S. (he didn’t), so I thought it would be timely to discuss why I think that’s such a bad idea.

As both a linguist and a citizen, I do not think that English should be the official language of the United States.

In fact, I don’t think the US should have any official language. Why? Two main reasons:

  • Historically, language legislation at a national level has… not gone well for other countries.
  • Picking one official language ignores the historical and current linguistic diversity of the United States.

Let’s start with how passing legislation making one or more languages official has gone for other countries. I’m going to stick with just two, Canada and Belgium, but please feel free to discuss others in the comments.

Canada

Unlike the US, Canada does have an official language. In fact, thanks to a  1969 law, they have two: English and French. If you’ve ever been to Canada, you know that road signs are all in both English and French.

This law was passed in the wake of turmoil in Quebec sparked by a Montreal school board’s decision to teach all first grade classes in French, much to the displeasure of the English-speaking residents of St. Leonard. Quebec later passed Bill 101 in 1977, making French the only official language of the province. One commenter on this article by the Canadian Broadcasting Corporation called this “the most divisive law in Canadian history”.

Language legislation and its enforcement in Canada has been particularity problematic for businesses. On one occasion, an Italian restaurant faced an investigation for using the word “pasta” on thier menu, instead of the French “pâtes”. Multiple retailers have faced prosecution at the hands of the Office Québécois de la langue Française for failing to have retail websites available in both English and French. A Montreal boutique narrowly avoided a large fine for making Facebook posts only in English. There’s even an official list of English words that Quebec Francophones aren’t supposed to use. While I certainly support bilingualism, personally I would be less than happy to see the same level of government involvement in language use in the US.

In addition, having only French and English as the official languages of Canada leave out a  very important group: aboriginal language users. There are over 60 different indigenous languages used in Canada used by over 213 thousand speakers. And even those don’t make up the majority of languages spoken in Canada: there are over 200 different languages used in Canada and 20% of the population speaks neither English nor French at home.

Belgium

Another country with a very storied past in terms of language legislation is Belgium. The linguistic situation in Belgium is very complex (there’s a more in-depth discussion here), but the general gist is that there are three languages used in different parts of the country. Dutch is used in the north, French is the south, and German in parts of the east. There is also a deep cultural divide between these regions, which language legislation has long served as a proxy for. There have been no fewer than eight separate national laws passed restricting when and where each language can be used. In 1970, four distinct language regions were codified in the Belgium constitution. You can use whatever language you want in private but there are restrictions on what language you can use for government business, in court, in education and employment.  While you might think that would put a rest to legislation on language, tensions have continued to be high. In 2013, for instance, the European Court of Justice overturned a Flemish law that contracts written in Flanders had to be in Dutch to be binding after a contractor working on an English contract was fired. Again, this represents a greater level of official involvement with language use than I’m personally comfortable with.

I want to be clear: I don’t think multi-lingualism is a problem. As a linguist, I value every language and I also recognize that bilingualism offers significant cognitive benefits. My problem is with legislating which languages should be used in a multi-lingual situation; it tends to lead to avoidable strife.

The US

Ok, you might be thinking, but in the US we really are just an English-speaking country! We wouldn’t have that same problem here. Weeeeelllllll….

The truth is, the US is very multilingual. We have a Language Diversity Index of .353, according to the UN. That means that, if you randomly picked two people from the United States population, the chance that they’d speak two different languages is over 35%. That’s far higher than a lot of other English-speaking countries. The UK clocks in at .139,  while New Zealand and Australia are at .102 and .126, respectively. (For the record, Canada is at .549 and Belgium .734.)

The number of different languages spoken in the US is also remarkable. In New York City alone there may be speakers of as many as 800 different languages, which would make it one of the most linguistically-diverse places in the world; like the Amazon rain-forest of languages. In King County, where I live now, there are over 170 different languages spoken, with the most common being Spanish, Chinese, Vietnamese and Amharic. And that linguistic diversity is reflected in the education system: there are almost 5 million students in the US Education system who are learning English, nearly 1 out of 10 students.

Multilingualism in the United States is nothing new, either: it’s been a part of the American experience since long before there was an America. Of course, there continue to be many speakers of indigenous languages in the United States, including Hawaiian (keep in mind that Hawaii did not actually want to become a state). But even within just European languages, English has never had sole dominion. Spanish has been spoken in Florida since the 1500’s. At the time of the signing of the Deceleration of Independence, an estimated 10% of the citizens of the newly-founded US spoke German (although the idea that it almost became the official language of the US is a myth). New York city? Used to be called New Amsterdam, and Dutch was spoken there into the 1900’s. Even the troops fighting the revolutionary war were doing so in at least five languages.

Making English the official language of the United States would ignore the rich linguistic history and the current linguistic diversity of this country. And, if other countries’ language legislation is any guide, would cause a lot of unnecessary fuss.

Social Speech Perception: Reviewing recent work

As some of you may already know, it’s getting to be crunch time for me on my PhD. So for the next little bit, my blog posts will probably mostly be on things that directly related to my dissertation. Consider this your behind-the-scenes pass to the research process.

With that in mind, today we’ll be looking at some work that’s closely related to my own.(And, not coincidentally, that I need to include in my lit review. Twofer.) They all share a common thread: social categories and speech perception. I’ve talked about this before, but these are more recent peer-reviewed papers on the topic.

Vowel perception by listeners from different English dialects

In this paper, Karpinska and coauthors investigated the role of a listener’s regional dialect on their use of two acoustic cues: formats and duration.

An acoustic cue is a specific part of the speech signal that you pay attention to to help you decide what sound you’re hearing. For example, when listening to a fricative, like “s” or “sh”,  you pay a lot of attention to the high-pitched, staticy-sounding  part of the sound to tell which fricative you’re hearing. This cue helps you tell the difference  between “sip” and “ship”, and if it gets removed or covered by another sound you’ll have a hard time telling those words apart.

They found that for listeners from the UK, New Zealand, Ireland and Singapore, formants were the most important cue distinguishing the vowels in “bit” and “beat”. For Australian listeners, however, duration (how long the vowel was) was a more important cue to the identity of these vowels. This study provides additional evidence that a listener’s dialect affects their speech perception, and in particular which cues they rely on.

Social categories are shared across bilinguals׳ lexicons

In this experiment Szakay and co-authors looked at English-Māori bilinguals from New Zealand. In New Zealand, there  are multiple dialects of English, including Māori English (the variety used by Native New Zealanders) and Pākehā English (the variety used by white New Zealanders). The authors found that there was a cross-language priming effect from Māori to English, but only for  Māori English.

Priming is a phenomena in linguistics where hearing or saying a particular linguistic unit, like a word, later makes it easier to understand or say a similar unit. So if I show you a picture of a dog, you’re going to be able to read the word “puppy” faster than you would have otherwise becuase you’re already thinking about canines.

They argue that this is due to the activation of language forms associated with a specific social identity–in this case Māori ethnicity. This provides evidence that listener’s beliefs about a speaker’s social identity affects their processing.

Intergroup Dynamics in Speech Perception: Interaction Among Experience, Attitudes and Expectations

Nguyen and co-authors investigate the effects of three separate factors on speech perception:

  • Experience: How much prior interaction a listener has had with a given speech variety.
  • Attitudes: How a listener feels about a speech variety and those who speak it.
  • Expectations: Whether a listener knows what speech variety they’re going to hear. (In this case only some listeners were explicitly told what speech variety they were going to hear.)

They found that these three factors all influenced the speech perception of Australian English speakers listening to Vietnamese accented English, and that there was an interaction between these factors. In general, participants with correct expectations (i.e. being told beforehand that they were going to hear Vietnamese accented English) identified more words correctly.

There was an interesting interaction between listener’s attitudes towards Asians and thier experience with Vietnamese accented English. Listeners who had negative prejudices towards Asians and little experience with Vietnamese English were more accurate than those with little experience and positive prejudice. The authors suggest that that was due to listeners with negative prejudice being more attentive. However, the opposite effect was found for listener’s with experience listening to Vietnamese English. In this group, positive prejudice increase accuracy while negative prejudice decreased it. There were, however, uneven numbers of participants between the groups so this might have skewed the results.

For me, this study is most useful becuase it shows that a listener’s experience with a speech variety and their expectation of hearing it affect their perception. I would, however, like to see a larger listener sample, especially given the strong negative skew in listner’s attitudes towards Asians (which I realize the researchers couldn’t have controlled for).

Perceiving the World through Group-Colored Glasses: A Perceptual Model of Intergroup Relations

Rather than presenting an experiment, this paper lays out a framework for the interplay of social group affiliation and perception. The authors pull together numerous lines of research showing that an individual’s own group affiliation can change thier perception and interpretation of the same stimuli. In the authors’ own words:

The central premise of our model is that social identification influences perception.

While they discuss perception across many domains (visual, tactile, orlfactory, etc.) the part which directly fits with my work is that of auditory perception. As they point out, auditory perception of speech depends on both bottom up and top down information. Bottom-up information, in speech perception, is the acoustic signal, while top-down information includes both knowledge of the language (like which words are more common) and social knowledge (like our knowledge of different dialects). While the authors do not discuss dialect perception directly, other work (including the three studies discussed above) fits nicely into this framework.

The key difference between this framework and Kleinschmidt & Jaeger’s Robust Speech Perception model is the centrality of the speaker’s identity. Since all language users have thier own dialect which affects their speech perception (although, of course, some listeners can fluently listen to more than one dialect) it is important to consider both the listener’s and talker’s social affiliation when modelling speech perception.

What does the National Endowment for the Humanities even do?

From the title, you might think this is a US-centric post. To a certain extent, it is. But I’m also going to be talking about topics that are more broadly of interest: what are some specific benefits of humanities research? And who should fund basic research? A lot has been written about these topics generally, so I’m going to be talking about linguistics and computational linguistics specifically.

This blog post came out of a really interesting conversation I had on Twitter the other day, sparked by this article on the potential complete elimination of both the National Endowment for the Humanities and the National Endowment for the Arts. During the course of the conversation, I realized that the person I was talking to (who was not a researcher, as far as I know) had some misconceptions about the role and reach of the NEH. So I thought it might be useful to talk about the role the NEH plays in my field, and has played in my own development as a researcher.

Curriculo

Oh this? Well, we don’t have funding to buy books anymore, so I put a picture of them in my office to remind myself they exist.

What does the NEH do?

I think the easiest way to answer this is to give you specific examples of projects that have been funded by the National Endowment for the Humanities, and talk about thier individual impacts. Keep in mind that this is just the tip of the iceberg; I’m only going to talk about projects that have benefitted my work in particular, and not even all of those.

  • Builds language teaching resources. One of my earliest research experiences was as a research assistance for Jack Martin, working with the Koasati tribe in Louisiana on a project funded by the NEH. The bulk of the work I did that summer was on a talking dictionary of the Koasati language, which the community especially wanted both as a record of the language and to support Koasati language courses. I worked with speakers to record the words for the dictionary, edit and transcribe the sound files to be put into the talking dictionaries. In addition to creating an important resource of the community, I learned important research skills that led me towards my current work on language variation. And the dictionary? It’s available on-line.
  • Helps fight linguistic discrimination. One of my main research topics is linguistic bias in automatic speech recognition (you can see some of that work here and here). But linguistic bias doesn’t only happen with computers. It’s a particularly pernicious form of discrimination that’s a big problem in education as well. As someone who’s both from the South and an educator, for example, I have purposefully cultivated my ability to speak mainstream American English becuase I know that, fair or not, I’ll be taken less seriously the more southern I sound. The NEH is at the forefront of efforts to help fight linguistic discrimination.
  • Document linguistic variation. This is a big one for my work, in particular: I draw on NEH-funded resources documenting linguistic variation in the United States in almost every research paper I write.

How does funding get allocated?

  • Which projects are funded is not decided by politicians. I didn’t realize this wasn’t common knowledge, but which projects get funded by federal funding agencies, including the NEH, NSF (which I’m currently being funded through) and NEA (National Endowment for the Arts) are not decided by politicians. This is a good thing–even the most accomplished politician can’t be expected to be an expert on everything from linguistics to history to architecture. You can see the breakdown of the process of allocating funding here.
  • Who looks at funding applications? Applications are peer reviewed, just like journal articles and other scholarly publications. The people looking at applications are top scholars in thier field. This means that they have a really good idea of which projects are going to have the biggest long-term impact, and that they can insure no one’s going to be reinventing the wheel.
  • How many projects are funded? All federal  research funding is extremely competitive, with many more applications submitted than accepted. At the NEH, this means as few as 6% of applications to a specific grant program will be accepted. This isn’t just free money–you have to make a very compelling case to a panel of fellow scholars that your project is truly exceptional.
  • What criteria are used to evaluate projects? This varies from grant to grant, but for the documenting endangered languages grant (which is what my work with the Koasati tribe was funded through), the evaluation criteria includes the following:
    • What is the potential for the proposed activity to
      1. Advance knowledge and understanding within its own field or across different fields (Intellectual Merit); and
      2. Benefit society or advance desired societal outcomes (Broader Impacts)?
    • To what extent do the proposed activities suggest and explore creative, original, or potentially transformative concepts?
    • Is the plan for carrying out the proposed activities well-reasoned, well-organized, and based on a sound rationale? Does the plan incorporate a mechanism to assess success?
    • How well qualified is the individual, team, or organization to conduct the proposed activities?
    • Are there adequate resources available to the PI (either at the home organization or through collaborations) to carry out the proposed activities?

Couldn’t this research be funded by businesses?

Sure, it could be. Nothing’s stopping companies from funding basic research in the humanities… but in my experience it’s not a priority, and they don’t. And that’s a real pity, because basic humanities research has a tendency of suddenly being vitally needed in other fields. Some examples from Natural Language Processing that have come up in just the last year:

  • Ethics: I’m currently taking what will  probably be my last class in graduate school. It’s a seminar course, filled with a mix of NLP researchers, electrical engineers and computer scientists, and we’re all reading… ethics texts. There’s been a growing awareness in the NLP and machine learning communities that algorithmic design and data selection is leading to serious negative social impacts (see this paper for some details). Ethics is suddenly taking center stage, and without the work of scholars working in the humanities, we’d be working up from first principles.
  • Pragmatics: Pragmatics, or the study of how situational factors affect meaning, is one of the more esoteric sub-disciplines in linguistics–many linguistics departments don’t even teach it as a core course. But one of the keynotes at the 2016 Empirical Methods in Natural Language Processing conference was about it (in NLP, conferences are the premier publication venue, so that’s a pretty big deal). Why? Because dialog systems, also known as chatbots, are a major research area right now. And modelling things like what you believe the person you’re talking to already knows is going to be critical to making interacting with them more natural.
  • Discourse analysis: Speaking of chatbots, discourse analysis–or the analysis of the structure of conversations–is another area of humanities research that’s been applied to a lot of computational systems. There are currently over 6000 ACL publications that draw on the discourse analysis literature. And given the strong interest in chatbots right now, I can only see that number going up.

These are all areas of research we’d traditionally consider humanities that have directly benefited the NLP community, and in turn many of the products and services we use day to day. But it’s hard to imagine companies supporting the work of someone working in the humanities whose work might one day benefit their products. These research programs that may not have an immediate impact but end up being incredibly important down-the-line is exactly the type of long-term investment in knowledge that the NEH supports, and that really wouldn’t happen otherwise.

Why does it matter?

“Now Rachael,” you may be saying, “your work definitely counts as STEM (science, technology, engineering and math). Why do you care so much about some humanities funding going away?”

I hope the reasons that I’ve outlined above help to make the point that humanities research has long-ranging impacts and is a good investment. NEH funding was pivotal in my development as a researcher. I would not be where I am today without early research experience on projects funded by the NEH.  And as a scholar working in multiple disciplines, I see how humanities research constantly enriches work in other fields, like engineering, which tend to be considered more desirable.

One final point: the National Endowment for the Humanities is, compared to other federal funding programs, very small indeed. In 2015 the federal government spent 146 million on the NEH, which was only 2% of the 7.1  billion dollar Department of Defense research budget. In other words, if everyone in the US contributed equally to the federal budget, the NEH would cost us each less than fifty cents a year. I think that’s a fair price for all of the different on-going projects the NEH funds, don’t you?

agencies3b

The entire National Endowment for the Humanities & National Endowment for the Arts, as well as the National Park Service research budget, all fit in that tiny “other” slice at the very top.

 

Preference for wake words varies by user gender

I recently read a very interesting article on the design of aspects of choosing a wake word, the word you use to turn on a voice-activated system. In Star Trek it’s “Computer”, but these days two of the more popular ones are “Alexa” and “OK Google”. The article’s author was a designer and noted that she found “Ok Google” or “Hey Google” to be more pleasant to use than “Alexa”. As I was reading the comments (I know, I know) I noticed that a lot of the people who strongly protested that they preferred “Alexa” had usernames or avatars that I would associate with male users. It struck me that there might be an underlying social pattern here.

So, being the type of nerd I am, I whipped up a quick little survey to look at the interaction between user gender and their preference for wake words. The survey only had two questions:

  • What is your gender?
    • Male
    • Female
    • Other
  • If Google Home and the Echo offered identical performance in all ways except for the wake word (the word or phrase you use to wake the device and begin talking to it), which wake word would you prefer?
    • “Ok Google” or “Hey Google”
    • “Alexa”

I included only those options becuase those are the defaults–I am aware you can choose to change the Echo’s wake word. (And probably should, given recent events.) 67 people responded to my survey. (If you were one of them, thanks!)

So what were the results? They were actually pretty strongly in line with my initial observations: as a group, only men preferred “Alexa” to “Ok Google”. Furthermore, this preference was far weaker than people of other genders’ for “Ok Google”. Women preferred “Ok Google” at a rate of almost two-to-one, and no people of other genders preferred “Alexa”.

I did have a bit of a skewed sample, with more women than men and people of other genders, but the differences between genders were robust enough to be statistically significant (c2(2, N = 67) = 7.25, p = 0.02)).

genderandwakewords

Women preferred “Ok Google” to “Alexa” 27:11, men preferred “Alexa” to “Ok Google” 14:11, and the four people of other genders in my survey all preferred “Ok Google”.

So what’s the take-away? Well, for one, Johna Paolino (the author of the original article) is by no means alone in her preference for a non-gendered wake word. More broadly, I think that, like the Clippy debacle, this is excellent evidence that there are strong gendered differences in how users’ gender affects their interaction with virtual agents. If you’re working to create virtual agents, it’s important to consider all types of users or you might end up creating something that rubs more than half of your potential customers the wrong way.

My code and data are available here.

Acoustics Documentaries on Netflix

Happy New Year’s Eve! Have you made any resolutions? Perhaps a resolution to learn something new in the new year? If so, you’re in luck! I’ve recently run across a number of different Netflix documentaries that touch on differents aspects of acoustics that readers of this blog might enjoy. (Yes, I’ve spent a lot of my winter break watching documentaries. Why do you ask?)

Netflix-new-icon

Sure, I guess they have, like, movies and stuff, but really I’m here for the documentaries.

  • Sanrachna (Hindi with English subtitles)
    • This series focuses on the architecture of ancient India. The second episode is all about the architectural acoustics of Golconda fort and Gol Gumbaz. Through careful design and construction, a handclap in the foyer of Golconda fort can be heard half a mile away!
  • The Lion in Your Living Room (English)
    • This Canadian documentary is about domestic house cats. In addition to some discussion of the ins and outs of cat’s ears, there’s a really cool segment by Karen McComb where she talks about the acoustic qualities of different types of purrs.
    • Bonus: Some sweet examples of the Canadian vowel shift.
  • Ocean Giants (English)
    • This BBC documentary about whales and dolphins has three hour-long episodes, and each includes a lot of underwater acoustics and animal communication. If you’ve only got time for one episode, the third episode “Voices of the Sea” is all about whale and dolphin vocalizations.
  • Do I sound gay? (English)
    • This documentary by David Thorpe explores the stereotype of “a gay voice” and does include some cameos by linguists. From a sociolinguisitcs standpoint, I think it’s a bit simplistic (to be fair, probably becuase I’m a sociolinguist) but it’s still an interesting discussion of speech and identity.
    • Bonus: If you want to get a more linguistics-y perspective, this post on Language Log (and the comments) go into a lot of depth.

Oh, and if you don’t have Netflix, I’ve got you covered too. Here are two Youtube channels with linguistics contents you might like:

  • Lingthusiasm (English)
    • This is a brand-new podcast by Gretchen McCulloch and Lauren Gawne (two of my favorite internet linguistics people), and it’s a ton of fun. You should check it out!
  • The Ling Space (English)
    • This channel has been around for a while and has little bite-sized videos about a range of linguistics topics. They have a new video every Wednesday.

Do you know of any other good documentaries about linguistics or acoustics? Leave a comment and let me know!

Do emojis have their own syntax?

So a while ago I got into a discussion with someone on Twitter about whether emojis have syntax. Their original question was this:

As someone who’s studied sign language, my immediate thought was “Of course there’s a directionality to emoji: they encode the spatial relationships of the scene.” This is just fancy linguist talk for: “if there’s a dog eating a hot-dog, and the dog is on the right, you’re going to use 🌭🐕, not 🐕🌭.” But the more I thought about it, the more I began to think that maybe it would be better not to rely on my intuitions in this case. First, because I know American Sign Language and that might be influencing me and, second, because I am pretty gosh-darn dyslexic and I can’t promise that my really excellent ability to flip adjacent characters doesn’t extend to emoji.

So, like any good behavioral scientist, I ran a little experiment. I wanted to know two things.

  1. Does an emoji description of a scene show the way that things are positioned in that scene?
  2. Does the order of emojis tend to be the same as the ordering of those same concepts in an equivalent sentence?

As it turned out, the answers to these questions are actually fairly intertwined, and related to a third thing I hadn’t actually considered while I was putting together my stimuli (but probably should have): whether there was an agent-patient relationship in the photo.

Agent: The entity in a sentence that’s affecting a changed, the “doer” of the action.

  • The dog ate the hot-dog.
  • The raccoons pushed over all the trash-bins.

Patient: The entity that’s being changed, the “receiver” of the action.

  • The dog ate the hot-dog.
  • The raccoons pushed over all the trash-bins.

Data

To get data, I showed people three pictures and asked them to “pick the emoji sequence that best describes the scene” and then gave them two options that used different orders of the same emoji. Then, once they were done with the emoji part, I asked them to “please type a short sentence to describe each scene”. For all the language data, I just went through and quickly coded the order that the same concepts as were encoded in the emoji showed up.

Examples:

  • “The dog ate a hot-dog”  -> dog hot-dog
  • “The hot-dog was eaten by the dog” -> hot-dog dog
  • “A dog eating” -> dog
  • “The hot-dog was completely devoured” -> hot-dog

So this gave me two parallel data sets: one with emojis and one with language data.

All together, 133 people filled out the emoji half and 127 people did the whole thing, mostly in English (I had one person respond in Spanish and I went ahead and included it). I have absolutely no demographics on my participants, and that’s by design; since I didn’t go through the Institutional Review Board it would actually be unethical for me to collect data about people themselves rather than just general information on language use. (If you want to get into the nitty-gritty this is a really good discussion of different types of on-line research.)

Picture one – A man counting money

Watch, movie schedule, poster, telephone, cashier machine, cash register Fortepan 6680

I picked this photo as sort of a sanity-check: there’s no obvious right-to-left ordering of the man and the money, and there’s one pretty clear way of describing what’s going on in this scene. There’s an agent (the man) and a patient (the money), and since we tend to describe things as agent first, patient second I expected people to pretty much all do the same thing with this picture. (Side note: I know I’ve read a paper about the cross-linguistic tendency for syntactic structures where the agent comes first, but I can’t find it and I don’t remember who it’s by. Please let me know if you’ve got an idea what it could be in the comments–it’s driving me nuts!)

manmoney

And they did! Pretty much everyone described this picture by putting the man before the money, both with emoji and words. This tells us that, when there’s no information about orientation you need to encode (e.g. what’s on the right or left), people do tend to use emoji in the same order as they would the equivalent words.

Picture two – A man walking by a castle

Château de Canisy (5)

But now things get a little more complex. What if there isn’t a strong agent-patient relationship and there is a strong orientation in the photo? Here, a man in a red shirt is walking by a castle, but he shows up on the right side of the photo. Will people be more likely to describe this scene with emoji in a way that encodes the relationship of the objects in the photo?

mancastle

I found that they were–almost four out of five participants described this scene by using the emoji sequence “castle man”, rather than “man castle”. This is particularly striking because, in the sentence writing part of the experiment, most people (over 56%) wrote a sentence where “man/dude/person etc.” showed up before “castle/mansion/chateau etc.”.

So while people can use emoji to encode syntax, they’re also using them to encode spatial information about the scene.

Picture three – A man photographing a model

Photographing a model

Ok, so let’s add a third layer of complexity: what about when spatial information and the syntactic agent/patient relationships are pointing in opposite directions? For the scene above, if you’re encoding the spatial information then you should use an emoji ordering like “woman camera man”, but if you’re encoding an agent-patient relationship then, as we saw in the picture of the man counting money, you’ll probably want to put the agent first: “man camera woman”.

(I leave it open for discussion whether the camera emoji here is representing a physical camera or a verb like “photograph”.)

mangirlcamera

For this chart I removed some data to make it readable. I kicked out anyone who picked another ordering of the emoji, and any word order that fewer than ten people (e.g. less than 10% of participants) used.

So people were a little more divided here. It wasn’t quite a 50-50 split, but it really does look like you can go either way with this one. The thing that jumped out at me, though, was how the word order and emoji order pattern together: if your sentence is something like “A man photographs a model”, then you are far more likely to use the “man camera woman” emoji ordering. On the other hand, if your sentence is something like “A woman being photographed by the sea” or “Photoshoot by the water”, then it’s more likely that your emoji ordering described the physical relation of the scene.

So what?

So what’s the big takeaway here? Well, one thing is that emoji don’t really have a fixed syntax in the same way language does. If they did, I’d expect that there would be a lot more agreement between people about the right way to represent a scene with emoji. There was a lot of variation.

On the other hand, emoji ordering isn’t just random either. It is encoding information, either about the syntactic/semantic relationship of the concepts or their physical location in space. The problem is that you really don’t have a way of knowing which one is which.

Edit 12/16/2016: The dataset and the R script I used to analyze it are now avaliable on Github.

What’s the difference between & and +?

So if you’re like me, you sometimes take notes on the computer and end up using some shortcuts so you can keep up with the speed of whoever’s talking. One of the short cuts I use a lot is replacing the word “and” with punctuation. When I’m handwriting things I only ever use “+” (becuase I can’t reliably write an ampersand), but in typing I use both “+” and “&”. And I realized recently, after going back to change which one I used, that I had the intuition that they should be used for different things.

Ampersand-handwriting-3.png

I don’t use Ampersands when I’m handwriting things becuase they’re hard to write.

Like sometimes happens with linguistic intuitions, though, I didn’t really have a solid idea of how they were different, just that they were. Fortunately, I had a ready-made way to figure it out. Since I use both symbols on Twitter quite a bit, all I had to do was grab tweets of mine that used either + or & and figure out what the difference was.

I got 450 tweets from between October 7th and November 11th of this year from my own account (@rctatman). I used either & or + in 83 of them, or roughly 18%. This number is a little bit inflated because I was livetweeting a lot of conference talks in that time period, and if a talk has two authors I start every livetweet from that talk with “AuthorName1 & AuthorName2:”. 43 tweets use & in this way. If we get rid of those, only around 8% of my tweets contain either + or &. They’re still a lot more common in my tweets than in writing in other genres, though, so it’s still a good amount of data.

So what do I use + for? See for yourself! Below are all the things I conjoined with + in my Twitter dataset. (Spelling errors intact. I’m dyslexic, so if I don’t carefully edit text—and even sometimes when I do, to my eternal chagrin—I tend to have a lot of spelling errors. Also, a lot of these tweets are from EMNLP so there’s quite a bit of jargon.)

  • time + space
  • confusable Iberian language + English
  • Data + code
  • easy + nice
  • entity linking + entity clustering
  • group + individual
  • handy-dandy worksheet + tips
  • Jim + Brenda, Finn + Jake
  • Language + action
  • linguistic rules + statio-temporal clustering
  • poster + long paper
  • Ratings + text
  • static + default methods
  • syntax thing + cattle
  • the cooperative principle + Gricean maxims
  • Title + first author
  • to simplify manipulation + preserve struture

If you’ve had some syntactic training, it might jump out to you that most of these things have the same syntactic structure: they’re noun phrases! There are just a couple of exception. The first is “static + default methods”, where the things that are being conjoined are actually adjectives modifying a single noun. The other is “to simplify manipulation + preserve struture”. I’m going to remain agnostic about where in the verb phrase that coordination is taking place, though, so I don’t get into any syntax arguments ;). That said, this is a fairly robust pattern! Remember that I haven’t been taught any rules about what I “should” do, so this is just an emergent pattern.

Ok, so what about &? Like I said, my number one use is for conjunction of names. This probably comes from my academic writing training. Most of the papers I read that use author names for in-line citations use an & between them. But I do also use it in the main body of tweets. My use of & is a little bit harder to characterize, so I’m going to go through and tell you about each type of thing.

First, I use it to conjoin user names with the @ tag. This makes sense, since I have a strong tendency to use & with names:

  • @uwengineering & @uwnlp
  • @amazon @baidu @Grammarly & @google

In some cases, I do use it in the same way as I do +, for conjoining noun phrases:

  • Q&A
  • the entities & relations
  • these features & our corpus
  • LSTM & attention models
  • apples & concrete
  • context & content

But I also use it for comparatives:

  • Better suited for weak (bag-level) labels & interpretable and flexible
  • easier & faster

And, perhaps more interestingly, for really high-level conjugation, like at the level of the sentence or entire verb phrase (again, I’m not going to make ANY claims about what happens in and around verbs—you’ll need to talk to a syntactician for that!).

  • Classified as + or – & then compared to polls
  • in 30% of games the group performance was below average & in 17% group was worse than worst individual
  • math word problems are boring & kids learn better if they’re interested in the theme of the problem
  • our system is the first temporal tagger designed for social media data & it doesn’t require hand tagging
  • use a small labeled corpus w/ small lexicon & choose words with high prob. of 1 label

And, finally, it gets used in sort of miscellaneous places, like hashtags and between URLs.

So & gets used in a lot more places than + does. I think that this is probably because, on some subconscious level I consider & to be the default (or, in linguistics terms, “unmarked“). This might be related to how I’m processing these symbols when I read them. I’m one of those people who hears an internal voice when reading/writing, so I tend to have canonical vocalizations of most typed symbols. I read @ as “at”, for example, and emoticons as a prosodic beat with some sort of emotive sound. Like I read the snorting emoji as the sound of someone snorting. For & and +, I read & as “and” and + as “plus”. I also use “plus” as a conjunction fairly often in speech, as do many of my friends, so it’s possible that it may pattern with my use in speech (I don’t have any data for that, though!). But I don’t say “plus” nearly as often as I say “and”. “And” is definitely the default and I guess that, by extension, & is as well.

Another thing that might possibly be at play here is ease of entering these symbols. While I’m on my phone they’re pretty much equally easy to type, on a full keyboard + is slightly easier, since I don’t have to reach as far from the shift key. But if that were the only factor my default would be +, so I’m fairly comfortable claiming that the fact that I use & for more types of conjunction is based on the influence of speech.

A BIG caveat before I wrap up—this is a bespoke analysis. It may hold for me, but I don’t claim that it’s the norm of any of my language communities. I’d need a lot more data for that! That said, I think it’s really neat that I’ve unconsciously fallen into a really regular pattern of use for two punctuation symbols that are basically interchangeable. It’s a great little example of the human tendency to unconsciously tidy up language.