Today’s blog post is a bit different. It’s in dance!
If that wasn’t quite clear enough for you, you can check this blog post for a more detailed explanation.
Today’s blog post is a bit different. It’s in dance!
If that wasn’t quite clear enough for you, you can check this blog post for a more detailed explanation.
If you’ve been following my blog for a while, you may remember that last year I found that YouTube’s automatic captions didn’t work as well for some dialects, or for women. The effects I found were pretty robust, but I wanted to replicate them for a couple of reasons:
With that in mind, I did a second analysis on both YouTube’s automatic captions and Bing’s speech API (that’s the same tech that’s inside Microsoft’s Cortana, as far as I know).
For this project, I used speech data from the International Dialects of English Archive. It’s a collection of English speech from all over, originally collected to help actors sound more realistic.
I used speech data from four varieties: the South (speakers from Alabama), the Northern Cities (Michigan), California (California) and General American. “General American” is the sort of news-caster style of speech that a lot of people consider unaccented–even though it’s just as much an accent as any of the others! You can hear a sample here.
For each variety, I did an acoustic analysis to make sure that speakers I’d selected actually did use the variety I thought they should, and they all did.
For the YouTube captions, I just uploaded the speech files to YouTube as videos and then downloaded the subtitles. (I would have used the API instead, but when I was doing this analysis there was no Python Google Speech API, even though very thorough documentation had already been released.)
Bing’s speech API was a little more complex. For this one, my co-author built a custom Android application that sent the files to the API & requested a long-form transcript back. For some reason, a lot of our sound files were returned as only partial transcriptions. My theory is that there is a running confidence function for the accuracy of the transcription, and once the overall confidence drops below a certain threshold, you get back whatever was transcribed up to there. I don’t know if that’s the case, though, since I don’t have access to their source code. Whatever the reason, the Bing transcriptions were less accurate overall than the YouTube transcriptions, even when we account for the fact that fewer words were returned.
OK, now to the results. Let’s start with dialect area. As you might be able to tell from the graphs below, there were pretty big differences between the two systems we looked at. In general, there was more variation in the word error rate for Bing and overall the error rate tended to be a bit higher (although that could be due to the incomplete transcriptions we mentioned above). YouTube’s captions were generally more accurate and more consistent. That said, both systems had different error rates across dialects, with the lowest average error rates for General American English.
Now, let’s turn to gender. If you read my earlier work, you’ll know that I previously found that YouTube’s automatic captions were more accurate for men and less accurate for women. This time, with carefully recorded speech samples, I found no robust difference in accuracy by gender in either system. Which is great! In addition, the unreliable trends for each system pointed in opposite ways; Bing had a lower WER for male speakers, while YouTube had a lower WER for female speakers.
So why did I find an effect last time? My (untested) hypothesis is that there was a difference in the signal to noise ratio for male and female speakers in the user-uploaded files. Since women are (on average) smaller and thus (on average) slightly quieter when they speak, it’s possible that their speech was more easily masked by background noises, like fans or traffic. These files were all recorded in a quiet place, however, which may help to explain the lack of difference between genders.
Finally, what about race? For this part of the analysis, I excluded General American speakers, since they did not report their race. I also excluded the single Native American speaker. Even with fewer speakers, and thus reduced power, the differences between races were still robust enough to be significant for YouTube’s automatic captions and Bing followed the same trend. Both systems were most accurate for Caucasian speakers.
While I was happy to find no difference in performance by gender, the fact that both systems made more errors on non-Caucasian and non-General-American speaking talkers is deeply concerning. Regional varieties of American English and African American English are both consistent and well-documented. There is nothing intrinsic to these varieties that make them less easy to recognize. The fact that they are recognized with more errors is most likely due to bias in the training data. (In fact, Mozilla is currently collecting diverse speech samples for an open corpus of training data–you can help them out yourself.)
There are two things I’m really worried about with these types of speech recognition errors. The first is higher error rates seem to overwhelmingly affect already-disadvantaged groups. In the US, strong regional dialects tend to be associated with speakers who aren’t as wealthy, and there is a long and continuing history of racial discrimination in the United States.
Given this, the second thing I’m worried about is the fact that these voice recognition systems are being incorporated into other applications that have a real impact on people’s lives.
Every automatic speech recognition system makes errors. I don’t think that’s going to change (certainly not in my lifetime). But I do think we can get to the point where those error don’t disproportionately affect already-marginalized people. And if we keep using automatic speech recognition into high-stakes situations it’s vital that we get to that point quickly and, in the meantime, stay aware of these biases.
If you’re interested in the long version, you can check out the published paper here.
I’ll get back to “a male/a female” question in my next blog post (promise!), but for now I want to discuss some of the findings from my dissertation research. I’ve talked about my dissertation research a couple times before, but since I’m going to be presenting some of it in Spain (you can read the full paper here), I thought it would be a good time to share some of my findings.
In my dissertation, I’m looking at how what you think you know about a speaker affects what you hear them say. In particular, I’m looking at American English speakers who have just learned to correctly identify the vowels of New Zealand English. Due to an on-going vowel shift, the New Zealand English vowels are really confusing for an American English speaker, especially the vowels in the words “head”, “head” and “had”.
These overlaps can be pretty confusing when American English speakers are talking to New Zealand English speakers, as this Flight of the Conchords clip shows!
The good news is that, as language users, we’re really good at learning new varieties of languages we already know, so it only takes a couple minutes for an American English speaker to learn to correctly identify New Zealand English vowels. My question was this: once an American English speaker has learned to understand the vowels of New Zealand English, how do they know when to use this new understanding?
In order to test this, I taught twenty one American English speakers who hadn’t had much, if any, previous exposure to New Zealand English to correctly identify the vowels in the words “head”, “heed” and “had”. While I didn’t play them any examples of a New Zealand “hid”–the vowel in “hid” is said more quickly in addition to having different formants, so there’s more than one way it varies–I did let them say that they’d heard “hid”, which meant I could tell if they were making the kind of mistakes you’d expect given the overlap between a New Zealand “head” and American “hid”.
So far, so good: everyone quickly learned the New Zealand English vowels. To make sure that it wasn’t that they were learning to understand the one talker they’d been listening to, I tested half of my listeners on both American English and New Zealand English vowels spoken by a second, different talker. These folks I told where the talker they were listening to was from. And, sure enough, they transferred what they’d learned about New Zealand English to the new New Zealand speaker, while still correctly identifying vowels in American English.
The really interesting results here, though, are the ones that came from the second half the listeners. This group I lied to. I know, I know, it wasn’t the nicest thing to do, but it was in the name of science and I did have the approval of my institutional review board, (the group of people responsible for making sure we scientists aren’t doing anything unethical).
In an earlier experiment, I’d played only New Zealand English as this point, and when I told them the person they were listening to was from America, they’d completely changed the way they listened to those vowels: they labelled New Zealand English vowels as if they were from American English, even though they’d just learned the New Zealand English vowels. And that’s what I found this time, too. Listeners learned the New Zealand English vowels, but “undid” that learning if they thought the speaker was from the same dialect as them.
But what about when I played someone vowels from their own dialect, but told them the speaker was from somewhere else? In this situation, listeners ignored my lies. They didn’t apply the learning they’d just done. Instead, the correctly treated the vowels of thier own dialect as if they were, in fact, from thier dialect.
At first glance, this seems like something of a contradiction: I just said that listeners rely on social information about the person who’s talking, but at the same time they ignore that same social information.
So what’s going on?
I think there are two things underlying this difference. The first is the fact that vowels move. And the second is the fact that you’ve heard a heck of a lot more of your own dialect than one you’ve been listening to for fifteen minutes in a really weird training experiment.
So what do I mean when I say vowels move? Well, remember when I talked about formants above? These are areas of high acoustic energy that occur at certain frequency ranges within a vowel and they’re super important to human speech perception. But what doesn’t show up in the plot up there is that these aren’t just static across the course of the vowel–they move. You might have heard of “diphthongs” before: those are vowels where there’s a lot of formant movement over the course of the vowel.
And the way that vowels move is different between different dialects. You can see the differences in the way New Zealand and American English vowels move in the figure below. Sure, the formants are in different places—but even if you slid them around so that they overlapped, the shape of the movement would still be different.
Ok, so the vowels are moving in different ways. But why are listeners doing different things between the two dialects?
Well, remember how I said earlier that you’ve heard a lot more of your own dialect than one you’ve been trained on for maybe five minutes? My hypothesis is that, for the vowels in your own dialect, you’re highly attuned to these movements. And when a scientist (me) comes along and tells you something that goes against your huge amount of experience with these shapes, even if you do believe them, you’re so used to automatically understanding these vowels that you can’t help but correctly identify them. BUT if you’ve only heard a little bit of a new dialect you don’t have a strong idea of what these vowels should sound like, so if you’re going to rely more on the other types of information available to you–like where you’re told the speaker is from–even if that information is incorrect.
So, to answer the question I posed in the title, can what you think you know about someone affect how you hear them? Yes… but only if you’re a little uncertain about what you heard in the first place, perhaps becuase it’s a dialect you’re unfamiliar with.
So a friend of mine who’s a reference librarian (and has a gaming YouTube channel you should check out) recently got an interesting question: how loud would a million dogs barking be?
This is an interesting question because it gets at some interesting properties of how sound work, in particular the decibel scale.
So, first off, we need to establish our baseline. The loudest recorded dog bark clocked in at 113.1 dB, and was produced by a golden retriever named Charlie. (Interestingly, the loudest recorded human scream was 129 dB, so it looks like Charlie’s got some training to do to catch up!) That’s louder than a chain saw, and loud enough to cause hearing damage if you heard it consonantly.
Now, let’s scale our problem down a bit and figure out how loud it would be if ten Charlies barked together. (I’m going to use copies of Charlie and assume they’ll bark in phase becuase it makes the math simpler.) One Charlie is 113 dB, so your first instinct may be to multiply that by ten and end up 1130 dB. Unfortunately, if you took this approach you’d be (if you’ll excuse the expression) barking up the wrong tree. Why? Because the dB scale is logarithmic. This means that a 1130 dB is absolutely ridiculously loud. For reference, under normal conditions the loudest possible sound (on Earth) is 194 dB. A sound of 1000 dB would be loud enough to create a black hole larger than the galaxy. We wouldn’t be able to get a bark that loud even if we covered every inch of earth with clones of champion barker Charlie.
Ok, so we know what one wrong approach is, but what’s the right one? Well, we have our base bark at 113 dB. If we want a bark that is one million times as powerful (assuming that we can get a million dogs to bark as one) then we need to take the base ten log of one million and multiply it by ten (that’s the deci part of decibel). (If you want more math try this site.) The base ten log of one million is six, so times ten that’s sixty decibels. But it’s sixty decibels louder than our original sound of 113dB, for a grand total of 173dB.
Now, to put this in perspective, that’s still pretty durn loud. That’s loud enough to cause hearing loss in our puppies and everyone in hearing distance. We’re talking about the loudness of a cannon, or a rocket launch from 100 meters away. So, yes, very loud, but not quite “destroying the galaxy” loud.
A final note: since the current world record for loudest barking group of dogs is a more modest 124 dB from group of just 76 dogs, if you could get a million dogs to bark in unison you’d definitely set a new world record! But, considering that you’d end up hurting the dogs’ hearing (and having to scoop all that poop) I’m afraid I really can’t recommend it.
I got a cool question from Veronica the other day:
Which wavelength someone would use not to hear but feel it on the body as a vibration?
So this would depend on two things. The first is your hearing ability. If you’ve got no or limited hearing, most of your interaction with sound will be tactile. This is one of the reasons why many Deaf individuals enjoy going to concerts; if the sound is loud enough you’ll be able to feel it even if you can’t hear it. I’ve even heard stories about folks who will take balloons to concerts to feel the vibrations better. In this case, it doesn’t really depend on the pitch of the sound (how high or low it is), just the volume.
But let’s assume that you have typical hearing. In that case, the relationship between pitch, volume and whether you can hear or feel a sound is a little more complex. This is due to something called “frequency response”. Basically, the human ear is better tuned to hearing some pitches than others. We’re really sensitive to sounds in the upper ranges of human speech (roughly 2k to 4k Hz). (The lowest pitch in the vocal signal can actually be much lower [down to around 80 Hz for a really low male voice] but it’s less important to be able to hear it because that frequency is also reflected in harmonics up through the entire pitch range of the vocal signal. Most telephones only transmit signals between 300 Hz to 3400 Hz, for example, and it’s only really the cut-off at the upper end of the range that causes problems–like making it hard to tell the difference between “sh” and “s”.)
The takeaway from all this is that we’re not super good at hearing very low sounds. That means they can be very, very loud before we pick up on them. If the sound is low enough and loud enough, then the only way we’ll be able to sense it is by feeling it.
How low is low enough? Most people can’t really hear anything much below 20 Hz (like the lowest note on a really big organ). The older you are and the more you’ve been exposed to really loud noises in that range, like bass-heavy concerts or explosions, the less you’ll be able to pick up on those really low sounds.
What about volume? My guess for what would be “sufficiently loud”, in this case, is 120+ Db. 120 Db is as loud as a rock concert, and it’s possible, although difficult and expensive, to get out of a home speaker set-up. If you have a neighbor listening to really bass-y music or watching action movies with a lot of low, booming sound effects on really expensive speakers, it’s perfectly possible that you’d feel those vibrations rather than hearing them. Especially if there are walls between the speakers and you. While mid and high frequency sounds are pretty easy to muffle, low-frequency sounds are much more difficult to sound proof against.
Are there any health risks? The effects of exposure to these types of low-frequency noise is actually something of an active research question. (You may have heard about the “brown note“, for example.) You can find a review of some of that research here. One comforting note: if you are exposed to a very loud sound below the frequencies you can easily hear–even if it’s loud enough to cause permanent damage at much higher frequencies–it’s unlikely that you will suffer any permanent hearing loss. That doesn’t mean you shouldn’t ask your neighbor to turn down the volume, though; for their ears if not for yours!
I got an interesting question on Facebook a while ago and though it might be a good topic for a blog post:
I say “good morning” to nearly everyone I see while I’m out running. But I don’t actually say “good”, do I? It’s more like “g’ morning” or “uh morning”. Never just morning by itself, and never a fully articulated good. Is there a name for this grunt that replaces a word? Is this behavior common among English speakers, only southeastern speakers, or only pre-coffee speakers?
This sort of thing is actually very common in speech, especially in conversation. (Or “in the wild” as us laboratory types like to call it.) The fancy-pants name for it is “hypoarticulation”. That’s less (hypo) speech-producing movements of the mouth and throat (articulation). On the other end of the spectrum you have “hyperarticulation” where you very. carefully. produce. each. individual. sound.
Ok, so you can change how much effort you put into producing speech sounds, fair enough. But why? Why don’t we just sort of find a happy medium and hang out there? Two reasons:
So we want to communicate clearly and unambiguously, but with as little effort as possible. But how does that tie in with this example? “G” could be “great” or “grass” or “génial “, and “uh” could be any number of things. For this we need to look outside the linguistic system.
The thing is, language is a social activity and when we’re using language we’re almost always doing so with other people. And whenever we interact with other people, we’re always trying to guess what they know. If we’re pretty sure someone can get to the word we mean with less information, for example if we’ve already said it once in the conversation, then we will expend less effort in producing the word. These contexts where things are really easily guessable are called “low entropy“. And in a social context like jogging past someone in the morning, phrases liked “good morning” have very low entropy. Much lower than, for example “Could you hand me that pickle?”–if you jogged past someone and said that you’d be very likely to hyperarticulate to make sure they understood.
My neighbor talks loudly on the phone and I can’t sleep. What is the best method to block his voice noise?
Great question Atif! There are few things more distracting than hearing someone else’s conversation, and only hearing one side of a phone conversation is even worse. Even if you don’t want it to, your brain is trying to fill in the gaps and that can definitely keep you awake. So what’s the best way to avoid hearing your neighbor? Well, probably the very best way is to try talking to them. Failing that, though, you have three main options: isolation, damping and masking.
So what’s the difference between them and what’s the best option for you? Before we get down to the nitty gritty I think it’s worth a quick reminder of what sound actually is: sound waves are just that–waves. Just like waves in a lake or ocean. Imagine you and a neighbor share a small pond and you like to go swimming every morning. Your neighbor, on the other hand, has a motorboat that they drive around on thier side. The waves the motorboat makes keep hitting you as you try to swim and you want to avoid them. This is very similar to your situation: your neighbor’s voice is making waves and you want to avoid being hit by them.
Isolation: So one way to avoid feeling the effects of waves in a pond, to use our example, is to build a wall down the center of the pond. As long as there no holes in the wall for the waves to diffract through, you should be able to avoid feeling the effects of the waves. Noise isolation works much the same way. You can use earplugs that are firmly mounted in your ears to form a seal and that should prevent any sound waves from reaching your eardrums, right? Well, not quite. The wrinkle is that sound can travel through solids as well. It’s like we built our wall in our pond out of something flexible, like rubber, instead of something solid, like brick. As waves hit the wall the wall itself will move with the wave and then transmit it to your side. So you may still end up hearing some noises, even with well-fitted headphones.
Techniques: earplugs/earbuds, noise isolating headphone or earbuds, noise-isolating architecture,
Damping: So in our pond example we might imagine doing something that makes it harder for waves to move through the water. If you replaced all the water with molasses or honey, for example, it would take a lot more energy for the sound waves to move through it and they’d dissipate more quickly.
Techniques: acoustic tiles, covering the intervening wall (with a fabric wall-hanging, foam, empty egg cartons, etc.), covering vents, placing a rolled-up towel under any doors, hanging heavy curtains over windows, putting down carpeting
Masking: Another way to avoid noticing our neighbor’s waves is to start making our own waves. We can either make waves that are exactly the same size as our neighbor’s but out of phase (so when theirs are at their highest peak, ours is at our lowest) so they end up cancelling each other out. That’s basically what noise-cancelling headphones do. Or we can make a lot of own waves that all feel enough like our neighbor’s that when thier wave arrives we don’t even notice it. Of course, if the point it to hear no sound that won’t work quite as well. But if the point is to avoid abrupt, distracting changes in sound then this can work quite nicely.
Techniques: Listening to white noise or music, using noise-cancelling headphones or earbuds
So what would I do? Well, first I’d take as many steps as I could to sound-proof my environment. Try to cover as many of the surfaces in your bedroom as in absorbent, ideally fluffy, surfaces as you can. (If it can absorb water it will probably help absorb sound.) Wall hangings, curtains and a throw rug can all help a great deal.
Then you have a couple options for masking. A fan help to provide both a bit of acoustic masking and a nice breeze. Personally, though, I like a white noise machine that gives you some control over the frequency (how high or low the pitch is) and intensity (loudness) of the sounds it makes. That lets you tailor it so that it best masks the sounds that are bothering you. I also prefer the ones with the fans rather than those that loop recorded sounds, since I often find the loop jarring. If you don’t want to or can’t buy one, though, myNoise has a number of free generators that let you tailor the frequency and intensity of a variety of sounds and don’t have annoying loops. (There are a bunch of additional features available that you can access for a small donation as well.)
If you can wear earbuds in bed, try playing a non-distracting noise at around 200-1000 Hertz, which will cover a lot of the speech sounds you can’t easily dampen. Make sure your earbuds are well-fitted in the ear canal so that as much noise is isolated as possible. In addition, limiting the amount of exposed hard surface on them will also increase noise isolation. You can knit little cozies, try to find earbuds with a nice thick silicon/rubber coating or even try coating your own.
By using many different strategies together you can really reduce unwanted noises. I hope this helps and good luck!