Can you configure speech recognition for a specific speaker?

James had an interesting question based on one of my earlier posts on gender differences in speech recognition:

Is there a voice recognition product that is focusing on women’s voices or allows for configuring for women’s voices (or the characteristics of women’s voices)?

I don’t know of any ASR systems specifically designed for women. But the answer to the second half of your question is yes!

BSPC 19 i Nyborg Danmark 2009 (4)

There are two main types of automatic speech recognition, or ASR, systems. The first is speaker independnet. These are systems, like YouTube automatic captions or  Apple’s Siri, that should work equally well across a large number of different speakers. Of course, as many other researchers have found and I corroborated in my own investigation, that’s not always the case. A major reason for this is socially-motivated variation between speakers. This is something we all know as language users. You can guess (with varying degrees of accuracy) a lot about someone from just their voice: thier sex, whether they’re young or old, where they grew up, how educated they are, how formal or casual they’re being.

So what does this mean for speech recognition? Well, while different speakers speak in a lot of different ways, individual speakers tend to use less variation. (With the exception of bidialectal speakers, like John Barrowman.) Which brings me nicely to the second type of speech recognition: speaker dependent. These are systems that are designed to work for one specific speaker, and usually to adapt and get more accurate for that speaker over time.

If you read some of my earlier posts, I suggested that the different performance for between dialects and genders was due to imbalances in the training data. The nice thing about speaker dependent systems is that the training data is made up of one voice: yours. (Although the system is usually initialized based on some other training set.)

So how can you get a speaker dependent ASR system?

  • By buying software such as Dragon speech recognition. This is probably the most popular commercial speaker-dependent voice recognition software (or at least the one I hear the most about). It does, however, cost real money.
  • Making your own! If you’re feeling inspired, you can make your own personalized ASR system. I’d recommend the CMU Sphinx toolkit; it’s free and well-documented. To make your own recognizer, you’ll need to build your own language model using text you’ve written as well as adapt the acoustic model using your recorded speech. The former lets the recognizer know what words you’re likely to say, and the latter how you say things. (If you’re REALLY gung-ho you can even build your own acoustic model from scratch, but that’s pretty involved.)

In theory, the bones of any ASR system should work equally well on any spoken human language. (Sign language recognition is a whole nother kettle of fish.) The difficulty is getting large amounts of (socially stratified) high-quality training data. By feeding a system data without a lot of variation, for example by using only one person’s voice, you can usually get more accurate recognition more quickly.

 

 

Advertisements

One response

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s