How to be wrong: Measuring error in machine learning models

One thing I remember very clearly from writing my dissertation is how confused I initially was about which particular methods I could use to evaluate how often my models were correct or wrong. (A big part of my research was comparing human errors with errors from various machine learning models.) With that in mind, I thought it might be handy to put together a very quick equation-free primer of some different ways of measuring error.

The first step is to figure out what type of model you’re evaluating. Which type of error measurement you use depends on the type of model you’re evaluating. This was a big part of what initially confused me: much of my previous work had been with regression, especially mixed-effects regression, but my dissertation focused on multi-class classification instead. As a result, the techniques I was used to using to evaluate models just didn’t apply.

Today I’m going to talk about three types of models: regression, binary classification and multiclass classification.


In regression, your goal is to predict the value of an output value given one or more input values. So you might use regression to predict how much a puppy will weigh in four months or the price of cabbage. (If you want to learn more about regression, I recently put together a beginner’s guide to regression with five days of exercises.)

  • R-squared: This is a measurement of how correlated your predicted values are with the actual observed values. It ranges from 0 to 1, with 0 being no correlation and 1 being perfect correlation. In general, models with higher r-squareds are a better fit for your data.
  • Root mean squared error (RMSE), aka standard error: This measurement is an average of how wrong you were for each point you predicted. It ranges from 0 up, with closer to zero being better. Outliers (points you were really wrong about) will disproportionately inflate this measure.

Binary Classification

In binary classification, you aim to predict which of two classes an observation will fall. Examples include predicting whether a student will pass or fail a class or whether or not a specific passenger survived on the Titanic. This is a very popular type of model and there are a lot of ways of evaluating them, so I’m just going to stick to the four that I see most often in the literature.

  • Accuracy: This is proportion of the test cases that your model got right. It ranges from 0 (you got them all wrong) to 1 (you got them all right).
  • Precision: This is a measure of how good your model is at selecting only the members of a certain class. So if you were predicting whether students would pass or not and all of the students you predicted would pass actually did, then your model would have perfect precision. Precision ranges from 0 (none of the observations you said were in a specific class actually were) to 1 (all of the observations you said were in that class actually were). It doesn’t tell you about how good your model is at identifying all the members of that class, though!
  • Recall (aka True Positive Rate, Specificity): This is a measure of how good your model was at finding all the data points that belonged to a specific class. It ranges from 0 (you didn’t find any of them) to 1 (you found all of them). In our students example, a model that just predicted all students would pass would have perfect recall–since it would find all the passing students–but probably wouldn’t have very good precision unless very few students failed.
  • F1 (aka F-Score): The F score is the (harmonic) mean of both precision and recall. It also ranges from 0 to 1. Like precision and recall, it’s calculated based on a specific class you’re interested in. One thing to note about precision, recall and F1 is that they all don’t count true negatives (when you guessed something wasn’t in a specific class and you were right) so if that’s an important consideration for your model you probably shouldn’t rely on these measures.

Multiclass Classification

Multiclass classification is the task of determining which of three or more classes a specific observation belongs to. Things like predicting which icecream flavor someone will buy or automatically identifying the breed of a dog are multiclass classification.

  • Confusion Matrix: One of the most common ways to evaluate multiclass classifications is with a confusion matrix, which is a table with the actual labels along one axis and the predicted labels along the other (in the same order). Each cell of the table has a count value for the number of predictions that fell into that category. Correct predictions will fall along the center diagonal. This won’t give you a single summary measure of a system, but it will let you quickly compare performance across different classes.
  • Cohen’s Kappa: Cohen’s kappa is a measure of how much better than chance a model is at assigning the correct class to an observation. It range from -1 to 1, with higher being better. 0 indicates that the model is at chance levels (i.e. you could do as well just by randomly guessing). (Note that there are some people who will strongly advise against using Cohen’s Kappa.)
  • Informedness (aka Powers’ Kappa): Informedness tells us how likely we are to make an informed decision rather than a random guess. It is the true positive rate (aka recall) plus the true negative rate, minus 1. Like precision, recall and F1, it’s calculated on a class-by-class basis but we can calculate it for a multiclass classification model by taking the (geometric) mean across all of the classes. It ranges from -1 to 1, with 1 being a model that always makes correct predictions, 0 being a model that makes predictions that are no different than random guesses and -1 being a model that always makes incorrect predictions.

Packages for analysis

For R, the Metrics package and caret package both have implementations of these model metrics, and you’ll often find functions for evaluating more specialized models in the packages that contain the models themselves. In Python, you can find implementations of many of these measurements in the scikit-learn module.

Also, it’s worth noting that any single-value metric can only tell you part of the story about a model. It’s important to consider things besides just accuracy when selecting or training the best model for your needs.

Got other tips and tricks for measuring model error? Did I leave out one of your faves? Feel free to share in the comments. 🙂


Can a computer write my blog posts?

This post is pretty special: it’s the 100th post I’ve made since starting my blog! It’s hard to believe I’ve been doing this so long. I started blogging in 2012, in my final year of undergrad, and now I’m heading into my last year of my PhD. Crazy how fast time flies.

Ok, back on topic. As I was looking back over everything I’ve written, it struck me that 99 posts worth of text on a very specific subject domain (linguistics) in a very specific register (informal) should be enough text to train a simple text generator.

So how did I go about building a blog bot? It was pretty easy! All I needed was:

  • 67,000 words of text (all blog posts before this one)
  • 1 R script to tidy up the text
  • 1 Python script to train a Markov Chain  text generator

A Markov Whatnow?

A Markov Chain is a type of simple (but surprisingly powerful) statistical model that tells you, given the item you’re currently on, what item you’re likely to see next. Today we’re going to apply it to whole words in a text.

How does it work? Basically, for each word in your text, you count how many different words occur after it, how many time each shows up and figure out the probability of each transition. So if your text is “The dog ate the apple.”, then there’s a 50% chance that “the” will be followed by “apple”, but a 100% chance that “apple” will be followed by “.”. You can then use these probabilities to generate new sentences, like so:

  • Input: The dog ate the apple.
  • Possible outputs:
    • The apple.
    • The dog ate the apple.
    • The dog ate the dog ate the apple.
    • The dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the dog ate the apple. (This may seem silly, but remember that we’re only looking at two words at a time. This model doesn’t “remember” what came earlier or “know” that it’s repeating itself.)

OK, now for the actual training of the model. If you want to play around with this yourself, all my code and text is up on GitHub here.

  1. First, grab your text. I downloaded all my blog posts as an XML file and used a quick little R script to snag just the text I was interested in.
  2. Next, train your model. I used PyMarkovChain, but you can always write your own if you’re feeling it.
  3. Finally, use the model to generate text.

Alright, now I’ve got (admittedly very, very dumb) little program that kinda knows what I write like when I’m blogging. How well does it do at pretending to be me?

First try:

I’m going to break eye contact, look down at your own personalized ASR system

Well that’s… vaguely disturbing. But also not really very human-sounding, and certainly not something I’d write in a blog post. Let’s give it a couple more tries and see if it does any better.

  • But, if frosting has to have a career where you learned it from Clarice
  • We only perceive the things that our bodies and sensory organs and brains can perceive, and we really don’t care about to tell the sounds in the second part, I was expecting to see whether something is grammatical is to put up a bunch of historical stuff that we’ve all heard about the structure in the arts and sciences were depressed
  • (Which is really interesting–I’d love to learn your language doesn’t make you very, very good a recognizing speech
  • Believe me, I know what you can uncontract them and what’s the take-away
    People with short face syndrome–yes, that’s a classical description of what a language does, along with parallels drawn to another, related, languages
  • Short answer: they’re all correct
  • And those speakers are aware of
  • The Job Market for Linguistics PhDsWhat do you much

Hmm, not so great. The only output that sounded good to me was “Short answer: they’re all correct”. And that was just the exact same phrasing I used in this blog post. No wonder it sounds just like me; it is me!

So it looks like I won’t be handing the reins over to Making Noise and Hearing Things bot any time soon. True, I could have used a fancier tool, like a Recurrent Neural Network. But to be perfectly honest, I have yet to see any text generation system that can produce anything even close to approximating a human-written blog post. Will we get there? Maybe. But natural language generation, especially outside of really formulaic things like weather or sports reporting, is a super hard problem. Heck, we still haven’t gotten to point where computers can reliably solve third-grade math word problems.

The very complexities that make language so useful (and interesting to study) also make it so hard to model. Which is good news for me! It means there’s still plenty of work to do in language modelling and blogging.