How to be wrong: Measuring error in machine learning models

One thing I remember very clearly from writing my dissertation is how confused I initially was about which particular methods I could use to evaluate how often my models were correct or wrong. (A big part of my research was comparing human errors with errors from various machine learning models.) With that in mind, I thought it might be handy to put together a very quick equation-free primer of some different ways of measuring error.

The first step is to figure out what type of model you’re evaluating. Which type of error measurement you use depends on the type of model you’re evaluating. This was a big part of what initially confused me: much of my previous work had been with regression, especially mixed-effects regression, but my dissertation focused on multi-class classification instead. As a result, the techniques I was used to using to evaluate models just didn’t apply.

Today I’m going to talk about three types of models: regression, binary classification and multiclass classification.


In regression, your goal is to predict the value of an output value given one or more input values. So you might use regression to predict how much a puppy will weigh in four months or the price of cabbage. (If you want to learn more about regression, I recently put together a beginner’s guide to regression with five days of exercises.)

  • R-squared: This is a measurement of how correlated your predicted values are with the actual observed values. It ranges from 0 to 1, with 0 being no correlation and 1 being perfect correlation. In general, models with higher r-squareds are a better fit for your data.
  • Root mean squared error (RMSE), aka standard error: This measurement is an average of how wrong you were for each point you predicted. It ranges from 0 up, with closer to zero being better. Outliers (points you were really wrong about) will disproportionately inflate this measure.

Binary Classification

In binary classification, you aim to predict which of two classes an observation will fall. Examples include predicting whether a student will pass or fail a class or whether or not a specific passenger survived on the Titanic. This is a very popular type of model and there are a lot of ways of evaluating them, so I’m just going to stick to the four that I see most often in the literature.

  • Accuracy: This is proportion of the test cases that your model got right. It ranges from 0 (you got them all wrong) to 1 (you got them all right).
  • Precision: This is a measure of how good your model is at selecting only the members of a certain class. So if you were predicting whether students would pass or not and all of the students you predicted would pass actually did, then your model would have perfect precision. Precision ranges from 0 (none of the observations you said were in a specific class actually were) to 1 (all of the observations you said were in that class actually were). It doesn’t tell you about how good your model is at identifying all the members of that class, though!
  • Recall (aka True Positive Rate, Specificity): This is a measure of how good your model was at finding all the data points that belonged to a specific class. It ranges from 0 (you didn’t find any of them) to 1 (you found all of them). In our students example, a model that just predicted all students would pass would have perfect recall–since it would find all the passing students–but probably wouldn’t have very good precision unless very few students failed.
  • F1 (aka F-Score): The F score is the (harmonic) mean of both precision and recall. It also ranges from 0 to 1. Like precision and recall, it’s calculated based on a specific class you’re interested in. One thing to note about precision, recall and F1 is that they all don’t count true negatives (when you guessed something wasn’t in a specific class and you were right) so if that’s an important consideration for your model you probably shouldn’t rely on these measures.

Multiclass Classification

Multiclass classification is the task of determining which of three or more classes a specific observation belongs to. Things like predicting which icecream flavor someone will buy or automatically identifying the breed of a dog are multiclass classification.

  • Confusion Matrix: One of the most common ways to evaluate multiclass classifications is with a confusion matrix, which is a table with the actual labels along one axis and the predicted labels along the other (in the same order). Each cell of the table has a count value for the number of predictions that fell into that category. Correct predictions will fall along the center diagonal. This won’t give you a single summary measure of a system, but it will let you quickly compare performance across different classes.
  • Cohen’s Kappa: Cohen’s kappa is a measure of how much better than chance a model is at assigning the correct class to an observation. It range from -1 to 1, with higher being better. 0 indicates that the model is at chance levels (i.e. you could do as well just by randomly guessing). (Note that there are some people who will strongly advise against using Cohen’s Kappa.)
  • Informedness (aka Powers’ Kappa): Informedness tells us how likely we are to make an informed decision rather than a random guess. It is the true positive rate (aka recall) plus the true negative rate, minus 1. Like precision, recall and F1, it’s calculated on a class-by-class basis but we can calculate it for a multiclass classification model by taking the (geometric) mean across all of the classes. It ranges from -1 to 1, with 1 being a model that always makes correct predictions, 0 being a model that makes predictions that are no different than random guesses and -1 being a model that always makes incorrect predictions.

Packages for analysis

For R, the Metrics package and caret package both have implementations of these model metrics, and you’ll often find functions for evaluating more specialized models in the packages that contain the models themselves. In Python, you can find implementations of many of these measurements in the scikit-learn module.

Also, it’s worth noting that any single-value metric can only tell you part of the story about a model. It’s important to consider things besides just accuracy when selecting or training the best model for your needs.

Got other tips and tricks for measuring model error? Did I leave out one of your faves? Feel free to share in the comments. 🙂