Editor’s note: This article compares measures for model performance. Note that “accuracy” is a specific such measure, but that this article uses the word “accuracy” to generically refer to measures in general.
In data mining, data scientists use algorithms to identify previously unrecognized patterns and trends hidden within vast amounts of structured and unstructured information. These patterns are used to create predictive models that try to forecast future behavior.
These models have many practical business applications—they help banks decide which customers to approve for loans, and marketers use them to determine which leads to target with campaigns.
But extracting real meaning from data can be challenging. Bad data, flawed processes and the misinterpretation of results can yield false positives and negatives, which can lead to inaccurate conclusions and ill-advised business decisions.
There are many different tests to determine if the predictive models you create are accurate, meaningful representations that will prove valuable to your organization—but which are best? To find out, we spoke to three top data mining experts. Here, we reveal the tests they use to measure their own results, and what makes each test so effective.
Karl Rexer is the founder of Rexer Analytics, a small consulting firm that provides predictive modeling and other analytic solutions to clients such as Deutsche Bank, CVS, Redbox and ADT Security Services. The measures his firm uses to create predictive models are often binary: people either convert on a website or don’t, bank customers close their accounts or leave them open, etc.
Rexer’s firm creates models that help clients determine how likely people are to complete a binary behavior. These models are created with algorithms that typically use historical data pulled from the client’s data warehouse to characterize behaviors and identify patterns.
To test the strength of these models, Rexer’s firm frequently uses lift charts and decile tables, which measure the performance of the model against random guessing, or what the results would be if you didn’t use any model. It’s a methodology commonly used to increase customer response rates by identifying the most promising targets.
Let’s look at an example. One of Rexer’s clients wanted to create a more focused, cost-effective marketing campaign for the coming year, so he and his team built a predictive model for the client using data from the previous year’s campaign.
In the previous year, the client had sent marketing material to 200,000 total leads, 1,579 of which became customers—a conversion rate of 0.8 percent. For the coming year, the client wanted to know in advance which of their new leads would be more likely to buy their product so they could focus their marketing efforts on this segment. Essentially, they wanted a smaller, cheaper campaign with a higher conversion rate.
Rexer and his team first took the client’s lead data and randomly split it into two groups. Approximately 60 percent was used to build the model, while the other 40 percent was set aside to test it (known as the “hold-out” sample).
Splitting the data “helps ensure that you’re not creating a ‘super’ model that only works on that one set of data and nothing else,” Rexer explains. “You’re essentially building something that works on two different sets of data.”
Rexer and his team used a variety of data mining algorithms to comb through hundreds of data fields to find the best set of predictors that worked in the modeling sample. These predictors were used to identify a highly responsive subset of leads.
This algorithm was then used to score each lead in the hold-out sample on a scale of 1 to 100 based upon the characteristics identified by the algorithm, such as how long the person had been a customer and various socio-demographic information. The closer to 100, the more likely the lead was to convert.
The list of leads in the hold-out sample was then rank ordered by score and split into 10 sections, called deciles. The top 10 percent of scores was decile one, the next 10 percent was decile two, and so forth.
Since these are all historical leads, the decile table can report on the number and proportion of sales for each decile. If the model is working well, the leads in the top deciles will have a much higher conversion rate than the leads in the lower deciles. The results of this client’s targeting model can be seen in the following decile table:
In this table, the “Lift” column signifies how much more successful the model is likely to be than if no predictive model was used to target leads.
The data shown in a decile table can also be plotted on a lift chart to create a visual representation of model performance. If no model was used and leads were contacted randomly, this would result in a linear line (represented in red in the chart below). Contacting the first decile, or the first 10 percent of leads, would yield 10 percent of sales, contacting 20 percent of leads would yield 20 percent of sales, and so on.
In this chart, the predictive model is represented by the curved blue line. The red X signifies the lift of the first decile above the random model. The lift of the first decile is 4.0 times greater than the random model (10 percent), or 40 percent. This indicates that if one were to select the top 10 percent of leads with the highest model scores, one would obtain 40 percent of total sales, which is substantially better than random.
The green line represents the ideal, “perfect” model, where contacting only 0.8 percent of leads (the observed conversion rate) would yield 100 percent of sales.
While this chart shows just one predictive model, Rexer typically creates several predictive models and uses lift charts and decile tables to compare their performance in order to select the best one. It’s a process, he says, that “identifies the best way of prioritizing your list so that you can go just a short way into it and have a large proportion of people completing the desired behavior.”
It’s also a process that Rexer says is more “helpful and necessary” than traditional statistical significance tests. “Statistical significance is evaluating whether something could have happened by chance or not,” he says.
“[With data mining algorithms, lift charts and decile tables], you’re doing something called supervised learning. You’re using historical data where you know the outcome of the scenario to supervise the creation of your model and evaluate how well it will work to predict a certain behavior. It’s a different methodology.”
John Elder is the founder of data mining and predictive analytics services firm Elder Research. He tests the statistical accuracy of his data mining results through a process called target shuffling. It’s a method Elder says is particularly useful for identifying false positives, or when two events or variables occurring together are perceived to have a cause-and-effect relationship, as opposed to a coincidental one.
“The more variables you have, the easier it becomes to “oversearch” and identify (false) patterns among them,” Elder says—what he calls the ‘vast search effect.’
As an example, he points to the Redskins Rule, where for over 70 years, if the Washington Redskins won their last home football game, the incumbent party would win the presidential election. “There’s no real relationship between these two things,” Elder says, “but for generations, they just happened to line up.”
In situations like these, it becomes easy to make inferences that are not only incorrect, but can be dangerously misleading. To prevent this problem from occurring, Elder uses target shuffling with all of his clients. It’s a process that reveals how likely it is that results occurred by chance in order to determine if a relationship between two variables is causal.
“Target shuffling is essentially a computer simulation that does what statistical tests were designed to when they were first invented,” Elder explains. “But this method is much easier to understand, explain and use than those mathematical formulas.”
Here’s how the process works:
Let’s break this down: imagine you have a math class full of students who are going to take a quiz. Before the quiz, everyone fills out a card with various personal information, such as name, age, how many siblings they have and what other math classes they’ve taken. Everyone then takes the quiz and receives their score.
To find out why certain students scored higher than others, you model the outputs (the score each student received) as a function of the inputs (students’ personal information) to identify patterns. Let’s say you find that older sisters had the highest quiz scores, which you think is a solid predictor of which types of future students will perform the best.
But depending on the size of the class and the number of questions you asked everyone, there’s always a chance that this relationship is not real, and therefore won’t hold true for the next class of students.
With target shuffling, you compare the same inputs and outputs against each other a second time to test the validity of the relationship. This time, however, you randomly shuffle the outputs so each student receives a different quiz score—Suzy gets Bob’s, Bob gets Emily’s, and so forth.
All of the inputs (personal information) remain the same for each person, but each now has a different output (test score) assigned to them. This effectively breaks the relationship between the inputs and the outputs without otherwise changing the data.
You then repeat this shuffling process over and over (perhaps 1,000 times), comparing the inputs with the randomly assigned outputs each time. While there should be no real relationship between each student’s personal information and these new, randomly assigned test scores, you’ll inevitably find some new false positives, or “bogus” relationships (e.g. older males receive the highest scores, women who also took Calculus receive the highest scores).
As you repeat the process, you record these “bogus” results using a histogram—a graphical representation of how data is distributed. You then evaluate where on this distribution (or beyond it) your model’s initial results (older sisters score highest) stand.
If you find that this result appears stronger than the best result of your shuffled data, you can be pleased with the original finding, as it was not matched by any random results.
If your initial result falls within the distribution, this tells you what proportion of random results did as well, which is your significance level. (For a study to be published in most medical journals, for example, the acceptable significance level is 0.05, meaning that there’s only a 5 percent chance of results that strong occurring by chance.)
In the histogram pictured above, the model scored in the high 20’s. Only 0.04 percent of the random, shuffled models performed better, meaning the model is significant to that level (and would meet the criteria of a publishable result in any journal).
Elder first came up with target shuffling 20 years ago, when his firm was working with a client who wasn’t sure if he wanted to invest more money into a new hedge fund. While the fund had done well in its first year, it had been a volatile ride, and the client was unsure if the success was real. A statistical test showed that the probability of the fund being that successful by chance was very low, but the client wasn’t convinced.
So Elder performed 1,000 simulations where he shuffled the results (as described above) where the target variable was the buy or hold signal for each day. He then compared the random results to how the hedge fund had actually performed.
Out of 1,000 simulations, the random distribution returned better results in just 15 instances—in other words, there was a 1.5 percent chance that the hedge fund’s success was a result of luck. This new way of presenting the data made sense to the client, and as a result he invested 10 times as much in the fund.
“I learned two lessons from that experience,” Elder says. “One is that target shuffling is a very good way to test non-traditional statistical problems. But more importantly, it’s a process that makes sense to a decision maker. Statistics is not persuasive to most people—it’s just too complex.
“If you’re a business person, you want to make decisions based upon things that are real and will hold up. So when you simulate a scenario like this, it quantifies how likely it is that the results you observed could have arisen by chance in a way that people can understand.”
Like Elder, Dean Abbott, president of Abbott Analytics, Inc. says predictive analytics and data mining types typically don’t use the traditional statistical tests taught in college statistics classes to assess models—they assess them with data.
“One big reason for this is that everything passes statistical tests with significance,” he says. “If you have a million records, everything looks like it’s good.”
According to Abbott, there’s a difference between statistical significance and what he calls operational significance. “You can have a model that is statistically significant, but it doesn’t mean that it’s generating enough revenue to be interesting,” he explains.
“You might come up with a model for a marketing campaign that winds up generating $40,000 in additional revenue, but that’s not enough to even cover the cost of the modeler who built it.”
This, Abbott says, is where many data miners fall short. He gives the example of comparing two different models, where you’re looking at the lift in the third decile for each. In one model, the lift is 3.1, while the other is 3.05.
“Data miners would typically say, ‘Ah! The 3.1 model is higher, let me go with that,’” he says. “Well, that’s true—but is it [operationally] significant? And this is where I like to use other methods to find out if one is truly better than the other. It’s a more intuitive way to get there.”
To do this, one method Abbott uses is bootstrap sampling (also called bootstrapping). This method tests a model’s performance on certain subsets of data over and over again to provide an estimate of accuracy.
Whenever Abbott builds a predictive model, he takes a random sample of the data and partitions it into three subsets: training, testing and validation. The model is built on the data in the training subset, and then evaluated on the testing subset data, allowing him to see how it performs on each of these two subsets. (We’ll discuss the third subset in a moment.)
Let’s apply this to a real world example. If you’re studying to take your Graduate Requisite Examinations (GREs), for instance, you might go out and get a test book. This can be thought of as the training subset. You study the book for weeks, then take a sample test and score a 95.
To find out if this score is a realistic predictor of how you’ll do on the real exam, you go out and get a second book—this is your testing subset. If you do as well on your sample test after studying that second book as you did on the first, then you know the model is correct.
But let’s say you score a 70 on your second test. When your test performance is worse than your training performance, “it’s because you’ve overfit the model and made it too complex,” Abbott says. This means you have to go back and tune parameters to simplify the model in order to improve the test performance, and then repeat the process: training and testing, training and testing.
Now that you’ve gone through this process several times, however, you have a problem—because you’ve been using the testing subset to help guide how you define the settings and parameters of your model for training, it’s no longer fully independent.
This is why you have a third subset of validation data. Ideally, the data you use to test your model is data you’ve never seen before, so once you’re finally convinced that you have a good model that’s consistent and accurate, you then deploy it against this final subset.
“It’s like holding onto your lock and key,” Abbott explains. “You don’t let anybody see it until you’re ready to go out and test your model against this ‘new’ data.”
Now let’s go back to Abbott’s example of comparing a model with a 3.1 lift versus a 3.05 lift. If you used bootstrap sampling to run each model through 1,000 bootstrap tests, here’s what you might find:
According to Abbott, bootstrap sampling is good for two things. One is obviously for picking which model wins. The other is when you run a model through 1,000 times (or however many you choose to run), you’re also going to get a range of lifts.
“[Bootstrap sampling] tells you how the model accuracy is bounded, and thus what to expect when you run it live,” he says. “When you only run a model through test data, it’s hard to know if the lift you’re getting is real.”
What kind of tests are you using to measure the accuracy of your models? Share your experience by leaving a comment below.
Lift chart and decile table created by Karl Rexer. Histogram of shuffled model success courtesy of John Elder. Bootstrap image created by Dean Abbott.
By: Victoria Garment, Content Editor, Software Advice
As featured on Software Advice’s Plotting Success blog