Several scientific disciplines have been rocked by a crisis of reproducibility in recent years . Not long ago, Bayer researchers found that they were only able to replicate 25% of the important pharmaceutical papers they examined , and an MIT report on Machine Learning papers found similar results. Some fields have begun to emerge from their crises, but other fields, such as psychology, may have not yet hit bottom  .
We might imagine that this is because many scientists are good at science but not so adept with statistics. We might even imagine that we Analytics practitioners should have fewer problems because we are good at statistics. As a matter of fact, we find ourselves with an equivalent issue: predictive models that underperform once deployed. We have a powerful tool to prevent an underperforming model in Cross Validation (CV), but the ubiquity of CV in our modeling tools has led many Analysts to misunderstand how to properly use CV or appropriately create CV partitions, leading to lower-performing models.
This article will address the proper use and partitioning of CV to help us avoid these crises of under-performance in our own projects.
Before discussing CV, we need to understand why it is necessary. Model validation allows us to test the accuracy of our models, but it is difficult to do correctly. The most basic mistake an analytics team can make is to test a model on the same data that was used for training. Doing this is like giving a test to students after handing out the answers: any sufficiently flexible model will get a perfect score.
A basic precaution that enables us to avoid testing and training on the same data is to divide our data into two sets before doing any analysis: one to train the model and the other to test the model. The training data represents the data that we will have in hand before deploying our model, and the test data represents the new, never-seen-before data that will arrive during production. This procedure is vital to modeling practice. Many who over-examine their total data learn too much to be surprised by the data subset later held out (unlike the real data of the future, which can always shock). They become overconfident in their models’ accuracy until the cold cruel world brings ruin.
Though the hold-out sample approach is a very good practice, it has two main weaknesses. First, if the training or testing sample is unrepresentative of the whole, the model, or its test score, can be skewed. Careful sampling is advised. Second, if multiple models are being developed and compared, the train/test division is no longer good enough. This happens because we use the training data to construct each model, then use the test data to compare them. By definition, we will pick the model that does best on the test data — it’s a relative comparison — but now that we’ve already peeked at that data, we’re out of new data and have no way to test the winner to see how well it will do on data it’s never seen. The model competition is actually part of training the winning model.
An important thing to note is that the term “multiple models” doesn’t simply mean very different models that use different techniques (like a random forest versus a neural network). If we have a logistic regression that we try with and without a particular predictor, or with and without taking the log of a variable, we have two different models.
Practitioners familiar with mixture models or mixture-of-experts ensembles might see an analogy where the final model can be viewed as a mixture of the competing models, with the mixture coefficients having a single 1 (for the winner) and 0’s for the models that lost. In this analogy, the “training” data is used to adapt the lower-level coefficients, and the “test” data is used to adapt the mixing coefficient, so we’re left with no held-out data with which to test the resulting model.
Seen this way, it is clear that when model selection is involved, we need three partitions: training, validation, and testing. Each model is trained on the training set, the models are compared with the validation (1) set, and the winner is tested on the test set to confirm how well it will perform on unseen data in the future. The concept of a three-way partition will be important to keep in mind as we move on to our discussion of cross validation: it still applies and can help our intuition as the process becomes more complex.
Cross validation (CV) takes the basic idea of a train/test partition and generalizes it into something more efficient and informative. In CV, we break the data up into K partitions and then, K times in turn, we select one partition for testing and use the remaining ones for training. It’s more “fair” in its use of the data than a single train/test partition, since each case is used once in testing (and K-1 times in training), and, of course, never in both sets at once.
The magic of cross validation is that it provides us with an accuracy distribution rather than a point estimate. With 10-fold CV we obtain 10 accuracy measurements, which allows us to estimate a central tendency and a spread. The spread is often a critical piece of information, especially when making comparisons or choices. Never accept a point estimate when you can get a distribution instead. This addresses our first problem with a hold-out sample – that it will not be representative – since the distribution gives a better expected value for the accuracy, as well as a confidence (e.g., standard deviation) for that estimate.
Cross validation is so ubiquitous that it often only requires a single extra argument to a fitting function to invoke a random 10-fold cross validation automatically. This ease of use can lead to two different errors in our thinking about CV:
Using CV properly and designing partitions strategically are important for analytics practitioners to understand.
A chief confusion about CV is not understanding the need for multiple uses of it, within layers. For instance, one can use cross validation within the model selection process and a different cross validation loop to actually select the winning model. When we use CV in the model fitting process, typically by turning on an option in our regression or classification function, its repeated splitting is equivalent to a train/test partitioning, except more informative and efficient. As we’ve seen, a two-way split like this is not sufficient for model selection, which requires a three-way split.
The CV equivalent to the training/validation/test split – that is, for model selection — requires that we run an inner CV that’s equivalent to the train/validation partition within an outer CV that’s equivalent to the validation/test partition. This is called Nested CV. If you hear someone say, “We selected our final model via CV,” it is essential to know whether they correctly used Nested CV or mistakenly used only an inner CV – a common mistake.
Once we’re comfortable with Nested CV, we should enlarge our vision of what building a model really entails to ensure a thorough modeling process. If we select certain variables but leave out others, that variable selection is actually part of our model. After all, we will select those same variables when we run our model in production to score new data. If we perform transformations on our selected variables, this is also part of our model, whether it shows up in a model’s formula or not. Because of this, the outer CV should wrap around selection and transformation choices as well.
Choosing the Right Partitioning Strategy
The default in most analytics packages is to select CV partitions at random. This avoids certain partitioning errors — such as selecting the first part and latter lines in a file, which will be non-representative if the file is ordered — but a random result is not always the best answer. At times, it’s actually the wrong answer.
If we’re doing classification, we need to guarantee that each class is represented in both training and test data, even if some classes are small. Purely random partitioning can end up with partitions that do not contain a particular class. When those partitions serve as the training set, CV breaks down, as the test data will contain a hitherto unseen class. If small classes are an issue, we need to use stratified random sampling – randomly breaking each class up into the partitions separately so that each class is represented proportionally in each partition.
Similarly, if we’re doing time series modeling, training and testing on a willy-nilly mixture of dates will break the sequential nature of a time series. Going back to our analogy where training data represents the data we have and testing data represents data from the future, we need to make sure that when each partition is used for testing, only samples from earlier in time are used for training. One way to do this is to partition by year/month and then step through each partition, only using earlier partitions for training.
This idea of purposefully partitioning is necessary for time series, but it can be illuminating in other cases as well. For example, we could partition our data by city to see how our model works for a new city we’ve never seen before, or on large cities if we’ve only seen small cities before. Different kinds of partitions ask different questions of our model, giving us a feel for how it may perform in varied circumstances; that knowledge is the key to not being unpleasantly surprised by our model’s performance in production.
Models that underperform our expectations break trust with our clients — even a good model is embarrassing if a great model was anticipated. One of the key techniques for avoiding the pain of an under-performing model is cross validation, properly applied.
When using CV, we must know when single CV or Nested CV is required. If your analytics tool of choice does not directly support Nested CV, extra effort will be needed. Further, we need to consider how large our model building process actually is so that we can appropriately wrap it in CV. Finally, random partitioning is a default for most CV commands; make sure it is appropriate for your case.
(1) Note that some fields use the term “test” to refer to the data used to compare models, and “validation” to refer to the final held out data that’s used to determine how well the winning model might perform on unseen data. Others use the terms reversed, as in this article.
|||M. Baker, “1,500 scientists lift the lid on reproducibility,” Nature, no. 533, pp. 452-454, 2016.|
|||Florian Prinz, Thomas Schlange, Khusru Abdallah, “Believe it or not: how much can we rely on published data on potential drug targets,” Nature Reviews, no. 712.|
|||Open Science Collaboration, “Estimating the reproducibility of psychological science,” Science, vol. 349, no. 6251, 2015.|
|||M. Baker, “Over half of psychology studies fail reproducibility test,” Nature, 2015.|
About the Author
Elder Research Data Scientist Wayne Folta enjoys diving into a new problem space, working with customers to clarify their needs and objectives, and using sophisticated tools and analysis to turn their data into insights. His major technical interests are time-series and predictive analytics. Previously, Wayne developed software in the intelligence community as a government employee and a contractor, and worked in the non-profit world as a video producer. Wayne volunteers in his community, teaching and providing statistical consulting for the local library and others.
Wayne earned a Master’s Degree in Computer Science (Artificial Intelligence) from George Mason University, and is also an alumnus of the University of Maryland and the Rochester Institute of Technology.
Wayne enjoys snowboarding, the ancient game of Go, and learning R packages and other statistical tools.