By Eric Siegel, Predictive Analytics World
Original published in OR/MS Today
In this excerpt from the updated edition of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, Revised and Updated Edition, I show that, although data science and predictive analytics’ explosive popularity promises meteoric value, a common misapplication readily backfires. The number crunching only delivers if a fundamental—yet often omitted—failsafe is applied.
Prediction is booming. Data scientists have the “sexiest job of the 21st century” (as Professor Thomas Davenport and US Chief Data Scientist D.J. Patil declared in 2012). Fueled by the data tsunami, we’ve entered a golden age of predictive discoveries. A frenzy of analysis churns out a bonanza of colorful, valuable, and sometimes surprising insights:
Look like fun? Before you dive in, be warned: This spree of data exploration must be tamed with strict quality control. It’s easy to get it wrong, crash, and burn—or at least end up with egg on your face.
In 2012, a Seattle Times article led with an eye-catching predictive discovery: “An orange used car is least likely to be a lemon.” This insight came from a predictive analytics competition to detect which used cars are bad buys (lemons). While insights also emerged pertaining to other car attributes—such as make, model, year, trim level, and size—the apparent advantage of being orange caught the most attention. Responding to quizzical expressions, data wonks offered creative explanations, such as the idea that owners who select an unusual car color tend to have more of a “connection” to and take better care of their vehicle.
Examined alone, the “orange lemon” discovery appeared sound from a mathematical perspective. Here’s the specific result:
Well-established statistics appeared to back up this “colorful” discovery. A formal assessment indicated it was statistically significant, meaning that the chances were slim this pattern would have appeared only by random chance. It seemed safe to assume the finding was sound. To be more specific, a standard mathematical test indicated there was less than a 1% chance this trend would show up in the data if orange cars weren’t actually more reliable.
But something had gone terribly wrong. The “orange car” insight later proved inconclusive. The statistical test had been applied in a flawed manner; the press had ran with the finding prematurely. As data gets bigger, so does a potential pitfall in the application of common, established statistical methods.
The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt.
Big data brings big potential—but also big danger. With more data, a unique pitfall often dupes even the brightest of data scientists. This hidden hazard can undermine the process that evaluates for statistical significance, the gold standard of scientific soundness. And what a hazard it is! A bogus discovery can spell disaster. You may buy an orange car—or undergo an ineffective medical procedure—for no good reason. As the aphorisms tell us, bad information is worse than no information at all; misplaced confidence is seldom found again.
This peril seems paradoxical. If data’s so valuable, why should we suffer from obtaining more and more of it? Statistics has long advised that having more examples is better. A longer list of cases provides the means to more scrupulously assess a trend. Can you imagine what the downside of more data might be? As you’ll see in a moment, it’s a thought-provoking, dramatic plot twist.
The fate of science—and sleeping well at night—depends on deterring the danger. The very notion of empirical discovery is at stake. To leverage the extraordinary opportunity of today’s data explosion, we need a surefire way to determine whether an observed trend is real, rather than a random artifact of the data. How can we reaffirm science’s trustworthy reputation?
Question that statistics can answer: If orange cars were actually no more reliable than used cars in general, what would be the probability that this strong a trend—depicting orange cars as more reliable—would show in data anyway, just by random chance?
With any discovery in data, there’s always some possibility we’ve been Fooled by Randomness, as Nassim Taleb titled his compelling book. The book reveals the dangerous tendency people have to subscribe to unfounded explanations for their own successes and failures, rather than correctly attributing many happenings to sheer randomness. The scientific antidote to this failing is probability, which Taleb affectionately dubs “a branch of applied skepticism.”
Statistics is the resource we rely on to gauge probability. It answers the orange car question above by calculating the probability that what’s been observed in data would occur randomly if orange cars actually held no advantage. The calculation takes data size into account—in this case, there were 72,983 used cars varying across 15 colors, of which 415 were orange.
In China when you’re one in a million, there are 1,300 people just like you.
So if there had only been a 1% long shot that we’d be misled by randomness, what went wrong?
The experimenters’ mistake was to not account for running many small risks, which had added up to one big one…
Adapted with permission of the publisher from Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, Revised and Updated Edition (Wiley, January 2016) by Eric Siegel, Ph.D., who is the founder of the Predictive Analytics World conference series (cross-sector events), executive editor of The Predictive Analytics Times, and a former computer science professor at Columbia University.
 This discovery was also featured by The Huffington Post, The New York Times, National Public Radio, The Wall Street Journal, and the New York Times Bestseller Big Data: A Revolution That Will Transform How We Live, Work, and Think.
 The notion that orange cars have no advantage is called the null hypothesis. The probability the observed effect would occur in data if the null hypothesis were true is called the p-value. If the p-value is low enough—e.g., below 1% or 5%—then a researcher will typically reject the null hypothesis as too unlikely, and view this as support for the discovery, which is thereby considered statistically significant.
 The applicable statistical method is a 1-sided equality of proportions hypothesis test, which calculated the p-value as under 0.0068.