This article will make you feel better. And you do need to feel better, if you are one of the many of us who practice analytics—or who must consume and rely on analytics—and find ourselves carrying tension in our shoulders or sometimes losing sleep.
The fear stems from a well-known warning of tragic mishap: "If you torture the data long enough, it will confess," as stated by University of Chicago economics professor Ronald Coase. There is a general sense that math could be wrong and that analytics is an art.
As John Elder of Elder Research put it, "It's always possible to get lucky (or unlucky). When you mine data and find something, is it real, or chance?" How can we confidently trust what a computer claims to have learned? How do we avert the dire declension, "Lies, damned lies, and statistics"?
There is a simple, elegant solution from Elder—but first, let me further magnify your fear: Even the very simplest predictive model risks utter failure. Mistaken, misleading conclusions are in fact terribly easy to come by.
A conclusion drawn about one single variable—even without the use of a common multivariate model (such as log-linear regression)—can go awry. In fact, one of the more famous such analytical insights, "an orange used car is least likely to be a lemon," has recently been debunked by Elder and his colleague Ben Bullard at Elder Research, Inc.
Big data, with all its pomp and circumstance, can actually mean big risk. More data can present more opportunities to inadvertently discover untrue patterns that appear misleadingly strong within your dataset—but, in fact, do not hold true in general. To be more specific, "bigger" data could mean longer data (a longer list of examples, which generally helps avert spurious conclusions), but also could mean wider data (more columns—more variables/factors per example). So, even if you are only considering one variable at a time, such as the color of each car, you are more likely to come across one that just happens to look predictive in your data by sheer chance alone. This peril that arises when searching across many variables has been dubbed by John Elder vast search.
Dr. Elder puts it this way: "Modern predictive analytic algorithms are hypothesis-generating machines, capable of testing millions of 'ideas.' The best result stumbled upon in its vast search has a much greater chance of being spurious… The problem is so widespread that it is the chief reason for a crisis in experimental science, where most journal results have been discovered to resist replication; that is, to be wrong!"
A few years ago, Berkeley Professor David Leinweber made waves with his discovery that the annual closing price of the S&P 500 stock market index could have been predicted from 1983 to 1993 by the rate of butter production in Bangladesh. Bangladesh's butter production mathematically explains 75 percent of the index's variation over that time. Urgent calls were placed to the Credibility Police, since it certainly cannot be believed that Bangladesh's butter is closely tied to the U.S. stock market. If its butter production boomed or went bust in any given year, how could it be reasonable to assume that U.S. stocks would follow suit? This stirred up the greatest fears of PA skeptics, and vindicated nonbelievers. Eyebrows were raised so vigorously, they catapulted Professor Leinweber onto national television.
Crackpot or legitimate educator? …
Read the full article by Eric Siegel on the Predictive Analytics Times
A different form of statistical analysis could prove benefitial, but I think the main thing to keep in mind is that data mining algorithms just show you what trends there are in the data, rather than prove anything concretely. If a trend is found in the data, that is the beginning rather than the end of the research. Once the trend is found it is important to dig deeper to understand whether it is just an anomaly, or if there is a significant coorelation.
For example, if a grocery store finds a trend in the data that suggests people frequently buy milk and apples together, it doesn't mean that there is a strong coorelation between those two items. By digging deeper, you may find that people buy milk all the time, so there is a possible trend with milk and any other item (or in other words, there is no trend at all). A quick check for this would be to try to explain why the trend exists. If no logical explanation can be made for why a trend exists, then there is a good chance it was just an anomaly.