Machine Learning Times
Machine Learning Times

CONTINUE READING: Access the complete article in ASUGNews, where it was originally published.  

5 years ago
How Stubby Datasets Can Lead to Predictive Analytics Snafus


Here’s a question for PATIMES members interested in Big Data and predictive analytics to ponder: “When is bigger data more hazardous?” And, here’s an answer to puzzle over: “When it is wider.”

Eric Siegel, executive editor of Predictive Analytics Times and founder of the Predictive Analytics World conference series posed both this question and answer to attendees of the Predictive Analytics World Business event during his keynote session in San Francisco on Monday. To dig into the dangers of bigger data that he’s referencing, we need to talk about “p-values.”

Chicago Banner

Anyone who has taken an entry-level university course on statistics has come across the term “p-value”—that perceived holy grail of statistical validity when it comes to identifying which studies have true takeaways and which are left searching for answers. For the layperson, “p-value” can be defined as the probability that a result would occur, even when it isn’t true—in other words, the chance that the result is random.

In academia, it’s common for results that see p-values of less than 0.05, or five percent, to be deemed credible for publication. So, if a p-value of an experiment is at 0.05, there is a 95 percent chance its result is not random.

But with the advent of Big Data, and the capability for analytics software to handle an ever-increasing number of variables at once, it’s becoming easier to get “tricked by randomness”, Siegel says.

So, why is a wider dataset more hazardous? A wider dataset means more columns or variables. Typically, bigger is better when it comes to data, but that only applies to longer data sets—which have more recorded events for each variable, Siegel says. Adding variables can be good—in finding more elaborate interpretations for explaining and predicting events. But every time a variable is added, another test is added, and the likelihood of a random event occurring increases.

Siegel provides the example of winning a one-in-100 chance jackpot. Each independent attempt at winning that jackpot is one percent. But after trying to win the jackpot 70 times, the chance of winning jumps to a near 50/50 proposition.

In a real-world example, Siegel points to an Associated Press article from 2012, which reported that orange-colored cars are less likely to be “lemons” or defective beyond repair, based on a study that was conducted for a predictive analytics competition. On the face of it, that’s an absurd notion, but the author of the article attempts to explain why orange cars would indeed be more lemon-proof:

As for why orange used cars are most likely to be in good shape, the numbers did not hold the answer. One notion was that such a flashy color would only attract car fanatics who would be more likely to take care of their vehicles. That didn’t pan out, however, since the least well-kept used cars turned out to be purple.

Why go to such lengths to explain a result as silly as “orange cars are less likely to be lemons?” Because the p-value for that result was less than 0.01, meaning there was a greater than 99 percent chance that it was not a random occurrence.

But here is where Siegel’s jackpot example comes into play. The study didn’t just test the lemon-ness of orange cars. It tested 15 different colors, meaning the chance for a random occurrence was bumped up over seven percent—a number that is not recognized as statistically significant.

“Don’t take a statistically significant result immediately at face value,” Siegel says. “How many things did you also test that came up dry that you are not telling me about? That’s the context we need.”

How to Avoid an Orange Lemon Situation

So how do companies identify these results that appear to be statistically significant, but are much more likely to be random? One method that Siegel highlights is “target shuffling,” citing a paper called Are Orange Cars Really Not Lemons? from data science consulting firm Elder Research, which dubbed the issue of false positives from increased variables as “vast search.”

In target shuffling, the data is rearranged so that it is known there are no significant pre-existing connections. So, for example, black car no. 1 is reassigned from “not a lemon” to “a lemon,” even though that isn’t the case in the real data. Then the same tests are run on this shuffled data to see how likely it is that a false statistically significant result is achieved.

This action gives a company a point of reference for the validity of their own statistically significant result—if one is found with shuffled data, then it is certainly possible that a random result could happen in the real dataset.

“When you account for vast search with methods such as target shuffling, you are holding [the result]t o a higher standard,” Siegel says.

However, that doesn’t really solve the core issue for companies—they don’t just want to know whether their results are valid, they want actual applicable predictive analytics insight. How can they fight vast search in that regard? Siegel says the best way is to make datasets less stubby.

“The only way to accommodate wider data is to get longer data—more examples,” he explains.

In the orange lemon situation, this would mean evaluating a larger number of cars—but not adding in more variables, such as new colors.

Barriers for Statistical Rigor

I have a good friend who works as a statistician on clinical trials at a pharmaceutical company. He has a master’s degree in biostatistics, so the sort of rigor that would be used to fight vast search comes naturally to him. Unfortunately, that’s not the case for others in his company who aren’t statistically trained, but are tasked with performing analysis—a role that has been called “citizen data scientist”, among other things.

By: Craig Powers,
Originally published at

Leave a Reply