Archive for December, 2016

December 28th 2016

Sound Data Science: Avoiding the Most Pernicious Prediction Pitfall

es-blog-post-image-12-28-2016

By Eric Siegel, Predictive Analytics World

Original published in OR/MS Today

In this excerpt from the updated edition of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, Revised and Updated Edition, I show that, although data science and predictive analytics’ explosive popularity promises meteoric value, a common misapplication readily backfires. The number crunching only delivers if a fundamental—yet often omitted—failsafe is applied.

Prediction is booming. Data scientists have the “sexiest job of the 21st century” (as Professor Thomas Davenport and US Chief Data Scientist D.J. Patil declared in 2012). Fueled by the data tsunami, we’ve entered a golden age of predictive discoveries. A frenzy of analysis churns out a bonanza of colorful, valuable, and sometimes surprising insights:

• People who “like” curly fries on Facebook are more intelligent.

• Typing with proper capitalization indicates creditworthiness.

• Users of the Chrome and Firefox browsers make better employees.

• Men who skip breakfast are at greater risk for coronary heart disease.

• Credit card holders who go to the dentist are better credit risks.

• High-crime neighborhoods demand more Uber rides.

Look like fun? Before you dive in, be warned: This spree of data exploration must be tamed with strict quality control. It’s easy to get it wrong, crash, and burn—or at least end up with egg on your face.

In 2012, a Seattle Times article led with an eye-catching predictive discovery: “An orange used car is least likely to be a lemon.” This insight came from a predictive analytics competition to detect which used cars are bad buys (lemons). While insights also emerged pertaining to other car attributes—such as make, model, year, trim level, and size—the apparent advantage of being orange caught the most attention. Responding to quizzical expressions, data wonks offered creative explanations, such as the idea that owners who select an unusual car color tend to have more of a “connection” to and take better care of their vehicle.

Examined alone, the “orange lemon” discovery appeared sound from a mathematical perspective. Here’s the specific result:

This shows orange cars turn out to be lemons one third less often than average. Put another way, if you buy a car that’s not orange, you increase your risk by 50%.

Well-established statistics appeared to back up this “colorful” discovery. A formal assessment indicated it was statistically significant, meaning that the chances were slim this pattern would have appeared only by random chance. It seemed safe to assume the finding was sound. To be more specific, a standard mathematical test indicated there was less than a 1% chance this trend would show up in the data if orange cars weren’t actually more reliable.

But something had gone terribly wrong. The “orange car” insight later proved inconclusive. The statistical test had been applied in a flawed manner; the press had ran with the finding prematurely. As data gets bigger, so does a potential pitfall in the application of common, established statistical methods.

The Little Gotcha of Big Data

The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt.

—Bertrand Russell

Big data brings big potential—but also big danger. With more data, a unique pitfall often dupes even the brightest of data scientists. This hidden hazard can undermine the process that evaluates for statistical significance, the gold standard of scientific soundness. And what a hazard it is! A bogus discovery can spell disaster. You may buy an orange car—or undergo an ineffective medical procedure—for no good reason. As the aphorisms tell us, bad information is worse than no information at all; misplaced confidence is seldom found again.

This peril seems paradoxical. If data’s so valuable, why should we suffer from obtaining more and more of it? Statistics has long advised that having more examples is better. A longer list of cases provides the means to more scrupulously assess a trend. Can you imagine what the downside of more data might be? As you’ll see in a moment, it’s a thought-provoking, dramatic plot twist.

The fate of science—and sleeping well at night—depends on deterring the danger. The very notion of empirical discovery is at stake. To leverage the extraordinary opportunity of today’s data explosion, we need a surefire way to determine whether an observed trend is real, rather than a random artifact of the data. How can we reaffirm science’s trustworthy reputation?

Statistics approaches this challenge in a very particular way. It tells us the chances the observed trend could randomly appear even if the effect were not real. That is, it answers this question:

Question that statistics can answer: If orange cars were actually no more reliable than used cars in general, what would be the probability that this strong a trend—depicting orange cars as more reliable—would show in data anyway, just by random chance?

With any discovery in data, there’s always some possibility we’ve been Fooled by Randomness, as Nassim Taleb titled his compelling book. The book reveals the dangerous tendency people have to subscribe to unfounded explanations for their own successes and failures, rather than correctly attributing many happenings to sheer randomness. The scientific antidote to this failing is probability, which Taleb affectionately dubs “a branch of applied skepticism.”

Statistics is the resource we rely on to gauge probability. It answers the orange car question above by calculating the probability that what’s been observed in data would occur randomly if orange cars actually held no advantage. The calculation takes data size into account—in this case, there were 72,983 used cars varying across 15 colors, of which 415 were orange.

Calculated answer to the question: Under 0.68%

Looks like a safe bet. Common practice considers this risk acceptably remote, low enough to at least tentatively believe the data. But don’t buy an orange car just yet—or write about the finding in a newspaper for that matter.

What Went Wrong: Accumulating Risk

In China when you’re one in a million, there are 1,300 people just like you.

—Bill Gates

So if there had only been a 1% long shot that we’d be misled by randomness, what went wrong?

The experimenters’ mistake was to not account for running many small risks, which had added up to one big one…

Click here to access the complete article as originally published in OR/MS Today

 

Eric Image 2015 croppedAdapted with permission of the publisher from Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, Revised and Updated Edition (Wiley, January 2016) by Eric Siegel, Ph.D., who is the founder of the Predictive Analytics World conference series (cross-sector events), executive editor of The Predictive Analytics Times, and a former computer science professor at Columbia University.

No Comments yet »

December 23rd 2016

Wise Practitioner – Predictive Analytics Interview Series: Ashish Bansal and John Schlerf from Capital One

By: Eric Siegel, Founder, Predictive Analytics World

In anticipation of their upcoming conference co-presentation, The Quest for Labeled Data:  Integrating Human Steps, at Predictive Analytics World San Francisco, May 14-18, 2017, we asked Ashish Bansal, Senior Director, Data Science and John Schlerf, Data ashish-banal-imageScientist at Capital One, a few questions about their work in predictive analytics.

Q: In your work with predictive analytics, what behavior or outcome do your models predict?

A: These models are used to match credit card transactions to augmented merchant data to improve readability of online credit card statements.jeff-schlerf-image

Q: How does predictive analytics deliver value for your customers – what is one specific way in which it actively improves operational outcomes?

A: Customers expect their bank to know the merchants where they shop. By expanding that knowledge beyond just the merchant name, we can improve our customers’ brand perception and loyalty. This also lowers operational call center costs, as customers can access more details about where they made a purchase.

Q: Can you describe a quantitative result, such as the predictive lift of your model or the ROI of an analytics initiative?

A: We increased our library of known restaurants by over 10% in less than 1 week.
 
Q: What surprising discovery or insight have you unearthed in your data?
 
A: We were quite surprised by the vibrancy of the communities of micro-workers. They take real pride in their work, and deliver very high quality output.

Q: Sneak preview: Please tell us a take-away that you will provide during your talk at Predictive Analytics World.

A: Human judgement is often the best source of labeled data for training and testing complicated models. Collecting such data at scale can be daunting. This talk focuses on building automated data pipelines that integrate manual labeling steps.

———————

Don't miss Ashish and John’s conference co-presentation, The Quest for Labeled Data:  Integrating Human Steps on Wednesday, May 17, 2017, from 11:15 am to 12:00 pm at Predictive Analytics World San Francisco. Click here to register to attend.

By: Eric Siegel, Founder, Predictive Analytics World

No Comments yet »

December 19th 2016

Wise Practitioner – Predictive Analytics Interview Series: Kristina Pototska at TriggMine

By: Eric Siegel, Founder, Predictive Analytics World

In anticipation of her upcoming conference presentation, 7 Examples of Customer Retention with Predictive Email Marketing at Predictive Analytics World San Francisco, May 14-18, 2017, we asked Kristina Pototska, CMO at TriggMine, a few questions about kristina-pototskaya-imageher work in predictive analytics.

Q: In your work with predictive analytics, what behavior or outcomes did your models predict?

A: We launched predictive analytics for eCommerce websites with the goal of dramatically increasing customer retention. We aimed for a specific target audience, boosted engagement, reduced turnover rates, drove more conversions and increased revenue for every email campaign our clients ran.

Q: How does predictive analytics increase profits? Name one specific way in which it directly drives customer decisions.

A: Our Company provides email marketing automation for eCommerce, which isn't really new to the market. And there should always be room for improvement. That's why we started with predictive analytics, which we knew would deliver greater personalization and increased revenue for every email campaign we send. Now, predictive analytics is one of our standout features, the one that offers the best value for our customers.

Q: Can you describe a quantitative result, such as your model's predictive increase or ROI from an analytics campaign?

A: The first email campaign we optimized using predictive analytics was abandoned carts recovery. We had a 100% conversion increase in the first month alone — same for the click-rate.

Following that, we implemented predictive technologies in our other email campaigns and saw the same fantastic results. Now, our customers are seeing an average ROI hovering at about 2000%.

Q: What surprising discovery have you found in your data?

A: The first discovery for me was the amount of data we already had. We'd been collecting it over a three-year period but never touched it. However, when we started to analyze that data, we were suddenly able to create in-depth customer profiles, including portraits of and predictions for the lifetime value of each customer, their click rates, as well as their likelihood of purchasing or abandoning their carts.

That's when I discovered the incredible power of predictive technology.

Q: Sneak preview: Please give us a take-away you'll later (also) provide during your talk at Predictive Analytics World.

A: The attendees will be able to implement our proven technology, already tested on eCommerce websites in order to send more personalized and behavior-based emails, create in-depth customer profiles and lookalike portraits, present better offers that customers are more likely to click, decrease abandonment and increase revenue with every email they send.

My presentation offers case-studies that will show attendees well-tested methods for vastly improving their email marketing — even if they don't think they need it.

———————

Don't miss Kristina’s conference presentation, 7 Examples of Customer Retention with Predictive Email Marketing on Tuesday, May 16, 2017 at 11:20 to 11:40 am at Predictive Analytics World San Francisco. Click here to register to attend.

By: Eric Siegel, Founder, Predictive Analytics World

No Comments yet »

December 16th 2016

Wise Practitioner – Predictive Analytics Interview Series: Frederick Guillot at The Co-operators General Insurance Company

By: Eric Siegel, Founder, Predictive Analytics World

In anticipation of his upcoming conference presentation, Defining Optimal Segmentation Territories – 10 Years of Research at Predictive Analytics World San Francisco, May 14-18, 2017, we asked Frédérick Guillot, Senior Manager, Research and Innovation at Frederick Guillot ImageThe Co-operators General Insurance Company, a few questions about his work in predictive analytics.

Q: In your work with predictive analytics, what behavior or outcome do your models predict?

A: At the Co-operators, we leverages predictive modeling for many purposes. We are using more traditional models to quantify the insurance risk of our clients. We also leverages more advanced techniques such as propensity model to maximize client engagement.

Q: How does predictive analytics deliver value at your organization – what is one specific way in which it actively drives decisions or operations?

A: Most pricing decisions regarding our insurance products are backed by predictive modeling. Predictive models also allow us to be more agile in many operational areas such as in the buying process, or the settlement of claims.

Q: Can you describe a quantitative result, such as the predictive lift of your model or the ROI of an analytics initiative?

A: Mixing traditional predictive modeling with geospatial analytic and Big Data is a very fertile research area for insurers. We spent more than 10 years so far in refining the way we are defining our segmentation territories, and each improvement translates into tangibles benefits for the organization. Over time, we managed to triple our territories’ homogeneity.

Q: What surprising discovery or insight have you unearthed in your data?

A: On the perspective of creating segmentation territories, I definitely think that the increasing availability of geospatial open data now enable many more opportunities for improving models and get more accurate predictions of insurance risk.

Q: Sneak preview: Please tell us a take-away that you will provide during your talk at Predictive Analytics World.

A: If you are not a little shy about a new predictive modeling application you are launching within your organization, you probably wait too long before releasing it. The territory analysis I will present in my talk is a good illustration of the minimum viable product mindset in which you better launch fast, and iterate over time to grow your model accuracy.

———————

Don't miss Frédérick’s conference presentation, Defining Optimal Segmentation Territories – 10 Years of Research on Wednesday, May 17, 2017 from 11:40 am to 12:00 pm at Predictive Analytics World San Francisco. Click here to register to attend.

By: Eric Siegel, Founder, Predictive Analytics World

No Comments yet »