By Eric Siegel, Predictive Analytics World


Watch the latest episode of The Dr. Data Show, which reveals five bizarre insights found in data.


About the Dr. Data Show. This new web series breaks the mold for data science infotainment, captivating the planet with short webisodes that cover the very best of machine learning and predictive analytics.

Click here for prior episodes and to sign up for future episodes of The Dr. Data Show

Full Transcript of This Episode

Please note that viewing the video (above) is recommended, since it includes complementary visuals. Also, since some vocal inflections and gesticulations express meaning, some of the intented content is lost by reading this transcript rather watching the video.

Welcome to “The Dr. Data Show”! I’m Eric Siegel.

It’s weird: Vegetarians miss fewer flights.

It’s wacky: People who like “curly fries” on Facebook are more intelligent.

You’re traveling through a dimension whose boundaries are that of imagination. Well, it’s not really The Twilight Zone, but let’s call it “The ‘Freakonomics’ of big data,” or “The Ripley’s Believe It or Not of data science.”

We live in a weird and wacky world, full of bizarre and surprising connections, and these connections are reflected within all those tons of data constantly being collected. That’s what makes data the world’s most potent, flourishing “unnatural resource.” And so the world has enthusiastically dived into a golden age of data discoveries. A frenzy of number-crunching is churning out a heap of insights that are colorful, sometimes surprising, and often valuable.

So, I’m gonna explain how this works, and investigate five bizarre discoveries found in data.

Now, the whole point of computers is to have things done automatically — so why not automate scientific discovery? Well, that is, in fact, often what’s achieved by machine learning, when computers learn from the experience encoded in data. You can think of the learning process in two phases:

Number one, find individual insights like the one about vegetarians.

Number two, combine the insights together to improve their capacity to predict — this is also called “predictive modeling,” since you’re combining them together into a single model.

Now, that’s actually a little simplified, ’cause machine learning often interweaves these two phases, but we can think of the first one, the hunt for insights, as a foundational part of the process.

Having computers make such discoveries is the very automation of scientific research. This is a major paradigm shift that upends the traditional scientific method, which is to form a hypothesis and then test it. For example, an airline might speculate that passengers who request a vegetarian meal end up missing their flight less often. Then they’d examine data in order to test this hypothesis.

By the way, the reason why this is the case for vegetarians is a separate question, which I’ll address in a couple minutes — the first question is simply whether or not this little theory holds true.

But, you know, there are so many trends like that you could check for. How about passengers who prefer an aisle seat rather than a window, or passengers traveling from certain cities, etc.? Perhaps those groups are also less likely to miss their flight. We humans are pretty smart, but, instead of sitting around conjecturing on such things, why not just unleash the computer to search through a whole bunch of possible trends to see which turn out to be true? Use your CPU as a hypothesis-testing machine, a “robot scientist.” By hunting tirelessly and robotically, no stone is left unturned. The technical terms for doing this include feature selection, or “trying each independent variable one at a time as a univariate model.” But whatever you call it, this kind of search process reaches beyond the more limited range of ideas that originate from human hunches. Allow it to do its number-crunching thing and the machine will spit out valuable discoveries — which sometimes turn out to be unexpected and may seem to defy logic.

Now, before you get too excited about a potential “robot scientist,” here’s an important warning. When you push the “go” button, cranking up the open-ended, massive search for scientific discoveries, it can backfire and get you into major trouble by drawing false conclusions. It might find apparent trends within the data that don’t actually hold true in general. The technical word for this pitfall is p-hacking. Actually, you can never ever ever be literally 100% conclusive about a connection found in data — there’s always at least some remote chance the data fooled you, like even if you flip a coin 100 times and it comes up heads every time, there’s still some very remote chance it’s just a normal coin and you had a crazy long streak of heads. But, with precautions that properly check the conclusions drawn from data — which apply a high standard of statistical rigor — we can reduce the odds of being mislead by data down to an acceptably remote long-shot.

Alright, let’s see what our robot scientist came up with. Here are several connections found in data. Each one helps predict things — like who will miss a flight, who is more intelligent, and who will prove to be more financially creditworthy — and so these serve as foundational building blocks with which machine learning builds predictive models.

I’ll first cover two typical examples that are pretty interesting and are useful in the commercial world, and then five truly kooky connections, for a total of seven.

Number one, if you buy diapers, you’re more likely to also buy beer. A pharmacy chain found this trend among evening-time shoppers, across dozens of outlets. Some have offered as an explanation, “Daddy needs a beer.” Now, although I’ve heard people presume this classic diapers/beer data mining example is an urban legend, you can actually find the details about this and all the examples I’m covering in the Notes for my book, “Predictive Analytics.” The Notes are available for free at By the way, the book itself includes a grand total of 46 bizarre discoveries, all compiled into one single table, with examples from the likes of Walmart, Shell, Microsoft, and Wikipedia.

Now on to number two. If your friend switches cell phone companies, you’re more likely to do so yourself. A major North American carrier discovered that a customer is seven times more likely to cancel if someone in the person’s calling network cancels. Birds of a feather really do flock together.

Number three, vegetarians miss fewer flights. Well, to be a bit more precise, an airline found that passengers who’ve preordered a vegetarian meal are less likely to miss their flight. Why is that so? I don’t know. To be completely honest, the title of this episode of The Dr. Data Show, “Why Vegetarians Miss Fewer Flights,” has a slight case of the click-bates, ’cause the fact is, nobody knows for sure. The researchers have suggested it could be because the knowledge of a personalized or specific meal awaiting the customer establishes a sense of commitment. But we actually can’t conclusively answer the “why” for most of these insights or discoveries, at least not without further research. That’s what’s meant by the often-heard adage, “Correlation does not entail causation.” Each of these discoveries, trends, or links in the data — whatever you wanna call them — are a correlation, and any explanation as to why they’re true would involve understanding causation. When analyzing the data that businesses accumulate — that is, typical “big data,” rather than analyzing data collected specifically for experiments or scientific inquiry — we often only get the “what” but not the “why.” However, it still helps predict — an airline can still use this discovery to help calculate, for example, how overbooked a flight is likely to be, even without understanding why it’s true.

Number four, people who like “Curly Fries” on Facebook are more intelligent. Researchers at the University of Cambridge and Microsoft found that liking “Curly Fries” is predictive of high scores on a certain test designed to measure intelligence. The researchers don’t think it’s necessarily because you have to be smart to realize how great curly fries are — rather they suspect that some intelligent person was the first to like the “Curly Fries” Facebook page, and his or her friends saw it and thus seeing and liking the page spread through a network of relatively smart people.

By the way, regarding the “more intelligent” part, I’m personally pretty skeptical about the paring down of human intelligence to a single number, but so-called intelligence tests probably do measure how good you are at some valuable group of skills that cover some but certainly not all of what we mean by “intelligence.” Just saying.

Number five, men who skip breakfast have a greater risk of coronary heart disease. Harvard University medical researchers found that men 45 to 82 who skip breakfast have a 27% higher risk. This isn’t necessarily because of any direct health effects of the meal itself, but rather because eating breakfast may be a proxy for lifestyle. People living a high-paced, more stressful life are more likely to skip breakfast and are also subjected to a higher health risk. Again, that’s largely just an intuitive hunch — as usual, there could also be other plausible explanations.

Number six, neighborhoods in San Francisco that exhibit higher rates of certain crime also have a higher demand for Uber rides. This is not necessarily because criminals are taking Ubers, but rather, as Uber has postulated, because the crime is a proxy for non-residential population — those happen to also be the areas where there are more people who don’t actually live there.

And finally, number seven, people who go to bars are a higher credit risk. Canadian Tire issues credit cards and looks at how their customers use their card relative to their on-time bill payments. It turns out that people observed spending at a drinking establishment are more likely than average to miss repeated credit card bill payments. However, if you spend with your card at the dentist, you’re a lower credit risk, less likely to miss bill payments. And if you buy those little felt pads that keep the legs of your chair from scratching the floor… lower credit risk, you’re a more reliable bill payer.

And in a related story, typing with proper capitalization corresponds with creditworthiness, according to a financial services startup company that analyzed how online loan applications were filled out.

Now, the reasoning behind these trends may seem intuitive and self-evident, as far as the kinds of personalities and how organized people are, but, again, remember to take such interpretations and hunches with a huge grain of salt. This “freak show” of surprising discoveries delivers predictive value, but does little to explain itself, scientifically speaking.

I’m Eric Siegel; thanks for watching. Hit “like” and share this video if you think your friends would also be interested in vegetarians, curly fries, and the automation of scientific research. And for access to the entire web series, go to

Excerpt of “Predict This!” rap during closing credits:

Who’s your data?

Provide me the data to improve
and I’ll apply the computation.

Predictive analytics can help you with decisions;
you can call, mail, credit, or hire with precision.
On law, love, and life, you can prognosticate
whom to investigate, incarcerate, set up on a date, or medicate.

Charlie Brown never gets his kicks;
that’s why every old dog needs a brand new trick.
If you get sick of chasing sticks or clicks with just a quick fix,
you need to learn to predict.

I can predict your every move;
just gimme all your information.
Who’s your data?
Provide me the data to improve
and I’ll apply the computation.

I love it when you call me big data.

To receive notifications of new webisodes of The Dr. Data Show as they are released, register for The Predictive Analytics Times – the machine learning professionals’ premier resource.

Please note that viewing the video (above) is recommended, since it includes complementary visuals. Also, since some vocal inflections and gesticulations express meaning, some of the intented content is lost by reading this transcript rather watching the video.