Watch the latest episode of The Dr. Data Show, which answers the question, “What the hell do data science and big data really mean?”
About the Dr. Data Show. This new web series breaks the mold for data science infotainment, captivating the planet with short webisodes that cover the very best of machine learning and predictive analytics.
Full Transcript of This Episode
Please note that viewing the video (above) is recommended, since it includes complementary visuals. Also, certain vocal inflections and gesticulations hold meaning. Some of the intented meaning is lost by reading this transcript rather watching the video.
Welcome to “The Dr. Data Show”! I’m Eric Siegel.
“Data science.” “Big data.” What the hell do these buzzwords really, specifically mean? Are they just cockamamie — intentionally vague jargon that overhypes and overpromises? Or are these terms actually helpful — do they somehow designate, like, the most profound impact of the Information Age? Well, I’ll start with the vague and overhyping side and then circle back to why these buzzwords may matter after all. It’s time for the Dr. Data buzzword smackdown.
There are a lotta problems with these words.
First, “data scientist” is redundant. It’s like calling a librarian a “book librarian.” If you’re doing science, it involves data. Duh!
Furthermore, don’t tell anyone I said this, but real sciences like physics and chemistry don’t have “science” in their name. Your science is trying too hard if it has to call itself a science: Social science, political science, data science, and I gotta say — even though I have three degrees in it and was a professor of it — computer science is an arbitrarily defined field. It’s just the amalgam of everything to do with computers — as a concept and as an appliance — from the engineering of how to build them and the deep mathematics about their theoretical limitations to how to make them more user friendly, and even business strategies for managing a team of programmers…
Universities might as well also have a “toaster science” department, which covers the engineering of better toasters as well as the culinary arts on how to best cook with them.
But I digress. Okay, next buzzword: “Big data.” First of all, it’s just grammatically incorrect. It’s like looking at the Pacific Ocean and saying “big water.” It should be “a lotta data” or “plenty of data.”
But the real problem with “big data” is that it emphasizes the size. ‘Cause what’s exciting about data isn’t how much of it there is per se — it’s about how quickly it’s growing — which is amazing by the way. There’s always so much more data today than there was yesterday. So we’re gonna run out of adjectives really quickly: “big data,” “bigger data,” “even bigger data,” “the biggest data.” Actually, there’s been a long-running conference called the International Conference on Very Large Databases since 1975. I’m not joking. That’s before the first Star Wars movie came out!
Now, in some cases, people use the terms data science and big data just to refer to machine learning, i.e., when computers learn from the experience encoded in data. That’s the topic of most episodes of this program, The Dr. Data Show. It’s a show about machine learning — which is a well-defined field and by the way is also often called predictive analytics, especially when you’re talking about its deployment in the private or public sector. I would urge folks to use the well-defined terms machine learning or predictive analytics if in fact that’s what you’re specifically talking about.
But as for data science and big data, in their general usage they suffer from a terrible case of vagueness. The have a wide range of subjective definitions, which compete and conflict. Basically, they’re often used to mean nothing more specific than “some clever use of data.” The terms don’t necessarily refer to any particular technology, method, or value proposition. They’re just plain subjective — you can use them to mean whichever technology you’d like: machine learning, data visualization, or even just basic reporting.
But much worse than that, this vagueness often serves to mislead and misrepresent by alluding to capabilities that don’t exist. For example, the popular press — as well certain analytics vendors — sometimes use “data science” to denote some whole collection of methods that includes machine learning as well as some other advanced methods. The problem is, those other advanced methods are implied but often actually just don’t really exist. They’re vaporware. This confusion is sometimes inadvertent — such as when journalists aren’t fully knowledgeable of the topic yet want it to sound as powerful as possible — but, either way, the end result is souped-up hype that over-promises and circulates misinformation.
All these issues, by the way, also apply to the older-school term “data mining,” also totally subjective. Besides, calling it “data mining” is like instead of “gold mining,” saying “dirt mining.” Malfunction, failed analogy… ‘Cause we aren’t searching for data, we’re searching within data.
So now you’re probably asking yourself, “How could Dr. Data come down so hard on these words if he loves data so much?” Well, no, Dr. Data doesn’t hate these words, only the misleading ways in which they’re often used. Dr. Data’s love for data is fully intact. After all, he named himself after it. Anyway, let’s talk data for a moment.
These buzzwords are all “data this” and “data that” — so what exactly is all the fuss about data? I mean, most people couldn’t be less interested in data. The non-geeks out there think it’s the driest, most boring word ever. The word “data” is a deal-killer at cocktail parties. I know from personal experience. I have the data.
And data just grows like a weed anyway. It’s so indiscriminately collected and warehoused, like some bland, uninteresting residue that companies dump into the cloud as they transactionally churn away endlessly.
But, no! That’s wrong! Let me make a correction. It isn’t indiscriminate. The stuff logged into all these memory banks are exactly the things that matter. That’s why they’re being recorded. People think data’s boring because they’re overlooking the fact that data is experience — it’s a long list of prior events from which it’s possible to analytically learn.
In fact, we could say that data is powerful and all-encompassing for the very same reason that it’s misconstrued as boring, which is that it’s very abstract. Data can mean anything and everything. In its most abstract, it means nothing in particular, but in the particular, it always means something valuable and interesting.
Every medical diagnosis, medical procedure, credit application, phone call, Facebook post, movie viewing, ad click, fraudulent transaction, spammy e-mail, traffic camera passed, flight taken, earthquake, purchase, successful or failed sales call, each positive and negative outcome of any significance is encoded as data somewhere.
There are “quintillions and quintillions” of bytes. That’s my Carl Sagan impersonation. Data grows by an estimated 2.5 quintillion bytes per day. A quintillion is a 1 with 18 zeros after it.
And here’s the big win: We can improve everything with this data. All the main functions and day-to-day operational decisions of companies and governments are exactly what these data streams are recording. Therefore, data records exactly the right, relevant experience to apply predictive analytics where it’s needed most. We have just the right data for this technology to learn how to streamline the major operations behind financial risk management, fraud detection, marketing, law enforcement, healthcare, and manufacturing. Boom!
This is major. We’re witnessing an epic, fundamental shift in how technology integrates with, alters, and improves society and its functions.
And so data isn’t the most boring after all. In fact, it’s the most… sexy? The Harvard Business Review declared data scientist “the sexiest job of the 21st century.” I mean, really? Data people are the most sexy? That’s great news! Geek is the new chic. It’s hip to be square. You know, I had always assumed the sexiest profession was firefighters. But who knows… Maybe it’s just the hard hat. This is a picture of me dressed up as a data miner for Halloween…
Actually, the New York City Fire Department uses predictive analytics to triage and prioritize the inspections of buildings with the highest risk of fire. Yet another priceless application of machine learning.
Anyway, we actually produced a rap music video about predictive analytics and how being a data geek affects your social life. It’s the the best ever educational predictive analytics rap music video ever created ever, period. And also the only one. Just three and a half minutes long, you can check it out at PredictThis.org.
In conclusion, there’s a lot to be excited about when it comes to the data explosion and what we can do with it. The buzzwords are kinda inane when viewed up close — perhaps an equally appropriate and less misleading buzzword for all this would be “datapalooza” — but, in any case, the terms really allude to a culture of smart people doing creative things to make value of all this data. Today’s totally historical advent of having data about everything and using data for everything is mind-blowingly profound and important.
I’m Eric Siegel; thanks for watching. Hit “like” and share this video if you think your friends were also wondering what the hell “data science” and “big data” really mean. And for access to the entire web series, go to TheDoctorDataShow.com.
Excerpt of “Predict This!” rap during closing credits:
Who’s your data?
Provide me the data to improve
and I’ll apply the computation.
Predictive analytics can help you with decisions;
you can call, mail, credit, or hire with precision.
On law, love, and life, you can prognosticate
whom to investigate, incarcerate, set up on a date, or medicate.
Charlie Brown never gets his kicks;
that’s why every old dog needs a brand new trick.
If you get sick of chasing sticks or clicks with just a quick fix,
you need to learn to predict.
I can predict your every move;
just gimme all your information.
Who’s your data?
Provide me the data to improve
and I’ll apply the computation.
I love it when you call me big data.
To receive notifications of new webisodes of The Dr. Data Show as they are released, register for The Predictive Analytics Times – the machine learning professionals’ premier resource.
About the Author
Eric Siegel, Ph.D., founder of the Predictive Analytics World and Deep Learning World conference series and executive editor of The Predictive Analytics Times, makes the how and why of predictive analytics (aka machine learning) understandable and captivating. He is the author of the award-winning Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, the host of The Dr. Data Show web series, a former Columbia University professor, and a renowned speaker, educator, and leader in the field. Read also his articles on data and social justice and follow him at @predictanalytic.