The problem of monetizing Big Data, or improving its usefulness in decision-making, stands out in every forward-looking organization today. To solve it, as everyone knows by now, it is not enough to store or manage it. We must analyze. Hence, the term—what else? —“Big Data Analytics” (BDA).
Yet in spite of the overwhelming need for BDA, without which Big Data is just more data, “Computer Science-IT Data Science” (CSIT DS) focuses the great majority of its effort on data storage and management. Most of its work could more properly be described as Big Data Management (BDM).
BDA is nevertheless generally thought to play a lead role in CSIT DS. But only a relatively small number of globe-spanning organizations like Amazon (social media, CERN, Citibank, and the NSA also come to mind) have the resources, ability, and most importantly, the pragmatic interest or organizational mission in bringing to book valid predictive analyses from counting, sorting, classifying and cross-classifying, tabulating and cross-tabulating, or clustering, on an almost astronomically large number of plausible explanatory and descriptive vectors within and across many data matrices.
Moreover, from the thousands of firms that are just beginning to execute predictive analytics project(s), there are rumblings from grassroots “industry sources” who say that CSIT DS BDA, so far, is not delivering the analytical goods. Are these people just churning their stock, or do their stories reflect an actual extent to which CSIT DS tools and techniques have encountered difficulties producing, in that overused term, “actionable information?” This is the desideratum of predictive analytics theory and practice.
From the mash-up of contradictions internal to the Big Data wave, one sticks out: it simultaneously covers us in an immensity of data and surrounds us with a desert of useable information we could have or should be refining from the raw data. What’s worse is that CSIT DS BDA, in its descriptive and exploratory modes (the latter deserve an entire blog entry, so I won’t discuss them here) generates more data about the data, without reducing the amount the analyst works with at each successive step of the investigation, by “narrowing down” relevant data from that which is not.
Well, how do you do that? It’s the main question in dealing with Big Data, or a dataset of any size, in any analytic effort. It’s not by letting the data “speak,” because at any level of disaggregation, across any numbers and types of transformations, and for all types of data, CSIT DS Descriptive BDA already can say many things. It can tell you how much, how many, how often and when, to name the first few among many thousands—actually thousands of thousands—of plausibly descriptive vectors in a big dataset. Descriptive BDA can depict these on tables, charts, and graphs, or use any other visualization technique you care to name. It also tabulates, sorts, filters, and graphs nominal data, recorded on many kinds and very large numbers of evaluation or survey instruments, scored on a variety of scales.
This isn’t a complete accounting of Descriptive BDA’s ways and means. Yet it can do all these things and still not produce what a client wants. That’s because it cannot ask the questions that only humans can formulate. The things we want to find out or know about, that we ask before we ever lay a hand on the data.
Most often, the one thing missing from a Descriptive BDA data assemblage is the decidedly low-tech (pencil and paper), high-concept (thinking) preparation that is the bedrock processual logic for all data analysis, whether it be of Big or small datasets. This is the step in which managers and researcher/analysts concentrate upon issues in the organization that need their deep questions, and to which they attach certain expectations—hypotheses—about the answers. Data collection (selection) follows in train, but only with hypotheses firmly in mind.
Without that, we see agonizingly unsuccessful BDA projects in which the first order of business is to resolve data storage and management issues. Hypothesis-forming—really, it can simply be the recognition of informed guesses—sits in the back row, up against the windows. And without having done that kind of conceptual work at the outset, these efforts produce more data about the dataset(s) in question, but little refined information.
I think this is a direct consequence of not asking good questions about what we want to know. And by “good,” I mean those that successively reduce the amount of data each step of the way, to the components or elements relevant to answering our research questions. That’s not a complete answer to “how do you reduce data to that which is relevant?” But it’s an essential start.
About the Author:
Bill Luker Jr., PhD, is an economist, applied econometrician, and methodologist, with 30+ years of analytic experience in private industry, US and state government, academia, and community and labor organizations. Most recently, he was a Senior Vice President at Citibank, overseeing Documentation for CCAR Modeling in fulfillment of the Bank’s regulatory obligations to the US Federal Reserve under the Dodd-Frank Act. In 2017, he founded a boutique economics and statistics consultancy, terraNovum Solutions. He co-authored Signal From Noise, Information From Data (https://tinyurl.com/yccjqyo9), an introductory statistics textbook for business and government professionals, and dozens more academic and professional publications. He can be reached at firstname.lastname@example.org, or at 940-435-2028.