Five years ago, in 2013, two emerging advocates for Big Data raised its public profile to the highest level it had enjoyed by then, and possibly since. They punched through the oblivion of anyone who needed to hear about it, but had not been vouchsafed the secret handshakes; and too many who tried to understand it and couldn’t; or who didn’t understand it, but thought they did.
The book was sensationally entitled Big Data: A Revolution That Will Transform How We Live, Work, and Think, by Viktor Mayer-Schonberger and Kenneth Cukier (New York, NY: Houghton Mifflin Harcourt). (Hereafter, M-S&C.) It was a finalist for the Financial Times Business Book of the Year. Since then, it seems anyone with experience doing data and analytics has followed with book-length opinions. But M-S&C’s series of sweeping claims have yet to be superseded by anyone. They seem to be the ground truth for big data practitioners—what there is by way of a rough epistemology for “big.” It is what I labeled previously in The Predictive Analytics Times as Computer Science-IT Data Science (or CS-IT Data Science).
First, no need for sampling anymore. You’ve got plenty of data. Whole populations? Forget it. More. Petabytes. Forget about causation, also, because our predictors are pair- or multi-wise correlations that occur at one point or another. And correlation does not imply causation. (It’s a necessary, but not sufficient condition.) Case closed. And with that we no longer need the scientific method, or modeling, simulating, etc., all aimed at finding causation from data analyzed statistically and theoretically. (In fairness, Wired’s Chris Anderson said this. After a while, though, the colors run together.)
Stephen Few, author of a new book as boldly titled as M-S&C’s, Big Data, Big Dupe: A little book about a big bunch of nonsense (Analytics Press, 2018), discusses how uncritical acceptance of CS-IT Data Science’s ground truth and its applications can lead to wholly incorrect conclusions for methodology and methods (two different things, actually) across the whole of science.
According to Few, big data as a concept or concepts remains ill-defined. As such, he believes there are no clear logical or empirical reasons for adopting M-S&C’s recommendation that we abandon 21st century statistical data science methodology and methods. To do so would be developmentally “regressive,” and could block the continued evolution of frequentist-Bayesian statistics. In the meantime, the promises of quantitative, evidence-driven decision making in business and government seem to recede further into the distance.
Few’s biggest beef is something about which some writers in the tech commentariat may have been disingenuous. Driven by marketing impulses, as they have been in many cases, big data’s claims for data analytic advances (e.g., no math, no coding) are often without foundation. And this is not the first CS-IT wave to generate unfulfilled promise. There are enough instances that similar failed “innovations” are now mapped to Gartner’s “product hype cycle.”
By now, “big” has gone through a cycle or two. M-S&C’s best-seller codified certain of its aspects for CS-IT Data Science, but “enthusiasts”—a euphemism for “marketers,” along with “evangelists,” to give it a movement feel—grabbed the credentialed authors’ coat tails. Typically, they possessed little or no formal understanding to enable scientific discussions about M-S&C’s objectively debatable assertions. Instead, the marketers made for the bottom line.
Nowadays, it’s hard to distinguish CS-IT marketers from CS-IT data scientists. Here are two rules: The former want to sell you something. The latter are agnostic about methods and platforms, and read the scientific literature rather than marketing papers resembling scholarly journal articles. CS-IT Data Science’s more rigorous practitioners may be embarrassed as Few takes an entire chapter of this otherwise short book (75 pp.) to emphasize the issue of marketing imperatives trumping science.
Finally, full disclosure: I don’t know Stephen, nor have had any dealings with him. Nor with M-S&C. He concludes his smackdown with “We must abandon big data to begin using data in effective ways.” (p. 69) Having read both books, and thought about the issues in various industry and occupational contexts, I’d say: “We must reform ‘big,’ as currently conceived and practiced, to increase mean ratios of signal-to-noise and information-to-data. Big data and its accompaniments are too new to be hindered by stone-carved dogma. That means including, not abandoning, predictive analytic techniques and approaches that many had been led to think were obsolete.” Few’s book ultimately leads us to think again about these issues, all the way through, as many times as needed.
About the Author
Bill Luker Jr., PhD, is an economist, applied econometrician, and methodologist, with 30+ years of analytic experience in private industry, US and state government, academia, and community and labor organizations. Most recently, he was a Senior Vice President at Citibank, overseeing Documentation for CCAR Modeling in fulfillment of the Bank’s regulatory obligations to the US Federal Reserve under the Dodd-Frank Act. In 2017, he founded a boutique economics and statistics consultancy, terraNovum Solutions. He co-authored Signal From Noise, Information From Data (https://tinyurl.com/yccjqyo9), an introductory statistics textbook for business and government professionals, and 30 more academic and professional publications. firstname.lastname@example.org