Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. Revised and Updated. Eric Siegel. (Wiley: Hoboken NJ. 2013, 2016)
In 2013, as the worst effects of the crash had begun to reverberate out of the system, analysts like myself, and of dozens of other stripes—statisticians, biostatisticians, econometricians, financial quants, psycho-sociological researchers, etc., etc. (not in any order)—were exposed to the first wave of evangelism for Big Data, and what it meant. Characterized by ready enthusiasm, it radiated a minimum of science and a maximum of advertising-as-pseudo-science.
We all know what Big Data is now. As soon as it began to be popularized, however, we also began looking for a payoff in terms of analytic insights that were in some ways proportionate to the amount of data being crunched. More data = more information, right? But it appeared that the predictive capabilities thought to be inherent in Big Data qua Big Data were more or less mythic.
In 2016, other market vectors intersected. Their goals were the same as their predecessors in Big Data and Data Mining: to realize the promise of Big Data by extracting and validating big amounts (in terms of volume, and importance or salience) of information. One line of thinking—Predictive Analytics (PA)—was first textualized in a firecracker of a book by Eric Siegel, a former Columbia University professor, in Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (2016).
Some observers thought that it might be the next step from Big Data Management (BDM) to Big Data Analytics (BDA). But BDA had not exactly swept the oceans. It was primarily a descriptive statistics, which PA is not. You could say with some validity that the reason PA works and BDA doesn’t is that PA had almost nothing at all to do with the emergence of Big Data.
Various forms of it were developed and used in scholarly work for more than a century, beginning with linear regression (Ordinary Least Squares, or OLS). OLS is now almost universally employed in the realm of data analysis, but ironically, much-derided. No comment on that. But It has matured well into the current period as what it is and always has been: a statistical technique for prediction.
Analysts use OLS multiple regression to predict all kinds of behavior. But PA, as Siegel practices it and writes about it, is a different mechanism. Using machine learning, it can do multiple OLS, and my experience is that many data scientists use their machine frequently to build “multiple-multiples.” Siegel calls these “ensemble models,“ and shows how PA can build them and how their benefits show up in question-answering. PA will gin up a nice logistic regression if you want that, and just about any other type of model. In essence, the machine or mechanism that’s being “learned”
…takes characteristics of the individual as input, and provides a predictive score as output. The higher the score, the more likely it is that the individual will exhibit the predicted behavior. (p. 34)
This could not only apply to individuals, but processes, things, or any other objects. A more detailed technical explanation might be that in answer to, or in the context of, questions about behaviors of people or objects, the learned machine (I’m getting a bit nutty on that) searches the data, and finds multiply correlated features or variables (aka “variates” such as clusters or classes) for use as input data. From that, it builds predictive models, predicting or scoring the probability of how the object of interest will behave, one way or another. Then, you, the analyst, compares models to find the best-predicting ones (which may be an OLS multiple regression model, broadly understood, or a neural net).
Predictive Analytics builds many models, then, one or hundreds which make predictions about behavior, all of them more or less well. They can be different types, particularly non-stochastic: neural networks (hard to compute and interpret) and deep learning networks (even more difficult to compute and interpret.) Decision trees, clustering, and classification are three non-statistical approaches Siegel wants to persuade us are more revealing, i.e., information rich, than neural nets and other algorithmic learning methods. To this author, he is preaching to the choir. But I suspect from reading Siegel’s discussion, he believes, mainly, that he has found a less laborious way and to do PA than building great honking OLS multiple regression models.
All of this rolls up into one major takeaway, which Siegel may be too modest to claim, but this author sees quite clearly: PA is a universal data analysis system. My Figure 1. reproduces Siegel’s general logic of method for PA:
Each PA Application is Defined By:
There are almost two hundred (182) PA Cases in Siegel’s book. They are some of the best things about it. With almost no mathematics, he is able to show and tell, and summarize, in tabular form, PA case applications like the one in the table, above. They are very useful in framing analytic questions, and clarifying the uses to which answering them will apply.
The bulleted list, below, illustrates the elements that Siegel uses for each PA Application.
And if that’s not exactly a correct verbal transformation of what’s going on in the machine that learns, I will defer to whomever else wants to give it a try.
In some cases these can be enormously data-intensive PA projects, because of the level of granularity analysts want on each observed individual for a more reliable prediction of behavior for the next unknown individual, and so on. But we have plenty of data, and as Siegel points out, PA is not dependent on a Big Dataset. It scales down as well as up, depending upon which PA technique you use.
I mentioned earlier that PA is a universal data analysis system. Sticking to that example, there is a concept that supports an unexpected symmetry between algorithmic PA (e.g., machine-learned OLS), and frequentist PA (statistical multiple OLS).
It is that both approaches are analyses of dependence. The analyst observes and asks questions about the behavior, or set of behaviors, of a target or dependent phenomenon. In the most general sense, the goal of her analytic is to find out whether and how the observed (and possibly “discovered”) variables drive what they believe to be the behaviors of the dependent object, thing, process, or individual.
This is the intersection between Siegel’s algorithmic-machine learning PA, and the frequentist-statistical PA which has been around a very long time. Some may see this with consternation because it could indicate that what we have thought were new approaches to data analysis are in fact not new, and no progress has been made over the last 20 years.
Or the glass half-full among us could see it as encouraging that there is a structural similarity between new PA and old PA. Again, it’s the analysis of dependence we all do in one way or another: the educated observer, analyst, data scientist, statistical data scientist, social and behavioral data scientist, industrial engineer, labor economic data scientist, and data scientist in general. With realizations like that, we can begin to speak the same languages. Great work, Dr. Siegel, in helping bring us to this point, where we can see that PA will be one of those languages.
Some other current books you might (or might not) want to look into:
The authors are elegant writers, and they, too, provide use cases for various analytic strategies, including machine learning, classification, clustering, etc. Their attitude toward and understanding of regression is, in one instance, clear and coherent, but in another (at the outset), completely incomprehensible. And yes, I do think that to criticize the technique, you should understand what it is, and be able to describe it coherently.
This is for the beginning techster, hence the “Dummies” in the title. But it is not much help for the analyst. Things like how to get a Hadoop cluster to run. But the word on the grapevine is that no one has been able to get Hadoop cluster to run. I’d like to see some real-world examples, along with the output from a Hadoop cluster that answers a clearly and unambiguously stated research question. But I don’t think this book will get a fully connected conceptual and practical understanding of just exactly what analysis and analytics are, much less Predictive Analytics, because of the focus on technical issues involved in getting Big Data Management apps to run.
This contains, almost exclusively, how-to’s for Spark and Hadoop clusters. In that vein, it’s every useful. But the author’s understanding of what constitutes analytics is, to understate, highly deficient. It involves shuffling huge volumes of numbers around in various ways without ever stating what the research or analytic questions are. What’s the point? There is nothing here about PA. Regrettably, IT professionals manifest these Big Data Analytics “approaches,” appearing to believe that Big Datasets have an inherent analytic capability, requiring no a priori understanding or questions to answer relevant queries. Noting could be further from the truth. Data does not simply speak for itself.
About the Author
Bill Luker Jr., PhD, is an economist, applied econometrician, and methodologist, with 30+ years of analytic experience in private industry, US and state government, academia, and community and labor organizations. Most recently, he was a Senior Vice President at Citibank, overseeing Documentation for CCAR Modeling in fulfillment of the Bank’s regulatory obligations to the US Federal Reserve under the Dodd-Frank Act. In 2017, he founded a boutique economics and statistics consultancy, terraNovum Solutions. He co-authored Signal From Noise, Information From Data (https://tinyurl.com/yccjqyo9), an introductory statistics textbook for business and government professionals, and has 30 more academic and professional publications to his credit. drblukerjr [at] gmail [dot] com +1 940-435-2028.