Machine Learning Times
Machine Learning Times

12 months ago
The Loss of Inference

For more from this writer, Stephen Chen, see his session, “The Perils of Prediction” at PAW Business, June 19, 2019, in Las Vegas, part of Mega-PAW.
The burgeoning field of Data Science / Machine Learning borrows heavily from Statistics but bastardizes it. For example, “dummy variable” becomes “one-hot encoding”, “independent variables” become “features”. This shift in nomenclature results in a loss of methodological meaning that was inherent in the original names; for instance, a casual Google search on the “auto-mpg” dataset will throw out many how-to pages, almost all of which treat the variables as “features” and throw everything (including non-independent variables) into the model.

This democratization of Data Science shifts the priority from explanation to production. The how-to can-do approach contrasts with the why-not-to dictums of Statistics classes (or, at least in my time). As a result, the meaning of inference based on triangulated evidence and reasoning, has increasingly shifted to a merely mathematical or computational problem.

What is unique to Data Science/Machine Learning is a set of purely algorithmic approaches (e.g. Genetic Algorithms, Neural Networks, Boosting and Stacking) that have been increasingly hyped because of their superior “inference”, where they can outperform other methods / humans / whatever strawman on hand.

These algorithmic-based approaches are marketed as “learning from data”, but even that concept has been bastardized – the Bayesian approach of adjusting probabilities based on extant data has been reduced to an Algorithm that essentially attempts to fit a curve to as many points as possible. In short, inference has gone from a sniper aiming to hit the target as close as possible, to firing a shotgun hoping one will hit close to the target via computational brute force.
Statistical inference is intimately tied to probability distributions – Gaussian, Poisson, Binomial etc. are evidence-backed probability density functions corresponding to specific event characteristics. There are application domains whereby algorithmic approaches are wholly appropriate (e.g. genetic algorithms in robotics), and even necessary (neural networks and image classification) when it is difficult to operationalize probability density (and the scope of data and context are contained).

The illustration above shows why Data Scientists applying the latest algorithmic approach in a domain with known probability densities risk sacrificing predictive power for model accuracy. They are either seduced by the latest “fad” or unaware that algorithmic-based approaches perform poorly beyond the limits of their input data range. More importantly model accuracy should not the sole end in itself because accuracy and overfitting are two sides of the same coin

All this means it is more important than ever to practice the 3 criteria of model evaluation that are typically tossed out at the beginning of Research / Statistics classes, and given scant attention after.

Parsimony – the simplest possible model is the best model. This recognizes one is overfitting at some point by adding more variables, and that each method introduces its own biases and data assumptions.

Validity – addressing potential biases in the data, and triangulating results with external sources vs. accepting model-generated metrics as truth.

Reliability – is the extent in which the results are replicable in different / real-world contexts. The current reliance on accuracy metrics in Machine Learning parallels the reliance on p-values by scientific community. By itself, a highly accurate model or a highly statistically significant study does not guarantee it would perform similarly in a different / real-world context.

This problem of an ever-increasing accumulation of large bodies of unreplicable published research forced the ASA (American Statistical Association) to issue a statement on p-values in 2016, essentially saying they should not be used as the sole basis of evaluation, and are not a substitute for scientific reasoning.

Data Scientists/ML experts etc. should take the same to heed as well (particularly in domains dealing with stochastic processes, shifting time series etc) as it typically takes less than 10% input noise/variance to break predictions.

About the Author

Stephen Chen spent over 10 years as an advanced analytics management consultant in diverse industries (including financial, retail, CPG, automotive, telecom, technology, healthcare, agency etc.) helping senior management turnaround their businesses with his blend of data science, strategy, and human insights expertise. Along the way, he has audited numerous models as well as developed analytic techniques that address shortcomings of existing methods.

Originally trained as a social scientist, Stephen was one of the handful of researchers to specialize in Social Network Analysis before the ascendance of Google and Facebook popularized those algorithms. He taught classes on Critical Statistics in university and has been a passionate advocate for model robustness and reliability.

One thought on “The Loss of Inference

  1. Pingback: The Loss of Inference – Predictive Analytics Times – NikolaNews

Leave a Reply

Pin It on Pinterest

Share This