By: Stephen Chen
In an earlier article (“The Loss of Inference“) I referenced the misuse of non-independent variables in models utilizing auto-mpg dataset as a symptom of the mathematical / technological determinism that now pervades data science due to the focus on production which has denigrated critical evaluation.
In this article, I hope to show how this shift to production vs. critical evaluation is not necessarily a virtue by looking at the use of the 2015 Flight Delays dataset. This is another popular dataset as it is large (comprising of 5.8 million flights within the US), and easy to understand. I have reviewed many articles using this dataset as-is via Google Search results, and those models and conclusions are pretty much useless or wrong; I have yet to come across one in my review that understands the Conditional Probability of flight delays i.e. no one (from what I have reviewed) has bothered to establish if the patterns they are exploring / their model is predicting is a valid phenomena, or merely a function of the number of flights in any given day / airline / airport.
Plotting the number of flights in addition to the number of delays by day of week, we can see that there is a high correlation between delay incidence and flight incidence. Any “pattern” in flight delays on a daily basis is an artifact of the number of flights that day.
When we look at the conditional probability of delays by airline and destination airport, we observe the conditional probability of a delay is the same for each airline and destination airport (with one or two blips) – the points pretty much line up on the same line. In other words, the odds of a flight that is delayed on Hawaiian Airlines (which has the least delays) is pretty much the same as Southwest Airlines (which has the most delays).
This evidence leads to the conclusion that models trained on this data as-is are neither predicting anything, nor finding anything interesting. Their results merely replicate local maxima in the data that do not constitute any meaningful pattern.
In the plot below, we can see that Average Delay (y-axis) is itself contingent on delay incidence (x-axis) as well as flight incidence (size of bubbles) to the airport. The High variance of average delay across low incidence airports which converges as incidence increases (across the x axis) is a fundamental statistical property of small samples.
Like the hypothetical example in the earlier article, if one were to employ techniques such as boosting (or density-based algorithm) on this particular data, the resulting model will score well on model error/accuracy measures but will have limited predictive power beyond this input dataset; whereas a methodologically-appropriate probability/stochastic-based model will show “underfit” due to high variance in lower incidence airports (and will be rejected as “inferior” in the current naive zeitgeist of better “accuracy” / newer technique = better).
Model accuracy and overfitting are both sides of the same coin
No amount of regularization or cross validation or fancy algorithm can address fundamentally flawed understanding or assumptions about the data. Consider whenever an airline changes its flight schedule and/or flight destinations, the shift in incidence of flights to an airport affects both the incidence of delays as well as the variability of the incidence of those delays, which will break most “accurate” models.
This has broader implications beyond the immediate dataset. Consider the treatment of important but low incidence customers (e.g. fraud cases, high value customers): to what extent have your models been over-engineered to either suppress or fit their tendency for high variability?
Those working in Healthcare to train Machine Learning models for diagnostic purposes should assess if there is sufficient variance captured in data on rare conditions beyond what is needed for model convergence. If not, there could be an additional whammy on top of the false positive paradox, which occurs when the overall population has a low incidence of a condition lower than the false positive rate (here is a link to concise plain language explanation).
Hopefully I have made a case that there should be more to data science than a production stream of scikit.fit and scikit.predict (or equivalents).
About the Author
Stephen Chen spent over 10 years as an advanced analytics management consultant in diverse industries (including financial, retail, CPG, automotive, telecom, technology, healthcare, agency etc.) helping senior management turnaround their businesses with his blend of data science, strategy, and human insights expertise. Along the way, he has audited numerous models as well as developed analytic techniques that address shortcomings of existing methods.