Originally published in KDnuggets, October 2019.
Fallacies are what we call the results of faulty reasoning. Statistical fallacies, a form of misuse of statistics, is poor statistical reasoning; you may have started off with sound data, but your use or interpretation of it, regardless of your possible purity of intent, has gone awry. Therefore, whatever decisions you base on these wrong moves will necessarily be incorrect.
There are infinite ways to incorrectly reason from data, some of which are much more obvious than others. Given that people have been making these mistakes for so long, many statistical fallacies have been identified and can be explained. The good thing is that once they are identified and studied, they can be avoided. Let’s have a look at a few of these more common fallacies and see how we can avoid them.
Out of interest, when misuse of statistics is not intentional, the process bears a resemblance to cognitive biases, which Wikipedia defines as “tendencies to think in certain ways that can lead to systematic deviations from a standard of rationality or good judgment.” The former builds incorrect reasoning on top of data and its explicit and active analysis, while the latter reaches a similar outcome much more implicitly and passively. That’s not hard and fast, however, as there is definitely overlap between these 2 phenomena. The end results is the same, however: plain ol’ wrong.
Here are five statistical fallacies — traps — which data scientists should be aware of and definitely avoid. The failure to do so will be catastrophic in terms of both data outcomes and a data scientist’s credibility.
1. Cherry Picking
In an attempt to demonstrate just how obvious and simplistic that statistical fallacies can be, let’s start off with the classic which everyone should already know: cherry picking. We can put this in the category of other easily recognizable fallacies, such as the Gambler’s Fallacy, False Causality, biased sampling, overgeneralization, and many others.
The idea of cherry picking is a simple one, and something you have definitely done before: the intentional selection of data points which help support your hypothesis, at the expense of other data points which either do not support your hypothesis or actively oppose it. Have you ever heard a politician talk? Then you’ve heard cherry picking. Also, if you are a living, breathing human being, you have cherry picked data at some point in your life. You know you have. It’s often tempting, a piece of low-hanging fruit which can win over or confound an opponent in a debate, or help push your agenda at the expense of an opposing view.
Why is it bad? Because it’s dishonest, that’s why. If data is truth, and analysis of data using statistical tools is supposed to help unearth truth, then cherry picking is the antithesis of truth-seeking. Don’t do it.
To continue reading this article, click here.
About the Author
Matthew Mayo is Machine Learning Researcher and the Editor of KDnuggets, the seminal online Data Science and Machine Learning resource. He is particularly interested in unsupervised learning, deep neural networks, natural language processing, algorithm design and optimization, and distributed approaches to data processing and analysis. Matthew holds a Master’s degree in CS and a graduate diploma in Data Mining. Email him at mattmayo at kdnuggets[dot]com.