There is not much attention paid these days to data reliability and analytic validity (R&V) in quantitative thought and practice. Many data practitioners, especially those in Computer Science-IT Data Science (CS-IT Data Science), don’t get what the terms mean, especially for R&V work they might need to do routinely that is not done at all.
After all, we’ve got Big Data. And ultimately, “‘bigness’ mitigates whatever may be wrong with data that might bias findings from analytical operations on it,” right? Maybe not.
This statement is similar to other claims of CS-IT Data Science “evangelists,” such as “no need for statistics (i.e., ‘not invented here’),” “NO THEORY ALLOWED Past This Point,” or “no sampl(ers) need apply.” Pardon the puns, but these are mere slogans for logical and analytical dead-ends. They are not, and must be supported (but may not be supportable), with evidence from independent research, i.e., not carried out by vendors. I don’t see anybody doing that yet.
Big data is and has been less easy to build, manage, and most importantly, analyze, than originally claimed. I think everyone recognizes this. But the reasons are unfortunately not clear to the two groups most likely to be misled by Big Data marketing about the scientific facts of life in the data and predictive analysis world: People who run companies and governments, and want answers; and the CS-IT Data Scientists who are often mis-employed in finding them. The facts are that we simply cannot analyze Big Datasets without the tools and empirically grounded theory from what I call the Statistical Data Sciences—known everywhere as just plain statistics. And CS-IT Data Science (with perhaps the exception of machine learning, as in automated applied statistics) has backed itself into a blind alley by dismissing statistics.
But with more and more data, CS-IT Data Scientists and data scientists from all disciplines have a greater need for the tools and statistical theory of data collection, selection, and construction; bivariate and multi-variate correlation analysis; and sampling. Ultimately R&V analysis also demands these analytic and testing routines, not the basically zero attention it too often receives.
So what is reliability? It is the key standard (the “platinum” standard, if you’re with me so far) for the quality of a measure, because a measure must be reliable if it’s to be valid. It refers to a measure’s consistency. We consider a measure reliable if, when we repeatedly measure the same person, plant, machine or system, along a particular dimension or variable, it produces approximately the same value. To illustrate, there is the simple example of a mechanical bathroom scale, and measurements recorded under the variable name “weight.”
Measure your weight 10 times in a row, with no time for drinking, eating, or exercise in between. The first shows 180.10, the next 180.07, the next 180.08, and so on, all very close to 180. Even though there is some small variation—no matter how precise the instrument, all measurements vary, and it’s a mechanical scale, after all—for our purposes, the scale produces a consistent and therefore reliable “weight” measure. But if one weighing registers 180, the next 350, and the next 400, the scale produces inconsistent and therefore unreliable measures.
Another way to estimate the scale’s reliability, and, again, the reliability of the “weight” measure/variable in your database, might be to weigh a group of n people once, and then a second time without intrusions that might affect their weight. Calculate a correlation coefficient between the two sets of weight variables. That’s a reliability coefficient, or RXX.
RXX is bounded by 1 and 0. The difference between RXX and 1 is a quick but clean estimate of measurement error. There are others, of course. All of them derive their formalism from the premise that a given measurement, when recorded, becomes an observation, assumed to consist of the “true” measure X(True), plus Xε, a random error that has zero mean and zero correlation with X(True).
The smaller its value, the less its corresponding measure correlates with sets of chosen variables, e.g., target or dependent variables for predictive analytics, or variables chosen randomly for data mining—since we are talking about Big Data. The smaller/larger the value of the RXX of a given measure/variable, the weaker/stronger will be its correlation to other measures in the data.
Ok, so you knew some of that. The weight example is almost risibly simplistic. Most small errors, though, that contaminate, say, 1-5 percent of a dataset, are cumulatively additive. This may be especially so in what you might think are elementary processes like counting and summing rows and columns of enormous datasets. As dataset size increases, small errors become bigger, in number and size. This happens especially when we talk about the orders of magnitude leaps that tera- and petabyte size datasets represent (and much bigger ones than that, of course) over what we used to think was a lot of data. This can generate a lot more error—and unreliability across both descriptive and analytic data runs—than unwary researchers might not expect. The issue immediately complicates data quality assurance and analytic validity if it’s not dealt with from the beginning and throughout the process.
The point of this discussion is not to bash Big Data; I’ve done it, along with many others. It’s that if you don’t pay attention to things like R&V (out of statistics research and the accompanying literature) that do require thought and adaptations to Big Data, unreliable measurements can really bite you. You won’t detect and at least mitigate source(s) of unreliability, and unreliable measures will invalidate your results. Irrelevant or nonsense measures will mask functional or structural (fixed effect) relationships, increasing the probability that you won’t find a relationship even if there is one, i.e., Type II error. Gag. Embarrassing, at best.
Before you begin an analysis, then, establish estimates of reliability for your chosen variables. Failing to do so is a form of data-analytic or statistical malpractice. Will it invalidate you as a statistical data scientist? Not for first offenders. But it’s your job to know these things. Learn them.
About the Author:
Bill Luker Jr., PhD, is an economist, applied econometrician, and methodologist, with 30+ years of analytic experience in private industry, US and state government, academia, and community and labor organizations. Most recently, he was a Senior Vice President at Citibank, overseeing Documentation for CCAR Modeling in fulfillment of the Bank’s regulatory obligations to the US Federal Reserve under the Dodd-Frank Act. In 2017, he founded a boutique economics and statistics consultancy, terraNovum Solutions. He co-authored Signal From Noise, Information From Data (https://tinyurl.com/yccjqyo9), an introductory statistics textbook for business and government professionals, and has 30 more academic and professional publications to his credit. firstname.lastname@example.org +1 940-435-2028.