There is not much attention paid these days to data reliability and validity (DR&V). Many data-scientific practitioners, especially those in Computer Science-IT Data Science (CS-IT Data Science), don’t get what the terms mean, especially for DR&AV work they might need to do routinely that they do not do at all.
After all, we’ve got Big Data. And ultimately, “‘bigness’ mitigates whatever may be wrong with data that might bias findings from analytical operations on it, right? Maybe not. The statement is similar to other claims of CS-IT Data Science “evangelists,” such as “no need for statistics (i.e., ‘not invented here’),” “NO THEORY ALLOWED Past This Point,” or “no sampl(ers) need apply.” Pardon the puns, but these are mere slogans for logical and analytical dead-ends. They are not, and must be, supported (but may not be supportable) with evidence from independent research, i.e., not carried out by vendors.
Instead, with more and more data, CS-IT Data Scientists and data scientists from all disciplines have a greater need for the statistical theory and tools of data collection, selection, and construction; bi- and multi-variate correlation analysis; and sampling. Ultimately, DR&AV analysis demands these descriptive and testing routines as essential preparation for analytic work, not the basically zero attention it too often receives.
So what is reliability? It is a key standard for the quality of a measure, because a measure must be reliable if it’s to be valid. It refers to a measure’s consistency and repeatability. We consider a measure reliable if, when we repeatedly measure the same person, plant, machine or system, along a particular dimension or variable, it produces approximately the same value. To illustrate, there is the simple example of a mechanical bathroom scale, and measurements recorded under the variable name “weight.”
Measure your weight 10 times in a row, with no time for drinking, eating, or exercise in between. The first shows 180.10, the next 180.07, the next 180.08, and so on, all very close to 180. Even though there is some small variation—no matter how precise the instrument, all measurements vary, and it’s a mechanical scale, after all—for our purposes, the scale produces a consistent and therefore reliable “weight” measure. But if one weighing registers 180, the next 350, and the next 400, the scale produces inconsistent and therefore unreliable measures.
Another way to estimate the scale’s reliability, and the reliability of the “weight” measure/variable in your database, might be to weigh a group of n people once, and then a second time without intrusions that might affect their weight. Calculate a correlation coefficient between the two sets of weight variables. That’s a reliability coefficient, or RXX.
RXX is bounded by 1 and 0. The difference between RXX and 1 is a quick but clean estimate of measurement error. There are others, of course. All of them derive their formalism from the premise that a given measurement, when recorded, becomes an observation, assumed to consist of the “true” measure X(True), plus X_{ε}, a random error that has zero mean and zero correlation with X(True).
The smaller/larger the value of the RXX of a given measure, the weaker/stronger will be its correlation to other measures in the data. These can be sets of target or dependent variables chosen for predictive analytics, or variables more or rather less randomly sampled for data mining—since we are talking about Big Data.
Ok, so you knew some of that. The weight example is almost risibly simplistic. Most small errors, though, that contaminate, say, 1-5 percent of a dataset, are cumulatively additive. This may be especially so in what you might think are elementary processes like counting and summing rows and columns of enormous datasets. As dataset size increases, small errors become bigger, in number and size. This happens especially when we talk about the leaps in orders of magnitude that tera- and petabyte size datasets represent over what we used to think was a lot of data. This can generate a lot more error—and unreliability across both descriptive and analytic data runs—than unwary CS-IT or statistical data scientists might expect. The issue immediately compromises data quality, and in connection, the validity of any analytic results. In short, your findings will be suspect.
Why analytic validity? A measure is valid, of course, if it measures what you think it’s measuring, and in that, the measure is relevant to answering your business or organizational questions. A measure or set of measures can only be valid if the data upon which it is based are reliable, i.e., they are consistent and repeatable. They cannot represent the thing you want to know if the opposite is the case. Validity does not work that way—you cannot have a valid measure with unreliable data.
The point of this discussion is that if you don’t pay attention to things like DR&AV that do require thought and adaptations to Big Data, and Big Data’s adaptation to DR&AV, unreliable measurements can bite back. You won’t detect and at least mitigate source(s) of unreliability, and unreliable measures will ruin your analytic results. Irrelevant or nonsense measures will mask functional or structural (fixed-effect) relationships, increasing the probability that you won’t find a relationship even if there is one, i.e., Type II error. Gag. Embarrassing, at best. Before you begin an analysis, then, establish estimates of reliability for your chosen variables. It’s a mark of statistical due diligence, and your clients will thank you for it.
About the Author:
Bill Luker Jr., PhD, is an economist, applied econometrician, and methodologist, with 30+ years of analytic experience in private industry, US and state government, academia, and community and labor organizations. Most recently, he was a Senior Vice President at Citibank, overseeing Documentation for CCAR Modeling in fulfillment of the Bank’s regulatory obligations to the US Federal Reserve under the Dodd-Frank Act. In 2017, he founded a boutique economics and statistics consultancy, terraNovum Solutions. He co-authored Signal From Noise, Information From Data (https://tinyurl.com/yccjqyo9), an introductory statistics textbook for business and government professionals, and has 30 more academic and professional publications to his credit. drblukerjr [at] gmail [dot] com +1 940-435-2028.
You must be logged in to post a comment.
The Machine Learning Times © 2019 • 211 E. Victoria Street, Suite E •
Santa Barbara, CA 93101
Produced by: Rising Media & Prediction Impact
Pingback: Data Reliability and Validity, Redux: Do Your CIO and Data Curators Really Understand the Concepts? – Bill Luker