Businesses often have a glut of redundant data stored in multiple systems and in different formats. While many companies realize the importance of normalizing the data to minimize data redundancy, transformation is also important.
“Some people will take the same data and not realize they have to do some data transformation before they can make sense of it,” said Booz Allen Hamilton principal data scientist Kirk Borne. “Let’s say you’re looking for purchasing patterns based on income. You may have collected data sets from data sources, and even though it’s the same data, some portion of the data had the monthly data on families and another part of it included the annual income of families. If you don’t do the transformation, you end up completely misusing the data.”
Data is often incomplete, inaccurate, inconsistent, irrelevant, and perhaps not even timely, so it needs to be cleansed. According to John Talbert and Yinle Zhou, authors of Entity Information Lifecycle for Big Data: Master Data Management and Information Integration, “If data cleansing and transformation processes are not applied, or applied differently at different times in an [error resolution] process, running the same input data using the same match rule could produce dramatically different results.”
Different algorithms tend to yield different results. One algorithm may be better suited to one particular task, more efficient, or introduce less uncertainty than another algorithm. Take the Latent Dirichlet Allocation (LDA) algorithm, for example, which is used to identify related topics in unstructured text. Luis N.A. Amaral, a professor at the McCormick School of Engineering and the Feinberg School of Medicine at Northwestern University, tested the popular algorithm and found that it was 90% accurate and 80% reproducible. Amaral said the LDA algorithm is less accurate than it should be since it solves a simple problem.
Acceptable margins of error vary depending on several factors, including the acceptable level of risk.
Data models have parameters and other conditions that can cause results to differ. In a simulation, the values must be specified so the output only applies to those values.
When University of Tennessee theoretical and computation astrophysics chair Tony Mezzacappa runs a 3D simulation of supernovae, the outcomes of the simulations depend on the spatial resolution of the model.
“Mother Nature is a continuum. When we model it, we grid it into something that is manageable so we have a discreet set of points in 3D space or an arbitrary number of dimensions of some abstract space, and we model the phenomenon on that limited subset of special points or whatever that may be in abstract space,” said Mezzacappa. He and his team rerun the simulations many times using finer and coarser resolutions to see whether the outcomes in the model change.