Machine Learning Times
Machine Learning Times
FAQ for Eric Siegel’s New Book, “The AI Playbook”
  There are plenty of questions to answer about...
Announcing Eric Siegel’s New Book: The AI Playbook
  Dear Reader, I’m excited to announce the forthcoming,...
Predictive Analytics for the Call Center
 So, you just received your shiny new smart watch....
MLW Preview Video: Gulrez Khan, Data Science Lead at PayPal
 In anticipation of his upcoming keynote presentation at Predictive...

This excerpt is from InformationWeek. To view the whole article click here.  

8 years ago
6 Causes Of Big Data Discrepancies


The same data can yield wildly different results. Here are some of the reasons for these fascinating, frustrating, or even dangerous discrepancies.

Same Data, Different Quality

Businesses often have a glut of redundant data stored in multiple systems and in different formats. While many companies realize the importance of normalizing the data to minimize data redundancy, transformation is also important.

“Some people will take the same data and not realize they have to do some data transformation before they can make sense of it,” said Booz Allen Hamilton principal data scientist Kirk Borne. “Let’s say you’re looking for purchasing patterns based on income. You may have collected data sets from data sources, and even though it’s the same data, some portion of the data had the monthly data on families and another part of it included the annual income of families. If you don’t do the transformation, you end up completely misusing the data.”

Data-Cleansing Issues

Data is often incomplete, inaccurate, inconsistent, irrelevant, and perhaps not even timely, so it needs to be cleansed. According to John Talbert and Yinle Zhou, authors of Entity Information Lifecycle for Big Data: Master Data Management and Information Integration, “If data cleansing and transformation processes are not applied, or applied differently at different times in an [error resolution] process, running the same input data using the same match rule could produce dramatically different results.”

Problems With The Algorithm

Different algorithms tend to yield different results. One algorithm may be better suited to one particular task, more efficient, or introduce less uncertainty than another algorithm. Take the Latent Dirichlet Allocation (LDA) algorithm, for example, which is used to identify related topics in unstructured text. Luis N.A. Amaral, a professor at the McCormick School of Engineering and the Feinberg School of Medicine at Northwestern University, tested the popular algorithm and found that it was 90% accurate and 80% reproducible. Amaral said the LDA algorithm is less accurate than it should be since it solves a simple problem.

Acceptable margins of error vary depending on several factors, including the acceptable level of risk.

DataWatch WebinarFREE WEBINAR

Models Differ

Data models have parameters and other conditions that can cause results to differ. In a simulation, the values must be specified so the output only applies to those values.

When University of Tennessee theoretical and computation astrophysics chair Tony Mezzacappa runs a 3D simulation of supernovae, the outcomes of the simulations depend on the spatial resolution of the model.

“Mother Nature is a continuum. When we model it, we grid it into something that is manageable so we have a discreet set of points in 3D space or an arbitrary number of dimensions of some abstract space, and we model the phenomenon on that limited subset of special points or whatever that may be in abstract space,” said Mezzacappa. He and his team rerun the simulations many times using finer and coarser resolutions to see whether the outcomes in the model change.

Leave a Reply