6 Causes Of Big Data Discrepancies

This excerpt is from InformationWeek. To view the whole article click here. ♦

Jun 7, 2015
Comments Off on 6 Causes Of Big Data Discrepancies
Industry News
19280 Views

11 years ago
6 Causes Of Big Data Discrepancies

By: Lisa Morgan, Freelance Writer, InformationWeek
Originally published at www.informationweek.com

The same data can yield wildly different results. Here are some of the reasons for these fascinating, frustrating, or even dangerous discrepancies.

Same Data, Different Quality

Businesses often have a glut of redundant data stored in multiple systems and in different formats. While many companies realize the importance of normalizing the data to minimize data redundancy, transformation is also important.

“Some people will take the same data and not realize they have to do some data transformation before they can make sense of it,” said Booz Allen Hamilton principal data scientist Kirk Borne. “Let’s say you’re looking for purchasing patterns based on income. You may have collected data sets from data sources, and even though it’s the same data, some portion of the data had the monthly data on families and another part of it included the annual income of families. If you don’t do the transformation, you end up completely misusing the data.”

Data-Cleansing Issues

Data is often incomplete, inaccurate, inconsistent, irrelevant, and perhaps not even timely, so it needs to be cleansed. According to John Talbert and Yinle Zhou, authors of Entity Information Lifecycle for Big Data: Master Data Management and Information Integration, “If data cleansing and transformation processes are not applied, or applied differently at different times in an [error resolution] process, running the same input data using the same match rule could produce dramatically different results.”

Problems With The Algorithm

Different algorithms tend to yield different results. One algorithm may be better suited to one particular task, more efficient, or introduce less uncertainty than another algorithm. Take the Latent Dirichlet Allocation (LDA) algorithm, for example, which is used to identify related topics in unstructured text. Luis N.A. Amaral, a professor at the McCormick School of Engineering and the Feinberg School of Medicine at Northwestern University, tested the popular algorithm and found that it was 90% accurate and 80% reproducible. Amaral said the LDA algorithm is less accurate than it should be since it solves a simple problem.

Acceptable margins of error vary depending on several factors, including the acceptable level of risk.

FREE WEBINAR

Models Differ

Data models have parameters and other conditions that can cause results to differ. In a simulation, the values must be specified so the output only applies to those values.

When University of Tennessee theoretical and computation astrophysics chair Tony Mezzacappa runs a 3D simulation of supernovae, the outcomes of the simulations depend on the spatial resolution of the model.

“Mother Nature is a continuum. When we model it, we grid it into something that is manageable so we have a discreet set of points in 3D space or an arbitrary number of dimensions of some abstract space, and we model the phenomenon on that limited subset of special points or whatever that may be in abstract space,” said Mezzacappa. He and his team rerun the simulations many times using finer and coarser resolutions to see whether the outcomes in the model change.

EXCLUSIVE HIGHLIGHTS

This excerpt is from InformationWeek. To view the whole article click here. ♦

Related

11 years ago
6 Causes Of Big Data Discrepancies

The same data can yield wildly different results. Here are some of the reasons for these fascinating, frustrating, or even dangerous discrepancies.

Same Data, Different Quality

Data-Cleansing Issues

Problems With The Algorithm

Models Differ

Login

Industry News

Connect with Us

Subscription

ADVERTISEMENTS

Produced By:

Archives

The Machine Learning Times © 2026 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190
Produced by: Rising Media & Prediction Impact

EXCLUSIVE HIGHLIGHTS

This excerpt is from InformationWeek. To view the whole article click here. ♦

Related

11 years ago6 Causes Of Big Data Discrepancies

The same data can yield wildly different results. Here are some of the reasons for these fascinating, frustrating, or even dangerous discrepancies.

Same Data, Different Quality

Data-Cleansing Issues

Problems With The Algorithm

Models Differ

Recommended

Big Tech Has Suddenly Flipped on the AI Jobs Wipeout Scenario

Why AI hasn’t replaced software engineers, and won’t

A reality check on the AI jobs hysteria

Apocalypse No

Login

Industry News

Connect with Us

Subscription

ADVERTISEMENTS

Produced By:

Archives

The Machine Learning Times © 2026 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190 Produced by: Rising Media & Prediction Impact

11 years ago
6 Causes Of Big Data Discrepancies

The Machine Learning Times © 2026 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190
Produced by: Rising Media & Prediction Impact