Machine Learning Times
Machine Learning Times
Video – Alexa On The Edge – A Case Study in Customer-Obsessed Research from Susanj of Amazon
 Event: Machine Learning Week 2021 Keynote: Alexa On The Edge...
Why AI Isn’t Going to Replace Data Scientists Any Time Soon
 Should data scientists consider AI a threat to their...
“Doing AI” Is a Mistake that Detracts from Real Problem-Solving
  A note from Executive Editor Eric Siegel: Richard...
Getting the Green Light for a Machine Learning Project
  This article is based on the transcript of...

8 years ago
“The hungry statistician” – or why we never can get enough data


As the “Year of Statistics” comes to a close, I write this blog in support of the many statisticians who carefully fulfil their analysis tasks day by day, and to defend what may appear to be demanding behavior when it comes to data requirements.

How do statisticians get this reputation?

Are we really that complicated, with data requirements that are hard to fulfil? Or are we just pushed by the business question itself and the subsequent demands of the appropriate analytical methods?

Let’s accept the presumption of innocence and postulate that none of us ask IT departments to excavate data from historical time periods or add just any old variable to the data mart just for fun.

Very often a statistician’s work is assessed on the quality of the analytical results. The more convincing, beneficial, and significant the results, the clearer it is to attribute the success to the statistician. Also, it is well-known that good results come from high quality and meaningful data. Thus it is understandable that statisticians emphasize the importance of the data warehouse.

And this insistence is not driven by selfishness “to appear in a good light,” but because we take seriously our work of making well-informed decisions based on the analysis.

Let’s consider three frequent data requirements in more detail.

Who is interested in old news? – can we learn from history?

In order to make projections about the future, historic patterns need to be discovered, analysed, and extrapolated. To do so, historic data is needed. For many operational IT systems, historic versions of the data are irrelevant; their focus is on having the actual version of the data to keep the operational process up and running.

Consider the example of a tariff (fee) change with your mobile phone provider. The operational billing system requires primarily the actual contracted tariff, in order to bill each phone call correctly. To analyze customer behaviour we must know the prior tariff to find out which pattern of tariff-change frequently leads to a certain event, like a product upgrade or a cancellation.

For many analyses we need to differentiate between historic data and the historic snapshot of data.

To forecast the number of rented cars for a car rental agency for the next four weeks, we may need not only the daily number of rented cars but often the bookings that have been already received. For example, for November 18, 2013 the statistical model will use the following data:

  • Number of rented cars on November 18
  • Number of bookings for the rental day, November 18, that are known as of November 17 (the day before)
  • Number of bookings for the rental day, November 18 that are known aso f November 16 (two days before)

As the historic booking status for a selected rental day is continuously overwritten by the operational system, the required data can only be provided if they are historicized in a data warehouse.

“More” is almost always better

In order get well-based conclusions from statistical results, a certain minimum data quantity is needed. This minimum quantity (also called sample size or number of cases) depends on the analysis task and the distribution of the data. The area of sample size planning deals with the determination of the number cases that are needed to make sure that a potential difference in the data can be recognized with certain statistical significance.

In predictive modelling, where for example the probability of a certain event will be predicted, we care not only about the number of observations but also the number of events. In a campaign response analysis, a data sample with 30 buyers and 70 non-buyers will allow us to make better conclusions about the reasons for a product purchase, compared to a situation where only five buyers and 95 non-buyers are in the data (although both cases have a sample size of n=100).

“More” can also mean that a larger number of attributes need to be in the data warehouse. These additional attributes can potentially increase the accuracy of the prediction or finding additional relationships. The increase in the number of attributes can be achieved by including additional data sources or by creating derived variables from transactional data.

Analyzing more data can also make the data volume hard to handle. Because of its computing power and the ability to handle large amounts of data (big data), SAS has always been excellently prepared for that task. Now we also offer a specialized SAS High Performance Solution.

Detailed data vs. aggregated data – or why external data are not always the solution to the data availability problem

External data are often considered as the solution to enrich analysis data with those aspects that are missing in your own data. In many cases this is truly possible; for example, where socio-demographic data per district is used to describe customer background.

But sometimes the features in this data are not available on individual customers but only aggregated by group. But analytical methods often need detailed data per analysis subject.

In my book “Data Quality for Analytics” I use an example of the performance of a sailboat during a sailing race. The boat has a GPS-tracking device on board but no wind-measuring device. Thus the position, speed, and compass heading are available for the boat but not the wind speed and the wind direction.

We could assume that “external data” of a meteorological station in the harbour could be a good substitute for this data. A more detailed view reveals that this data shows a good picture of the general wind situation. But they are measured far away from the race area and not representative of the individual race behaviour of a boat. In addition they are only collected in five-minute intervals and do not allow a detailed analysis on short term wind shifts.

We care! That’s why are demanding.

When we request comprehensive, historical, detailed data, we statisticians do not want to be nasty; we just want to treat our respective analysis question with the right amount of carefulness.

By: Dr. Gerhard Svolba, senior solutions architect and analytic expert, SAS Institute Inc. Originally published at

Leave a Reply