By: Richard Boire, Senior Vice President, Environics Analytics
In some of the more recent literature, discussion has ensued about the use of pure random or noise variables that end up as key variables in predictive models. In our big data environment with millions of records and thousands of variables, intuitively one might think that random or spurious variables might be a normal outcome in many models. As a data miner, I am always intrigued by fellow colleagues who arrive at certain findings based on their research and work. When considering the validity of these comments, I remember my own experiences in building hundreds of models over the years in a variety of different industry sectors. However, our approaches in building robust models negated the likelihood of “pure random” variables being in any final model equation. More about this approach will be discussed later on
But in initially considering how I would either refute or support this hypothesis, I commenced my exploration from a “small” data standpoint. I created a series of datasets or scenarios where each dataset had a different number of records. Each dataset had five variables with four independent variables and the fifth being the target variable. All five variables were randomly generated using a random number generator. Multiple regressions were conducted on the four independent variables versus the target variable for each dataset. Listed below for each scenario or dataset are the p-values of each independent variable in both the development and validation datasets.
In the above table, you will observe that indeed there are random variables which appear to be significant as denoted by the highlighted cells in yellow. Under four of the five scenarios, a random variable appears significant at a 95% confidence interval. In the 1000 record scenario and the 5000 record scenario, the significant variable appears in development while in the 10000 record scenario and the 100000 record scenario, the variable appears in validation. Within the 500000 record scenario, we observe that no variable is significant in either development or validation. The real finding from all these results is that no variable is significant in both development and validation under any scenario. No real trend or finding emerges whether we look at either smaller datasets or larger data sets regarding the likelihood of random variables being in a final predictive model.
These results do not necessarily refute the findings in the literature. Instead these results point to a more disciplined approach in determining whether variables are indeed random or spurious. The need for validation, validation, and more validation is just simply reinforced as one task in trying to eliminate random variables from any predictive analytics solution.
Yet, as the volume of records and variables continues to increase with Big Data, the rules of statistics dictate that many more variables are more likely to be significant and perhaps this might increase the likelihood of random variables being in a predictive analytics solution. But why are there more variables likely to be significant in a big data environment? Understanding some of the basic tenets in statistics which govern significance will help to shed some light in this matter. Most statistical formulae which try to determine significance always have standard deviation as the denominator. A lower standard deviation will increase the probability of significance with the reverse being true with a higher standard deviation. But the calculation of standard deviation has sample size in its denominator, thereby implying that the higher the sample size, the lower the standard deviation which will result in an increased likelihood of some event or variable being significant. In a Big Data environment, the above logic not surprisingly implies that many more variables are likely to be statistically significant. But using measures of significance as a threshold in filtering out variables is only the first step. If there are hundreds of variables that are now significant, we may now want to select the top 200 variables that are ranked by the correlation coefficient versus the target variable. As discussed in previous articles and in my book(Data Mining for Managers-How to Use Data(Big and Small) to Solve Business Challenges, we then utilize the approach of running a series of step-wise routines as a key component in the selection of variables for a final solution. This approach is one method that certainly mitigates the impact of random variables as being part of the final solution. But validation of the model and looking at the variables themselves in both development and validation data sets is our final check to examine the robustness of our solution.
Typically, the development of acquisition models represent the most likely scenario where random variables may appear in a final model. This is caused by the fact that real challenges exist in obtaining robust variables that could be considered in a final model. There are often fewer than fifty variables which are statistically significant in correlation analysis versus the target variable. From my practical experience, this increases the likelihood that random variables are more likely to occur with limited information. These challenges are manifested in models with limited performance yielding lift ratios between the top decile and bottom decile of less than 3 to 1. These sub-par lift ratios are the compromise in finding variables that are significant in both the development file and validation files. Even with Big Data, it is still difficult to find individual-level information that is privacy compliant and more importantly unbiased. Remember these individuals are prospects so any collected information from them is obtained as a result of the individual opting in through some survey or platform that captures their digital information. One might reasonably argue that this proactive behavior by the prospects represents an activity that is atypical and not representative of the prospect pool at large. In order to obtain a more unbiased prospect universe, the practitioner will opt for external geographic information which depicts the demographics and behaviors of that geographic area where the prospect resides. This can achieve lift but it is clearly not as powerful as the availability of data at the individual data. As a result, we observe weak performance lift results as it is difficult to find model variables that maintain consistent results in both the development and validation datasets.
In preventing random variables, the use of very powerful individual-level data will prevent random “noise” variables being in any model presuming there is no inherent bias with the data. As mentioned above, the use of a sound modelling approach encompassing both correlation analysis and a series of step-wise routines does mitigate the likelihood of noise variables in any equation. But as have seen in virtually all discussions dealing with successful predictive analytics solution.
Richard Boire, B.Sc. (McGill), MBA (Concordia), is the founding partner at the Boire Filler Group, a nationally recognized expert in the database and data analytical industry and is among the top experts in this field in Canada, with unique expertise and background experience. Boire Filler Group was recently acquired by Environics Analytics where I am currently senior vice-president.
Mr. Boire’s mathematical and technical expertise is complimented by experience working at and with clients who work in the B2C and B2B environments. He previously worked at and with Clients such as: Reader’s Digest, American Express, Loyalty Group, and Petro-Canada among many to establish his top notch credentials.
After 12 years of progressive data mining and analytical experience, Mr. Boire established his own consulting company – Boire Direct Marketing in 1994. He writes numerous articles for industry publications, is a well-sought after speaker on data mining, and works closely with the Canadian Marketing Association on a number of areas including Education and the Database and Technology councils. He is currently the Chair of Predictive Analytics World Toronto.