In the last article, I discussed the concept of feature engineering as comprising two components with the first component being the ability to create and derive meaningful variables in the analytical file which is used as the source information in the development of any predictive analytics solution. Within this first component, access to data has grown exponentially with data scientists now being able to use semi-structured and unstructured data as data inputs. Although the ETL process for this type of data requires new technical skills to essentially structure this data, the intensive data science work in creating and deriving variables still adheres to the same disciplines that have been used for the last 30 years.
Meanwhile, the second component looks at tools and techniques that are used against the analytical file in reducing or filtering out the number of fields which will be analyzed by the more advanced mathematical routines. As the data science industry matures as a discipline, new tools continue to emerge within the industry that facilitate this process for the practitioner Many schools of thought emanate from this discussion in how to best filter or reduce the variable inputs for modelling.. One of the most commonly discussed techniques is the use of PCA (principle components analysis), which is the commonly used version of factor analysis. Although PCA and factor analysis is a great data reduction technique, it is unsupervised implying that this information filter occurs without the use of any information from the target variable. Let’s take a look at an example.
In this simple example of nine variables, we have reduced the variable set down to three variable categories or factors (factor1, factor2,factor3) which can be used as input variables to a model. Based on the statistical parameters (eigenvalue cutoff) ) used by the analyst, factor analysis reduced the nine variables into three factors. Using the factor loading scores within each factor, we then identify those groups of variables with the largest factor loading score from each factor. In this type of exercise, we are then able to identify the key variable drivers within each factor which in this case would indicate the following:
The above example is simplistic in order to better demonstrate the benefits of factor analysis in its use as a data or variable reduction technique. In the real world though, this technique is applied against hundreds of thousands of variables. Yet, in developing predictive models, the question remains as to whether this technique is really optimal in the selection of variables. Factor analysis being an unsupervised technique is selecting variables based on a given factor’s variation relative to all the variation on the data. But shouldn’t techniques be used that look at variation of the data but with respect to the target variable’s overall variation. In other words, the selection of variables should be supervised based on the target variable we are trying to predict. Techniques such as correlation analysis and/or decision trees are examples of this supervised approach.
Once our analytical file with the many independent variables and dependent variable has been created, correlation analysis can be conducted to rank order variables based on the absolute value of the correlation coefficient. In developing a typical response model, the analyst might run several hundred variables to be correlated against response. The output would then indicate the top variables which are significant against response albeit in a uni-variate manner. Selecting the “top” variables is where the “art of data science” comes into play. In working with data sets that encompass hundreds of thousands of records, most of the several hundred variables will be statistically significant at a 95% level. But the variables are rank-ordered and could be used to select the top 50-100 variables which would be considered in the more robust multivariate routines. See example below for a typical response model which lists the top 4 variables as correlated against response.
The use of decision trees is another technique which can be used to detect if there are key interactions between variables. Although decision trees in many cases are used to actually create models, one can also apply decision trees against the top variables(top 20 variables) as determined from correlation. The critical piece of knowledge arising from the use of decision trees as opposed to correlation analysis are the identification of key interaction variables. For example, we might discover that age and gender such as males under 25 years represent a key interaction in predicting auto claim losses. In a response model, customer tenure and where they live might be another interaction variable in predicting response. In our example below which is trying to optimize response in an RRSP model, we see the interaction variables between financial holdings and tenure vs. financial holdings and relationships and mortgage.
Besides the determination of interaction variables, another capability of decision trees is the ability to group very sparse data into groups. For example, zip codes in the U.S. or postal codes in Canada represent very fine levels of geography that typically would contain a small proportion of individuals that reside in this area relative to those other records not residing in the area. Developing a binary variable at this level would be a meaningless exercise as in most cases the proportion of individuals residing in the area relative to those records not living in the area would be very small.(typically well under 5%). There is simply no way that a pattern or trend can be determined with < 5% of the individuals actually residing in that area. With decision- trees, these geo-area levels can then be grouped to larger geo-areas such that larger geo-area groupings now contain significant proportion of records(20% or higher). Here, the grouping determination is supervised in that the target variable or variable being modelled is optimizing the grouping categories based on how they impact the target variable. See example below.
Now the grouping levels can be altered (i.e. collapsed or expanded) based on statistical thresholds that the user sets when running the decision trees. As seen in this example, the utility of decision trees can be expanded beyond just the creation of predictive models but towards the area of variable creation or feature engineering.
Another example of filtering out variables is the use of exploratory data analysis techniques which enable the practitioner to identify the optimum transformation of a given variable but eliminate the original variable with its source information. Listed below is one example which examines the impact of the floor area of properties and its impact on frequency of claims on properties. In the example here, we observe that the trend tends to be bi modal rather than linear.
We now transform this variable into a more linear type relationship with claim frequency using the following rules:
We then rerun the EDA (exploratory data analysis) on the transformed variable(IND_FLOORAREA) where the trend is now linear.
All these filtering or variable reduction techniques allow the practitioner to identify those meaningful variables which will be input into the more advanced statistical routines. In many of these routines, this filtering out process is a critical component in optimizing these routines. The so-called “black box” nature of predictive models is mitigated as the filtering-out process enables the practitioner to be more knowledgeable on all the details of how the model was developed.
Both components (derivation of new variables and filtering- out variables) of feature engineering represent the most labor-intensive phase of the predictive analytics process. Yet, most practitioners consider feature engineering as being most important within the predictive analytics process. Using the analogy of cooking, one can compare feature engineering as the determination of the key inputs into a meal while the determination of the right temperature for cooking can be compared to using the right mathematical technique for predictive models. But as in predictive analytics, feature engineering is the most critical stage in cooking which is reinforced by the level of content in cookbooks that are devoted to inputs and ingredients.