10 months agoOn Variable Importance in Logistic Regression

The model looks good. It’s parsimonious, provides effective segmentation, and the predictors appear to be intuitively reasonable. While there is no problem with deploying results, you need to be able to order the variables in terms of their importance, a seemingly straightforward task.

However, this is not as easy to do, as it appears. It’s not at all clear what ‘most important’ means. Important might not mean the same thing to a researcher studying a treatment for an illness, in distinction to a marketer seeking to target his best customers.

Then there’s the issue of the type of analysis you are conducting. Most data scientists will agree that analyzing a continuous variable via regression analysis, lends itself to a more agreed upon approach for answering our original question. Dealing with binary outcomes-the typical problem logistic regression addresses, is a different animal, and may require a bit more creativity to untangle.

Let’s first spend a moment discussing the analysis of a continuous variable, and then we’ll proceed to examining binary outcomes-the have and have/nots-the primary objective of this article.

In this example, a retailer is investigating sales (continuous), and trying to determine what is impacting this revenue metric, as well which of the ‘impacts’ is greatest. Three predictors were employed to explain sales.

• Age
• Number of Previous visits to website
• Income

The following regression analysis was produced. At first glance, of the three predictors, previous visits appears to be most important. If we look under the column titled ‘B’, we find the weights that are applied to each of the three variables. The 297.537 is the largest, and this is associated with previous visits.

But wait a minute.

Regression coefficients or weights depict the association between each predictor and what we are trying to predict-sales, in our case. This weight denotes the average change in sales assuming a one-unit increase in the predictor. It seems intuitive to conclude that variables with larger coefficients (weights) are more important because they point to more significant changes in sales.

But there’s a problem in this thinking. Each variable is measured differently. Age is in years; income is in dollars, and visits is yet gauged on another scale!

By using the weights to compare among the three variables, we are essentially comparing apples to oranges! The objective then would be to place all three predictors on the same scale, so that we are equating apples with apples!

Enter the standardized coefficient. This statistic, converts all data to the same scale, and thus allows a direct examination of these new modified weights so as to answer our original question, which is most important? Referring to any text on this topic will certainly provide the reader the calculations that are used to standardize the coefficients.

One needs to simply identify the independent variable that has the largest absolute value for its standardized coefficient. By looking at the regression output, under the ‘standardized coefficients’, it is now clear that AGE plays the most important role, as its standardized weight, .422 is the largest.

Enough for multiple regression. There is widespread, although not universal agreement, that standardization is the way to go.

Let’s turn our attention to logistic regression where the item to be analyzed is a yes/no type variable. Typical examples include response/nonresponse, buy/not buy, click/not click, etc.

I am going to briefly discuss five approaches that tackles our original question. The techniques include

1. Standardization
2. Log odds
3. Wald
4. Tree analysis
5. Decile analysis

Take a look at the following mini case study that investigated the likelihood of a customer churning. Let’s reproduce the logistic regression output, and then suggest some approaches to attack our ‘most important’ variable problem.

The above table, standard output from IBM SPSS, does not include any reference to standardization, as did our multiple regression, continuous variable analysis.

Perhaps surprisingly, standardized regression coefficients do not appear to be typically employed in the logistic regression setting. Thumbing through several textbooks on logistic regression, I found only minimal reference of standardizing coefficients in any of them. This could very well be because there does not appear to be an agreed upon definition of standardization for logistic regression type problems.

Other researchers, however, disagree, and have suggested using the following formula to compute standardized coefficients for logistic regression.

We’ll use the ‘internet’ variable above to demonstrate the approach.

Standardized coefficient of ‘internet’ =

√3/π multiplied by the (standard deviation of ‘internet’ multiplied by the unstandardized coefficient of ‘internet’).

The standard deviation of any variable would have to be calculated from the data.

I have appended the standardized coefficients to the above table by computing them with a formula. Now, the analyst can compare the absolute values, and quickly conclude that ‘Months of service’ is the most critical predictor. SAS users can invoke the STB option to calculate a standardized coefficient. Those that prefer R can use the Beta Glm from the ‘reghelper’ package. ‘RELIMP’ is also a useful R routine. Those Python users can investigate scikit-learn to assist in generating standardized coefficients.

With these calculations completed for each variable, we are now in a position to interpret these standardized values similar to the way we used them in multiple regression.  We can directly observe the values, rank them from high to low (omitting the + or – sign) and report on order of importance.

2. Odds ratio

The odds ratio, or Exp(B) in the above table is the conventional approach of gauging the impact of a predictor variable on the dependent variable-churn likelihood. The issue with this metric is that all other predictors are held constant.  So, one cannot really assess the overall effect. The odds ratio does not clearly consider the variation of the other predictor variables. Due to the differences in how the variables are measured, a similar concern we referred to earlier, it may be inappropriate to rank the variables in terms of odds ratios. If all the variables were on the same scale, then we would have a more suitable approach.

Odds ratios that are greater than 1 indicate that the event is more likely to occur as the predictor increases. Odds ratios that are less than 1 indicate that the event is less likely to occur as the predictor increases.

3. Wald :P‐value of the Wald chi‐square test

The Wald test (also called the Wald Chi-Squared Test) is used to discern when predictors in a model are significant. This helps to demonstrate whether they provide some added value to the model.

Essentially, we are assessing whether the predictor has a ‘real’ effect. Small p‐values suggest that there is some indication of a non‐zero relationship with the dependent variable. Small p values are associated with larger Wald statistics.

If the Wald test confirms that the coefficient for a predictor variable is ‘0’, this would imply that there is no real value in incorporating that predictor in the model. It may be removed.

On the other hand, if the test demonstrates that the explanatory variable in not equal to zero, then the predictor should be a candidate for inclusion in the model.

This P value and Wald statistic only denote that there is reason to believe that there is some relationship. The strength of the association is not presented, nor available. So, the Wald indicates how strong we feel that that the weights or the coefficients are not equal to zero.

While there are researchers that use the p value, it doesn’t directly respond to our initial question. Availability of this statistic makes it easy to use. But ease of use does not imply usefulness.

4. Another approach

Another approach, proposed by Ratner 1 in addressing our question, is to use CHAID.  CHAID is a tool used to ascertain the relationship between variables.  CHAID analysis constructs a predictive model, or tree, to help establish how variables best combine to explain the outcome in the given dependent variable. The mini trees below will help to clarify this analysis.

Here, we use the predicted value that emerged from our logistic regression as the dependent variable-the node on top of the tree.

Let’s take a look at each variable separately. The CHAID tree for months of service is shown below. By moving from NODE 1, (months of service <=14) to NODE 2, (months of service > 14 and <=27), the attrition rate declines from 53.2% (mean=.532) to 38%.  Thus, the analyst can gauge the sensitivity of this variable to predicting churn. I will not go through the remainder of the nodes, but the analysis would be similar. Let’s move to our next variable, AGE. The TREE below can also be used to deduce the relationship of AGE to churn.

Glancing at the difference between the youngest age group to the oldest one, we observe a significant 36% difference in attrition rates (.461-.108)! Analyzing ‘Equipment last month’, we are also able to assess its impact, as shown below. Finally, the CHAID tree on the internet variable, reveals the following. A major change in churn rate occurs. As we go from not having internet to having internet, there is a 26% change in attrition likelihood (.434-.173).

I think most would agree that employing CHAID to assess influence or importance is intuitively reasonable, as well as easily interpretable. But, as Ratner 2 point out “it does not reveal the effects of the predictor variable with respect to the variation of the other predictor variables.”

While I will not go through the remainder of the analysis, Ratner demonstrates that by using multivariate (employing more than one variable as a predictor) CHAID analysis, it is possible to account for the variation of the other variables in the model, as well.

5. Decile table

Our last routine to gauge variable importance is by examining the decile or lift table. These analyses simply divide the population into 10 equal groups based on model score, the highest probabilities of churning being on top, and the lowest residing in decile 10. This approach looks at part of the decile table, comparing results using all variables, to decile results removing one variable at a time. Below are the results utilizing this procedure. By using the GAINS table, a sense of importance begins to emerge. We achieve a 71% churn rate in Decile 1 by employing all the variables. However, if we omit tenure, then the churn rate diminishes to 56%.

While this does not directly address our ‘importance’ question, it does provide the user an intuitive sense as to what may be contributing more to the overall prediction.

The arbitrary cutoff at the first and second decile is just that-arbitrary. But clearly as we proceed deeper into deciles, there will be no differences. At decile 10, the cumulative churn rates, of all the combinations of the variables will produce identical results!

One of the critical determinants of which ranking method to use, whether it’s one that was mentioned here, or some other tool, is whether the method provides a suitable tool in elucidating results to the end user.

Between these and other potential approaches, you have several options in selecting how to order variable importance. None of these are perfect. There are some data scientists that will use one tool when results coincide with their thinking, and will employ other techniques when these are more in alignment with their hypotheses. Surprisingly, this may be appropriate, depending on the objective of the study.

Whichever approach is selected, one should feel somewhat comfortable that the end result provides a ‘feeling’ in understanding causal relationships between your explanatory variables and the resulting change in likelihoods for your dependent variable.

Citations

1. Ratner, Bruce. Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data., p.111, Chapman and Hall/CRC, 2003.
2. Ibid, p.116 Sam Koslowsky serves as Senior Analytic Consultant for Harte Hanks. Sam’s responsibilities include developing quantitative and analytic solutions for a wide variety of firms. Sam is a frequent speaker at industry conferences, a contributor to many analytic related publications, and has taught at Columbia and New York Universities. He has an undergraduate degree in mathematics, an MBA in finance from New York University, and has completed post-graduate work in statistics and operations research.

Harte Hanks is a global marketing services firm specializing in multichannel marketing solutions that connect our clients with their customers in powerful ways. Experts in defining, executing and optimizing the customer journey, Harte Hanks offers end-to-end marketing services including consulting, strategic assessment, data, analytics, digital, social, mobile, print, direct mail and contact center. From visionary thinking to tactical execution Harte Hanks delivers smarter customer interactions for some of the world’s leading brands. Harte Hanks 5000+ employees are located in North America, Asia-Pacific and Europe.