By: Sam Koslowsky, Senior Analytic Consultant, Harte Hanks
After more than thirty years in the targeted modeling field, I still hear the same question I heard when I was first beginning. “Which tool produces the best result,” is an often-asked query. Indeed, at a recent seminar I was conducting among data science graduate students, I was asked that very question. And a dialogue among several participants ensued, each claiming that what he used was the best. At the conclusion of their give-and-take, I chimed in, “It’s not the tool – it’s the data.”
What distinguishes the novices from the professionals in the predictive modeling world, is the ability of the data scientist to employ data creatively. There is really little, if any expertise, in presenting data to some software, and waiting for a result. Excel does this wonderfully. It is, however, the ability to create new data elements, and incorporate them in your analysis, that can make all the difference. It is these derived data elements that can convert a mediocre result into a superior one.
Equally interesting, is the ability of the unsupervised methods to creatively develop some of these data derivation tools.
Data derivation is quite a broad topic, and I certainly cannot delve into any aspect in any sort of depth. But, in this article, I will provide a taste, a feeling of what should be considered. It is the analyst’s job to do any necessary additional research. While not all analysts will perform all these steps in all of their efforts, some of them have to be performed in order to optimize the final modeling algorithms. Let’s divide the tactics into five sections
Often, we deal with a continuous data element that may not be as predictive as we hypothesized. One way of further refining this variable would be to split into bins or buckets. We may very well discover that one of these bins has a better relationship with our dependent variable. Let’s look at
the following variable, TENURE. By attempting to employ the variable the way it exists, not much predictive ability is really available. However, upon binning the variable, we may observe the following.
Here the bucket referred to as ‘between 37 and 60’, BIN 3 has a response rate of 4.47%.
We can now create a new variable, a flag, that indicates whether a customer’s tenure is between 37 and 60. The flag is turned ON when the customer falls into this bucket; other wise it’s turned off (0/1 values).
While there are many ways of determining how the above groups are created, two of the more well-known strategies work through either using a DECISION TREE program or by dividing up the population into equal groups. Tree programs, such as CHAID, can provide increased optimal binning. The above was generated through such as analysis.
Alternatively, one may divide the dataset into equal groups. So, if we are analyzing 40,000 customers, and decide on four groups, 10,000 records will be created for each group. First, sort the file by tenure, from HIGH to LOW TENURE, divide into 4 ranges of 10,000 each, and calculate response rates (or whatever behavior you’re studying) for each segment. This will provide a clue as to whether binning may make sense.
Derived variables through domain knowledge
Data scientists require a broad level of skills. Often, their expertise emerges from their software engineering and/or statistical proficiency. These essential skills are indispensable. And they can be applied to all domains. However, some industries have their own peculiarities, and require someone who, not only possesses the technical knowhow, but also is comfortable in the particular industry they are working in. While a data miner may very well have the analytic skills to develop a comprehensive modeling solution, she may not possess domain knowledge that may provide added insight into the analysis.
Take an analyst working on a credit card churn model. Typical variables that the bank maintains include balances, purchases, credit line, transaction dates, purchase channel, and repayment history to mention just a few. What turns out to be suggestive of attrition is not necessarily balances or purchases. But rather the ratio of balance to credit line may be more indicative. Known as credit card utilization, this proportion is often correlated with churn rates. It is not at all clear that an analyst without the domain knowledge would independently arrive at this transformed variable.
Take a look at a Mid-west retailer who is conducting a major upgrade campaign to convert customers to a loyalty program. Monthly spending is, of course, available. Calculating the ratio of spending in the first quarter (lowest period of spending) to the fourth quarter (highest period of spending), the retailer has observed, may provide clues as to the likelihood that a customer will convert. Again, an analyst may not have considered developing such a significant transformation.
Derived through logical assumptions
Take an auto manufacturer who maintains dates of service for vehicles that are presented for a variety of repair related visits. Determining the elapsed time between these visits, is often a predictor of whether a customer would open a new lease. This, I refer to as a logical assumption, and may not really require extensive domain knowledge.
Take a bank that is soliciting customers for a new investment product. Various types of income are available to the financial service organization. Earned income, as well as investment income are resident on their data mart. While you may have two customers that have the same investment income, one individual’s investment income as a per cent to total may be 100%, while the second individual same ratio may only be 5%! This share of income relationship may be strongly predictive of this new product.
While raw data elements and some of the transformations we have spoken about are essential components of the model development process, often additional variable modifications are needed. These often take the form of square root, logarithms, arctangent, reciprocal, sine, etc. While spending may be a predictive variable, the square root of spending may provide an additional model lift.
Take a look at the following decile performance report for a retail response model. The second column, using a transformation of spending, produces a ‘better’ response rate for DECILE 1. If the marketer’s goal is to reach 10% of the population, employing the square root transformation improves results.
I often evaluate several of these transformations to determine whether additional improvement is possible. Often, it does provide for added model gains.
As to why these transformations may be necessary, suffice it to say that the relationships between these modified predictors and response, for example, are stronger. The distribution of the variable is critical in determining these transformations. I leave it to the reader to peruse some standard textbooks to secure a more comprehensive treatment of this particular subject.
Often with a large number of predictors, we encounter multicollinearity (high dependence) among the predictors. Typically, this is not a desirable result. So, we may be faced with at least two issues: a huge number of data elements which don’t do a satisfactory job in helping the analyst negotiate an acceptable model, and two, the multicollinearity problem.
Principal component analysis (PCA) is an effective tool that is universally employed in predictive analytics and data science. The PCA algorithm analyzes data to extract the most pertinent variables that contribute for the highest variation in that dataset.
PCA determines a subset of the original dataset that includes the most important information; in developing the model on that reduced data, the model that emerges can also be applied to the original larger dataset. In addition, the new reduced dataset has individually un-correlated variables.
The transformed data is useful in explaining the dynamics of the model, as well as assuring the correlation problem is satisfactorily addressed.
While PCA can provide enhanced outcomes, it is a purely a data driven statistical tool, and interpretation of the results is not always clear. The analyst should balance interpretability with incremental model improvement.
Bottom line is that there is more that meets the eye-the raw data is more often than not, unsatisfactory. A good analyst will assess all of the above approaches, and employ those that best meets his objectives. If you are a consumer of predictive models, then you need to inform your analyst about any hypothesis you might have. These assumptions should be evaluated. If you deny your analyst the domain knowledge that you have, then what you may get, may not be precisely what you want. Remember it’s not the tool-it’s the data.
About the Author
Sam Koslowsky serves as Senior Analytic Consultant for Harte Hanks. Sam’s responsibilities include developing quantitative and analytic solutions for a wide variety of firms. Sam is a frequent speaker at industry conferences, a contributor to many analytic related publications, and has taught at Columbia and New York Universities. He has an undergraduate degree in mathematics, an MBA in finance from New York University, and has completed post-graduate work in statistics and operations research.